score:4

Accepted answer

you don't even need to use an udf for that :

val df3 = df1.as('a).join(df2.as('b), $"a.id" === $"b.id","left").
               select(df1("id"),'x,'y,(coalesce('x, lit(0)) + coalesce('y, lit(0))).alias("x_plus_y")).na.fill(0)

df3.show
// df3: org.apache.spark.sql.dataframe = [id: int, x: int, y: int, x_plus_y: int]
// +---+---+---+--------+
// | id|  x|  y|x_plus_y|
// +---+---+---+--------+
// |  1| 10|  0|      10|
// |  2| 20|200|     220|
// |  3| 30|300|     330|
// |  4| 40|400|     440|
// +---+---+---+--------+

score:0

in scala noticed this solution,

val d = sqlcontext.sql("""
  select df1.id, x, y from df1 left join df2 on df1.id=df2.id""").na.fill(0)

to join the frames and replace non available values with zeroes, and then define this udf,

import org.apache.spark.sql.functions
import org.apache.spark.sql.functions._

val plus: (int,int) => int = (x:int,y:int) => x+y
val plus_udf = udf(plus)

d.withcolumn("x_plus_y", plus_udf($"x", $"y")).show
+---+---+---+--------+
| id|  x|  y|x_plus_y|
+---+---+---+--------+
|  1| 10|  0|      10|
|  2| 20|200|     220|
|  3| 30|300|     330|
|  4| 40|400|     440|
+---+---+---+--------+

score:4

df3 = df1.join(df2, df1.id == df2.id, "left_outer").select(df1.id, df1.x, df2.y).fillna(0)
df3.select("id", (df3.x + df3.y).alias("x_plus_y")).show()

this works in python.


Related Query

More Query from same tag