score:0

The withColumn() option of adding new column will work on entire data set. And if you have more columns, it makes things even worse. You can use Spark SQL and have a query in SQL style to add new columns. This will need more sql skills than just spark. And with 100 columns, may be the maintenance would be tough.

You can follow another approach.

You can convert an rdd into dataframe. Then use map on the data frame and process each row as you wish. Inside map method,

a. you can gather new values based on the calculations

b. Add these new column values to main rdd as below

val newColumns: Seq[Any] = Seq(newcol1,newcol2)
Row.fromSeq(row.toSeq.init ++ newColumns)

Here row, is the reference of row in map method

c. Create new schema as below

val newColumnsStructType = StructType{Seq(new StructField("newcolName1",IntegerType),new StructField("newColName2", IntegerType))

d. Add to the old schema

val newSchema = StructType(mainDataFrame.schema.init ++ newColumnsStructType)

e. Create new dataframe with new columns

val newDataFrame = sqlContext.createDataFrame(newRDD, newSchema)

Related Query

More Query from same tag