score:4

An ideal approach would be to read entire dataframe as Binary (Array[Byte]) data type, and then casting corresponding values to their compatible data types, however, Spark does not allow to read Double Data type as Binary data type. Therefore could not proceed with this approach.

A hack could be, to set Spark property "spark.sql.files.ignoreCorruptFiles" to true and then read the files with the desired schema. Files that don’t match the specified schema are ignored. The resultant dataset contains only data from those files that match the specified schema. Thus read two dataframe one with String data type and other with Double data type and then cast any one of them to a single data type and then finally union them.

val stringSchema = StructType(StructField("final_height", StringType, false) :: Nil)
val doubleSchema = StructType(StructField("final_height", DoubleType, false) :: Nil)

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")

val stringDF = spark.read.schema(stringSchema).parquet("path/")
val doubleDF = spark.read.schema(doubleSchema).parquet("path/")
//Cast to compatible type
val doubleToStringDF = doubleDF.select(col("final_height").cast(StringType))

val finalDF = stringDF.union(doubleToStringDF)

Related Query

More Query from same tag