score:-1

Can you try cross joining DF and xmldf data frames.

df.crossJoin(xmldf).select($"id",$"code",$"Arrive",$"Depart",$"Destination",$"Passenger._Age",$"Passenger._Quantity").show

Thanks,

oops, sorry, i didn't think about that. Can you try below commands. Key is, you need to select proper keys from xml.

case class Data(id: String, code: Int, xmldata: String)
val df = Seq(Data("123abc", 12345,"<XML><Date Depart=\"2019-06-30\" Arrive=\"2019-06-22\" /><Passengers><Passenger Age=\"ADT\" Quantity=\"1\" /><Passenger Age=\"CHD\" Quantity=\"1\" /></Passengers><Destination Code=\"LAX\"/></XML>"),Data("345xyz", 102030,"<XML><Date Depart=\"2019-07-30\" Arrive=\"2019-07-22\" /><Passengers><Passenger Age=\"BCD\" Quantity=\"2\" /><Passenger Age=\"APB\" Quantity=\"2\" /></Passengers><Destination Code=\"TX\"/></XML>"),Data("xxdf456", 201910,"<XML><Date Depart=\"2019-07-30\" Arrive=\"2019-07-22\" /><Passengers><Passenger Age=\"BCD\" Quantity=\"2\" /><Passenger Age=\"APB\" Quantity=\"2\" /></Passengers><Destination Code=\"TX\"/></XML>")).toDF

val xrdd = df.select("xmldata").map(x => x.getString(0)).rdd
val xmldf = (new XmlReader()).xmlRdd(spark.sqlContext,xrdd).select($"Date._Arrive".as("Arrive"),$"Date._Depart".as("Depart"),$"Destination._Code".as("Destination"),explode($"Passengers.Passenger").alias("Passenger")).select($"Arrive",$"Depart",$"Destination",$"Passenger._Age",$"Passenger._Quantity")

val w = Window.orderBy("xmldata")
val dfx=df.withColumn("id1",dense_rank.over(w))

val w1 = Window.orderBy($"Arrive",$"Depart",$"Destination")
val xmldfx =xmldf.withColumn("id1",dense_rank.over(w1))

dfx.alias("a").join(xmldfx.alias("b"), $"a.id1" === $"b.id1").select($"a.id",$"a.code",$"b.Arrive",$"b.Depart",$"b.Destination",$"b._Age",$"b._Quantity").orderBy("id").show

Here 2nd and 3rd xml columns are same . so output will be

+-------+------+----------+----------+-----------+----+---------+
|     id|  code|    Arrive|    Depart|Destination|_Age|_Quantity|
+-------+------+----------+----------+-----------+----+---------+
| 123abc| 12345|2019-06-22|2019-06-30|        LAX| ADT|        1|
| 123abc| 12345|2019-06-22|2019-06-30|        LAX| CHD|        1|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| BCD|        2|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| APB|        2|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| APB|        2|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| BCD|        2|
|xxdf456|201910|2019-07-22|2019-07-30|         TX| BCD|        2|
|xxdf456|201910|2019-07-22|2019-07-30|         TX| APB|        2|
|xxdf456|201910|2019-07-22|2019-07-30|         TX| BCD|        2|
|xxdf456|201910|2019-07-22|2019-07-30|         TX| APB|        2|
+-------+------+----------+----------+-----------+----+---------+

with Unique xml columns.

val df = Seq(
    Data("123abc", 12345,"<XML><Date Depart=\"2019-06-30\" Arrive=\"2019-06-22\" /><Passengers><Passenger Age=\"ADT\" Quantity=\"1\" /><Passenger Age=\"CHD\" Quantity=\"1\" /></Passengers><Destination Code=\"LAX\"/></XML>"),Data("345xyz", 102030,"<XML><Date Depart=\"2019-07-30\" Arrive=\"2019-07-22\" /><Passengers><Passenger Age=\"BCD\" Quantity=\"2\" /><Passenger Age=\"APB\" Quantity=\"2\" /></Passengers><Destination Code=\"TX\"/></XML>"),Data("xxdf456", 201910,"<XML><Date Depart=\"2019-08-10\" Arrive=\"2019-08-22\" /><Passengers><Passenger Age=\"yyy\" Quantity=\"10\" /><Passenger Age=\"xxx\" Quantity=\"10\" /></Passengers><Destination Code=\"TX\"/></XML>")).toDF

output:

+-------+------+----------+----------+-----------+----+---------+
|     id|  code|    Arrive|    Depart|Destination|_Age|_Quantity|
+-------+------+----------+----------+-----------+----+---------+
| 123abc| 12345|2019-06-22|2019-06-30|        LAX| CHD|        1|
| 123abc| 12345|2019-06-22|2019-06-30|        LAX| ADT|        1|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| BCD|        2|
| 345xyz|102030|2019-07-22|2019-07-30|         TX| APB|        2|
|xxdf456|201910|2019-08-22|2019-08-10|         TX| yyy|       10|
|xxdf456|201910|2019-08-22|2019-08-10|         TX| xxx|       10|
+-------+------+----------+----------+-----------+----+---------+

Sorry for the long content.


Related Query

More Query from same tag