Accepted answer

Spark >= 2.3

You can simplify the process using map_keys function:

import org.apache.spark.sql.functions.map_keys

There is also map_values function, but it won't be directly useful here.

Spark < 2.3

General method can be expressed in a few steps. First required imports:

import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.Row

and example data:

val ds = Seq(
  (1, Map("foo" -> (1, "a"), "bar" -> (2, "b"))),
  (2, Map("foo" -> (3, "c"))),
  (3, Map("bar" -> (4, "d")))
).toDF("id", "alpha")

To extract keys we can use UDF (Spark < 2.3)

val map_keys = udf[Seq[String], Map[String, Row]](_.keys.toSeq)

or built-in functions

import org.apache.spark.sql.functions.map_keys

val keysDF =$"alpha"))

Find distinct ones:

val distinctKeys =[Seq[String]].flatMap(identity).distinct

You can also generalize keys extraction with explode:

import org.apache.spark.sql.functions.explode

val distinctKeys = df
  // Flatten the column into key, value columns

And select:$"id" +: => $"alpha".getItem(x).alias(x)): _*)


And if you are in PySpark, I just find an easy implementation:

from pyspark.sql.functions import map_keys"ALPHA").alias("keys")).show()

You can check details in here

Related Query

More Query from same tag