Spark >= 2.3
You can simplify the process using
There is also
map_values function, but it won't be directly useful here.
Spark < 2.3
General method can be expressed in a few steps. First required imports:
import org.apache.spark.sql.functions.udf import org.apache.spark.sql.Row
and example data:
val ds = Seq( (1, Map("foo" -> (1, "a"), "bar" -> (2, "b"))), (2, Map("foo" -> (3, "c"))), (3, Map("bar" -> (4, "d"))) ).toDF("id", "alpha")
To extract keys we can use UDF (Spark < 2.3)
val map_keys = udf[Seq[String], Map[String, Row]](_.keys.toSeq)
or built-in functions
import org.apache.spark.sql.functions.map_keys val keysDF = df.select(map_keys($"alpha"))
Find distinct ones:
val distinctKeys = keysDF.as[Seq[String]].flatMap(identity).distinct .collect.sorted
You can also generalize
keys extraction with
import org.apache.spark.sql.functions.explode val distinctKeys = df // Flatten the column into key, value columns .select(explode($"alpha")) .select($"key") .as[String].distinct .collect.sorted
ds.select($"id" +: distinctKeys.map(x => $"alpha".getItem(x).alias(x)): _*)
And if you are in PySpark, I just find an easy implementation:
from pyspark.sql.functions import map_keys alphaDF.select(map_keys("ALPHA").alias("keys")).show()
You can check details in here
- How to get keys and values from MapType column in SparkSQL DataFrame
- In scala, how to get an array of keys and values from map, with the correct order (i-th key is for the i-th value)?
- Create a dataframe from a hashmap with keys as column names and values as rows in Spark
- How to get column values from list which contains column names in spark scala dataframe
- How to rename a dataframe column and datatype from another dataframe values in spark?
- get min and max from a specific column scala spark dataframe
- How to add a column to Dataset without converting from a DataFrame and accessing it?
- How to get max length of string column from dataframe using scala?
- How to set and get keys from scala TreeMap?
- Spark, Scala - How to get Top 3 value from each group of two column in dataframe
- Split an Spark dataframe by some column values and then rotate each generated dataframe independently from the others
- Spark Dataframe - How to get a particular field from a struct type column
- How to get the set of rows which contains null values from dataframe in scala using filter
- Lookup values from a MapType column with keys from another column
- Fetch all values irrespective of keys from a column of JSON type in a Spark dataframe using Spark with scala
- How to get MapType from column
- Get top values from a spark dataframe column in Scala
- scala/spark - group dataframe and select values from other column as dataframe
- How to get common elements of values from map keys in Scala?
- Traverse Set elements and get values using them as keys from a hashmap using Scala
- How to extract DB name and table name from a comma separated string of Dataframe column
- How to add a column collection based on the maximum and minimum values in a dataframe
- How to swap the values from one column to another in Spark dataframe
- How to merge all unique values of a spark dataframe column into single row based on id and convert the column into json format
- How to drop specific column and then select all columns from spark dataframe
- How to convert the dataframe column type from string to (array and struct) in spark
- How to get distinct value, count of a column in dataframe and store in another dataframe as (k,v) pair using Spark2 and Scala
- How to split dataframe based on a range of values in a column and store them in separate files?
- Scala: How to return column name and value from a dataframe
- how to access values from a array column in scala dataframe
More Query from same tag
- scala get function name that was sent as param
- How to update column with another subquery value in Slick?
- Scala-A function about 'contains'
- Scala SortedSet - sorted by one Ordering and unique by something else?
- Non working Spark example in Scala, LogisticRegressionTrainingSummary
- How to bypass or satisfy scala generic types for Thrift classes
- IntelliJ SBT - Module naming
- Secured trait causing spec2 unit test compilation errors
- In Apache spark memory MemoryStore tryToPut what does free mean
- Option is both a functor and a monad?
- How to "instantiate" an abstract class in the superclass?
- Cache an intermediate variable in an one-liner
- What is the input format of org.apache.spark.ml.classification.LogisticRegression fit()?
- Play Framework with dispatch(scala) : where should I install dispatch?
- Why is my code not recognising a function declared above?
- How to use JUnit's @Rule annotation with Scala Specs2 tests?
- Achieve S3 concurrency in scala by splitting send and receive into promise and future without blocking
- Is there a way to sort values in a CSV file in Scala?
- Scala alternative to pythons `with`
- how to replace a string in Spark DataFrame using regexp
- Sorting of scala hashmap using integer key is not working
- How to change concurrency for test execution in sbt
- Computing && and ||
- Scala sort list based on second attribute and then first
- HBase put[util.List[Put]) does not work
- In scala 2.13, how to use implicitly[value singleton type]?
- How to set up jEdit for sbt build error highlighting?
- Spark, String to Timestamp
- Scala External DSL -- Multi line string literal