score:1
Accepted answer
i think combination of tokenizer
and explode
might work. the solution is given as below:
scala> val data = spark.read.format("csv").option("delimiter", "\t").schema(schema).load("plot_summaries.txt")
data: org.apache.spark.sql.dataframe = [documentid: bigint, description: string]
scala> data.show(1)
+----------+--------------------+
|documentid| description|
+----------+--------------------+
| 23890098|shlykov, a hard-w...|
+----------+--------------------+
only showing top 1 row
scala> import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.explode
scala> import org.apache.spark.ml.feature.tokenizer
import org.apache.spark.ml.feature.tokenizer
scala> val tokenizer = new tokenizer().setinputcol("description").setoutputcol("words")
tokenizer: org.apache.spark.ml.feature.tokenizer = tok_80d1c6e72cbc
scala> val wordsdata = tokenizer.transform(data)
wordsdata: org.apache.spark.sql.dataframe = [documentid: bigint, description: string ... 1 more field]
scala> wordsdata.show(1)
+----------+--------------------+--------------------+
|documentid| description| words|
+----------+--------------------+--------------------+
| 23890098|shlykov, a hard-w...|[shlykov,, a, har...|
+----------+--------------------+--------------------+
only showing top 1 row
scala> val newwordsdata = wordsdata.drop("description")
newwordsdata: org.apache.spark.sql.dataframe = [documentid: bigint, words: array<string>]
scala> newwordsdata.show(1)
+----------+--------------------+
|documentid| words|
+----------+--------------------+
| 23890098|[shlykov,, a, har...|
+----------+--------------------+
only showing top 1 row
scala> val flattened = newwordsdata.withcolumn("token",explode($"words"))
flattened: org.apache.spark.sql.dataframe = [documentid: bigint, words: array<string> ... 1 more field]
scala> flattened.show
+----------+--------------------+-------------+
|documentid| words| token|
+----------+--------------------+-------------+
| 23890098|[shlykov,, a, har...| shlykov,|
| 23890098|[shlykov,, a, har...| a|
| 23890098|[shlykov,, a, har...| hard-working|
| 23890098|[shlykov,, a, har...| taxi|
| 23890098|[shlykov,, a, har...| driver|
| 23890098|[shlykov,, a, har...| and|
| 23890098|[shlykov,, a, har...| lyosha,|
| 23890098|[shlykov,, a, har...| a|
| 23890098|[shlykov,, a, har...| saxophonist,|
| 23890098|[shlykov,, a, har...| develop|
| 23890098|[shlykov,, a, har...| a|
| 23890098|[shlykov,, a, har...| bizarre|
| 23890098|[shlykov,, a, har...| love-hate|
| 23890098|[shlykov,, a, har...|relationship,|
| 23890098|[shlykov,, a, har...| and|
| 23890098|[shlykov,, a, har...| despite|
| 23890098|[shlykov,, a, har...| their|
| 23890098|[shlykov,, a, har...| prejudices,|
| 23890098|[shlykov,, a, har...| realize|
| 23890098|[shlykov,, a, har...| they|
+----------+--------------------+-------------+
only showing top 20 rows
let me know if it helps!!
score:1
first split
the description data to get array
and then explode
the array to get individual words
as rows associated with the documentid
.
example:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
//sampledata
val csvdata: dataset[string] = spark.sparkcontext.parallelize(
"""
|23890098 shlykov, a hard-working taxi driver and lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all.
""".stripmargin.lines.tolist).tods()
val schema = new structtype().add("documentid", longtype, true).add("description", stringtype, true)
//read the dataset
val data=spark.read.option("delimiter", "\t").schema(schema).csv(csvdata)
data.withcolumn("description",split(col("description")," ")).//split description to get array of words
withcolumn("token", explode(col("description"))). //explode on array and get tokens(each word as individual row with documentid)
show(false)
//+----------+--------------------+-------------+
//|documentid| description| token|
//+----------+--------------------+-------------+
//| 23890098|[shlykov,, a, har...| shlykov,|
//| 23890098|[shlykov,, a, har...| a|
//| 23890098|[shlykov,, a, har...| hard-working|
//| 23890098|[shlykov,, a, har...| taxi|
//| 23890098|[shlykov,, a, har...| driver|
//| 23890098|[shlykov,, a, har...| and|
//| 23890098|[shlykov,, a, har...| lyosha,|
//| 23890098|[shlykov,, a, har...| a|
//| 23890098|[shlykov,, a, har...| saxophonist,|
//| 23890098|[shlykov,, a, har...| develop|
//| 23890098|[shlykov,, a, har...| a|
//| 23890098|[shlykov,, a, har...| bizarre|
//| 23890098|[shlykov,, a, har...| love-hate|
//| 23890098|[shlykov,, a, har...|relationship,|
//| 23890098|[shlykov,, a, har...| and|
//| 23890098|[shlykov,, a, har...| despite|
//| 23890098|[shlykov,, a, har...| their|
//| 23890098|[shlykov,, a, har...| prejudices,|
//| 23890098|[shlykov,, a, har...| realize|
//| 23890098|[shlykov,, a, har...| they|
//+----------+--------------------+-------------+
//only showing top 20 rows
Source: stackoverflow.com
Related Query
- Split strings in to words in spark scala
- Spark fails while calling scala class method to comma split strings
- How to split sentences into words inside map(case(key,value)=>...) in scala spark
- Splitting strings by words in Scala Spark
- How to split strings into characters in Scala
- Split 1 column into 3 columns in spark scala
- Spark Scala Split dataframe into equal number of rows
- Scala Spark - split vector column into separate columns in a Spark DataFrame
- Splitting strings in Apache Spark using Scala
- SPARK dataframe error: cannot be cast to scala.Function2 while using a UDF to split strings in column
- How to split comma separated string and get n values in Spark Scala dataframe?
- Split strings with separator splited into each characters in Scala
- Scala : How to split words using multiple delimeters
- Spark - Csv data split with scala
- How to split a File Source into Strings or Words
- Split a dataset in training and test in one line in scala spark
- Writing JSON array of strings with a blob element in Spark Scala
- Spark SQL Split or Extract words from String of Words
- Split a data frame into two or more data frames in Scala using spark
- column split in Spark Scala dataframe
- Scala split a 2 words which aren't seperated
- how to check if schema.field.dataype is array of strings in scala with spark
- Splitting strings separated by a comma and a white space in spark using scala
- split the file into multiple files based on a string in spark scala
- Replace words in Data frame using List of words in another Data frame in Spark Scala
- How to Split the row by nth delimiter in Spark Scala
- Split a column and concatenate parts into a new column using Spark in scala on Zeppelin
- split Json array into two rows spark scala
- Spark Scala Count the Occurrence of Array of strings in the Map Key
- Null Fields Omitted with Split Function - Spark Scala
More Query from same tag
- How to exclude logging (like logback-classic) from jar published by sbt
- playframework 2.x tags Compilation error
- flink parsing JSON in map: InvalidProgramException: Task not serializable
- Akka monitor children restarts
- How to traverse JSON object fields using JsPath?
- Precision of double values in Spark
- How to convert Sequence of Future tasks to Enumerator, that would consume latest complete tasks
- IntelliJ doesn't recognize TypeTag in Scala
- How to convert RDD to DF in spark scala?
- Scala get a substring
- How to return a value from the while/for loop in Scala
- Split string based on parantheses
- Returning Value in Functional Programming
- spark streaming multiple sockets sources
- How to reflect annotations in Scala 2.10?
- Can I execute a scala code file as a part of my code?
- + Operator on lists in Scala
- Greenplum Spark Connector org.postgresql.util.PSQLException: ERROR: error when writing data to gpfdist
- Executing an effect inside an effect in ZIO
- Passing a tuple to a curried function
- How to use provider for dependency injection using guice in playframework
- Play 2.6 using Logging Markers to carry request uuid similar to Java MDC
- S3ObjectSummmary not able to call function
- Storing each element from each RDD to a new List
- What date/time class should I be mapping too?
- Access element of an Array and return a monad?
- Profiling a Scala Spark application
- Is there a way to compare all rows in one column of a dataframe against all rows in another column of another dataframe (spark)?
- Any built-in scala cache feature with size limit
- Variables get reset after redirection [SCALA-PLAY]