score:5

Accepted answer

With Spark SQL each line must contain a separate, self-contained valid JSON otherwise the computation fails.

However you can try this

spark.read.json(spark.sparkContext.wholeTextFiles("path to json").values) 

or

spark.read.option("wholeFile", true).option("mode", "PERMISSIVE").json("path to json")

This should convert the json to the dataframe.

score:0

Adding to this, as it took some time to understand it, when querying for something nested inside totals you may have to use the "explode" method:

Dataset<Row> socials = sparkSession
            .read()
            .option("multiLine", true)
            .option("mode", "PERMISSIVE")
            .json(<path to file>).cache();

socials.select(org.apache.spark.sql.functions.explode(socials.col("total")).as("t")).where("t.<some nested column under total> = 'foo'").toJSON().collectAsList();

This is for Java Spark but hope the explode method be of some help.

score:1

Given input file with json data as

{
  "element" : value,
  "id" : value,
  "total" : []
}
{
  "element" : value,
  "id" : value,
  "total: []
}

which is not a valid json to be converted to a dataframe, so you have to convert the data into valid spark readable json format.

val rdd = sc.wholeTextFiles("path to the json file")

val validJsonRdd = rdd.flatMap(_._2.replace(" ", "").replace("\n", "").replace(":value", ":\"value\"").replace("}{", "}\n{").split("\n"))

Above step would work only if you have value string without inverted commas in element and id fields. Otherwise you can modify it according to your needs.

Next step is to convert into dataframe using sqlcontext.

 val df = sqlContext.read.json(validJsonRdd)

which should result to

+-------+-----+-----+
|element|id   |total|
+-------+-----+-----+
|value  |value|[]   |
|value  |value|[]   |
+-------+-----+-----+

now you should be able to select id and respective totals and play with them

I hope the answer is helpful


Related Query

More Query from same tag