score:1

Accepted answer

you can group the array and calculate the sum of values.

// raw rdd
val hashedrdd = spark.sparkcontext.parallelize(seq(
  ("abc",array(("asd",1),("asd",3),("cvd",2),("cvd",2),("xyz",1)))
))

//group by first value and calculate the sum
val y = hashedrdd.map(x => {
  (x._1, x._2.groupby(_._1).mapvalues(_.map(_._2).sum))
})

output:

y.foreach(println)
(abc,map(xyz -> 1, asd -> 4, cvd -> 4))

hope this helps!

score:0

one way would be to reduce on the tuples after groupby(of the first entry):

@ hashedrdd.map { f => (f._1, f._2.groupby{ _._1 }.map{ _._2.reduce{ (a,b)=>(a._1, a._2+b._2) } } )}.collect
res11: array[(string, map[string, int])] = array(("abc", map("xyz" -> 1, "asd" -> 4, "cvd" -> 4)))

Related Query

More Query from same tag