If you want to find how many 1 and 0s there are you can do:

val rdd = => (x,1)).reduceByKey(_+_)

this will give you an RDD[(Int,Int),Int] containing exactly what you described, meaning: [((0,0),2), ((1,0),1), ((1,1),1), ((2,1),2), ((2,0),1)]. If you really want them gathered by their first key, you can add this line:

val rdd2 = => (x._1._1, (x._1._2, x._2))).groupByKey()

this will yield an RDD[(Int, (Int,Int)] which will look like what you described, i.e.: [(0, [(0,2)]), (1, [(0,1),(1,1)]), (2, [(1,2),(0,1)])].

If you need the number of instances, it looks like (at least in your example) clusterAndLabel.count() should do the work.

I don't really understand question 3? I can see two things:

  • you want to know how many keys have 3 occurrences. To do so, you can start from the object I called rdd (no need for the groupByKey line) and do so:

    val rdd3 = => (x._2,1)).reduceByKey(_+_)

    this will yield and RDD[(Int,Int)] which is kind of a frequency RDD: the key is the number of occurences and the value is how many times this key is hit. Here it would look like: [(1,3),(2,2)]. So if you want to know how many pairs occur 3 times, you just do rdd3.filter(_._1==3).collect() (which will be an array of size 0, but if it's not empty then it'll have one value and it will be your answer).

  • you want to know how many time the first key 3 occurs (once again 0 in your example). Then you start from rdd2 and do:

    val rdd3 =>(x._1,x._2.size)).filter(_._1==3).collect()

    once again it will yield either an empty array or an array of size 1 containing how many elements have a 3 for their first key. Note that you can do it directly if you don't need to display rdd2, you can just do:

    val rdd4 = => (x._1._1,1)).reduceByKey(_+_).filter(_._1==3).collect()

    (for performance you might want to do the filter before reduceByKey also!)

Related Query

More Query from same tag