score:3

If you want to find how many 1 and 0s there are you can do:

``````val rdd = clusterAndLabel.map(x => (x,1)).reduceByKey(_+_)
``````

this will give you an `RDD[(Int,Int),Int]` containing exactly what you described, meaning: `[((0,0),2), ((1,0),1), ((1,1),1), ((2,1),2), ((2,0),1)]`. If you really want them gathered by their first key, you can add this line:

``````val rdd2 = rdd.map(x => (x._1._1, (x._1._2, x._2))).groupByKey()
``````

this will yield an `RDD[(Int, (Int,Int)]` which will look like what you described, i.e.: `[(0, [(0,2)]), (1, [(0,1),(1,1)]), (2, [(1,2),(0,1)])]`.

If you need the number of instances, it looks like (at least in your example) `clusterAndLabel.count()` should do the work.

I don't really understand question 3? I can see two things:

• you want to know how many keys have 3 occurrences. To do so, you can start from the object I called `rdd` (no need for the groupByKey line) and do so:

``````val rdd3 = rdd.map(x => (x._2,1)).reduceByKey(_+_)
``````

this will yield and `RDD[(Int,Int)]` which is kind of a frequency RDD: the key is the number of occurences and the value is how many times this key is hit. Here it would look like: `[(1,3),(2,2)]`. So if you want to know how many pairs occur 3 times, you just do `rdd3.filter(_._1==3).collect()` (which will be an array of size 0, but if it's not empty then it'll have one value and it will be your answer).

• you want to know how many time the first key 3 occurs (once again 0 in your example). Then you start from `rdd2` and do:

``````val rdd3 = rdd2.map(x=>(x._1,x._2.size)).filter(_._1==3).collect()
``````

once again it will yield either an empty array or an array of size 1 containing how many elements have a 3 for their first key. Note that you can do it directly if you don't need to display `rdd2`, you can just do:

``````val rdd4 = rdd.map(x => (x._1._1,1)).reduceByKey(_+_).filter(_._1==3).collect()
``````

(for performance you might want to do the filter before `reduceByKey` also!)