score:1

Accepted answer

dimas answer should work. here is another solution.

i think (not positive) groupby would hold all of the data in memory.. so perhaps this would be better for you.

val rows = scala.io.source.fromfile("data.txt") // assuming data is in a file
             .getlines  // create an iterator from lines in file
             .foldleft(map.empty[string, int]){ // fold over empty map
                (acc, row) => acc + (row -> (acc.getorelse(row, 0) + 1))}  // keep accumulator to track of row counts as fold is done
             .filter(t => t._2 > 1)  // filter to tuples with more than one row

i'm new to scala myself, i actually spent a while answering this as practice haha. confusing, but it makes sense!

think of a map like a dictionary. you can store pairs in it. in scala, you can add/update a key/value pair by adding a pair to it. map(b -> 4) + ("c" -> 2) would return map(b -> 4, c -> 2). expanding on that, map(b -> 4, c -> 2) + ("b" -> 1) returns map(b -> 1, c -> 2). what acc is (renamed from count for clarity) is the accumulator of a growing object as the iterator is folded. each time it hits a new row, it is checking to see if that row has is in the map yet (again, think dictionary). if the value is there, it takes the previous value with getorelse and adds 1 to it, then updates the acc map with that new pair, or it initializes it at one if it doesn't exist yet (since it was the first time the row was seen).

here is the best blog i found for learning folding. the author describes it succinctly and accurately: https://coderwall.com/p/4l73-a/scala-fold-foldleft-and-foldright

score:-1

if you use scala collections (like seq, list) you have a method called .distinct. otherwise you can transform it in a set which removes duplicates by default (but doesn't conserve the order)

score:1

dataset.groupby(identity).collect { case (k,v) if v.size > 1 => k }


Related Query

More Query from same tag