score:1
dimas answer should work. here is another solution.
i think (not positive) groupby
would hold all of the data in memory.. so perhaps this would be better for you.
val rows = scala.io.source.fromfile("data.txt") // assuming data is in a file
.getlines // create an iterator from lines in file
.foldleft(map.empty[string, int]){ // fold over empty map
(acc, row) => acc + (row -> (acc.getorelse(row, 0) + 1))} // keep accumulator to track of row counts as fold is done
.filter(t => t._2 > 1) // filter to tuples with more than one row
i'm new to scala myself, i actually spent a while answering this as practice haha. confusing, but it makes sense!
think of a map like a dictionary. you can store pairs in it. in scala, you can add/update a key/value pair by adding a pair to it.
map(b -> 4) + ("c" -> 2)
would return map(b -> 4, c -> 2)
. expanding on that, map(b -> 4, c -> 2) + ("b" -> 1)
returns map(b -> 1, c -> 2)
. what acc is (renamed from count for clarity) is the accumulator of a growing object as the iterator is folded. each time it hits a new row, it is checking to see if that row has is in the map yet (again, think dictionary). if the value is there, it takes the previous value with getorelse
and adds 1 to it, then updates the acc map with that new pair, or it initializes it at one if it doesn't exist yet (since it was the first time the row was seen).
here is the best blog i found for learning folding. the author describes it succinctly and accurately: https://coderwall.com/p/4l73-a/scala-fold-foldleft-and-foldright
score:-1
if you use scala collections (like seq
, list
) you have a method called .distinct
. otherwise you can transform it in a set
which removes duplicates by default (but doesn't conserve the order)
score:1
dataset.groupby(identity).collect { case (k,v) if v.size > 1 => k }
Source: stackoverflow.com
Related Query
- Finding out duplicates in a dataset in scala
- Filter out duplicates from List[Either[Object, Exception]] in scala
- code for finding duplicates in an array, Scala
- How to find duplicates in Schema RDD [created out of Case Class] in Spark Scala and respective duplicate counts?
- In Scala how do I remove duplicates from a list?
- Finding the index of an element in a list scala
- Finding an item that matches predicate in Scala
- Should I use Unit or leave out the return type for my scala method?
- Scala Process - Capture Standard Out and Exit Code
- Scala Macros: Making a Map out of fields of a class in Scala
- Finding type parameters via reflection in Scala 2.10?
- Any info out there on migrating a Scala 2.9 compiler plugin to 2.10?
- Print out Scala worksheet results in interactive mode in IntelliJ
- Scala - finding a specific tuple in a list
- What is Scala way of finding whether all the elements of an Array has same length?
- Maven is not finding scala tests
- Convert scala list to DataFrame or DataSet
- Finding elements in a scala list and also know which predicate has been satisfied
- IntelliJ IDEA w/ Scala Plugin not finding scala.concurrent
- Find out Scala Version Used to Make JAR
- Scala spark: how to use dataset for a case class with the schema has snake_case?
- Commenting out portions of code in Scala
- Finding unused methods in scala using Intelli J
- Combination of map and filter nulls out in Scala
- Get the row corresponding to the latest timestamp in a Spark Dataset using Scala
- Scala Reflection - Loading or finding classes based on trait
- Scala break out of map
- Finding current username with scala
- How to find out what implicit(s) are used in my scala code
- Android/Scala project in IntelliJ 14 compiles, but crashes when launched not finding Scala class
More Query from same tag
- Why there is such a big difference with and without using `par` (snippet provided)?
- Linking to external Scala API docs in IntelliJ
- Code coverage 0% with sonarqube 8.9 upgrade - scala project
- 'not found: type' error
- Thymeleaf with springboot and Scala not rendering HTML page
- Can't access extra methods of trait implementation
- scala how to use pattern matching with inheriance and templated class
- Scala: partitioning number into n almost equal-length ranges
- Where is the string operator " <- " defined in Scala?
- Collect most occurring unique values across columns after a groupby in Spark
- Extract multiple substrings from same string efficiently
- Applying schema at message level instead of dataframe level in Spark Structured Streaming
- I am trying to convert Java code to scala
- What is time complexity of slice in scala?
- Why do I get "value X is not a member of Y" error in this code?
- Scala style: More than one class in a file?
- Deterimine concrete type without explicit matching
- Google Ads API Java Client Library: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkNotNull
- How do I define Play assets and APIs to be served from a different sub-directory in production?
- Spray multiple post parameters
- create n iterators from one iterator
- Akka persistent actor - functional approach
- Is there any bulkUpdate (similar to bulkDelete_!!) in Mapper?
- Scala io.source fromFile
- Exception when trying to instantiate a Scala class from Java
- Enqueue to a Scala Queue another Scala Queue
- Custom Spray Formatter for NonEmptyList[A]
- How to state that command on one subproject depends on command in another subproject in sbt?
- How to pass dataset column value to a function while using spark filter with scala?
- Memory Analysis of Scala