score:0
you cannot solve this elegantly and performant in a spark way as there is not enough initial information provided for spark to process all data guaranteed to be in the same partition. if we do all processing in the same partition, then this is not the true intent of spark.
in fact a sensible partitionby cannot be issued (over window function). the issue here is that the data represents a long list of sequential such data that would require looking across partitions for if data in the previous partition relates to the current partition. that could be done, but it's quite a job. zero323 has an answer somewhere here that tries to solve this, but if i remember correctly, it is cumbersome.
the logic to do it is easy enough, but using spark is problematic for this.
without a partitionby data all gets shuffled to a single partition and could result in oom and space problems.
sorry.
Source: stackoverflow.com
Related Query
- Set column value depending on previous ones with Spark without repeating grouping attribute
- How to fill column with value taken from a (non-adjacent) previous row without natural partitioning key using Spark Scala DataFrame
- Scala Spark Dataframe Create a new column with maximum of previous and current value of another column
- How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame
- Value and column operations in scala spark, how to use a value left of an operator with spark column?
- Scala - Spark In Dataframe retrieve, for row, column name with have max value
- Convert row values into columns with its value from another column in spark scala
- Calculate value based on value from same column of the previous row in spark
- Convert Spark Dataframes each row as a String with a delimiter between each column value in scala
- Create a new column in a Spark DataFrame using a var with constant value
- Spark lag with default value as another column
- Create new column in Spark DataFrame with diff of previous values from another column
- Spark and Scala, add new column with value from another dataframe by mapping common key
- Spark Dataframe with pivot and different aggregation, based on the column value (measure_type) - Scala
- Collect Spark dataframe column value to set
- Convert String and Compare with DF Column Value Spark Scala
- Compare Value of Current and Previous Row, and after for Column if required in Spark
- Replace column string name with another column value in Spark Scala
- Replace spark dataframe column value with random value (ex. UUID)
- Fill blank rows in a column with a non-blank value above it in Spark
- how to concat the same column value to a new column with comma delimiters in spark
- Calculating column value in current row of Spark Dataframe based on the calculated value of a different column in previous row using Scala
- Error using filter on column value with Spark dataframe
- Generate key value pairs from spark dataframe or RDD with column name present in key
- Create new DataFrame with new rows depending in number of a column - Spark Scala
- Spark DataFrame Add Column with Value
- Add new column with literal value to a struct column in Dataframe in Spark Scala
- Find average value from a column of stream dataframe with array values using spark scala
- How to extract column value to compare with rlike in spark dataframe
- Multiple Left Joins in Spark Dataframe with same table without Unique Column Error
More Query from same tag
- Getting NoSuchMethodError when setting up Spark GraphX graph
- .Reverse unable to be used within a map {case => if } Palindrome with second element
- Scala: .take(1) in for-comprehension?
- Parquet file being read as empty
- I can't fit the FP-Growth model in spark
- Concatenation of lists in Scala
- DSX - How to re-load a newer jar version that was already added via %AddJar in Scala
- How to specify a name to CSV file that I save to S3 with Scala
- What's the difference between LazyList and Stream in Scala?
- Error not found value in Spark Scala
- i18n Playframework 2.4 not working
- How do I list all files in a subdirectory in scala?
- How do I use "writeOutputStream" with an fs2 Stream[IO, Byte]
- How to set up different databases per environment in Play 2.0?
- Coalesce in spark scala
- Exporting Spark DataFrame to S3
- Recommended Scala io library
- Reading CSV file with multi line strings in Scala
- "stable identifier required" for an enumeration class
- Getting the row count by key from dataframe / RDD using spark
- Multiple Fields in For Expression
- Play2 Specs2 - destroy FakeApplication
- Can someone explain me the line in between ** ? I don't understand how the map is working
- spark-cassandra-connector for Spark 1.4 and Cassandra 3.0
- Matching an object somewhere in a Scala List
- Does Akka automatically copy over variables when an actor fails
- Trouble with ReactiveMongo's BSON macros and generics
- Compare two rdd and the values which match from the right rdd put it in the rdd
- How to store multiple Dataframes in the Map[String, Dataframe] and access each dataframe by using key of the map
- How to sort an array with a custom order in Scala