score:0
you have multiple files per partition because each node writes output to its own file. that means that the only way how to have only single file per partition is to re-partition data before writing. please note, that that will be very inefficient because data repartition will cause shuffling on your data.
score:1
having lots of files is the expected behavior as each partition (resulting in whatever computation you had before the write) will write to the partitions you requested the relevant files
if you wish to avoid that you need to repartition before the write:
spark.createdataframe(asrow, struct)
.repartition("foo","bar")
.write
.partitionby("foo", "bar")
.format("text")
.save("/some/output-path")
Source: stackoverflow.com
Related Query
- How to convert a SQL query output (dataframe) into an array list of key value pairs in Spark Scala?
- Spark: repartition output by key
- Read XML in spark 2.2 with java and expected output in key value format
- Write to multiple outputs by key Spark - one Spark job
- Joining Spark dataframes on the key
- Merge Spark output CSV files with a single header
- Spark Standalone Mode: How to compress spark output written to HDFS
- Spark Group By Key to (Key,List) Pair
- Spark ML VectorAssembler returns strange output
- Spark runs out of memory when grouping by key
- Passing a map with struct-type key into a Spark UDF
- Spark throws java.util.NoSuchElementException: key not found: 67
- Get value from a map for a column value as a key in spark dataframes
- Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each
- Spark Dataframes- Reducing By Key
- How can I merge spark results files without repartition and copyMerge?
- spark group multiple rdd items by key
- How to use ReduceByKey on multiple key in a Scala Spark Job
- Spark joinWithCassandraTable() on map multiple partition key ERROR
- how to collect spark sql output to a file?
- Why Spark repartition leads to MemoryOverhead?
- Spark JoinWithCassandraTable on TimeStamp partition key STUCK
- When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?
- Why does Spark ML NaiveBayes output labels that are different from the training data?
- Turn list of key/value pairs into list of values per key in spark
- Spark Structured Streaming with Kafka - How to repartition the data and distribute the processing among worker nodes
- Spark output to kafka exactly-once
- Scala spark reduce by key and find common value
- how to use spark intersection() by key or filter() with two RDD?
- Spark Standalone Mode: Change replication factor of HDFS output
More Query from same tag
- Scala make List of strings from char array
- Scala amazon S3 file upload show error not found
- how to replace distinct() with reducebykey
- Schema transformation in Scala
- Disable false warning "possible missing interpolator"
- Class broken error with Joda Time using Scala
- Function of White Space in Scala Class Property Setter
- Spark withColumn working for modifying column but not adding a new one
- SSL connection in Play! Framework
- Need help in solving type mismatch error (value is not a member of Double) in Scala
- Extracting array index in Spark Dataframe
- Scala : Nested getOrElse
- Play framework: scala template, if statement issue
- Scala Play: List to Json-Array
- Cannot extend Flink ProcessFunction
- Strange delays in spark streaming
- How to fix all the errors and successfully run the Akka Distributed Workers Sample Project?
- Scala create a numeric from a string
- Scala: A extends Ordered[A]
- Too Many Arguments for reduce [Flink 1.9 in Scala]
- Need pointers for optimization of Merge Sort implementation in Scala
- Scala reflection with Int parameter
- Generate apply methods creating a class
- Improving the performance of a specific piece of Scala code
- Building configuration DSL in Scala
- How to do feature engineering for integer column?
- Using a single actor for in and out ports of a Flow
- How can I update a Swing canvas at a certain framerate?
- How to parameterize writing dataframe into hive table
- How to deserialize json with json-lenses