score:269

Accepted answer

reduce vs foldLeft

A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that reduce should be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.

This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists. The collection can be chopped up and the reduce can operate on each chunk, then the reduce can operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.

If you just look at the signatures there is no reason for reduce to exist because you can achieve everything you can with reduce with a foldLeft. The functionality of foldLeft is a greater than the functionality of reduce.

But you cannot parallelize a foldLeft, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is not a commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.

foldLeft does not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.

If you have a look at the Spark documentation for reduce it specifically says "... commutative and associative binary operator"

http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD

Here is proof that reduce is NOT just a special case of foldLeft

scala> val intParList: ParSeq[Int] = (1 to 100000).map(_ => scala.util.Random.nextInt()).par

scala> timeMany(1000, intParList.reduce(_ + _))
Took 462.395867 milli seconds

scala> timeMany(1000, intParList.foldLeft(0)(_ + _))
Took 2589.363031 milli seconds

reduce vs fold

Now this is where it gets a little closer to the FP / mathematical roots, and a little trickier to explain. Reduce is defined formally as part of the MapReduce paradigm, which deals with orderless collections (multisets), Fold is formally defined in terms of recursion (see catamorphism) and thus assumes a structure / sequence to the collections.

There is no fold method in Scalding because under the (strict) Map Reduce programming model we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity.

Put simply, reduce works without an order of cumulation, fold requires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speaking reduce should work on an empty collection, because its zero value can by deduced by taking an arbitrary value x and then solving x op y = x, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e. x op y != y op x). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.

It seems (as is often the case in etymology) that this original mathematical meaning has been lost, since the only obvious difference in programming is the signature. The result is that reduce has become a synonym for fold, rather than preserving it's original meaning from MapReduce. Now these terms are often used interchangeably and behave the same in most implementations (ignoring empty collections). Weirdness is exacerbated by peculiarities, like in Spark, that we shall now address.

So Spark does have a fold, but the order in which sub results (one for each partition) are combined (at the time of writing) is the same order in which tasks are completed - and thus non-deterministic. Thanks to @CafeFeed for pointing out that fold uses runJob, which after reading through the code I realised that it's non-deterministic. Further confusion is created by Spark having a treeReduce but no treeFold.

Conclusion

There is a difference between reduce and fold even when applied to non-empty sequences. The former is defined as part of the MapReduce programming paradigm on collections with arbitrary order (http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) and one ought to assume operators are commutative in addition to being associative to give deterministic results. The latter is defined in terms of catomorphisms and requires that the collections have a notion of sequence (or are defined recursively, like linked lists), thus do not require commutative operators.

In practice due to the unmathematical nature of programming, reduce and fold tend to behave in the same way, either correctly (like in Scala) or incorrectly (like in Spark).

Extra: My Opinion On the Spark API

My opinion is that confusion would be avoided if use of the term fold was completely dropped in Spark. At least spark does have a note in their documentation:

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala.

score:2

One other difference for Scalding is the use of combiners in Hadoop.

Imagine your operation is commutative monoid, with reduce it will be applied on the map side also instead of shuffling/sorting all data to reducers. With foldLeft this is not the case.

pipe.groupBy('product) {
   _.reduce('price -> 'total){ (sum: Double, price: Double) => sum + price }
   // reduce is .mapReduceMap in disguise
}

pipe.groupBy('product) {
   _.foldLeft('price -> 'total)(0.0){ (sum: Double, price: Double) => sum + price }
}

It is always good practice to define your operations as monoid in Scalding.

score:3

fold in Apache Spark is not the same as fold on not-distributed collections. In fact it requires commutative function to produce deterministic results:

This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.

This has been shown by Mishael Rosenthal and suggested by Make42 in his comment.

It's been suggested that observed behavior is related to HashPartitioner when in fact parallelize doesn't shuffle and doesn't use HashPartitioner.

import org.apache.spark.sql.SparkSession

/* Note: standalone (non-local) mode */
val master = "spark://...:7077"  

val spark = SparkSession.builder.master(master).getOrCreate()

/* Note: deterministic order */
val rdd = sc.parallelize(Seq("a", "b", "c", "d"), 4).sortBy(identity[String])
require(rdd.collect.sliding(2).forall { case Array(x, y) => x < y })

/* Note: all posible permutations */
require(Seq.fill(1000)(rdd.fold("")(_ + _)).toSet.size == 24)

Explained:

Structure of fold for RDD

def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
  var jobResult: T
  val cleanOp: (T, T) => T
  val foldPartition = Iterator[T] => T
  val mergeResult: (Int, T) => Unit
  sc.runJob(this, foldPartition, mergeResult)
  jobResult
}

is the same as structure of reduce for RDD:

def reduce(f: (T, T) => T): T = withScope {
  val cleanF: (T, T) => T
  val reducePartition: Iterator[T] => Option[T]
  var jobResult: Option[T]
  val mergeResult =  (Int, Option[T]) => Unit
  sc.runJob(this, reducePartition, mergeResult)
  jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}

where runJob is performed with disregard of partition order and results in need of commutative function.

foldPartition and reducePartition are equivalent in terms of order of processing and effectively (by inheritance and delegation) implemented by reduceLeft and foldLeft on TraversableOnce.

Conclusion: fold on RDD cannot depend on order of chunks and needs commutativity and associativity.

score:10

If I am not mistaken, even though the Spark API does not require it, fold also requires for the f to be commutative. Because the order in which the partitions will be aggregated is not assured. For example in the following code only the first print out is sorted:

import org.apache.spark.{SparkConf, SparkContext}

object FoldExample extends App{

  val conf = new SparkConf()
    .setMaster("local[*]")
    .setAppName("Simple Application")
  implicit val sc = new SparkContext(conf)

  val range = ('a' to 'z').map(_.toString)
  val rdd = sc.parallelize(range)

  println(range.reduce(_ + _))
  println(rdd.reduce(_ + _))
  println(rdd.fold("")(_ + _))
}  

Print out:

abcdefghijklmnopqrstuvwxyz

abcghituvjklmwxyzqrsdefnop

defghinopjklmqrstuvabcwxyz


Related Query

More Query from same tag