Accepted answer

Yes. I haven't used MongoDB, but based on other things I've done with Spark, these should all be quite possible.

However, do keep in mind that a Spark application is not typically fault-tolerant. The application (aka "driver") itself is a single point of failure. There's a related question on that topic (Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode), but I think it doesn't have a really good answer at the moment.

I have no experience running a critical HDFS cluster, so I don't know how much of a problem the single point of failure is. But another idea may be running on top of Amazon S3 or Google Cloud Storage. I would expect these to be way more reliable than anything you can cook up. They have large support teams and lots of money and expertise invested.

Related Query

More Query from same tag