In general, it's against the idea of distributed computation to materialize 400M rows. If you can provide details of what you want to accomplish we can suggest how you accomplish that in parallelized manner. Spark has flexible APIs that can accommodate most of the use cases.

You can still make this happen. When you collect 400M rows you will accumulate them in the driver process. On Azure HDInsight, driver process runs in yarn master application. You will need to configure it to have enough memory to hold that much data. The configuration from Jupyter is:

%%configure -f { "driverMemory":"60G" }

Just add it as separate cell to the notebook.

Related Query

More Query from same tag