score:0
In general, it's against the idea of distributed computation to materialize 400M rows. If you can provide details of what you want to accomplish we can suggest how you accomplish that in parallelized manner. Spark has flexible APIs that can accommodate most of the use cases.
You can still make this happen. When you collect 400M rows you will accumulate them in the driver process. On Azure HDInsight, driver process runs in yarn master application. You will need to configure it to have enough memory to hold that much data. The configuration from Jupyter is:
%%configure -f { "driverMemory":"60G" }
Just add it as separate cell to the notebook.
Source: stackoverflow.com
Related Query
- Pulling 400 million rows using Spark SQL using Jupyter notebook
- Spark SQL using a window - collect data from rows after current row based on a column condition
- How do I check for equality using Spark Dataframe without SQL Query?
- Read from a hive table and write back to it using spark sql
- Dollar sign in function call in Java using Spark SQL
- How to show the spark progress bar in Jupyter notebook (using pyspark)
- Spark 2.2/Jupyter Notebook SQL regexp_extract function not matching regex pattern
- How to break each rows into multiple rows in Spark DataFrame using scala
- explode a row of spark dataset into several rows with added column using flatmap
- How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte
- How to compare one row with all other rows in spark using scala
- Can I use Jupyter lab to interact with databricks spark cluster using Scala?
- Add identical rows to a Spark Dataframe using an integer
- Add new rows in the Spark DataFrame using scala
- spark vs pandas dataframe (with large columns) head(n) in jupyter notebook
- Spark sql using spark thrift server
- Delete rows from Azure Sql table using Azure Databricks with Scala
- Spark scala - Output number of rows after sql insert operation
- Find the facility with the longest interval without accidents using Apache Spark SQL
- "ORA-00933: SQL command not properly ended" error while querying Oracle DB using Spark
- How to refer to data in Spark SQL using column names?
- remove duplicated columns in Spark using SQL expression
- Drop rows of Spark DataFrame that contain specific value in column using Scala
- Filling missing values in rows using Apache spark
- Optimal way to save spark sql dataframe to S3 using information stored in them
- What is the correct way to dynamically pass a list or variable into a SQL cell in a spark databricks notebook in Scala?
- Error while writing spark 3.0 sql dataframe to CSV file using scala without Databricks
- Compare rows of an array column with the headers of another data frame using Scala and Spark
- how to split row into multiple rows on the basis of date using spark scala?
- How to do a Spark dataframe(1 million rows) cartesian product with a list(1000 entries) efficiently to generate a new dataframe with 1 billion rows
More Query from same tag
- Playframework play.api.libs.json.Json.format real use case
- How scale a Elasticbeanstalk application worker on based on messages from SQS?
- how to keep records information when working in ML
- Dropping multiple columns from Spark dataframe by Iterating through the columns from a Scala List of Column names
- sbt-native-packager defining multiple mainClasses in different modules
- Play Framework Json validation that needs to check database
- Scala parser combinators and Reader infinite loop
- Scalaz Type Classes for Apache Spark RDDs
- How to extract values from Some() in Scala
- Tests fail on a simple RDD action
- How to split a sorted RDD into n parts and get first element from each part?
- HPROF result interpretation
- Scala's `toMap` in Javascript
- Scala: removing a list element by value
- play framework - how can i call this function for the authenticated user in this code play2 scala zentasks
- How to exclude LICENSE files in sbt-android-sdk plugin?
- Some[Seq[X]] does not conform to the expected type Option[Seq[X]]
- Why does a method parameter cause NotSerializableException with Mockito?
- Dependency error while setting up spark with java in eclipse
- Gatling. Check, if a HTML result contains some string
- Scala reduce a List based on a condition
- scala/java get json value using dot pattern
- Extracting lift-json into a case class with an upper bound
- Possibility to configure Akka to use blocking or non blocking i/o?
- Inserting default values in slick
- Unable to override bash startup script in sbt-native-packager
- Why won't sbt run ScalaTest's example test?
- Spark: how to filter rows without using any joins?
- Spark scala not able to push data in Hive table
- Databricks Connect and External Libraries