Since the error message points to a specific file that you did not specify (
/wikiinput/wiki.xml.gz), that file should either be corrupt or you don't the right permissions to access it.
Are you using the latest version of Spark? I think the Python lagged a bit behind in older Spark versions.
And what input does
gensim.corpora.wikicorpus.extract_pages expect? I'm just curious because
/wikiinput/wiki.xml.gz contains neither protocol nor bucket, and thus might simply fail to address the correct file. When I'm using wholeTextFiles from Scala and on HDFS, the file name is
The problem seems to be not principally with spark, but with the version of the Hadoop libs linked. I was getting this when using spark 1.3.0 with Hadoop 1, but do not see it when using Hadoop 2. If you need this method to work with s3, be sure to install a version of spark linked to the Hadoop 2 libs. Specifically, if you're using the spark-ec2 script to set up a cluster on AWS, be sure to include the option --hadoop-major-version=2
Full details can be found here: https://issues.apache.org/jira/browse/SPARK-4414
- Spark: java.io.FileNotFoundException: File does not exist in copyMerge
- Getting Configuration file config.hocon does not exist when running Docker image (Snowplow, Scala)
- akka http route HttpEntity.fromFile return 403 if file does not exist
- AWS Access Key Id you provided does not exist in our records - Credentials fail for java SDK but not CLI
- hdfs : File does not exist
- Spark read error file path does not exist
- Scala Breeze does not find my file on a webserver, while Java does
- Scala Slick, how to create Schema ONLY if it does not exist
- Why does this compile under Java 7 but not under Java 8?
- Intellij: SBT-based Scala project does not build with Java 9
- SBT Publish only when version does not exist
- What does Kotlin's type reification make possible that is not possible in Java or Scala?
- play2 framework my template is not seen. : package views.html does not exist
- Why does @RequestMapping annotation accept String parameter in java but not in scala?
- @throws in Scala does not allow calling Java to catch correct exception type
- Retrieve an Akka actor or create it if it does not exist
- Renaming a .scala file in Scala IDE does not rename the class
- Scala does not read file from resources folder
- How do I add additional file to class path which is not java or scala file using SBT configuration?
- jar does not generate correct manifest file
- Getting a jar does not exist and java.lang.ClassNotFoundException while running a simple twitter sentiment analysis code
- Package does not exist when running H2 under SBT
- Simple file read with Scala does not work
- sbt package is trying to download a package whose path does not exist
- Add dependencies to sbt build.scala file does not have effect in IntelliJ
- Warning:(55, 56) `withFilter' method does not yet exist on MoldPiece[PressureData(Int, prettyprint.YesNo)], using `filter' method instead
- scala type alias to java enum does not work?
- Kafka Java Consumer does not pick up messages from where it last left off
- java.lang.IllegalArgumentException: Field "label" does not exist using SparkML
- Why does running exported jar file give "Exception in thread "main" java.io.IOException; Class not found"?
More Query from same tag
- Scala Spark, array with incremental new column
- java.lang.IllegalArgumentException: 'path' is not specified // Spark Consumer Issue
- Error when running SparkApp from docker container against Spark running in another container
- Scala:Cannot resolve overloaded method 'fromCollection', while 'fromCollection' has multiple implementations
- Finding min/max of moving window in O(log(n))
- Scala: Spark: java.lang.ClassNotFoundException:
- TreeMap Keys and Iteration in Scala
- Including Static Methods in Companion Objects when Extending from Traits
- Akka - worse performance with more actors
- Chain records in DataFrame by using a recursive function
- Does the synchronized construct in Java use internally (and somehow) the hardware primitive CAS operation?
- Cannot download sources from jar in intellij with maven
- How can I convert scala Map to a partial function
- Scala collection of classes with Implicit Ordering
- Error on an asynchronous play method with reactivemongo
- Load folder name as column in delta table
- Scala map list element to a value calculated from previous elements
- Scala reduce on Point class
- Safe API to transform HTML (Java)
- How to run scala code on Intellij Idea 11?
- How to unpack a parent object into a extended child constructor?
- How to set thread timeout
- In Scala, what is the difference between using the `_` and using a named identifier?
- Accessing Constants which rely on a list buffer to be populated Scala
- What are main changes from scala 2.8.1 to scala 2.9.1?
- How to do something like this in Scala?
- What does "<:" mean in Scala?
- Is it possible to pass dynamic values in default replacement of ZipAll in Scala?
- How to convert a Stream[IO, List[A]] to Stream[IO, A]
- Domain Object Validation in play