Since the error message points to a specific file that you did not specify (/wikiinput/wiki.xml.gz), that file should either be corrupt or you don't the right permissions to access it.

Are you using the latest version of Spark? I think the Python lagged a bit behind in older Spark versions.

And what input does gensim.corpora.wikicorpus.extract_pages expect? I'm just curious because /wikiinput/wiki.xml.gz contains neither protocol nor bucket, and thus might simply fail to address the correct file. When I'm using wholeTextFiles from Scala and on HDFS, the file name is hdfs://<host>:<port>/path/to/file.


The problem seems to be not principally with spark, but with the version of the Hadoop libs linked. I was getting this when using spark 1.3.0 with Hadoop 1, but do not see it when using Hadoop 2. If you need this method to work with s3, be sure to install a version of spark linked to the Hadoop 2 libs. Specifically, if you're using the spark-ec2 script to set up a cluster on AWS, be sure to include the option --hadoop-major-version=2

Full details can be found here:

Related Query

More Query from same tag