score:1
answers for you question 2: inspite of using '
you need to use ` before the start and end of the column names with spaces. try below query it will work:
val explevel = sc.sqlcontext.sql("select `time_spend_company (years)` as `years_spent_in_company`,count(1) from emp where left_company = 1 group by `time_spend_company (years)`")
question 1: loading excel using "com.crealytics.spark.excel" is ok. i am also using it. there can be different option too. for assigning a different column name, you can use the struct type to define the schema and impose it during the loading the data into dataframe. e.g
val newschema = structtype(
list(structfield("a", integertype, nullable = true),
structfield("b", integertype, nullable = true),
structfield("c", integertype, nullable = true),
structfield("d", integertype, nullable = true))
)
val employeesdf = spark.read.schema(newschema)
.format("com.crealytics.spark.excel")
.option("sheetname", "sheet1")
.option("useheader", "true")
.option("treatemptyvaluesasnulls", "false")
.option("inferschema", "false")
.option("location", empfile)
.option("addcolorcolumns", "false")
.load()
the first four column names now will be accessed by a,b,c and d. run below query it will work on new column names.
sc.sqlcontext.sql("select a,b,c,d from emp").show()
score:1
- spark has good support for working with csvs. so if your excel file has just one sheet you can convert it to csv by simply renaming
empdatasets.xlsx
toempdatasets.csv
. use this to do it.
once you have your file as csv, you can read it as spark.read.csv(pathtocsv)
and can supply many options like: to read/skip header or supply schema of the dataset as spark.read.schema(schema).csv(pathtocsv)
.
here schema
can be created as described here or can be extracted from a case class using spark sql encoders encoders.product[case_class_name].schema
- you can remove spaces from the column names like:
val employeesdfcolumns = employeesdf.columns.map(x
=> col(x.replaceall(" ", "")))
and apply these new column names on the dataframe.
val employeedf = employeedf.select(employeesdfcolumns:_*)
score:3
for the version 0.13.5
you will need a different set of parameters:
def readexcel(file: string): dataframe = {
sqlcontext.read
.format("com.crealytics.spark.excel")
.option("dataaddress", "'sheet_name'!a1") // optional, default: "a1"
.option("header", "true") // required
.option("treatemptyvaluesasnulls", "false") // optional, default: true
.option("inferschema", "true") // optional, default: false
.option("addcolorcolumns", "false") // optional, default: false
.option("timestampformat", "mm-dd-yyyy hh:mm:ss") // optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxrowsinmemory", 20) // optional, d[#all]efault none. if set, uses a streaming reader which can help with big files
.load(file)
}
maven dependency:
<dependency>
<groupid>com.crealytics</groupid>
<artifactid>spark-excel_2.11</artifactid>
<version>0.13.5</version>
</dependency>
Source: stackoverflow.com
Related Query
- How to read from textfile(String type data) map and load data into parquet format(multiple columns with different datatype) in Spark scala dynamically
- how to read excel data into a dataframe in spark/scala
- How to extract urls from HYPERLINKS in excel file when reading into scala spark dataframe
- How to read in-memory JSON string into Spark DataFrame
- How to read json data using scala from kafka topic in apache spark
- How to read a checkpoint Dataframe in Spark Scala
- How to break each rows into multiple rows in Spark DataFrame using scala
- How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?
- How do I read a non standard csv file into dataframe with python or scala
- How to load data into Teradata table using FASTLOAD through Spark dataframe
- How to pass data as a tuple into an rdd in Spark using Scala
- How do I read sequence data in Scala in Spark
- How to read a fixed length file in Spark using DataFrame API and SCALA
- How to calculate the ApproxQuanitiles from list of Integers into Spark DataFrame column using scala
- Scala Spark - copy data from 1 Dataframe into another DF with nested schema & same column names
- How to read Avro Binary(Base64) Encoded data in Spark Scala
- How to read decimal logical type into spark dataframe
- How to load data into Product case class using Dataframe in Spark
- How to load a json file which is having double quotes within a string into a dataframe in spark scala
- How to get Table Stats of Hive table into Dataframe in Spark Scala
- how to convert rows into columns in spark dataframe using scala
- How to load complex xml files containing more than 1 row tag into dataframe using spark scala and save it as table(note generic solution)
- How to convert a wide dataframe into a vertical dataframe in Spark Scala
- How to Merge Spark Scala Dataframe - Multiple rows into One based on condition
- How to consolidate two json result into one dataframe in scala spark
- Spark Scala read text file into DataFrame
- How to transform a string column of a dataframe into a column of Array[String] with Apache Spark and Scala
- How to extract data from a single cell into multiple columns using scala in Spark
- How to use dataframe inside an udf and parse the data in spark scala
- How to convert Row of a Scala DataFrame into case class most efficiently?
More Query from same tag
- How to dynamically generate JSon in Gatling?
- Proving associativity of natural number addition using Scala shapeless
- Set a private var by an anonymous class in Scala?
- Spark's RDD.map() will not execute unless the item inside RDD is visited
- How to create OFormat from Format in scala and play framework
- Spark Streaming - write to Kafka topic
- Issue with pattern matching in scala: "error: constructor cannot be instantiated to expected type"
- How to apply a function to every string in a dataframe
- compile-time constraints on the order of function composition in scala
- Is there a way to avoid repeating the type parameters when using a trait?
- Select columns containing array type and normal type with all flatten
- Mutate implicit "Context" parameter
- java.lang.ClassNotFoundException: org.jboss.netty.channel.ChannelFactory while running play project in intellij idea
- How to substitute ::: in scala
- difficulty in understanding the synchronized definition
- Printing array in Scala
- How to modify a Spark Dataframe with a complex nested structure?
- scala pattern matching
- Map() function giving error in spark
- Embedding Jython in a scala program
- Initialize local variable in both branches
- Overriding `Comparison method violates its general contract` exception
- Merging RDDs using Scala Apache Spark
- flatmap(GenTraversableOnce) on Options
- How to make a version of the `WebSocketDirectives.handleWebSocketMessages` directive that exposes the materialized value?
- Mocking Scala traits containing abstract val members
- Spark Scala: Count Consecutive Months
- How to process Dataframe data in parallel way to call url with bulk of params
- Option[_] of Type with Play Json
- Return function in Scala