score:1

Accepted answer

answers for you question 2: inspite of using ' you need to use ` before the start and end of the column names with spaces. try below query it will work:

val explevel = sc.sqlcontext.sql("select `time_spend_company (years)` as `years_spent_in_company`,count(1) from emp where left_company = 1 group by `time_spend_company (years)`")

question 1: loading excel using "com.crealytics.spark.excel" is ok. i am also using it. there can be different option too. for assigning a different column name, you can use the struct type to define the schema and impose it during the loading the data into dataframe. e.g

val newschema = structtype(
    list(structfield("a", integertype, nullable = true),
         structfield("b", integertype, nullable = true),
         structfield("c", integertype, nullable = true),
         structfield("d", integertype, nullable = true))
  )

val employeesdf = spark.read.schema(newschema)
  .format("com.crealytics.spark.excel")
  .option("sheetname", "sheet1")
  .option("useheader", "true")
  .option("treatemptyvaluesasnulls", "false")
  .option("inferschema", "false")
  .option("location", empfile)
  .option("addcolorcolumns", "false")
  .load()

the first four column names now will be accessed by a,b,c and d. run below query it will work on new column names.

sc.sqlcontext.sql("select a,b,c,d from emp").show()

score:1

  1. spark has good support for working with csvs. so if your excel file has just one sheet you can convert it to csv by simply renaming empdatasets.xlsx to empdatasets.csv. use this to do it.

once you have your file as csv, you can read it as spark.read.csv(pathtocsv) and can supply many options like: to read/skip header or supply schema of the dataset as spark.read.schema(schema).csv(pathtocsv).

here schema can be created as described here or can be extracted from a case class using spark sql encoders encoders.product[case_class_name].schema

  1. you can remove spaces from the column names like:

val employeesdfcolumns = employeesdf.columns.map(x => col(x.replaceall(" ", "")))

and apply these new column names on the dataframe.

val employeedf = employeedf.select(employeesdfcolumns:_*)

score:3

for the version 0.13.5 you will need a different set of parameters:

def readexcel(file: string): dataframe = {
    sqlcontext.read
      .format("com.crealytics.spark.excel")
      .option("dataaddress", "'sheet_name'!a1") // optional, default: "a1"
      .option("header", "true") // required
      .option("treatemptyvaluesasnulls", "false") // optional, default: true
      .option("inferschema", "true") // optional, default: false
      .option("addcolorcolumns", "false") // optional, default: false
      .option("timestampformat", "mm-dd-yyyy hh:mm:ss") // optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
      .option("maxrowsinmemory", 20) // optional, d[#all]efault none. if set, uses a streaming reader which can help with big files
      .load(file)
  }

maven dependency:

<dependency>
  <groupid>com.crealytics</groupid>
  <artifactid>spark-excel_2.11</artifactid>
  <version>0.13.5</version>
</dependency>

Related Query

More Query from same tag