score:4

Accepted answer

Well... use substring to break lines. Then trim to remove wheitespaces. And then do whatever you want.

case class DataUnit(s1: Int, s2: String, s3:Boolean, s4:Double)

sc.textFile('your_file_path')
  .map(l => (l.substring(0, 3).trim(), l.substring(3, 13).trim(), l.substring(13,18).trim(), l.substring(18,22).trim()))
  .map({ case (e1, e2, e3, e4) => DataUnit(e1.toInt, e2, e3.toBoolean, e4.toDouble) })
  .toDF

score:5

The Fixed Length format is very old and I could not find a good Scala library for this format... so I have created my own.

You can check it out here: https://github.com/atais/Fixed-Length

Usage with Spark is quite simple, you would get a DataSet of your objects!

You first need to create a description of your objects, fe:

case class Employee(name: String, number: Option[Int], manager: Boolean)

object Employee {

    import com.github.atais.util.Read._
    import cats.implicits._
    import com.github.atais.util.Write._
    import Codec._

    implicit val employeeCodec: Codec[Employee] = {
      fixed[String](0, 10) <<:
        fixed[Option[Int]](10, 13, Alignment.Right) <<:
        fixed[Boolean](13, 18)
    }.as[Employee]
}

And later just use the parser:

val input = sql.sparkContext.textFile(file)
               .filter(_.trim.nonEmpty)
               .map(Parser.decode[Employee])
               .flatMap {
                  case Right(x) => Some(x)
                  case Left(e) =>
                         System.err.println(s"Failed to process file $file, error: $e")
                         None
               }
sql.createDataset(input)

Related Query

More Query from same tag