score:0

Summarizing what I found while researching this issue. Hadoop's TextInputFormat or most input formats deal with a line at a time. Also, it uses the delimiter as a character/string and not as a regex.

One way around this is to build a custom Regex InputFormat. This blog describes how this can be done in more detail

https://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/

Another approach to dealing with few lines that have these escape sequences is to filter out those lines in a separate RDD, reduce it to a String and Split it to generate a new RDD which can be Unioned back. This works as a hack but not a real solution to the problem. Better solutions are welcome


Related Query

More Query from same tag