Summarizing what I found while researching this issue. Hadoop's TextInputFormat or most input formats deal with a line at a time. Also, it uses the delimiter as a character/string and not as a regex.

One way around this is to build a custom Regex InputFormat. This blog describes how this can be done in more detail

Another approach to dealing with few lines that have these escape sequences is to filter out those lines in a separate RDD, reduce it to a String and Split it to generate a new RDD which can be Unioned back. This works as a hack but not a real solution to the problem. Better solutions are welcome

Related Query

More Query from same tag