score:0

Accepted answer

tr might be the faster solution. note, you can pipe any strings, so in this case, i'm cating a file on disk, but this can also be a file stream from sftp.

~/desktop/test $ cat data.txt
$~$field1$~$|$~$field2$~$|$~$field3$~$

$~$data1$~$|$~$data2$~$|$~$data3$~$

$~$data4$~$|$~$data5$~$|$~$data6$~$

$~$data7$~$|$~$data8$~$|$~$data9$~$

# the '>' will open a new file for writing

~/desktop/test $ cat data.txt | tr -d \$~\$ > output.psv

# see the results here
~/desktop/test $ cat output.psv 
field1|field2|field3

data1|data2|data3

data4|data5|data6

data7|data8|data9

examples: https://shapeshed.com/unix-tr/#what-is-the-tr-command-in-unix

score:0

here is a pure spark solution. there might be better performing solutions.

var df = spark.read.option("delimiter", "|").csv(filepath)
val replace = (value: string, find: string, replace: string) => value.replace(find, replace)
val replaceudf = udf(replace)
df.select(
       df.columns.map(c => replaceudf(col(c), lit("$~$"), lit("")).alias(c)): _*)
  .show

update: you cannot use $~$ as quote option or use $~$|$~$ as a delimiter in the 2.3.0 as those options accept only single character.

score:0

using regexp_replace and foldleft to update all columns. check this out

scala> val df = seq(("$~$data1$~$","$~$data2$~$","$~$data3$~$"), ("$~$data4$~$","$~$data5$~$","$~$data6$~$"), ("$~$data7$~$","$~$data8$~$","$~$data9$~$"),("$~$data10$~$","$~$data11$~$","$~$data12$~$")).todf("field1","field2","field3")
df: org.apache.spark.sql.dataframe = [field1: string, field2: string ... 1 more field]

scala> df.show(false)
+------------+------------+------------+
|field1      |field2      |field3      |
+------------+------------+------------+
|$~$data1$~$ |$~$data2$~$ |$~$data3$~$ |
|$~$data4$~$ |$~$data5$~$ |$~$data6$~$ |
|$~$data7$~$ |$~$data8$~$ |$~$data9$~$ |
|$~$data10$~$|$~$data11$~$|$~$data12$~$|
+------------+------------+------------+


scala> val df2 = df.columns.foldleft(df) { (acc,x) => acc.withcolumn(x,regexp_replace(col(x),"""^\$~\$|\$~\$$""","")) }
df2: org.apache.spark.sql.dataframe = [field1: string, field2: string ... 1 more field]

scala> df2.show(false)
+------+------+------+
|field1|field2|field3|
+------+------+------+
|data1 |data2 |data3 |
|data4 |data5 |data6 |
|data7 |data8 |data9 |
|data10|data11|data12|
+------+------+------+


scala>

Related Query

More Query from same tag