score:5

Accepted answer

i do wonder if 0x12 is valid even in xml 1.1. see this summary on 1.0 versus 1.1 differences. in particular:

in addition, xml 1.1 allows you to have control characters in your documents through the use of character references. this concerns the control characters #x1 through #x1f, most of which are forbidden in xml 1.0. this means that your document can now include the bell character, like this: . however, you still cannot have these characters appear directly in your documents; this violates the definition of the mime type used for xml (text/xml).

xerces can parse xml 1.1 but seems to expect the entity  instead of the true 0x12 character:

val s = "<?xml version='1.1'?><root>\u0012</root>"
// causes an invalid xml character (unicode: 0x12)
//xml.loadxml(xml.source.fromstring(s), xml.parser)

val u = "<?xml version='1.1'?><root>&#18;</root>"
val v = xml.loadxml(xml.source.fromstring(u), xml.parser)
println(v) // works

as suggested by lavinio, you may be able to filter out invalid characters. this does not take too many lines in scala:

val in = new inputstream {
  val in0 = new fileinputstream("invalid.xml")
  override def read():int = in0.read match { case 0x12=> read() case x=> x}
}
val x = xml.load(in)

score:3

0x12 is only valid in xml 1.1. if your xml file states that version, you might be able to turn on 1.1 processing support in your sax parser.

otherwise, the underlying parser is probably xerces, which, as a conforming xml parser, properly is complaining.

if you must handle these streams, i'd write a wrapper inputstream or reader around my input file, filter out the characters with invalid unicode values, and pass the rest on.

score:11

to expand on @huynhjl's answer: the inputstream filter is dangerous if you have multi-byte characters, for example in utf-8 encoded text. instead, use a character oriented filter: filterreader. or if the file is small enough, load into a string and replace the characters there.

scala> val origxml = "<?xml version='1.1'?><root>\u0012</root>"                                          
origxml: java.lang.string = <?xml version='1.1'?><root></root>

scala> val cleanxml = xml flatmap { 
   case x if character.isisocontrol(x) => "&#x" + integer.tohexstring(x) + ";"
   case x => seq(x) 
}
cleanxml: string = <?xml version='1.1'?><root>&#x12;</root>

scala> scala.xml.xml.loadstring(cleanxml) 
res14: scala.xml.elem = <root></root>

Related Query

More Query from same tag