python - Parse XML with unicode characters into ElementTree

python - Parse XML with unicode characters into ElementTree - python

I'm using PDFminer, but it contains a bug and I get the following invalid XML file:
<?xml version="1.1" encoding="UTF-8"?>
<string size="16">ôÌfÆ*]Ö[</string>
When I'm trying to parse it with ElementTree I'm getting the following error:
bookXml = xml.etree.ElementTree.parse(filename)
File "C:\Users\User\Anaconda3\lib\xml\etree\ElementTree.py", line 1196, in parse
tree.parse(source, parser)
File "C:\Users\User\Anaconda3\lib\xml\etree\ElementTree.py", line 597, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 36
I think best way to handle this case is to fix XML first, but how?

I would wrap the offending XML string in CDATA. Confirmed working as soon as I did this. Example:
<?xml version="1.1" encoding="UTF-8"?>
<string><![CDATA[ôÌÆ*Ö]]></string>
More about CDATA here.

Related

XPath SyntaxError: invalid predicate

I have a XML file like this:
$ cat sample.xml
<Requests>
<Request>
<ID>123</ID>
<Items>
<Item>a item</Item>
<Item>b item</Item>
<Item>c item</Item>
</Items>
</Request>
<Request>
<ID>456</ID>
<Items>
<Item>d item</Item>
<Item>e item</Item>
</Items>
</Request>
</Requests>
I simply want to extract the XML of Request elements which has certain value for their grandchild element Item. Here is code:
bash-4.2$ cat xsearch.py
import sys
import xml.etree.ElementTree as ET
if __name__ == '__main__':
tree = ET.parse(sys.argv[1])
root = tree.getroot()
for request in root.findall(".//Item[.='c item']/../.."):
#for request in root.findall(".//Request[Items/Item = 'c item']"):
print(request)
I got "invalid predicate" error:
bash-4.2$ python3 xsearch.py sample.xml
Traceback (most recent call last):
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 263, in iterfind
selector = _cache[cache_key]
KeyError: (".//Item[.='c item']/../..", None)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "xsearch.py", line 8, in <module>
for request in root.findall(".//Item[.='c item']/../.."):
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 304, in findall
return list(iterfind(elem, path, namespaces))
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 277, in iterfind
selector.append(ops[token[0]](next, token))
File "/usr/lib64/python3.6/xml/etree/ElementPath.py", line 233, in prepare_predicate
raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate
Could any one point out where I got it wrong?

In general, an XPath invalid predicate error means something is syntactically wrong with one of the XPath's predicates, the code between the [ and ].
Specifically in your case, there are two issues:
The SyntaxError("invalid predicate") is because there's an extra ) in the predicate:
for request in root.findall(".//Item[.='c item')]/../.."):
^
Note also that you can hoist the predicate to avoid navigating down and then back up (../..):
Instead of
.//Item[.='c item']/../..
consider
.//Request[Items/Item = 'c item']
to select the Request element with the targeted Item.
The XPath library you're using, ElementTree, is not a full implementation of the XPath standard. You can waste a lot of time trying to identify what ElementTree does support (".//Items[Item='c item']/.." happens to work here) and does not support, but it'd be better to just use a more compliant library such as lxml.

XML ParseError for letter?

While attempting to parse some very large XML, I ran into a parsing error. I was able to narrow down that the issue was somewhere in the writeup tag and when I re-ran my code with just that section, it produced the following traceback.
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "path\to\Python\Python37-32\lib\xml\etree\ElementTree.py", line 1315, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 310
When I looked at line 1, column 310 in Atom there's just a N which is not a forbidden character in XML. Why is this issue popping up and how can I fix it?
Code
from xml.etree import ElementTree as etree
xml = """<writeup><p>sVw*f4FgT9`|wXNz!x)McB})KDh*0O"47BKR;G4F3]p3!-?n!\'%_sP:3WuGw44yTGF""Mf=8d34:Pb0pCZF](d%+(V\'M3-i*Dr:#sS/o*[_Z$"8%F*H6_lr&gt;I#lmd/RIUskV9#Ba\\poJ&lt;GVG]5CVIeJJytI7]q{pJQLF/&amp;N:kYrJ^3s"aCdHupx#_/Ool9qfo1.?$cdd&gt;u{Xi|yQyPahZ88ayU;DX[eDr9p?G)"*I^VG4xvJjZDCTUr1#qE6e=By_^YINk!\x02~eU3v1(pgU-\\"(*)[dg#}cVG&gt;2b=P-uH9z?fOS9amy\'e~ZO,2?,^cAWpt;jo+`p/D`B&gt;&gt;NLDqhN~&lt;"=_"DU0V^kqDTN=7EWZL|ax&amp;7dn&gt;]u1C)-[}~wuS",je`OOGIwT1g.jSe:3!tn^E2z!|4)B+rUV#6&amp;~,(iv,A%`W_\')E"kdD({ppNuPts%P%/Gi;`Hx-P/}WX(\\&amp;N2[pSy=\'9D1b?XNKG*E.#v3riX]Dq#8EEt;OA3:Uav3\'2^\\r;|Ck75}inlV)TrTFGgsI{wLx/KrmehxiwK*9^"UGa8DAV?wd~\\)gP4!r}(Y0Sx^ssxS^6zx4)#XS7|.bxFbV`t\\D,w\\YqW+&amp;%v)+&amp;fFtl]g28M61m34gD=|w{~OmjKbJr1QOI7I%]X\'m*r-p=sUeE.L-"rXR`L&gt;,nz{%\';3VY:aAKQa~ngm"Sx$3RxB!AH$O^t1&lt;9~t}ujaZ}D2\'*\\b}gGMBg4,`m9WL0Eo</p></writeup>"""
root = etree.fromstring(xml)

The problem is \x02 in the string. Python thinks that is an ASCII hex character.
Try making your string a raw string by prefixing the string with "r". This will make \ a literal character instead of an escape character.
Example...
xml = r"""<writeup><p>sVw*f4FgT9`|wXNz!x)McB})KDh*0O"47BKR;G4F3]p3!-?n!\'%_sP:3WuGw44yTGF""Mf=8d34:Pb0pCZF](d%+(V\'M3-i*Dr:#sS/o*[_Z$"8%F*H6_lr&gt;I#lmd/RIUskV9#Ba\\poJ&lt;GVG]5CVIeJJytI7]q{pJQLF/&amp;N:kYrJ^3s"aCdHupx#_/Ool9qfo1.?$cdd&gt;u{Xi|yQyPahZ88ayU;DX[eDr9p?G)"*I^VG4xvJjZDCTUr1#qE6e=By_^YINk!\x02~eU3v1(pgU-\\"(*)[dg#}cVG&gt;2b=P-uH9z?fOS9amy\'e~ZO,2?,^cAWpt;jo+`p/D`B&gt;&gt;NLDqhN~&lt;"=_"DU0V^kqDTN=7EWZL|ax&amp;7dn&gt;]u1C)-[}~wuS",je`OOGIwT1g.jSe:3!tn^E2z!|4)B+rUV#6&amp;~,(iv,A%`W_\')E"kdD({ppNuPts%P%/Gi;`Hx-P/}WX(\\&amp;N2[pSy=\'9D1b?XNKG*E.#v3riX]Dq#8EEt;OA3:Uav3\'2^\\r;|Ck75}inlV)TrTFGgsI{wLx/KrmehxiwK*9^"UGa8DAV?wd~\\)gP4!r}(Y0Sx^ssxS^6zx4)#XS7|.bxFbV`t\\D,w\\YqW+&amp;%v)+&amp;fFtl]g28M61m34gD=|w{~OmjKbJr1QOI7I%]X\'m*r-p=sUeE.L-"rXR`L&gt;,nz{%\';3VY:aAKQa~ngm"Sx$3RxB!AH$O^t1&lt;9~t}ujaZ}D2\'*\\b}gGMBg4,`m9WL0Eo</p></writeup>"""
In your actual code, if you're getting the XML as a string from an outside source, try encoding the string with .encode("unicode-escape")
root = etree.fromstring(xml.encode("unicode-escape"))

ParseError: not well-formed (invalid token) using ElementTree

I have a rare problem I cant seem to wrap my head around, it is related to XML parsing using ElementTree in python.
Tried seaching for clues and answers regarding similar problems but to no help
My function:
def errorChecker(xmlResponse):
xmlResponse = str(xmlResponse)
xmlText = xmlParser.fromstring(xmlResponse)
errorText = ""
for xmlData in xmlText.iter():
print xmlData.tag
if xmlData.tag == "fault":
for errorData in xmlText.iter('code'):
#errorText = errorDict[errorData]
return errorData.text
return False
When I pass this XML code it returns just fine:
"""<?xml version="1.0" encoding="utf-8"?>
<response>
<fault>
<code>1055</code>
</fault>
</response>"""
But when I get the XML directly from the server and pass it to the function I get this error:
Traceback (most recent call last):
File "wmsTest.py", line 556, in <module>
errorChecker(str(location))
File "wmsTest.py", line 134, in errorChecker
xmlText = xmlParser.fromstring(str(xmlResponse))
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1311, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1659, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1523, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 6, column 11
Added information about my server request:
I use requests to access the server. By using:
response = requests.post(appServer, data=xml)
print "raw from server"
print response.text
print "str response"
print str(response.text)
return response.text
the response is:
raw from server
<?xml version="1.0" encoding="utf-8"?>
<response>
<fault>
<code>1055</code>
</fault>
</response>
str response
<?xml version="1.0" encoding="utf-8"?>
<response>
<fault>
<code>1055</code>
</fault>
</response>
Python interprets the incoming XML as type unicode, it is exactly the same as the manual XML code above as it is a print I only add the """ to start and end.
Any clues?

After doing some checks on the lengths that where returned I figured out that the servers return a longer response than expected, there is a space added to the end after . Removing the space cleared the problem!

Import xml in another xml with Python ElementTree parser

Is it possible to load an xml file which imports another xml file with Python ElementTree.parse ?
For example:
I have file test.xml which contains:
<TestXml>
<!DOCTYPE doc [
<!ENTITY otherFile SYSTEM "test_1.xml">
]>
</TestXml>
and I also have test_1.xml which contains:
<test>it works!</test>
and I want to load test.xml in my python script:
from xml.etree.ElementTree import parse
a = parse('test.xml')
print a.find('test').text
and I expect it to output:
it works!
but instead I have:
Traceback (most recent call last):
File "D:/Work/depot/WIP/olex/Python/test/test.py", line 3, in <module>
a = parse('test.xml')
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 656, in parse
parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 2, column 6
Does somebody know what am I doing wrong or it is just impossible to load such a xml file for python ElementTree parser ?

The specific problem you are having is that your xml is malformed. Your DOCTYPE declaration should not be inside your root element. Rather, it should precede your root element:
<!DOCTYPE doc [
<!ENTITY otherFile SYSTEM "test_1.xml">
]>
<TestXml>
some content . . .
</TestXml>
That said, you will face a larger problem once you solve that issue. How do you use Python to parse the DOCTYPE declaration? Should you use the xml module, the lxml module, or the bs4 module?
That's a tough question. From what I have seen, people have (recently) had to do dtd parsing themselves. See the SO threads here and here for some possible leads.

How to validate xml using python without third-party libs?

I have some xml pieces like this:
<!DOCTYPE mensaje SYSTEM "record.dtd">
<record>
<player_birthday>1979-09-23</player_birthday>
<player_name>Orene Ai'i</player_name>
<player_team>Blues</player_team>
<player_id>453</player_id>
<player_height>170</player_height>
<player_position>F&W</player_position> <---- a '&' here.
<player_weight>75</player_weight>
</record>
Is there any way to validate whether the xml pieces is well-formatted?
Is there any way to validate the xml against a DTD or XML Scheme?
For various reasons I can't use any third-party packages.
e.g. the xml above is not conrrect since it has a '&' in it. Note that the DOCTYPE definition sentence refer to a DTD.

Just try to parse it with ElementTree (xml.etree.ElementTree.fromstring) - it will raise an error if the XML is not well formed.
>>> a = """<record>
... <player_birthday>1979-09-23</player_birthday>
... <player_name>Orene Ai'i</player_name>
... <player_team>Blues</player_team>
... <player_id>453</player_id>
... <player_height>170</player_height>
... <player_position>F&W</player_position> <---- a '&' here.
... <player_weight>75</player_weight>
... </record>"""
>>>
>>> from xml.etree import ElementTree as ET
>>> x = ET.fromstring(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1282, in XML
parser.feed(text)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1488, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 7, column 24

You can use python's xml.dom.minidom XML parser (which is in the standard library, but isn't as powerful as alternatives such as lxml).
Just do:
import xml.dom.minidom
xml.dom.minidom.parseString('<My><XML><String/><XML/><My/>')
You will get a xml.parsers.expat.ExpatError if the XML is invalid.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - Parse XML with unicode characters into ElementTree - python

I would wrap the offending XML string in CDATA. Confirmed working as soon as I did this. Example: <?xml version="1.1" encoding="UTF-8"?> <string><![CDATA[ôÌÆ*Ö]]></string> More about CDATA here.

Related

XPath SyntaxError: invalid predicate

XML ParseError for letter?

ParseError: not well-formed (invalid token) using ElementTree

Import xml in another xml with Python ElementTree parser

How to validate xml using python without third-party libs?

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - Parse XML with unicode characters into ElementTree - python

I would wrap the offending XML string in CDATA. Confirmed working as soon as I did this. Example: <?xml version="1.1" encoding="UTF-8"?> <string><![CDATA[ôÌÆ *Ö]]></string> More about CDATA here.

Related

XPath SyntaxError: invalid predicate

XML ParseError for letter?

ParseError: not well-formed (invalid token) using ElementTree

Import xml in another xml with Python ElementTree parser

How to validate xml using python without third-party libs?

Categories

Resources

I would wrap the offending XML string in CDATA. Confirmed working as soon as I did this. Example: <?xml version="1.1" encoding="UTF-8"?> <string><![CDATA[ôÌÆ*Ö]]></string> More about CDATA here.