xml parsing error special characters - python

I have following xml that I want to parse with xml.dom.minidom module
<?xml version="1.0" encoding="UTF-8"?>
<RootTag>
<InnerTag>
<MyValue>"< here is special char."</MyValue>
</InnerTag>
</RootTag>
I have following snippet for parsing above xml
import xml.dom.minidom
xml.dom.minidom.parse('input_xml')
But I get following error:
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 4, column 26
Above error occurs only when I provide '&' or '<' provided in MyValue tags
So,
How to resolve this issue?
I am not wishing to change my XML by using escape sequence < etc..
and I want to use "" (quotes)

Your example is not well-formed XML. < is not allowed in XML anywhere else other than the tags. Your data needs to be wrapped in CDATA or escaped as <
<![CDATA[< here is special char.]]>

Related

Getting error "not well-formed (invalid token)"

I have an XML file with the following data:
<?xml version="1.0" encoding="utf-8"?>
<metadata>
<filter>
<regex>ATL|LAX|DFW</regex >
<start_char>3</start_char>
<end_char></end_char>
<action>remove</action>
</filter>
<filter>
<regex>DFW.+\.$</regex >
<start_char>3</start_char>
<end_char>-1</end_char>
<action>remove</action>
</filter>
<filter>
<regex>\-</regex >
<replacement></replacement>
<action>substitute</action>
</filter>
<filter>
<regex>\s</regex >
<replacement></replacement>
<action>substitute</action>
</filter>
</metadata>
I am trying to read in the xml file into my python code and loop through all the filter tags and see if the action tag is 'remove'. If the action tag is 'remove', I want to remove the part of the mfn_pn that matches the text within the regex tag.
Next, I want it to see if the action tag is 'substitute'. If it is 'substitute', I want it to substitute the text within the regex tag with what's in the replacement tag.
However, I keep getting the error
File "C:\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
File "", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 50, column 13".
Not sure what "not well-formed (invalid token)" is referring to.
from xml.etree.ElementTree import ElementTree
# filters.xml is the file that holds the things to be filtered
tree = ElementTree()
tree.parse("filters.xml")
It looks like the error occurs in the first 4 lines of your script. As such, the rest of the script is not needed for a minimal reproducible example.
Having said that, interestingly the example from the documentation yields the same error.
Finally, I managed to resolve the issue by following the solution provided here.

Fix namespace with regular expression

I have the following name spaces coming from a certain service
<soapenv:Envelope xmlns:soapenv=http://schemas.xmlsoap.org/soap/envelope/ xmlns:soap=http://www.4cgroup.co.za/soapauth xmlns:gen=http://www.4cgroup.co.za/genericsoap>
Trying to parse this request I receive the following error
xml.etree.ElementTree.ParseError: not well-formed
I noticed there is no "" on namespace value. How can I add them with regular expression
Proper format
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:soap="http://www.4cgroup.co.za/soapauth" xmlns:gen="http://www.4cgroup.co.za/genericsoap">
Note double quotes
Using regex:
import re
namespace = "<soapenv:Envelope xmlns:soapenv=http://schemas.xmlsoap.org/soap/envelope/ xmlns:soap=http://www.4cgroup.co.za/soapauth xmlns:gen=http://www.4cgroup.co.za/genericsoap>"
FIND_URL = re.compile(r"((?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+)")
print(FIND_URL.sub(r'"\1"', namespace))
Output:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:soap="http://www.4cgroup.co.za/soapauth" xmlns:gen="http://www.4cgroup.co.za/genericsoap">
Note that the regex isn't perfect. It works for this case but if the urls become more "unique" it may fail.
Credit to this answer
This regex seems to do the trick:
import re
nsmap = "<soapenv:Envelope xmlns:soapenv=http://schemas.xmlsoap.org/soap/envelope/ xmlns:soap=http://www.4cgroup.co.za/soapauth xmlns:gen=http://www.4cgroup.co.za/genericsoap>"
nsmap = re.sub(r"(https?://.*?)(?=\sxmlns|>)", r'"\1"', nsmap)
print(nsmap)
Output:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:soap="http://www.4cgroup.co.za/soapauth" xmlns:gen="http://www.4cgroup.co.za/genericsoap">
Check it out online here.

Parse large python xml using xmltree

I have a python script that parses huge xml files ( largest one is 446 MB)
try:
parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse(os.path.join(srcDir, fileName), parser)
root = tree.getroot()
except Exception, e:
print "Error parsing file "+str(fileName) + " Reason "+str(e.message)
for child in root:
if "PersonName" in child.tag:
personName = child.text
This is what my xml looks like :
<?xml version="1.0" encoding="utf-8"?>
<MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
<Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
<Description>myData</Description>
<Identifier>43hhjh87n4nm</Identifier>
</Aliases>
<RollNo uom="kPa">39979172.201167159</RollNo>
<PersonName>Miracle Smith</PersonName>
<Date>2017-06-02T01:10:32-05:00</Date>
....
All I want to do is get the PersonName tags contents thats all. Other tags I don't care about.
Sadly My files are huge and I keep getting this error when I use the code above :
Error parsing file 2eb6d894-0775-e611.xml Reason unknown error, line 1, column 310915857
Error parsing file 2ecc18b5-ef41-e711-80f.xml Reason Extra content at the end of the document, line 1, column 3428182
Error parsing file 2f0d6926-b602-e711-80f4-005.xml Reason Extra content at the end of the document, line 1, column 6162118
Error parsing file 2f12636b-b2f5-e611-80f3-00.xml Reason Extra content at the end of the document, line 1, column 8014679
Error parsing file 2f14e35a-d22b-4504-8866-.xml Reason Extra content at the end of the document, line 1, column 8411238
Error parsing file 2f50c2eb-55c6-e611-80f0-005056a.xml Reason Extra content at the end of the document, line 1, column 7636614
Error parsing file 3a1a3806-b6af-e611-80ef-00505.xml Reason Extra content at the end of the document, line 1, column 11032486
My XML is perfectly fine and has no extra content .Seems that the large files parsing causes the error.
I have looked at iterparse() but it seems to complex for what I want to achieve as it provides parsing of the whole DOM while I just want that one tag that is under the root. Also , does not give me a good sample to get the correct value by tag name ?
Should I use a regex parse or grep /awk way to do this ? Or any tweak to my code will let me get the Person name in these huge files ?
UPDATE:
Tried this sample and it seems to be printing the whole world from the xml except my tag ?
Does iterparse read from bottom to top of file ? In that case it will take a long time to get to the top i.e my PersonName Tag ? I tried changing the line below to read end to start events=("end", "start") and it does the same thing !!!
path = []
for event, elem in ET.iterparse('D:\\mystage\\2-80ea-005056.xml', events=("start", "end")):
if event == 'start':
path.append(elem.tag)
elif event == 'end':
# process the tag
print elem.text // prints whole world
if elem.tag == 'PersonName':
print elem.text
path.pop()
Iterparse is not that difficult to use in this case.
temp.xml is the file presented in your question with a </MyRoot> stuck on as a line at the end.
Think of the source = as boilerplace, if you will, that parses the xml file and returns chunks of it element-by-element, indicating whether the chunk is the 'start' of an element or the 'end' and supplying information about the element.
In this case we need consider only the 'start' events. We watch for the 'PersonName' tags and pick up their texts. Having found the one and only such item in the xml file we abandon the processing.
>>> from xml.etree import ElementTree
>>> source = iter(ElementTree.iterparse('temp.xml', events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'
Edit, in response to question in a comment:
Normally you wouldn't do this since iterparse is intended for use with large chunks of xml. However, by wrapping a string in a StringIO object it can be processed with iterparse.
>>> from xml.etree import ElementTree
>>> from io import StringIO
>>> xml = StringIO('''\
... <?xml version="1.0" encoding="utf-8"?>
... <MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
... <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
... <Description>myData</Description>
... <Identifier>43hhjh87n4nm</Identifier>
... </Aliases>
... <RollNo uom="kPa">39979172.201167159</RollNo>
... <PersonName>Miracle Smith</PersonName>
... <Date>2017-06-02T01:10:32-05:00</Date>
... </MyRoot>''')
>>> source = iter(ElementTree.iterparse(xml, events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'

What's the best way to handle -like entities in XML documents with lxml?

Consider the following:
from lxml import etree
from StringIO import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
This would fail with:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 2, column 11
This is because resolve_entities=False doesn't ignore them, it just doesn't resolve them.
If I use etree.HTMLParser instead, it creates html and body tags, plus a lot of other special handling it tries to do for HTML.
What's the best way to get a â text child under the aa tag with lxml?
You can't ignore entities as they are part of the XML definition. Your document is not well-formed if it doesn't have a DTD or standalone="yes" or if it includes entities without an entity definition in the DTD. Lie and claim your document is HTML.
https://mailman-mail5.webfaction.com/pipermail/lxml/2008-February/003398.html
You can try lying and putting an XHTML DTD on your document. e.g.
from lxml import etree
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
etree.tostring(r) # '<aa> â</aa>'
#Alex is right: your document is not well-formed XML, and so XML parsers will not parse it. One option is to pre-process the text of the document to replace bogus entities with their utf-8 characters:
entities = [
(' ', u'\u00a0'),
('â', u'\u00e2'),
...
]
for before, after in entities:
x = x.replace(before, after.encode('utf8'))
Of course, this can be broken by sufficiently weird "xml" also.
Your best bet is to fix your input XML documents to be well-formed XML.
When I was trying to do something similar, I just used x.replace('&', '&') before parsing the string.

Parsing XML with SAX/Python + no validation

I am new to python and I'm trying to parse a XML file with SAX without validating it.
The head of my xml file is:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE n:document SYSTEM "schema.dtd">
<n:document....
and I've tried to parse it with python 2.5.2:
from xml.sax import make_parser, handler
import sys
parser = make_parser()
parser.setFeature(handler.feature_namespaces,True)
parser.setFeature(handler.feature_validation,False)
parser.setContentHandler(handler.ContentHandler())
parser.parse(sys.argv[1])
but I got an error:
python doc.py document.xml
(...)
File "/usr/lib/python2.5/urllib2.py", line 244, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: schema.dtd
I don't want the SAX parser to look for a schema. Where am I wrong ?
Thanks !
expatreader considers the DTD external subset as an external general entity. So the feature you want is:
parser.setFeature(handler.feature_external_ges, False)
However, it's a bit dodgy pointing the DTD external subset to a non-existant URL; as this shows, it's not only validating parsers that read it.

Categories

Resources