Getting error "not well-formed (invalid token)" - python

I have an XML file with the following data:
<?xml version="1.0" encoding="utf-8"?>
<metadata>
<filter>
<regex>ATL|LAX|DFW</regex >
<start_char>3</start_char>
<end_char></end_char>
<action>remove</action>
</filter>
<filter>
<regex>DFW.+\.$</regex >
<start_char>3</start_char>
<end_char>-1</end_char>
<action>remove</action>
</filter>
<filter>
<regex>\-</regex >
<replacement></replacement>
<action>substitute</action>
</filter>
<filter>
<regex>\s</regex >
<replacement></replacement>
<action>substitute</action>
</filter>
</metadata>
I am trying to read in the xml file into my python code and loop through all the filter tags and see if the action tag is 'remove'. If the action tag is 'remove', I want to remove the part of the mfn_pn that matches the text within the regex tag.
Next, I want it to see if the action tag is 'substitute'. If it is 'substitute', I want it to substitute the text within the regex tag with what's in the replacement tag.
However, I keep getting the error
File "C:\Python\Python37-32\lib\xml\etree\ElementTree.py", line 598, in parse
self._root = parser._parse_whole(source)
File "", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 50, column 13".
Not sure what "not well-formed (invalid token)" is referring to.
from xml.etree.ElementTree import ElementTree
# filters.xml is the file that holds the things to be filtered
tree = ElementTree()
tree.parse("filters.xml")

It looks like the error occurs in the first 4 lines of your script. As such, the rest of the script is not needed for a minimal reproducible example.
Having said that, interestingly the example from the documentation yields the same error.
Finally, I managed to resolve the issue by following the solution provided here.

Related

Parse large python xml using xmltree

I have a python script that parses huge xml files ( largest one is 446 MB)
try:
parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse(os.path.join(srcDir, fileName), parser)
root = tree.getroot()
except Exception, e:
print "Error parsing file "+str(fileName) + " Reason "+str(e.message)
for child in root:
if "PersonName" in child.tag:
personName = child.text
This is what my xml looks like :
<?xml version="1.0" encoding="utf-8"?>
<MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
<Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
<Description>myData</Description>
<Identifier>43hhjh87n4nm</Identifier>
</Aliases>
<RollNo uom="kPa">39979172.201167159</RollNo>
<PersonName>Miracle Smith</PersonName>
<Date>2017-06-02T01:10:32-05:00</Date>
....
All I want to do is get the PersonName tags contents thats all. Other tags I don't care about.
Sadly My files are huge and I keep getting this error when I use the code above :
Error parsing file 2eb6d894-0775-e611.xml Reason unknown error, line 1, column 310915857
Error parsing file 2ecc18b5-ef41-e711-80f.xml Reason Extra content at the end of the document, line 1, column 3428182
Error parsing file 2f0d6926-b602-e711-80f4-005.xml Reason Extra content at the end of the document, line 1, column 6162118
Error parsing file 2f12636b-b2f5-e611-80f3-00.xml Reason Extra content at the end of the document, line 1, column 8014679
Error parsing file 2f14e35a-d22b-4504-8866-.xml Reason Extra content at the end of the document, line 1, column 8411238
Error parsing file 2f50c2eb-55c6-e611-80f0-005056a.xml Reason Extra content at the end of the document, line 1, column 7636614
Error parsing file 3a1a3806-b6af-e611-80ef-00505.xml Reason Extra content at the end of the document, line 1, column 11032486
My XML is perfectly fine and has no extra content .Seems that the large files parsing causes the error.
I have looked at iterparse() but it seems to complex for what I want to achieve as it provides parsing of the whole DOM while I just want that one tag that is under the root. Also , does not give me a good sample to get the correct value by tag name ?
Should I use a regex parse or grep /awk way to do this ? Or any tweak to my code will let me get the Person name in these huge files ?
UPDATE:
Tried this sample and it seems to be printing the whole world from the xml except my tag ?
Does iterparse read from bottom to top of file ? In that case it will take a long time to get to the top i.e my PersonName Tag ? I tried changing the line below to read end to start events=("end", "start") and it does the same thing !!!
path = []
for event, elem in ET.iterparse('D:\\mystage\\2-80ea-005056.xml', events=("start", "end")):
if event == 'start':
path.append(elem.tag)
elif event == 'end':
# process the tag
print elem.text // prints whole world
if elem.tag == 'PersonName':
print elem.text
path.pop()
Iterparse is not that difficult to use in this case.
temp.xml is the file presented in your question with a </MyRoot> stuck on as a line at the end.
Think of the source = as boilerplace, if you will, that parses the xml file and returns chunks of it element-by-element, indicating whether the chunk is the 'start' of an element or the 'end' and supplying information about the element.
In this case we need consider only the 'start' events. We watch for the 'PersonName' tags and pick up their texts. Having found the one and only such item in the xml file we abandon the processing.
>>> from xml.etree import ElementTree
>>> source = iter(ElementTree.iterparse('temp.xml', events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'
Edit, in response to question in a comment:
Normally you wouldn't do this since iterparse is intended for use with large chunks of xml. However, by wrapping a string in a StringIO object it can be processed with iterparse.
>>> from xml.etree import ElementTree
>>> from io import StringIO
>>> xml = StringIO('''\
... <?xml version="1.0" encoding="utf-8"?>
... <MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
... <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
... <Description>myData</Description>
... <Identifier>43hhjh87n4nm</Identifier>
... </Aliases>
... <RollNo uom="kPa">39979172.201167159</RollNo>
... <PersonName>Miracle Smith</PersonName>
... <Date>2017-06-02T01:10:32-05:00</Date>
... </MyRoot>''')
>>> source = iter(ElementTree.iterparse(xml, events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'

Beautiful Soup fails to recognize UTF-8 encoding on Python 3, IPython 6 console

I am trying to read an xml document using Beautiful Soup on Python 3.6.2, IPython 6.1.0, Windows 10, and I can't get the encoding right.
Here's my test xml, saved as a file in UTF8-encoding:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<info name="愛よ">ÜÜÜÜÜÜÜ</info>
<items>
<item thing="ÖöÖö">"23Äßßß"</item>
</items>
</root>
First check the XML using ElementTree:
import xml.etree.ElementTree as ET
def printXML(xml,indent=''):
print(indent+str(xml.tag)+': '+(xml.text if xml.text is not None else '').replace('\n',''))
if len(xml.attrib) > 0:
for k,v in xml.attrib.items():
print(indent+'\t'+k+' - '+v)
if xml.getchildren():
for child in xml.getchildren():
printXML(child,indent+'\t')
xml0 = ET.parse("test.xml").getroot()
printXML(xml0)
The output is correct:
root:
info: ÜÜÜÜÜÜÜ
name - 愛よ
items:
item: "23Äßßß"
thing - ÖöÖö
Now read the same file with Beautiful Soup and pretty-print it:
import bs4
with open("test.xml") as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
print(xml.prettify())
Output:
<!--?xml version="1.0" encoding="UTF-8"?-->
<html>
<head>
</head>
<body>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>
</body>
</html>
This is just wrong. Doing the call with explicite encoding specified bs4.BeautifulSoup(ff,"html5lib",from_encoding="UTF-8") doesn't change the result.
Doing
print(xml.original_encoding)
outputs
None
So Beautiful Soup is apparently unable to detect the original encoding even though the file is encoded in UTF8 (according to Notepad++) and the header information says UTF-8 as well, and I do have chardet installed as the doc recommends.
Am I making a mistake here? What could be causing this?
EDIT:
When I invoke the code without the html5lib I get this warning:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib").
This usually isn't a problem, but if you run this code on another system, or in a different virtual environment,
it may use a different parser and behave differently.
The code that caused this warning is on line 241 of the file C:\Users\My.Name\AppData\Local\Continuum\Anaconda2\envs\Python3\lib\site-packages\spyder\utils\ipython\start_kernel.py.
To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html5lib")
markup_type=markup_type))
EDIT 2:
As suggested in a comment I tried bs4.BeautifulSoup(ff,"html.parser"), but the problem remains.
Then I installed lxml and tried bs4.BeautifulSoup(ff,"lxml-xml"), still the same output.
What also strikes me as odd is that even when specifying an encoding like bs4.BeautifulSoup(ff,"lxml-xml",from_encoding='UTF-8') the value of xml.original_encoding is None contrary to what is written in the doc.
EDIT 3:
I put my xml contents into a string
xmlstring = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><info name=\"愛よ\">ÜÜÜÜÜÜÜ</info><items><item thing=\"ÖöÖö\">\"23Äßßß\"</item></items></root>"
And used bs4.BeautifulSoup(xmlstring,"lxml-xml"), now I'm getting the correct output:
<?xml version="1.0" encoding="utf-8"?>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>
So it seems something is wrong with the file after all.
Found the error, I have to specify the encoding when opening the file:
with open("test.xml",encoding='UTF-8') as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
As I'm on Python 3 I thought the value of encoding was UTF-8 by default, but it turned out it's system-dependent and on my system it's cp1252.

Python XPath SyntaxError: invalid predicate

i am trying to parse an xml like
<document>
<pages>
<page>
<paragraph>XBV</paragraph>
<paragraph>GHF</paragraph>
</page>
<page>
<paragraph>ash</paragraph>
<paragraph>lplp</paragraph>
</page>
</pages>
</document>
and here is my code
import xml.etree.ElementTree as ET
tree = ET.parse("../../xml/test.xml")
root = tree.getroot()
path="./pages/page/paragraph[text()='GHF']"
print root.findall(path)
but i get an error
print root.findall(path)
File "X:\Anaconda2\lib\xml\etree\ElementTree.py", line 390, in findall
return ElementPath.findall(self, path, namespaces)
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 293, in findall
return list(iterfind(elem, path, namespaces))
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 263, in iterfind
selector.append(ops[token[0]](next, token))
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 224, in prepare_predicate
raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate
what is wrong with my xpath?
Follow up
Thanks falsetru, your solution worked. I have a follow up. Now, i want to get all the paragraph elements that come before the paragraph with text GHF. So in this case i only need the XBV element. I want to ignore the ash and lplp. i guess one way to do this would be
result = []
for para in root.findall('./pages/page/'):
t = para.text.encode("utf-8", "ignore")
if t == "GHF":
break
else:
result.append(para)
but is there a better way to do this?
ElementTree's XPath support is limited. Use other library like lxml:
import lxml.etree
root = lxml.etree.parse('test.xml')
path = "./pages/page/paragraph[text()='GHF']"
print(root.xpath(path))
As #falsetru mentioned, ElementTree doesn't support text() predicate, but it supports matching child element by text, so in this example, it is possible to search for a page that has a paragraph with specific text, using the path ./pages/page[paragraph='GHF']. The problem here is that there are multiple paragraph tags in a page, so one would have to iterate for the specific paragraph. In my case, I needed to find the version of a dependency in a maven pom.xml, and there is only a single version child so the following worked:
In [1]: import xml.etree.ElementTree as ET
In [2] ns = {"pom": "http://maven.apache.org/POM/4.0.0"}
In [3] print ET.parse("pom.xml").findall(".//pom:dependencies/pom:dependency[pom:artifactId='some-artifact-with-hardcoded-version']/pom:version", ns)[0].text
Out[1]: '1.2.3'

Python flask cannot open an xml file

I try to implement a server-side multilanguage service on my website. This is the structure on the folders:
data
--locale
static
--css
--images
--js
templates
--index.html
--page1.html
...
main.py
I use Crowdin to translate the website and the output files are in XML. The locale folder contains one folder for each language with one xml file for every page.
I store on Cookies the language and here is my python code:
from flask import request
from xml.dom.minidom import parseString
def languages(page):
langcode = request.cookies.get("Language")
xml = "/data/locale/%s/%s.xml" % (langcode, page)
dom = parseString(xml)
................
.............
Which I call in every page, like languages("index")
This is an example of the exported xml files
<?xml version="1.0" encoding="utf-8"?>
<!--Generated by crowdin.com-->
<!--
This is a description of my page
-->
<resources>
<string name="name1">value 1</string>
<string name="name2">value 2</string>
<string name="name3">value 3</string>
</resources>
However, I have the following error ExpatError: not well-formed (invalid token): line 1, column 0
I googled it. I ended up to other stackoverflow questions, but most of them says about encoding problems and I cannot find any in my example.
You have to use parse() if you want to parse a file. parseString() will parse a string, the file name in your case.
from flask import request
from xml.dom.minidom import parse
def languages(page):
langcode = request.cookies.get("Language")
xml = "/data/locale/%s/%s.xml" % (langcode, page)
dom = parse(xml)

xml parsing error special characters

I have following xml that I want to parse with xml.dom.minidom module
<?xml version="1.0" encoding="UTF-8"?>
<RootTag>
<InnerTag>
<MyValue>"< here is special char."</MyValue>
</InnerTag>
</RootTag>
I have following snippet for parsing above xml
import xml.dom.minidom
xml.dom.minidom.parse('input_xml')
But I get following error:
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 4, column 26
Above error occurs only when I provide '&' or '<' provided in MyValue tags
So,
How to resolve this issue?
I am not wishing to change my XML by using escape sequence < etc..
and I want to use "" (quotes)
Your example is not well-formed XML. < is not allowed in XML anywhere else other than the tags. Your data needs to be wrapped in CDATA or escaped as <
<![CDATA[< here is special char.]]>

Categories

Resources