I receive xml strings from an external source that can contains unsanitized user contributed content.
The following xml string gave a ParseError in cElementTree:
>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
ET.XML(s)
File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17
Is there a way to make cElementTree not complain?
It seems to complain about \x08 you will need to escape that.
Edit:
Or you can have the parser ignore the errors using recover
from lxml import etree
parser = etree.XMLParser(recover=True)
etree.fromstring(xmlstring, parser=parser)
I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)
import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)
EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...
This code snippet worked for me. I have an issue with the parsing batch of XML files. I had to encode them to 'iso-8859-5'
import xml.etree.ElementTree as ET
tree = ET.parse(filename, parser = ET.XMLParser(encoding = 'iso-8859-5'))
See this answer to another question and the according part of the XML spec.
The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity and cannot occur plainly.
If you need to process this XML snippet, you must replace \x08 in s before feeding it into an XML parser.
None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup instead of ElementTree as follows:
from bs4 import BeautifulSoup
with open("data/myfile.xml") as fp:
soup = BeautifulSoup(fp, 'xml')
Then you can search the tree as:
soup.find_all('mytag')
This is most probably an encoding error. For example I had an xml file encoded in UTF-8-BOM (checked from the Notepad++ Encoding menu) and got similar error message.
The workaround (Python 3.6)
import io
from xml.etree import ElementTree as ET
with io.open(file, 'r', encoding='utf-8-sig') as f:
contents = f.read()
tree = ET.fromstring(contents)
Check the encoding of your xml file. If it is using different encoding, change the 'utf-8-sig' accordingly.
After lots of searching through the entire WWW, I only found out that you have to escape certain characters if you want your XML parser to work! Here's how I did it and worked for me:
escape_illegal_xml_characters = lambda x: re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', '', x)
And use it like you'd normally do:
ET.XML(escape_illegal_xml_characters(my_xml_string)) #instead of ET.XML(my_xml_string)
A solution for gottcha for me, using Python's ElementTree... this has the invalid token error:
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
xml = u"""<?xml version='1.0' encoding='utf8'?>
<osm generator="pycrocosm server" version="0.6"><changeset created_at="2017-09-06T19:26:50.302136+00:00" id="273" max_lat="0.0" max_lon="0.0" min_lat="0.0" min_lon="0.0" open="true" uid="345" user="john"><tag k="test" v="Съешь же ещё этих мягких французских булок да выпей чаю" /><tag k="foo" v="bar" /><discussion><comment data="2015-01-01T18:56:48Z" uid="1841" user="metaodi"><text>Did you verify those street names?</text></comment></discussion></changeset></osm>"""
xmltest = ET.fromstring(xml.encode("utf-8"))
However, it works with the addition of a hyphen in the encoding type:
<?xml version='1.0' encoding='utf-8'?>
Most odd. Someone found this footnote in the python docs:
The encoding string included in XML output should conform to the
appropriate standards. For example, “UTF-8” is valid, but “UTF8” is
not.
I have been in stuck with similar problem. Finally figured out the what was the root cause in my particular case. If you read the data from multiple XML files that lie in same folder you will parse also .DS_Store file.
Before parsing add this condition
for file in files:
if file.endswith('.xml'):
run_your_code...
This trick helped me as well
lxml solved the issue, in my case
from lxml import etree
for _, elein etree.iterparse(xml_file, tag='tag_i_wanted', unicode='utf-8'):
print(ele.tag, ele.text)
in another case,
parser = etree.XMLParser(recover=True)
tree = etree.parse(xml_file, parser=parser)
tags_needed = tree.iter('TAG NAME')
Thanks to theeastcoastwest
Python 2.7
In my case I got the same error. (using Element Tree)
I had to add these lines:
import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(recover=True,encoding='utf-8')
xml_file = ET.parse(path_xml,parser=parser)
Works in pyhton 3.10.2
What helped me with that error was Juan's answer - https://stackoverflow.com/a/20204635/4433222
But wasn't enough - after struggling I found out that an XML file needs to be saved with UTF-8 without BOM encoding.
The solution wasn't working for "normal" UTF-8.
The only thing that worked for me is I had to add mode and encoding while opening the file like below:
with open(filenames[0], mode='r',encoding='utf-8') as f:
readFile()
Otherwise it was failing every time with invalid token error if I simply do this:
f = open(filenames[0], 'r')
readFile()
this error is coming while you are giving a link . but first you have to find the string of that link
response = requests.get(Link)
root = cElementTree.fromstring(response.content)
I tried the other solutions in the answers here but had no luck. Since I only needed to extract the value from a single xml node I gave in and wrote my function to do so:
def ParseXmlTagContents(source, tag, tagContentsRegex):
openTagString = "<"+tag+">"
closeTagString = "</"+tag+">"
found = re.search(openTagString + tagContentsRegex + closeTagString, source)
if found:
start = found.regs[0][0]
end = found.regs[0][1]
return source[start+len(openTagString):end-len(closeTagString)]
return ""
Example usage would be:
<?xml version="1.0" encoding="utf-16"?>
<parentNode>
<childNode>123</childNode>
</parentNode>
ParseXmlTagContents(xmlString, "childNode", "[0-9]+")
Related
I am reading a xml file and converting to df using xmltodict and pandas.
This is how one of the elements in the file looks like
<net>
<ref>https://whois.arin.net/rest/v1/net/NET-66-125-37-120-1</ref>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
<name>SBC066125037120020307</name>
<netBlocks>
<netBlock>
<cidrLenth>29</cidrLenth>
<endAddress>066.125.037.127</endAddress>
<type>S</type>
<startAddress>066.125.037.120</startAddress>
</netBlock>
</netBlocks>
<pocLinks/>
<orgHandle>C00285134</orgHandle>
<parentNetHandle>NET-66-120-0-0-1</parentNetHandle>
<registrationDate>2002-03-08T00:00:00-05:00</registrationDate>
<startAddress>66.125.37.120</startAddress>
<updateDate>2002-03-08T07:56:59-05:00</updateDate>
<version>4</version>
</net>
since there are a large number of records like this which is being pulled in by an API, sometimes some <net> objects at the end of the file can be partially downloaded.
ex : one tag not having closing tag.
This is what i wrote to parse the xml
xml_data = open('/Users/dgoswami/Downloads/net.xml', 'r').read() # Read data
xml_data = xmltodict.parse(xml_data,
process_namespaces=True,
namespaces={'http://www.arin.net/bulkwhois/core/v1':None})
when that happens, I get an error like so
no element found: line 30574438, column 37
I want to be able to parse till the last valid <net> element.
How can that be done?
You may need to fix your xml beforehand - xmltodict has no ability to do that for you.
You can leverage lxml as described in Python xml - handle unclosed token to fix your xml:
from lxml import etree
def fixme(x):
p = etree.fromstring(x, parser = etree.XMLParser(recover=True))
return etree.tostring(p).decode("utf8")
fixed = fixme("""<start><net>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
</net><net>
<endAddress>66.125.37.227</endAddress>
<handle>NET-66-125-37-220-1</handle>
""")
and then use the fixed xml:
import xmltodict
print(xmltodict.parse(fixed))
to get
OrderedDict([('start',
OrderedDict([('net', [
OrderedDict([('endAddress', '66.125.37.127'), ('handle', 'NET-66-125-37-120-1')]),
OrderedDict([('endAddress', '66.125.37.227'), ('handle', 'NET-66-125-37-220-1')])
])
]))
])
I am trying use the below pom.xml to create a Python script and validate the pom for any syntax errors using lxml to further confirm the <version>is a SNAPSHOT and update the <version> to match this format ci_{git hub org}_{branch name}-SNAPSHOT.
project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.wsi.devops</groupId>
<artifactId>python-test</artifactId>
<version>1.0-SNAPSHOT</version>
</project>
This where I am currently with my solution,
# For XML validation, importing the etree module from the lxml
# package, as well as sys for handling input.
from lxml import etree
import sys
#filename as command line arguments
filename_xml = sys.argv[1]
# parse xml
try:
doc = etree.parse(sys.argv[1])
print('XML well formed, syntax ok.')
# check for XML syntax errors
except etree.XMLSyntaxError as err:
print('XML Syntax Error, see error_syntax.log')
with open('error_syntax.log', 'w') as error_log_file:
error_log_file.write(str(err.error_log))
quit()
except:
print('Unknown error, exiting.')
quit()
#Update version
from xml.etree import ElementTree as et
tree = et.parse(sys.argv[1])
tree.find('1.0').text = 'ci_{git hub org name}_{branch name}'
tree.write(sys.argv[1])
Just want to get some help for any mistakes I am committing in my script.
The main problem with your code is an incorrect use of the ElementTree parse() method. It takes a tagname or a certain simplified xpath syntax, whereas you seem to be treating it like the str.find() method which takes an arbitrary string. What you need is the version tag.
Your parsing code should look like this:
version = tree.find('ns:version', {ns:'http://maven.apache.org/POM/4.0.0'})
if 'SNAPSHOT' in version.text:
version.text = 'ci_{git hub org name here}_{branch name here}'
# I guess you have some other code that sets this version properly
else:
print("Not a snapshot.")
Note that you always have to set a namespace to find version. That brings me to my second point; why are you parsing the file twice? lxml is just a more featureful version of xml. You only need to import one! lxml also has the advantage that its ElementTrees have an nsmap attribute, so you don't have to type the namespace yourself. I guess that makes it more robust, if Apache releases a new Maven version or something:
tree = etree.parse(sys.argv[1])
version = tree.find('ns:version', {'ns':tree.getroot().nsmap[None]})
For complete code, using only lxml:
from lxml import etree
import sys
# parse xml
try:
tree = etree.parse(sys.argv[1])
print('XML well formed, syntax ok.')
except OSError: # check for file errors (e.g missing)
print("Bad file: " + sys.argv[1])
quit()
# check for XML syntax errors
except etree.XMLSyntaxError as err:
print('XML Syntax Error, see error_syntax.log')
with open('error_syntax.log', 'w') as error_log_file:
error_log_file.write(str(err.error_log))
quit()
except:
print('Unknown error, exiting.')
quit()
#Update version
version = tree.find('ns:version', {'ns':tree.getroot().nsmap[None]})
if 'SNAPSHOT' not in version.text:
print("Not a snapshot")
quit() # Quitting after a failure is a way to avoid nesting
version.text = 'ci_{git hub org name}_{branch name}'
# I guess you have some other code that sets this version properly
tree.write(sys.argv[1])
I'm using lxml to parse some HTML with Russian letters. That's why i have headache with encodings.
I transform html text to tree using following code. Then i'm trying to extract some things from the page (header, arcticle content) using css queries.
from lxml import html
from bs4 import UnicodeDammit
doc = UnicodeDammit(html_text, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
tree = html.fromstring(html_text, parser=parser)
...
def extract_title(tree):
metas = tree.cssselect("meta[property^=og]")
for meta in metas:
# print(meta.attrib)
# print(sys.stdout.encoding)
# print("123") # Uncomment this to fix error
content = meta.attrib['content']
print(content.encode('utf-8')) # This fails with "[Decode error - output not utf-8]"
I get "Decode error" when i'm trying to print unicode symbols to stdout. But if i add some print statement before failing print then everything works fine. I never saw such strange behavior of python print function. I thought it has no side-effects.
Do you have any idea why this is happening?
I use Windows and Sublime to run this code.
I am working on CityGML data right now and try to parse CityGML in Python.
To do so, I use ElementTree, which is working fine with any XML files. But whenever I try to parse the CItyGML file I don't get any results.
As one example I want to get a list with all child tags named "creationDate" in the CityGML file. Here is the code:
import xml.etree.ElementTree as ET
tree = ET.parse('Gasometer.xml')
root = tree.getroot()
def child_list(child):
list_child = list(tree.iter(child))
return list_child
date = child_list('creationDate')
print (date)
I only get an empty list [].
Here is the the very first part of the CityGML file (the "creationDate"-tag you can find at the end):
<?xml version="1.0" encoding="UTF-8"?>
<CityModel>
<cityObjectMember>
<bldg:Building gml:id="UUID_899cac3f-e0b6-41e6-ae30-a91ce51d6d95">
<gml:description>Wohnblock in geschlossener Bauweise</gml:description>
<gml:boundedBy>
<gml:Envelope srsName="urn:ogc:def:crs,crs:EPSG::3068,crs:EPSG::5783" srsDimension="3">
<gml:lowerCorner>21549.6537889055 17204.3479916992 38.939998626709</gml:lowerCorner>
<gml:upperCorner>21570.6420902953 17225.660050148 60.6840192923434</gml:upperCorner>
</gml:Envelope>
</gml:boundedBy>
<creationDate>2014-03-28</creationDate>
This appears not only when I try to get lists of child tags. I can't print any attributes or tag names. It looks like the way I parse the file is wrong. I hope somebody can help me out with my problem and tell me what I should do! Thanks!
Since this is an old post I'll just leave this here in case someone else might need it.
To parse CityGML try the following code, it should help getting a general idea how to fetch the information.
import xml.etree.ElementTree as ET
def loadfile():
tree = ET.parse('filename')
root = tree.getroot()
for envelope in root.iter('{http://www.opengis.net/gml}Envelope'):
print "ENV tag", envelope.tag
print "ENV attrib", envelope.attrib
print "ENV text", envelope.text
lCorner = envelope.find('{http://www.opengis.net/gml}lowerCorner').text
uCorner = envelope.find('{http://www.opengis.net/gml}upperCorner').text
print "lC",lCorner
print "uC",uCorner
if __name__== "__main__":
loadfile()
To get the srsName try following:
import xml.etree.ElementTree as ET
def loadfile():
tree = ET.parse('filename')
root = tree.getroot()
for envelope in root.iter('{http://www.opengis.net/gml}Envelope'):
key = envelope.attrib
srsName = key.get('srsName')
print "SRS Name: ", srsName
if __name__== "__main__":
loadfile()
I hope this helps you or anyone else who might try parsing CityGML with ElementTree.
hi all I'm trying to extract the "META" description from a webpage using libxml for python. When it encounters UTF chars it seems to choke and display garbage chars. However when getting the data via a regex I get the unicode chars just fine. Am I doing something wrong with libxml?
thanks
''' test encoding issues with utf8 '''
from lxml.html import fromstring
from lxml.html.clean import Cleaner
import urllib2
import re
url = 'http://www.youtube.com/watch?v=LE-JN7_rxtE'
page = urllib2.urlopen(url).read()
xmldoc = fromstring(page)
desc = xmldoc.xpath('/html/head/meta[#name="description"]/#content')
meta_description = desc[0].strip()
print "**** LIBXML TEST ****\n"
print meta_description
print "**** REGEX TEST ******"
reg = re.compile(r'<meta name="description" content="(.*)">')
for desc in reg.findall(page):
print desc
OUTPUTS:
**** LIBXML TEST ****
My name is Hikakin.<br>I'm Japanese Beatboxer.<br><br>HIKAKIN Official Blog<br>http://ameblo.jp/hikakin/<br><br>ãã³çã³ãã¥<br>http://com.nicovideo.jp/community/co313576<br><br>â»å¾¡ç¨ã®æ¹ã¯Youtubeã®ã¡ãã»ã¼ã¸ã¾ã...
**** REGEX TEST ******
My name is Hikakin.<br>I'm Japanese Beatboxer.<br><br>HIKAKIN Official Blog<br>http://ameblo.jp/hikakin/<br><br>ニコ生コミュ<br>http://com.nicovideo.jp/community/co313576<br><br>※御用の方はYoutubeのメッセージまた...
Does this help?
xmldoc = fromstring(page.decode('utf-8'))
It is very possible that the problem is that your console does not support the display of Unicode characters. Try piping the output to a file and then open it in something that can display Unicode.
In lxml, you need to pass the encoding to the parser.
For HTML/XML parsing:
url = 'http://en.wikipedia.org/wiki/' + wiki_word
parser = lxml.etree.HTMLParser(encoding='utf-8') # you can either use an XMLParser()
page = urllib2.urlopen(url)
doc = etree.parse(page, parser)
T = doc.xpath('//p//text()')
text = u''.join(T).encode('utf-8')