Xml parsing with python - python

I am parsing xml with python using xmltodict. But i am getting following error ,
xml.parsers.expat.ExpatError: mismatched tag: line 2890, column 2
Here is my code ,
import xmltodict
import urllib2
url="url here"
data=xmltodict.parse(urllib2.urlopen(url).read())
print data
I have also tried using etree
Here is the code ,
import urllib2
import lxml.etree as ET
print 'Started Execution here'
url="url here"
xmldata = urllib2.urlopen(url).read()
root = ET.fromstring(xmldata)
print 'Done'
print root
It is also giving me error,
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: ClinicID line 54 and Type, line 55, column 14
I know there is problem in xml .
My question is that is there is any way to read the xml with all the nodes except that caused error ?

Related

parsing invalid xml using xmltodict

I am reading a xml file and converting to df using xmltodict and pandas.
This is how one of the elements in the file looks like
<net>
<ref>https://whois.arin.net/rest/v1/net/NET-66-125-37-120-1</ref>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
<name>SBC066125037120020307</name>
<netBlocks>
<netBlock>
<cidrLenth>29</cidrLenth>
<endAddress>066.125.037.127</endAddress>
<type>S</type>
<startAddress>066.125.037.120</startAddress>
</netBlock>
</netBlocks>
<pocLinks/>
<orgHandle>C00285134</orgHandle>
<parentNetHandle>NET-66-120-0-0-1</parentNetHandle>
<registrationDate>2002-03-08T00:00:00-05:00</registrationDate>
<startAddress>66.125.37.120</startAddress>
<updateDate>2002-03-08T07:56:59-05:00</updateDate>
<version>4</version>
</net>
since there are a large number of records like this which is being pulled in by an API, sometimes some <net> objects at the end of the file can be partially downloaded.
ex : one tag not having closing tag.
This is what i wrote to parse the xml
xml_data = open('/Users/dgoswami/Downloads/net.xml', 'r').read() # Read data
xml_data = xmltodict.parse(xml_data,
process_namespaces=True,
namespaces={'http://www.arin.net/bulkwhois/core/v1':None})
when that happens, I get an error like so
no element found: line 30574438, column 37
I want to be able to parse till the last valid <net> element.
How can that be done?
You may need to fix your xml beforehand - xmltodict has no ability to do that for you.
You can leverage lxml as described in Python xml - handle unclosed token to fix your xml:
from lxml import etree
def fixme(x):
p = etree.fromstring(x, parser = etree.XMLParser(recover=True))
return etree.tostring(p).decode("utf8")
fixed = fixme("""<start><net>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
</net><net>
<endAddress>66.125.37.227</endAddress>
<handle>NET-66-125-37-220-1</handle>
""")
and then use the fixed xml:
import xmltodict
print(xmltodict.parse(fixed))
to get
OrderedDict([('start',
OrderedDict([('net', [
OrderedDict([('endAddress', '66.125.37.127'), ('handle', 'NET-66-125-37-120-1')]),
OrderedDict([('endAddress', '66.125.37.227'), ('handle', 'NET-66-125-37-220-1')])
])
]))
])

Python lxml getpath error

I'm trying to get a full list of xpaths from a device config in xml.
When I run it though I get:
AttributeError: 'Element' object has no attribute 'getpath'
Code is just a few lines
import xml.etree.ElementTree
import os
from lxml import etree
file1 = 'C:\Users\test1\Desktop\test.xml'
file1_path = file1.replace('\\','/')
e = xml.etree.ElementTree.parse(file1_path).getroot()
for entry in e.iter():
print e.getpath(entry)
anyone come across this before ?
Thanks
Richie
You are doing it incorrectly, don't call getroot just parse and iter using lxml.etree:
import lxml.etree as et
file1 = 'C:/Users/test1/Desktop/test.xml'
root = et.parse(file1)
for e in root.iter():
print root.getpath(e)
If you are dealing with namespaces you may find getelementpath usefule:
root.getelementpath(e)

Python Urllib/Requests XML iterparse error

I am currently trying to fetch a XML from Wikipedia and parse it with XML. My general setup is the following:
import requests
import xml.etree.cElementTree as etree
payload = {'pages': 'Apple', 'action': 'submit', 'offset' : '2008-01-24 09:39:22'}
r = requests.post('http://en.wikipedia.org/w/index.php?title=Special:Export', params=payload, stream=True)
xmlIterator = etree.iterparse(r.raw, events=("start","end"))
When I do my parsing syntax, I get the following error:
for event, element in self.xmlIterator:
File "<string>", line 107, in next
ParseError: no element found: line 249375, column 2
I have tried the same approach with urllib receiving in the same error. It also just seems to happen for this specific XML, others work fine.
But the strange thing is as follows: if I store the response to a file and then pass the file to the XML parser it works fine. E.g.,:
open("test.xml","w").write(r.text.encode('utf-8'))
xmlIterator = etree.iterparse("test.xml", events=("start","end"))
Again, the same behavior for urllib.
Does anyone have an idea of what the problem could be?

Python lxml validation offline yields "connection refused" to http://www.opengis.net/kml/2.2

I am trying to validate some XML via lxml and an xsd
(ogckml22.xsd). This is happening OFFLINE. I read ther
file via a straight open/read
For the record, http://www.opengis.net/kml/2.2 is not valid.
from another article:
(clarified due to comment request..)
from lxml import etree
import os
import sys
import StringIO
file=open('ogckml22.xsd')
data=file.read()
str=StringIO.StringIO(data)
try:
xmlschema_doc=etree.parse(data)
except IOError as ex:
print "oops {0}".format(ex.strerror)
except:
print "Unexpected error:", sys.exc_info()[0]
xmlschema=etree.XMLSchema(xmlschema_doc)
All I get is a "connection refused".
With the try/except, I get the xmlschema_doc is not defined.
File "<stdin>", line 1, in <module>
File "<xmlschema.pxi",line 105, in lxml.etree.XMLSchema.__init__ (src/lxml/lxml.etree.c:132748
self.error_log)
lxml.etree.XMLSchemaParseError: connection refused
I know it can read the xsd file above and another xsd file that
gets included.
OK maybe the xsd gets read? I downloaded the source for lxml and in src/lxml/xmlschema.pxi,
if self._c_schema is NULL:
raise XMLSchemaParseError(
self.error_log._buildExceptionMessage(
u"Document is not valid XML Schema"),
self._error_log)
I never see the "Document is not valid XML Schema" message. I can only assume that "Connection Refused" is used in place of the "Document message" (a default?) but
a more thorough reading of _error_log (outside of recompilation) evades me....
Sincerely,
ArrowInTree
ogckml22.xsd imports two other schema documents (atom-author-link.xsd and xAL.xsd):
<!-- import atom:author and atom:link -->
<import namespace="http://www.w3.org/2005/Atom"
schemaLocation="atom-author-link.xsd"/>
<!-- import xAL:Address -->
<import namespace="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0"
schemaLocation="http://docs.oasis-open.org/election/external/xAL.xsd"/>
If you want to parse the schema offline, you need to have both these documents available locally and he paths given by schemaLocation must be correct.
The parsing and loading of the schema can be simplified (there is no need for StringIO):
from lxml import etree
xmlschema_doc = etree.parse("ogckml22.xsd")
xmlschema = etree.XMLSchema(xmlschema_doc)
print xmlschema
Output:
<lxml.etree.XMLSchema object at 0x00D25120>
I don't understand what you mean by "For the record, http://www.opengis.net/kml/2.2 is not valid".
If you have internet access, you can use the URL as argument to etree.parse():
xmlschema_doc = etree.parse("http://www.opengis.net/kml/2.2")
At least this works for me.

ParseError: not well-formed (invalid token) using cElementTree

I receive xml strings from an external source that can contains unsanitized user contributed content.
The following xml string gave a ParseError in cElementTree:
>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
ET.XML(s)
File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17
Is there a way to make cElementTree not complain?
It seems to complain about \x08 you will need to escape that.
Edit:
Or you can have the parser ignore the errors using recover
from lxml import etree
parser = etree.XMLParser(recover=True)
etree.fromstring(xmlstring, parser=parser)
I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)
import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)
EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...
This code snippet worked for me. I have an issue with the parsing batch of XML files. I had to encode them to 'iso-8859-5'
import xml.etree.ElementTree as ET
tree = ET.parse(filename, parser = ET.XMLParser(encoding = 'iso-8859-5'))
See this answer to another question and the according part of the XML spec.
The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity  and cannot occur plainly.
If you need to process this XML snippet, you must replace \x08 in s before feeding it into an XML parser.
None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup instead of ElementTree as follows:
from bs4 import BeautifulSoup
with open("data/myfile.xml") as fp:
soup = BeautifulSoup(fp, 'xml')
Then you can search the tree as:
soup.find_all('mytag')
This is most probably an encoding error. For example I had an xml file encoded in UTF-8-BOM (checked from the Notepad++ Encoding menu) and got similar error message.
The workaround (Python 3.6)
import io
from xml.etree import ElementTree as ET
with io.open(file, 'r', encoding='utf-8-sig') as f:
contents = f.read()
tree = ET.fromstring(contents)
Check the encoding of your xml file. If it is using different encoding, change the 'utf-8-sig' accordingly.
After lots of searching through the entire WWW, I only found out that you have to escape certain characters if you want your XML parser to work! Here's how I did it and worked for me:
escape_illegal_xml_characters = lambda x: re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', '', x)
And use it like you'd normally do:
ET.XML(escape_illegal_xml_characters(my_xml_string)) #instead of ET.XML(my_xml_string)
A solution for gottcha for me, using Python's ElementTree... this has the invalid token error:
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
xml = u"""<?xml version='1.0' encoding='utf8'?>
<osm generator="pycrocosm server" version="0.6"><changeset created_at="2017-09-06T19:26:50.302136+00:00" id="273" max_lat="0.0" max_lon="0.0" min_lat="0.0" min_lon="0.0" open="true" uid="345" user="john"><tag k="test" v="Съешь же ещё этих мягких французских булок да выпей чаю" /><tag k="foo" v="bar" /><discussion><comment data="2015-01-01T18:56:48Z" uid="1841" user="metaodi"><text>Did you verify those street names?</text></comment></discussion></changeset></osm>"""
xmltest = ET.fromstring(xml.encode("utf-8"))
However, it works with the addition of a hyphen in the encoding type:
<?xml version='1.0' encoding='utf-8'?>
Most odd. Someone found this footnote in the python docs:
The encoding string included in XML output should conform to the
appropriate standards. For example, “UTF-8” is valid, but “UTF8” is
not.
I have been in stuck with similar problem. Finally figured out the what was the root cause in my particular case. If you read the data from multiple XML files that lie in same folder you will parse also .DS_Store file.
Before parsing add this condition
for file in files:
if file.endswith('.xml'):
run_your_code...
This trick helped me as well
lxml solved the issue, in my case
from lxml import etree
for _, elein etree.iterparse(xml_file, tag='tag_i_wanted', unicode='utf-8'):
print(ele.tag, ele.text)
in another case,
parser = etree.XMLParser(recover=True)
tree = etree.parse(xml_file, parser=parser)
tags_needed = tree.iter('TAG NAME')
Thanks to theeastcoastwest
Python 2.7
In my case I got the same error. (using Element Tree)
I had to add these lines:
import xml.etree.ElementTree as ET
from lxml import etree
parser = etree.XMLParser(recover=True,encoding='utf-8')
xml_file = ET.parse(path_xml,parser=parser)
Works in pyhton 3.10.2
What helped me with that error was Juan's answer - https://stackoverflow.com/a/20204635/4433222
But wasn't enough - after struggling I found out that an XML file needs to be saved with UTF-8 without BOM encoding.
The solution wasn't working for "normal" UTF-8.
The only thing that worked for me is I had to add mode and encoding while opening the file like below:
with open(filenames[0], mode='r',encoding='utf-8') as f:
readFile()
Otherwise it was failing every time with invalid token error if I simply do this:
f = open(filenames[0], 'r')
readFile()
this error is coming while you are giving a link . but first you have to find the string of that link
response = requests.get(Link)
root = cElementTree.fromstring(response.content)
I tried the other solutions in the answers here but had no luck. Since I only needed to extract the value from a single xml node I gave in and wrote my function to do so:
def ParseXmlTagContents(source, tag, tagContentsRegex):
openTagString = "<"+tag+">"
closeTagString = "</"+tag+">"
found = re.search(openTagString + tagContentsRegex + closeTagString, source)
if found:
start = found.regs[0][0]
end = found.regs[0][1]
return source[start+len(openTagString):end-len(closeTagString)]
return ""
Example usage would be:
<?xml version="1.0" encoding="utf-16"?>
<parentNode>
<childNode>123</childNode>
</parentNode>
ParseXmlTagContents(xmlString, "childNode", "[0-9]+")

Categories

Resources