Error while parsing xml file in python - python

This is the xml file I am trying to parse. This file does not have a root tag.
<data txt="some0" txt1 = "some1" txt2 = "some2" >
<data2>
< bank = "SBI" bank2 = "SBI2" >
<data2>
<data3>
<branch = "bang1" branch = bang"2" >
<data3>
<data>
My script contains below lines. The below can be used to get the specific data after parsing it.
data = re.findall("<data txt=.*?</data>", re.DOTALL)
tree = ElementTree.fromstringlist(data)
I am unabale to parse this file because its not having root tag. please help me how to parse if the file is having no tag ??

As pointed out in a comment already, you can just parse the whole thing. If the missing root element is the problem, you can grab the contents of the file as a string and then add an arbitrary root tag at the beginning and the end.
stringdata = "<myroot>%s</myroot>" % stringdata
and then parse the string.
EDIT:
In response to comment.
If you have one string, you'll want fromstring, but you'll almost certainly get the same error. Something else is going on. Try this ...
from xml.etree import ElementTree
stringdata = "<myroot>%s</myroot>" % stringdata
tree = ElementTree.fromstring(stringdata)
Then get what you need from tree.

Related

Parse large python xml using xmltree

I have a python script that parses huge xml files ( largest one is 446 MB)
try:
parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse(os.path.join(srcDir, fileName), parser)
root = tree.getroot()
except Exception, e:
print "Error parsing file "+str(fileName) + " Reason "+str(e.message)
for child in root:
if "PersonName" in child.tag:
personName = child.text
This is what my xml looks like :
<?xml version="1.0" encoding="utf-8"?>
<MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
<Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
<Description>myData</Description>
<Identifier>43hhjh87n4nm</Identifier>
</Aliases>
<RollNo uom="kPa">39979172.201167159</RollNo>
<PersonName>Miracle Smith</PersonName>
<Date>2017-06-02T01:10:32-05:00</Date>
....
All I want to do is get the PersonName tags contents thats all. Other tags I don't care about.
Sadly My files are huge and I keep getting this error when I use the code above :
Error parsing file 2eb6d894-0775-e611.xml Reason unknown error, line 1, column 310915857
Error parsing file 2ecc18b5-ef41-e711-80f.xml Reason Extra content at the end of the document, line 1, column 3428182
Error parsing file 2f0d6926-b602-e711-80f4-005.xml Reason Extra content at the end of the document, line 1, column 6162118
Error parsing file 2f12636b-b2f5-e611-80f3-00.xml Reason Extra content at the end of the document, line 1, column 8014679
Error parsing file 2f14e35a-d22b-4504-8866-.xml Reason Extra content at the end of the document, line 1, column 8411238
Error parsing file 2f50c2eb-55c6-e611-80f0-005056a.xml Reason Extra content at the end of the document, line 1, column 7636614
Error parsing file 3a1a3806-b6af-e611-80ef-00505.xml Reason Extra content at the end of the document, line 1, column 11032486
My XML is perfectly fine and has no extra content .Seems that the large files parsing causes the error.
I have looked at iterparse() but it seems to complex for what I want to achieve as it provides parsing of the whole DOM while I just want that one tag that is under the root. Also , does not give me a good sample to get the correct value by tag name ?
Should I use a regex parse or grep /awk way to do this ? Or any tweak to my code will let me get the Person name in these huge files ?
UPDATE:
Tried this sample and it seems to be printing the whole world from the xml except my tag ?
Does iterparse read from bottom to top of file ? In that case it will take a long time to get to the top i.e my PersonName Tag ? I tried changing the line below to read end to start events=("end", "start") and it does the same thing !!!
path = []
for event, elem in ET.iterparse('D:\\mystage\\2-80ea-005056.xml', events=("start", "end")):
if event == 'start':
path.append(elem.tag)
elif event == 'end':
# process the tag
print elem.text // prints whole world
if elem.tag == 'PersonName':
print elem.text
path.pop()
Iterparse is not that difficult to use in this case.
temp.xml is the file presented in your question with a </MyRoot> stuck on as a line at the end.
Think of the source = as boilerplace, if you will, that parses the xml file and returns chunks of it element-by-element, indicating whether the chunk is the 'start' of an element or the 'end' and supplying information about the element.
In this case we need consider only the 'start' events. We watch for the 'PersonName' tags and pick up their texts. Having found the one and only such item in the xml file we abandon the processing.
>>> from xml.etree import ElementTree
>>> source = iter(ElementTree.iterparse('temp.xml', events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'
Edit, in response to question in a comment:
Normally you wouldn't do this since iterparse is intended for use with large chunks of xml. However, by wrapping a string in a StringIO object it can be processed with iterparse.
>>> from xml.etree import ElementTree
>>> from io import StringIO
>>> xml = StringIO('''\
... <?xml version="1.0" encoding="utf-8"?>
... <MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
... <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
... <Description>myData</Description>
... <Identifier>43hhjh87n4nm</Identifier>
... </Aliases>
... <RollNo uom="kPa">39979172.201167159</RollNo>
... <PersonName>Miracle Smith</PersonName>
... <Date>2017-06-02T01:10:32-05:00</Date>
... </MyRoot>''')
>>> source = iter(ElementTree.iterparse(xml, events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'

How to parse out xml from noisy file using python

I have a file which contains a bunch of logging information including xml. I'd like to parse out the xml portion into a string object so I can then run some xpaths on it to ensure to existence of certain information on the 'data' element.
File to parse:
Requesting event notifications...
Receiving command objects...
<?xml version="1.0" encoding="UTF-8"?><Root xmlns="http://schemas.com/service" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><data id="123" interface="2017.1" implementation="2016.122-SNAPSHOT" Version="2016.1.2700-SNAPSHOT"></data></Root>
All information has been collected
Command execution successful...
Python:
import re
with open('./output.out', 'r') as outFile:
data = outFile.read().replace('\n','')
regex = re.escape("<.*?>.*?<\/Root>");
p = re.compile(regex)
m = p.match(data)
if m:
print(m.group())
else:
print('No match')
Output:
No match
What am I doing wrong? How can I accomplish my goal? Any help would be much appreciated.
Thou shalt never use regular expressions for parsing XML/HTML. There is BeautifulSoup for this daunting task.
import bs4
soup = bs4.BeautifulSoup(open("output.out").read(), "lxml")
roots = soup.findAll('root')
#[<root xmlns="http://schemas.com/service"
# xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
# <data id="123" implementation="2016.122-SNAPSHOT" interface="2017.1"
# version="2016.1.2700-SNAPSHOT"></data></root>]
roots[0] is an XML document. You can do anything you want with it.

Adding html tags to text of XML.ElementTree Elements in Python

I am trying to use a python script to generate an HTML document with text from a data table using the XML.etree.ElementTree module. I would like to format some of the cells to include html tags, typically either <br /> or <sup></sup> tags. When I generate a string and write it to a file, I believe the XML parser is converting these tags to individual characters. The output the shows the tags as text rather than processing them as tags. Here is a trivial example:
import xml.etree.ElementTree as ET
root = ET.Element('html')
#extraneous code removed
td = ET.SubElement(tr, 'td')
td.text = 'This is the first line <br /> and the second'
tree = ET.tostring(root)
out = open('test.html', 'w+')
out.write(tree)
out.close()
When you open the resulting 'test.html' file, it displays the text string exactly as typed: 'This is the first line <br /> and the second'.
The HTML document itself shows the problem in the source. It appears that the parser substitutes the "less than" and "greater than" symbols in the tag to the HTML representations of those symbols:
<!--Extraneous code removed-->
<td>This is the first line %lt;br /> and the second</td>
Clearly, my intent is to have the document process the tag itself, not display it as text. I'm not sure if there are different parser options I can pass to get this to work, or if there is a different method I should be using. I am open to using other modules (e.g. lxml) if that will solve the problem. I am mainly using the built-in XML module for convenience.
The only thing I've figured out that works is to modify the final string with re substitutions before I write the file:
tree = ET.tostring(root)
tree = re.sub(r'<','<',tree)
tree = re.sub(r'>','>',tree)
This works, but seems like it should be avoidable by using a different setting in xml. Any suggestions?
You can use tail attribute with td and br to construct the text exactly what you want:
import xml.etree.ElementTree as ET
root = ET.Element('html')
table = ET.SubElement(root, 'table')
tr = ET.SubElement(table, 'tr')
td = ET.SubElement(tr, 'td')
td.text = "This is the first line "
# note how to end td tail
td.tail = None
br = ET.SubElement(td, 'br')
# now continue your text with br.tail
br.tail = " and the second"
tree = ET.tostring(root)
# see the string
tree
'<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>'
with open('test.html', 'w+') as f:
f.write(tree)
# and the output html file
cat test.html
<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>
As a side note, to include the <sup></sup> and append text but still within <td>, use tail will have the desire effect too:
...
td.text = "this is first line "
sup = ET.SubElement(td, 'sup')
sup.text = "this is second"
# use tail to continue your text
sup.tail = "well and the last"
print ET.tostring(root)
<html><table><tr><td>this is first line <sup>this is second</sup>well and the last</td></tr></table></html>

Programmatically clean/ignore namespaces in XML - python

I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.
The XML looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
{...}
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
<cmdty:space>ISO4217</cmdty:space>
<cmdty:id>BRL</cmdty:id>
<cmdty:get_quotes/>
<cmdty:quote_source>currency</cmdty:quote_source>
<cmdty:quote_tz/>
</gnc:commodity>
Right now, i'm able to iterate and get results using
import xml.etree.ElementTree as ET
r = ET.parse("file.xml").findall('.//')
after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.
Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...
I've come up with this solution:
def strip_namespaces(self, tree):
nspOpen = re.compile("<\w*:", re.IGNORECASE)
nspClose = re.compile("<\/\w*:", re.IGNORECASE)
for i in tree:
start = re.sub(nspOpen, '<', tree.tag)
end = re.sub(nspOpen, '<\/', tree.tag)
# pprint(finaltree)
return
But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.
I think below python code will be helpfull to you.
sample.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
xmlns:act="http://www.gnucash.org/XML/act"
xmlns:book="http://www.gnucash.org/XML/book"
xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:change>
<gnc:lastUpdate>2018-12-21
</gnc:lastUpdate>
</gnc:change>
<gnc:bill>
<gnc:billAccountNumber>1234</gnc:billAccountNumber>
<gnc:roles>
<gnc:id>111111</gnc:id>
<gnc:pos>2</gnc:pos>
<gnc:genid>15</gnc:genid>
</gnc:roles>
</gnc:bill>
<gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>
PYTHON CODE: to remove xmlns for root tag.
import xml.etree.cElementTree as ET
def xmlns(str):
str1 = str.split('{')
l=[]
for i in str1:
if '}' in i:
l.append(i.split('}')[1])
else:
l.append(i)
var = ''.join(l)
return var
tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag) #returns root tag with xmlns as prefix
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix
Output:
{http://www.gnucash.org/XML/gnc}prodinfo
prodinfo

lxml not adding newlines when inserting a new element into existing xml

I have a large set of existing xml files, and I am trying to add one element to all of them (they are pom.xml for a number of maven projects, and I am trying to add a parent element to all of them). The following is my exact code.
The problem is that the final xml output in pom2.xml has the complete parent element in a single line. Though, when I print the element by itself, it writes it out in 4 lines as usual. How do I print out the complete xml with proper formatting for the parent element?
from lxml import etree
parentPom = etree.Element('parent')
groupId = etree.Element('groupId')
groupId.text = 'org.myorg'
parentPom.append(groupId)
artifactId = etree.Element('artifactId')
artifactId.text = 'myorg-master-pom'
parentPom.append(artifactId)
version = etree.Element('version')
version.text = '1.0.0'
parentPom.append(version)
print etree.tostring(parentPom, pretty_print=True)
pom = etree.parse("pom.xml")
projectElement = pom.getroot()
projectElement.insert(0, parentPom)
file = open("pom2.xml", 'wb')
file.write(etree.tostring(projectElement, pretty_print=True))
file.close()
Output of print:
<parent>
<groupId>org.myorg</groupId>
<artifactId>myorg-master-pom</artifactId>
<version>1.0.0</version>
</parent>
Output of same element in pom2.xml:
<parent><groupId>com.inmobi</groupId><artifactId>inmobi-master-pom</artifactId><version>1.0.1</version></parent><modelVersion>4.0.0</modelVersion>
This might be of intrest to you.
http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
In short for future reference:
parser = etree.XMLParser(remove_blank_text=True)
pom = etree.parse("pom.xml",parser)

Categories

Resources