ExpatError: junk after document element xml python error - python

I have a project that needs to be conversion from xml to dict in python. I am using the xmltodict library however when I convert the xml to dict it raises the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Users/deanchristianarmada/Desktop/projects/asian_gaming/radar/lib/python2.7/site-packages/xmltodict.py", line 311, in parse
parser.Parse(xml_input, True)
ExpatError: junk after document element: line 2, column 0
my code is:
import xmltodict
xml = '<row dataType="TR" ID="3B6B408870BA7AC3E05381010A0A5849" agentCode="690001001001001" transferId="G87_AGIN160901115820S441XB" tradeNo="160831287638239" platformType="AGIN" playerName="mubuuvu2" transferType="IN" transferAmount="28" previousAmount="0" currentAmount="28" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:16" gameCode="" />\r\n<row dataType="TR" ID="3B6B408870BB7AC3E05381010A0A5849" agentCode="690001001001001" transferId="160831231227456_Hunter_Out" tradeNo="160831287639025" platformType="AGIN" playerName="zxh123" transferType="OUT" transferAmount="-50" previousAmount="50" currentAmount="0" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:18" gameCode="" />\r\n<row dataType="TR" ID="3B6B408870BC7AC3E05381010A0A5849" agentCode="690001001001001" transferId="160831231227452_Hunter_In" tradeNo="160831287639507" platformType="AGIN" playerName="qqq19qq32b" transferType="IN" transferAmount="71" previousAmount="0" currentAmount="71" currency="CNY" exchangeRate="1" IP="0" flag="0" creationTime="2016-08-31 23:58:19" gameCode="" />\r\n'
_dict = xmltodict.parse(xml, attr_prefix="")
I can't seem to find a way to fix it and I'm not used in xml, I'm used with JSON

If you add a starting root tag in the beginning and an ending root tag in the end of the xml string, it should work.
import xmltodict
xml = 'xml string here'
xml = '<root>'+xml+'</root>'
_dict = xmltodict.parse(xml, attr_prefix="")
Basically, it's just missing the <root> tag.

Related

extract a specific tag from xml file using beautiful soup in python

I have an xml file (lets call is abc.xml) which looks like this.
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<product name="XYZ" version="123"/>
<application-links>
<application-links>
<id>111111111111111</id>
<name>Link_1</name>
<primary>true</primary>
<type>applinks.ABC</type>
<display-url>http://ABC.displayURL</display-url>
<rpc-url>http://ABC.displayURL</rpc-url>
</application-links>
</application-links>
</properties>
my python code is like this
f = open ('file.xml', 'r')
from bs4 import BeautifulSoup
soup = BeautifulSoup(f,'lxml')
print(soup.product)
for applinks in soup.application-links:
print(applinks)
which prints the following
<product name="XYZ" version="123"></product>
Traceback (most recent call last):
File "parse.py", line 7, in <module>
for applinks in soup.application-links:
NameError: name 'links' is not defined
Please can you help me understand how to print lines which have tags including a dash/hyphen '-'
I don't know if beautifulsoup is the best option here, but I really suggest using the ElementTree module in python like so:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('file.xml').getroot()
>>> for app in root.findall('*/application-links/'):
... print(app.text)
111111111111111
Link_1
true
applinks.ABC
http://ABC.displayURL
http://ABC.displayURL
So, to print the value inside the <name> tag, you can do so:
>>> for app in root.findall('*/application-links/name'):
... print(app.text)
Link_1

Parse large python xml using xmltree

I have a python script that parses huge xml files ( largest one is 446 MB)
try:
parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse(os.path.join(srcDir, fileName), parser)
root = tree.getroot()
except Exception, e:
print "Error parsing file "+str(fileName) + " Reason "+str(e.message)
for child in root:
if "PersonName" in child.tag:
personName = child.text
This is what my xml looks like :
<?xml version="1.0" encoding="utf-8"?>
<MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
<Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
<Description>myData</Description>
<Identifier>43hhjh87n4nm</Identifier>
</Aliases>
<RollNo uom="kPa">39979172.201167159</RollNo>
<PersonName>Miracle Smith</PersonName>
<Date>2017-06-02T01:10:32-05:00</Date>
....
All I want to do is get the PersonName tags contents thats all. Other tags I don't care about.
Sadly My files are huge and I keep getting this error when I use the code above :
Error parsing file 2eb6d894-0775-e611.xml Reason unknown error, line 1, column 310915857
Error parsing file 2ecc18b5-ef41-e711-80f.xml Reason Extra content at the end of the document, line 1, column 3428182
Error parsing file 2f0d6926-b602-e711-80f4-005.xml Reason Extra content at the end of the document, line 1, column 6162118
Error parsing file 2f12636b-b2f5-e611-80f3-00.xml Reason Extra content at the end of the document, line 1, column 8014679
Error parsing file 2f14e35a-d22b-4504-8866-.xml Reason Extra content at the end of the document, line 1, column 8411238
Error parsing file 2f50c2eb-55c6-e611-80f0-005056a.xml Reason Extra content at the end of the document, line 1, column 7636614
Error parsing file 3a1a3806-b6af-e611-80ef-00505.xml Reason Extra content at the end of the document, line 1, column 11032486
My XML is perfectly fine and has no extra content .Seems that the large files parsing causes the error.
I have looked at iterparse() but it seems to complex for what I want to achieve as it provides parsing of the whole DOM while I just want that one tag that is under the root. Also , does not give me a good sample to get the correct value by tag name ?
Should I use a regex parse or grep /awk way to do this ? Or any tweak to my code will let me get the Person name in these huge files ?
UPDATE:
Tried this sample and it seems to be printing the whole world from the xml except my tag ?
Does iterparse read from bottom to top of file ? In that case it will take a long time to get to the top i.e my PersonName Tag ? I tried changing the line below to read end to start events=("end", "start") and it does the same thing !!!
path = []
for event, elem in ET.iterparse('D:\\mystage\\2-80ea-005056.xml', events=("start", "end")):
if event == 'start':
path.append(elem.tag)
elif event == 'end':
# process the tag
print elem.text // prints whole world
if elem.tag == 'PersonName':
print elem.text
path.pop()
Iterparse is not that difficult to use in this case.
temp.xml is the file presented in your question with a </MyRoot> stuck on as a line at the end.
Think of the source = as boilerplace, if you will, that parses the xml file and returns chunks of it element-by-element, indicating whether the chunk is the 'start' of an element or the 'end' and supplying information about the element.
In this case we need consider only the 'start' events. We watch for the 'PersonName' tags and pick up their texts. Having found the one and only such item in the xml file we abandon the processing.
>>> from xml.etree import ElementTree
>>> source = iter(ElementTree.iterparse('temp.xml', events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'
Edit, in response to question in a comment:
Normally you wouldn't do this since iterparse is intended for use with large chunks of xml. However, by wrapping a string in a StringIO object it can be processed with iterparse.
>>> from xml.etree import ElementTree
>>> from io import StringIO
>>> xml = StringIO('''\
... <?xml version="1.0" encoding="utf-8"?>
... <MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" uuid="ertr" xmlns="http://www.example.org/yml/data/litsmlv2">
... <Aliases authority="OPP" xmlns="http://www.example.org/yml/data/commonv2">
... <Description>myData</Description>
... <Identifier>43hhjh87n4nm</Identifier>
... </Aliases>
... <RollNo uom="kPa">39979172.201167159</RollNo>
... <PersonName>Miracle Smith</PersonName>
... <Date>2017-06-02T01:10:32-05:00</Date>
... </MyRoot>''')
>>> source = iter(ElementTree.iterparse(xml, events=('start', 'end')))
>>> for an_event, an_element in source:
... if an_event=='start' and an_element.tag.endswith('PersonName'):
... an_element.text
... break
...
'Miracle Smith'

parsing XML file depends on tags which may or may not be existed

Im trying to parse an XML file depends on a tag which may or may not be existed !
how I can avoid this IndexError without using exception handler ?
python script:
#!/usr/bin/python3
from xml.dom import minidom
doc = minidom.parse("Data.xml")
persons = doc.getElementsByTagName("person")
for person in persons:
print(person.getElementsByTagName("phone")[0].firstChild.data)
Data.xml :
<?xml version="1.0" encoding="UTF-8"?>
<obo>
<Persons>
<person>
<id>XA123</id>
<first_name>Adam</first_name>
<last_name>John</last_name>
<phone>01-12322222</phone>
</person>
<person>
<id>XA7777</id>
<first_name>Anna</first_name>
<last_name>Watson</last_name>
<relationship>
<type>Friends</type>
<to>XA123</to>
</relationship>
<!--<phone>01-12322222</phone>-->
</person>
</Persons>
</obo>
and I get an IndexError:
01-12322222
Traceback (most recent call last):
File "XML->Neo4j-try.py", line 29, in <module>
print(person.getElementsByTagName("phone")[0].firstChild.data)
IndexError: list index out of range
First, you need to check whether current person has phone data, and proceed further only if it has. Also, it is slightly better to store the result of getElementsByTagName() in a variable to avoid doing the same query repeatedly, especially when the actual XML has a lot more content in each person element :
for person in persons:
phones = person.getElementsByTagName("phone")
if phones:
print(phones[0].firstChild.data)
it is giving error because if any person does not have phone then
from xml.dom import minidom
doc = minidom.parse("Data.xml")
persons = doc.getElementsByTagName("person")
for person in persons:
if person.getElementsByTagName("phone"):
print(person.getElementsByTagName("phone")[0].firstChild.data)

How to change element text of only one element with python elementtree

I have the following xml example file:
<Book>
<Location>page10</Location>
<Chapter>
<Location>page11</Location>
</Chapter>
</Book>
I want to change the text value of element <Location> right beneath <book>.
Using findall gives both 'Location' elements.
Using find gives the first, that could be right, but in case element 'Chapter' is placed before Location than I get the wrong element.
Anyone any suggestions?
Use paths..
>>> import xml.etree.ElementTree as etree
>>> frag = '<Book><Chapter><Location>page11</Location></Chapter><Location>page10</Location></Book>'
>>> tree = etree.fromstring(frag)
>>> tree.findall('./Location')[0].text
'page10'
>>> tree.findall('./Location')[1].text
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range

Parse xsd with values [python]

I'm trying to examine and extract some data from an xml file using python. I'm doing this by parsing with etree then looping through the elements:
import xml.etree.ElementTree as etree
root = etree.fromstring(xml_string)
for element in root.iter():
print("%s , %s , %s" % (element.tag, element.attrib, element.text))
This works fine for some test data, but the actual xml files that I'm working with seem to contain xsd tags along with the data. Below is an example
<wdtf:observationMember>
<wdtf:TimeSeriesObservation gml:id="ts1">
<gml:description>Reading using DTW (Depth To Water) from TOC</gml:description>
<gml:name codeSpace="http://www.bom.gov.au/std/water/xml/wio0.2/feature/TimeSeriesObservation/w00066/12/A/GroundWaterLevel/">1</gml:name>
<om:procedure xlink:href="#gwTOC12" />
<om:observedProperty xlink:href="http://www.bom.gov.au/std/water/xml/wio0.2/property//bom/GroundWaterLevel_m" />
<om:featureOfInterest xlink:href="http://www.bom.gov.au/std/water/xml/wio0.2/feature/BorePipeSamplingInterval/w00066/12" />
<wdtf:metadata>
<wdtf:TimeSeriesObservationMetadata>
<wdtf:regulationProperty>Reg200806.s3.2a</wdtf:regulationProperty>
<wdtf:status>validated</wdtf:status>
</wdtf:TimeSeriesObservationMetadata>
</wdtf:metadata>
<wdtf:result>
<wdtf:TimeSeries>
<wdtf:defaultInterpolationType>InstVal</wdtf:defaultInterpolationType>
<wdtf:defaultUnitsOfMeasure>m</wdtf:defaultUnitsOfMeasure>
<wdtf:defaultQuality>quality-A</wdtf:defaultQuality>
<wdtf:timeValuePair time="1915-12-09T12:00:00+10:00">51.82</wdtf:timeValuePair>
<wdtf:timeValuePair time="1917-12-18T12:00:00+10:00">41.38</wdtf:timeValuePair>
<wdtf:timeValuePair time="1924-05-23T12:00:00+10:00">21.95</wdtf:timeValuePair>
<wdtf:timeValuePair time="1988-02-02T12:00:00+10:00">7.56</wdtf:timeValuePair>
</wdtf:TimeSeries>
</wdtf:result>
</wdtf:TimeSeriesObservation>
</wdtf:observationMember>
Useing this xml in the code above causes etree to return an error:
Traceback (most recent call last):
File "xml_test2.py", line 38, in <module>
root = etree.fromstring(xml_string)
File "<string>", line 124, in XML
ParseError: unbound prefix: line 1, column 4
Is there a different parser I should be using? Or can I remove the xsc tags some how?
Thanks
From what I can see in your post, your parser is namespace aware and is complaining that XML namespace aliases are not resolved. Assuming that <wdtf:observationMember> is your topmost element, then you have to have the following at least:
<wdtf:observationMember xmlns:wdtf="some-uri">
The same applies for all other prefixes, such as gml, om, etc.

Categories

Resources