Python parse XML has multiple root - python

I failed to parse an XML file(it is GC history). Sample of the XML is shown below.
<?xml version="1.0" ?>
<verbosegc xmlns="http://www.ibm.com/j9/verbosegc" version="R28_jvm.28_20150612_0201_B252774_CMPRSS">
<initialized id="1" timestamp="2015-12-04T20:17:07.219">
<attribute name="gcPolicy" value="-Xgcpolicy:gencon" />
<attribute name="maxHeapSize" value="0x20000000" />
<attribute name="initialHeapSize" value="0x400000" />
</initialized>
<cycle-start id="4" type="scavenge" contextid="0" timestamp="2015-12-04T20:17:10.677" intervalms="3457.977" />
<gc-start id="5" type="scavenge" contextid="4" timestamp="2015-12-04T20:17:10.677">
<mem-info id="6" free="3037768" total="4194304" percent="72">
</mem-info>
</gc-start>
<gc-end id="8" type="scavenge" contextid="4" durationms="0.807" usertimems="0.000" systemtimems="0.000" timestamp="2015-12-04T20:17:10.678" activeThreads="2">
<mem-info id="9" free="3163968" total="4194304" percent="75">
</mem-info>
</gc-end>
<cycle-end id="10" type="scavenge" contextid="4" timestamp="2015-12-04T20:17:10.678" />
<cycle-start id="16" type="scavenge" contextid="0" timestamp="2015-12-04T20:17:10.742" intervalms="64.838" />
<gc-start id="17" type="scavenge" contextid="16" timestamp="2015-12-04T20:17:10.742">
<mem-info id="18" free="3037664" total="4194304" percent="72">
</mem-info>
</gc-start>
<gc-end id="20" type="scavenge" contextid="16" durationms="0.649" usertimems="0.000" systemtimems="0.000" timestamp="2015-12-04T20:17:10.743" activeThreads="2">
<mem-info id="21" free="3110592" total="4194304" percent="74">
</mem-info>
</gc-end>
<cycle-end id="22" type="scavenge" contextid="16" timestamp="2015-12-04T20:17:10.743" />
<allocation-satisfied id="23" threadId="0000000002E10500" bytesRequested="416" />
</verbosegc>
I want to mem-info::free in gc-start and gc-end, both of which are enclosed by cycle-start and cycle-end tags and have the same contexid. For example, the first two mem-info values are 3037768 and 3163968, the corresponding contextid is 4 which equals to the cycle-start id. With these data, I can draw the figure to show memory footprint.
The main problem for me is that I could not parse the XML sucessfully with the method in XML parse python. The getroot works but all other find/findall returns empty. Is there any other solutions for this? thanks
Here are my tries:
>>> tree = ET.parse('gc.trace')
>>> tree
<xml.etree.ElementTree.ElementTree object at 0x7fdfaddc19d0>
>>> root=tree.getroot()
>>> root
<Element '{http://www.ibm.com/j9/verbosegc}verbosegc' at 0x7fdfaddc1a90>
>>> cycle_start = root.findall('cycle-start')
>>> cycle_start
[] ; Empty???
>>> cycle_start = root.findall('mem-info')
>>> print cycle_start
[] ;Empty???
>>>
>>> cycle_start = root.find('mem-info')
>>> cycle_start
>>> print cycle_start
None
from lxml import etree
tree = etree.parse("gc.log")
root = tree.getroot()
>>root.findall('mem-info', root.nsmap)
>>> root.nsmap
{None: 'http://www.ibm.com/j9/verbosegc'}

That's because your XML has default namespace here :
xmlns="http://www.ibm.com/j9/verbosegc"
Notice that descendant element inherits ancestor's default namespace implicitly. You can use prefix-to-namespace mapping to get element in namespace, for example :
ns = {'d': 'http://www.ibm.com/j9/verbosegc'}
cycle_starts = root.findall('d:cycle-start', namespaces=ns)
print(cycle_starts)
mem_infos = root.findall('d:gc-start/d:mem-info', namespaces=ns)
print(mem_infos)
output :
[<Element '{http://www.ibm.com/j9/verbosegc}cycle-start' at 0x29ae6a0>, <Element '{http://www.ibm.com/j9/verbosegc}cycle-start' at 0x29ae8d0>]
[<Element '{http://www.ibm.com/j9/verbosegc}mem-info' at 0x29ae780>, <Element '{http://www.ibm.com/j9/verbosegc}mem-info' at 0x29ae9b0>]
update :
Responding to your comment, this is one possible way to avoid hard-coding the namespace :
#map default namespace uri to prefix d without hard-coding:
ns = {'d': root.nsmap[None]}
result = root.findall('.//d:mem-info', namespaces=ns)
as an aside, I'd suggest using xpath() method instead of findall() since the former provides better support for standard XPath 1.0 expression which will be useful in more complex situation.

Related

Python Beautiful Soup XML Conditional Query (mixed tags and attributes) Parsing

I have two versions of XML files that I need to extract content from. Both have the same information in two different formats as follows (not just different tags, but different structure):
The first one has active and inactive elements defined between: <activeelementSubstance></activeelementSubstance> and <inactiveelementSubstance></inactiveelementSubstance>
The second has active and inactive elements defined between: <element classCode="IACT"></element classCode="IACT"> and <element classCode="ACTIM"></element classCode="ACTIM">
How do I address both situations to extract Inactive and Active Elements ? (there are several examples where tags have synonyms, but I have not seen any where the actual structure is different as in this case)
I came up with the following code (not very clean) to extract from the second case (extracting the element and its code):
activeElements = soup.findAll('Element', attrs={'classCode': 'ACTIM'})
for i in activeElements:
aiName = i.find('name')
aiCode = str(i.find('code'))
print(aiName.text)
print( re.findall(r'"(.*?)"', aiCode)[0] )
print('\nInactive Elements\n')
inactiveElements = soup.findAll('Element', attrs={'classCode': 'IACT'})
for i in inactiveElements:
aiName = i.find('name')
print(aiName.text)
aiCode = i.find('code')['code']
print(aiCode)
Examples of XML files are as follows:
First type (with the format <element classCode="IACT"></element classCode="IACT"> and <element classCode="ACTIM"></element classCode="ACTIM">):
<?xml version="1.0" encoding="UTF-8"?>
<document>
<manufacturedProduct>
<element classCode="IACT">
<elementSubstance>
<code code="36SFW2JZ" codeSystem="33590coding"/>
<name>HYPROMELLOSE 2910 (15 MPA.S)</name>
</elementSubstance>
</element>
<element classCode="IACT">
<elementSubstance>
<code code="70097M6I" codeSystem="33590coding"/>
<name>MAGNESIUM STEARATE</name>
</elementSubstance>
</element>
<elementSubstance>
<code code="XHX3C3X6" codeSystem="33590coding"/>
<name>TRIACETIN</name>
</elementSubstance>
</element>
<element classCode="ACTIM">
<quantity>
<numerator unit="mg" value="250"/>
<denominator unit="1" value="1"/>
</quantity>
<elementSubstance>
<code code="JTE4MNN1" codeSystem="33590coding"/>
<name>AZITHROMYCIN MONOHYDRATE</name>
</elementSubstance>
</element>
</manufacturedProduct>
</document>
Second type (with the format<activeelementSubstance></activeelementSubstance> and <inactiveelementSubstance></inactiveelementSubstance>):
<?xml version="1.0" encoding="UTF-8"?>
<document>
<manufacturedProduct>
<activeelementSubstance>
<code code="VB0R961H" codeSystem="33590coding" codeSystemName="USDA" />
<name>Prednisone</name>
</activeelementSubstance>
</activeelement>
<inactiveelement>
<inactiveelementSubstance>
<code code="776XM704" codeSystem="33590coding" codeSystemName="USDA" />
<name>calcium stearate</name>
</inactiveelementSubstance>
</inactiveelement>
<inactiveelement>
<inactiveelementSubstance>
<name>corn starch</name>
</inactiveelementSubstance>
</inactiveelement>
</manufacturedProduct>
</document>
The XML files are deeply nested (This is in part why I am using Beautiful soup), I tried cleaning and extracting the relevant portion of them.
You can use CSS selectors with ,. For example:
from bs4 import BeautifulSoup
xml_doc1 = ...version 1 of the xml document...
xml_doc2 = ...version 2 of the xml document...
soup = BeautifulSoup(xml_doc1 + xml_doc2, "html.parser")
elem = soup.select(
'element[classCode="IACT"], element[classCode="ACTIM"], activeelementSubstance, inactiveelementSubstance'
)
for e in elem:
code = c["code"] if (c := e.code) else "-"
name = n.text if (n := e.find("name")) else "-"
print(code, name)
Prints:
36SFW2JZ HYPROMELLOSE 2910 (15 MPA.S)
70097M6I MAGNESIUM STEARATE
JTE4MNN1 AZITHROMYCIN MONOHYDRATE
VB0R961H Prednisone
776XM704 calcium stearate
- corn starch
Or split between active and inactive:
print("Active Elements")
elem = soup.select('element[classCode="ACTIM"], activeelementSubstance')
for e in elem:
code = c["code"] if (c := e.code) else "-"
name = n.text if (n := e.find("name")) else "-"
print(code, name)
print("\nInactive Elements")
elem = soup.select('element[classCode="IACT"], inactiveelementSubstance')
for e in elem:
code = c["code"] if (c := e.code) else "-"
name = n.text if (n := e.find("name")) else "-"
print(code, name)
You could use CSS OR syntax within a CSS selector list than add conditional logic to determine where to assign extracted results. The below could do with a refactor to reduce nesting e.g. move certain parts into their own functions, but it gives a starting point.
from bs4 import BeautifulSoup as bs
type_1 = '''
<?xml version="1.0" encoding="UTF-8"?>
<document>
<manufacturedProduct>
<element classCode="IACT">
<elementSubstance>
<code code="36SFW2JZ" codeSystem="33590coding"/>
<name>HYPROMELLOSE 2910 (15 MPA.S)</name>
</elementSubstance>
</element>
<element classCode="IACT">
<elementSubstance>
<code code="70097M6I" codeSystem="33590coding"/>
<name>MAGNESIUM STEARATE</name>
</elementSubstance>
</element>
<elementSubstance>
<code code="XHX3C3X6" codeSystem="33590coding"/>
<name>TRIACETIN</name>
</elementSubstance>
</element>
<element classCode="ACTIM">
<quantity>
<numerator unit="mg" value="250"/>
<denominator unit="1" value="1"/>
</quantity>
<elementSubstance>
<code code="JTE4MNN1" codeSystem="33590coding"/>
<name>AZITHROMYCIN MONOHYDRATE</name>
</elementSubstance>
</element>
</manufacturedProduct>
</document>'''
type_2 = '''
<?xml version="1.0" encoding="UTF-8"?>
<document>
<manufacturedProduct>
<activeelementSubstance>
<code code="VB0R961H" codeSystem="33590coding" codeSystemName="USDA" />
<name>Prednisone</name>
</activeelementSubstance>
</activeelement>
<inactiveelement>
<inactiveelementSubstance>
<code code="776XM704" codeSystem="33590coding" codeSystemName="USDA" />
<name>calcium stearate</name>
</inactiveelementSubstance>
</inactiveelement>
<inactiveelement>
<inactiveelementSubstance>
<name>corn starch</name>
</inactiveelementSubstance>
</inactiveelement>
</manufacturedProduct>
</document>'''
results = {'active': [], 'inactive': []}
for doc in [type_1, type_2]:
soup = bs(doc, 'lxml')
for i in soup.select('[classcode=IACT], [classcode=ACTIM], activeelement, inactiveelement'):
if i.get('classcode') == 'IACT' or i.name == 'inactiveelement':
try:
d = {i.select_one('name').text:i.select_one('code')['code']}
except:
d = {i.select_one('name').text:'missing'}
results['inactive'].append(d)
else:
results['active'].append({i.select_one('name').text:i.select_one('code')['code']})
print(results)

How do I access elements in an XML when multiple default namespaces are used?

I would expect this code to produce a non-empty list:
import xml.etree.ElementTree as et
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<A
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="a:namespace">
<B xmlns="b:namespace">
<C>"Stuff"</C>
</B>
</A>
'''
namespaces = {'a' : 'a:namespace', 'b' : 'b:namespace'}
xroot = et.fromstring(xml)
res = xroot.findall('b:C', namespaces)
instead, res is an empty array. Why?
When I inspect the contents of xroot I can see that the C item is within b:namespace as expected:
for x in xroot.iter():
print(x)
# result:
<Element '{a:namespace}A' at 0x7f56e13b95e8>
<Element '{b:namespace}B' at 0x7f56e188d2c8>
<Element '{b:namespace}C' at 0x7f56e188def8>
To check whether something was wrong with my namespacing, I tried this as well; xroot.findall('{b:namespace}C') but the result was an empty array as well.
Your findall xpath 'b:C' is searching only tags immediately in the root element; you need to make it './/b:C' so the tag is found anywhere in the tree and it works, e.g.:
import xml.etree.ElementTree as et
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<A
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="a:namespace">
<B xmlns="b:namespace">
<C>"Stuff"</C>
</B>
</A>
'''
namespaces = {'a' : 'a:namespace', 'b' : 'b:namespace'}
xroot = et.fromstring(xml)
######## changed the xpath to start with .//
res = xroot.findall('.//b:C', namespaces)
print( f"{res=}" )
for x in xroot.iter():
print(x)
Output:
res=[<Element '{b:namespace}C' at 0x00000222DFCAAA40>]
<Element '{a:namespace}A' at 0x00000222DFCAA9A0>
<Element '{b:namespace}B' at 0x00000222DFCAA9F0>
<Element '{b:namespace}C' at 0x00000222DFCAAA40>
See here for some useful examples of ElementTree xpath support https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=xpath#xpath-support

Get attribute of first element using lxml

Trying to parse an XML file using lxml in Python, how do I simply get the value of an element's attribute? Example:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<item id="123">
<sub>ABC</sub>
</item>
I'd like to get the result 123, and store it as a variable.
When using etree.parse(), simply call .getroot() to get the root element; the .attrib attribute is a dictionary of all attributes, use that to get the value:
>>> from lxml import etree
>>> tree = etree.parse('test.xml')
>>> tree.getroot().attrib['id']
'123'
If you used etree.fromstring() the object returned is the root object already, so no .getroot() call is needed:
>>> tree = etree.fromstring('''\
... <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
... <item id="123">
... <sub>ABC</sub>
... </item>
... ''')
>>> tree.attrib['id']
'123'
Alternatively, you could use an XPath selector:
>>> from lxml import etree
>>> tree = etree.fromstring(b'''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<item id="123">
<sub>ABC</sub>
</item>''')
>>> tree.xpath('/item/#id')
['123']
I think Martijn has answered your question. Building on his answer, you can also use the items() method to get a list of tuples with the attributes and values. This may be useful if you need the values of multiple attributes. Like so:
>>> from lxml import etree
>>> tree = etree.parse('test.xml')
>>> item = tree.xpath('/item')
>>> item.items()
[('id', '123')]
Or in case of string:
>>> tree = etree.fromstring("""\
... <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
... <item id="123">
... <sub>ABC</sub>
... </item>
... """)
>>> tree.items()
[('id', '123')]

Store XML values as Python list

I have XML stored as a string "vincontents", formatted as such:
<response>
<data>
<vin>1FT7X2B69CEC76666</vin>
</data>
<data>
<vin>1GNDT13S452225555</vin>
</data>
</response>
I'm trying to use Python's elementtree library to parse out the VIN values into an array or Python list. I'm only interested in the values, not the tags.
def parseVins():
content = etree.fromstring(vincontents)
vins = content.findall("data/vin")
print vins
Outputs all of the tag information:
[<Element 'vin' at 0x2d2eef0>, <Element 'vin' at 0x2d2efd0> ....
Any help would be appreciated. Thank you!
Use .text property:
>>> import xml.etree.ElementTree as etree
>>> data = """<response>
... <data>
... <vin>1FT7X2B69CEC76666</vin>
... </data>
... <data>
... <vin>1GNDT13S452225555</vin>
... </data>
... </response>"""
>>> tree = etree.fromstring(data)
>>> [el.text for el in tree.findall('.//data/vin')]
['1FT7X2B69CEC76666', '1GNDT13S452225555']

How do I get a list of all parent tags in BeautifulSoup?

Let's say I have a structure like this:
<folder name="folder1">
<folder name="folder2">
<bookmark href="link.html">
</folder>
</folder>
If I point to bookmark, what would be the command to just extract all of the folder lines?
For example,
bookmarks = soup.findAll('bookmark')
then beautifulsoupcommand(bookmarks[0]) would return:
[<folder name="folder1">,<folder name="folder2">]
I'd also want to know when the ending tags hit too. Any ideas?
Thanks in advance!
Here is my stab at it:
>>> from BeautifulSoup import BeautifulSoup
>>> html = """<folder name="folder1">
<folder name="folder2">
<bookmark href="link.html">
</folder>
</folder>
"""
>>> soup = BeautifulSoup(html)
>>> bookmarks = soup.find_all('bookmark')
>>> [p.get('name') for p in bookmarks[0].find_all_previous(name = 'folder')]
[u'folder2', u'folder1']
The key difference from #eumiro's answer is that I am using find_all_previous instead of find_parents. When I tested #eumiro's solution I found that find_parents only returns the first (immediate) parent as the name of the parent and grandparent are the same.
>>> [p.get('name') for p in bookmarks[0].find_parents('folder')]
[u'folder2']
>>> [p.get('name') for p in bookmarks[0].find_parents()]
[u'folder2', None]
It does return two generations of parents if the parent and grandparent are differently named.
>>> html = """<folder name="folder1">
<folder_parent name="folder2">
<bookmark href="link.html">
</folder_parent>
</folder>
"""
>>> soup = BeautifulSoup(html)
>>> bookmarks = soup.find_all('bookmark')
>>> [p.get('name') for p in bookmarks[0].find_parents()]
[u'folder2', u'folder1', None]
bookmarks[0].findParents('folder') will return you a list of all parent nodes. You can then iterate over them and use their name attribute.

Categories

Resources