lxml XPath gobbles an element from the next record - python

I am using lxml to obtain names from each record via XPath. For some reason XPath always fetches first name from the next record, despite that I feed it only one record at a time. In addition, it also fetches the same name again when the next record is loaded. What am I doing wrong?
Example: parse the following sample.xml:
<?xml version="1.0" encoding="UTF-8"?>
<records>
<REC>
<name>Alpha</name>
<name>Beta</name>
<name>Gamma</name>
</REC>
<REC>
<name>Delta</name>
</REC>
</records>
Code:
#!/usr/bin/env python3
from lxml import etree
class Nam:
XPATH = '/records/REC/name'
def __init__(self):
self.xp = etree.XPath(self.XPATH)
def getvals(self, doc):
for no, el in enumerate(self.xp(doc)):
print("{} val: {} ".format(no, el.text))
print()
def main():
nam = Nam()
context = etree.iterparse("sample.xml", events=('end',), tag='REC')
for event, elem in context:
print("Element: {}".format( etree.tostring(elem).decode()))
nam.getvals(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
if __name__ == '__main__':
main()
Output:
Element: <REC>
<name>Alpha</name>
<name>Beta</name>
<name>Gamma</name> </REC>
0 val: Alpha
1 val: Beta
2 val: Gamma
3 val: Delta
Element: <REC>
<name>Delta</name> </REC>
0 val: Delta
Thank you for your help.

When iterparse emits an event that doesn't mean that it has only parsed the input up to the current element, it may actually have already parsed beyond that point as while you iterate over it it parses the input file in chunks of a fixed size.
That however means there is no guarantee how much of the input xml has already been parsed, so for a start event you shouldn't try to access an elements content (other then its attribute) as in may not have been parsed yet and you should not try to access any of the following siblings in either start or end events.
In this case your sample xml is very short, so it's being parsed as a single chunk. Your xpath expression is rooted, so it will always return all matching elements of a document regardless of the given element.
Given that you only handle REC tags anyway, your xpath expression should probably be ./name instead.

Related

XML counting and printing elements

<?xml version="1.0" encoding="utf-8"?>
<export_full date="2022-03-15 07:01:30" version="20160107">
<items>
<item code="A1005" image="https://www.astramodel.cz/images/A/800x600/A1005.jpg" imageDate="2014-04-08" name="Uhlíková tyčka 0.6mm (1m)" brandId="32" brand="ASTRA" czk="89.00" eur="3.50" czksmap="89.00" eursmap="3.50" hasPrice="true" created="2014-01-09" changed="" new="false" stock="true" date="" stock2="true" date2="" stock3="high" date3="" discontinued="false" weight="0.001" length="0.001" width="0.001" height="1.000" recycling_fee="">
<descriptions>
<description title="Charakteristika" order="1"><p>Tyč z uhlíkových vláken kruhového průřezu ø0.6&nbsp;mm v délce 1&nbsp;m. Hmotnost 0,3&nbsp;g</p></description>
</descriptions>
</item>
I have a an XML file which is significantly large however I am trying to count the total number of items and try to type the name attribute of each item, above you can see of how each individual item with its tags looks like.I do get a number when trying to print the total item count however I'm not sure if I'm going about it the right way and in terms of name attributes I am getting nothing so far, please help.
import xml.etree.ElementTree as ET
tree = ET.parse('export_full.xml')
root = tree.getroot()
test = [elem.tag for elem in root.iter("item")]
print(len(test))
for item in root.iter('./item[#name]'):
print(item.attrib)
To evaluate an XPath expression use findall() function. Note the "item" elements are children of "items" element so need to add 'items' to the XPath if using an absolute path otherwise use ".//item[#name]".
for item in root.findall('./items/item[#name]'):
print(item.attrib)
If you want it iterate over all items and add the name attribute to a list.
items = [elem.get('name') for elem in root.iter("item")]
print(len(items), items) # print count of items and list of names
If XML is huge then you can benefit by doing an incremental parse of the XML using iterparse() function.
Example below iterate overs the XML and if tag is 'item' then print its 'name' attribute. You can add whatever logic you want to check.
count = 0
for _, elem in ET.iterparse('export_full.xml'):
if elem.tag == 'item':
print(elem.get('name')) # print out just the name
count += 1
# print(elem.attrib) # print out all attributes
print(count) # display number of items

Parsing XML in Python using the cElementTree module

I have an XML file, which I wanted to convert to a dictionary. I have tried to write the following code but the output is not as expected. I have the following XML file named core-site.xml:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdfs/tmp</value>
<description>Temporary Directory.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.XXX.X.XXX:XXXX</value>
<description>Use HDFS as file storage engine</description>
</property>
</configuration>
The code that I wrote is:
import xml.etree.cElementTree
import xml.etree.ElementTree as ET
import warnings
warnings.filterwarnings("ignore")
class XmlListConfig(list):
def __init__(self, aList):
for element in aList:
if element:
# treat like dict
if len(element) == 1 or element[0].tag != element[1].tag:
self.append(XmlDictConfig(element))
# treat like list
elif element[0].tag == element[1].tag:
self.append(XmlListConfig(element))
elif element.text:
text = element.text.strip()
if text:
self.append(text)
class XmlDictConfig(dict):
def __init__(self, parent_element):
if parent_element.items():
self.update(dict(parent_element.items()))
for element in parent_element:
if element:
# treat like dict - we assume that if the first two tags
# in a series are different, then they are all different.
if len(element) == 1 or element[0].tag != element[1].tag:
aDict = XmlDictConfig(element)
# treat like list - we assume that if the first two tags
# in a series are the same, then the rest are the same.
else:
# here, we put the list in dictionary; the key is the
# tag name the list elements all share in common, and
# the value is the list itself
aDict = {element[0].tag: XmlListConfig(element)}
# if the tag has attributes, add those to the dict
if element.items():
aDict.update(dict(element.items()))
self.update({element.tag: aDict})
# this assumes that if you've got an attribute in a tag,
# you won't be having any text. This may or may not be a
# good idea -- time will tell. It works for the way we are
# currently doing XML configuration files...
elif element.items():
self.update({element.tag: dict(element.items())})
# finally, if there are no child tags and no attributes, extract
# the text
else:
self.update({element.tag: element.text})
tree = ET.parse('core-site.xml')
root = tree.getroot()
xmldict = XmlDictConfig(root)
print xmldict
This is the output that I am getting:
{
'property':
{
'name': 'fs.defaultFS',
'value': 'hdfs://192.X.X.X:XXXX',
'description': 'Use HDFS as file storage engine'
}
}
Why isn't the first property tag being shown? It only shows the data in the last property tag.
Since you are using a dict, the second element with the same key property replaces the first element previously recorded in the dict.
You have to use a different data structure, a list of dict for instance.

How to remove duplicate nodes xml Python

I have a special case xml file structure is something like :
<Root>
<parent1>
<parent2>
<element id="Something" >
</parent2>
</parent1>
<parent1>
<element id="Something">
</parent1>
</Root>
My use case is to remove the duplicated element , I want to remove the elements with same Id . I tried the following code with no positive outcome (its not finding the duplicate node)
import xml.etree.ElementTree as ET
path = 'old.xml'
tree = ET.parse(path)
root = tree.getroot()
prev = None
def elements_equal(e1, e2):
if type(e1) != type(e2):
return False
if e1.tag != e1.tag: return False
if e1.text != e2.text: return False
if e1.tail != e2.tail: return False
if e1.attrib != e2.attrib: return False
if len(e1) != len(e2): return False
return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])
for page in root: # iterate over pages
elems_to_remove = []
for elem in page:
for insideelem in page:
if elements_equal(elem, insideelem) and elem != insideelem:
print("found duplicate: %s" % insideelem.text) # equal function works well
elems_to_remove.append(insideelem)
continue
for elem_to_remove in elems_to_remove:
page.remove(elem_to_remove)
# [...]
tree.write("out.xml")
Can someone help me in letting me know how can i solve it. I am very new to python with almost zero experience .
First of all what you're doing is a hard problem in the library you're using, see this question: How to remove a node inside an iterator in python xml.etree.ElemenTree
The solution to this would be to use lxml which "implements the same API but with additional enhancements". Then you can do the following fix.
You seem to be only traversing the second level of nodes in your XML tree. You're getting root, then walking the children its children. This would get you parent2 from the first page and the element from your second page. Furthermore you wouldn't be comparing across pages here:
your comparison will only find second-level duplicates within the same page.
Select the right set of elements using a proper traversal function such as iter:
# Use a `set` to keep track of "visited" elements with good lookup time.
visited = set()
# The iter method does a recursive traversal
for el in root.iter('element'):
# Since the id is what defines a duplicate for you
if 'id' in el.attr:
current = el.get('id')
# In visited already means it's a duplicate, remove it
if current in visited:
el.getparent().remove(el)
# Otherwise mark this ID as "visited"
else:
visited.add(current)

Filtering XML in Python

I need to write a filter to discard some elements, tags and blocks in my XML Files. In the following you can see what are my xml examples and expected outputs. I am somehow confused about the differences between element, tag, attribute in the elemetTree. My test does not work!
Filter:
import xml.etree.ElementTree as xee
def test(input):
doc=xee.fromstring(input)
print xee.tostring(doc)
#RemoveTimeStampAttribute
for elem in doc.findall('Component'):
if 'timeStamp' in elem.attrib:
del elem.attrib['timeStamp']
#RemoveTimeStampElements
for elem in doc.findall('TimeStamp'):
del elem
print xee.tostring(doc)
return xee.tostring(doc)
First of all, you are removing the attribute incorrectly, see if timeStamp is in the element's attrib dictionary and then use del to remove it:
def amdfilter(input):
doc = xee.fromstring(input)
for node in doc.findall('Component'):
if 'timeStamp' in node.attrib:
del node.attrib['timeStamp']
return xee.tostring(doc)
Also, since you are testing only the attribute removal here, change your expectation to:
expected = '<ComponentMain><Component /></ComponentMain>'
Complete test (it passes):
import unittest
from amdfilter import *
class FilterTest(unittest.TestCase):
def testRemoveTimeStampAttribute(self):
input = '<?xml version="1.0"?><ComponentMain><Component timeStamp="2014"></Component></ComponentMain>'
output = amdfilter(input)
expected = '<ComponentMain><Component /></ComponentMain>'
self.assertEqual(expected, output)
Note that I don't care here about the xml declaration line (it could be easily added).

Turning ElementTree findall() into a list

I'm using ElementTree findall() to find elements in my XML which have a certain tag. I want to turn the result into a list. At the moment, I'm iterating through the elements, picking out the .text for each element, and appending to the list. I'm sure there's a more elegant way of doing this.
#!/usr/bin/python2.7
#
from xml.etree import ElementTree
import os
myXML = '''<root>
<project project_name="my_big_project">
<event name="my_first_event">
<location>London</location>
<location>Dublin</location>
<location>New York</location>
<month>January</month>
<year>2013</year>
</event>
</project>
</root>
'''
tree = ElementTree.fromstring(myXML)
for node in tree.findall('.//project'):
for element in node.findall('event'):
event_name=element.attrib.get('name')
print event_name
locations = []
if element.find('location') is not None:
for events in element.findall('location'):
locations.append(events.text)
# Could I use something like this instead?
# locations.append(''.join.text(*events) for events in element.findall('location'))
print locations
Outputs this (which is correct, but I'd like to assign the findall() results directly to a list, in text format, if possible;
my_first_event
['London', 'Dublin', 'New York']
You can try this - it uses a list comprehension to generate the list without having to create a blank one and then append.
if element.find('location') is not None:
locations = [events.text for events in element.findall('location')]
With this, you can also get rid of the locations definition above, so your code would be:
tree = ElementTree.fromstring(myXML)
for node in tree.findall('.//project'):
for element in node.findall('event'):
event_name=element.attrib.get('name')
print event_name
if element.find('location') is not None:
locations = [events.text for events in element.findall('location')]
print locations
One thing you will want to be wary of is what you are doing with locations - it won't be defined if location doesn't exist, so you will get a NameError if you try to print it and it doesn't exist. If that is an issue, you can retain the locations = [] definition - if the matching element isn't found, the result will just be an empty list.

Categories

Resources