extraction child text python with lxml

extraction child text python with lxml - python

i'm triying to extract from xml file (GPX) all informations related to the waypoints of my gpx file with lxml library.
there is a subset of my gpx file.
<?xml version="1.0"?>
<gpx
version="1.0"
creator="GPSBabel - http://www.gpsbabel.org"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.topografix.com/GPX/1/0"
xsi:schemaLocation="http://www.topografix.com/GPX/1/0 http://www.topografix.com/GPX/1/0/gpx.xsd">
<time>2006-01-23T02:00:28Z</time>
<trk>
<name>08-JAN-06 02</name>
<trkseg>
<trkpt lat="-33.903422356" lon="151.175565720">
<ele>19.844360</ele>
<time>2006-01-08T06:45:07Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>
i can get point latitude and longitude by:
node.get("lon") and node.get("lat")
but when i try to get time with :
for element in root:
if element.tag=="{http://www.topografix.com/GPX/1/0}time":
time=str(element.text)
i get finally for example this kind of results
(1.45,32.12,'')
a blank value for time how can i solve this

I'm assuming there is a </trk> and a </trkseg> tag that's supposed to be at the end of what you posted, or else this would be kind of malformed.
I'm going to write this out in a very verbose way. First, let's assume you've got an lxml object containing your xml-- we'll call it tree.
First define your namespace, if necessary:
ns = {'gpx': 'http://www.topografix.com/GPX/1/0'}
I like using XPath queries. If you try a query like tree.xpath('//trk') and get an undefined namespace error, try again by specifying a namespace argument-- you have to prefix your xpath expressions with the key, like tree.xpath('//gpx:trk', namespaces=ns)
Now you want to get a list of all your trk objects:
trk_objects = tree.xpath('//gpx:trk', namespaces=ns)
This will return a list of them or an empty list if there are no trk tags.
Then you want to iterate through them (I'm assuming there's only one trkseg tag per trk tag, and that you need to use the name space):
for trk in trk_objects:
# xpath queries aways return a list of objects
lat_objects = trk.xpath('./gpx:trkseg/gpx:trkpt/#lat', namespaces=ns)
if lat_objects:
lat = lat_objects[0].text
lon_objects = trk.xpath('./gpx:trkseg/gpx:trkpt/#lon', namespace=ns)
if lon_objects:
lon = lon_objects[0].text
time_objects = trk.xpath('./gpx:trkseg/gpx:time', namespace=ns)
if time_objects:
time = time_objects[0].text

Related

XML counting and printing elements

<?xml version="1.0" encoding="utf-8"?>
<export_full date="2022-03-15 07:01:30" version="20160107">
<items>
<item code="A1005" image="https://www.astramodel.cz/images/A/800x600/A1005.jpg" imageDate="2014-04-08" name="Uhlíková tyčka 0.6mm (1m)" brandId="32" brand="ASTRA" czk="89.00" eur="3.50" czksmap="89.00" eursmap="3.50" hasPrice="true" created="2014-01-09" changed="" new="false" stock="true" date="" stock2="true" date2="" stock3="high" date3="" discontinued="false" weight="0.001" length="0.001" width="0.001" height="1.000" recycling_fee="">
<descriptions>
<description title="Charakteristika" order="1"><p>Tyč z uhlíkových vláken kruhového průřezu ø0.6&nbsp;mm v délce 1&nbsp;m. Hmotnost 0,3&nbsp;g</p></description>
</descriptions>
</item>
I have a an XML file which is significantly large however I am trying to count the total number of items and try to type the name attribute of each item, above you can see of how each individual item with its tags looks like.I do get a number when trying to print the total item count however I'm not sure if I'm going about it the right way and in terms of name attributes I am getting nothing so far, please help.
import xml.etree.ElementTree as ET
tree = ET.parse('export_full.xml')
root = tree.getroot()
test = [elem.tag for elem in root.iter("item")]
print(len(test))
for item in root.iter('./item[#name]'):
print(item.attrib)

To evaluate an XPath expression use findall() function. Note the "item" elements are children of "items" element so need to add 'items' to the XPath if using an absolute path otherwise use ".//item[#name]".
for item in root.findall('./items/item[#name]'):
print(item.attrib)
If you want it iterate over all items and add the name attribute to a list.
items = [elem.get('name') for elem in root.iter("item")]
print(len(items), items) # print count of items and list of names
If XML is huge then you can benefit by doing an incremental parse of the XML using iterparse() function.
Example below iterate overs the XML and if tag is 'item' then print its 'name' attribute. You can add whatever logic you want to check.
count = 0
for _, elem in ET.iterparse('export_full.xml'):
if elem.tag == 'item':
print(elem.get('name')) # print out just the name
count += 1
# print(elem.attrib) # print out all attributes
print(count) # display number of items

Get XPath to attribute

I want to get the actual XPath expression to an attribute node for a specific attribute in an xml element tree (using lxml).
Suppose the following XML tree.
<foo>
<bar attrib_name="hello_world"/>
</foo>
The XPath expression "//#*[local-name() = "attrib_name"]" produces ['hello_world'] which is the values of concerned attributes, and "//#*[local-name() = "attrib_name"]/.." gets me the bar element, which is one level too high, I need the xpath expression to the specific attribute node itself, not its parent xml node, that is having the string 'attrib_name' I want to generate '/foo/bar/#attrib_name'.
from lxml import etree
from io import StringIO
f = StringIO('<foo><bar attrib_name="hello_world"></bar></foo>')
tree = etree.parse(f)
print(tree.xpath('//#*[local-name() = "attrib_name"]'))
# --> ['hello_world']
print([tree.getpath(el) for el in tree.xpath('//#*[local-name() = "attrib_name"]/..')])
# --> ['/foo/bar']
As an add-on this should work with namespaces too.

If you remove the /.. then you will get the _ElementUnicodeResult
This will allow you to append the attribute name to the xpath:
>>> print(['%s/#%s' % (tree.getpath(attrib_result.getparent()), attrib_result.attrname) for attrib_result in tree.xpath('//#*[local-name() = "attrib_name"]')])
['/foo/bar/#attrib_name']
Trying to apply that to namespaces will result in the namespace added to the xpath (which may not be what you want):
>>> tree = etree.parse(StringIO('<foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><bar xsi:attrib_name="hello_world"></bar></foo>'))
>>> print(['%s/#%s' % (tree.getpath(attrib_result.getparent()), attrib_result.attrname) for attrib_result in tree.xpath('//#*[local-name() = "attrib_name"]')])
['/foo/bar/#{http://www.w3.org/2001/XMLSchema-instance}attrib_name']

Python getting element value for specific element

I am attempting to parse the below XML file but having difficulty getting a specific element value. I am trying to specify element 'Item_No_2' to get the related value <v>2222222222</v> but am unable to do it using get.element('Item_No_2'). Am I using the get.element value incorrectly?
XML File:
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="Data.xsl"?>
<abc>
<md>
<mi>
<datetime>20160822020003</datetime>
<period>3600</period>
<it>Item_No_1</it>
<it>Item_No_2</it>
<it>Item_No_3</it>
<it>Item_No_4</it>
<it>Item_No_5</it>
<it>Item_No_6</it>
<it>Item_No_7</it>
<ovalue>
<v>1111111111</v>
<v>2222222222</v>
<v>3333333333</v>
<v>4444444444</v>
<v>5555555555</v>
<v>6666666666</v>
<v>7777777777</v>
</ovalue>
</mi>
</md>
</abc>
My Code:
from xml.etree.ElementTree import parse
doc = parse('test.xml').getroot()
for element in doc.findall('md/mi/'):
print(element.text)
for element in doc.findall('md/mi/ovalue/'):
print(element.text)
The current output gets them separately but I can't seem to understand how to call a specific element value.
Output:
20160822020003
3600
Item_No_1
Item_No_2
Item_No_3
Item_No_4
Item_No_5
Item_No_6
Item_No_7
1111111111
2222222222
3333333333
4444444444
5555555555
6666666666
7777777777
Tried this but did not work:
for element in doc.findall('md/mi/ovalue/'):
print(element.get('Item_No_1'))

There is no Item_No_1 at the elements that are found by doc.findall('md/mi/ovalue/').
I think what you may try to do is get both lists
items = [e.text for e in doc.findall('md/mi/it')]
values = [e.text for e in doc.findall('md/mi/ovalue/v')]
Then find the index of the string 'Item_No_1' from items, and then index into values with that number.
Alternatively, zip the two lists together and check when you find one element.
for item,value in zip(doc.findall('md/mi/it'), doc.findall('md/mi/ovalue/v')):
if item.text == 'Item_No_1':
print(value.text)
There might be a better way, but those are the first ways that come to mind

XPath - Return ALL nodes with certain string pattern

Here is a sample from the doc I am working with:
<idx:index xsi:schemaLocation="http://www.belscript.org/schema/index index.xsd" idx:belframework_version="2.0">
<idx:namespaces>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/entrez-gene-ids-hmr.belns"/>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/hgnc-approved-symbols.belns"/>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/mgi-approved-symbols.belns"/>
I can get all nodes with name "namespace" with the following code:
tree = etree.parse(self.old_files)
urls = tree.xpath('//*[local-name()="namespace"]')
This would return a list of the 3 namespace elements. But what if I want to get to the data in the idx:resourceLocation attribute? Here is my attempt at doing that, using the XPath docs as a guide.
urls = tree.xpath('//*[local-name()="namespace"]/#idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/"',
namespaces={'idx' : 'http://www.belscript.org/schema/index'})
What I want is all nodes that have an attribute starting with http://resource.belframework.org/belframework/1.0/namespace. So in the sample doc, it would return me only those strings in the resourceLocation attribute. Unfortunately, the syntax is not quite right, and I am having trouble deriving the proper syntax from the documentation. Thank you!

I think what you are looking for is:
//*[local-name()="namespace"]/#idx:resourceLocation
or
//idx:namespace/#idx:resourceLocation
or, if you want only those #idx:resourceLocation attributes that start with "http://resource.belframework.org/belframework/1.0/namespace" you could use
'''//idx:namespace[
starts-with(#idx:resourceLocation,
"http://resource.belframework.org/belframework/1.0/namespace")]
/#idx:resourceLocation'''
import lxml.etree as ET
content = '''\
<root xmlns:xsi="http://www.xxx.com/zzz/yyy" xmlns:idx="http://www.belscript.org/schema/index">
<idx:index xsi:schemaLocation="http://www.belscript.org/schema/index index.xsd" idx:belframework_version="2.0">
<idx:namespaces>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/entrez-gene-ids-hmr.belns"/>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/hgnc-approved-symbols.belns"/>
<idx:namespace idx:resourceLocation="http://resource.belframework.org/belframework/1.0/namespace/mgi-approved-symbols.belns"/>
</idx:namespaces>
</idx:index>
</root>
'''
root = ET.XML(content)
namespaces = {'xsi': 'http://www.xxx.com/zzz/yyy',
'idx': 'http://www.belscript.org/schema/index'}
for item in root.xpath(
'//*[local-name()="namespace"]/#idx:resourceLocation', namespaces=namespaces):
print(item)
yields
http://resource.belframework.org/belframework/1.0/namespace/entrez-gene-ids-hmr.belns
http://resource.belframework.org/belframework/1.0/namespace/hgnc-approved-symbols.belns
http://resource.belframework.org/belframework/1.0/namespace/mgi-approved-symbols.belns

Turning ElementTree findall() into a list

I'm using ElementTree findall() to find elements in my XML which have a certain tag. I want to turn the result into a list. At the moment, I'm iterating through the elements, picking out the .text for each element, and appending to the list. I'm sure there's a more elegant way of doing this.
#!/usr/bin/python2.7
#
from xml.etree import ElementTree
import os
myXML = '''<root>
<project project_name="my_big_project">
<event name="my_first_event">
<location>London</location>
<location>Dublin</location>
<location>New York</location>
<month>January</month>
<year>2013</year>
</event>
</project>
</root>
'''
tree = ElementTree.fromstring(myXML)
for node in tree.findall('.//project'):
for element in node.findall('event'):
event_name=element.attrib.get('name')
print event_name
locations = []
if element.find('location') is not None:
for events in element.findall('location'):
locations.append(events.text)
# Could I use something like this instead?
# locations.append(''.join.text(*events) for events in element.findall('location'))
print locations
Outputs this (which is correct, but I'd like to assign the findall() results directly to a list, in text format, if possible;
my_first_event
['London', 'Dublin', 'New York']

You can try this - it uses a list comprehension to generate the list without having to create a blank one and then append.
if element.find('location') is not None:
locations = [events.text for events in element.findall('location')]
With this, you can also get rid of the locations definition above, so your code would be:
tree = ElementTree.fromstring(myXML)
for node in tree.findall('.//project'):
for element in node.findall('event'):
event_name=element.attrib.get('name')
print event_name
if element.find('location') is not None:
locations = [events.text for events in element.findall('location')]
print locations
One thing you will want to be wary of is what you are doing with locations - it won't be defined if location doesn't exist, so you will get a NameError if you try to print it and it doesn't exist. If that is an issue, you can retain the locations = [] definition - if the matching element isn't found, the result will just be an empty list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extraction child text python with lxml - python

Related

XML counting and printing elements

Get XPath to attribute

Python getting element value for specific element

XPath - Return ALL nodes with certain string pattern

Turning ElementTree findall() into a list

Categories

Resources