How do I use a default namespace in an lxml xpath query? - python

I have an xml document in the following format:
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:gsa="http://schemas.google.com/gsa/2007">
...
<entry>
<id>https://ip.ad.dr.ess:8000/feeds/diagnostics/smb://ip.ad.dr.ess/path/to/file</id>
<updated>2011-11-07T21:32:39.795Z</updated>
<app:edited xmlns:app="http://purl.org/atom/app#">2011-11-07T21:32:39.795Z</app:edited>
<link rel="self" type="application/atom+xml" href="https://ip.ad.dr.ess:8000/feeds/diagnostics"/>
<link rel="edit" type="application/atom+xml" href="https://ip.ad.dr.ess:8000/feeds/diagnostics"/>
<gsa:content name="entryID">smb://ip.ad.dr.ess/path/to/directory</gsa:content>
<gsa:content name="numCrawledURLs">7</gsa:content>
<gsa:content name="numExcludedURLs">0</gsa:content>
<gsa:content name="type">DirectoryContentData</gsa:content>
<gsa:content name="numRetrievalErrors">0</gsa:content>
</entry>
<entry>
...
</entry>
...
</feed>
I need to retrieve all entry elements using xpath in lxml. My problem is that I can't figure out how to use an empty namespace. I have tried the following examples, but none work. Please advise.
import lxml.etree as et
tree=et.fromstring(xml)
The various things I have tried are:
for node in tree.xpath('//entry'):
or
namespaces = {None:"http://www.w3.org/2005/Atom" ,"openSearch":"http://a9.com/-/spec/opensearchrss/1.0/" ,"gsa":"http://schemas.google.com/gsa/2007"}
for node in tree.xpath('//entry', namespaces=ns):
or
for node in tree.xpath('//\"{http://www.w3.org/2005/Atom}entry\"'):
At this point I just don't know what to try. Any help is greatly appreciated.

Something like this should work:
import lxml.etree as et
ns = {"atom": "http://www.w3.org/2005/Atom"}
tree = et.fromstring(xml)
for node in tree.xpath('//atom:entry', namespaces=ns):
print node
See also http://lxml.de/xpathxslt.html#namespaces-and-prefixes.
Alternative:
for node in tree.xpath("//*[local-name() = 'entry']"):
print node

Use findall method.
for item in tree.findall('{http://www.w3.org/2005/Atom}entry'):
print item

Related

How to get a href attribute value in xml content (atom feed)?

I'm saving the content (atom feed / xml content) from a get request as content = response.text and the content looks like this:
<feed xmlns="http://www.w3.org/2005/Atom">
<title type="text">title-a</title>
<subtitle type="text">content: application/abc</subtitle>
<updated>2021-08-05T16:29:20.202Z</updated>
<id>tag:tag-a,2021-08:27445852</id>
<generator uri="uri-a" version="v-5.1.0.3846329218047">abc</generator>
<author>
<name>name-a</name>
<email>email-a</email>
</author>
<link href="url-a" rel="self"/>
<link href="url-b" rel="next"/>
<link href="url-c" rel="previous"/>
</feed>
How can I get the value "url-b" of the href attribute with rel="next" ?
I tried it with the ElementTree module, for example:
from xml.etree import ElementTree
response = requests.get("myurl", headers={"Authorization": f"Bearer {my_access_token}"})
content = response.text
tree = ElementTree.fromstring(content)
tree.find('.//link[#rel="next"]')
// or
tree.find('./link').attrib['href']
but that didn't work.
I appreciate any help and thank you in advance.
If there is an easier, simpler solution (maybe feedparser) I welcome that too.
How can I get the value "url-b" of the href attribute with rel="next" ?
see below
from xml.etree import ElementTree as ET
xml = '''<feed xmlns="http://www.w3.org/2005/Atom">
<title type="text">title-a</title>
<subtitle type="text">content: application/abc</subtitle>
<updated>2021-08-05T16:29:20.202Z</updated>
<id>tag:tag-a,2021-08:27445852</id>
<generator uri="uri-a" version="v-5.1.0.3846329218047">abc</generator>
<author>
<name>name-a</name>
<email>email-a</email>
</author>
<link href="url-a" rel="self"/>
<link href="url-b" rel="next"/>
<link href="url-c" rel="previous"/>
</feed>'''
root = ET.fromstring(xml)
links = root.findall('.//{http://www.w3.org/2005/Atom}link[#rel="next"]')
for link in links:
print(f'{link.attrib["href"]}')
output
url-b
You can use this XPath-1.0 expression:
./*[local-name()="feed"]/*[local-name()="link" and #rel="next"]/#href
This should result in "url-b".

python elementtree - how to find all elements in xml with certain attribute

I have xml in similiar structure:
<terms>
<entry ID="1">
<language ID="1">en</language>
<term>user</term>
<state>text</state>
<use>text</use>
<definition ID="1">text</definition>
<subdefinition ID="1">text</subdefinition>
<definition-source>text</definition-source>
<source ID="1">text</source>
<circle>text</circle>
</entry>
In this case, parent and child elements contain attribute ID. Is there a way, how to find all elements from a tree that contains attribute ID and change value to 0 or dele it?
I was trying to do it with XPath but it's difficult when there is a deep hierarchy and any of elements can have this attribute.
another way would be to handle it as a string, but is there a way how to do it in ElementTree?
It should be fairly easy using the XPath .//*[#ID] to select the elements.
Here's an example changing all ID values to 0...
XML Input (test.xml)
<terms>
<entry ID="1">
<language ID="1">en</language>
<term>user</term>
<state>text</state>
<use>text</use>
<definition ID="1">text</definition>
<subdefinition ID="1">text</subdefinition>
<definition-source>text</definition-source>
<source ID="1">text</source>
<circle>text</circle>
</entry>
</terms>
Python
import xml.etree.ElementTree as ET
tree = ET.parse("test.xml")
for elem in tree.findall(".//*[#ID]"):
elem.attrib["ID"] = "0"
ET.dump(tree.getroot())
Output (dumped to console)
<terms>
<entry ID="0">
<language ID="0">en</language>
<term>user</term>
<state>text</state>
<use>text</use>
<definition ID="0">text</definition>
<subdefinition ID="0">text</subdefinition>
<definition-source>text</definition-source>
<source ID="0">text</source>
<circle>text</circle>
</entry>
</terms>

Get XML value using ElementTree in Python

I have this XML file :
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="https://receasy1p1942606901trial.hanatrial.ondemand.com:443/rec/Accrual_PO.xsodata/"
xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices"
xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
xmlns="http://www.w3.org/2005/Atom">
<title type="text">accruals_po</title>
<id>https://receasy1p1942606901trial.hanatrial.ondemand.com:443/rec/Accrual_PO.xsodata/accruals_po</id>
<author>
<name />
</author>
<link rel="self" title="accruals_po" href="accruals_po" />
<entry>
<id>https://receasy1p1942606901trial.hanatrial.ondemand.com:443/rec/Accrual_PO.xsodata/accruals_po('96372537-120')</id>
<title type="text"></title>
<author>
<name />
</author>
<link rel="edit" title="accruals_po" href="accruals_po('96372537-120')"/>
<category term="receasy.accruals_poType" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
<content type="application/xml">
<m:properties>
<d:PO_NUMBER m:type="Edm.String">96372537-120</d:PO_NUMBER>
<d:SAP_AMT m:type="Edm.Single">109</d:SAP_AMT>
<d:GL_ACCOUNT m:type="Edm.Int64">65009000</d:GL_ACCOUNT>
<d:COMPANY_CODE m:type="Edm.String">US10_OH</d:COMPANY_CODE>
<d:CONFIRMED_ACCRUAL_AMT m:type="Edm.Single">109</d:CONFIRMED_ACCRUAL_AMT>
<d:FINAL_APPROVER m:type="Edm.String">europe\bamcguir</d:FINAL_APPROVER>
<d:FINAL_GL_ACCOUNT m:type="Edm.Int64">65009000</d:FINAL_GL_ACCOUNT>
<d:FINAL_COMPANY_CODE m:type="Edm.String">US10_OH</d:FINAL_COMPANY_CODE>
<d:RECONCILIATION m:type="Edm.String">Successful</d:RECONCILIATION>
</m:properties>
</content>
</entry>
</feed>
I'm trying to get the values highlighted below in bold, they are under the entry tag.
96372537-120
109
65009000
US10_OH
109
europe\bamcguir
65009000
US10_OH
Successful
This is the code I have as of now to get the values.
import urllib2
import xmltodict
import xml.etree.ElementTree as ET
import requests
tree = ET.parse('export.xml')
root = tree.getroot()
for child in root:
print child.tag, child.attrib
for child2 in child:
print child2.tag, child2.attrib
for child3 in child2:
print child3.tag, child3.attrib
for child4 in child3:
print child4.tag, child4.attrib
for child5 in child4:
print child5.tag, child5.attrib
This is part of the output that I get for PO_NUMBER.
{http://schemas.microsoft.com/ado/2007/08/dataservices}PO_NUMBER {'{http://schemas.microsoft.com/ado/2007/08/dataservices/metadata}type': 'Edm.String'}
I'm not able to get the value of PO_NUMBER which is 96372537-120. How do I get this value, and the other values as highlighted above?
In ElementTree, an element's (leading) text node is set on the text attribute. tag is the name of the XML tag (in Clark's notation) and attrib are the XML attributes only (also in Clark's notation).
So child5.text will give you the information you need.
Incidentally, you can use Clark's notation {namespace}tag with ElementTree's regular querying API to access the content or properties element directly, you don't have to iterate everything by hand:
tree.iter('{http://schemas.microsoft.com/ado/2007/08/dataservices/metadata}properties')
will give you an iterator on all the "properties" objects in the tree, and then you can just iterate on each property and get the corresponding child's text:
for child in property:
print(child.text)
Note an oddity for mixed content (when an element can have both text and element children): in the ElementTree document model, only first child is set on .text when it's a text node, otherwise it's set as .tail on the preceding element e.g.
<foo>
bar
<qux/>
baz
</foo>
will have foo.text == "bar" but "baz" will be set on qux.tail.

Python parsing Google Contacts xml

Here's a sample of my XML file I want to parse. How can I get gd:fullName and address from it? The problem is in some situations I have the name field and sometimes I don't. Any help?
I use
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
root.findall('./{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}title'):
print item.attrib
but it just gives me empty dictionaries. And I haven't used the gdata library, so I want to extract the data myself. Here's my XML file:
<?xml version="1.0" encoding="UTF-8"?>
<feed gd:etag=""Rn84fzVSLyt7I2A9XRVbFkwOQAE."" xmlns="http://www.w3.org/2005/Atom" xmlns:batch="http://schemas.google.com/gdata/batch" xmlns:gContact="http://schemas.google.com/contact/2008" xmlns:gd="http://schemas.google.com/g/2005" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/">
<id>moha****ee#gmail.com</id>
<updated>2015-08-03T15:12:37.137Z</updated>
<category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/contact/2008#contact"/>
<title>Mohammad Amin's Contacts</title>
<link rel="alternate" type="text/html" href="https://www.google.com/"/>
<link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/mohamma***ee%40gmail.com/full"/>
<link rel="http://schemas.google.com/g/2005#post" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/mohamm***aee%40gmail.com/full"/>
<link rel="http://schemas.google.com/g/2005#batch" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/moha****ee%40gmail.com/full/batch"/>
<link rel="self" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/moham***ee%40gmail.com/full?max-results=25"/>
<link rel="next" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/moha****aee%40gmail.com/full?max-results=25&start-index=26"/>
<author>
<name>Mohammad Amin</name>
<email>moha****ee#gmail.com</email>
</author>
<generator version="1.0" uri="http://www.google.com/m8/feeds">Contacts</generator>
<openSearch:totalResults>131</openSearch:totalResults>
<openSearch:startIndex>1</openSearch:startIndex>
<openSearch:itemsPerPage>25</openSearch:itemsPerPage>
<entry gd:etag=""SXc5cTNQJit7I2A9XRRbGEsPQQY."">
<id>http://www.google.com/m8/feeds/contacts/moh***ee%40gmail.com/base/15281000e768a31</id>
<updated>2015-04-12T19:07:08.929Z</updated>
<app:edited xmlns:app="http://www.w3.org/2007/app">2015-04-12T19:07:08.929Z</app:edited>
<category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/contact/2008#contact"/>
<title>Sina Ghazi</title>
<link rel="http://schemas.google.com/contacts/2008/rel#photo" type="image/*" href="https://www.google.com/m8/feeds/photos/media/moh***aee%40gmail.com/15****a31" gd:etag=""WR1-e34pSit7I2BlWW4TbChNHHg6LF88WhE.""/>
<link rel="self" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/moham****aee%40gmail.com/full/1528****8a31"/>
<link rel="edit" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/mohamm***ee%40gmail.com/full/15***a31"/>
<gd:name>
<gd:fullName>Si***i</gd:fullName>
<gd:givenName>Si***a</gd:givenName>
<gd:familyName>G***zi</gd:familyName>
</gd:name>
<gd:email rel="http://schemas.google.com/g/2005#home" address="si***i#gmail.com" primary="true"/>
<gContact:website href="http://www.google.com/profiles/1167****31" rel="profile"/>
</entry>
There are a few problems going on here.
You don't show all your code, but if root is the result of...
with open('data.xml') as fd:
doc = etree.parse(fd)
root = doc.getroot()
...then you're never going to find any entry elements:
>>> print root.findall('{http://www.w3.org/2005/Atom}entry')
[]
Because the root element doesn't contain any entry elements. You will find title elements:
>>> print root.findall('{http://www.w3.org/2005/Atom}title')
[<Element {http://www.w3.org/2005/Atom}title at 0x7ff8b3d166c8>, <Element {http://www.w3.org/2005/Atom}title at 0x7ff8b3d162d8>]
But looking at .attrib on these will yield empty dictionaries, because the title elements have no attributes:
<title>Si***azi</title>
If you want gd:fullName, you would need something like:
>>> root.findall('*/{http://schemas.google.com/g/2005}fullName')
[<Element {http://schemas.google.com/g/2005}fullName at 0x7ff8b04a83f8>]
Note that this is using a different namespace
(http://schemas.google.com/g/2005) from the default namespace
(http://www.w3.org/2005/Atom).
There is no address element that I can see, but you could use the
above to get the gd:email element and then extract the address
attribute.
Update
Consider instead using the xpath method and a namespaces dictionary:
>>> namespaces={'atom': 'http://www.w3.org/2005/Atom',
... 'gd': 'http://schemas.google.com/g/2005'}
>>> root.xpath('//atom:title', namespaces=namespaces)
[<Element {http://www.w3.org/2005/Atom}title at 0x7fd63aa30b90>, <Element {http://www.w3.org/2005/Atom}title at 0x7fd642b18ea8>]
>>> root.xpath('//gd:fullName', namespaces=namespaces)
[<Element {http://schemas.google.com/g/2005}fullName at 0x7fd63a5c5170>]
Or, using an explicit path instead of //:
>>> root.xpath('/atom:feed/atom:entry/atom:title', namespaces=namespaces)
[<Element {http://www.w3.org/2005/Atom}title at 0x7fd642b18ea8>]

How to remove all attributes of a tag

How can I remove all the attributes of a xml tag so I can get from this:
<xml blah blah blah> to just <xml>.
With lxml I know I can remove the whole element and I didn't find any way to do it specific on a tag. (I found solutions on stackoverflow for C# but I want Python).
I am opening a gpx(xml) file and this is my code so far (based on How do I get the whole content between two xml tags in Python?):
from lxml import etree
t = etree.parse("1.gpx")
e = t.xpath('//trk')[0]
print(e.text + ''.join(map(etree.tostring, e))).strip()
Another approach I did was this:
from lxml import etree
TOPOGRAFIX_NS = './/{http://www.topografix.com/GPX/1/1}'
TRACKPOINT_NS = TOPOGRAFIX_NS + 'extensions/{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}TrackPointExtension/{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}'
doc1 = etree.parse("1.gpx")
for node1 in doc1.findall(TOPOGRAFIX_NS + 'trk'):
node_to_string1 = etree.tostring(node1)
print(node_to_string1)
But I get the trk tag with TOPOGRAFIX_NS attributes witch I don't want and here I am wanting to remove the tag attribute. I just want to get:
<trk> all the inside content </trk>
Thank you very much!
P.S. The content of the gpx file:
<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" creator="Endomondo.com" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/GpxExtensions/v3 http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<metadata>
<author>
<name>Blah Blah</name>
<email id="blah" domain="blah.com"/>
</author>
<link href="http://www.endomondo.com">
<text>Endomondo</text>
</link>
<time>2014-01-20T10:50:28Z</time>
</metadata>
<trk>
<name>Galati</name>
<src>http://www.endomondo.com/</src>
<link href="http://www.endomondo.com/workouts/260782567/13005122">
<text>Galati</text>
</link>
<type>MOUNTAIN_BIKING</type>
<trkseg>
<trkpt lat="45.431074" lon="28.021038">
<time>2013-10-20T05:49:04Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>

Categories

Resources