How to get a href attribute value in xml content (atom feed)? - python

I'm saving the content (atom feed / xml content) from a get request as content = response.text and the content looks like this:
<feed xmlns="http://www.w3.org/2005/Atom">
<title type="text">title-a</title>
<subtitle type="text">content: application/abc</subtitle>
<updated>2021-08-05T16:29:20.202Z</updated>
<id>tag:tag-a,2021-08:27445852</id>
<generator uri="uri-a" version="v-5.1.0.3846329218047">abc</generator>
<author>
<name>name-a</name>
<email>email-a</email>
</author>
<link href="url-a" rel="self"/>
<link href="url-b" rel="next"/>
<link href="url-c" rel="previous"/>
</feed>
How can I get the value "url-b" of the href attribute with rel="next" ?
I tried it with the ElementTree module, for example:
from xml.etree import ElementTree
response = requests.get("myurl", headers={"Authorization": f"Bearer {my_access_token}"})
content = response.text
tree = ElementTree.fromstring(content)
tree.find('.//link[#rel="next"]')
// or
tree.find('./link').attrib['href']
but that didn't work.
I appreciate any help and thank you in advance.
If there is an easier, simpler solution (maybe feedparser) I welcome that too.

How can I get the value "url-b" of the href attribute with rel="next" ?
see below
from xml.etree import ElementTree as ET
xml = '''<feed xmlns="http://www.w3.org/2005/Atom">
<title type="text">title-a</title>
<subtitle type="text">content: application/abc</subtitle>
<updated>2021-08-05T16:29:20.202Z</updated>
<id>tag:tag-a,2021-08:27445852</id>
<generator uri="uri-a" version="v-5.1.0.3846329218047">abc</generator>
<author>
<name>name-a</name>
<email>email-a</email>
</author>
<link href="url-a" rel="self"/>
<link href="url-b" rel="next"/>
<link href="url-c" rel="previous"/>
</feed>'''
root = ET.fromstring(xml)
links = root.findall('.//{http://www.w3.org/2005/Atom}link[#rel="next"]')
for link in links:
print(f'{link.attrib["href"]}')
output
url-b

You can use this XPath-1.0 expression:
./*[local-name()="feed"]/*[local-name()="link" and #rel="next"]/#href
This should result in "url-b".

Related

Parsing messy XML in Python

I'm super new to coding and if someone could help me in figuring out howto parse XML file it would be awesome.
I'm trying to write a python script that would read all notes created in Gnome-Notes and display it in command line. I've got the load notes part, but I can't figure out howto parse the XML so it would display the text part. The sample data looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<note version="1" xmlns:link="http://projects.gnome.org/bijiben/link" xmlns:size="http://projects.gnome.org/bijiben/size" xmlns="http://projects.gnome.org/bijiben">
<title>Testnote</title>
<text xml:space="preserve"><html xmlns="http://www.w3.org/1999/xhtml"><head><link rel="stylesheet" href="Default.css" type="text/css" /><script language="javascript" src="bijiben.js"></script></head><body id="editable" style="color: white;">Some text for the note.</body></html></text>
<last-change-date>2021-04-01T20:03:08Z</last-change-date>
<last-metadata-change-date>2021-04-01T20:02:53Z</last-metadata-change-date>
<create-date>2021-03-29T10:37:14Z</create-date>
<cursor-position>0</cursor-position>
<selection-bound-position>0</selection-bound-position>
<width>0</width>
<height>0</height>
<x>0</x>
<y>0</y>
<color>rgb(0,0,0)</color>
<tags/>
<open-on-startup>False</open-on-startup>
And after parsing I should get only the "Some text for the note." part. I've been trying ElementTree for this. While I don't have issues when working with "clean" xml files provided in the sample I can't figure out what to do with this one.
Should be doable using ElementTree
from xml.etree import ElementTree as ET
data = '''\
<?xml version="1.0" encoding="UTF-8"?>
<note version="1" xmlns:link="http://projects.gnome.org/bijiben/link" xmlns:size="http://projects.gnome.org/bijiben/size" xmlns="http://projects.gnome.org/bijiben">
<title>Testnote</title>
<text xml:space="preserve">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link rel="stylesheet" href="Default.css" type="text/css"/>
<script language="javascript" src="bijiben.js"/>
</head>
<body id="editable" style="color: white;">Some text for the note.</body>
</html>
</text>
<last-change-date>2021-04-01T20:03:08Z</last-change-date>
<last-metadata-change-date>2021-04-01T20:02:53Z</last-metadata-change-date>
<create-date>2021-03-29T10:37:14Z</create-date>
<cursor-position>0</cursor-position>
<selection-bound-position>0</selection-bound-position>
<width>0</width>
<height>0</height>
<x>0</x>
<y>0</y>
<color>rgb(0,0,0)</color>
<tags/>
<open-on-startup>False</open-on-startup>
</note>
'''
tree = ET.fromstring(data)
nmsp = {
'xml': 'http://www.w3.org/1999/xhtml',
} # NAMESPACE PREFIX ASSIGNMENT
print(tree.find('.//xml:body', namespaces=nmsp).text)
You can use regex to extract the string between the body tags:
<body.*>(.*)</body>
The first .* matches for any character, zero or more times, to account for any attributes in the body tag.
(.*) captures anything between the tags.
import re
with open('file.xml', 'r') as file:
data = file.read()
x = re.search(r"<body.*>(.*)</body>", data)
print(x.group(1))

Python parsing Google Contacts xml

Here's a sample of my XML file I want to parse. How can I get gd:fullName and address from it? The problem is in some situations I have the name field and sometimes I don't. Any help?
I use
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
root.findall('./{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}title'):
print item.attrib
but it just gives me empty dictionaries. And I haven't used the gdata library, so I want to extract the data myself. Here's my XML file:
<?xml version="1.0" encoding="UTF-8"?>
<feed gd:etag=""Rn84fzVSLyt7I2A9XRVbFkwOQAE."" xmlns="http://www.w3.org/2005/Atom" xmlns:batch="http://schemas.google.com/gdata/batch" xmlns:gContact="http://schemas.google.com/contact/2008" xmlns:gd="http://schemas.google.com/g/2005" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/">
<id>moha****ee#gmail.com</id>
<updated>2015-08-03T15:12:37.137Z</updated>
<category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/contact/2008#contact"/>
<title>Mohammad Amin's Contacts</title>
<link rel="alternate" type="text/html" href="https://www.google.com/"/>
<link rel="http://schemas.google.com/g/2005#feed" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/mohamma***ee%40gmail.com/full"/>
<link rel="http://schemas.google.com/g/2005#post" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/mohamm***aee%40gmail.com/full"/>
<link rel="http://schemas.google.com/g/2005#batch" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/moha****ee%40gmail.com/full/batch"/>
<link rel="self" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/moham***ee%40gmail.com/full?max-results=25"/>
<link rel="next" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/moha****aee%40gmail.com/full?max-results=25&start-index=26"/>
<author>
<name>Mohammad Amin</name>
<email>moha****ee#gmail.com</email>
</author>
<generator version="1.0" uri="http://www.google.com/m8/feeds">Contacts</generator>
<openSearch:totalResults>131</openSearch:totalResults>
<openSearch:startIndex>1</openSearch:startIndex>
<openSearch:itemsPerPage>25</openSearch:itemsPerPage>
<entry gd:etag=""SXc5cTNQJit7I2A9XRRbGEsPQQY."">
<id>http://www.google.com/m8/feeds/contacts/moh***ee%40gmail.com/base/15281000e768a31</id>
<updated>2015-04-12T19:07:08.929Z</updated>
<app:edited xmlns:app="http://www.w3.org/2007/app">2015-04-12T19:07:08.929Z</app:edited>
<category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/contact/2008#contact"/>
<title>Sina Ghazi</title>
<link rel="http://schemas.google.com/contacts/2008/rel#photo" type="image/*" href="https://www.google.com/m8/feeds/photos/media/moh***aee%40gmail.com/15****a31" gd:etag=""WR1-e34pSit7I2BlWW4TbChNHHg6LF88WhE.""/>
<link rel="self" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/moham****aee%40gmail.com/full/1528****8a31"/>
<link rel="edit" type="application/atom+xml" href="https://www.google.com/m8/feeds/contacts/mohamm***ee%40gmail.com/full/15***a31"/>
<gd:name>
<gd:fullName>Si***i</gd:fullName>
<gd:givenName>Si***a</gd:givenName>
<gd:familyName>G***zi</gd:familyName>
</gd:name>
<gd:email rel="http://schemas.google.com/g/2005#home" address="si***i#gmail.com" primary="true"/>
<gContact:website href="http://www.google.com/profiles/1167****31" rel="profile"/>
</entry>
There are a few problems going on here.
You don't show all your code, but if root is the result of...
with open('data.xml') as fd:
doc = etree.parse(fd)
root = doc.getroot()
...then you're never going to find any entry elements:
>>> print root.findall('{http://www.w3.org/2005/Atom}entry')
[]
Because the root element doesn't contain any entry elements. You will find title elements:
>>> print root.findall('{http://www.w3.org/2005/Atom}title')
[<Element {http://www.w3.org/2005/Atom}title at 0x7ff8b3d166c8>, <Element {http://www.w3.org/2005/Atom}title at 0x7ff8b3d162d8>]
But looking at .attrib on these will yield empty dictionaries, because the title elements have no attributes:
<title>Si***azi</title>
If you want gd:fullName, you would need something like:
>>> root.findall('*/{http://schemas.google.com/g/2005}fullName')
[<Element {http://schemas.google.com/g/2005}fullName at 0x7ff8b04a83f8>]
Note that this is using a different namespace
(http://schemas.google.com/g/2005) from the default namespace
(http://www.w3.org/2005/Atom).
There is no address element that I can see, but you could use the
above to get the gd:email element and then extract the address
attribute.
Update
Consider instead using the xpath method and a namespaces dictionary:
>>> namespaces={'atom': 'http://www.w3.org/2005/Atom',
... 'gd': 'http://schemas.google.com/g/2005'}
>>> root.xpath('//atom:title', namespaces=namespaces)
[<Element {http://www.w3.org/2005/Atom}title at 0x7fd63aa30b90>, <Element {http://www.w3.org/2005/Atom}title at 0x7fd642b18ea8>]
>>> root.xpath('//gd:fullName', namespaces=namespaces)
[<Element {http://schemas.google.com/g/2005}fullName at 0x7fd63a5c5170>]
Or, using an explicit path instead of //:
>>> root.xpath('/atom:feed/atom:entry/atom:title', namespaces=namespaces)
[<Element {http://www.w3.org/2005/Atom}title at 0x7fd642b18ea8>]

How to remove all attributes of a tag

How can I remove all the attributes of a xml tag so I can get from this:
<xml blah blah blah> to just <xml>.
With lxml I know I can remove the whole element and I didn't find any way to do it specific on a tag. (I found solutions on stackoverflow for C# but I want Python).
I am opening a gpx(xml) file and this is my code so far (based on How do I get the whole content between two xml tags in Python?):
from lxml import etree
t = etree.parse("1.gpx")
e = t.xpath('//trk')[0]
print(e.text + ''.join(map(etree.tostring, e))).strip()
Another approach I did was this:
from lxml import etree
TOPOGRAFIX_NS = './/{http://www.topografix.com/GPX/1/1}'
TRACKPOINT_NS = TOPOGRAFIX_NS + 'extensions/{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}TrackPointExtension/{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}'
doc1 = etree.parse("1.gpx")
for node1 in doc1.findall(TOPOGRAFIX_NS + 'trk'):
node_to_string1 = etree.tostring(node1)
print(node_to_string1)
But I get the trk tag with TOPOGRAFIX_NS attributes witch I don't want and here I am wanting to remove the tag attribute. I just want to get:
<trk> all the inside content </trk>
Thank you very much!
P.S. The content of the gpx file:
<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1" creator="Endomondo.com" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd http://www.garmin.com/xmlschemas/GpxExtensions/v3 http://www.garmin.com/xmlschemas/GpxExtensionsv3.xsd http://www.garmin.com/xmlschemas/TrackPointExtension/v1 http://www.garmin.com/xmlschemas/TrackPointExtensionv1.xsd" xmlns="http://www.topografix.com/GPX/1/1" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/TrackPointExtension/v1" xmlns:gpxx="http://www.garmin.com/xmlschemas/GpxExtensions/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<metadata>
<author>
<name>Blah Blah</name>
<email id="blah" domain="blah.com"/>
</author>
<link href="http://www.endomondo.com">
<text>Endomondo</text>
</link>
<time>2014-01-20T10:50:28Z</time>
</metadata>
<trk>
<name>Galati</name>
<src>http://www.endomondo.com/</src>
<link href="http://www.endomondo.com/workouts/260782567/13005122">
<text>Galati</text>
</link>
<type>MOUNTAIN_BIKING</type>
<trkseg>
<trkpt lat="45.431074" lon="28.021038">
<time>2013-10-20T05:49:04Z</time>
</trkpt>
</trkseg>
</trk>
</gpx>

Python ElementTree find() using a wildcard?

I am parsing an XML feed in python to extract certain tags. My XML contains namespaces and this results in each tag containing a namespace followed by tag name.
Here is the xml:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:rte="http://www.rte.ie/schemas/vod">
<id>10038711/</id>
<updated>2013-01-24T22:52:43+00:00</updated>
<title type="text">Reeling in the Years</title>
<logo>http://www.rte.ie/iptv/images/logo.gif</logo>
<link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist?type=iptv&showId=10038711" />
<category term="feed"/>
<author>
<name>RTE</name>
<uri>http://www.rte.ie</uri>
</author>
<entry>
<id>10038711</id>
<published>2012-07-04T12:00:00+01:00</published>
<updated>2013-01-06T12:31:25+00:00</updated>
<title type="text">Reeling in the Years</title>
<content type="text">National and international events with popular music from the year 1989.First Broadcast: 08/11/1999</content>
<category term="WEB Exclusive" rte:type="channel"/>
<category term="Classics 1980" rte:type="genre"/>
<category term="rte player" rte:type="source"/>
<category term="" rte:type="transmision_details"/>
<category term="False" rte:type="copyprotectionoptout"/>
<category term="long" rte:type="form"/>
<category term="3275" rte:type="progid"/>
<link rel="site" type="text/html" href="http://www.rte.ie/tv50/"/>
<link rel="self" type="application/atom+xml" href="http://feeds.rasset.ie/rteavgen/player/playlist/?itemId=10038711&type=iptv&format=xml" />
<link rel="alternate" type="text/html" href="http://www.rte.ie/player/#v=10038711"/>
<rte:valid start="2012-07-23T15:56:04+01:00" end="2017-08-01T15:56:04+01:00"/>
<rte:duration ms="842205" formatted="0:10"/>
<rte:statistics views="19"/>
<rte:bri id="na"/>
<rte:channel id="13"/>
<rte:item id="10038711"/>
<media:title type="plain">Reeling in the Years</media:title>
<media:description type="plain">National and international events with popular music from the year 1989. First Broadcast: 08/11/1999</media:description>
<media:thumbnail url="http://img.rasset.ie/00062efc200.jpg" height="288" width="512" time="00:00:00+00:00"/>
<media:teaserimgref1x1 url="" time="00:00:00+00:00"/>
<media:rating scheme="http://www.rte.ie/schemes/vod">NA</media:rating>
<media:copyright>RTÉ</media:copyright>
<media:group rte:format="single">
<media:content url="http://vod.hds.rasset.ie/manifest/2012/0728/20120728_reelingint_cl10038711_10039316_260_.f4m" type="video/mp4" medium="video" expression="full" duration="842205" rte:format="content"/>
</media:group>
<rte:ads>
<media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&iu=%2F3014%2FP_RTE_TV50_Pre&ciu_szs=300x250&impl=s&gdfp_req=1&env=vp&output=xml_vast2&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
<media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&iu=%2F3014%2FP_RTE_TV50_Pre2&ciu_szs=300x250&impl=s&gdfp_req=1&env=vp&output=xml_vast2&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
<media:content url="http://pubads.g.doubleclick.net/gampad/ads?sz=512x288&iu=%2F3014%2FP_RTE_TV50_Pre3&ciu_szs=300x250&impl=s&gdfp_req=1&env=vp&output=xml_vast2&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]" type="text/xml" medium="video" expression="full" rte:format="advertising" rte:cue="0" />
</rte:ads>
</entry>
<!-- playlist.xml -->
</feed>
When the XML is parsed each element is soming out as:
{http://www.w3.org/2005/Atom}id
{http://www.w3.org/2005/Atom}published
{http://www.w3.org/2005/Atom}updated
.....
.....
{http://www.rte.ie/schemas/vod}valid
{http://www.rte.ie/schemas/vod}duration
....
....
{http://search.yahoo.com/mrss/}description
{http://search.yahoo.com/mrss/}thumbnail
....
As I have 3 different namespaces and I cannot gurantee that they will be always the same then I woul prefer not to hard specify each tag like so:
for elem in tree.iter({http://www.w3.org/2005/Atom}entry'):
stream = str(elem.find('{http://www.w3.org/2005/Atom}id').text)
date_tmp = str(elem.find('{http://www.w3.org/2005/Atom}published').text)
name_tmp = str(elem.find('{http://www.w3.org/2005/Atom}title').text)
short_tmp = str(elem.find('{http://www.w3.org/2005/Atom}content').text)
channel_tmp = elem.find('{http://www.w3.org/2005/Atom}category', "channel")
channel = str(channel_tmp.get('term'))
icon_tmp = elem.find('{http://search.yahoo.com/mrss/}thumbnail')
icon_url = str(icon_tmp.get('url'))
Is there any way that I can put a wildcard or something similar into the find so it will simply ignore the namespace?
stream = str(elem.find('*id').text)
I can hardcode them as above but it would be my luck that down the line the namespace would change and my queries stop returning data..
Thanks for the help.
You can use an XPath expression with the local-name() function:
<?xml version="1.0"?>
<root xmlns="ns">
<tag/>
</root>
Assuming "doc" is the ElementTree for the above XML:
import lxml.etree
doc = lxml.etree.parse(<some_file_like_object>)
root = doc.getroot()
root.xpath('//*[local-name()="tag"]')
[<Element {ns}tag at 0x7fcde6f7c960>]
replacing <some_file_like_object> as appropriate (alternatively, you can use lxml.etree.fromstring with an XML string to get the root element directly).

How do I use a default namespace in an lxml xpath query?

I have an xml document in the following format:
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:gsa="http://schemas.google.com/gsa/2007">
...
<entry>
<id>https://ip.ad.dr.ess:8000/feeds/diagnostics/smb://ip.ad.dr.ess/path/to/file</id>
<updated>2011-11-07T21:32:39.795Z</updated>
<app:edited xmlns:app="http://purl.org/atom/app#">2011-11-07T21:32:39.795Z</app:edited>
<link rel="self" type="application/atom+xml" href="https://ip.ad.dr.ess:8000/feeds/diagnostics"/>
<link rel="edit" type="application/atom+xml" href="https://ip.ad.dr.ess:8000/feeds/diagnostics"/>
<gsa:content name="entryID">smb://ip.ad.dr.ess/path/to/directory</gsa:content>
<gsa:content name="numCrawledURLs">7</gsa:content>
<gsa:content name="numExcludedURLs">0</gsa:content>
<gsa:content name="type">DirectoryContentData</gsa:content>
<gsa:content name="numRetrievalErrors">0</gsa:content>
</entry>
<entry>
...
</entry>
...
</feed>
I need to retrieve all entry elements using xpath in lxml. My problem is that I can't figure out how to use an empty namespace. I have tried the following examples, but none work. Please advise.
import lxml.etree as et
tree=et.fromstring(xml)
The various things I have tried are:
for node in tree.xpath('//entry'):
or
namespaces = {None:"http://www.w3.org/2005/Atom" ,"openSearch":"http://a9.com/-/spec/opensearchrss/1.0/" ,"gsa":"http://schemas.google.com/gsa/2007"}
for node in tree.xpath('//entry', namespaces=ns):
or
for node in tree.xpath('//\"{http://www.w3.org/2005/Atom}entry\"'):
At this point I just don't know what to try. Any help is greatly appreciated.
Something like this should work:
import lxml.etree as et
ns = {"atom": "http://www.w3.org/2005/Atom"}
tree = et.fromstring(xml)
for node in tree.xpath('//atom:entry', namespaces=ns):
print node
See also http://lxml.de/xpathxslt.html#namespaces-and-prefixes.
Alternative:
for node in tree.xpath("//*[local-name() = 'entry']"):
print node
Use findall method.
for item in tree.findall('{http://www.w3.org/2005/Atom}entry'):
print item

Categories

Resources