Trouble parsing xml archive using Element Tree - python

Python + programming noob here, so you may have to bear with me. I have a number of xml files (RSS archives) and I want to extract news article urls from them. I'm using Python 2.7.3 on Windows... and here's an example of the code I'm looking at:
<feed xmlns:media="http://search.yahoo.com/mrss/" xmlns:gr="http://www.google.com/schemas/reader/atom/" xmlns:idx="urn:atom-extension:indexing" xmlns="http://www.w3.org/2005/Atom" idx:index="no" gr:dir="ltr">
<!--
Content-type: Preventing XSRF in IE.
-->
<generator uri="http://www.google.com/reader">Google Reader</generator>
<id>
tag:google.com,2005:reader/feed/http://feeds.smh.com.au/rssheadlines/national.xml
</id>
<title>The Sydney Morning Herald National Headlines</title>
<subtitle type="html">
The top National headlines from The Sydney Morning Herald. For all the news, visit http://www.smh.com.au.
</subtitle>
<gr:continuation>CJPL-LnHybcC</gr:continuation>
<link rel="self" href="http://www.google.com/reader/atom/feed/http://feeds.smh.com.au/rssheadlines/national.xml?n=1000&c=%5BC%5D"/>
<link rel="alternate" href="http://www.smh.com.au/national" type="text/html"/>
<updated>2013-06-16T07:55:56Z</updated>
<entry gr:is-read-state-locked="true" gr:crawl-timestamp-msec="1371369356359">
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
<category term="user/03956512242887934409/state/com.google/read" scheme="http://www.google.com/reader/" label="read"/>
<title type="html">Daley opts for Dugan for Origin two</title>
<published>2013-06-16T07:12:11Z</published>
<updated>2013-06-16T07:12:11Z</updated>
<link rel="alternate" href="http://rss.feedsportal.com/c/34697/f/644122/s/2d5973e2/l/0Lnews0Bsmh0N0Bau0Cbreaking0Enews0Esport0Cdaley0Eopts0Efor0Edugan0Efor0Eorigin0Etwo0E20A130A6160E2oc5k0Bhtml/story01.htm" type="text/html"/>
Specifically I want to extract the "original id" link:
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
I originally tried using BeautifulSoup for this but ran into problems, and from the research I did it looks like Element Tree is the way to go. First off with ET I tried:
import xml.etree.ElementTree as ET
tree = ET.parse('thefile.xml')
root = tree.getroot()
#first_original_id = root[8][0]
parents_of_interest = root[8::]
for elem in parents_of_interest:
print elem.items()[0][1]
So far as I can work out parents_of_interest does grab the data I want (as a list of dictionaries) but the for loop only returns a bunch of true statements, and after reading the documentation and SO it seems like this is the wrong approach.
I think this has the answer I'm looking for but even though it's a good explanation I can't seem to apply it to my own situation. From that answer I tried:
print tree.find('//{http://www.w3.org/2005/Atom}entry}id').text
But got the error:
__main__:1: FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version. If you rely
on the current behaviour, change it to './/{http://www.w3.org/2005/Atom}entry}id'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'text'
Any help on this would be appreciated... and sorry if that's a verbose question... but I thought I'd detail everything... just in case.

Your xpath expression matches the first id, not the one you're looking for and original-id is an attribute of the element, so you should write something like that:
idelem = tree.find('./{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}id')
if idelem is not None:
print idelem.get('{http://www.google.com/schemas/reader/atom/}original-id')
That will find only the first matching id, if you want them all, use findall and iterate over the results.

Related

Lxml Get all itens but test the next one as well - Python

I'm, in trouble trying to parse this lxml. I'm using python language, 3.6.9.
It is something like this.
<download date="22/05/2020 08:34">
<link url="http://xpto" document="y"/>
<link url="http://xpto" document="y"/>
<subjects number="2"><subject>Text explaining the previous link</subject><subject>Another text explaining the previous link</subject></subjects>
<link url="http://xpto" document="z"/>
<subjects number="1"><subject>Text explaining the previous link</subject></subjects>
<link url="http://xpto" document="y"/>
<link url="http://xpto" document="z"/>
</download>
Currently, I'm able to get all the links (which is something easy to accomplish) using this function:
import requests
from lxml import html
response = html.fromstring(requests.post(url_post, data=data).content)
links = response.xpath('//link')
As I pointed at the lxml, the subjects, when exists, are designed to explain the previous link. Sometimes, it can have more than one subject (Like the example above, one of the subjects has the number 2, which means it has two 'subject' items inside, but the other 'subjects' has just one subject). It is a large lxml file, so this difference (a lot of links until it has one link with one explanation after) occurs very often.
How can I build a query to get all these links and, when exists the subjects next to it (after the link, to be more precise), put it together or insert it into the link as well?
My dream would be something like this:
<link url="http://xpto" document="y" subjects="Text explaining the previous link| Another text explaining the thing"/>
A list with both links and subjects together would help a lot as well.
[
[<link url="http://xpto" document="y"/>],
[<link url="http://xpto" document="y"/>, <subjects number="2"><subject>Text explaining the previous link</subject><subject>Another text explaining the previous link</subject></subjects>],
[<link url="http://xpto" document="y"/>],
]
Please, be free to suggest something different, of course.
Thank you, folks!
This does what I think it is you need:
from lxml import html
example = """
<link url="some_url" document="a"/>
<link url="some_url" document="b"/>
<subjects><subject>some text</subject></subjects>
<link url="some_url" document="c"/>
<link url="some_url" document="d"/>
<subjects><subject>some text</subject><subject>some more</subject></subjects>
"""
response = html.fromstring(example)
links = response.xpath('//link')
result = []
for link in links:
result.append([link])
next_element = link.getnext()
if next_element is not None and next_element.tag == 'subjects':
result[-1].append(next_element)
print(result)
Result:
[[<Element link at 0x1a0891e0d60>], [<Element link at 0x1a0891e0db0>, <Element subjects at 0x1a089096360>], [<Element link at 0x1a0891e0e00>], [<Element link at 0x1a0891e0e50>, <Element subjects at 0x1a0891e0d10>]]
Note that the lists still contain lxml Element objects, you can turn them into strings of course, if you need.
The key step is the next_element = link.getnext() line. For an lxml Element, the .getnext() method returns the next sibling in the document. So, although you're looping over link elements matched with .xpath(), link.getnext() will still get you a subjects element if that's the next sibling in the document. If there is no next element (i.e. for the last link, if it's not followed by a subjects), .getnext() will return None, which is why the following lines of code check for is not None.
This isn't the most elegant way of doing things, but it gets the job done...
subjects= """
<download date="22/05/2020 08:34">
<link url="http://xpto" document="y"/>
<link url="http://xpto" document="y"/>
<subjects number="2">
<subject>First Text explaining the previous link</subject>
<subject>Another text explaining the previous link</subject>
</subjects>
<link url="http://xpto2" document="z"/>
<subjects number="1"><subject>Second Text explaining the previous link</subject></subjects>
<link url="http://xpto3" document="y"/>
<link url="http://xpto4" document="z"/>
</download>
"""
#Note that I changed your html a bit to emphasize the differences between nodes
import lxml.html as lh
import elementpath
doc = lh.fromstring(subjects)
elements = elementpath.select(doc, "//link[following-sibling::*[1][name()='subjects']]/concat('<link url=',./#url, ' document=xxx',#document,'xxx subjects=xxx',string-join(./following-sibling::subjects[1]//subject,' | '),'xxx/>')")
# I needed to use the xxx placeholder because I couldn't find a way to escape the double quote marks inside the expression, and this way is simple to implement
for element in elements:
print(element.replace('xxx','"'))
Output:
<link url=http://xpto document="y" subjects="First Text explaining the previous link | Another text explaining the previous link"/>
<link url=http://xpto2 document="z" subjects="Second Text explaining the previous link"/>
I came up with this solution.
It is a little slower than the #grismar suggestion but achieved the insertion of the 'subjects' into the link. On the other hand, it saved me the need to loop through the list one more time to parse the '[[link, subjects],]' element.
filteredData = response.xpath('//link | //subjects') #get both link and subjects into a list
for i, item in enumerate(filteredData):
if item.tag == 'subjects':
filteredData[i-1].append(item)
filteredData.remove(item)

BeautifulSoup4 not recognizing xml tag

I'm using BeautifulSoup4 (with lxml parser) to parse xml that looks like this:
<?xml version="1.0" encoding="UTF-8" ?>
<data>
<metadata id="8735180" name="Dauphin Island" lat="30.2500" lon="-88.0750"/>
<observations>
<wl t="2013-12-14 00:00" v="0.725" s="0.059" f="0,0,0,0" q="v" />
<wl t="2013-12-14 00:06" v="0.771" s="0.066" f="0,0,0,0" q="v" />
<wl t="2013-12-14 00:12" v="0.764" s="0.085" f="0,0,0,0" q="v" />
....etc
The python code is like so:
obs_soup = BeautifulSoup(urllib2.urlopen('http://tidesandcurrents.noaa.gov/api/datagetter?product=water_level&application=NOS.COOPS.TAC.WL&begin_date=20131214&end_date=20131216&datum=MSL&station=8735180&time_zone=GMT&units=english&interval=&format=xml),'lxml')
for l in obs_soup.findall('wl'):
obs.append(l['v'])
I keep getting the error:
for l in obs_soup.findall('wl'):
TypeError: 'NoneType' object is not callable
I tried the solution here (except instead of looking for 'html', I looked for 'data'), but that didn't work. Any suggestions?
There are two problems here.
First, there is no such method as findall in BeautifulSoup. Change that to:
for l in obs_soup.find_all('wl'):
obs.append(l['v'])
… and it will work.
So, why are you getting this TypeError: 'NoneType' object is not callable instead of the more usual AttributeError? Because of BeautifulSoup's magic lookup—the same thing that lets you do obs_soup.wl as a shortcut for finding a <wl> also lets you do obs_soup.findall as a shortcut for finding a <findall>. Because there is no <findall> node, it returns None. And then you're trying to call that None object as a function, which of course is nonsense.
Also, if you actually had copied and pasted the copy from here as you claimed, you wouldn't have had this problem. That code uses findAll, with a capital "A", which is a deprecated synonym for find_all. (You shouldn't use the deprecated synonyms, of course.)
Second, you're explicitly asking for lxml's HTML parser instead of its XML parser. Don't do that. See the docs:
BeautifulSoup(markup, ["lxml", "xml"])
Since you didn't give us a complete XML document, I don't know whether this will affect you, or whether you'll happen to get lucky. But you shouldn't rely on happening to get lucky when it's so easy to actually do things right.

XML Python Choosing one of numerous attributes using ElementTree

As far as I know this question is not a repeat, as I have been searching for a solution for days now and simply cannot pin the problem down. I am attempting to print a nested attribute from an XML document tag using Python. I believe the error I am running into has to do with the fact that the tag I from which I'm trying to get information has more than one attribute. Is there some way I can specify that I want the "status" value from the "second-tag" tag?? Thank you so much for any help.
My XML document 'test.xml':
<?xml version="1.0" encoding="UTF-8"?>
<first-tag xmlns="http://somewebsite.com/" date-produced="20130703" lang="en" produced- by="steve" status="OFFLINE">
<second-tag country="US" id="3651653" lang="en" status="ONLINE">
</second-tag>
</first-tag>
My Python File:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
whatiwant = root.find('second-tag').get('status')
print whatiwant
Error:
AttributeError: 'NoneType' object has no attribute 'get'
You fail at .find('second-tag'), not on the .get.
For what you want, and your idiom, BeautifulSoup shines.
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(xml_string)
whatyouwant = soup.find('second-tag')['status']
I dont know with elementtree but i would do so with ehp or easyhtmlparser
here is the link.
http://easyhtmlparser.sourceforge.net/
a friend told me about this tool im still learning thats pretty good and simple.
from ehp import *
data = '''<?xml version="1.0" encoding="UTF-8"?>
<first-tag xmlns="http://somewebsite.com/" date-produced="20130703" lang="en" produced- by="steve" status="OFFLINE">
<second-tag country="US" id="3651653" lang="en" status="ONLINE">
</second-tag>
</first-tag>'''
html = Html()
dom = html.feed(data)
item = dom.fst('second-tag')
value = item.attr['status']
print value
The problem here is that there is no tag named second-tag here. There's a tag named {http://somewebsite.com/}second-tag.
You can see this pretty easily:
>>> print(root.getchildren())
[<Element '{http://somewebsite.com/}second-tag' at 0x105b24190>]
A non-namespace-compliant XML parser might do the wrong thing and ignore that, making your code work. A parser that bends over backward to be friendly (like BeautifulSoup) will, in effect, automatically try {http://somewebsite.com/}second-tag when you ask for second-tag. But ElementTree is neither.
If that isn't all you need to know, you first need to read a tutorial on namespaces (maybe this one).

Retrieving first urban dictionary result for a term in python

I have written a pretty simple code to get the first result for any term on urbandictionary.com. I started by writing a simple thing to see how their code is formatted.
def parseudtest(searchurl):
url = 'http://www.urbandictionary.com/define.php?term=%s' %searchurl
url_info = urllib.urlopen(url)
for lines in url_info:
print lines
For a test, I searched for 'cats', and used that as the variable searchurl. The output I receive is of course a gigantic page, but here is the part I care about:
<meta content='He set us up the bomb. Also took all our base.' name='Description' />
<meta content='He set us up the bomb. Also took all our base.' property='og:description' />
<meta content='cats' property='og:title' />
<meta content="http://static3.urbandictionary.com/rel-1e0b481/images/og_image.png" property="og:image" />
<meta content='Urban Dictionary' property='og:site_name' />
As you can see, the first time the element "meta content" appears on the site, it is the first definition for the search term. So I wrote this code to retrieve it:
def parseud(searchurl):
url = 'http://www.urbandictionary.com/define.php?term=%s' %searchurl
url_info = urllib.urlopen(url)
if (url_info):
xmldoc = minidom.parse(url_info)
if (xmldoc):
definition = xmldoc.getElementsByTagName('meta content')[0].firstChild.data
print definition
For some reason the parsing doesn't seem to be working and invariably encounters an error every time. It is especially confusing since the site appears to use basically the same format as other sites I have successfully retrieved specific data from. If anyone could help me figure out what I am messing up here, it would be greatly appreciated.
As you don't give the traceback for the errors that occur it's hard to be specific, but I assume that although the site claims to be XHTML it's not actually valid XML. You'd be better off using Beautiful Soup as it is designed for parsing HTML and will correctly handle broken markup.
I never used the minidom parser, but I think the problem is that you call:
xmldoc.getElementsByTagName('meta content')
while tha tag name is meta, content is just the first attribute (as shown pretty well by the highlighting of your html code).
Try to replace that bit with:
xmldoc.getElementsByTagName('meta')

Python get ID from XML data

I am a total python newb and am trying to parse an XML document that is being returned from google as a result of a post request.
The document returned looks like the one outlined in this doc
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#Archives
where it says 'The response contains information about the archive.'
The only part I am interested in is the Id attribute right near the beginning. There will only every be 1 entry, and 1 id attribute. How can I extract it to be use later? I've been fighting with this for a while and I feel like I've tried everything from minidom to elementtree. No matter what I do my search comes back blank, loops don't iterate, or methods are missing. Any assistance is much appreciated. Thank you.
I would highly recommend the Python package BeautifulSoup. It is awesome. Here is a simple example using their example data (assuming you've installed BeautifulSoup already):
from BeautifulSoup import BeautifulSoup
data = """<?xml version='1.0' encoding='utf-8'?>
<entry xmlns='http://www.w3.org/2005/Atom'
xmlns:docs='http://schemas.google.com/docs/2007'
xmlns:gd='http://schemas.google.com/g/2005'>
<id>
https://docs.google.com/feeds/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA</id>
<published>2010-11-18T18:34:06.981Z</published>
<updated>2010-11-18T18:34:07.763Z</updated>
<app:edited xmlns:app='http://www.w3.org/2007/app'>
2010-11-18T18:34:07.763Z</app:edited>
<category scheme='http://schemas.google.com/g/2005#kind'
term='http://schemas.google.com/docs/2007#archive'
label='archive' />
<title>Document Archive - someuser#somedomain.com</title>
<link rel='self' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<link rel='edit' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<author>
<name>someuser</name>
<email>someuser#somedomain.com</email>
</author>
<docs:archiveNotify>someuser#somedomain.com</docs:archiveNotify>
<docs:archiveStatus>flattening</docs:archiveStatus>
<docs:archiveResourceId>
0Adj-hQNOVsTFSNDEkdk2221OTJfMWpxOGI5OWZu</docs:archiveResourceId>
<docs:archiveResourceId>
0Adj-hQNOVsTFZGZodGs2O72NFMllMQDN3a2Rq</docs:archiveResourceId>
<docs:archiveConversion source='application/vnd.google-apps.document'
target='text/plain' />
</entry>"""
soup = BeautifulSoup(data, fromEncoding='utf8')
print soup('id')[0].text
There is also expat, which is built into Python, but it is worth learning BeautifulSoup, because it will respond way better to real-world XML (and HTML).
Assuming the variable response contains a string representation of the returned HTML document, let me tell you the WRONG way to solve your problem
id = response.split("</id>")[0].split("<id>")[1]
The right way to do it is with xml.sax or xml.dom or expat, but personally, I wouldn't be bothered unless I wanted to have robust error handling of exception cases when response contains something unexpected.
EDIT: I forgot about BeautifulSoup, it is indeed as awesome as Travis describes.
If you'd like to use minidom, you can do the following (replace gd.xml with your xml input):
from xml.dom import minidom
dom = minidom.parse("gd.xml")
id = dom.getElementsByTagName("id")[0].childNodes[0].nodeValue
print id
Also, I assume you meant id element, and not id attribute.

Categories

Resources