Python get ID from XML data - python

I am a total python newb and am trying to parse an XML document that is being returned from google as a result of a post request.
The document returned looks like the one outlined in this doc
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#Archives
where it says 'The response contains information about the archive.'
The only part I am interested in is the Id attribute right near the beginning. There will only every be 1 entry, and 1 id attribute. How can I extract it to be use later? I've been fighting with this for a while and I feel like I've tried everything from minidom to elementtree. No matter what I do my search comes back blank, loops don't iterate, or methods are missing. Any assistance is much appreciated. Thank you.

I would highly recommend the Python package BeautifulSoup. It is awesome. Here is a simple example using their example data (assuming you've installed BeautifulSoup already):
from BeautifulSoup import BeautifulSoup
data = """<?xml version='1.0' encoding='utf-8'?>
<entry xmlns='http://www.w3.org/2005/Atom'
xmlns:docs='http://schemas.google.com/docs/2007'
xmlns:gd='http://schemas.google.com/g/2005'>
<id>
https://docs.google.com/feeds/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA</id>
<published>2010-11-18T18:34:06.981Z</published>
<updated>2010-11-18T18:34:07.763Z</updated>
<app:edited xmlns:app='http://www.w3.org/2007/app'>
2010-11-18T18:34:07.763Z</app:edited>
<category scheme='http://schemas.google.com/g/2005#kind'
term='http://schemas.google.com/docs/2007#archive'
label='archive' />
<title>Document Archive - someuser#somedomain.com</title>
<link rel='self' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<link rel='edit' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<author>
<name>someuser</name>
<email>someuser#somedomain.com</email>
</author>
<docs:archiveNotify>someuser#somedomain.com</docs:archiveNotify>
<docs:archiveStatus>flattening</docs:archiveStatus>
<docs:archiveResourceId>
0Adj-hQNOVsTFSNDEkdk2221OTJfMWpxOGI5OWZu</docs:archiveResourceId>
<docs:archiveResourceId>
0Adj-hQNOVsTFZGZodGs2O72NFMllMQDN3a2Rq</docs:archiveResourceId>
<docs:archiveConversion source='application/vnd.google-apps.document'
target='text/plain' />
</entry>"""
soup = BeautifulSoup(data, fromEncoding='utf8')
print soup('id')[0].text
There is also expat, which is built into Python, but it is worth learning BeautifulSoup, because it will respond way better to real-world XML (and HTML).

Assuming the variable response contains a string representation of the returned HTML document, let me tell you the WRONG way to solve your problem
id = response.split("</id>")[0].split("<id>")[1]
The right way to do it is with xml.sax or xml.dom or expat, but personally, I wouldn't be bothered unless I wanted to have robust error handling of exception cases when response contains something unexpected.
EDIT: I forgot about BeautifulSoup, it is indeed as awesome as Travis describes.

If you'd like to use minidom, you can do the following (replace gd.xml with your xml input):
from xml.dom import minidom
dom = minidom.parse("gd.xml")
id = dom.getElementsByTagName("id")[0].childNodes[0].nodeValue
print id
Also, I assume you meant id element, and not id attribute.

Related

'XML' document with multiple root elements

I have an 'XML' file, which I do not control, which I am trying to parse with etree.ElementTree which contains two root elements:
<?xml version="1.0"?>
<meta>
... data I do not care about
</meta>
<database>
... data I wish to parse
</database>
Trying to parse the file I'm getting the error: 'junk after document element' which I understand is related to the fact that it isn't valid xml, since xml can only have one root element. I've been reading around for a solution, and while I have found a few posts addressing this issue they have all been different enough or difficult enough that I could not, as a beginner, get my head round them.
As I understand it the solution would either be to encase everything in a new root element, and parse that, or somehow ignore/split off the <meta> element and it's children. Any guidance on how to best accomplish this would be appreciated.
Beautiful Soup might ease your problem (although it is the lxml inside which renders this service), but its a long-term downgrade, thus for instance when you want to use xpath.
Stick to ET. It is strict and won't allow you to parse not well-formed XML, which requires one root element and nothing else outside of it.
If you manage to parse your xml-file, you can be sure, it is well-formed. All options are legit:
1) Read the file as a string, remove the declaration and put the root tags around it. Then parse from string. (Clear the string variable after that.) Or you could edit the file first.
2) Create a new root element ( new_root = ET.Element('new_root') ), read the top-level elements in the file an append them with SubElement.
The second option requires more coding and maintainance, if the file gets changed.
Here is one solution using BeautifulSoup, in data is malformed xml. BeautifulSoup will process it as any document, so you can access both parts:
from bs4 import BeautifulSoup
data = """<?xml version="1.0"?>
<meta>
<somedata>1</somedata>
</meta>
<database>
<important>100</important>
</database>"""
soup = BeautifulSoup(data, 'lxml')
print(soup.database.important.text)
Prints:
100

How to extract efficientely <![CDATA[]> content from an xml with python?

I have the following xml:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
<document><![CDATA["#username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING ]]></document>
<document><![CDATA[Ugh ]]></document>
<document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt ]]></document>
<document><![CDATA[#username Shout out to me???? ]]></document>
</author>
What is the most efficient way to parse and extract the <![CDATA[content]]> into a list. Let's say:
[#username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING Ugh YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt #username Shout out to me???? ]
This is what I tried:
from bs4 import BeautifulSoup
x='/Users/user/PycharmProjects/TratandoDeMejorarPAN/test.xml'
y = BeautifulSoup(open(x), 'xml')
out = [y.author.document]
print out
And this is the output:
[<document>"#username: That boner came at the wrong time ???? http://t.co/5XgDyCaCjR" HELP I'M DYING </document>]
The problem with this output is that I should not get the <document></document>. How can I remove the <document></document> tags and get all the elements of this xml in a list?.
There are several things wrong here. (Asking questions on selecting a library is against the rules here, so I'm ignoring that part of the question).
You need to pass in a file handle, not a file name.
That is: y = BeautifulSoup(open(x))
You need to tell BeautifulSoup that it's dealing with XML.
That is: y = BeautifulSoup(open(x), 'xml')
CDATA sections don't create elements. You can't search for them in the DOM, because they don't exist in the DOM; they're just syntactic sugar. Just look at the text directly under the document, don't try to search for something named CDATA.
To state it again, somewhat differently: <doc><![CDATA[foo]]</doc> is exactly the same as <doc>foo</doc>. What's different about a CDATA section is that everything inside it is automatically escaped, meaning that <![CDATA[<hello>]] is interpreted as <hello>. However -- you can't tell from the parsed object tree whether your document contained a CDATA section with literal < and > or a raw text section with < and >. This is by design, and true of any compliant XML DOM implementation.
Now, how about some code that actually works:
import bs4
doc="""
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
<document><![CDATA["#username: That came at the wrong time ????" HELP I'M DYING ]]></document>
<document><![CDATA[Ugh ]]></document>
<document><![CDATA[YES !!!! WE GO FOR IT. ]]></document>
<document><![CDATA[#username Shout out to me???? ]]></document>
</author>
"""
doc_el = bs4.BeautifulSoup(doc, 'xml')
print [ el.text for el in doc_el.findAll('document') ]
If you want to read from a file, replace doc with open(filename, 'r').

XML Python Choosing one of numerous attributes using ElementTree

As far as I know this question is not a repeat, as I have been searching for a solution for days now and simply cannot pin the problem down. I am attempting to print a nested attribute from an XML document tag using Python. I believe the error I am running into has to do with the fact that the tag I from which I'm trying to get information has more than one attribute. Is there some way I can specify that I want the "status" value from the "second-tag" tag?? Thank you so much for any help.
My XML document 'test.xml':
<?xml version="1.0" encoding="UTF-8"?>
<first-tag xmlns="http://somewebsite.com/" date-produced="20130703" lang="en" produced- by="steve" status="OFFLINE">
<second-tag country="US" id="3651653" lang="en" status="ONLINE">
</second-tag>
</first-tag>
My Python File:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
whatiwant = root.find('second-tag').get('status')
print whatiwant
Error:
AttributeError: 'NoneType' object has no attribute 'get'
You fail at .find('second-tag'), not on the .get.
For what you want, and your idiom, BeautifulSoup shines.
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(xml_string)
whatyouwant = soup.find('second-tag')['status']
I dont know with elementtree but i would do so with ehp or easyhtmlparser
here is the link.
http://easyhtmlparser.sourceforge.net/
a friend told me about this tool im still learning thats pretty good and simple.
from ehp import *
data = '''<?xml version="1.0" encoding="UTF-8"?>
<first-tag xmlns="http://somewebsite.com/" date-produced="20130703" lang="en" produced- by="steve" status="OFFLINE">
<second-tag country="US" id="3651653" lang="en" status="ONLINE">
</second-tag>
</first-tag>'''
html = Html()
dom = html.feed(data)
item = dom.fst('second-tag')
value = item.attr['status']
print value
The problem here is that there is no tag named second-tag here. There's a tag named {http://somewebsite.com/}second-tag.
You can see this pretty easily:
>>> print(root.getchildren())
[<Element '{http://somewebsite.com/}second-tag' at 0x105b24190>]
A non-namespace-compliant XML parser might do the wrong thing and ignore that, making your code work. A parser that bends over backward to be friendly (like BeautifulSoup) will, in effect, automatically try {http://somewebsite.com/}second-tag when you ask for second-tag. But ElementTree is neither.
If that isn't all you need to know, you first need to read a tutorial on namespaces (maybe this one).

Django parse XML from a POST

I'm receiving an HTTP POST. With one parameter thats sent: xml
It contain an xml document. The format of this document is:
<?xml version="1.1" encoding="ISO-8859-1"?>
<delivery_receipt>
<version>1.0</version>
<status>Delivered</status>
</delivery_receipt>
I need to get whats in <status> from the POST, how do I parse the parameter and get the 'status'?
Update....
if request.POST:
from lxml.cssselect import CSSSelector
from lxml.etree import fromstring
h = fromstring(request.POST['xml'])
h.cssselect('delivery_reciept status').text_content()
I'm not sure that request.POST['xml'] will work tho
You can (and should) use CSS selectors with XML documents, granted you are doing relatively simple tasks for parsing XML documents. CSS selectors are clear, easy to read and write, and are more expressive than XPATH queries.
I suggest getting lxml installed, and using their cssselect features.
Your end result might look like this:
>>> h = fromstring("""<?xml version="1.1" encoding="ISO-8859-1"?>
<delivery_receipt>
<version>1.0</version>
<status>Delivered</status>
</delivery_receipt> """)
>>> h.cssselect('delivery_reciept status').text_content()

Retrieving first urban dictionary result for a term in python

I have written a pretty simple code to get the first result for any term on urbandictionary.com. I started by writing a simple thing to see how their code is formatted.
def parseudtest(searchurl):
url = 'http://www.urbandictionary.com/define.php?term=%s' %searchurl
url_info = urllib.urlopen(url)
for lines in url_info:
print lines
For a test, I searched for 'cats', and used that as the variable searchurl. The output I receive is of course a gigantic page, but here is the part I care about:
<meta content='He set us up the bomb. Also took all our base.' name='Description' />
<meta content='He set us up the bomb. Also took all our base.' property='og:description' />
<meta content='cats' property='og:title' />
<meta content="http://static3.urbandictionary.com/rel-1e0b481/images/og_image.png" property="og:image" />
<meta content='Urban Dictionary' property='og:site_name' />
As you can see, the first time the element "meta content" appears on the site, it is the first definition for the search term. So I wrote this code to retrieve it:
def parseud(searchurl):
url = 'http://www.urbandictionary.com/define.php?term=%s' %searchurl
url_info = urllib.urlopen(url)
if (url_info):
xmldoc = minidom.parse(url_info)
if (xmldoc):
definition = xmldoc.getElementsByTagName('meta content')[0].firstChild.data
print definition
For some reason the parsing doesn't seem to be working and invariably encounters an error every time. It is especially confusing since the site appears to use basically the same format as other sites I have successfully retrieved specific data from. If anyone could help me figure out what I am messing up here, it would be greatly appreciated.
As you don't give the traceback for the errors that occur it's hard to be specific, but I assume that although the site claims to be XHTML it's not actually valid XML. You'd be better off using Beautiful Soup as it is designed for parsing HTML and will correctly handle broken markup.
I never used the minidom parser, but I think the problem is that you call:
xmldoc.getElementsByTagName('meta content')
while tha tag name is meta, content is just the first attribute (as shown pretty well by the highlighting of your html code).
Try to replace that bit with:
xmldoc.getElementsByTagName('meta')

Categories

Resources