XML Python Choosing one of numerous attributes using ElementTree - python

As far as I know this question is not a repeat, as I have been searching for a solution for days now and simply cannot pin the problem down. I am attempting to print a nested attribute from an XML document tag using Python. I believe the error I am running into has to do with the fact that the tag I from which I'm trying to get information has more than one attribute. Is there some way I can specify that I want the "status" value from the "second-tag" tag?? Thank you so much for any help.
My XML document 'test.xml':
<?xml version="1.0" encoding="UTF-8"?>
<first-tag xmlns="http://somewebsite.com/" date-produced="20130703" lang="en" produced- by="steve" status="OFFLINE">
<second-tag country="US" id="3651653" lang="en" status="ONLINE">
</second-tag>
</first-tag>
My Python File:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
whatiwant = root.find('second-tag').get('status')
print whatiwant
Error:
AttributeError: 'NoneType' object has no attribute 'get'

You fail at .find('second-tag'), not on the .get.
For what you want, and your idiom, BeautifulSoup shines.
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(xml_string)
whatyouwant = soup.find('second-tag')['status']

I dont know with elementtree but i would do so with ehp or easyhtmlparser
here is the link.
http://easyhtmlparser.sourceforge.net/
a friend told me about this tool im still learning thats pretty good and simple.
from ehp import *
data = '''<?xml version="1.0" encoding="UTF-8"?>
<first-tag xmlns="http://somewebsite.com/" date-produced="20130703" lang="en" produced- by="steve" status="OFFLINE">
<second-tag country="US" id="3651653" lang="en" status="ONLINE">
</second-tag>
</first-tag>'''
html = Html()
dom = html.feed(data)
item = dom.fst('second-tag')
value = item.attr['status']
print value

The problem here is that there is no tag named second-tag here. There's a tag named {http://somewebsite.com/}second-tag.
You can see this pretty easily:
>>> print(root.getchildren())
[<Element '{http://somewebsite.com/}second-tag' at 0x105b24190>]
A non-namespace-compliant XML parser might do the wrong thing and ignore that, making your code work. A parser that bends over backward to be friendly (like BeautifulSoup) will, in effect, automatically try {http://somewebsite.com/}second-tag when you ask for second-tag. But ElementTree is neither.
If that isn't all you need to know, you first need to read a tutorial on namespaces (maybe this one).

Related

Getting XML attributes from XML with namespaces and Python (lxml)

I'm trying to grab the "id" and "href" attributes from the below XML. Thus far I can't seem to get my head around the namespacing aspects. I can get things easily enough with XML that doesn't have namespace references. But this has befuddled me. Any ideas would be appreciated!
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<ns3:searchResult total="1" xmlns:ns5="ers.ise.cisco.com" xmlns:ers-v2="ers- v2" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:ns3="v2.ers.ise.cisco.com">
<ns3:resources>
<ns5:resource id="d28b5080-587a-11e8-b043-d8b1906198a4"name="00:1B:4F:32:27:50">
<link rel="self" href="https://ho-lab-ise1:9060/ers/config/endpoint/d28b5080-587a-11e8-b043-d8b1906198a4"type="application/xml"/>
</ns5:resource>
</ns3:resources>
You can use xpath function to search all resources and iterate on them. The function has a namespaces keyword argument. The can use it to declare the mapping between namespace prefixes and namespace URL.
Here is the idea:
from lxml import etree
NS = {
"ns5": "ers.ise.cisco.com",
"ns3": "v2.ers.ise.cisco.com"
}
tree = etree.parse('your.xml')
resources = tree.xpath('//ns5:resource', namespaces=NS)
for resource in resources:
print(resource.attrib['id'])
links = resource.xpath('link')
for link in links:
print(link.attrib['href'])
sorry, this is not tested
Here is the documentation about xpath.
#laurent-laporte's answer is great for showing how to handle multiple namespaces (+1).
However if you truly only need to select a couple of attributes no matter what namespace they're in, you can test local-name() in a predicate...
from lxml import etree
tree = etree.parse('your.xml')
attrs = tree.xpath("//#*[local-name()='id' or local-name()='href']")
for attr in attrs:
print(attr)
This will print (the same as Laurent's)...
d28b5080-587a-11e8-b043-d8b1906198a4
https://ho-lab-ise1:9060/ers/config/endpoint/d28b5080-587a-11e8-b043-d8b1906198a4

How to extract efficientely <![CDATA[]> content from an xml with python?

I have the following xml:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
<document><![CDATA["#username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING ]]></document>
<document><![CDATA[Ugh ]]></document>
<document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt ]]></document>
<document><![CDATA[#username Shout out to me???? ]]></document>
</author>
What is the most efficient way to parse and extract the <![CDATA[content]]> into a list. Let's say:
[#username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING Ugh YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt #username Shout out to me???? ]
This is what I tried:
from bs4 import BeautifulSoup
x='/Users/user/PycharmProjects/TratandoDeMejorarPAN/test.xml'
y = BeautifulSoup(open(x), 'xml')
out = [y.author.document]
print out
And this is the output:
[<document>"#username: That boner came at the wrong time ???? http://t.co/5XgDyCaCjR" HELP I'M DYING </document>]
The problem with this output is that I should not get the <document></document>. How can I remove the <document></document> tags and get all the elements of this xml in a list?.
There are several things wrong here. (Asking questions on selecting a library is against the rules here, so I'm ignoring that part of the question).
You need to pass in a file handle, not a file name.
That is: y = BeautifulSoup(open(x))
You need to tell BeautifulSoup that it's dealing with XML.
That is: y = BeautifulSoup(open(x), 'xml')
CDATA sections don't create elements. You can't search for them in the DOM, because they don't exist in the DOM; they're just syntactic sugar. Just look at the text directly under the document, don't try to search for something named CDATA.
To state it again, somewhat differently: <doc><![CDATA[foo]]</doc> is exactly the same as <doc>foo</doc>. What's different about a CDATA section is that everything inside it is automatically escaped, meaning that <![CDATA[<hello>]] is interpreted as <hello>. However -- you can't tell from the parsed object tree whether your document contained a CDATA section with literal < and > or a raw text section with < and >. This is by design, and true of any compliant XML DOM implementation.
Now, how about some code that actually works:
import bs4
doc="""
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
<document><![CDATA["#username: That came at the wrong time ????" HELP I'M DYING ]]></document>
<document><![CDATA[Ugh ]]></document>
<document><![CDATA[YES !!!! WE GO FOR IT. ]]></document>
<document><![CDATA[#username Shout out to me???? ]]></document>
</author>
"""
doc_el = bs4.BeautifulSoup(doc, 'xml')
print [ el.text for el in doc_el.findAll('document') ]
If you want to read from a file, replace doc with open(filename, 'r').

Parse xml from file using etree works when reading string, but not a file

I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>
Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.
Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A

Appropriate xpath syntax with python for non-standard xml

The input file is actually multiple XML files appending to one file. (Sourced from Google Patents). This is an example:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23">
<applicants>
<applicant sequence="001" app-type="applicant-inventor" designation="us-only">
<addressbook><last-name>Beyer</last-name>
<first-name>Daniel Lee</first-name>
<address><city>Franklin</city>
<state>TN</state>
<country>US</country></address></addressbook>
<nationality><country>omitted</country></nationality>
<residence><country>US</country></residence>
</applicant>
<applicant sequence="002" app-type="applicant-inventor" designation="us-only">
<addressbook><last-name>Friedland</last-name>
<first-name>Jason Michael</first-name>
<address><city>Franklin</city>
<state>TN</state>
<country>US</country></address></addressbook>
<nationality><country>omitted</country></nationality>
<residence><country>US</country></residence>
</applicant>
</applicants>
</us-patent-grant>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>
I'm trying to create a string with a "-".join xpath for all of the children and grandchildren within <applicant> using the following in python with lxml:
import urllib2, os, zipfile
from lxml import etree
count = 0
for item in xmlSplitter(zf.open(xml_file)):
count += 1
if count > 1: break
doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
title = first(doc.xpath('//invention-title/text()'))
applicant = "-".join(doc.xpath('//applicants/applicant/*/text()'))
print "DocID: {0}\nTitle: {1}\nApplicant: {2}\n".format(docID,title,applicant)
outFile.write(str(docID) +"|"+ str(title) +"|"+ str(applicant) +"\n")
I've tried mutliple xpath combinations but I can't produce a string with hyphens for <applicants> and while //text() cant get to the grandchild it doesn't help with the stringing. What is the appropriate xpath syntax to select all text within the children and grandchildren of <applicant> and still punch it out in a string? While not shown in this example is there a way to ignore unicode that might be present at the beginning of a text line too (I believe it appears in some of the later xml docs)? The 'applicant' output I'm hoping to get should look something like:
Beyer-Daniel Lee-Franklin-TN-US-omitted-US-Friedland-Jason Michael-Franklin-TN-US-omitted-US
This question is very similar to this other question of yours.
There are two problems here:
How to get from "non-standard XML" to "standard XML"?
How to use XPath to get text values of descendant elements and concatenate them?
You need to solve 1 before attacking 2. If you need help with that, ask a separate question.
"Non-standard XML" is the same as not XML at all. You can't parse it as XML, and you can't use XPath on it. But you have phrased the question in a way that makes it look like you are trying to do that anyway.
Assuming that your question is actually about working with "standard XML", how about using the same approach as in my answer to your other question?

Python get ID from XML data

I am a total python newb and am trying to parse an XML document that is being returned from google as a result of a post request.
The document returned looks like the one outlined in this doc
http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#Archives
where it says 'The response contains information about the archive.'
The only part I am interested in is the Id attribute right near the beginning. There will only every be 1 entry, and 1 id attribute. How can I extract it to be use later? I've been fighting with this for a while and I feel like I've tried everything from minidom to elementtree. No matter what I do my search comes back blank, loops don't iterate, or methods are missing. Any assistance is much appreciated. Thank you.
I would highly recommend the Python package BeautifulSoup. It is awesome. Here is a simple example using their example data (assuming you've installed BeautifulSoup already):
from BeautifulSoup import BeautifulSoup
data = """<?xml version='1.0' encoding='utf-8'?>
<entry xmlns='http://www.w3.org/2005/Atom'
xmlns:docs='http://schemas.google.com/docs/2007'
xmlns:gd='http://schemas.google.com/g/2005'>
<id>
https://docs.google.com/feeds/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA</id>
<published>2010-11-18T18:34:06.981Z</published>
<updated>2010-11-18T18:34:07.763Z</updated>
<app:edited xmlns:app='http://www.w3.org/2007/app'>
2010-11-18T18:34:07.763Z</app:edited>
<category scheme='http://schemas.google.com/g/2005#kind'
term='http://schemas.google.com/docs/2007#archive'
label='archive' />
<title>Document Archive - someuser#somedomain.com</title>
<link rel='self' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<link rel='edit' type='application/atom+xml'
href='https://docs.google.com/feeds/default/private/archive/-228SJEnnmwemsiDLLxmGeGygWrvW1tMZHHg6ARCy3Uj3SMH1GHlJ2scb8BcHSDDDUosQAocwBQOAKHOq3-0gmKA' />
<author>
<name>someuser</name>
<email>someuser#somedomain.com</email>
</author>
<docs:archiveNotify>someuser#somedomain.com</docs:archiveNotify>
<docs:archiveStatus>flattening</docs:archiveStatus>
<docs:archiveResourceId>
0Adj-hQNOVsTFSNDEkdk2221OTJfMWpxOGI5OWZu</docs:archiveResourceId>
<docs:archiveResourceId>
0Adj-hQNOVsTFZGZodGs2O72NFMllMQDN3a2Rq</docs:archiveResourceId>
<docs:archiveConversion source='application/vnd.google-apps.document'
target='text/plain' />
</entry>"""
soup = BeautifulSoup(data, fromEncoding='utf8')
print soup('id')[0].text
There is also expat, which is built into Python, but it is worth learning BeautifulSoup, because it will respond way better to real-world XML (and HTML).
Assuming the variable response contains a string representation of the returned HTML document, let me tell you the WRONG way to solve your problem
id = response.split("</id>")[0].split("<id>")[1]
The right way to do it is with xml.sax or xml.dom or expat, but personally, I wouldn't be bothered unless I wanted to have robust error handling of exception cases when response contains something unexpected.
EDIT: I forgot about BeautifulSoup, it is indeed as awesome as Travis describes.
If you'd like to use minidom, you can do the following (replace gd.xml with your xml input):
from xml.dom import minidom
dom = minidom.parse("gd.xml")
id = dom.getElementsByTagName("id")[0].childNodes[0].nodeValue
print id
Also, I assume you meant id element, and not id attribute.

Categories

Resources