python alexa result parsing with lxml.etree - python

I am using alexa api from aws but I find difficult in parse the result to get what I want
alexa api return an object tree <type 'lxml.etree._ElementTree'>
I use this code to print the tree
from lxml import etree
root = tree.getroot()
print etree.tostring(root)
I get xml below
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"><aws:OperationRequest><aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId></aws:OperationRequest><aws:UrlInfoResult><aws:Alexa>
<aws:ContentData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:SiteData>
<aws:Title>Google</aws:Title>
<aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
<aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
</aws:SiteData>
<aws:LinksInCount>3453627</aws:LinksInCount>
</aws:ContentData>
<aws:TrafficData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:Rank>1</aws:Rank>
</aws:TrafficData>
</aws:Alexa></aws:UrlInfoResult><aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:StatusCode>Success</aws:StatusCode></aws:ResponseStatus></aws:Response></aws:UrlInfoResponse>
I use root.find('LinksInCount').text to get value of element but it does not work.
I want to know how to get the text 3453627 of aws:LinksInCount

You run into two challenges:
XML using namespaces
two namespaces sharing the same namespace prefix
XML document with reused prefix for 2 different namespaces
You see "aws:" prefix, but it is used for two different namespaces:
xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"
xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"
Using the same namespace prefix in XML is completely legal. The rule is, the later one is valid.
xmlstr = """
<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
<aws:OperationRequest>
<aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId>
</aws:OperationRequest>
<aws:UrlInfoResult>
<aws:Alexa>
<aws:ContentData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:SiteData>
<aws:Title>Google</aws:Title>
<aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
<aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
</aws:SiteData>
<aws:LinksInCount>3453627</aws:LinksInCount>
</aws:ContentData>
<aws:TrafficData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:Rank>1</aws:Rank>
</aws:TrafficData>
</aws:Alexa>
</aws:UrlInfoResult>
<aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:StatusCode>Success</aws:StatusCode>
</aws:ResponseStatus>
</aws:Response>
</aws:UrlInfoResponse>
"""
Next challenge is, how to search for namespaced elements.
I prefer using xpath, and for it, you can use whatever namespace you like in the xpath expression, but you have to tell the xpath call what you meant by those prefixes. This is done by namespaces dictionary:
from lxml import etree
doc = etree.fromstring(xmlstr.strip())
namespaces = {"aws": "http://awis.amazonaws.com/doc/2005-07-11"}
texts = doc.xpath("//aws:LinksInCount/text()", namespaces=namespaces)
print texts[0]

Related

How can I go to exact position of xml file having its XPath and Offset?

I'm using lxml to parse xml files as ElementTree objects. I'm building annotation application, and I need to reach to exact positions in the file.
I have relative XPath and startOffset of where the intended text is located. For example in this piece of code:
<section role="doc-abstract">
<h1>Abstract</h1>
<p>The creation and use of knowledge graphs for information discovery, question answering, and task completion has exploded in recent years, but their application has often been limited to the most common user scenarios.</p>
</section>
I want to get the part "knowledge graphs for information discovery" with following XPath ".//section[2]/p[1]" so I can get to that <p> element. Then I have startOffset variable equal to "26" which means the text is 26 characters far from the beginning of the element.
My question is how can I get to that exact position using lxml?
Considering your xml to be stored in a string - xml_string.
from lxml import etree
#initialize a parser
parser = etree.XMLParser(remove_blank_text=True)
#initialize the xml root, it will automatically take the root of the xml
root = etree.XML(xml_string, parser)
node = root.find('//section[2]/p[1]')
Now you can do processing of this node. Also, you can use a loop for finding more node_elements, for eg: root.findall()
For more reference on lxml: https://lxml.de/tutorial.html

Finding and converting XML processing instructions using Python

We are converting our ancient FrameMaker docs to XML. My job is to convert this:
<?FM MARKER [Index] foo, bar ?>`
to this:
<indexterm>
<primary>foo, bar</primary>
</indexterm>
I'm not worried about that part (yet); what is stumping me is that the ProcessingInstructions are all over the documents and could potentially be under any element, so I need to be able to search the entire tree, find them, and then process them. I cannot figure out how to iterate over an entire XML tree using minidom. Am I missing some secret method/iterator? This is what I've looked at thus far:
Elementtree has the excellent Element.iter() method, which is a depth-first search, but it doesn't process ProcessingInstructions.
ProcessingInstructions don't have tag names, so I cannot search for them using minidom's getElementsByTagName.
xml.sax's ContentHandler.processingInstruction looks like it's only used to create ProcessingInstructions.
Short of creating my own depth-first search algorithm, is there a way to generate a list of ProcessingInstructions in an XML file, or identify their parents?
Use the XPath API of the lxml module as such:
from lxml import etree
foo = StringIO('<foo><bar></bar></foo>')
tree = etree.parse(foo)
result = tree.xpath('//processing-instruction()')
The node test processing-instruction() is true for any processing instruction. The processing-instruction() test may have an argument that is Literal; in this case, it is true for any processing instruction that has a name equal to the value of the Literal.
References
XPath and XSLT with lxml
XML Path Language 1.0: Node Tests

Parsing XML with namespaces using ElementTree in Python

I have an xml, small part of it looks like this:
<?xml version="1.0" ?>
<i:insert xmlns:i="urn:com:xml:insert" xmlns="urn:com:xml:data">
<data>
<image imageId="1"></image>
<content>Content</content>
</data>
</i:insert>
When i parse it using ElementTree and save it to a file i see following:
<ns0:insert xmlns:ns0="urn:com:xml:insert" xmlns:ns1="urn:com:xml:data">
<ns1:data>
<ns1:image imageId="1"></ns1:image>
<ns1:content>Content</ns1:content>
</ns1:data>
</ns0:insert>
Why does it change prefixes and put them everywhere? Using minidom i don't have such problem. Is it configured? Documentation for ElementTree is very poor.
The problem is, that i can't find any node after such parsing, for example image - can't find it with or without namespace if i use it like {namespace}image or just image. Why's that? Any suggestions are strongly appreciated.
What i already tried:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
for a in root.findall('ns1:image'):
print a.attrib
This returns an error and the other one returns nothing:
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
I also tried to make namespace like this and use it:
namespaces = {'ns1': 'urn:com:xml:data'}
for a in root.findall('ns1:image', namespaces):
print a.attrib
It returns nothing. What am i doing wrong?
This snippet from your question,
for a in root.findall('{urn:com:xml:data}image'):
print a.attrib
does not output anything because it only looks for direct {urn:com:xml:data}image children of the root of the tree.
This slightly modified code,
for a in root.findall('.//{urn:com:xml:data}image'):
print a.attrib
will print {'imageId': '1'} because it uses .//, which selects matching subelements on all levels.
Reference: https://docs.python.org/2/library/xml.etree.elementtree.html#supported-xpath-syntax.
It is a bit annoying that ElementTree does not just retain the original namespace prefixes by default, but keep in mind that it is not the prefixes that matter anyway. The register_namespace() function can be used to set the wanted prefix when serializing the XML. The function does not have any effect on parsing or searching.
From what I gather, it has something to do with the namespace recognition in ET.
from here http://effbot.org/zone/element-namespaces.htm
When you save an Element tree to XML, the standard Element serializer generates unique prefixes for all URI:s that appear in the tree. The prefixes usually have the form “ns” followed by a number. For example, the above elements might be serialized with the prefix ns0 for “http://www.w3.org/1999/xhtml” and ns1 for “http://effbot.org/namespace/letters”.
If you want to use specific prefixes, you can add prefix/uri mappings to a global table in the ElementTree module. In 1.3 and later, you do this by calling the register_namespace function. In earlier versions, you can access the internal table directly:
ElementTree 1.3
ET.register_namespace(prefix, uri)
ElementTree 1.2 (Python 2.5)
ET._namespace_map[uri] = prefix
Note the argument order; the function takes the prefix first, while the raw dictionary maps from URI:s to prefixes.

Python parsing multiple xml tags

I am finding element tree a little overwhelming. I have an xml file and I have two tags that I want to get the content of the tags and make a txt file with. The two tags are
<l>...</l>,
and
<bib>...</bib>.
Is there and easy way to simply grab whats in these two tags? I can handle the output fine, I am just shaky dealing with xml in python.
Thanks
XML files can be challenging to parse until you get the hang of it. First, you need to access the 'node' to which the tags you want belong. To do this, you need to determine where in the file they are located in the XML's hierarchy.
Assuming both of those tags are not nested deep and are at level 2 of the tree:
import xml.etree.ElementTree as ET
root = ET.parse(filename).getroot()
# The dot represents current nested level from root, else you must include other parent tags here
l_list = []
for node in root.findall("./l"):
# tag.text is the attribute for the text between the tag
l_list.append(node.text)
bib_list = []
for node in root.findall("./bib"):
bib_list.append(node.text)
A real-world example involves parsing a Nessus scan file. In that case, the desired findings are nested much deeper. A high-level summary of getting to those is like this (this assumes one host, multiple hosts would be more complicated, as you would enumerate each host for findings):
import xml.etree.ElementTree as ET
root = ET.parse(filename).getroot()
findings = []
one_finding = {}
ReportItems = root.findall("./Report/ReportHost/ReportItem")
for node in ReportItems:
for n in ReportItems.getchildren()
# Save all child tags as dictionary of tag_name:tag_text
one_finding[node.tag] = node.text
findings.append(one_finding)
I hope this example was also useful in showing how to create a dictionary of tag names and their text, then appending all of them into a list of nested dictionaries in case you need to take it that far with what you are parsing.
I'm not sure if you've heard of Beautifulsoup, but I believe it's very useful for this sort of task. Of course there are multiple ways to accomplish what you're requesting, I'll explain it using Beautifulsoup. You can find the documentation here.
Install: pip install beautifulsoup4
from bs4 import BeautifulSoup
my_xml='''<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
</CATALOG>'''
souped=BeautifulSoup(my_xml, 'xml')
>>> print souped.find("COMMON").text # Finds the first instance
Bloodroot
>>> _commons = souped.findAll("COMMON") # Returns a list of all instances
>>> print _commons[0].text
Bloodroot

Parse xml from file using etree works when reading string, but not a file

I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>
Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.
Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A

Categories

Resources