I have used xpaths to great effect with both HTML and XML before, but can't seem to get any results this time.
The data is from http://www.ahrefs.com/api/, under "Example of an answer", saved to an .xml file
My code:
from lxml import etree
doc = etree.XML(open('example.xml').read())
print doc.xpath('//result')
which doesn't give any results.
Where am I going wrong?
You need to take the namespace of the document into account:
from lxml import etree
doc = etree.parse('example.xml')
print doc.xpath('//n:result',
namespaces={'n': "http://ahrefs.com/schemas/api/links/1"})
=>
[<Element {http://ahrefs.com/schemas/api/links/1}result at 0xc8d670>,
<Element {http://ahrefs.com/schemas/api/links/1}result at 0xc8d698>]
My experience is from using XPath in C#, but I believe the XML namespace is causing your query to fail. You'll need to either use some variation of the local() operator, or check your documentation for some way of defining the namespace beforehand.
Related
I am testing against the following test document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>hi there</title>
</head>
<body>
<img class="foo" src="bar.png"/>
</body>
</html>
If I parse the document using lxml.html, I can get the IMG with an xpath just fine:
>>> root = lxml.html.fromstring(doc)
>>> root.xpath("//img")
[<Element img at 1879e30>]
However, if I parse the document as XML and try to get the IMG tag, I get an empty result:
>>> tree = etree.parse(StringIO(doc))
>>> tree.getroot().xpath("//img")
[]
I can navigate to the element directly:
>>> tree.getroot().getchildren()[1].getchildren()[0]
<Element {http://www.w3.org/1999/xhtml}img at f56810>
But of course that doesn't help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:
>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0])
'/*/*[2]/*'
>>> tree.getroot().xpath('/*/*[2]/*')
[<Element {http://www.w3.org/1999/xhtml}img at fa1750>]
But that xpath is, again, obviously not useful for parsing arbitrary documents.
Obviously I am missing some key issue here, but I don't know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don't know what else I might need to consider in regards to namespaces.
So, what am I missing?
The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.
Try this:
>>> tree.getroot().xpath(
... "//xhtml:img",
... namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
... )
[<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]
XPath considers all unprefixed names to be in "no namespace".
In particular the spec says:
"A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "
See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.
Hope this helped.
Cheers,
Dimitre Novatchev
If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.
In your case it would be like
from lxml import objectify
root = objectify.parse(url) #also available: fromstring
You can access the nodes as
root.html
body = root.html.body
for img in body.img: #Assuming all images are within the body tag
While it might not be of great help in html, it can be highly useful in well structured xml.
For more info, check out http://lxml.de/objectify.html
I'm using lxml to parse an XML message. What I want to do is convert the string into an xml message, extract some informations thanks to xpath directives, edit a few attributes and then dump the XML into a string again.
lxml is doing a wonderful job at it, except for one thing : It won't respect the tag declaration that were originally provided. What I mean by this, is that if in your input you have :
xml_str = "<root><tag><tutu/></tag></root>"
or
xml_str = "<root><tag><tutu></tutu></tag></root>"
The following code will return the same thing:
>>> from lxml import etree
>>> root = etree.XML(xml_str)
>>> print etree.tostring(root)
<root><tag><tutu/></tag></root>
The tutu tag will be rendered no matter what as <tutu/>
I found here that by setting the text of the element to '' we can force the closing tag to be explicitly rendered.
My issue is the following : I need to have the exact same tag rendering before and after calling lxml (because some external program will perform a string comparison on both strings and a mismatch will be detected on <tutu/> and <tutu></tutu>)
I know we can create a custom ElementTree class as well as a custom parser...What I was thinking was while parsing the string, to save in the custom ElementTree what type of tag we have (short or extended) and then before calling tostring function, update the tree and set the text to None or '' to keep the same type of tag as in the input
The question is : How may I know what type of tag I have? Or do you have any other idea on how to solve this issue?
Thanks a lot for your help
I understand that etree object from lxml library is a tree representation of a xml document. It is not clear to me what .xpath function does. I just need to know how to interpret its argument and its output. I saw the following example of use:
tree.xpath('.//' + tagname)
html.xpath("string()")
html.xpath("//text()")
What do all these string() and //text() mean?
What you see there is XPath. https://de.wikipedia.org/wiki/XPath
The xpath() function will take a valid XPath and return the result. You could say XPath is a query language for xml documents.
This should clear things up: http://lxml.de/xpathxslt.html#the-xpath-method
This question already has answers here:
Don't put html, head and body tags automatically, beautifulsoup
(9 answers)
Closed 9 years ago.
In BeautifulSoup versions prior to 3 I could take any chunk of HTML and get a string representation in this way:
from BeautifulSoup import BeautifulSoup
soup3 = BeautifulSoup('<div><b>soup 3</b></div>')
print unicode(soup3)
'<div><b>soup</b></div>'
However with BeautifulSoup4 the same operation creates additional tags:
from bs4 import BeautifulSoup
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
print unicode(soup4)
'<html><body><div><b>soup 4</b></div></body></html>'
^^^^^^^^^^^^ ^^^^^^^^^^^^^^
I don't need the outer <html><body>..</body></html> tags that BS4 is adding. I have looked through the BS4 docs and also searched inside the class but could not find any setting for supressing the extra tags in the output. How do I do it? Downgrading to v3 is not an option since the SGML parser used in BS3 is not near as good as the lxml or html5lib parsers that are available with BS4.
If you want your code to work on everyone's machine, no matter which parser(s) they have installed, etc. (the same lxml version built on libxml2 2.9 vs. 2.8 acts very differently, the stdlib html.parser had some radical changes between 2.7.2 and 2.7.3, …), you pretty much need to handle all of the legitimate results.
If you know you have a fragment, something like this will give you exactly that fragment:
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
if soup4.body:
return soup4.body.next
elif soup4.html:
return soup4.html.next
else:
return soup4
Of course if you know your fragment is a single div, it's even easier—but it's not as easy to think of a use case where you'd know that:
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
return soup4.div
If you want to know why this happens:
BeautifulSoup is intended for parsing HTML documents. An HTML fragment is not a valid document. It's pretty close to a document, but that's not good enough to guarantee that you'll get back exactly what you give it.
As Differences between parsers says:
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will give different results.
So, while this exact difference isn't documented, it's just a special case of something that is.
As was noted in the old BeautifulStoneSoup documentation:
The BeautifulSoup class is full of web-browser-like heuristics for divining the intent of HTML authors. But XML doesn't have a fixed tag set, so those heuristics don't apply. So BeautifulSoup doesn't do XML very well.
Use the BeautifulStoneSoup class to parse XML documents. It's a general class with no special knowledge of any XML dialect and very simple rules about tag nesting...
And in the BeautifulSoup4 docs:
There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor. For the same reason, the BeautifulSoup constructor no longer recognizes the isHTML argument.
Perhaps that will yield what you want.
I am a relative newby to Python and SO. I have an xml file from which I need to extract information. I've been struggling with this for several days, but I think I finally found something that will extract the information properly. Now I'm having troubles getting the right output. Here is my code:
from xml import etree
node = etree.fromstring('<dataObject><identifier>5e1882d882ec530069d6d29e28944396</identifier><description>This is a paragraph about a shark.</description></dataObject>')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
The result that I get is "5e1882d882ec530069d6d29e28944396 This is a paragraph about a shark.", which is what I want.
However, what I really need is to be able to read from a file instead of a string. So I try this code:
from xml import etree
node = etree.parse('test3.xml')
identifier = node.findtext('identifier')
description = node.findtext('description')
print identifier, description
Now my result is "None None". I have a feeling I'm either not getting the file in right or something is wrong with the output. Here is the contents of test3.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<response xmlns="http://www.eol.org/transfer/content/0.3" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dwc="http://rs.tdwg.org/dwc/dwcore/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:dwct="http://rs.tdwg.org/dwc/terms/" xsi:schemaLocation="http://www.eol.org/transfer/content/0.3 http://services.eol.org/schema/content_0_3.xsd">
<identifier>5e1882d822ec530069d6d29e28944369</identifier>
<description>This is a paragraph about a shark.</description>
Your XML file uses a default namespace. You need to qualify your searches with the correct namespace:
identifier = node.findtext('{http://www.eol.org/transfer/content/0.3}identifier')
for ElementTree to match the correct elements.
You could also give the .find(), findall() and iterfind() methods an explicit namespace dictionary. This is not documented very well:
namespaces = {'eol': 'http://www.eol.org/transfer/content/0.3'} # add more as needed
root.findall('eol:identifier', namespaces=namespaces)
Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the eol: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.eol.org/transfer/content/0.3}identifier instead.
If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in a .nsmap attribute on elements.
Have you thought of trying beautifulsoup to parse your xml with python:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing%20XML
There is some good documentation and a healthy online group so support is quite good
A