Navigating in html by xpath in python - python

So, I am accessing some url that is formatted something like the following:
<DOCUMENT>
<TYPE>A
<SEQUENCE>1
<TEXT>
<HTML>
<BODY BGCOLOR="#FFFFFF" LINK=BLUE VLINK=PURPLE>
</BODY>
</HTML>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>B
<SEQUENCE>2
...
As you can see, it starts a document, (which is the sequence number 1), and then finishes the document, and then document with sequence 2 starts and so on.
So, what I want to do, is to write an xpath address in python such that to just get the document with sequence value 1, (or, equivalently, TYPE A).
I supposed that such a thing would work:
import lxml
from lxml import html
page = html.fromstring(pagehtml)
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
however, it just gives me an empty list as type_a variable.
Could someone please let me know what is my mistake in this code? I am really new to this xml stuff.

It might be because that's highly dubious HTML. The <SEQUENCE> tag is unclosed, so it could well be interpreted by lxml as containing all of the code until the next </DOCUMENT>, so it does not end up just containing the 1. When your XPath code then looks for a <SEQUENCE> containing 1, there isn't one.
Additionally, XML is case-sensitive, but HTML isn't. XPath is designed for XML, so it is also case sensitive, which would also stop your document matching <DOCUMENT>.
Try //DOCUMENT[starts-with(SEQUENCE,'1')]. That's based on Xpath using starts-with function.
Ideally, if the input is under your control, you should instead just close the type and sequence tags (with </TYPE> and </SEQUENCE>) to make the input valid.

I'd like to point out, apart from the great answer provided by #GKFX, lxml.html module is capable of parsing broken or a fragment of HTML. In fact it will parse from your string just fine and handle it well.
fromstring(string): Returns document_fromstring or
fragment_fromstring, based on whether the string looks like a full
document, or just a fragment.
The problem you have, perhaps from your other codes generating the string, also lies on the fact that, you haven't given the true path to access the SEQUENCE node.
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
your above xpath will try to find all document nodes with a following children node called sequence which its value 1, however your document's first children node is type, not sequence, so you will never get what you want.
Consider rewriting to this, will get what you need:
page.xpath('//document[type/sequence=1]/descendant::*/text()')
['A\n ', '1\n ']
Since your html string is missing the closing tag for sequence, you cannot, however get the correct result by another xpath like this:
page.xpath('//document[type/sequence=1]/../..//text()')
['A\n ', '1\n ', 'B\n ', '2']
That is because your sequence=1 has no closing tag, sequence=2 will become a child node of it.
I have to point out an important point that your html string is still invalid, but the tolerance from lxml's parser can handle your case just fine.

Try using a relative path: explicitly specifying the correct path to your element. (not skipping type)
page.xpath("//document[./type/sequence = 1]")
See: http://pastebin.com/ezQXtKcr
Output:
Trying original post (novice_007): //document[sequence=1]/descendant::*/text()
[]
Using GKFX's answer: //DOCUMENT[starts-with(SEQUENCE,'1')]
[]
My answer: //document[./type/sequence = 1]
[<Element document at 0x1bfcb30>]
Currently, the xpath I provided is the only one that ... to just get the document with sequence value 1

Related

Incorrect parent element lxml

I am implementing a web scraping program in Python.
Consider my following HTML snippet.
<div>
<b>
<i>
HelloWorld
</i>
HiThere
</b>
</div>
If I wish to use lxml to extract my bold or italicized texts only, I use the following command
tree = etree.fromstring(myhtmlstr, htmlparser)
opp1 = tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]")
This gives me the correct result, i.e. the result of my opp1 is :
['HelloWorld', 'HiThere']
So far, everything is perfect. However, the real problem arises if I try to query the parents of the tags. As expected, the output of opp1[0].getparent().tag and opp1[0].getparent().getparent().tag are i and b.
The real problem is however in the second tag. Ideally, the parent of opp[1] should be the b tag. However, the output of opp1[1].getparent().tag and opp1[1].getparent().getparent().tag are i and b again.
You can verify the same in the following code:
from lxml import etree
htmlstr = """<div><b><i>HelloWorld</i>HiThere</b></div>"""
htmlparser = etree.HTMLParser()
tree = etree.fromstring(htmlstr, htmlparser)
opp1 = tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]")
print(opp1)
print(opp1[0].getparent(), opp1[0].getparent().getparent())
print(opp1[1].getparent(), opp1[1].getparent().getparent())
Can someone point out why this is the case? What can I do to correct it? I plan to use only lxml for my program, and do not want any solution that uses bs4.
The issue seems to stem from LXML's (and ElementTree's) data model, where an element is roughly "tag, attributes, text, children, tail"; the DOM data model has actual nodes for text too.
If you change your program to do
for x in tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]"):
print(x, x.getparent(), "text?", x.is_text, "tail?", x.is_tail)
it will print
HelloWorld <Element i at 0x10aa0ccd0> text? True tail? False
HiThere <Element i at 0x10aa0ccd0> text? False tail? True
i.e. "HiThere" is the tail of the i element, since that's the way the Etree data model represents intermingled text and tags.
The takeaway here (which should probably work for your use case) is to consider .getparent().getparent() as the effective parent of a text result that has is_tail=True.

Parse self-closing tags missing the '/'

I'm trying to parse some old SGML code using BeautifulSoup4 and build an Element Tree with the data. It's mostly working fine, but some of the tags that should be self-closing are aren't marked as such. For example:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element1>
When I parse the data, it ends up like:
<element1>
<element2 attr="0">
<element3>Data</element3>
</element2>
</element1>
What I'd like is for it to assume that if it doesn't find a closing tag for such elements, it should treat it as self-closing tag instead of assuming that everything after it is a child and putting the closing tag as late as possible, like so:
<element1>
<element2 attr="0"/>
<element3>Data</element3>
</element1>
Can anyone point me to a parser that could do this, or some way to modify an existing one to act this way? I've dug through a few parsers (lxml, lxml-xml, html5lib) but I can't figure out how to get these results.
What I ended up doing was extracting all empty elements where the end tag can be omitted from the DTD (eg. <!ELEMENT elem_name - o EMPTY >), creating a list from those elements, then using regex to close all the tags in the list. The resulting text is then passed to the XML parser.
Here's a boiled down version of what I'm doing:
import re
from lxml.html import soupparser
from lxml import etree as ET
empty_tags = ['elem1', 'elem2', 'elem3']
markup = """
<elem1 attr="some value">
<elem2/>
<elem3></elem3>
"""
for t in empty_tags:
markup = re.sub(r'(<{0}(?:>|\s+[^>/]*))>\s*(?:</{0}>)?\n?'.format(t), r'\1/>\n', markup)
tree = soupparser.fromstring(markup)
print(ET.tostring(tree, pretty_print=True).decode("utf-8"))
The output should be:
<elem1 attr="some value"/>
<elem2/>
<elem3/>
(This will actually be enclosed in tags, but the parser adds those in.)
It will leave attributes alone, and won't touch tags that are already self-closed. If the tag has a closing tag, but is empty, it will remove the closing tag and self-close the tag instead, just so it's standardized.
It's not a very generic solution but, as far as I can tell, there's no other way to do this without knowing which tags should be closed. Even OpenSP needs the DTD to know which tags it should be closing.

Parsing broken XML with numbers as tag names

I have lots of xml files that have keys that are in digit format i.e <12345>Golly</12345>
When parsing using ElementTree I get an error not well-formed (invalid token). I am assuming this because the keys are in digit format and not words. When I try to change/replace the keys into string by adding double quotes using regex
xmlstr = re.sub('<([\d]+)>','<"' + str(re.search('<([\d]+)>', xmlstr).group(1))+ '">',xmlstr)
xmlstr = re.sub('</([\d]+)>','</"' + str(re.search('</([\d]+)>', xmlstr).group(1))+ '">',xmlstr)
All other keys are replace using the first found key.(all keys end up being the same. whereas the keys themselves in the original file are unique in each document.) I guess the files were converted from json to xml directly. The keys should represent id number and the values are the names associated with the id number
I was wondering if there is a way to work with digits as keys, or if there is a way I can replace the keys one by one and not replacing all matches with one found string.
.group(1) returns the first occurrence which causes the problem.
Please Help.
I think you need to have both the numeric tag name and the content captured in different saving groups and then reference them in the replacement string:
In [2]: data = "<content><12345>Golly</12345><67890>Jelly</67890></content>"
In [3]: re.sub(r"<(\d+)>(.*?)</\d+>", r'<item id="\1">\2</item>', data)
Out[3]: '<content><item id="12345">Golly</item><item id="67890">Jelly</item></content>'
Though, it is difficult to come up with something 100% reliable without having access to the possible variations of the input XML data. For instance, I am not sure if this expression going to handle nested numerical tags nicely.
You may also want to explore possibilities to parse the document in lxml's "recovery" mode.
Another possible tool that may help to deal with this situation is BeautifulSoup - you may try the non-traditional approach - parse the XML data with a lenient html5lib parser:
In [1]: from bs4 import BeautifulSoup
In [2]: data = "<content><12345>Golly</12345><67890>Jelly</67890></content>"
In [3]: soup = BeautifulSoup(data, "html5lib")
In [3]: print(soup.prettify())
<html>
<head>
</head>
<body>
<content>
<12345>Golly
<!--12345-->
<67890>Jelly
<!--67890-->
</content>
</body>
</html>
It is not the desired output, of course, but may be something you can work with and extract the keys and words.
lxml package will make your life easier than struggling with regex.
Take a look at the documentation page.
pip install lxml
file_path = 'your/xml/file.xml'
parser_obj = lxml.etree.XMLParser(recover=True)
lxml.etree.parse(file_path, parser=parser_obj)

Extract text between tags with XPath including markup

I have the following piece of XML:
...<span class="st">In Tim <em>Power</em>: Politieman...</span>...
I want to extract the part between the <span> tags.
For this I use XPath:
/span[#class="st"]
This however will extract everything including the <span>.
and.
/span[#class="st"]/text()
will return a list of two text elements. One containing "In Tim". The other ":Politieman". The <em>..</em> is not included and is handled like a separator.
Is there a pure XPath solution which returns:
In Tim <em>Power</em>: Politieman...
EDIT
Thanks to #helderdarocha and #TextGeek. Seems non trivial to extract plain text with XPath only including the <em>.
The /span[#class="st"]/node() solution creates a list containing the individual lines, from which it is trivial in Python to create a String.
To get any child node you can use:
/span[#class="st"]/node()
This will return:
Two child text nodes
The full <em> node (element and contents).
If you actually want all the text() nodes, including the ones inside em, then get all the text() descendants:
/span[#class="st"]//text()
or
/span[#class="st"]/descendant::text()
This will return three text nodes, the text inside <em>, but not the <em> elements.
Sounds like you want the equivalent of the Javascript DOM innerHTML() function, but for XML. I don't think there's a way to do that in pure XPath.
XPath doesn't really operate on markup strings like "<em>" and "</em>" at all -- it works with a tree of Node objects (there might possibly be an XPath implementation that tries to work directly off markup, but I doubt it). Most XPath implementations wouldn't even have the 4 characters "<em>" anywhere (except maybe kept around for printing error messages or something), and of course the DOM could have been built from scratch rather than from XML or other input in the first place. Likewise, XPath doesn't really figure on handing back marked-up strings, but lists of nodes.
In XSLT or XQuery you can do this easily, but not in XPath by itself, unless I'm missing something.
-s

lxml bug in .xpath?

After going through the xpath in lxml tutorial for python I'm finding it hard to understand 2 behaviors that seem like bugs to me. Firstly, lxml seems to return a list even when my xpath expression clearly selects only one element, and secondly .xpath seems to return the elements' parent rather than the elements themselves selected by a straight forward xpath search expression.
Is my understanding of XPath all wrong or does lxml indeed have a bug?
The script to replicate the behaviors I'm talking about:
from lxml.html.soupparser import fromstring
doc = fromstring("""
<html>
<head></head>
<body>
<p>Paragraph 1</p>
<p>Paragraph 2</p>
</body>
</html>
""")
print doc.xpath("//html")
#[<Element html at 1f385e0>]
#(This makes sense - return a list of all possible matches for html)
print doc.xpath("//html[1]")
#[<Element html at 1f385e0>]
#(This doesn't make sense - why do I get a list when there
#can clearly only be 1 element returned?)
print doc.xpath("body")
#[<Element body at 1d003e8>]
#(This doesn't make sense - according to
#http://www.w3schools.com/xpath/xpath_syntax.asp if I use a tag name
#without any leading / I should get the *child* nodes of the named
#node, which in this case would mean I get a list of
#p tags [<Element p at ...>, <Element p at ...>]
It's because the context node of doc is 'html' node. When you use doc.xpath('body') it select the child element 'body' of 'html'. This conforms XPath 1.0 standard
All p tags should be doc.findall(".//p")
As per guide, expression nodename Selects all child nodes of the named node.
Thus, to use only nodename (without trailing /), you must have a named node selected (to select parent node as named node, use dot).
In fact doc.xpath("//html[1]") can return more than one node with a different input document from your example. That path picks the first sibling that matches //html. If there are matching non sibling elements, it will select the first sibling of each of them.
XPath: (//html)[1] forces a different order of evaluation. It selects all of the matching elements in the document and then chooses the first.
But, in any case, it's a better API design to always return a list. Otherwise, code would always have to test for single or None values before processing the list.

Categories

Resources