I am implementing a web scraping program in Python.
Consider my following HTML snippet.
<div>
<b>
<i>
HelloWorld
</i>
HiThere
</b>
</div>
If I wish to use lxml to extract my bold or italicized texts only, I use the following command
tree = etree.fromstring(myhtmlstr, htmlparser)
opp1 = tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]")
This gives me the correct result, i.e. the result of my opp1 is :
['HelloWorld', 'HiThere']
So far, everything is perfect. However, the real problem arises if I try to query the parents of the tags. As expected, the output of opp1[0].getparent().tag and opp1[0].getparent().getparent().tag are i and b.
The real problem is however in the second tag. Ideally, the parent of opp[1] should be the b tag. However, the output of opp1[1].getparent().tag and opp1[1].getparent().getparent().tag are i and b again.
You can verify the same in the following code:
from lxml import etree
htmlstr = """<div><b><i>HelloWorld</i>HiThere</b></div>"""
htmlparser = etree.HTMLParser()
tree = etree.fromstring(htmlstr, htmlparser)
opp1 = tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]")
print(opp1)
print(opp1[0].getparent(), opp1[0].getparent().getparent())
print(opp1[1].getparent(), opp1[1].getparent().getparent())
Can someone point out why this is the case? What can I do to correct it? I plan to use only lxml for my program, and do not want any solution that uses bs4.
The issue seems to stem from LXML's (and ElementTree's) data model, where an element is roughly "tag, attributes, text, children, tail"; the DOM data model has actual nodes for text too.
If you change your program to do
for x in tree.xpath(".//text()[ancestor::b or ancestor::i or ancestor::strong]"):
print(x, x.getparent(), "text?", x.is_text, "tail?", x.is_tail)
it will print
HelloWorld <Element i at 0x10aa0ccd0> text? True tail? False
HiThere <Element i at 0x10aa0ccd0> text? False tail? True
i.e. "HiThere" is the tail of the i element, since that's the way the Etree data model represents intermingled text and tags.
The takeaway here (which should probably work for your use case) is to consider .getparent().getparent() as the effective parent of a text result that has is_tail=True.
Related
I have the following Code and Output (without closing br-tag):
from lxml import html
root = html.fromstring('<br/>')
html.tostring(root)
Output : '<br>'
However, I have expected some similar to this (adding also a closing tag for this self-closing element):
from lxml import html
root = html.fromstring('<a/>')
html.tostring(root)
Output : '<a></a>'
Is there a way to produce the wanted output or just '<br/>' ?
Maybe somebody could also point to the relevant part in the source code.
HTML is not XML compliant and this is one of the places where that shows up. A break is <br>. Since a break element does not usually have internal content (attributes or text) it does not need to be written as an enclosing <br/> or <br></br> (although most browsers will accept both). You can change the output method to get XML compliance.
html.tostring(root, method='xml')
This could negatively impact other parts of the document, though.
I have an XML string
xml_str = '<Foo><Bar>burp</Bar></Foo>'
I'm parsing it with xml etree
import xml.etree.ElementTree as ET
root_element = ET.fromstring(xml_str)
This creates an Element object(root_element) with a tag, tail, text, and attrib values within it. I can see all of them when debugging. However, I can't see any children Elements while debugging. I know the children are there because I can access them in a for loop.
for child in root_element:
*break point here*
Below is a screenshot of what I'm seeing
Is there a way to see all elements at once while debugging? And is this issue because the XML parser is a JIT parser or something?
It sounds like you want to be able to see the available elements in the XML document you want to parse.
This will list all the child tags of the root element
all_children = list(root_element.iter())
This will produce
[<Element 'Foo' at 0x11b315908>, <Element 'Bar' at 0x11b315c28>]
This output, however, doesn't respect the 'shape' of the XML.
When I want to parse XML, I find it easier to use ElementTree but my first experiences parsing XML was with BeautifulSoup. I still like the prettify() function.
This code,
pretty = ""
soup = BeautifulSoup(xml_str, 'html.parser')
for value in soup.find_all("foo"):
pretty += value.prettify()
produces this output
print(pretty)
<foo>
<bar>
burp
</bar>
</foo>
You can replace the find_all() with specific elements you might be looking for.
So, I am accessing some url that is formatted something like the following:
<DOCUMENT>
<TYPE>A
<SEQUENCE>1
<TEXT>
<HTML>
<BODY BGCOLOR="#FFFFFF" LINK=BLUE VLINK=PURPLE>
</BODY>
</HTML>
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>B
<SEQUENCE>2
...
As you can see, it starts a document, (which is the sequence number 1), and then finishes the document, and then document with sequence 2 starts and so on.
So, what I want to do, is to write an xpath address in python such that to just get the document with sequence value 1, (or, equivalently, TYPE A).
I supposed that such a thing would work:
import lxml
from lxml import html
page = html.fromstring(pagehtml)
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
however, it just gives me an empty list as type_a variable.
Could someone please let me know what is my mistake in this code? I am really new to this xml stuff.
It might be because that's highly dubious HTML. The <SEQUENCE> tag is unclosed, so it could well be interpreted by lxml as containing all of the code until the next </DOCUMENT>, so it does not end up just containing the 1. When your XPath code then looks for a <SEQUENCE> containing 1, there isn't one.
Additionally, XML is case-sensitive, but HTML isn't. XPath is designed for XML, so it is also case sensitive, which would also stop your document matching <DOCUMENT>.
Try //DOCUMENT[starts-with(SEQUENCE,'1')]. That's based on Xpath using starts-with function.
Ideally, if the input is under your control, you should instead just close the type and sequence tags (with </TYPE> and </SEQUENCE>) to make the input valid.
I'd like to point out, apart from the great answer provided by #GKFX, lxml.html module is capable of parsing broken or a fragment of HTML. In fact it will parse from your string just fine and handle it well.
fromstring(string): Returns document_fromstring or
fragment_fromstring, based on whether the string looks like a full
document, or just a fragment.
The problem you have, perhaps from your other codes generating the string, also lies on the fact that, you haven't given the true path to access the SEQUENCE node.
type_a = page.xpath("//document[sequence=1]/descendant::*/text()")
your above xpath will try to find all document nodes with a following children node called sequence which its value 1, however your document's first children node is type, not sequence, so you will never get what you want.
Consider rewriting to this, will get what you need:
page.xpath('//document[type/sequence=1]/descendant::*/text()')
['A\n ', '1\n ']
Since your html string is missing the closing tag for sequence, you cannot, however get the correct result by another xpath like this:
page.xpath('//document[type/sequence=1]/../..//text()')
['A\n ', '1\n ', 'B\n ', '2']
That is because your sequence=1 has no closing tag, sequence=2 will become a child node of it.
I have to point out an important point that your html string is still invalid, but the tolerance from lxml's parser can handle your case just fine.
Try using a relative path: explicitly specifying the correct path to your element. (not skipping type)
page.xpath("//document[./type/sequence = 1]")
See: http://pastebin.com/ezQXtKcr
Output:
Trying original post (novice_007): //document[sequence=1]/descendant::*/text()
[]
Using GKFX's answer: //DOCUMENT[starts-with(SEQUENCE,'1')]
[]
My answer: //document[./type/sequence = 1]
[<Element document at 0x1bfcb30>]
Currently, the xpath I provided is the only one that ... to just get the document with sequence value 1
I'm trying to parse an XML document. The document has HTML like formatting embedded, for example
<p>This is a paragraph
<em>with some <b>extra</b> formatting</em>
scattered throughout.
</p>
So far I've used
import xml.etree.cElementTree as xmlTree
to handle the XML document, but I am not sure if this provides the functionality I look for. How would I go about handling the text nodes here?
Also, is there a way to find the closing tags in a document?
Thanks!
If your XML document fits in memory, you should use Beautiful Soup which will give you a much cleaner access to the document. You'll be able to select a node and automatically interact with its children; every node will have a .next command, which will iterate through the text up to the next tag.
So:
>>> b = BeautifulSoup.BeautifulStoneSoup("<p>This is a paragraph <em>with some <b>extra</b> formatting</em> scattered throughout.</p>")
>>> b.find('p')
<p>This is a paragraph <em>with some <b>extra</b> formatting</em> scattered throughout.</p>
>>> b.find('p').next
u'This is a paragraph '
>>> b.find('p').next.next
<em>with some <b>extra</b> formatting</em>
That, or something like it, should solve your problem.
If it doesn't fit in memory, you'll need to subclass a SAX parser, which is a bit more work. To do that, you use from xml.parsers import expat and write handlers for opening and closing of tags. It's a bit more involved.
I am writing some HTML parsers using LXML Xpath feature. It seems to be working fine, but I have one main problem.
When parsing all the HTML <p> tags, there are words that use the tags <b>, <i> and etc. I need to keep those tags.
When parsing the HTML, for example;
<div class="ArticleDetail">
<p>Hello world, this is a <b>simple</b> test, which contains words in <i>italic</i> and others.
I have a <strong>strong</strong> tag here. I guess this is a silly test.
<br/>
Ops, line breaks.
<br/></p>
If I run this Python code;
x = lxml.html.fromstring("...html text...").xpath("//div[#class='ArticleDetail']/p")
for stuff in x:
print stuff.text_content()
This seems to work fine, but it removes all the other tags instead of p only.
Output:
Hello world, this is a simple test, which contains words in italic and others.
I have a strong tag here. I guess this is a silly test.
Ops, line breaks.
As you can see it removed all the <b>, <i> and <strong> tags. Is there anyway you can keep them?
You are currently retrieving only the text content, not the HTML content (which would include tags).
You want to retrieve all child nodes of your XPath match instead:
from lxml import etree
x = lxml.html.fromstring("...html text...").xpath("//div[#class='ArticleDetail']/p")
for elem in x:
for child in elem.iterdescendants():
print etree.tostring(child)