lxml can't parse <table>?

lxml can't parse <table>? - python

I want to parse tables in html, but i found lxml can't parse it? what's wrong?
# -*- coding: utf8 -*-
import urllib
import lxml.etree
keyword = 'lxml+tutorial'
url = 'http://www.baidu.com/s?wd='
if __name__ == '__main__':
page = 0
link = url + keyword + '&pn=' + str(page)
f = urllib.urlopen(link)
content = f.read()
f.close()
tree = lxml.etree.HTML(content)
query_link = '//table'
info_link = tree.xpath(query_link)
print info_link
the print result is just []...

lxml's documentation says, "The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing."
And sure enough, the HTML returned by Baidu is invalid: the W3C validator reports "173 Errors, 7 warnings". I don't know (and haven't investigated) whether these particular errors have caused your trouble with lxml, because I think that your strategy of using lxml to parse HTML found "in the wild" (which is nearly always invalid) is doomed.
For parsing invalid HTML, you need a parser that implements the (surprisingly bizarre!) HTML error recovery algorithm. So I recommend swapping lxml for html5lib, which handles Baidu's invalid HTML with no problems:
>>> import urllib
>>> from html5lib import html5parser, treebuilders
>>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
>>> dom = p.parse(urllib.urlopen('http://www.baidu.com/s?wd=foo').read())
>>> len(dom.getElementsByTagName('table'))
12

I see several places that code could be improved but, for your question, here are my suggestions:
Use lxml.html.parse(link) rather than lxml.etree.HTML(content) so all the "just works" automatics can kick in. (eg. Handling character coding declarations in headers properly)
Try using tree.findall(".//table") rather than tree.xpath("//table"). I'm not sure whether it'll make a difference, but I just used that syntax in a project of my own a few hours ago without issue and, as a bonus, it's compatible with non-LXML ElementTree APIs.
The other major thing I'd suggest would be using Python's built-in functions for building URLs so you can be sure the URL you're building is valid and properly escaped in all circumstances.
If LXML can't find a table and the browser shows a table to exist, I can only imagine it's one of these three problems:
Bad request. LXML gets a page without a table in it. (eg. error 404 or 500)
Bad parsing. Something about the page confused lxml.etree.HTML when called directly.
Javascript needed. Maybe the table is generated client-side.

Related

Using BeautifulSoup to Extract CData

I'm trying to use BeautifulSoup from bs4/Python 3 to extract CData. However, whenever I search for it using the following, it returns an empty result. Can anyone point out what I'm doing wrong?
from bs4 import BeautifulSoup,CData
txt = '''<foobar>We have
<![CDATA[some data here]]>
and more.
</foobar>'''
soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, CData):
print('CData contents: %r' % cd)

The problem appears to be that the default parser doesn't parse CDATA properly. If you specify the correct parser, the CDATA shows up:
soup = BeautifulSoup(txt,'html.parser')
For more information on parsers, see the docs
I got onto this by using the diagnose function, which the docs recommend:
If you have questions about Beautiful Soup, or run into problems, send mail to the discussion group. If your problem involves parsing an HTML document, be sure to mention what the diagnose() function says about that document.
Using the diagnose() function gives you output of how the different parsers see your html, which enables you to choose the right parser for your use case.

lxml tree head and some other elements broken

I tried many different solutions for the following problem and I couldn't find one that works at the time being.
I need to get some information from meta tags in several webpages. For this purpose I found lxml very useful because I also need to find specific content using xpath to parse it. XPath works on the tree, however, I have a 20% of websites (in a total around 100) that don't work, specifically head seems to be broken.
tree = html.fromstring(htmlfrompage) // using html from lxml package
head_object = tree.head // access to head object from this webpage
In all of these websites accessing head object (which is only a shortcut to xpath) fails with the same error:
print tree.head
IndexError: list index out of range
Because the following xpath fails:
self.xpath('//head|//x:head', namespaces={'x':XHTML_NAMESPACE})[0]
This xpath is empty so accessing the first element fails. I was navigating the tree myself and self.xpath('//head') or self.xpath('//html/head') or even self.xpath('//body') is empty. But if I try to access meta tags directly in any place of the document:
head = tree.xpath("//meta")
for meta_tag in head:
print meta_tag.text # Just printing something
It works, so it means somehow metas are not connected to the head, but they're somewhere floating in the tree. Head doesn't exist anyway. Of course I can try to "patch" this issue accessing head and in case I get an index out of range exception I could navigate metas to find what I'm looking for but I expected lxml fixes broken html (as I read in the documentation).
Is there anybody that had the same issue and could solve it in a better way?

Using requests I can load the tree just fine:
>>> import requests
>>> from lxml import html
>>> r = requests.get('http://www.lanacion.com.ar/1694725-ciccone-manana-debera-declarar-carosso-donatiello-el-inquilino-de-boudou')
>>> tree = html.fromstring(r.content)
>>> tree.head
<Element head at 0x10681b4c8>
Do note that you want to pass a byte string to html.fromstring(); don't use r.text as that'll pass in Unicode instead.
Moreover, if the server did not indicate the encoding in the headers, requests falls back to the HTTP RFC default, which is ISO-8859-1 for text/ responses. For this specific response that is incorrect:
>>> r.headers['Content-Type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding # make an educated guess
'utf-8'
This means r.text will use Latin-1 to decode the UTF-8 data, leading to an incorrectly decoded Unicode string, further confusing matters.
The HTML parser, on the other hand, can make use of the <meta> header present to tell it what encoding to use:
>>> tree.find('.//meta').attrib
{'content': 'text/html; charset=utf-8', 'http-equiv': 'Content-Type'}

How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup? [duplicate]

This question already has answers here:
Don't put html, head and body tags automatically, beautifulsoup
(9 answers)
Closed 9 years ago.
In BeautifulSoup versions prior to 3 I could take any chunk of HTML and get a string representation in this way:
from BeautifulSoup import BeautifulSoup
soup3 = BeautifulSoup('<div><b>soup 3</b></div>')
print unicode(soup3)
'<div><b>soup</b></div>'
However with BeautifulSoup4 the same operation creates additional tags:
from bs4 import BeautifulSoup
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
print unicode(soup4)
'<html><body><div><b>soup 4</b></div></body></html>'
^^^^^^^^^^^^ ^^^^^^^^^^^^^^
I don't need the outer <html><body>..</body></html> tags that BS4 is adding. I have looked through the BS4 docs and also searched inside the class but could not find any setting for supressing the extra tags in the output. How do I do it? Downgrading to v3 is not an option since the SGML parser used in BS3 is not near as good as the lxml or html5lib parsers that are available with BS4.

If you want your code to work on everyone's machine, no matter which parser(s) they have installed, etc. (the same lxml version built on libxml2 2.9 vs. 2.8 acts very differently, the stdlib html.parser had some radical changes between 2.7.2 and 2.7.3, …), you pretty much need to handle all of the legitimate results.
If you know you have a fragment, something like this will give you exactly that fragment:
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
if soup4.body:
return soup4.body.next
elif soup4.html:
return soup4.html.next
else:
return soup4
Of course if you know your fragment is a single div, it's even easier—but it's not as easy to think of a use case where you'd know that:
soup4 = BeautifulSoup('<div><b>soup 4</b></div>')
return soup4.div
If you want to know why this happens:
BeautifulSoup is intended for parsing HTML documents. An HTML fragment is not a valid document. It's pretty close to a document, but that's not good enough to guarantee that you'll get back exactly what you give it.
As Differences between parsers says:
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will give different results.
So, while this exact difference isn't documented, it's just a special case of something that is.

As was noted in the old BeautifulStoneSoup documentation:
The BeautifulSoup class is full of web-browser-like heuristics for divining the intent of HTML authors. But XML doesn't have a fixed tag set, so those heuristics don't apply. So BeautifulSoup doesn't do XML very well.
Use the BeautifulStoneSoup class to parse XML documents. It's a general class with no special knowledge of any XML dialect and very simple rules about tag nesting...
And in the BeautifulSoup4 docs:
There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor. For the same reason, the BeautifulSoup constructor no longer recognizes the isHTML argument.
Perhaps that will yield what you want.

Fetching data from the Wiki

I am currently developing a wiki and will keep posting information into the wiki. However, I'll have to fetch the information from the wiki using a python code. For example, if I have a wiki page about a company, say Coca Cola, I will need all the information (text) that I have posted on the wiki to be parsed to my python program. Please let me know if there's a way to do this.
Thanks!

You can use the api.php to get the Wikipedia Source text. It includes only the actual article.
I have written this one for the german wikipedia, so it works with umlauts. Some special characters of some other languages don't work (russian works, so it might be some asian languages). This is a working example:
import urllib2
from BeautifulSoup import BeautifulStoneSoup
import xml.sax.saxutils
def load(lemma, language="en", format="xml"):
""" Get the Wikipedia Source Text (not the HTML source code)
format:xml,json, ...
language:en, de, ...
Returns None if page doesn't exist
"""
url = 'http://' + language + '.wikipedia.org/w/api.php' + \
'?action=query&format=' + format + \
'&prop=revisions&rvprop=content' + \
'&titles=' + lemma
request = urllib2.Request(url)
handle = urllib2.urlopen(request)
text = handle.read()
if format == 'xml':
soup = BeautifulStoneSoup(text)
rev = soup.rev
if rev != None:
text = unicode(rev.contents[0])
text = xml.sax.saxutils.unescape(text)
else:
return None
return text
print load("Coca-Cola")
If you like to get the actual source code you have to change the url and the part with BeautifulStoneSoup.
BeautifulStoneSoup parses XML, BeautifulSoup parses HTML. Both are part of the BeautifulSoup package.

A manner is to download the page with urllib or httplib, then to analyze it with regexes to extract the precise information you want. It may be long, but it's relatively easy to do.
Maybe there are other solutions to analyze the source of the page, parsers or something like that; I don't know enough about them.

In the past for this sort of thing I've used SemanticMediawiki, and found it to work reasonably well. It's not terribly flexible, though so if you're doing something complicated you'll find yourself writing custom plugins or delegating to an external service to do the work.
I ultimately ended up writing a lot of python web services to do extra processing.

Python strategy for extracting text from malformed html pages

I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't just install anything I want on the server.
pyparsing and html2text.py also did not seem to work for malformed html pages.
Example URL is http://apnews.myway.com/article/20091015/D9BB7CGG1.html
My current implementation is approximately the following:
# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s)
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
c=soup.findAll('script')
for i in c:
i.extract()
body = bsoup.body(text=True)
text = ''.join(body)
# if BeautifulSoup can't handle it,
# alter html by trying to find 1st instance of "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html
if beautifulsoup still does not work, then I resort to using a heuristic of looking at the 1st char, last char (to see if they looks like its a code line # < ; and taking a sample of the line and then check if the tokens are english words, or numbers. If to few of the tokens are words or numbers, then I guess that the line is code.
I could use machine learning to inspect each line, but that seems a little expensive and I would probably have to train it (since I don't know that much about unsupervised learning machines), and of course write it as well.
Any advice, tools, strategies would be most welcome. Also I realize that the latter part of that is rather messy since if I get a line that is determine to contain code, I currently throw away the entire line, even if there is some small amount of actual English text in the line.

Try not to laugh, but:
class TextFormatter:
def __init__(self,lynx='/usr/bin/lynx'):
self.lynx = lynx
def html2text(self, unicode_html_source):
"Expects unicode; returns unicode"
return Popen([self.lynx,
'-assume-charset=UTF-8',
'-display-charset=UTF-8',
'-dump',
'-stdin'],
stdin=PIPE,
stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')
I hope you've got lynx!

Well, it depends how good the solution has to be. I had a similar problem, importing hundreds of old html pages into a new website. I basically did
# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )
and it worked out, but of course the documents could be so bad that even BS can't salvage much.

BeautifulSoup will do bad with malformed HTML. What about some regex-fu?
>>> import re
>>>
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>>
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'
You can then assembly a list of valid tags from which you want to extract information.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml can't parse <table>? - python

Related

Using BeautifulSoup to Extract CData

lxml tree head and some other elements broken

How to prevent BeautifulSoup4 from adding extra <html><body> tags to the soup? [duplicate]

Fetching data from the Wiki

Python strategy for extracting text from malformed html pages

Categories

Resources