lxml tree head and some other elements broken - python

I tried many different solutions for the following problem and I couldn't find one that works at the time being.
I need to get some information from meta tags in several webpages. For this purpose I found lxml very useful because I also need to find specific content using xpath to parse it. XPath works on the tree, however, I have a 20% of websites (in a total around 100) that don't work, specifically head seems to be broken.
tree = html.fromstring(htmlfrompage) // using html from lxml package
head_object = tree.head // access to head object from this webpage
In all of these websites accessing head object (which is only a shortcut to xpath) fails with the same error:
print tree.head
IndexError: list index out of range
Because the following xpath fails:
self.xpath('//head|//x:head', namespaces={'x':XHTML_NAMESPACE})[0]
This xpath is empty so accessing the first element fails. I was navigating the tree myself and self.xpath('//head') or self.xpath('//html/head') or even self.xpath('//body') is empty. But if I try to access meta tags directly in any place of the document:
head = tree.xpath("//meta")
for meta_tag in head:
print meta_tag.text # Just printing something
It works, so it means somehow metas are not connected to the head, but they're somewhere floating in the tree. Head doesn't exist anyway. Of course I can try to "patch" this issue accessing head and in case I get an index out of range exception I could navigate metas to find what I'm looking for but I expected lxml fixes broken html (as I read in the documentation).
Is there anybody that had the same issue and could solve it in a better way?

Using requests I can load the tree just fine:
>>> import requests
>>> from lxml import html
>>> r = requests.get('http://www.lanacion.com.ar/1694725-ciccone-manana-debera-declarar-carosso-donatiello-el-inquilino-de-boudou')
>>> tree = html.fromstring(r.content)
>>> tree.head
<Element head at 0x10681b4c8>
Do note that you want to pass a byte string to html.fromstring(); don't use r.text as that'll pass in Unicode instead.
Moreover, if the server did not indicate the encoding in the headers, requests falls back to the HTTP RFC default, which is ISO-8859-1 for text/ responses. For this specific response that is incorrect:
>>> r.headers['Content-Type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding # make an educated guess
'utf-8'
This means r.text will use Latin-1 to decode the UTF-8 data, leading to an incorrectly decoded Unicode string, further confusing matters.
The HTML parser, on the other hand, can make use of the <meta> header present to tell it what encoding to use:
>>> tree.find('.//meta').attrib
{'content': 'text/html; charset=utf-8', 'http-equiv': 'Content-Type'}

Related

No handlers could be found for logger "bs4.dammit"

I write a small link miner by using BeautifulSoup library.
But I saw that there was some link that doesn't handled. So I test one of theme:
result = requests.get('https://domain.ir/PATH_TO_FILE/optics-program-msc.pdf')
soup = BeautifulSoup(result.content,'html.parser')
f2.write('{"counter":'+str(i)+', "id": "'+a['href']+'", "group":'+str(counter)+", \"children\":"+str(len(soup.find_all('a',href=True)))+"},\n")
I understood that html.parser cannot handle all links and I give this error
No handlers could be found for logger "bs4.dammit"
So the link doesn't written in file. But There is some links that I don't no which parser should be use. like .pdf,.zip,....
So what should I do?
First of all: you should use result.text, because it is already unicode string (instead of bytes in content)
Second thing to check: is it really parsed "soup" of HTML with links? By putting one simple if soup.body:
The third one: bs4.dummit warning says about problem with detecting encoding, so try put some more info about it: BeautifulSoup(result.content, 'html.parser', from_encoding="windows-1259")
Another one: instead of html.parser, try to use lxml

Correctly detect encoding without any guessing when using Beautiful Soup

I’m working on improving the character encoding support for a Python IRC bot that retrieves the titles of pages whose URLs are mentioned in a channel.
The current process I’m using is as follows:
Requests:
r = requests.get(url, headers={ 'User-Agent': '...' })
Beautiful Soup:
soup = bs4.BeautifulSoup(r.text, from_encoding=r.encoding)
title = soup.title.string.replace('\n', ' ').replace(...) etc.
Specifying from_encoding=r.encoding is a good start, because it allows us to heed the charset from the Content-Type header when parsing the page.
Where this falls on its face is with pages that specify a <meta http-equiv … charset=…"> or <meta charset="…"> instead (or on top) of a charset in their Content-Type header.
The approaches I currently see from here are as follows:
Use Unicode, Dammit unconditionally when parsing the page. This is the default, but it seems to be ineffective for any of the pages that I’ve been testing it with.
Use ftfy unconditionally before or after parsing the page. I’m not fond of this option, because it basically relies on guesswork for a task for which we (usually) have perfect information.
Write code to look for an appropriate <meta> tag, try to heed any encodings we find there, then fall back on Requests’ .encoding, possibly in combination with the previous option. I find this option ideal, but I’d rather not write this code if it already exists.
TL;DR is there a Right Way™ to make Beautiful Soup correctly heed the character encoding of arbitrary HTML pages on the web, using a similar technique to what browsers use?
It seems you want to prefer encodings declared in documents over those declared in the HTTP headers. UnicodeDammit (used internally by BeautifulSoup) does this the other way around if you just pass it the encoding from the header. You can overcome this by reading declared encodings from the document and passing those to try first. Roughly (untested!):
r = requests.get(url, headers={ 'User-Agent': '...' })
is_html = content_type_header.split(';', 1)[0].lower().startswith('text/html')
declared_encoding = UnicodeDammit.find_declared_encoding(r.text, is_html=is_html)
encodings_to_try = [r.encoding]
if declared_encoding is not None:
encodings_to_try.insert(0, declared_encoding)
soup = bs4.BeautifulSoup(r.text, from_encoding=encodings_to_try)
title = soup.title...
Unlike the more general module ftfy, the approach that Unicode, Dammit takes is exactly what I’m looking for (see bs4/dammit.py). It heeds the information provided by any <meta> tags, rather than applying more blind guesswork to the problem.
When r.text is used, however, Requests tries to be helpful by automatically decoding pages with the charset from their Content-Type header, falling back to ISO 8859-1 where it’s not present, but Unicode, Dammit does not touch any markup which is already in a unicode string!
The solution I chose was to use r.content instead:
r = requests.get(url, headers={ 'User-Agent': '...' })
soup = bs4.BeautifulSoup(r.content)
title = soup.title.string.replace('\n', ' ').replace(...) etc.
The only drawback that I can see is that pages with only a charset from their Content-Type will be subject to some guesswork by Unicode, Dammit, because passing BeautifulSoup the from_encoding=r.encoding argument will override Unicode, Dammit completely.

why the results of response.xpath('//html') differs than response.body?

I'm trying to parse this page using scrapy
http://mobileshop.ae/one-x
I need to extract the links of the products.
The problem is the links are available in the response.body result, but no available if you try response.xpath('//body').extract()
the results of response.body and response.xpath('//body') are different.
>>> body = response.body
>>> body_2 = response.xpath('//html').extract()[0]
>>> len(body)
238731
>>> len(body_2)
67520
same short result for response.xpath('.').extract()[0]
is there any idea why this happens, and how can I extract the data needed ?
So, the issue here is a lot of mal-formed content in that page, including several unclosed tags. One way to solve this problem is to use lxml's soupparser to parse the mal-formed content (using BeautifulSoup under the covers) and build a Scrapy Selector with it.
Example session with scrapy shell http://mobileshop.ae/one-x:
>>> from lxml.html import soupparser
>>> from scrapy import Selector
>>> sel = Selector(_root=soupparser.fromstring(response.body))
>>> sel.xpath('//h4[#class="name"]/a').extract()
[u'HTC One X 3G 16GB Grey',
u'HTC One X 3G 16GB White',
u'HTC One X 3G 32GB Grey',
u'HTC One X 3G 32GB White']
Note that using the BeautifulSoup parser is a lot slower than lxml's default parser. You probably want to do this only in the places where it's really needed.
Response.xpath("//body") returns body of 'html' element contained in response, while response.body returns body (or message-body) of whole HTTP response (so all html in response including head & body elements).
Response.xpath("//body") is actually shortcut that converts body of HTTP response into Selector object that can be navigated with xpaths.
Links that you need are contained in html element body, they cannot really be anywhere else, I'm not sure why you suggest that they are not there. response.xpath("//body//a/#href") will give you all links on page, you probably need to create proper xpath that will select only those links you need.
The length of response.xpath("//body") that you mention in your example results from the fact that your first example len(response.xpath('//body').extract()) returns the number of body elements in html document, response.xpath.extract() returns list of elements matching xpath. There is only one body in document. In your second example len(response.xpath('//body')).extract()[0]) you are actually getting body element as string, and you are getting length of string (number of characters contained in body). len(response.body) also gives you number of characters in whole HTTP response, the number is higher most likely because html HEAD contains lots of scripts and stylesheets, that are not present in HTML body.

lxml can't parse <table>?

I want to parse tables in html, but i found lxml can't parse it? what's wrong?
# -*- coding: utf8 -*-
import urllib
import lxml.etree
keyword = 'lxml+tutorial'
url = 'http://www.baidu.com/s?wd='
if __name__ == '__main__':
page = 0
link = url + keyword + '&pn=' + str(page)
f = urllib.urlopen(link)
content = f.read()
f.close()
tree = lxml.etree.HTML(content)
query_link = '//table'
info_link = tree.xpath(query_link)
print info_link
the print result is just []...
lxml's documentation says, "The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing."
And sure enough, the HTML returned by Baidu is invalid: the W3C validator reports "173 Errors, 7 warnings". I don't know (and haven't investigated) whether these particular errors have caused your trouble with lxml, because I think that your strategy of using lxml to parse HTML found "in the wild" (which is nearly always invalid) is doomed.
For parsing invalid HTML, you need a parser that implements the (surprisingly bizarre!) HTML error recovery algorithm. So I recommend swapping lxml for html5lib, which handles Baidu's invalid HTML with no problems:
>>> import urllib
>>> from html5lib import html5parser, treebuilders
>>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
>>> dom = p.parse(urllib.urlopen('http://www.baidu.com/s?wd=foo').read())
>>> len(dom.getElementsByTagName('table'))
12
I see several places that code could be improved but, for your question, here are my suggestions:
Use lxml.html.parse(link) rather than lxml.etree.HTML(content) so all the "just works" automatics can kick in. (eg. Handling character coding declarations in headers properly)
Try using tree.findall(".//table") rather than tree.xpath("//table"). I'm not sure whether it'll make a difference, but I just used that syntax in a project of my own a few hours ago without issue and, as a bonus, it's compatible with non-LXML ElementTree APIs.
The other major thing I'd suggest would be using Python's built-in functions for building URLs so you can be sure the URL you're building is valid and properly escaped in all circumstances.
If LXML can't find a table and the browser shows a table to exist, I can only imagine it's one of these three problems:
Bad request. LXML gets a page without a table in it. (eg. error 404 or 500)
Bad parsing. Something about the page confused lxml.etree.HTML when called directly.
Javascript needed. Maybe the table is generated client-side.

How to download any(!) webpage with correct charset in python?

Problem
When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong than your output will be messed up.
People usually use some rudimentary technique to detect the encoding. They either use the charset from the header or the charset defined in the meta tag or they use an encoding detector (which does not care about meta tags or headers).
By using only one these techniques, sometimes you will not get the same result as you would in a browser.
Browsers do it this way:
Meta tags always takes precedence (or xml definition)
Encoding defined in the header is used when there is no charset defined in a meta tag
If the encoding is not defined at all, than it is time for encoding detection.
(Well... at least that is the way I believe most browsers do it. Documentation is really scarce.)
What I'm looking for is a library that can decide the character set of a page the way a browser would. I'm sure I'm not the first who needs a proper solution to this problem.
Solution (I have not tried it yet...)
According to Beautiful Soup's documentation.
Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:
An encoding you pass in as the
fromEncoding argument to the soup
constructor.
An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected
at this stage, it will be one of the
UTF-* encodings, EBCDIC, or ASCII.
An
encoding sniffed by the chardet
library, if you have it installed.
UTF-8
Windows-1252
When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted:
fp = urllib2.urlopen(request)
charset = fp.headers.getparam('charset')
You can use BeautifulSoup to locate a meta element in the HTML:
soup = BeatifulSoup.BeautifulSoup(data)
meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
If neither is available, browsers typically fall back to user configuration, combined with auto-detection. As rajax proposes, you could use the chardet module. If you have user configuration available telling you that the page should be Chinese (say), you may be able to do better.
Use the Universal Encoding Detector:
>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}
The other option would be to just use wget:
import os
h = os.popen('wget -q -O foo1.txt http://foo.html')
h.close()
s = open('foo1.txt').read()
It seems like you need a hybrid of the answers presented:
Fetch the page using urllib
Find <meta> tags using beautiful soup or other method
If no meta tags exist, check the headers returned by urllib
If that still doesn't give you an answer, use the universal encoding detector.
I honestly don't believe you're going to find anything better than that.
In fact if you read further into the FAQ you linked to in the comments on the other answer, that's what the author of detector library advocates.
If you believe the FAQ, this is what the browsers do (as requested in your original question) as the detector is a port of the firefox sniffing code.
I would use html5lib for this.
Scrapy downloads a page and detects a correct encoding for it, unlike requests.get(url).text or urlopen. To do so it tries to follow browser-like rules - this is the best one can do, because website owners have incentive to make their websites work in a browser. Scrapy needs to take HTTP headers, <meta> tags, BOM marks and differences in encoding names in account.
Content-based guessing (chardet, UnicodeDammit) on its own is not a correct solution, as it may fail; it should be only used as a last resort when headers or <meta> or BOM marks are not available or provide no information.
You don't have to use Scrapy to get its encoding detection functions; they are released (among with some other stuff) in a separate library called w3lib: https://github.com/scrapy/w3lib.
To get page encoding and unicode body use w3lib.encoding.html_to_unicode function, with a content-based guessing fallback:
import chardet
from w3lib.encoding import html_to_unicode
def _guess_encoding(data):
return chardet.detect(data).get('encoding')
detected_encoding, html_content_unicode = html_to_unicode(
content_type_header,
html_content_bytes,
default_encoding='utf8',
auto_detect_fun=_guess_encoding,
)
instead of trying to get a page then figuring out the charset the browser would use, why not just use a browser to fetch the page and check what charset it uses..
from win32com.client import DispatchWithEvents
import threading
stopEvent=threading.Event()
class EventHandler(object):
def OnDownloadBegin(self):
pass
def waitUntilReady(ie):
"""
copypasted from
http://mail.python.org/pipermail/python-win32/2004-June/002040.html
"""
if ie.ReadyState!=4:
while 1:
print "waiting"
pythoncom.PumpWaitingMessages()
stopEvent.wait(.2)
if stopEvent.isSet() or ie.ReadyState==4:
stopEvent.clear()
break;
ie = DispatchWithEvents("InternetExplorer.Application", EventHandler)
ie.Visible = 0
ie.Navigate('http://kskky.info')
waitUntilReady(ie)
d = ie.Document
print d.CharSet
BeautifulSoup dose this with UnicodeDammit : Unicode, Dammit

Categories

Resources