scrape with correct character encoding (python requests + beautifulsoup)

scrape with correct character encoding (python requests + beautifulsoup) - python

I have an issue parsing this website: http://fm4-archiv.at/files.php?cat=106
It contains special characters such as umlauts. See here:
My chrome browser displays the umlauts properly as you can see in the screenshot above. However on other pages (e.g.: http://fm4-archiv.at/files.php?cat=105) the umlauts are not displayed properly, as can be seen in the screenshot below:
The meta HTML tag defines the following charset on the pages:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
I use the python requests package to get the HTML and then use Beautifulsoup to scrape the desired data. My code is as follows:
r = requests.get(URL)
soup = BeautifulSoup(r.content,"lxml")
If I print the encoding (print(r.encoding) the result is UTF-8. If I manually change the encoding to ISO-8859-1 or cp1252 by calling r.encoding = ISO-8859-1 nothing changes when I output the data on the console. This is also my main issue.
r = requests.get(URL)
r.encoding = 'ISO-8859-1'
soup = BeautifulSoup(r.content,"lxml")
still results in the following string shown on the console output in my python IDE:
Der WildlÃ¶wenpfleger
instead it should be
Der Wildlöwenpfleger
How can I change my code to parse the umlauts properly?

In general, instead of using r.content which is the byte string received, use r.text which is the decoded content using the encoding determined by requests.
In this case requests will use UTF-8 to decode the incoming byte string because this is the encoding reported by the server in the Content-Type header:
import requests
r = requests.get('http://fm4-archiv.at/files.php?cat=106')
>>> type(r.content) # raw content
<class 'bytes'>
>>> type(r.text) # decoded to unicode
<class 'str'>
>>> r.headers['Content-Type']
'text/html; charset=UTF-8'
>>> r.encoding
'UTF-8'
>>> soup = BeautifulSoup(r.text, 'lxml')
That will fix the "Wildlöwenpfleger" problem, however, other parts of the page then begin to break, for example:
>>> soup = BeautifulSoup(r.text, 'lxml') # using decoded string... should work
>>> soup.find_all('a')[39]
Der Wildlöwenpfleger
>>> soup.find_all('a')[10]
<a href="files.php?cat=87" title="Stermann und Grissemann sind auf Sommerfrische und haben Hermes ihren Salon �bergeben. Auf Streifz�gen durch die Popliteratur st��t Hermes auf deren gro�e Themen und h�rt mit euch quer. In der heutige">Salon Hermes (6 files)
shows that "Wildlöwenpfleger" is fixed but now "übergeben" and others in the second link are broken.
It appears that multiple encodings are used in the one HTML document. The first link uses UTF-8 encoding:
>>> r.content[8013:8070].decode('iso-8859-1')
'Der WildlÃ¶wenpfleger'
>>> r.content[8013:8070].decode('utf8')
'Der Wildlöwenpfleger'
but the second link uses ISO-8859-1 encoding:
>>> r.content[2868:3132].decode('iso-8859-1')
'Salon Hermes (6 files)\r\n'
>>> r.content[2868:3132].decode('utf8', 'replace')
'Salon Hermes (6 files)\r\n'
Obviously it is incorrect to use multiple encodings in the same HTML document. Other than contacting the document's author and asking for a correction, there is not much that you can easily do to handle the mixed encoding. Perhaps you can run chardet.detect() over the data as you process it, but it's not going to be pleasant.

I just found two solutions. Can you confirm?
Soup = BeautifulSoup(r.content.decode('utf-8','ignore'),"lxml")
and
Soup = BeautifulSoup(r.content,"lxml", fromEncoding='utf-8')
Both results in the following example output:
Der Wildlöwenpfleger
EDIT:
I just wonder why these work, because r.encoding results in UTF-8 anyway. This tells me that requests anyway handled the data as UTF-8 data. Hence I wonder why .decode('utf-8','ignore') or fromEncoding='utf-8' result in the desired output?
EDIT 2:
okay, I think I get it now. The .decode('utf-8','ignore') and fromEncoding='utf-8' mean that the actual data is encoded as UTF-8 and that Beautifulsoup should parse it handling it as UTF-8 encoded data which is actually the case.
I assume that requests correctly handled it as UTF-8, but BeautifulSoup did not. Hence, I have to do this extra decoding.

Related

Encode UTF-8 not working when retrieving Xpath text

I'm retrieving some date from a website via lxml xpath:
page = requests.get(url)
tree = html.fromstring(page.content)
titles_arr = tree.xpath("//span[#class='lister-item-header']/span/a/text()")
Some of the titles have German Umlaute (e.g. üöä) so I thought of encoding the text returned like so:
for title in titles_arr:
title = title.encode('utf-8')
but it still consists of like Der Herr der Ringe - Die R\u00fcckkehr des K\u00f6nigs instead of their respective unicode character. What am I doing wrong?
Thanks

You seem to be dealing with a bytestring encoded with unicode characters escaped.
You can decode like this:
>>> bs = b'Die R\u00fcckkehr des K\u00f6nigs'
>>> bs.decode('raw-unicode-escape')
'Die Rückkehr des Königs'
If you are dealing with text, rather than bytes, you'll need to encode then decode:
>>> s = 'Die R\u00fcckkehr des K\u00f6nigs'
>>> s.encode('latin-1').decode('raw-unicode-escape')
'Die Rückkehr des Königs'
This kind of encoding is used to escape unicode characters in json, to restrict the json to ascii values:
>>> json.dumps('Die Rückkehr des Königs')
'"Die R\\u00fcckkehr des K\\u00f6nigs"'
so it's possible that whatever url you are fetching is html with embedded json, or json with embedded html - it might be worth checking the response's json attribute.

Send request to page with windows-1251 encoding from python

i need get a page source (html) and convert him to uft8, because i want find some text in this page( like, if 'my_same_text' in page_source: then...). This page contains russian text (сyrillic symbols), and this tag
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
I use flask, and request python lib.
i send request
source = requests.get('url/')
if 'сyrillic symbols' in source.text: ...
and i can`t find my text, this is due to the encoding
how i can convert text to utf8? i try .encode() .decode() but it did not help.

Let's create a page with an windows-1251 charset given in meta tag and some Russian nonsense text. I saved it in Sublime Text as a windows-1251 file, for sure.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
</head>
<body>
<p>Привет, мир!</p>
</body>
</html>
You can use a little trick in the requests library:
If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text.
So it goes like that:
In [1]: import requests
In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')
In [3]: result.encoding = 'windows-1251'
In [4]: u'Привет' in result.text
Out[4]: True
Voila!
If it doesn't work for you, there's a slightly uglier approach.
You should take a look at what encoding do the web-server is sending you.
It may be that the encoding of the response is actually cp1252 (also known as ISO-8859-1), or whatever else, but neither utf8 nor cp1251. It may differ and depends on a web-server!
In [1]: import requests
In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')
In [3]: result.encoding
Out[3]: 'ISO-8859-1'
So we should recode it accordingly.
In [4]: u'Привет'.encode('cp1251').decode('cp1252') in result.text
Out[4]: True
But that just looks ugly to me (also, I suck at encodings and it's not really the best solution at all). I'd go with a re-setting the encoding using requests itself.

As documented, requests automatically decode response.text to unicode, so you must either look for a unicode string:
if u'cyrillic symbols' in source.text:
# ...
or encode response.text in the appropriate encoding:
# -*- coding: utf-8 -*-
# (....)
if 'cyrillic symbols' in source.text.encode("utf-8"):
# ...
The first solution being much simpler and lighter.

Chinese Unicode issue?

From this website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
<tr class="list03" onclick="showMen1(9);" style="cursor:pointer;">
<td id="e_9" class="qh_one">百度汇总</td>
I'm scraping the text and trying to get 百度汇总
but when I r.encoding = 'utf-8' the result is �ٶȻ���
if I don't use utf-8 the result is °Ù¶È»ã×Ü

The server doesn't tell you anything helpful in the response headers, but the HTML page itself contains:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
GB2312 is a variable-width encoding, like UTF-8. The page lies however; it in actual fact uses GBK, an extension to GB2312.
You can decode it with GBK just fine:
>>> len(r.content.decode('gbk'))
44535
>>> u'百度汇总' in r.content.decode('gbk')
True
Decoding with gb2313 fails:
>>> r.content.decode('gb2312')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 26367-26368: illegal multibyte sequence
but since GBK is a superset of GB2313, it should always be safe to use the former even when the latter is specified.
If you are using requests, then setting r.encoding to gb2312 works because r.text uses replace when handling decode errors:
content = str(self.content, encoding, errors='replace')
so the decoding error when using GB2312 is masked for those codepoints only defined in GBK.
Note that BeautifulSoup can do the decoding all by itself; it'll find the meta header:
>>> soup = BeautifulSoup(r.content)
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
The warning is caused by the GBK codepoints being used while the page claims to use GB2312.

urllib: get utf-8 encoded site source code

I'm trying to fetch a segment of some website. The script works, however it's a website that has accents such as á, é, í, ó, ú.
When I fetch the site using urllib or urllib2, the site source code is not encoded in utf-8, which I would like it to be, as utf-8 supports these accents.
I believe that the target site is encoded in utf-8 as it contains the following meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
My python script:
opener = urllib2.build_opener()
opener.addheaders = [('Accept-Charset', 'utf-8')]
url_response = opener.open(url)
deal_html = url_response.read().decode('utf-8')
However, I keep getting results that look like they are not encoded un utf-8.
E.g: "Milán" on website = "Mil\xe1n" after urllib2 fetches it
Any suggestions?

Your script is working correctly. The "\xe1" string is the representation of the unicode object resulting from decoding. For example:
>>> "Mil\xc3\xa1n".decode('utf-8')
u'Mil\xe1n'
The "\xc3\xa1" sequence is the UTF-8 sequence for leter a with diacritic mark: á.

Python urllib2 decode chunked encoding

I have the following code to open and read URLs:
html_data = urllib2.urlopen(req).read()
and I believe this is the most standard way to read data from HTTP.
However, when the response have chunked tranfer-encoding, the response starts with the following characters:
1eb0\r\n2625\r\n
<?xml version="1.0" encoding="UTF-8"?>
...
This happens due to the mentioned above chunked encoding and thus my XML data becomes corrupted.
So I wonder how I can get rid of all meta-data related to the chunked encoding?

I ended up with custom xml stripping, like this:
xml_start = html_data.find('<?xml')
xml_end = html_data.rfind('</mytag>')
if xml_start !=0:
log_user_action(req.get_host() ,'chunked data', html_data, {})
html_data = html_data[xml_start:]
if xml_end != len(html_data)-len('</mytag>')-1:
html_data = html_data[:xml_end+1]
Can't find any simple solution.

1eb0\r\n2625\r\n is the segment start/stop positions (in hex) in the reassembled payload

You can remove everything before ?xml
html_data = html_data[html_data.find('<?xml'):]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrape with correct character encoding (python requests + beautifulsoup) - python

Related

Encode UTF-8 not working when retrieving Xpath text

Send request to page with windows-1251 encoding from python

Chinese Unicode issue?

urllib: get utf-8 encoded site source code

Python urllib2 decode chunked encoding

Categories

Resources