I am working on a python web scraper to extract data from this webpage. It contains latin characters like ą, č, ę, ė, į, š, ų, ū, ž. I use BeautifulSoup to recognise the encoding:
def decode_html(html_string):
converted = UnicodeDammit(html_string)
print(converted.original_encoding)
if not converted.unicode_markup:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.tried_encodings))
return converted.unicode_markup
The encoding that it always seems to use is "windows-1252". However, this turns characters like ė into ë and ų into ø when printing to file or console. I use the lxml library to scrape the data. So I would think that it uses the wrong encoding, but what's odd is that if I use lxml.html.open_in_browser(decoded_html), all the characters are back to normal. How do I print the characters to a file without all the mojibake?
This is what I am using for output:
def write(filename, obj):
with open(filename, "w", encoding="utf-8") as output:
json.dump(obj, output, cls=CustomEncoder, ensure_ascii=False)
return
From the HTTP headers set on the specific webpage you tried to load:
Content-Type:text/html; charset=windows-1257
so Windows-1252 will result in invalid results. BeautifulSoup made a guess (based on statistical models), and guessed wrong. As you noticed, using 1252 instead leads to incorrect codepoints:
>>> 'ė'.encode('cp1257').decode('cp1252')
'ë'
>>> 'ų'.encode('cp1257').decode('cp1252')
'ø'
CP1252 is the fallback for the base characterset detection implementation in BeautifulSoup. You can improve the success-rate of BeautifulSoup's character-detection code by installing an external library; both chardet and cchardet are supported. These two libraries guess at MacCyrillic and ISO-8859-13, respectively (both wrong, but cchardet got pretty close, perhaps close enough).
In this specific case, you can make use of the HTTP headers instead. In requests, I generally use:
import requests
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)
The above only uses the encoding from the response if explicitly set by the server, and there was no HTML <meta> header. For text/* mime-types, HTTP specifies that the response should be considered as using Latin-1, which requests adheres too, but that default would be incorrect for most HTML data.
Related
I use Python's request library to access (public) ads.txt files:
import requests
r = requests.get('https://www.sicurauto.it/ads.txt')
print(r.text)
This works fine in most cases, but the text from the URL above begins with some strange symbols:
> google.com, [...]
If I open the URL in my browser, I do not see these three symbols; the text begins with google.com, [...] I am a beginner when it comes to encodings and web protocols ... where might these odd symbols come from?
You need to specify your encoding (in r.encoding) before calling r.text:
import requests
r = requests.get('https://www.sicurauto.it/ads.txt')
r.encoding = 'utf-8-sig' # specify UTF-8-sig encoding
print(r.text)
I have issues with a Python script. I just try to translate some sentences with the google translate API. Some sentences have problems with special UTF-8 encoding like ä, ö or ü. Can't imagine why some sentences work, others not.
If I try the API call direct in the browser, it works, but inside my Python script I get a mismatch.
this is a small version of my script which shows directly the error:
# -*- encoding: utf-8' -*-
import requests
import json
satz="Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=de&dt=t&q='+satz
r = requests.get(url);
r.text.encode().decode('utf8','ignore')
n = json.loads(r.text);
i = 0
while i < len(n[0]):
newLine = n[0][i][0]
print(newLine)
i=i+1
this is how my result looks:
Unter dem Mondschein glänzt ein winziges Silberfragment, ein Bruchteil einer Li
nie â ? |
Google has served you a Mojibake; the JSON response contains data that was original encoded using UTF-8 but then was decoded with a different codec resulting in incorrect data.
I suspect Google does this as it decodes the URL parameters; in the past URL parameters could be encoded in any number of codecs, that UTF-8 is now the standard is a relatively recent development. This is Google's fault, not yours or that of requests.
I found that setting a User-Agent header makes Google behave better; even an (incomplete) user agent of Mozilla/5.0 is enough here for Google to use UTF-8 when decoding your URL parameters.
You should also make sure your URL string is properly percent encoded, if you pass in parameters in a dictionary to params then requests will take care of adding those to the URL in properly :
satz = "Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&dt=t'
params = {
'q': satz,
'sl': 'en',
'tl': 'de',
}
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get(url, params=params, headers=headers)
results = r.json()[0]
for inputline, outputline, *__ in results:
print(outputline)
Note that I pulled out the source and target language parameters into the params dictionary too, and pulled out the input and output line values from the results lists.
What is the most appropriate method to determine the encoding in the text from a webpage. I process webpages from various languages and use Python and the "requests" library. The aim is eventually to be able to get clean text using some text extraction library for text mining projects
resp = requests.get(url)
Now I know that we have the following options:
1)
from requests.utils import get_encoding_from_headers
encoding = get_encoding_from_headers(resp.headers)
html = (resp.content).decode(encoding)
2)
from requests_toolbelt.utils.deprecated import get_encodings_from_content
encoding = get_encodings_from_content(resp.content)
html = (resp.content).decode(encoding)
3)
from requests_toolbelt.utils.deprecated import get_encodings_from_content
html = get_unicode_from_response(resp)
I processed around a 1000 urls and was expecting 1) and 2) to be the same but 20% of the time that wasnt the case. In those 20% cases (1) would give "ISO-8859-1" which from looking at the code means it didnt find the charset in the header and (2) mostly gave out "utf8"
Now does someone have some experience with this as to what the most appropriate technique among these is or if a better more cleaner way exists?
Don't trust the declared encoding, just test it directly on the HTML code you got with the chardet or cchardet (faster) libraries:
import requests
try:
# this module is much faster
import cchardet as chardet
except ImportError:
import chardet
response = requests.get(url)
guessed_encoding = chardet.detect(response.content)['encoding']
if guessed_encoding is not None:
try:
htmltext = response.content.decode(guessed_encoding)
except UnicodeDecodeError:
htmltext = response.text
else:
htmltext = response.text
# do something with htmltext...
I have the following code for urllib and BeautifulSoup:
getSite = urllib.urlopen(pageName) # open current site
getSitesoup = BeautifulSoup(getSite.read()) # reading the site content
print getSitesoup.originalEncoding
for value in getSitesoup.find_all('link'): # extract all <a> tags
defLinks.append(value.get('href'))
The result of it:
/usr/lib/python2.6/site-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
"Some characters could not be decoded, and were "
And when i try to read the site i get:
�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z
The page is in UTF-8, but the server is sending it to you in a compressed format:
>>> print getSite.headers['content-encoding']
gzip
You'll need to decompress the data before running it through Beautiful Soup. I got an error using zlib.decompress() on the data, but writing the data to a file and using gzip.open() to read from it worked fine--I'm not sure why.
BeautifulSoup works with Unicode internally; it'll try and decode non-unicode responses from UTF-8 by default.
It looks like the site you are trying to load is using a different encode; for example, it could be UTF-16 instead:
>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('utf-16-le')
뿯㞽뿯施뿯붿뿯붿⨰䤢럟뿯䞽뿯䢽뿯붿뿯붿붿뿯붿뿯붿뿯㦽붿뿯붿뿯붿뿯㮽뿯붿붿썙䊞붿뿯붿뿯붿뿯붿뿯붿铣㾶뿯㒽붿뿯붿붿뿯붿뿯붿坞뿯붿뿯붿뿯悽붿敋뿯붿붿뿯⪽붿✮兏붿뿯붿붿뿯䂽뿯붿뿯붿뿯嶽뿯붿뿯⢽붿뿯庽뿯붿붿붿㕓뿯붿뿯璽⩔뿯媽
It could be mac_cyrillic too:
>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��#�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('mac_cyrillic')
пњљ7пњљeпњљпњљпњљпњљ0*"IяЈпњљGпњљHпњљпњљпњљпњљFпњљпњљпњљпњљпњљпњљ9-пњљпњљпњљпњљпњљпњљ;пњљпњљEпњљY√ЮBsпњљпњљпњљпњљпњљпњљпњљпњљпњљгФґ?пњљ4iпњљпњљпњљ)пњљпњљпњљпњљпњљ^Wпњљпњљпњљпњљпњљ`wпњљKeпњљпњљ%пњљпњљ*9пњљ.'OQBпњљпњљпњљVпњљпњљ#пњљпњљпњљпњљпњљ]пњљпњљпњљ(Pпњљпњљ^пњљпњљqпњљ$пњљS5пњљпњљпњљtT*пњљZ
But I have way too little information about what kind of site you are trying to load nor can I read the output of either encoding. :-)
You'll need to decode the result of getSite() before passing it to BeautifulSoup:
getSite = urllib.urlopen(pageName).decode('utf-16')
Generally, the website will return what encoding was used in the headers, in the form of a Content-Type header (probably text/html; charset=utf-16 or similar).
I ran into the same problem, and as Leonard mentioned, it was due to a compressed format.
This link solved it for me which says to add ('Accept-Encoding', 'gzip,deflate') in the request header. For example:
opener = urllib2.build_opener()
opener.addheaders = [('Referer', referer),
('User-Agent', uagent),
('Accept-Encoding', 'gzip,deflate')]
usock = opener.open(url)
url = usock.geturl()
data = decode(usock)
usock.close()
return data
Where the decode() function is defined by:
def decode (page):
encoding = page.info().get("Content-Encoding")
if encoding in ('gzip', 'x-gzip', 'deflate'):
content = page.read()
if encoding == 'deflate':
data = StringIO.StringIO(zlib.decompress(content))
else:
data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
page = data.read()
return page
I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?
Thanks
Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.