detect and change website encoding in python

detect and change website encoding in python - python

I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?
Thanks

Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.

Related

requests.Response.text shows odd symbols

I use Python's request library to access (public) ads.txt files:
import requests
r = requests.get('https://www.sicurauto.it/ads.txt')
print(r.text)
This works fine in most cases, but the text from the URL above begins with some strange symbols:
> ï»¿google.com, [...]
If I open the URL in my browser, I do not see these three symbols; the text begins with google.com, [...] I am a beginner when it comes to encodings and web protocols ... where might these odd symbols come from?

You need to specify your encoding (in r.encoding) before calling r.text:
import requests
r = requests.get('https://www.sicurauto.it/ads.txt')
r.encoding = 'utf-8-sig' # specify UTF-8-sig encoding
print(r.text)

UTF8 missmatch in script

I have issues with a Python script. I just try to translate some sentences with the google translate API. Some sentences have problems with special UTF-8 encoding like ä, ö or ü. Can't imagine why some sentences work, others not.
If I try the API call direct in the browser, it works, but inside my Python script I get a mismatch.
this is a small version of my script which shows directly the error:
# -*- encoding: utf-8' -*-
import requests
import json
satz="Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=de&dt=t&q='+satz
r = requests.get(url);
r.text.encode().decode('utf8','ignore')
n = json.loads(r.text);
i = 0
while i < len(n[0]):
newLine = n[0][i][0]
print(newLine)
i=i+1
this is how my result looks:
Unter dem Mondschein glÃ¤nzt ein winziges Silberfragment, ein Bruchteil einer Li
nie â ? |

Google has served you a Mojibake; the JSON response contains data that was original encoded using UTF-8 but then was decoded with a different codec resulting in incorrect data.
I suspect Google does this as it decodes the URL parameters; in the past URL parameters could be encoded in any number of codecs, that UTF-8 is now the standard is a relatively recent development. This is Google's fault, not yours or that of requests.
I found that setting a User-Agent header makes Google behave better; even an (incomplete) user agent of Mozilla/5.0 is enough here for Google to use UTF-8 when decoding your URL parameters.
You should also make sure your URL string is properly percent encoded, if you pass in parameters in a dictionary to params then requests will take care of adding those to the URL in properly :
satz = "Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&dt=t'
params = {
'q': satz,
'sl': 'en',
'tl': 'de',
}
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get(url, params=params, headers=headers)
results = r.json()[0]
for inputline, outputline, *__ in results:
print(outputline)
Note that I pulled out the source and target language parameters into the params dictionary too, and pulled out the input and output line values from the results lists.

Latin encoding issue

I am working on a python web scraper to extract data from this webpage. It contains latin characters like ą, č, ę, ė, į, š, ų, ū, ž. I use BeautifulSoup to recognise the encoding:
def decode_html(html_string):
converted = UnicodeDammit(html_string)
print(converted.original_encoding)
if not converted.unicode_markup:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.tried_encodings))
return converted.unicode_markup
The encoding that it always seems to use is "windows-1252". However, this turns characters like ė into ë and ų into ø when printing to file or console. I use the lxml library to scrape the data. So I would think that it uses the wrong encoding, but what's odd is that if I use lxml.html.open_in_browser(decoded_html), all the characters are back to normal. How do I print the characters to a file without all the mojibake?
This is what I am using for output:
def write(filename, obj):
with open(filename, "w", encoding="utf-8") as output:
json.dump(obj, output, cls=CustomEncoder, ensure_ascii=False)
return

From the HTTP headers set on the specific webpage you tried to load:
Content-Type:text/html; charset=windows-1257
so Windows-1252 will result in invalid results. BeautifulSoup made a guess (based on statistical models), and guessed wrong. As you noticed, using 1252 instead leads to incorrect codepoints:
>>> 'ė'.encode('cp1257').decode('cp1252')
'ë'
>>> 'ų'.encode('cp1257').decode('cp1252')
'ø'
CP1252 is the fallback for the base characterset detection implementation in BeautifulSoup. You can improve the success-rate of BeautifulSoup's character-detection code by installing an external library; both chardet and cchardet are supported. These two libraries guess at MacCyrillic and ISO-8859-13, respectively (both wrong, but cchardet got pretty close, perhaps close enough).
In this specific case, you can make use of the HTTP headers instead. In requests, I generally use:
import requests
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)
The above only uses the encoding from the response if explicitly set by the server, and there was no HTML <meta> header. For text/* mime-types, HTTP specifies that the response should be considered as using Latin-1, which requests adheres too, but that default would be incorrect for most HTML data.

BeautifulSoup: scraping spanish characters issue

I'm trying to get some Spanish text from a website using BeautifulSoup and urllib2. I currently get this: Â¡Hola! Â¿CÃ³mo estÃ¡s?.
I have tried applying the different unicode functions I have seen on related threads, but nothing seems to work for my issue:
# import the main window object (mw) from aqt
from aqt import mw
# import the "show info" tool from utils.py
from aqt.utils import showInfo
# import all of the Qt GUI library
from aqt.qt import *
from BeautifulSoup import BeautifulSoup
import urllib2
wiki = "http://spanishdict.com/translate/hola"
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page)
dictionarydiv = soup.find("div", { "class" : "dictionary-neodict-example" })
dictionaryspans = dictionarydiv.contents
firstspan = dictionaryspans[0]
firstspantext = firstspan.contents
thetext = firstspantext[0]
thetextstring = str(thetext)

thetext is type <class 'BeautifulSoup.NavigableString'>. Printing it returns a Unicode string, which will be encoded in the output terminal encoding:
print thetext
Output (in a Windows console):
¡Hola! ¿Cómo estás?
This will work on any terminal configured for an encoding supporting the Unicode characters being printed.
You'll get UnicodeEncodeError if your terminal is configured with an encoding that doesn't support the Unicode characters you try to print.
Using str on that type returns a byte string...in this case encoded in UTF-8. If you print that on anything but a UTF-8-configured terminal, you'll get an incorrect display.

Spynner wrong encoding

I'm trying to download this page - https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8 (looks like this for me in Russia - http://screencloud.net/v/6a7o) via spynner in python - it uses some javascript checking so one does not simply download it without full browser emulation.
My code:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from StringIO import StringIO
import spynner
def log(str, filename_end):
filename = '/tmp/apple_log_%s.html' % filename_end
print 'logged to %s' % filename
f = open(filename, 'w')
f.write(str)
f.close()
debug_stream = StringIO()
browser = spynner.Browser(debug_level=3, debug_stream=debug_stream)
browser.load("https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8")
ret = browser.contents
log(ret, 'noenc')
print 'content length = %s' % len(ret)
browser.close()
del browser
f=open('/tmp/apple_log_debug', 'w')
f.write(debug_stream.getvalue())
f.close()
print 'log stored in /tmp/debug_log'
So, the problem is: either apple, either spynner work wrong with Cyrillic symbols. I see them fine if I try browser.show() after loading, but in the code and logs they are still wrong encoded like <meta content="ÐÐ¾Ð»ÑÑÐ¸ÑÑ Farm Storyâ¢ Ð² App Store. ÐÑÐ¾ÑÐ¼Ð¾ÑÑÐµÑÑ ÑÐºÑÐ¸Ð½ÑÐ¾ÑÑ Ð¸ ÑÐµÐ¹ÑÐ¸Ð½Ð³Ð¸, Ð¿ÑÐ¾ÑÐ¸ÑÐ°ÑÑ Ð¾ÑÐ·ÑÐ²Ñ Ð¿Ð¾ÐºÑÐ¿Ð°ÑÐµÐ»ÐµÐ¹." property="og:description">.
http://2cyr.com/ Says that it is a utf-8 text displayed like iso-8859-1...
As you see - I don't use any headers in my request, but if I take them from chrome's network debug console and pass it to load() method e.g. headers=[('Accept-Encoding', 'utf-8'), ('Accept-Language', 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4')] - I get the same result.
Also, from the same network console you can see that chrome uses gzip,deflate,sdch as Accept-Encoding. I can try that too, but I fail to decode what I get: <html><head></head><body>ï¿½ï¿½}ksÇï¿½g!ï¿½ï¿½ï¿½4ï¿½I/zï¿½Oï¿½ï¿½ï¿½/)ï¿½(ywï¿½ï¿½ï¿½é®iï¿½ï¿½{ï¿½<vï¿½ï¿½ï¿½:ï¿½ï¿½Ù·ï¿½Ø³-?ï¿½bï¿½bï¿½ï¿½ jï¿½... even if I remove the tags from the begin and end of the result.
Any help?

Basically, browser.webframe.toHtml() returns a QTString in which case str() won't help if res actually has unicode non-latin characters.
If you want to get a Python unicode string you need to do:
ret = unicode(browser.webframe.toHtml().toUtf8(), encoding="UTF-8")
#if you want to get rid of non-latin text
ret = ret.encode("ascii", errors="replace") # encodes to bytestring
in case you suspect its in Russian you could decode it to a Russian multibyte oem string (sill a bytestring) by doing
ret = ret.encode("cp1251", errors="replace") # encodes to Win-1251
# or
ret = ret.encode("cp866", errors="replace") # encodes to windows/dos console
Only then you can save it to an ASCII file.

str(browser.webframe.toHtml()) saved me

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

detect and change website encoding in python - python

Related

requests.Response.text shows odd symbols

UTF8 missmatch in script

Latin encoding issue

BeautifulSoup: scraping spanish characters issue

Spynner wrong encoding

Categories

Resources