UTF8 missmatch in script - python

I have issues with a Python script. I just try to translate some sentences with the google translate API. Some sentences have problems with special UTF-8 encoding like ä, ö or ü. Can't imagine why some sentences work, others not.
If I try the API call direct in the browser, it works, but inside my Python script I get a mismatch.
this is a small version of my script which shows directly the error:
# -*- encoding: utf-8' -*-
import requests
import json
satz="Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=de&dt=t&q='+satz
r = requests.get(url);
r.text.encode().decode('utf8','ignore')
n = json.loads(r.text);
i = 0
while i < len(n[0]):
newLine = n[0][i][0]
print(newLine)
i=i+1
this is how my result looks:
Unter dem Mondschein glänzt ein winziges Silberfragment, ein Bruchteil einer Li
nie â ? |

Google has served you a Mojibake; the JSON response contains data that was original encoded using UTF-8 but then was decoded with a different codec resulting in incorrect data.
I suspect Google does this as it decodes the URL parameters; in the past URL parameters could be encoded in any number of codecs, that UTF-8 is now the standard is a relatively recent development. This is Google's fault, not yours or that of requests.
I found that setting a User-Agent header makes Google behave better; even an (incomplete) user agent of Mozilla/5.0 is enough here for Google to use UTF-8 when decoding your URL parameters.
You should also make sure your URL string is properly percent encoded, if you pass in parameters in a dictionary to params then requests will take care of adding those to the URL in properly :
satz = "Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&dt=t'
params = {
'q': satz,
'sl': 'en',
'tl': 'de',
}
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get(url, params=params, headers=headers)
results = r.json()[0]
for inputline, outputline, *__ in results:
print(outputline)
Note that I pulled out the source and target language parameters into the params dictionary too, and pulled out the input and output line values from the results lists.

Related

Python 3 when writing json file double backslash problem

I use urllib.request and regex for html parse but when I write in json file there is double backslash in text. How can I replace one backslash?
I have looked at many solutions but none of them have worked.
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'
req = Request('https://www.manga-tr.com/manga-list.html', headers=headers)
response = urlopen(req).read()
a = re.findall(r'<b><a[^>]* href="([^"]*)"',str(response))
sub_req = Request('https://www.manga-tr.com/'+a[3], headers=headers)
sub_response = urlopen(sub_req).read()
manga = {}
manga['manga'] = []
manga_subject = re.findall(r'<h3>Tan.xc4.xb1t.xc4.xb1m<.h3>[^<]*.n.t([^<]*).t',str(sub_response))
manga['manga'].append({'msubject': manga_subject })
with io.open('allmanga.json', 'w', encoding='utf-8-sig') as outfile:
outfile.write(json.dumps(manga, indent=4))
this is my json file
{
"manga": [
{
"msubject": [
" Minami Ria 16 ya\\xc5\\x9f\\xc4\\xb1ndad\\xc4\\xb1r. \\xc4\\xb0lk erkek arkada\\xc5\\x9f\\xc4\\xb1 sakatani jirou(16) ile yakla\\xc5\\x9f\\xc4\\xb1k 6 ayd\\xc4\\xb1r beraberdir. Herkes taraf\\xc4\\xb1ndan \\xc3\\xa7ifte kumru olarak g\\xc3\\xb6r\\xc3\\xbclmelerine ra\\xc4\\x9fmen ili\\xc5\\x9fkilerinde %1\\'lik bir eksiklik vard\\xc4\\xb1r. Bu eksikli\\xc4\\x9fi tamamlayabilecekler mi?"
}
]
}
Why Is This Happening?
The error is when str is used to convert a bytes object to a str. This does not do the conversion in the desired way.
a = re.findall(r'<b><a[^>]* href="([^"]*)"',str(response))
# ^^^
For example, if the response is the word "Tanıtım", you it would be expressed in UTF-8 as b'Tan\xc4\xb1t\xc4\xb1m'. If you then use str on that, you get:
In [1]: response = b'Tan\xc4\xb1t\xc4\xb1m'
In [2]: str(response)
Out[2]: "b'Tan\\xc4\\xb1t\\xc4\\xb1m'"
If you convert this to JSON, you'll see double backslashes (which are really just ordinary backslashes, encoded as JSON).
In [3]: import json
In [4]: print(json.dumps(str(response)))
"b'Tan\\xc4\\xb1t\\xc4\\xb1m'"
The correct way to convert a bytes object back to a str is by using the decode method, with the appropriate encoding:
In [5]: response.decode('UTF-8')
Out[5]: 'Tanıtım'
Note that the response is not valid UTF-8, unfortunately. The website operators appear to be serving corrupted data.
Quick Fix
Replace every call to str(response) with response.decode('UTF-8', 'replace') and update the regular expressions to match.
a = re.findall(
# "r" prefix to string is unnecessary
'<b><a[^>]* href="([^"]*)"',
response.decode('UTF-8', 'replace'))
sub_req = Request('https://www.manga-tr.com/'+a[3],
headers=headers)
sub_response = urlopen(sub_req).read()
manga = {}
manga['manga'] = []
manga_subject = re.findall(
# "r" prefix to string is unnecessary
'<h3>Tanıtım</h3>([^<]*)',
sub_response.decode('UTF-8', 'replace'))
manga['manga'].append({'msubject': manga_subject })
# io.open is the same as open
with open('allmanga.json', 'w', encoding='utf-8-sig') as fp:
# json.dumps is unnecessary
json.dump(manga, fp, indent=4)
Better Fix
Use "Requests"
The Requests library is much easier than using urlopen. You will have to install it (with pip, apt, dnf, etc, whatever you use), it does not come with Python. It will look like this:
response = requests.get(
'https://www.manga-tr.com/manga-list.html')
And then response.text contains the decoded string, you don't need to decode it yourself. Easier!
Use BeautifulSoup
The Beautiful Soup library can search through HTML documents, and it is more reliable and easier to use than regular expressions. It also needs to be installed. You might use it like this, for example, to find all the summary from a manga page:
soup = BeautifulSoup(response.text, 'html.parser')
subject = soup.find('h3', text='Tanıtım').next_sibling.string
Summary
Here is a Gist containing a more complete example of what the scraper might look like.
Keep in mind that scraping a website can be a bit difficult, just because you might scrape 100 pages and then suddenly discover that something is wrong with your scraper, or you are hitting the website too hard, or something crashes and fails and you need to start over. So scraping well often involves rate-limiting, saving progress and caching responses, and (ideally) parsing robots.txt.
But Requests + BeautifulSoup will at least get you started. Again, see the Gist.

UTF-8 text from website is decoded improperly when using python 3 and requests, works well with Python 2 and mechanize

I've been tinkering with Python using Pythonista on my iPad. I decided to write a simple script that pulls song lyrics in Japanese from one website, and makes post requests to another website that basically annotates the lyrics with extra information.
When I use Python 2 and the module mechanize for the second website, everything works fine, but when I use Python 3 and requests, the resulting text is nonsense.
This is a minimal script that doesn't exhibit the issue:
#!/usr/bin/env python2
from bs4 import BeautifulSoup
import requests
import mechanize
def main():
# Get lyrics from first website (lyrical-nonsense.com)
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib")
raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()
# Use second website to anotate lyrics with fugigana
browser = mechanize.Browser()
browser.open('http://furigana.sourceforge.net/cgi-bin/index.cgi')
browser.select_form(nr=0)
browser.form['text'] = raw_lyrics
request = browser.submit()
# My actual script does more stuff at this point, but this snippet doesn't need it
annotated_lyrics = BeautifulSoup(request.read().decode('utf-8'), "html5lib").find("body").get_text()
print annotated_lyrics
if __name__ == '__main__':
main()
The truncated output is:
扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)の夜(よる)昨日(きのう)どうやって帰(かえ)った体(からだ)だけが確(たし)かおはよう これからまた迷子(まいご)の続(つづ)き見慣(みな)れた知(し)らない景色(けしき)の中(なか)でもう駄目(だめ)って思(おも)ってから わりと何(なん)だかやれている死(し)にきらないくらいに丈夫(じょうぶ)何(なに)かちょっと恥(は)ずかしいやるべきことは忘(わす)れていても解(わか)るそうしないと とても苦(くる)しいから顔(かお)を上(あ)げて黒(くろ)い目(め)の人(にん)君(くん)が見(み)たから光(ひかり)は生(う)まれた選(えら)んだ色(しょく)で塗(ぬ)った世界(せかい)に [...]
This is a minimal script that exhibits the issue:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import requests
def main():
# Get lyrics from first website (lyrical-nonsense.com)
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib")
raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()
# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
html_annotated_lyrics = BeautifulSoup(requests.post(url, data=data).text, "html5lib")
annotated_lyrics = html_annotated_lyrics.find("body").get_text()
print(annotated_lyrics)
if __name__ == '__main__':
main()
whose truncated output is:
IQp{_<n(åiFcf0c_S`QLºKJoFSK~_÷PnMc_åjDorn-gFÄîcfcfKhU`KfD{kMjDOD+UKacheZKWDyMSho،fDfã]FWjDhhfæWDKTRfÒDînºL_KIo~_x`rgWc_Lkò~fxyjD·nsoiS`FTê`QLÒüíüLn [...]
It's worth noting that if I just try to get the HTML of the second request, like so:
# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
annotated_lyrics = requests.post(url, data=data).content.decode('utf-8')
A embedded null character error occurs when printing annotated_lyrics. This issue can be circumvented by passing truncated lyrics to the post requests. In the current example, only one character can be passed.
However, with
url = 'https://www.lyrical-nonsense.com/lyrics/aimer/brave-shine/'
I can pass up to 51 characters, like so:
data = {'text': raw_lyrics[0:51], 'state': 'output'}
before triggering the embedded null character error.
I've tried using urllib instead of requests, decoding and encoding to utf-8 the resulting HTML of the post request, or the data passed as an argument to this request. I've also checked that the encoding of the website is utf-8, which matches the encoding of the post requests:
r = requests.post(url, data=data)
print(r.encoding)
prints utf-8.
I think the problem has to do with how Python 3 is more strict in how it treats strings vs bytes, but I've been unable to pinpoint the exact cause.
While I'd appreciate a working code sample in Python 3, I'm more interested in what exactly I'm doing wrong, in what is the code doing that results in failure.
I'm able to get the lyrics properly with this code in python3.x:
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
resp = requests.get(url)
print(BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text())
Printing (truncated)
>>> BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text()
'扉開けば\u3000捻れた昼の夜\r\n昨日どうやって帰った\u3000体だけ...'
A few things strike me as odd there, notably the \r\n (windows line ending) and \u3000 (IDEOGRAPHIC SPACE) but that's probably not the problem
The one thing I noticed that's odd about the form submission (and why the browser emulator probably succeeds) is the form is using multipart instead of urlencoded form data. (signified by enctype="multipart/form-data")
Sending multipart form data is a little bit strange in requests, I had to poke around a bit and eventually found this which helps show how to format the multipart data in a way that the backing server understands. To do this you have to abuse files but have a "None" filename. "for humans" hah!
url2 = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
resp2 = requests.post(url2, files={'text': (None, raw_lyrics), 'state': (None, 'output')})
And the text is not mangled now!
>>> BeautifulSoup(resp2.text).find('body').get_text()
'\n扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)...'
(Note that this code should work in either python2 or python3)

requests.Response.text shows odd symbols

I use Python's request library to access (public) ads.txt files:
import requests
r = requests.get('https://www.sicurauto.it/ads.txt')
print(r.text)
This works fine in most cases, but the text from the URL above begins with some strange symbols:
> google.com, [...]
If I open the URL in my browser, I do not see these three symbols; the text begins with google.com, [...] I am a beginner when it comes to encodings and web protocols ... where might these odd symbols come from?
You need to specify your encoding (in r.encoding) before calling r.text:
import requests
r = requests.get('https://www.sicurauto.it/ads.txt')
r.encoding = 'utf-8-sig' # specify UTF-8-sig encoding
print(r.text)

Latin encoding issue

I am working on a python web scraper to extract data from this webpage. It contains latin characters like ą, č, ę, ė, į, š, ų, ū, ž. I use BeautifulSoup to recognise the encoding:
def decode_html(html_string):
converted = UnicodeDammit(html_string)
print(converted.original_encoding)
if not converted.unicode_markup:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.tried_encodings))
return converted.unicode_markup
The encoding that it always seems to use is "windows-1252". However, this turns characters like ė into ë and ų into ø when printing to file or console. I use the lxml library to scrape the data. So I would think that it uses the wrong encoding, but what's odd is that if I use lxml.html.open_in_browser(decoded_html), all the characters are back to normal. How do I print the characters to a file without all the mojibake?
This is what I am using for output:
def write(filename, obj):
with open(filename, "w", encoding="utf-8") as output:
json.dump(obj, output, cls=CustomEncoder, ensure_ascii=False)
return
From the HTTP headers set on the specific webpage you tried to load:
Content-Type:text/html; charset=windows-1257
so Windows-1252 will result in invalid results. BeautifulSoup made a guess (based on statistical models), and guessed wrong. As you noticed, using 1252 instead leads to incorrect codepoints:
>>> 'ė'.encode('cp1257').decode('cp1252')
'ë'
>>> 'ų'.encode('cp1257').decode('cp1252')
'ø'
CP1252 is the fallback for the base characterset detection implementation in BeautifulSoup. You can improve the success-rate of BeautifulSoup's character-detection code by installing an external library; both chardet and cchardet are supported. These two libraries guess at MacCyrillic and ISO-8859-13, respectively (both wrong, but cchardet got pretty close, perhaps close enough).
In this specific case, you can make use of the HTTP headers instead. In requests, I generally use:
import requests
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)
The above only uses the encoding from the response if explicitly set by the server, and there was no HTML <meta> header. For text/* mime-types, HTTP specifies that the response should be considered as using Latin-1, which requests adheres too, but that default would be incorrect for most HTML data.

detect and change website encoding in python

I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?
Thanks
Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.

Categories

Resources