I use Python's request library to access (public) ads.txt files:
import requests
r = requests.get('https://www.sicurauto.it/ads.txt')
print(r.text)
This works fine in most cases, but the text from the URL above begins with some strange symbols:
> google.com, [...]
If I open the URL in my browser, I do not see these three symbols; the text begins with google.com, [...] I am a beginner when it comes to encodings and web protocols ... where might these odd symbols come from?
You need to specify your encoding (in r.encoding) before calling r.text:
import requests
r = requests.get('https://www.sicurauto.it/ads.txt')
r.encoding = 'utf-8-sig' # specify UTF-8-sig encoding
print(r.text)
Related
I've been tinkering with Python using Pythonista on my iPad. I decided to write a simple script that pulls song lyrics in Japanese from one website, and makes post requests to another website that basically annotates the lyrics with extra information.
When I use Python 2 and the module mechanize for the second website, everything works fine, but when I use Python 3 and requests, the resulting text is nonsense.
This is a minimal script that doesn't exhibit the issue:
#!/usr/bin/env python2
from bs4 import BeautifulSoup
import requests
import mechanize
def main():
# Get lyrics from first website (lyrical-nonsense.com)
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib")
raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()
# Use second website to anotate lyrics with fugigana
browser = mechanize.Browser()
browser.open('http://furigana.sourceforge.net/cgi-bin/index.cgi')
browser.select_form(nr=0)
browser.form['text'] = raw_lyrics
request = browser.submit()
# My actual script does more stuff at this point, but this snippet doesn't need it
annotated_lyrics = BeautifulSoup(request.read().decode('utf-8'), "html5lib").find("body").get_text()
print annotated_lyrics
if __name__ == '__main__':
main()
The truncated output is:
扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)の夜(よる)昨日(きのう)どうやって帰(かえ)った体(からだ)だけが確(たし)かおはよう これからまた迷子(まいご)の続(つづ)き見慣(みな)れた知(し)らない景色(けしき)の中(なか)でもう駄目(だめ)って思(おも)ってから わりと何(なん)だかやれている死(し)にきらないくらいに丈夫(じょうぶ)何(なに)かちょっと恥(は)ずかしいやるべきことは忘(わす)れていても解(わか)るそうしないと とても苦(くる)しいから顔(かお)を上(あ)げて黒(くろ)い目(め)の人(にん)君(くん)が見(み)たから光(ひかり)は生(う)まれた選(えら)んだ色(しょく)で塗(ぬ)った世界(せかい)に [...]
This is a minimal script that exhibits the issue:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import requests
def main():
# Get lyrics from first website (lyrical-nonsense.com)
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib")
raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()
# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
html_annotated_lyrics = BeautifulSoup(requests.post(url, data=data).text, "html5lib")
annotated_lyrics = html_annotated_lyrics.find("body").get_text()
print(annotated_lyrics)
if __name__ == '__main__':
main()
whose truncated output is:
IQp{_<n(åiFcf0c_S`QLºKJoFSK~_÷PnMc_åjDorn-gFÄîcfcfKhU`KfD{kMjDOD+UKacheZKWDyMSho،fDfã]FWjDhhfæWDKTRfÒDînºL_KIo~_x`rgWc_Lkò~fxyjD·nsoiS`FTê`QLÒüíüLn [...]
It's worth noting that if I just try to get the HTML of the second request, like so:
# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
annotated_lyrics = requests.post(url, data=data).content.decode('utf-8')
A embedded null character error occurs when printing annotated_lyrics. This issue can be circumvented by passing truncated lyrics to the post requests. In the current example, only one character can be passed.
However, with
url = 'https://www.lyrical-nonsense.com/lyrics/aimer/brave-shine/'
I can pass up to 51 characters, like so:
data = {'text': raw_lyrics[0:51], 'state': 'output'}
before triggering the embedded null character error.
I've tried using urllib instead of requests, decoding and encoding to utf-8 the resulting HTML of the post request, or the data passed as an argument to this request. I've also checked that the encoding of the website is utf-8, which matches the encoding of the post requests:
r = requests.post(url, data=data)
print(r.encoding)
prints utf-8.
I think the problem has to do with how Python 3 is more strict in how it treats strings vs bytes, but I've been unable to pinpoint the exact cause.
While I'd appreciate a working code sample in Python 3, I'm more interested in what exactly I'm doing wrong, in what is the code doing that results in failure.
I'm able to get the lyrics properly with this code in python3.x:
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
resp = requests.get(url)
print(BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text())
Printing (truncated)
>>> BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text()
'扉開けば\u3000捻れた昼の夜\r\n昨日どうやって帰った\u3000体だけ...'
A few things strike me as odd there, notably the \r\n (windows line ending) and \u3000 (IDEOGRAPHIC SPACE) but that's probably not the problem
The one thing I noticed that's odd about the form submission (and why the browser emulator probably succeeds) is the form is using multipart instead of urlencoded form data. (signified by enctype="multipart/form-data")
Sending multipart form data is a little bit strange in requests, I had to poke around a bit and eventually found this which helps show how to format the multipart data in a way that the backing server understands. To do this you have to abuse files but have a "None" filename. "for humans" hah!
url2 = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
resp2 = requests.post(url2, files={'text': (None, raw_lyrics), 'state': (None, 'output')})
And the text is not mangled now!
>>> BeautifulSoup(resp2.text).find('body').get_text()
'\n扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)...'
(Note that this code should work in either python2 or python3)
I have issues with a Python script. I just try to translate some sentences with the google translate API. Some sentences have problems with special UTF-8 encoding like ä, ö or ü. Can't imagine why some sentences work, others not.
If I try the API call direct in the browser, it works, but inside my Python script I get a mismatch.
this is a small version of my script which shows directly the error:
# -*- encoding: utf-8' -*-
import requests
import json
satz="Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl=de&dt=t&q='+satz
r = requests.get(url);
r.text.encode().decode('utf8','ignore')
n = json.loads(r.text);
i = 0
while i < len(n[0]):
newLine = n[0][i][0]
print(newLine)
i=i+1
this is how my result looks:
Unter dem Mondschein glänzt ein winziges Silberfragment, ein Bruchteil einer Li
nie â ? |
Google has served you a Mojibake; the JSON response contains data that was original encoded using UTF-8 but then was decoded with a different codec resulting in incorrect data.
I suspect Google does this as it decodes the URL parameters; in the past URL parameters could be encoded in any number of codecs, that UTF-8 is now the standard is a relatively recent development. This is Google's fault, not yours or that of requests.
I found that setting a User-Agent header makes Google behave better; even an (incomplete) user agent of Mozilla/5.0 is enough here for Google to use UTF-8 when decoding your URL parameters.
You should also make sure your URL string is properly percent encoded, if you pass in parameters in a dictionary to params then requests will take care of adding those to the URL in properly :
satz = "Beneath the moonlight glints a tiny fragment of silver, a fraction of a line…"
url = 'https://translate.googleapis.com/translate_a/single?client=gtx&dt=t'
params = {
'q': satz,
'sl': 'en',
'tl': 'de',
}
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get(url, params=params, headers=headers)
results = r.json()[0]
for inputline, outputline, *__ in results:
print(outputline)
Note that I pulled out the source and target language parameters into the params dictionary too, and pulled out the input and output line values from the results lists.
I am working on a project that reads a url which contains an ICS file (icalendar). Instead of reading it as a string it prints as bytes need some advice on this.
import requests
url = "http://ical.keele.ac.uk/index.php/ical/ical/15021113"
c = requests.get(url)
c.encoding = 'ISO-8859-1'
print(c.content)
Expected return
BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//hacksw/handcal//NONSGML v1.0//EN
BEGIN:VEVENT
Actual return
b"BEGIN:VCALENDAR\rVERSION:2.0\rPRODID:-//hacksw/handcal//NONSGML v1.0//EN\rBEGIN:VEVENT\r
I have tried using the ics file directly and works without any problems but when I request from url it doesnt work. Thanks
Delirious Lettuce is right, just use text:
http://docs.python-requests.org/en/master/user/quickstart/#response-content
import requests
url = "http://ical.keele.ac.uk/index.php/ical/ical/15021113"
c = requests.get(url)
#c.encoding = 'ISO-8859-1'
#print(c.content)
print(c.text[:10])
results in
BEGIN:VCAL
(3.6.1 32 bit windows)
I am working on a python web scraper to extract data from this webpage. It contains latin characters like ą, č, ę, ė, į, š, ų, ū, ž. I use BeautifulSoup to recognise the encoding:
def decode_html(html_string):
converted = UnicodeDammit(html_string)
print(converted.original_encoding)
if not converted.unicode_markup:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.tried_encodings))
return converted.unicode_markup
The encoding that it always seems to use is "windows-1252". However, this turns characters like ė into ë and ų into ø when printing to file or console. I use the lxml library to scrape the data. So I would think that it uses the wrong encoding, but what's odd is that if I use lxml.html.open_in_browser(decoded_html), all the characters are back to normal. How do I print the characters to a file without all the mojibake?
This is what I am using for output:
def write(filename, obj):
with open(filename, "w", encoding="utf-8") as output:
json.dump(obj, output, cls=CustomEncoder, ensure_ascii=False)
return
From the HTTP headers set on the specific webpage you tried to load:
Content-Type:text/html; charset=windows-1257
so Windows-1252 will result in invalid results. BeautifulSoup made a guess (based on statistical models), and guessed wrong. As you noticed, using 1252 instead leads to incorrect codepoints:
>>> 'ė'.encode('cp1257').decode('cp1252')
'ë'
>>> 'ų'.encode('cp1257').decode('cp1252')
'ø'
CP1252 is the fallback for the base characterset detection implementation in BeautifulSoup. You can improve the success-rate of BeautifulSoup's character-detection code by installing an external library; both chardet and cchardet are supported. These two libraries guess at MacCyrillic and ISO-8859-13, respectively (both wrong, but cchardet got pretty close, perhaps close enough).
In this specific case, you can make use of the HTTP headers instead. In requests, I generally use:
import requests
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, 'lxml', from_encoding=encoding)
The above only uses the encoding from the response if explicitly set by the server, and there was no HTML <meta> header. For text/* mime-types, HTTP specifies that the response should be considered as using Latin-1, which requests adheres too, but that default would be incorrect for most HTML data.
I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?
Thanks
Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.