Unable to decode unicode for Stack Exchange API

Unable to decode unicode for Stack Exchange API - python

I was looking at this codegolf problem, and decided to try taking the python solution and use urllib instead. I modified some sample code for manipulating json with urllib:
import urllib.request
import json
res = urllib.request.urlopen('http://api.stackexchange.com/questions?sort=hot&site=codegolf')
res_body = res.read()
j = json.loads(res_body.decode("utf-8"))
This gives:
➜ codegolf python clickbait.py
Traceback (most recent call last):
File "clickbait.py", line 7, in <module>
j = json.loads(res_body.decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
If you go to: http://api.stackexchange.com/questions?sort=hot&site=codegolf and click under "Headers" it says charset=utf-8. Why is it giving me these weird results with urlopen?

res_body is gzipped. I'm not sure that uncompressing the response is something urllib takes care of by default.
You'll have your data if you uncompress the response from the API server.
import urllib.request
import zlib
import json
with urllib.request.urlopen(
'http://api.stackexchange.com/questions?sort=hot&site=codegolf'
) as res:
decompressed_data = zlib.decompress(res.read(), 16+zlib.MAX_WBITS)
j = json.loads(decompressed_data, encoding='utf-8')
print(j)

Related

How can I encode html file after read file with ZipFile?

I am reading a zip file from a URL. Inside the zip file, there is an HTML file. After I read the file everything works fine. But when I print the text I am facing a Unicode problem. Python version: 3.8
from zipfile import ZipFile
from io import BytesIO
from bs4 import BeautifulSoup
from lxml import html
content = requests.get("www.url.com")
zf = ZipFile(BytesIO(content.content))
file_name = zf.namelist()[0]
file = zf.open(file_name)
soup = BeautifulSoup(file.read(),'html.parser',from_encoding='utf-8',exclude_encodings='utf-8')
for product in soup.find_all('tr'):
product = product.find_all('td')
if len(product) < 2: continue
print(product[1].text)
I already try to open file and print text with .decode('utf-8') I got following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 0: invalid continuation byte
I add from_encoding and exclude_encodings in BeautifulSoup but nothing change and I didn't get an error.
Expected prints:
ÇEŞİTLİ MADDELER TOPLAMI
Tarçın
Fidanı
What I am getting:
ÇEÞÝTLÝ MADDELER TOPLAMI
Tarçýn
Fidaný

I look at the file and the encoding is not utf-8, but iso-8859-9.
Change the encoding and everything will be fine:
soup = BeautifulSoup(file.read(),'html.parser',from_encoding='iso-8859-9')
This will output: ÇEŞİTLİ MADDELER TOPLAMI

Parse text file from content-type=application/zip and base64 encoding in AWS SES

On amazon SES, I have a rule to save incoming emails to S3 buckets. Amazon saves these in MIME format.
These emails have a .txt in attachment that will be shown in the MIME file as content-type=text/plain, Content-Disposition=attachment ... .txt, and Content-Transfer-Encoding=quoted-printable or bases64.
I am able to parse it fine using python.
I have a problem decoding the content of the .txt file attachment when this is compressed (i.e., content-type: applcation/zip), as if the encoding wasn't base64.
My code:
import base64
s = unicode(base64.b64decode(attachment_content), "utf-8")
throws the error:
Traceback (most recent call last):
File "<input>", line 796, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcf in position 10: invalid continuation byte
Below are the first few lines of the "base64" string in attachment_content, which btw has length 53683 + "==" at the end, and I thought that the length of a base64 should be a multiple of 4 (??).
So maybe the decoding is failing because the compression is changing attachment_content and I need some other operation before/after decoding it? I have really no idea..
UEsDBBQAAAAIAM9Ah0otgkpwx5oAADMTAgAJAAAAX2NoYXQudHh0tL3bjiRJkiX23sD+g0U3iOxu
REWGu8c1l2Ag8lKd0V2ZWajM3kLuC6Hubu5uFeZm3nYJL6+n4T4Ry8EOdwCSMyQXBRBLgMQ+7CP5
QPBj5gdYn0CRI6JqFxWv7hlyszursiJV1G6qonI5cmQyeT6dPp9cnCaT6Yvp5Yvz6xfJe7cp8P/k
1SbL8xfJu0OSvUvr2q3TOnFVWjxrknWZFeuk2VRlu978s19MRvNMrHneOv51SOZlGUtMLYnfp0nd
...
I have also tried used "latin-1", but get gibberish.

The problem was that, after conversion, I was dealing with a zipped file in format, like "PK \x03 \x04 \X3C \Xa \x0c ...", and I needed to unzip it before transforming it to UTF-8 unicode.
This code worked for me:
import email
# Parse results from email
received_email = email.message_from_string(email_text)
for part in received_email.walk():
c_type = part.get_content_type()
c_enco = part.get('Content-Transfer-Encoding')
attachment_content = part.get_payload()
if c_enco == 'base64':
import base64
decoded_file = base64.b64decode(attachment_content)
print("File decoded from base64")
if c_type == "application/zip":
from cStringIO import StringIO
import zipfile
zfp = zipfile.ZipFile(StringIO(decoded_file), "r")
unzipped_list = zfp.open(zfp.namelist()[0]).readlines()
decoded_file = "".join(unzipped_list)
print('And un-zipped')
result = unicode(decoded_file, "utf-8")

Can't decode result from urlopen

I have the following code in Python3
import urllib.request
f = urllib.request.urlopen("https://www.okcoin.cn/api/v1/trades.do?since=0")
a = f.read() # there is data here
print(a.decode()) # error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I can get a readable result for https://www.okcoin.cn/api/v1/trades.do?since=0 in a browser. The browser confirms the encoding is UTF-8.
What am I missing?
Thanks

Downloading the data with wget reveals that the data is actually
compressed with gzip. So you need to decompress it first. There’s a
gzip module that should be useful.
Edit: try this.
import urllib.request
import gzip
f = urllib.request.urlopen("https://www.okcoin.cn/api/v1/trades.do?since=0")
a = f.read() # there is data here
uncompressed = gzip.decompress(a)
print(uncompressed.decode())

Why not to use requests module?
import requests
f = requests.get("https://www.okcoin.cn/api/v1/trades.do?since=0")
a = f.text
print(a)
Works fine for me :)

As I mentioned on my comment #Yuval Pruss's answer, requests modules handle compressed data implicitly, urllib3 as well does the same-thing as it has Support for gzip and deflate encoding. Here is a demonstration:
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request("https://www.okcoin.cn/api/v1/trades.do?since=0")
>>> r.headers['content-encoding']
'gzip'
>>>
>>> import json
>>> if r.status == 200:
json_data = json.loads(r.data.decode('utf-8'))
print(json_data[0])
{'date_ms': 1489842827000, 'tid': 7368887285, 'date': 1489842827, 'price': '7236.01', 'amount': '1.081', 'type': 'sell'}

Decoding html file downloaded with urllib

I tried to download a html file like this:
import urllib
req = urllib.urlopen("http://www.stream-urls.de/webradio")
html = req.read()
print html
html = html.decode('utf-16')
print html
Since the output after req.read() looks like unicode I tried to convert the response but getting this error:
Traceback (most recent call last): File
"e:\Documents\Python\main.py", line 8, in <module>
html = html.decode('utf-16')
File "E:\Software\Python2.7\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 38-39: illegal UTF-16 surrogate
What do I have to do to get the right encoding?

Use requests and you get correct, ungzipped HTML
import requests
r = requests.get("http://www.stream-urls.de/webradio")
print r.text
EDIT: how to use gzip and StringIO to ungzip data without saving in file
import urllib
import gzip
import StringIO
req = urllib.urlopen("http://www.stream-urls.de/webradio")
# create file-like object in memory
buf = StringIO.StringIO(req.read())
# create gzip object using file-like object instead of real file on disk
f = gzip.GzipFile(fileobj=buf)
# get data from file
html = f.read()
print html

Load an UTF8 JSON file with Python

I try to parse a JSON file and I have an error when I want to print a JSON value that is HTML string.
The error is : Traceback (most recent call last): File "parseJson.py", line 11, in <module> print entryContentHTML.prettify() UnicodeEncodeError: 'ascii' codec can't encode character u'\u02c8' in position 196: ordinal not in range(128)
import json
import codecs
from bs4 import BeautifulSoup
with open('cat.json') as f:
data = json.load(f)
print data["entryLabel"]
entryContentHTML = BeautifulSoup(data["entryContent"])
print entryContentHTML.prettify()
What is the common way to load a json file with UTF8 specification ?

You are loading the JSON just fine. It is your print statement that fails.
You are trying to print to a console or terminal that is configured for ASCII handling only. You'll either have to alter your console configuration or explicitly encode your output:
print data["entryLabel"].encode('ascii', 'replace')
and
print entryContentHTML.prettify().encode('ascii', 'replace')
Without more information about your environment it is otherwise impossible to tell how to fix your configuration (if at all possible).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to decode unicode for Stack Exchange API - python

Related

How can I encode html file after read file with ZipFile?

Parse text file from content-type=application/zip and base64 encoding in AWS SES

Can't decode result from urlopen

Decoding html file downloaded with urllib

Load an UTF8 JSON file with Python

Categories

Resources