I have an encoded string, payload, of a request that I want to uncompress.
payload = ''
My process / goal is explained here. However, I want to stay with Python.
First, I believe I need to decompose the data. With the help of this answer I write:
import base64
payload_in_bytes = base64.b64decode(payload)
Next, I assume that the end-result is a dictionary so I use json.loads() as the documentation states it accepts bytes.
import json
data = json.loads(payload_in_bytes)
However, this results in a UnicodeDecodeError:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
What am I doing wrong?
Here is a possible solution using gzip()
import base64
import gzip
import json
payload = ''
binary_string = base64.b64decode(payload)
decomp_data = gzip.decompress(binary_string).decode()
data = json.loads(decomp_data)
print(data)
Related
I use the requests module in Python to fetch a result of a web page. However, I found that if the URL includes a character à in its URL, it issues the UnicodeDecodeError:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 27: invalid continuation byte
Strangely, this only happens if I also add a space in the URL. So for example, the following does not issue an error.
requests.get("http://myurl.com/àieou")
However, the following does:
requests.get("http://myurl.com/àienah aie")
Why does it happen and how can I make the request correctly?
using the lib urllib to auto-encode characters.
import urllib
requests.get("http://myurl.com/"+urllib.quote_plus("àieou"))
Use quote_plus().
from urllib.parse import quote_plus
requests.get("http://myurl.com/" + quote_plus("àienah aie"))
You can try to url encode your value:
requests.get("http://myurl.com/%C3%A0ieou")
The value for à is %C3%A0 once encoded.
I'm working on a new project but I can't fix the error in the title.
Here's the code:
#!/usr/bin/env python3.5.2
import urllib.request , urllib.parse
def start(url):
source_code = urllib.request.urlopen(url).read()
info = urllib.parse.parse_qs(source_code)
print(info)
start('https://www.youtube.com/watch?v=YfRLJQlpMNw')
The error occurred because of .encode which works on a unicode object. So we need to convert the byte string to unicode string using
.decode('unicode_escape')
So the code will be:
#!/usr/bin/env python3.5.2
import urllib.request , urllib.parse
def start(url):
source_code = urllib.request.urlopen(url).read()
info = urllib.parse.parse_qs(source_code.decode('unicode_escape'))
print(info)
start('https://www.youtube.com/watch?v=YfRLJQlpMNw')
Try this
source_code = urllib.request.urlopen(url).read().decode('utf-8')
The error message is self explainatory: there is a byte 0xf0 in an input string that is expected to be an ascii string.
You should have given the exact error message and on what line it happened, but I can guess that is happened on info = urllib.parse.parse_qs(source_code), because parse_qs expects either a unicode string or an ascii byte string.
The first question is why you call parse_qs on data coming from youtube, because the doc for the Python Standart Library says:
Parse a query string given as a string argument (data of type application/x-www-form-urlencoded). Data are returned as a dictionary. The dictionary keys are the unique query variable names and the values are lists of values for each name.
So you are going to parse this on = and & character to interpret it as a query string in the form key1=value11&key2=value2&key1=value12 to give { 'key1': [ 'value11', 'value12'], 'key2': ['value2']}.
If you know why you want that, you should first decode the byte string into a unicode string, using the proper encoding, or if unsure Latin1 which is able to accept any byte:
def start(url):
source_code = urllib.request.urlopen(url).read().decode('latin1')
info = urllib.parse.parse_qs(source_code)
print(info)
This code is rather weird indeed. You are using query parser to parse contents of a web page.
So instead of using parse_qs you should be using something like this.
This question already has answers here:
UnicodeEncodeError: 'charmap' codec can't encode characters
(11 answers)
Closed 5 months ago.
I am trying to make a crawler in python by following an udacity course. I have this method get_page() which returns the content of the page.
def get_page(url):
'''
Open the given url and return the content of the page.
'''
data = urlopen(url)
html = data.read()
return html.decode('utf8')
the original method was just returning data.read(), but that way I could not do operations like str.find(). After a quick search I found out I need to decode the data. But now I get this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position
1: invalid start byte
I have found similar questions in SO but none of them were specifically for this. Please help.
You are trying to decode an invalid string.
The start byte of any valid UTF-8 string must be in the range of 0x00 to 0x7F.
So 0x8B is definitely invalid.
From RFC3629 Section 3:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number.
You should post the string you are trying to decode.
Maybe the page is encoded with other character encoding but 'utf-8'. So the start byte is invalid.
You could do this.
def get_page(self, url):
if url is None:
return None
response=urllib.request.urlopen(url)
if response.getcode()!=200:
print("Http code:",response.getcode())
return None
else:
try:
return response.read().decode('utf-8')
except:
return response.read()
Web servers often serve HTML pages with a Content-Type header that includes the encoding used to encoding the page. The header might look this:
Content-Type: text/html; charset=UTF-8
We can inspect the content of this header to find the encoding to use to decode the page:
from urllib.request import urlopen
def get_page(url):
""" Open the given url and return the content of the page."""
data = urlopen(url)
content_type = data.headers.get('content-type', '')
print(f'{content_type=}')
encoding = 'latin-1'
if 'charset' in content_type:
_, _, encoding = content_type.rpartition('=')
print(f'{encoding=}')
html = data.read()
return html.decode(encoding)
Using requests is similar:
response = requests.get(url)
content_type = reponse.headers.get('content-type', '')
Latin-1 (or ISO-8859-1) is a safe default: it will always decode any bytes (though the result may not be useful).
If the server doesn't serve a content-type header you can try looking for a <meta> tag that specifies the encoding in the HTML. Or pass the response bytes to Beautiful Soup and let it try to guess the encoding.
I'm using python 2.7 and I have some problems converting chars like "ä" to "ae".
I'm retrieving the content of a webpage using:
req = urllib2.Request(url + str(questionID))
response = urllib2.urlopen(req)
data = response.read()
After that I'm doing some extraction stuff and there is my problem.
extractedStr = pageContent[start:end] // this string contains the "ä" !
extractedStr = extractedStr.decode("utf8") // here I get the error, tried it with encode aswell
extractedStr = extractedStr.replace(u"ä", "ae")
--> 'utf8' codec can't decode byte 0xe4 in position 13: invalid continuation byte
But: my simple trial is working fine...:
someStr = "geräusch"
someStr = someStr.decode("utf8")
someStr = someStr.replace(u"ä", "ae")
I've got the feeling, it has something to do with WHEN I try to use the .decode() function... I tried it at several positions, no success :(
Use .decode("latin-1") instead. That is what you are trying to decode.
I am trying to count the number of Unicode characters in the JSON data. I am using requests to get the data from the feed.
import requests
r = requests.get('https://venmo.com/api/v5/public?since=1438578858&until=1438578958')'
j_data = r.text
Now, I need to convert the j_data into a dictionary to get the message items alone. If I just use json.loads(j_data), I get UnicodeEncodeError: 'charmap' codec can't encode character.
Therefore, I am encoding the j_data and then trying to convert to dict using loads. I am getting this error
TypeError: the JSON object must be str, not 'bytes'
How to approach this problem?
Code:
import requests
import json
r = requests.get('https://venmo.com/api/v5/public?since=1438578858&until=1438578958')
j_data = r.text
encoded = j_data.encode()
b = json.loads(encoded)
print(b)
It seems to work fine in Python 2.7.6
import requests
import json
req = requests.get('https://venmo.com/api/v5/public?since=1438578858&until=1438578958')
contentJ = json.loads(req.content)
and I get a dict named contentJ
As I see it, you try to encode something, that does not need to be encoded. Stripping of the line with the encoding and all works fine in Python3.4.
import requests
import json
r = requests.get('https://venmo.com/api/v5/public?since=1438578858&until=1438578958')
j_data = r.text
b = json.loads(j_data)
print(type(b))
To get json, use r.json():
import requests # $ pip install requests
r = requests.get(url)
data = r.json()
Your error: UnicodeEncodeError: 'charmap' codec can't encode character. is unrelated to the json parsing. Most likely it happens when you are trying to print Unicode to Windows console. Configure the console font that can display the desired characters and install win-unicode-console package:
T:\> py -mpip install win-unicode-console
T:\> py -mrun your_script_that_prints_unicode.py
See What's the deal with Python 3.4, Unicode, different languages and Windows?