UnicodeDecodeError utf8 codec Python 2.7 - python

I have built a scraper which reads artistnames from a csv file and collects artistdata via the Songkick api from these artists. However, after running my code for a while I get the following error:
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 64-65: invalid continuation byte
Sample data can be downloaded here:
I am relatively new to coding and I was wondering how can I solve this error? Below you can find my code.
import urllib2
import requests
import json
import csv
from tinydb import TinyDB, Query
db = TinyDB('spotify_artists.json')
#read csv
def wait_for_internet():
while True:
try:
resp = urllib2.urlopen('http://google.com', timeout=1)
return
except:
pass
def load_artists():
f = open('artistnames.csv', 'r').readlines();
for a in f:
artist = a.strip()
print(artist)
url = 'http://api.songkick.com/api/3.0/search/artists.json?query='+artist+'&apikey='
# wait_for_internet()
r = requests.get(url)
resp = r.json()
# print(resp)
try :
if(resp['resultsPage']['totalEntries']):
# print(json.dumps(resp['resultsPage']['results']['artist'], indent=4, sort_keys=True))
results = resp['resultsPage']['results']['artist'];
for x in results:
# print('rxx')
# print(json.dumps(x, indent=4, sort_keys=True))
if(x['displayName'] == artist):
print(x)
db.insert(x)
except:
print('cannot fetch url',url);
load_artists()
db.close()
Traceback (most recent call last):
File "C:\Users\rmlj\Dropbox\songkick\scrapers\Data\Scraper.py", line 45, in <module>
load_artists()
File "C:C:\Users\rmlj\Dropbox\songkick\scrapers\Data\Scraper.py".py", line 25, in load_artists
r = requests.get(url)
File "C:\Python27\lib\site-packages\requests\api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 474, in request
prep = self.prepare_request(req)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 407, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Python27\lib\site-packages\requests\models.py", line 302, in prepare
self.prepare_url(url, params)
File "C:\Python27\lib\site-packages\requests\models.py", line 358, in prepare_url
url = url.decode('utf8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 64-65: invalid continuation byte

The problem is in your formation of a URL where you pass the query string as bytes (regular str on Python 2.x) with characters in a non-utf-8 encoding to the requests module, which in turn tries to turn it into an utf-8 unicode string, and fails.
First of all, you should let the requests module form your query string and deal with the creation of your final URL:
url = "http://api.songkick.com/api/3.0/search/artists.json"
r = requests.get(url, params={"query": artist, "apikey": ""})
# etc.
But second, you should not mix encodings least you want to be in a world of hurt. Unfortunately, the built-in csv module doesn't work with Unicode which is probably why you end up with invalid characters. To remedy that, install unicodecsv and use it as a drop-in replacement (just replace your import csv with import unicodecsv as csv).
update: Wait, on a second look you're not even using csv. You're reading your file line by line and trying to pass that off as a query. Is that your intended behavior? If that's the case, keeping with the idea of staying with same encoding:
import codecs
URL = "http://api.songkick.com/api/3.0/search/artists.json" # no need to redefine this
with codecs.open("artistnames.csv", "r", "utf-8") as f:
for a in f:
artist = a.strip()
r = requests.get(URL, params={"query": artist, "apikey": ""})
# etc.

You should use unicode whenever possible. Requests should convert any non-ascii characters in the url to the correct encoding.
>>> import requests
>>> requests.get(u'http://Motörhead.com/?q=Motörhead').url
u'http://xn--motrhead-p4a.com/?q=Mot%C3%B6rhead'
As you can see, the domain name is encoded as punycode, and the querystring uses percent-encoding.
as long as artist is a valid unicode string, this should work.
url = u'http://api.songkick.com/api/3.0/search/artists.json?query='+artist
if artist is a byte string, you must decode it into unicode using the correct encoding, which depends on what your original input file was encoded as.
artist = artist.decode('SHIFT-JIS')

Related

Does multiple looping affect response.json()?

As a part of a small project of mine, I'm using the requests module to make an API call. Here's the snippet:
date = str(day) + '-' + str(month) + '-' + str(year)
req = "https://cdn-api.co-vin.in/api/v2/appointment/sessions/public/findByDistrict?district_id=" + str(distid) + "&date=" + date
response = requests.get(req,headers={'Content-Type': 'application/json'})
st = str(jprint(response.json()))
file = open("data.json",'w')
file.write(st)
file.close()
The jprint function is as follows:
def jprint(obj):
text = json.dumps(obj,sort_keys=True,indent=4)
return text
This is a part of a nested loop. On the first few runs, it worked successfully but after that it gave the following error:
Traceback (most recent call last):
File "vax_alert2.py", line 99, in <module>
st = str(jprint(response.json()))
File "/usr/lib/python3/dist-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I tried adding a sleep of 1 second but got the same error. How should I resolve it?
Also, I checked it without using the jprint function yet got the exact same error.
I would suggest recording the response in case of exception parsing the response as the response body is likely empty with an error status. It's likely that you're getting a 403 or some other error status (potentially from a DDOS aware firewall). Once you know the potentially errant (empty) response status, you may detect said status and throttle your requests accordingly.
try:
st = str(jprint(response.json()))
file = open("data.json",'w')
file.write(st)
file.close()
except:
print(response)
See the following (from https://docs.python-requests.org/en/master/user/quickstart/):
In case the JSON decoding fails, r.json() raises an exception. For
example, if the response gets a 204 (No Content), or if the response
contains invalid JSON, attempting r.json() raises
simplejson.JSONDecodeError if simplejson is installed or raises
ValueError: No JSON object could be decoded on Python 2 or
json.JSONDecodeError on Python 3.
It should be noted that the success of the call to r.json() does not
indicate the success of the response. Some servers may return a JSON
object in a failed response (e.g. error details with HTTP 500). Such
JSON will be decoded and returned. To check that a request is
successful, use r.raise_for_status() or check r.status_code is what
you expect.

How to capture the response based on the Content-Type sent by the server using requests in Python?

I am pretty new to python and learning how to make HTTP request and store the response in a variable.
Below is the similar kind of code snippet that I am trying to make the POST request.
import requests
import simplejson as json
api_url = https://jsonplaceholder.typicode.com/tickets
raw_body = {"searchBy":"city","searchValue":"1","processed":9,"size":47,"filter":{"cityCode":["BA","KE","BE"],"tickets":["BLUE"]}}
raw_header = {"X-Ticket-id": "1234567", "X-Ticket-TimeStamp": "11:01:1212", "X-Ticket-MessageId": "123", 'Content-Type': 'application/json'}
result = requests.post(api_url, headers=json.loads(raw_header), data=raw_body)
#Response Header
response_header_contentType = result.headers['Content-Type'] #---> I am getting response_header_contentType as "text/html; charset=utf-8"
#Trying to get the result in json format
response = result.json() # --> I am getting error at this line. May be because the server is sending the content type as "text/html" and I am trying to capture the json response.
Error in console :
Traceback (most recent call last):
File "C:\sam\python-project\v4\makeApiRequest.py", line 45, in make_API_request
response = result.json()
File "C:\Users\userName\AppData\Local\Programs\Python\Python37\lib\site-packages\requests\models.py", line 898, in json
return complexjson.loads(self.text, **kwargs)
File "C:\Users\userName\AppData\Local\Programs\Python\Python37\lib\site-packages\simplejson\__init__.py", line 525, in loads
return _default_decoder.decode(s)
File "C:\Users\userName\AppData\Local\Programs\Python\Python37\lib\site-packages\simplejson\decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "C:\Users\userName\AppData\Local\Programs\Python\Python37\lib\site-packages\simplejson\decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
So, how can I store the response in a variable based on the content-type sent by the server using requests.
Can somebody please help me here. I tried googling too but did not find any helpful documentation on how to capture the response based on the content-type.
as you already said your contentType is 'text/html' not 'application/json' that normally means that it can not be decoded as json.
If you look at the documentation
https://2.python-requests.org/en/master/user/quickstart/#response-content you can find that there are different ways to decode the body, if you already know you have 'text/html' it makes sense to decode it with response.text.
Hence it makes sense to distinquish based on the content type how to decode your data:
if result.headers['Content-Type'] == 'application/json':
data = result.json()
elif result.headers['Content-Type'] == 'text/html':
data = result.text
else:
data = result.raw

Can not parse response from sg.media-imdb in python

I'm trying to parse response from https://sg.media-imdb.com/suggests/a/a.json in Python 3.6.8.
Here is my code:
import requests
url = 'https://sg.media-imdb.com/suggests/a/a.json'
data = requests.get(url).json()
I get this error:
$ /usr/bin/python3 /home/livw/Python/test_scrapy/phase_1.py
Traceback (most recent call last):
File "/home/livw/Python/test_scrapy/phase_1.py", line 33, in <module>
data = requests.get(url).json()
File "/home/livw/.local/lib/python3.6/site-packages/requests/models.py", line 889, in json
self.content.decode(encoding), **kwargs
File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
It seems like the response format is not JSON format, although I can parse the response at JSON Formatter & Validator
How to fix it and store the response in a json object?
This probably happend because its not a complete json, it have a prefix
you can see that the response start with imdb$a( and ends with )
json parsing doesn't know how to handle it and he fails, you can remove those values and just parse the json itself
you can do this:
import json
import requests
url = 'https://sg.media-imdb.com/suggests/a/a.json'
data = requests.get(url).text
json.loads(data[data.index('{'):-1])

Post file and data to API using Python

I'm experimenting a bit with an API which can detect faces from an image. I'm using Python and want to be able to upload an image that specifies an argument (in the console). For example:
python detect.py jack.jpg
This is meant to send file jack.jpg up to the API. And afterwards print JSON response. Here is the documentation of the API to identify the face.
http://rekognition.com/developer/docs#facerekognize
Below is my code, I'm using Python 2.7.4
#!/usr/bin/python
# Imports
import sys
import requests
import json
# Facedetection.py sends us an argument with a filename
filename = (sys.argv[1])
# API-Keys
rekognition_key = ""
rekognition_secret = ""
array = {'api_key': rekognition_key,
'api_secret': rekognition_secret,
'jobs': 'face_search',
'name_space': 'multify',
'user_id': 'demo',
'uploaded_file': open(filename)
}
endpoint = 'http://rekognition.com/func/api/'
response = requests.post(endpoint, params= array)
data = json.loads(response.content)
print data
I can see that everything looks fine, but my console gets this output:
Traceback (most recent call last):
File "upload.py", line 23, in <module>
data = json.loads(response.content)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
What is wrong?
Success!
The question didn't contain good code. I used code from here:
uploading a file to imgur via python
#!/usr/bin/python
# Imports
import base64
import sys
import requests
import json
from base64 import b64encode
# Facedetection.py sends us an argument with a filename
filename = (sys.argv[1])
# API-Keys
rekognition_key = ""
rekognition_secret = ""
url = "http://rekognition.com/func/api/"
j1 = requests.post(
url,
data = {
'api_key': rekognition_key,
'api_secret': rekognition_secret,
'jobs': 'face_recognize',
'name_space': 'multify',
'user_id': 'demo',
'base64': b64encode(open(filename, 'rb').read()),
}
)
data = json.loads(j1.text)
print data
Now this: python detect.py jack.jpg returns the wanted JSON. Fully working.

Nested text encodings in suds requests

Environment: Python 2.7.4 (partly on Windows, partly on Linux, see below), suds (SVN HEAD with minor modifications)
I need to call into a web service that takes a single argument, which is an XML string (yes, I know…), i.e. the request is declared in the WSDL with the following type:
<s:complexType>
<s:sequence>
<s:element minOccurs="0" maxOccurs="1" name="actionString" type="s:string"/>
</s:sequence>
</s:complexType>
I'm using cElementTree to construct this inner XML document, then I pass it as the only parameter to the client.service.ProcessAction(request) method that suds generates.
For a while, this worked okay:
root = ET.Element(u'ActionCommand')
value = ET.SubElement(root, u'value')
value.text = saxutils.escape(complex_value)
request = u'<?xml version="1.0" encoding="utf-8"?>\n' + ET.tostring(root, encoding='utf-8')
client.service.ProcessAction(request)
The saxutils.escape, I had added at some point to fix the first encoding problems, pretty much without being able to understand why exactly I need it and what difference it makes.
Now (possibly due to the first occurence of the pound sign), I suddenly got the following exception:
Traceback (most recent call last):
File "/app/module.py", line 135, in _process_web_service_call
request = u'<?xml version="1.0" encoding="utf-8"?>\n' + ET.tostring(root, encoding='utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 137: ordinal not in range(128)
The position 137 here corresponds to the location of the special characters inside the inner XML request. Apparently, cElementTree.tostring() returns a 'str' type, not a 'unicode' even when an encoding is given. So Python tries to decode this string str into unicode (why with 'ascii'?), so that it can concatenate it with the unicode literal. This fails (of course, because the str is actually encoded in UTF-8, not ASCII).
So I figured, fine, I'll decode it to unicode myself then:
root = ET.Element(u'ActionCommand')
value = ET.SubElement(root, u'value')
value.text = saxutils.escape(complex_value)
request_encoded_str = ET.tostring(root, encoding='utf-8')
request_unicode = request_encoded_str.decode('utf-8')
request = u'<?xml version="1.0" encoding="utf-8"?>\n' + request_unicode
client.service.ProcessClientAction(request)
Except that now, it blows up inside suds, which tries to decode the outer XML request for some reason:
Traceback (most recent call last):
File "/app/module.py", line 141, in _process_web_service_call
raw_response = client.service.ProcessAction(request)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/client.py", line 542, in __call__
return client.invoke(args, kwargs)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/client.py", line 602, in invoke
result = self.send(soapenv)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/client.py", line 643, in send
reply = transport.send(request)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/transport/https.py", line 64, in send
return HttpTransport.send(self, request)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/transport/http.py", line 118, in send
return self.invoke(request)
File "/app/.heroku/python/lib/python2.7/site-packages/suds/transport/http.py", line 153, in invoke
u2response = urlopener.open(u2request, timeout=tm)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 1222, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/app/.heroku/python/lib/python2.7/urllib2.py", line 1181, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
File "/app/.heroku/python/lib/python2.7/httplib.py", line 973, in request
self._send_request(method, url, body, headers)
File "/app/.heroku/python/lib/python2.7/httplib.py", line 1007, in _send_request
self.endheaders(body)
File "/app/.heroku/python/lib/python2.7/httplib.py", line 969, in endheaders
self._send_output(message_body)
File "/app/.heroku/python/lib/python2.7/httplib.py", line 827, in _send_output
msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 565: ordinal not in range(128)
The position 565 here again corresponds with the same character as above, except this time it's the location of my inner XML request embedded into the outer XML request (SOAP) created by suds.
I'm confused. Can anyone help me out of this mess? :)
To make matters worse, all of this only happens on the server under Linux. None of these raises an exception in my development environment on Windows. (Bonus points for an explanation as to why that is, just because I'm curious. I suspect it has to do with a different default encoding.) However, they all are not accepted by the server. What does work on Windows is if I drop the saxutils.escape and then hand a proper unicode object to suds. This however still results in the same UnicodeDecodeError on Linux.
Update: I started debugging this on Windows (where it works fine), and in the line 827 of httplib.py, it indeed tries to concatenate the unicode object msg (containing the HTTP headers) and the str object message_body, leading to the implicit unicode decoding with the incorrect encoding. I guess it just happens to not fail on Windows for some reason. I don't understand why suds tries to send a str object when I put a unicode object in at the top.
This turned out to be more than absurd. I'm still understanding only small parts of the whole problem and situation, but I managed to solve my problem.
So let's trace it back: my last attempt was the most sane one, I believe. So let's start there:
msg += message_body
That line in Python's httplib.py tries to concatenate a unicode and a str object, which leads to an implicit .decode('ascii') of the str, even though the str is UTF8-encoded. Why is that? Because msg is a unicode object.
msg = "\r\n".join(self._buffer)
self._buffer is a list of HTTP headers. Inspecting that, only one header in there was unicode, 'infecting' the resulting string: the action and endpoint.
And there's the problem: I'm using unicode_literals from __future__ (makes it more future-proof, right? right???) and I'm passing my own endpoint into suds.
By just doing an .encode('utf-8') on the URL, all my problems went away. Even the whole saxutils.escape was no longer needed (even though it weirdly also didn't hurt).
tl;dr: make sure you're not passing any unicode objects anywhere into httplib or suds, I guess.
root = ET.Element(u'ActionCommand')
value = ET.SubElement(root, u'value')
value.text = complex_value)
request = ET.tostring(root, encoding='utf-8').decode('utf-8')
client.service.ProcessAction(request)

Categories

Resources