python requests randomly breaks with JSONDecodeError

python requests randomly breaks with JSONDecodeError - python

I have been debugging for hours why my code randomly breaks with this error: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
This is the code I have:
while True:
try:
submissions = requests.get('http://reymisterio.net/data-dump/api.php/submission?filter[]=form,cs,'+client+'&filter[]=date,cs,'+since).json()['submission']['records']
break
except requests.exceptions.ConnectionError:
time.sleep(100)
And I've been debugging by printing requests.get(url) and requests.get(url).text and I have encountered the following "special "cases:
requests.get(url) returns a successful 200 response and requests.get(url).text returns html. I have read online that this should fail when using requests.get(url).json(), because it won't be able to read the html, but somehow it doesn't break. Why is this?
requests.get(url) returns a successful 200 response and requests.get(url).text is in json format. I don't understand why when it goes to the requests.get(url).json() line it breaks with the JSONDecodeError?
The exact value of requests.get(url).text for case 2 is:
{
"submission": {
"columns": [
"pk",
"form",
"date",
"ip"
],
"records": [
[
"21197",
"mistico-form-contacto-form",
"2018-09-21 09:04:41",
"186.179.71.106"
]
]
}
}

Looking at the documentation for this API it seems the only responses are in JSON format, so receiving HTML is strange. To increase the likelihood of receiving a JSON response, you can set the 'Accept' header to 'application/json'.
I tried querying this API many times with parameters and did not encounter a JSONDecodeError. This error is likely the result of another error on the server side. To handle it, except a json.decoder.JSONDecodeError in addition to the ConnectionError error you currently except and handle this error in the same way as the ConnectionError.
Here is an example with all that in mind:
import requests, json, time, random
def get_submission_records(client, since, try_number=1):
url = 'http://reymisterio.net/data-dump/api.php/submission?filter[]=form,cs,'+client+'&filter[]=date,cs,'+since
headers = {'Accept': 'application/json'}
try:
response = requests.get(url, headers=headers).json()
except (requests.exceptions.ConnectionError, json.decoder.JSONDecodeError):
time.sleep(2**try_number + random.random()*0.01) #exponential backoff
return get_submission_records(client, since, try_number=try_number+1)
else:
return response['submission']['records']
I've also wrapped this logic in a recursive function, rather than using while loop because I think it is semantically clearer. This function also waits before trying again using exponential backoff (waiting twice as long after each failure).
Edit: For Python 2.7, the error from trying to parse bad json is a ValueError, not a JSONDecodeError
import requests, time, random
def get_submission_records(client, since, try_number=1):
url = 'http://reymisterio.net/data-dump/api.php/submission?filter[]=form,cs,'+client+'&filter[]=date,cs,'+since
headers = {'Accept': 'application/json'}
try:
response = requests.get(url, headers=headers).json()
except (requests.exceptions.ConnectionError, ValueError):
time.sleep(2**try_number + random.random()*0.01) #exponential backoff
return get_submission_records(client, since, try_number=try_number+1)
else:
return response['submission']['records']
so just change that except line to include a ValueError instead of json.decoder.JSONDecodeError.

Try this. it might work
while True:
try:
submissions = requests.get('http://reymisterio.net/data-dump/api.php/submission?filter[]=form,cs,'+client+'&filter[]=date,cs,'+since).json()['submission']['records']
sub = json.loads(submissions.text)
print(sub)
break
except requests.exceptions.ConnectionError:
time.sleep(100)

Related

Iterate through nested JSON object and get values throughout

Working on a API project, in which I'm trying to get all the redirect urls from an API output such as https://urlscan.io/api/v1/result/39a4fc22-39df-4fd5-ba13-21a91ca9a07d/
Example of where I'm trying to pull the urls from:
"redirectResponse": {
"url": "https://www.coke.com/"
I currently have the following code:
import requests
import json
import time
#URL to be scanned
url = 'https://www.coke.com'
#URL Scan Headers
headers = {'API-Key':apikey,'Content-Type':'application/json'}
data = {"url":url, "visibility": "public"}
response = requests.post('https://urlscan.io/api/v1/scan/',headers=headers, data=json.dumps(data))
uuid = response.json()['uuid']
responseUrl = response.json()['api']
time.sleep(10)
req = requests.Session()
r = req.get(responseUrl).json()
r.keys()
for value in r['data']['requests']['redirectResponse']['url']:
print(f"{value}")
I get the following error: TypeError: list indices must be integers or slices, not str. Not sure what the best way to parse the nested json in order to get all the redirect urls.

A redirectResponse isn't always present in the requests, so the code has to be written to handle that and keep going. In Python that's usually done with a try/except:
for obj in r['data']['requests']:
try:
redirectResponse = obj['request']['redirectResponse']
except KeyError:
continue # Ignore and skip to next one.
url = redirectResponse['url']
print(f'{url=!r}')

When to use `raise_for_status` vs `status_code` testing

I have always used:
r = requests.get(url)
if r.status_code == 200:
# my passing code
else:
# anything else, if this even exists
Now I was working on another issue and decided to allow for other errors and am instead now using:
try:
r = requests.get(url)
r.raise_for_status()
except requests.exceptions.ConnectionError as err:
# eg, no internet
raise SystemExit(err)
except requests.exceptions.HTTPError as err:
# eg, url, server and other errors
raise SystemExit(err)
# the rest of my code is going here
With the exception that various other errors could be tested for at this level, is one method any better than the other?

Response.raise_for_status() is just a built-in method for checking status codes and does essentially the same thing as your first example.
There is no "better" here, just about personal preference with flow control. My preference is toward try/except blocks for catching errors in any call, as this informs the future programmer that these conditions are some sort of error. If/else doesn't necessarily indicate an error when scanning code.
Edit: Here's my quick-and-dirty pattern.
import time
import requests
from requests.exceptions import HTTPError
url = "https://theurl.com"
retries = 3
for n in range(retries):
try:
response = requests.get(url)
response.raise_for_status()
break
except HTTPError as exc:
code = exc.response.status_code
if code in [429, 500, 502, 503, 504]:
# retry after n seconds
time.sleep(n)
continue
raise
However, in most scenarios, I subclass requests.Session, make a custom HTTPAdapter that handles exponential backoffs, and the above lives in an overridden requests.Session.request method. An example of that can be seen here.

Almost always, raise_for_status() is better.
The main reason is that there is a bit more to it than testing status_code == 200, and you should be making best use of tried-and-tested code rather than creating your own implementation.
For instance, did you know that there are actually five different 'success' codes defined by the HTTP standard? Four of those 'success' codes will be misinterpreted as failure by testing for status_code == 200.

If you are not sure, follow the Ian Goldby's answer.
...however please be aware that raise_for_status() is not some magical or exceptionally smart solution - it's a very simple function that decodes the response body and throws an exception for HTTP codes 400-599, distinguishing client-side and server-side errors (see its code here).
And especially the client-side error responses may contain valuable information in the response body that you may want to process. For example a HTTP 400 Bad Request response may contain the error reason.
In such a case it may be better to not use raise_for_status() but instead cover all the cases by yourself.
Example code
try:
r = requests.get(url)
# process the specific codes from the range 400-599
# that you are interested in first
if r.status_code == 400:
invalid_request_reason = r.text
print(f"Your request has failed because: {invalid_request_reason}")
return
# this will handle all other errors
elif r.status_code > 400:
print(f"Your request has failed with status code: {r.status_code}")
return
except requests.exceptions.ConnectionError as err:
# eg, no internet
raise SystemExit(err)
# the rest of my code is going here
Real-world use case
PuppetDB's API using the Puppet Query Language (PQL) responds with a HTTP 400 Bad Request to a syntactically invalid query with a very precise info where is the error.
Request query:
nodes[certname] { certname == "bastion" }
Body of the HTTP 400 response:
PQL parse error at line 1, column 29:
nodes[certname] { certname == "bastion" }
^
Expected one of:
[
false
true
#"[0-9]+"
-
'
"
#"\s+"
See my Pull Request to an app that uses this API to make it show this error message to a user here, but note that it doesn't exactly follow the example code above.

Better is somewhat subjective; both can get the job done. That said, as a relatively inexperienced programmer I prefer the Try / Except form.
For me, the T / E reminds me that requests don't always give you what you expect (in a way that if / else doesn't - but that could just be me).
raise_for_status() also lets you easily implement as many or as few different actions for the different error types (.HTTPError, .ConnectionError) as you need.
In my current project, I've settled on the form below, as I'm taking the same action regardless of cause, but am still interested to know the cause:
try:
...
except requests.exceptions.RequestException as e:
raise SystemExit(e) from None
Toy implementation:
import requests
def http_bin_repsonse(status_code):
sc = status_code
try:
url = "http://httpbin.org/status/" + str(sc)
response = requests.post(url)
response.raise_for_status()
p = response.content
except requests.exceptions.RequestException as e:
print("placeholder for save file / clean-up")
raise SystemExit(e) from None
return response, p
response, p = http_bin_repsonse(403)
print(response.status_code)

How to import python requests using a file in order to output the status code

I'm new with Python(3) so please don't bash me :D
I'm using the following code in order to import a .txt file which contains different URLs so I can check their status code.
In my example, I added 4 site URLs
here is the import.txt with just one URL:
https://site1.site
https://site2.site
https://site3.site
https://site4.site
https://site5.site
while this is the py script itself:
import requests
with open('import.txt', 'r') as f :
for line in f :
print(line)
#try :
r = requests.get(line)
print(r.status_code)
#except requests.ConnectionError :
# print("failed to connect")
this is the response:
https://site1.site
https://site2.site
https://site3.site
https://site4.site
https://site5.site
400
Even though site3 and site4 are 301's while site5 has a failed to connect response I only receive a 400 response which applies to all of the submitted URLs.
If I request.head for each one of those URLs using the following script then I receive the correct page status code('Moved Permantly' for the example below). This is the single request script:
import requests
try:
r = requests.head("http://site3.net/")
if r.status_code == 200:
print('Success!')
elif r.status_code == 301:
print('Moved Permanently')
elif r.status_code == 404:
print('Not Found')
# print(r.status_code)
except requests.ConnectionError:
print("failed to connect")
kudos to What’s the best way to get an HTTP response code from a URL?

Your call to requests.get() is outside the for loop, and so is only executed once. Try indenting the relevant lines, like so:
import requests
with open('import;txt', 'r') as f :
for line in f :
print(line)
#try :
r = requests.get(line)
print(r.status_code)
#except requests.ConnectionError :
# print("failed to connect")
Ps. I suggest you use 4-space indents. That way, errors like this are easier to spot.

Django http request to api error

As this is the first time I'm trying this out, I do not know what is wrong with the problem. So it would be great if someone can help me solve this problem
The code I'm using is at the bottom page of this website: https://www.twilio.com/blog/2014/11/build-your-own-pokedex-with-django-mms-and-pokeapi.html
Where it give example on how you can make HTTP request function and retrieve database on your query.
The code on the website is this.
query.py
import requests
import json
BASE_URL = 'http://pokeapi.co'
def query_pokeapi(resource_url):
url = '{0}{1}'.format(BASE_URL, resource_url)
response = requests.get(url)
if response.status_code == 200:
return json.loads(response.text)
return None
charizard = query_pokeapi('/api/v1/pokemon/charizard/')
sprite_uri = charizard['sprites'][0]['resource_uri']
description_uri = charizard['descriptions'][0]['resource_uri']
sprite = query_pokeapi(sprite_uri)
description = query_pokeapi(description_uri)
print
charizard['name']
print
description['description']
print
BASE_URL + sprite['image']
In my edit, I only change these print line at the bottom of this
query.py
print(charizard['name'])
print(description['description'])
print(BASE_URL + sprite['image'])
But i got this error instead
Traceback (most recent call last): File "query2.py", line 46, in
sprite_uri = charizard['sprites'][0]['resource_uri'] TypeError: 'NoneType' object is not subscriptable

query_pokeapi must be returning None, which would mean that your API call is not receiving a 200 HTTP response. I'd check your URL, to make sure it's properly formed. Test it in your web browser.
best practice would be to try-except your API call with an error message letting you know that your API call failed and otherwise routing the thread.
Update: reread and the sub scripting issue could be in any layer of your nested object.
Evaluate charizard['sprites'][0]['resource_uri']
step by step in your debugger.

When you call api requests.get(url) then its response is
More than one resource is found at this URI
you are using charizard['sprites'][0]['resource_uri'] on result and it's raising exception.
When I tried to get response then status code is 300 so
def query_pokeapi(resource_url) returning None value.
'{0}{1}'.format(BASE_URL, resource_url)
Update
it means at {0} BASE_URL will be places and at {1} resource_url will be places.
Complete url will be
url = '{0}{1}'.format(BASE_URL, resource_url)
url = 'http://pokeapi.co/api/v1/pokemon/charizard/'.
update
you can try
import json
charizard = query_pokeapi('/api/v1/pokemon/')
data = json.loads(charizard.content)
print data['objects'][0]['descriptions'][0]
result will be
{u'name': u'ekans_gen_1', u'resource_uri': u'/api/v1/description/353/'}
Update with complete code
import requests
import json
BASE_URL = 'http://pokeapi.co'
def query_pokeapi(resource_url):
url = '{0}{1}'.format(BASE_URL, resource_url)
response = requests.get(url)
if response.status_code == 200:
return json.loads(response.text)
return None
charizard = query_pokeapi('/api/v1/pokemon/')
print charizard['objects'][0]['descriptions'][0]
result will be:
{u'name': u'ekans_gen_1', u'resource_uri': u'/api/v1/description/353/'}

How to keep a program running if there is an Traceback error

I made a simple script for amusment that takes the latest comment from http://www.reddit.com/r/random/comments.json?limit=1 and speaks through espeak. I ran into a problem however. If Reddit fails to give me the json data, which it commonly does, the script stops and gives a traceback. This is a problem, as it stops the script. Is there any sort of way to retry to get the json if it fails to load. I am using requests if that means anything
If you need it, here is the part of the code that gets the json data
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']

For the vocabulary, the actual error you're having is an exception that has been thrown at some point in a program because of a detected runtime error, and the traceback is the program thread that tells you where the exception has been thrown.
Basically, what you want is an exception handler:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
except Exception as err:
print err
so that you jump over the part that needs the thing that couldn't work. Have a look at that doc as well: HandlingExceptions - Python Wiki
As pss suggests, if you want to retry after the url failed to load:
done = False
while not done:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
except Exception as err:
print err
done = True
quote = r.text
body = json.loads(quote)['data']['children'][0]['data']['body']
subreddit = json.loads(quote)['data']['children'][0]['data']['subreddit']
N.B.: That solution may not be optimal, since if you're offline or the URL is always failing, it'll do an infinite loop. If you retry too fast and too much, Reddit may also ban you.
N.B. 2: I'm using the newest Python 3 syntax for exception handling, which may not work with Python older than 2.7.
N.B. 3: You may also want to choose a class other than Exception for the exception handling, to be able to select what kind of error you want to handle. It mostly depends on your app design, and given what you say, you might want to handle requests.exceptions.ConnectionError, but have a look at request's doc to choose the right one.
Here's what you may want, but please think this through and adapt it to your use case:
import requests
import time
import json
def get_reddit_comments():
retries = 5
while retries != 0:
try:
url = 'http://www.reddit.com/r/random/comments.json?limit=1'
r = requests.get(url)
break # if the request succeeded we get out of the loop
except requests.exceptions.ConnectionError as err:
print("Warning: couldn't get the URL: {}".format(err))
time.delay(1) # wait 1 second between two requests
retries -= 1
if retries == 0: # if we've done 5 attempts, we fail loudly
return None
return r.text
def use_data(quote):
if not quote:
print("could not get URL, despites multiple attempts!")
return False
data = json.loads(quote)
if 'error' in data.keys():
print("could not get data from reddit: error code #{}".format(quote['error']))
return False
body = data['data']['children'][0]['data']['body']
subreddit = data['data']['children'][0]['data']['subreddit']
# … do stuff with your data here
if __name__ == "__main__":
quote = get_reddit_comments()
if not use_data(quote):
print("Fatal error: Couldn't handle data receipt from reddit.")
sys.exit(1)
I hope this snippet will help you correctly design your program. And now that you've discovered exceptions, please always remember that exceptions are for handling things that shall stay exceptional. If you throw an exception at some point in one of your programs, always ask yourself if this is something that should happen when something unexpected happens (like a webpage not loading), or if it's an expected error (like a page loading but giving you an output that is not expected).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python requests randomly breaks with JSONDecodeError - python

Related

Iterate through nested JSON object and get values throughout

When to use `raise_for_status` vs `status_code` testing

How to import python requests using a file in order to output the status code

Django http request to api error

How to keep a program running if there is an Traceback error

Categories

Resources