Scrapy - how to wait for json page to be fully loaded - python

I scrape json pages but sometimes I get this error:
ERROR: Spider error processing <GET https://reqbin.com/echo/get/json/page/2>
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/user/path/scraping.py", line 239, in parse_images
jsonresponse = json.loads(response.text)
File "/usr/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 48662 (char 48661)
So I suspect that the json page does not have the time to be fully loaded and that's why parsing of its json content fails. And if I do it manually, I mean taking the json content as a string and loading it with the json module, it works and I don't get the json.decoder.JSONDecodeError error.
What I've done so far is to set in settings.py:
DOWNLOAD_DELAY = 5
DOWNLOAD_TIMEOUT = 600
DOWNLOAD_FAIL_ON_DATALOSS = False
CONCURRENT_REQUESTS = 8
hoping that it would slow down the scraping and solve my problem but the problem still occurs.
Any idea on how to be sure that the json page loaded completely so the parsing of its content does not fail ?

you can try to increase DOWNLOAD_TIMEOUT. It usually helps. If that's not enough, you can try to reduce CONCURRENT_REQUESTS.
If that still doesn't help, try use retry request. You can write your own retry_request function and call it return self.retry_request(response).
Or do it something like that req = response.request.copy(); req.dont_filter=True And return req.
You can also use RetryMiddleware. Read more on the documentation page https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.retry

Related

Why doesn't the json() method on the requests module return anything?

When I call the json() method on request response I get an error.
Any suggestions to what could be wrong here?
My code:
import requests
import bs4
url = 'https://www.reddit.com/r/AskReddit/comments/l4styp/serious_what_is_the_the_scariest_thing_that_you/'
rsp = requests.get(url)
sc = rsp.json()
print(sc)
Output:
File "c:\VS_Code1\scrape.py", line 6, in <module>
sc = rsp.json()
File "C:\Users\User\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\models.py", line 900, in json
return complexjson.loads(self.text, **kwargs)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.496.0_x64__qbz5n2kfra8p0\lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.496.0_x64__qbz5n2kfra8p0\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.496.0_x64__qbz5n2kfra8p0\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 5 (char 5)
Your rsp is actually returning <Response [200]> which is not a JSON. If you want to read the content of the response, you can simply do:
rsp.text
What you get from the URL you posted here is HTML, and not JSON.
This does not work because the page you are fetching does not return Json but HTML source code instead.
To fetch the content of the webpage you need to replace sc = rsp.json() with sc = rsp.text
If you need this data in Json, you can look into Reddit's API: https://www.reddit.com/dev/api
I figured out that if I want to get the JSON from the URL that I've inputted I have to add .json to the end of the URL, this is for some reason (to my knowledge) unique to Reddit and a few other sites that allow it.

API returned JSON data wrapped in a function, breaking json.loads

I'm trying to work with JSON data that is pulled from USGS Earthquake API. If you follow that link, you can see the raw JSON data.
The JSON looks great; however, the returned request is wrapped in an eqfeed_callback(); that is breaking the JSON deserializer in Python.
A quick look at the code I have so far:
import requests
import json
from pprint import pprint
URL = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_week.geojsonp"
response = requests.get(URL)
raw_json = str(response.content)
json = json.loads(raw_json)
print(json)
I get the errors:
Traceback (most recent call last):
File "run.py", line 11, in <module>
json = json.loads(raw_json)
File "C:\Program Files\Anaconda3\lib\json\__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "C:\Program Files\Anaconda3\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Program Files\Anaconda3\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Although I'm positive the issue is that it's wrapped in that function and the JSON decoder doesn't like it. So how would I go about removing the function wrapper to leave me with the clean JSON inside.
You're using the wrong URL.
JSON wrapped in a function call is JSONP, which is needed for getting around CORS when calling an API from web browsers.
The URL to get normal JSON is
URL = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_week.geojson"

Can not parse response from sg.media-imdb in python

I'm trying to parse response from https://sg.media-imdb.com/suggests/a/a.json in Python 3.6.8.
Here is my code:
import requests
url = 'https://sg.media-imdb.com/suggests/a/a.json'
data = requests.get(url).json()
I get this error:
$ /usr/bin/python3 /home/livw/Python/test_scrapy/phase_1.py
Traceback (most recent call last):
File "/home/livw/Python/test_scrapy/phase_1.py", line 33, in <module>
data = requests.get(url).json()
File "/home/livw/.local/lib/python3.6/site-packages/requests/models.py", line 889, in json
self.content.decode(encoding), **kwargs
File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
It seems like the response format is not JSON format, although I can parse the response at JSON Formatter & Validator
How to fix it and store the response in a json object?
This probably happend because its not a complete json, it have a prefix
you can see that the response start with imdb$a( and ends with )
json parsing doesn't know how to handle it and he fails, you can remove those values and just parse the json itself
you can do this:
import json
import requests
url = 'https://sg.media-imdb.com/suggests/a/a.json'
data = requests.get(url).text
json.loads(data[data.index('{'):-1])

Twitter has a rate limit of 180 calls every 15 minutes

Python struggling to read many JSON files?
I'm writing a short script to check for which 5 letter twitter handles are available, basically 5 for loops and then using Twitter API to check if it is available.
In the middle for loop I have two lines:
response = requests.get("https://twitter.com/users/username_available?username=" + user)
print user, str(response.json()["valid"])
It ran for a little bit and at some point decided it couldn't read JSON files anymore, and now when I try running it it stops immediately with the same error:
File "check.py", line 25, in <module>
main()
File "check.py", line 16, in main
print user, str(response.json()["valid"])
File "/Library/Python/2.7/site-packages/requests/models.py", line 886, in json
return complexjson.loads(self.text, **kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
The only logical answer that comes to my mind is that my computer can't handle so many JSON requests but I was wondering if anyone knew any way to get around this.
Worked it out after many minutes of confusion..
Twitter has a rate limit of 180 calls every 15 minutes.
https://dev.twitter.com/rest/public/rate-limiting

API Coinbase ValueError from get_buy_price()

python 3.4 and Coinbase V2 API
I am working on some BTC data analysis and trying to make continuous requests to coinbase API. When running my script, it will always eventually crash on a calls to
r = client.get_spot_price()
r = client.get_buy_price()
r = client.get_sell_price()
The unusual thing is that the script will always crash at different times. Sometimes it will successfully collect data for an hour or so and then crash, other times it will crash after 5 - 10 minutes.
ERROR:
r = client.get_spot_price()
File "/home/g/.local/lib/python3.4/site-packages/coinbase/wallet/client.py", line 191, in get_spot_price
response = self._get('v2', 'prices', 'spot', data=params)
File "/home/g/.local/lib/python3.4/site-packages/coinbase/wallet/client.py", line 129, in _get
return self._request('get', *args, **kwargs)
File "/home/g/.local/lib/python3.4/site-packages/coinbase/wallet/client.py", line 116, in _request
return self._handle_response(response)
File "/home/g/.local/lib/python3.4/site-packages/coinbase/wallet/client.py", line 125, in _handle_response
raise build_api_error(response)
File "/home/g/.local/lib/python3.4/site-packages/coinbase/wallet/error.py", line 49, in build_api_error
blob = blob or response.json()
File "/home/g/.local/lib/python3.4/site-packages/requests/models.py", line 812, in json
return complexjson.loads(self.text, **kwargs)
File "/usr/lib/python3.4/json/__init__.py", line 318, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.4/json/decoder.py", line 343, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.4/json/decoder.py", line 361, in raw_decode
raise ValueError(errmsg("Expecting value", s, err.value)) from None
ValueError: Expecting value: line 1 column 1 (char 0)
It seems to be crashing due to some json decoding?
Does anyone have any idea why this will only throw errors at certain times?
I have tried something like the following to avoid crashing due to this error:
#snap is tuple of data containing data from buy, sell , spot price
if not any(snap):
print('\n\n-----ENTRY ERROR---- Snap returned None \n\n')
success = False
return
but it isn't doing the trick
What are some good ways to handle this error in your opinion?
Thanks, any help is much appreciated!
For me it could be something related with that issue https://github.com/coinbase/coinbase-python/issues/15. It seems in fact to be an internal library error (as the code does raise build_api_error(response) what confirms my assertions).
Maybe it possible that the problem is related to a internet connectivity? If your network (or the server fails), it can either fail to retrieve the JSON file or can retrieve an empty one. But, the library should inform you more clearly.
So, it will try to decode an empty file inside the JSON decoder, what causes the error.
A temporary workaround would be to brace your code with a try statement and to try again if it fails.
You have to supply it with a currency to get a price.
Here is an example:
price = client.get_spot_price(currency_pair='XRP-USD')

Categories

Resources