Can not parse response from sg.media-imdb in python - python

I'm trying to parse response from https://sg.media-imdb.com/suggests/a/a.json in Python 3.6.8.
Here is my code:
import requests
url = 'https://sg.media-imdb.com/suggests/a/a.json'
data = requests.get(url).json()
I get this error:
$ /usr/bin/python3 /home/livw/Python/test_scrapy/phase_1.py
Traceback (most recent call last):
File "/home/livw/Python/test_scrapy/phase_1.py", line 33, in <module>
data = requests.get(url).json()
File "/home/livw/.local/lib/python3.6/site-packages/requests/models.py", line 889, in json
self.content.decode(encoding), **kwargs
File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
It seems like the response format is not JSON format, although I can parse the response at JSON Formatter & Validator
How to fix it and store the response in a json object?

This probably happend because its not a complete json, it have a prefix
you can see that the response start with imdb$a( and ends with )
json parsing doesn't know how to handle it and he fails, you can remove those values and just parse the json itself
you can do this:
import json
import requests
url = 'https://sg.media-imdb.com/suggests/a/a.json'
data = requests.get(url).text
json.loads(data[data.index('{'):-1])

Related

Scrapy - how to wait for json page to be fully loaded

I scrape json pages but sometimes I get this error:
ERROR: Spider error processing <GET https://reqbin.com/echo/get/json/page/2>
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/twisted/internet/defer.py", line 857, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/user/path/scraping.py", line 239, in parse_images
jsonresponse = json.loads(response.text)
File "/usr/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 48662 (char 48661)
So I suspect that the json page does not have the time to be fully loaded and that's why parsing of its json content fails. And if I do it manually, I mean taking the json content as a string and loading it with the json module, it works and I don't get the json.decoder.JSONDecodeError error.
What I've done so far is to set in settings.py:
DOWNLOAD_DELAY = 5
DOWNLOAD_TIMEOUT = 600
DOWNLOAD_FAIL_ON_DATALOSS = False
CONCURRENT_REQUESTS = 8
hoping that it would slow down the scraping and solve my problem but the problem still occurs.
Any idea on how to be sure that the json page loaded completely so the parsing of its content does not fail ?
you can try to increase DOWNLOAD_TIMEOUT. It usually helps. If that's not enough, you can try to reduce CONCURRENT_REQUESTS.
If that still doesn't help, try use retry request. You can write your own retry_request function and call it return self.retry_request(response).
Or do it something like that req = response.request.copy(); req.dont_filter=True And return req.
You can also use RetryMiddleware. Read more on the documentation page https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.retry

Why doesn't the json() method on the requests module return anything?

When I call the json() method on request response I get an error.
Any suggestions to what could be wrong here?
My code:
import requests
import bs4
url = 'https://www.reddit.com/r/AskReddit/comments/l4styp/serious_what_is_the_the_scariest_thing_that_you/'
rsp = requests.get(url)
sc = rsp.json()
print(sc)
Output:
File "c:\VS_Code1\scrape.py", line 6, in <module>
sc = rsp.json()
File "C:\Users\User\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\requests\models.py", line 900, in json
return complexjson.loads(self.text, **kwargs)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.496.0_x64__qbz5n2kfra8p0\lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.496.0_x64__qbz5n2kfra8p0\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.496.0_x64__qbz5n2kfra8p0\lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 5 (char 5)
Your rsp is actually returning <Response [200]> which is not a JSON. If you want to read the content of the response, you can simply do:
rsp.text
What you get from the URL you posted here is HTML, and not JSON.
This does not work because the page you are fetching does not return Json but HTML source code instead.
To fetch the content of the webpage you need to replace sc = rsp.json() with sc = rsp.text
If you need this data in Json, you can look into Reddit's API: https://www.reddit.com/dev/api
I figured out that if I want to get the JSON from the URL that I've inputted I have to add .json to the end of the URL, this is for some reason (to my knowledge) unique to Reddit and a few other sites that allow it.

API returned JSON data wrapped in a function, breaking json.loads

I'm trying to work with JSON data that is pulled from USGS Earthquake API. If you follow that link, you can see the raw JSON data.
The JSON looks great; however, the returned request is wrapped in an eqfeed_callback(); that is breaking the JSON deserializer in Python.
A quick look at the code I have so far:
import requests
import json
from pprint import pprint
URL = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_week.geojsonp"
response = requests.get(URL)
raw_json = str(response.content)
json = json.loads(raw_json)
print(json)
I get the errors:
Traceback (most recent call last):
File "run.py", line 11, in <module>
json = json.loads(raw_json)
File "C:\Program Files\Anaconda3\lib\json\__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "C:\Program Files\Anaconda3\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Program Files\Anaconda3\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Although I'm positive the issue is that it's wrapped in that function and the JSON decoder doesn't like it. So how would I go about removing the function wrapper to leave me with the clean JSON inside.
You're using the wrong URL.
JSON wrapped in a function call is JSONP, which is needed for getting around CORS when calling an API from web browsers.
The URL to get normal JSON is
URL = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_week.geojson"

JSONDecodeError Expecting value

I am trying to get json object, and it tells me it's expecting a value even though i define the path to the json in r.json(). Also when i do r.headers[content-type] give me text/html;charset=ISO-8859-1 ... Thank you for your time everyone
import requests
import json
session = requests.Session()
username = "------"
password = "-------"
url_cookie = 'http://ludwig.podiumdata.com:----/podium/j_spring_security_check?j_username=--&j_password=----'
url_get = 'http://ludwig.corp.podiumdata.com:----/qdc/entity/v1/getEntities?type=EXTERNAL&count=2&sortAttr=name&sortDir=ASC'
r = requests.get(url_get, auth=(username,password), verify=False)
r.json()
r.headers['content-type']
Traceback (most recent call last):
File "<ipython-input-108-61f8159bb1b5>", line 10, in <module>
r.json()
File "//anaconda3/lib/python3.7/site-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "//anaconda3/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "//anaconda3/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "//anaconda3/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
JSONDecodeError: Expecting value
Your request is receiving the response as text/html, but you want to receive application/json.
You need to include {'Accept': 'application/json'} as a header in your request.
Example: requests.get(url_get, auth=(username,password), headers={'Accept': 'application/json'}, verify=False)
Also, it looks like you aren't using the Session you created on line 3, but that isn't what is causing this error.

Python Json Error: ValueError: No JSON object could be decoded

I'm getting an error when I execute this script and can't figure it out.
The Error:
Traceback (most recent call last):
File "./upload.py", line 227, in <module>
postImage()
File "./upload.py", line 152, in postImage
reddit = RedditConnection(redditUsername, redditPassword)
File "./upload.py", line 68, in __init__
self.modhash = r.json()['json']['data']['modhash']
File "/usr/lib/python2.6/site-packages/requests/models.py", line 799, in json
return json.loads(self.text, **kwargs)
File "/usr/lib/python2.6/site-packages/simplejson/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.6/site-packages/simplejson/decoder.py", line 335, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.6/site-packages/simplejson/decoder.py", line 353, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
You are getting this exception because you are using the wrong json function here:
def getNumberOfFailures(path):
try:
with open(path + '.failurecount') as f:
return json.loads(f.read())
except:
return 0
You need to do this instead:
def getNumberOfFailures(path):
try:
with open(path + '.failurecount') as f:
return json.load(f)
except:
return 0
json.loads() is used on json strings. json.load() is used on json file objects.
As some people have mentioned, you need to reissue yourself a new API key and delete the one you posted here in your code. Other people can and will abuse those secret keys to spam Reddit under your name.

Categories

Resources