trouble scraping from JSONP feed

trouble scraping from JSONP feed - python

I asked a similar question earlier
python JSON feed returns string not object
but I am having a little more trouble and don't understand it.
For about half of the dates this works and returns a JSON object
for example November 9 2013 works
url = 'http://data.ncaa.com/jsonp/scoreboard/basketball-men/d1/2013/11/09/scoreboard.html?callback=c'
r = requests.get(url)
jsonObj = json.loads(r.content[2:-2])
but if I try November 11 2013:
url = 'http://data.ncaa.com/jsonp/scoreboard/basketball-men/d1/2013/11/11/scoreboard.html?callback=c'
r = requests.get(url)
jsonObj = json.loads(r.content[2:-2])
I get this error
ValueError: No JSON object could be decoded
I dont understand why. When I put both urls into a browser they look exactly the same.

The JSON in the second feed is, in fact, invalid JSON. Found this by removing the callback function and running it through: http://jsonlint.com/
To see for yourself, search for the following ID: 336252
The lines just above that ID contain two commas in a row, which is disallowed by the JSON spec.
My guess is that the server at data.ncaa.com is trying to generate JSON itself rather than using a JSON library. You should contact the site administrator and make them aware of this error.

Using demjson
demjson.decode(r.content[2:-2])
seems to work

Related

How do I pull out a certain segment from a string

I'm using an API that is giving me and output formatted as
['{"quote":{"symbol":"AAPL"', '"companyName":"Apple Inc."', '"primaryExchange":"Nasdaq Global Select"', '"sector":"Technology"', '"calculationPrice":"close"', '"open":367.88', '"openTime":1593696600532', '"close":364.11', '"closeTime":1593720000277', '"high":370.47', '"low":363.64', '"latestPrice":364.11'}]
...(it keeps going like this with many more categories.)
I am attempting to pull out only the latest price. What would be the best way to do that?
This is what I have but I get a bunch of errors.
string = (data.decode("utf-8"))
data_v = string.split(',')
for word in data_v[latestPrice]:
if word == ["latestPrice"]:
print(word)
print(data_v)

Judging by the output this is JSON. To parse this easily use the JSON module (see https://docs.python.org/3/library/json.html ).
If I'm correct you got this output from Yahoo Finance, if this indeed the case don't fetch and parse it manually but use the yfinance module (see https://pypi.org/project/yfinance/ )

You will have to use JSON module to parse this JSON string. You can convert it into dictionary then. I have indented the JSON code for ease of understanding. You can use the following approach,
import json
text_to_parse = """
{"quote":
{
"symbol":"AAPL",
"companyName":"Apple Inc.",
"primaryExchange":"Nasdaq Global Select",
"sector":"Technology",
"calculationPrice":"close",
"open":367.88,
"openTime":1593696600532,
"close":364.11,
"closeTime":1593720000277,
"high":370.47,
"low":363.64,
"latestPrice":364.11
}
}
"""
parsed_dict = json.loads(text_to_parse)
print(parsed_dict["quote"]["latestPrice"])
When the program is run, it outputs 364.11

retrieved URLs, trouble building payload to use requests module

I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])

You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.

Python3 strange error with json.loads [duplicate]

This question already has answers here:
Let JSON object accept bytes or let urlopen output strings
(12 answers)
Closed 6 years ago.
I'm using flask in a web application that uses service api generates JSON response. The following part of the function works fine and returns JSON text output:
def get_weather(query = 'london'):
api_url = "http://api.openweathermap.org/data/2.5/weather?q={}&units=metric&appid=XXXXX****2a6eaf86760c"
query = urllib.request.quote(query)
url = api_url.format(query)
response = urllib.request.urlopen(url)
data = response.read()
return data
The output returned is:
{"coord":{"lon":-0.13,"lat":51.51},"weather":[{"id":803,"main":"Clouds","description":"broken clouds","icon":"04d"}],"base":"cmc stations","main":{"temp":12.95,"pressure":1030,"humidity":68,"temp_min":12.95,"temp_max":12.95,"sea_level":1039.93,"grnd_level":1030},"wind":{"speed":5.11,"deg":279.006},"clouds":{"all":76},"dt":1462290955,"sys":{"message":0.0048,"country":"GB","sunrise":1462249610,"sunset":1462303729},"id":2643743,"name":"London","cod":200}
This mean that data is a string, does not it?
However, commenting the return data and then adding the following two lines:
jsonData = json.loads(data)
return jsonData
generates the following error:
TypeError: the JSON object must be str, not 'bytes'
What's wrong? data, the JSON object, previously returned as a string! I need to know where is the mistake?

The data returned by request library is a binary string while json.loads accepts strings so you need to convert the data (decode) to a string using the encoding that your request returns (which is usually ok to assume that it is UTF-8).
You should be able to just change your code to this:
return json.loads(data.decode("utf-8"))
PS: Storing the variable right before returning it is redundant so I simplified things

Some help understanding my own Python code

I'm starting to learn Python and I've written the following Python code (some of it omitted) and it works fine, but I'd like to understand it better. So I do the following:
html_doc = requests.get('[url here]')
Followed by:
if html_doc.status_code == 200:
soup = BeautifulSoup(html_doc.text, 'html.parser')
line = soup.find('a', class_="some_class")
value = re.search('[regex]', str(line))
print (value.group(0))
My questions are:
What does html_doc.text really do? I understand that it makes "text" (a string?) out of html_doc, but why isn't it text already? What is it? Bytes? Maybe a stupid question but why doesn't requests.get create a really long string containing the HTML code?
The only way that I could get the result of re.search was by value.group(0) but I have literally no idea what this does. Why can't I just look at value directly? I'm passing it a string, there's only one match, why is the resulting value not a string?

requests.get() return value, as stated in docs, is Response object.
re.search() return value, as stated in docs, is MatchObject object.
Both objects are introduced, because they contain much more information than simply response bytes (e.g. HTTP status code, response headers etc.) or simple found string value (e.g. it includes positions of first and last matched characters).
For more information you'll have to study docs.
FYI, to check type of returned value you may use built-in type function:
response = requests.get('[url here]')
print type(response) # <class 'requests.models.Response'>

Seems to me you are lacking some basic knowledge about Classes, Object and methods...etc, you need to read more about it here (for Python 2.7) and about requests module here.
Concerning what you asked, when you type html_doc = requests.get('url'), you are creating an instance of class requests.models.Response, you can check it by:
>>> type(html_doc)
<class 'requests.models.Response'>
Now, html_doc has methods, thus html_doc.text will return to you the server's response
Same goes for re module, each of its methods generates response object that are not simply int or string

Push a raw value to Firebase via REST API

I am trying to use the requests library in Python to push data (a raw value) to a firebase location.
Say, I have urladd (the url of the location with authentication token). At the location, I want to push a string, say International. Based on the answer here, I tried
data = {'.value': 'International'}
p = requests.post(urladd, data = sjson.dumps(data))
I get <Response [400]>. p.text gives me:
u'{\n "error" : "Invalid data; couldn\'t parse JSON object, array, or value. Perhaps you\'re using invalid characters in your key names."\n}\n'
It appears that they key .value is invalid. But that is what the answer linked above suggests. Any idea why this may not be working, or how I can do this through Python? There are no problems with connection or authentication because the following works. However, that pushes an object instead of a raw value.
data = {'name': 'International'}
p = requests.post(urladd, data = sjson.dumps(data))
Thanks for your help.

The answer you've linked is a special case for when you want to assign a priority to a value. In general, '.value' is an invalid name and will throw an error.
If you want to write just "International", you should write the stringified-JSON version of that data. I don't have a python example in front of me, but the curl command would be:
curl -X POST -d "\"International\"" https://...

Andrew's answer above works. In case someone else wants to know how to do this using the requests library in Python, I thought this would be helpful.
import simplejson as sjson
data = sjson.dumps("International")
p = requests.post(urladd, data = data)
For some reason I had thought that the data had to be in a dictionary format before it is converted to stringified JSON version. That is not the case, and a simple string can be used as an input to sjson.dumps().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

trouble scraping from JSONP feed - python

Using demjson demjson.decode(r.content[2:-2]) seems to work

Related

How do I pull out a certain segment from a string

retrieved URLs, trouble building payload to use requests module

Python3 strange error with json.loads [duplicate]

Some help understanding my own Python code

Push a raw value to Firebase via REST API

Categories

Resources