Extracting Data from JSON - python

I have a large JSON item returned through a REST API, I wont junk up this with the full text but here is the code I am currently using:
import urllib2
import json
req = urllib2.Request
('http://elections.huffingtonpost.com/
pollster/api/polls.json?state=IA')
response = urllib2.urlopen(req)
the_page = response.read()
decode = json.loads(the_page)
#print = decode #removed, because it is not actually related to the question
print decode
I have been trying to extract information out of it such as the date polls are updated, the actual data from the polls etc (particularly the presidential polls) but I am having trouble returning any data at all. Can anyone assist?
EDIT:
The actual question is how to query data from the returned array/dict

The problem is, that you overwrite print with your data, instead of printing the data. Just remove the = in the last line and it should work fine:
print decode
If you want to use Python 3, you need parenthesis for print. This would look like this:
print(decode)
Edit: As you updated your question, here an answer to your actual question: The data is returned as a combination of dicts and lists by the loads function. Hence you can also access the data like a dict/list. For example, to get the last_updated field of all polls in one list, you can do something like this:
all_last_updated = [poll['last_updated'] for poll in decode]
Or to just get the end date of all polls sponsored by "Constitutional Responsibility Project", you could do this:
end_dates = [poll['end_date'] for poll in decode if any(sponsor['name'] == 'Constitutional Responsibility Project' for sponsor in poll['sponsors'])]
Or if you just want the id of the first poll in the list, do:
the_id = decode[0]['id']
You access anything you want from the json in a similar way.

it is because you do
print = decode
instead, if you are using python 2 do
print decode
or in python 3 do
print(decode)

Related

Input variable name as raw string into request in python

I am kind of very new to python.
I tried to loop through an URL request via python and I want to change one variable each time it loops.
My code looks something like this:
codes = ["MCDNDF3","MCDNDF4"]
#count = 0
for x in codes:
response = requests.get(url_part1 + str(codes) + url_part3, headers=headers)
print(response.content)
print(response.status_code)
print(response.url)
I want to have the url change at every loop to like url_part1+code+url_part3 and then url_part1+NEXTcode+url_part3.
Sadly my request badly formats the string from the variable to "%5B'MCDNDF3'%5D".
It should get inserted as a raw string each loop. I don't know if I need url encoding as I don't have any special chars in the request. Just change code to MCDNDF3 and in the next request to MCDNDF4.
Any thoughts?
Thanks!
In your for loop, the first line should be:
response = requests.get(url_part1 + x + url_part3, headers=headers)
This will work assuming url_part1 and url_part3 are regular strings. x is already a string, as your codes list (at least in your example) contains only strings. %5B and %5D are [ and ] URL-encoded, respectively. You got that error because you called str() on a single-membered list:
>>> str(["This is a string"])
"['This is a string']"
If url_part1 and url_part3 are raw strings, as you seem to indicate, please update your question to show how they are defined. Feel free to use example.com if you don't want to reveal your actual target URL. You should probably be calling str() on them before constructing the full URL.
You’re putting the whole list in (codes) when you probably want x.

retrieved URLs, trouble building payload to use requests module

I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])
You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.

How Do I Start Pulling Apart This Block of JSON Data?

I'd like to make a program that makes offline copies of math questions from Khan Academy. I have a huge 21.6MB text file that contains data on all of their exercises, but I have no idea how to start analyzing it, much less start pulling the questions from it.
Here is a pastebin containing a sample of the JSON data. If you want to see all of it, you can find it here. Warning for long load time.
I've never used JSON before, but I wrote up a quick Python script to try to load up individual "sub-blocks" (or equivalent, correct term) of data.
import sys
import json
exercises = open("exercises.txt", "r+b")
byte = 0
frontbracket = 0
backbracket = 0
while byte < 1000: #while byte < character we want to read up to
#keep at 1000 for testing purposes
char = exercises.read(1)
sys.stdout.write(char)
#Here we decide what to do based on what char we have
if str(char) == "{":
frontbracket = byte
while True:
char = exercises.read(1)
if str(char)=="}":
backbracket=byte
break
exercises.seek(frontbracket)
block = exercises.read(backbracket-frontbracket)
print "Block is " + str(backbracket-frontbracket) + " bytes long"
jsonblock = json.loads(block)
sys.stdout.write(block)
print jsonblock["translated_display_name"]
print "\nENDBLOCK\n"
byte = byte + 1
Ok, the repeated pattern appears to be this: http://pastebin.com/4nSnLEFZ
To get an idea of the structure of the response, you can use JSONlint to copy/paste portions of your string and 'validate'. Even if the portion you copied is not valid, it will still format it into something you can actually read.
First I have used requests library to pull the JSON for you. It's a super-simple library when you're dealing with things like this. The API is slow to respond because it seems you're pulling everything, but it should work fine.
Once you get a response from the API, you can convert that directly to python objects using .json(). What you have is essentially a mixture of nested lists and dictionaries that you can iterate through and pull specific details. In my example below, my_list2 has to use a try/except structure because it would seem that some of the entries do not have two items in the list under translated_problem_types. In that case, it will just put 'None' instead. You might have to use trial and error for such things.
Finally, since you haven't used JSON before, it's also worth noting that it can behave like a dictionary itself; you are not guaranteed the order in which you receive details. However, in this case, it seems the outermost structure is a list, so in theory it's possible that there is a consistent order but don't rely on it - we don't know how the list is constructed.
import requests
api_call = requests.get('https://www.khanacademy.org/api/v1/exercises')
json_response = api_call.json()
# Assume we first want to list "author name" with "author key"
# This should loop through the repeated pattern in the pastebin
# access items as a dictionary
my_list1 = []
for item in json_response:
my_list1.append([item['author_name'], item['author_key']])
print my_list1[0:5]
# Now let's assume we want the 'sha' of the SECOND entry in translated_problem_types
# to also be listed with author name
my_list2 = []
for item in json_response:
try:
the_second_entry = item['translated_problem_types'][0]['items'][1]['sha']
except IndexError:
the_second_entry = 'None'
my_list2.append([item['author_name'], item['author_key'], the_second_entry])
print my_list2[0:5]

Parsing data from MapQuest reverse geocoding api in Python?

My code:
from urllib import request
import json
lat = 31.33 ; long = -84.52
webpage = "http://www.mapquestapi.com/geocoding/v1/reverse?key=MY_KEY&callback=renderReverse&location={},{}".format(lat, long)
response = request.urlopen(webpage)
json_data = response.read().decode(response.info().get_param('charset') or 'utf-8')
data = json.loads(json_data)
print(data)
This gives me following error:
ValueError: Expecting value: line 1 column 1 (char 0)
I am trying to read county and state from MapQuest reverse geocoding api. The response looks like this:
renderReverse({"info":{"statuscode":0,"copyright":{"text":"\u00A9 2015 MapQuest, Inc.","imageUrl":"http://api.mqcdn.com/res/mqlogo.gif","imageAltText":"\u00A9 2015 MapQuest, Inc."},"messages":[]},"options":{"maxResults":1,"thumbMaps":true,"ignoreLatLngInput":false},"results":[{"providedLocation":{"latLng":{"lat":32.841516,"lng":-83.660992}},"locations":[{"street":"562 Patterson St","adminArea6":"","adminArea6Type":"Neighborhood","adminArea5":"Macon","adminArea5Type":"City","adminArea4":"Bibb","adminArea4Type":"County","adminArea3":"GA","adminArea3Type":"State","adminArea1":"US","adminArea1Type":"Country","postalCode":"31204-3508","geocodeQualityCode":"P1AAA","geocodeQuality":"POINT","dragPoint":false,"sideOfStreet":"L","linkId":"0","unknownInput":"","type":"s","latLng":{"lat":32.84117,"lng":-83.660973},"displayLatLng":{"lat":32.84117,"lng":-83.660973},"mapUrl":"http://www.mapquestapi.com/staticmap/v4/getmap?key=MY_KEY&type=map&size=225,160&pois=purple-1,32.84117,-83.660973,0,0,|&center=32.84117,-83.660973&zoom=15&rand=-189494136"}]}]})
How do I convert this string to a dict from which I can query using a key? Any help would be appreciated. Thanks!
First, get rid of the callback parameter in your URL since that's what is causing the response to be wrapped in renderReverse()
webpage = "http://www.mapquestapi.com/geocoding/v1/reverse?key=MY_KEY&location={},{}".format(lat, long)
That will give you valid json which should work with the json.loads function you call. At this point you can interact with data like a dictionary, getting the county and state names by their keys. The way mapquests structures their json is pretty strange, so it looks like you may have to do some string matching to get the correct key name. In this case, 'adminArea4Type' is set to 'County' so you want to access the 'adminArea4' key to return the county name.
data['results'][0]['locations'][0]['adminArea4']

extract json data from post body request with python

Is there a way to easily extract the json data portion in the body of a POST request?
For example, if someone posts to www.example.com/post with the body of the form with json data, my GAE server will receive the request by calling:
jsonstr = self.request.body
However, when I look at the jsonstr, I get something like :
str: \r\n----------------------------8cf1c255b3bd7f2\r\nContent-Disposition: form-data;
name="Actigraphy"\r\n Content-Type: application/octet-
stream\r\n\r\n{"Data":"AfgCIwHGAkAB4wFYAZkBKgHwAebQBaAD.....
I just want to be able to call a function to extract the json part of the body which starts at the {"Data":...... section.
Is there an easy function I can call to do this?
there is a misunderstanding, the string you show us is not json data, it looks like a POST body. You have to parse the body with something like cgi.parse_multipart.
Then you could parse json like answered by aschmid00. But instead of the body, you parse only the data.
Here you can find a working code that shows how to use cgi.FieldStorage for parsing the POST body.
This Question is also answered here..
It depends on how it was encoded on the browser side before submitting, but normally you would get the POST data like this:
jsonstr = self.request.POST["Data"]
If that's not working you might want to give us some info on how "Data" was encoded into the POST data on the client side.
you can try:
import json
values = 'random stuff .... \r\n {"data":{"values":[1,2,3]}} more rnandom things'
json_value = json.loads(values[values.index('{'):values.rindex('}') + 1])
print json_value['data'] # {u'values': [1, 2, 3]}
print json_value['data']['values'] # [1, 2, 3]
but this is dangerous and takes a fair amount of assumptions, Im not sure which framework you are using, bottle, flask, theres many, please use the appropriate call to POST
to retrieve the values, based on the framework, if indeed you are using one.
I think you mean to do this self.request.get("Data") If you are using the GAE by itself.
https://developers.google.com/appengine/docs/python/tools/webapp/requestclass#Request_get
https://developers.google.com/appengine/docs/python/tools/webapp/requestclass#Request_get_all

Categories

Resources