How Do I Start Pulling Apart This Block of JSON Data? - python

I'd like to make a program that makes offline copies of math questions from Khan Academy. I have a huge 21.6MB text file that contains data on all of their exercises, but I have no idea how to start analyzing it, much less start pulling the questions from it.
Here is a pastebin containing a sample of the JSON data. If you want to see all of it, you can find it here. Warning for long load time.
I've never used JSON before, but I wrote up a quick Python script to try to load up individual "sub-blocks" (or equivalent, correct term) of data.
import sys
import json
exercises = open("exercises.txt", "r+b")
byte = 0
frontbracket = 0
backbracket = 0
while byte < 1000: #while byte < character we want to read up to
#keep at 1000 for testing purposes
char = exercises.read(1)
sys.stdout.write(char)
#Here we decide what to do based on what char we have
if str(char) == "{":
frontbracket = byte
while True:
char = exercises.read(1)
if str(char)=="}":
backbracket=byte
break
exercises.seek(frontbracket)
block = exercises.read(backbracket-frontbracket)
print "Block is " + str(backbracket-frontbracket) + " bytes long"
jsonblock = json.loads(block)
sys.stdout.write(block)
print jsonblock["translated_display_name"]
print "\nENDBLOCK\n"
byte = byte + 1

Ok, the repeated pattern appears to be this: http://pastebin.com/4nSnLEFZ
To get an idea of the structure of the response, you can use JSONlint to copy/paste portions of your string and 'validate'. Even if the portion you copied is not valid, it will still format it into something you can actually read.
First I have used requests library to pull the JSON for you. It's a super-simple library when you're dealing with things like this. The API is slow to respond because it seems you're pulling everything, but it should work fine.
Once you get a response from the API, you can convert that directly to python objects using .json(). What you have is essentially a mixture of nested lists and dictionaries that you can iterate through and pull specific details. In my example below, my_list2 has to use a try/except structure because it would seem that some of the entries do not have two items in the list under translated_problem_types. In that case, it will just put 'None' instead. You might have to use trial and error for such things.
Finally, since you haven't used JSON before, it's also worth noting that it can behave like a dictionary itself; you are not guaranteed the order in which you receive details. However, in this case, it seems the outermost structure is a list, so in theory it's possible that there is a consistent order but don't rely on it - we don't know how the list is constructed.
import requests
api_call = requests.get('https://www.khanacademy.org/api/v1/exercises')
json_response = api_call.json()
# Assume we first want to list "author name" with "author key"
# This should loop through the repeated pattern in the pastebin
# access items as a dictionary
my_list1 = []
for item in json_response:
my_list1.append([item['author_name'], item['author_key']])
print my_list1[0:5]
# Now let's assume we want the 'sha' of the SECOND entry in translated_problem_types
# to also be listed with author name
my_list2 = []
for item in json_response:
try:
the_second_entry = item['translated_problem_types'][0]['items'][1]['sha']
except IndexError:
the_second_entry = 'None'
my_list2.append([item['author_name'], item['author_key'], the_second_entry])
print my_list2[0:5]

Related

how to get nested data with pandas and request

I'm going crazy trying to get data through an API call using request and pandas. It looks like it's nested data, but I cant get the data i need.
https://xorosoft.docs.apiary.io/#reference/sales-orders/get-sales-orders
above is the api documentation. I'm just trying to keep it simple and get the itemnumber and qtyremainingtoship, but i cant even figure out how to access the nested data. I'm trying to use DataFrame to get it, but am just lost. any help would be appreciated. i keep getting stuck at the 'Data' level.
type(json['Data'])
df = pd.DataFrame(['Data'])
df.explode('SoEstimateHeader')
df.explode('SoEstimateHeader')
Cell In [64], line 1
df.explode([0:])
^
SyntaxError: invalid syntax
I used the link to grab a sample response from the API documentation page you provided. From the code you provided it looks like you are already able to get the data and I'm assuming the you have it as a dictionary type already.
From what I can tell I don't think you should be using pandas, unless its some downstream requirement in the task you are doing. But to get the ItemNumber & QtyRemainingToShip you can use the code below.
# get the interesting part of the data out of the api response
data_list = json['Data']
#the data_list is only one element long, so grab the first element which is of type dictionary
data = data_list[0]
# the dictionary has two keys at the top level
so_estimate_header = data['SoEstimateHeader']
# similar to the data list the value associated with "SoEstimateItemLineArr" is of type list and has 1 element in it, so we grab the first & only element.
so_estimate_item_line_arr = data['SoEstimateItemLineArr'][0]
# now we can grab the pieces of information we're interested in out of the dictionary
qtyremainingtoship = so_estimate_item_line_arr["QtyRemainingToShip"]
itemnumber = so_estimate_item_line_arr["ItemNumber"]
print("QtyRemainingToShip: ", qtyremainingtoship)
print("ItemNumber: ", itemnumber)
Output
QtyRemainingToShip: 1
ItemNumber: BC
Side Note
As a side note I wouldn't name any variables json because thats also the name of a popular library in python for parsing json, so that will be confusing to future readers and will clash with the name if you end up having to import the json library.

Loop for iterating trhough variable URL (Python)

Python newby here.
I am developing a simple sequence search program for DNA sequences. The main idea is to get from the NCBI database the different sequences from an specific genome and start-end points. So far, I am able to do a simple search for one genome and one specific position:
`
import urllib
genome="NC_009089.1"
start="359055"
end= "359070"
link = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=%s&rettype=fasta&seq_start=%s&seq_stop=%s" % (genome, start, end)
f = urllib.urlopen(link)
myfile = f.read()
print(myfile).splitlines()[1]
`
And this is the output I'm getting (the sequence in that position):
AGTAAAACGGTTTCCT
Now,I would like to find several sequences from different genomes and with different start-end points at the same time, returning all the sequences that were found. I´ve tried to import the data as a csv with the genomes in the first column, starts in the second and ends in the thirds, and then do a for loop with the open file, but since I´m not familiar with changing variables in URLs, I don´t know how to proceed.
Sorry if this is a naive question. Any help would be appreciated.
If you already have all the parameters in a file, you can iterate over that data and make requests like this (I use lists, because you don't show your code how you read your data from the file):
import urllib
url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'
for genome, start, end in zip(genome_list, start_list, end_list):
data = {
'db': 'nuccore',
'rettype': 'fasta',
'genome': genome,
'start': start,
'end': end,
}
f = urllib.urlopen(url, data)
By passgin a dict with the query params, urlopen() takes care of encoding all the parameters properly (with = and &).
If urllib is a bit complicated, you can try the python requests library, which is much nicer to work with in my experience (but it is a third-party lib, not built-in).

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

json change dictionary item to a list with one dictionary

I'm working with a Rest Api for finding address details. I pass it an address and it passes back details for that address: lat/long, suburb etc. I'm using the requests library with the json() method on the response and adding the json response to a list to analyse later.
What I'm finding is that when there is a single match for an address the 'FoundAddress' key in the json response contains a dictionary but when more than one match is found the 'FoundAddress' key contains a list of dictionaries.
The returned json looks something like:
For a single match:
{
'FoundAddress': {AddressDetails...}
}
For multiple matches:
{
'FoundAddress': [{Address1Details...}, {Address2Details...}]
}
I don't want to write code to handle a single match and then multiple matches.
How can I modify the 'FoundAddress' so that when there is a single match it changes it to a list with a single dictionary entry? Such that I get something like this:
{
'FoundAddress': [{AddressDetails...}]
}
If it's the external API sending responses in that format then you can't really change FoundAddress itself, since it will always arrive in that format.
You can change the response if you want to, since you have full control over what you've received:
r = json.parse(response)
fixed = r['FoundAddress'] if (type(r['FoundAddress']) is list) else [r['FoundAddress']]
r['FoundAddress'] = fixed
Alternatively you can do the distinction at address usage time:
def func(foundAddress):
# work with a single dictionary instance here
then:
result = map(func, r['FoundAddress']) if (type(r['FoundAddress']) is list) else [func(r['FoundAddress'])]
But honestly I'd take a clear:
if type(r['FoundAddress']) is list:
result = map(func, r['FoundAddress'])
else:
result = func(r['FoundAddress'])
or the response fix-up over the a if b else c one-liner any day.
If you can, I would just change the API. If you can't there's nothing magical you can do. You just have to handle the special case. You could probably do this in one place in your code with a function like:
def handle_found_addresses(found_addresses):
if not isinstance(found_addresses, list):
found_addresses = [found_addreses]
...
and then proceed from there to do whatever you do with found addresses as if the value is always a list with one or more items.

rss is not updated immediately

I have a little script that monitors the RSS for 'new questions tagged with python' specifically on SO. It stores the feed in a variable on the first iteration of the loop, and then constantly checks the feed against the one stored in the variable. If the feed changes, it updates the variable and outputs the newest entry to the console, and plays a soundfile to alert me that there are new questions. All in all, it's quite handy as I don't have to keep an eye on anything. However, there are time discrepancies between new questions actually being posted, and my script detecting feed updates. These discrepancies seem to vary in the length of time, but generally, it's isn't instant and tends to not alert me before there has been enough action on a question so that it's been pretty much dealt with. Not always the case, but generally. Is there a way for me to ensure much faster or quicker updates/alerts? Or is this as good as it gets? (It's crossed my mind that this particular feed is only updated when there is actually action on a question.. anyone know if that's the case?)
Have I misunderstood the way that rss actually works?
import urllib2
import mp3play
import time
from xml.dom import minidom
def SO_notify():
""" play alarm when rss is updated """
rss = ''
filename = "path_to_soundfile"
mp3 = mp3play.load(filename)
mp3.volume(25)
while True:
html = urllib2.urlopen("http://stackoverflow.com/feeds/tag?tagnames=python&sort=newest")
new_rss = html.read()
if new_rss == rss:
continue
rss = new_rss
feed = minidom.parseString(rss)
new_entry = feed.getElementsByTagName('entry')[0]
title = new_entry.getElementsByTagName('title')[0].childNodes[0].nodeValue
print title
mp3.play()
time.sleep(30) #Edit - thanks to all who suggested this
SO_notify()
Something like:
import requests
import mp3play
import time
curr_ids = []
filename = "path_to_soundfile"
mp3 = mp3play.load(filename)
mp3.volume(25)
while True:
api_json = requests.get("http://api.stackoverflow.com/1.1/questions/unanswered?order=desc&tagged=python").json()
new_questions = []
all_questions = []
for q in api_json["questions"]:
all_questions.append(q["question_id"])
if q["question_id"] not in curr_ids:
new_questions.append(q["question_id"])
if new_questions:
print(new_questions)
mp3.play()
curr_ids = all_questions
time.sleep(30)
Used the requests package here because urllib gives me some encoding troubles.
IMHO, you could have 2 solutions to this, depending on which approach you want:
Use JSON - this will give you a nice dict with all entries.
Use RSS (XML). In this case you'd need something like feedparser to process your XML.
Either way, the code should be something like:
# make curr_ids a dictionary for easier lookup
curr_ids = []
filename = "path_to_soundfile"
mp3 = mp3play.load(filename)
mp3.volume(25)
# Loop
while True:
# Get the list of entries in objects
entries = get_list_of_entries()
new_ids = []
for entry in entries:
# Check if we reached the most recent entry
if entry.id in curr_ids:
# Force loop end if we did
break
new_ids.append(entry.id)
# Do whatever operations
print entry.title
if len(new_ids) > 0:
mp3.play()
curr_ids = new_ids
else:
# No updates in the meantime
pass
sleep(30)
Several notes:
I'd order the entries by "oldest" instead so the printed entries look like a stream, with the most recent one being the last printed out.
the new_ids thing is to keep the list of ids to a minimum. Otherwise lookup will become slower with time
get_list_of_entries() is a container to get the entries from the source (objects from XML or a dict from JSON). Depending on which approach you want, referring them is different (but the principle is the same)

Categories

Resources