Parsing complex and changing JSON data in Python, several levels deep

Parsing complex and changing JSON data in Python, several levels deep - python

I am trying to parse changing JSON data, however the JSON data is a bit complex and changes wtih each iteration.
The JSON data is being parsed inside a loop so each time the loop runs, the json data is different. I'm focused right now on the education data.
THE JSON DATA:
First one might look like this:
{u'gender': u'female', u'id': u'15394'}
Next one might be:
{
u'gender': u'male', u'birthday': u'12/10/1983', u'location': {u'id': '12', u'name': u'Mexico City, Mexico'}, u'hometown': {u'id': u'19', u'name': u'Mexico City, Mexico'},
u'education': [
{
u'school': {u'id': u'22', u'name': u'Institut Saint Dominique de Rome'},
u'type': u'High School',
u'year': {u'id': u'33', u'name': u'2002'}
},
{
u'school': {u'id': u'44', u'name': u'Instituto Cumbres'},
u'type': u'High School',
u'year': {u'id': u'55', u'name': u'1999'}
},
{
u'school': {u'id': u'66', u'name': u'Chantemerle International School'},
u'type': u'High School',
u'year': {u'id': u'77', u'name': u'1998'}
},
{
u'school': {u'id': u'88', u'name': u'Columbia University'},
u'type': u'College',
u'concentration':
[{u'id': u'91', u'name': u'Economics'},
{u'id': u'92', u'name': u'Film Studies'}]
}
],
u'id': u'100384'}
I am trying to return all the values for school name, school id and school type, so essentially I want [education][school][id], [education][school][name], [education][school][type] in one line. However, every person has a different number of schools listed and different types of schools or no schools at all. I want to return each school with its associated name, id and type on a new line within my existing loop.
IDEAL OUTPUT:
1 34 Boston Latin School High School
1 26 Harvard University College
1 22 University of Michigan Graduate School
The one in this case refers to a friend_id, which I have already set up to append to the list as the first item in each loop.
I've tried:
friend_data = response.read()
friend_json = json.loads(friend_data)
#This below is inside a loop pulling data for each friend:
try:
for school_id in friend_json['education']:
school_id = school_id['school']['id']
friendedu.append(school_id)
for school_name in friend_json['education']:
school_name = school_name['school']['name']
friendedu.append(school_name)
for school_type in friend_json['education']:
school_type = school_type['type']
friendedu.append(school_type)
except:
school_id = "NULL"
print friendedu
writer.writerow(friendedu)
CURRENT OUTPUT:
[u'22', u'44', u'66', u'88', u'Institut Saint Dominique de Rome', u'Instituto Cumbres', u'Chantemerle International School', u'Columbia University', u'High School', u'High School', u'High School', u'College']
This output is just a list of the values it has pulled, instead I'm trying to organize the output as shown above. I think that perhaps another for-loop is called for since for one person I want each school to be on its own line. Right now, the friendedu list is appending all the education info for one person into each line of the list. I want each education item in a new line and then move on to the next person and continue to write rows for the next person.

how about
friend_data = response.read()
friend_json = json.loads(friend_data)
if 'education' in friend_json.keys():
for school_id in friend_json['education']:
friendedu = []
try:
friendedu.append(school_id['school']['id'])
friendedu.append(school_name['school']['name'])
friendedu.append(school_type['school']['type'])
except:
friendedu.append('School ID, NAME, or type not found')
print(" ".join(friendedu))

import csv
import json
import requests
def student_schools(student, fields=["id", "name", "type"], default=None):
schools = student.get("education", [])
return ((school.get(field, default) for field in fields) for school in schools)
def main():
res = requests.get(STUDENT_URL).contents
students = json.loads(res)
with open(OUTPUT, "wb") as outf:
outcsv = csv.writer(outf)
for student in students["results"]: # or whatever the root label is
outcsv.writerows(student_schools(student))
if __name__=="__main__":
main()

You certainly don't need more for loops.
One will do:
friendedu = []
for school_id in friend_json['education']:
friendedu.append("{id} {name} {type}".format(
id=school_id['school']['id'],
name=school_name['school']['name'],
type=school_type['school']['type'])
Or a list comprehension:
friendedu = ["{id} {name} {type}".format(
id=school_id['school']['id'],
name=school_name['school']['name'],
type=school_type['school']['type']) for school_id in friend_json['education']]

Related

Joining Keywords in a list NYT

For each article that a keyword list is returned for. We want to join all the words using the key --> values into a list, as shown below. I would like to strip out the ‘u’ from the list, before I do the append. Then we want to compare how many common words in both list and return the result.
Example lists returned from dic['keywords']:
Article one returns:
[
{
u'value': u'Dunford, Joseph F Jr',
u'name': u'persons',
u'rank': u'1'
},
{
u'value': u'Afghanistan',
u'name': u'glocations',
u'rank': u'1'
},
{
u'value': u'Afghan National Police',
u'name': u'organizations',
u'rank': u'1'
},
{
u'value': u'Afghanistan War (2001- )',
u'name': u'subject',
u'rank': u'1'
},
{
u'value': u'Defense and Military Forces',
u'name': u'subject',
u'rank': u'2'
}
]
Article two returns:
[
{
u'value': u'Gall, Carlotta',
u'name': u'persons',
u'rank': u'1'
},
{
u'value': u'Gannon, Kathy',
u'name': u'persons',
u'rank': u'2'
},
{
u'value': u'Niedringhaus, Anja (1965-2014)',
u'name': u'persons',
u'rank': u'3'
},
{
u'value': u'Kabul (Afghanistan)',
u'name': u'glocations',
u'rank': u'2'
},
{
u'value': u'Afghanistan',
u'name': u'glocations',
u'rank': u'1'
},
{
u'value': u'Afghan National Police',
u'name': u'organizations',
u'rank': u'1'
},
{
u'value': u'Afghanistan War (2001- )',
u'name': u'subject',
u'rank': u'1'
}
]
Desired Output:
List1 = ['Dunford, Joseph F Jr',’ Afghanistan’, ‘Afghan National Police’, ‘: Afghanistan War (2001- )’, ‘Defense and Military Forces’]
List2 = [‘Gall, Carlotta'’,’ u'Gannon, Kathy',’ Niedringhaus, Anja (1965-2014)’,’Afghanistan’]
Keywords in common: 2
My Code is as follows:
from flask import Flask, render_template, request, session, g, redirect, url_for
from nytimesarticle import articleAPI
api = articleAPI('X')
articles = api.search( q = 'Afghan War',
fq = {'headline':'', 'source':['Reuters','AP', 'The New York Times']},
begin_date = 20111231 )
def parse_articles(articles):
'''
This function takes in a response to the NYT api and parses
the articles into a list of dictionaries
'''
news = []
for i in articles['response']['docs']:
dic = {}
dic['id'] = i['_id']
if i['abstract'] is not None:
dic['abstract'] = i['abstract'].encode("utf8")
dic['headline'] = i['headline']['main'].encode("utf8")
dic['desk'] = i['news_desk']
dic['date'] = i['pub_date'][0:10] # cutting time of day.
dic['section'] = i['section_name']
dic['keywords'] = i['keywords']
print dic['keywords']
if i['snippet'] is not None:
dic['snippet'] = i['snippet'].encode("utf8")
dic['source'] = i['source']
dic['type'] = i['type_of_material']
dic['url'] = i['web_url']
dic['word_count'] = i['word_count']
# locations
locations = []
for x in range(0,len(i['keywords'])):
if 'glocations' in i['keywords'][x]['name']:
locations.append(i['keywords'][x]['value'])
dic['locations'] = locations
# subject
subjects = []
for x in range(0,len(i['keywords'])):
if 'subject' in i['keywords'][x]['name']:
subjects.append(i['keywords'][x]['value'])
dic['subjects'] = subjects
news.append(dic)
return(news)
print(parse_articles(articles))

You can use list comprehension to build lists from the given dict.
d = [{u'value': u'Dunford, Joseph F Jr', u'name': u'persons', u'rank': u'1'}, {u'value': u'Afghanistan', u'name': u'glocations', u'rank': u'1'}, {u'value': u'Afghan National Police', u'name': u'organizations', u'rank': u'1'}, {u'value': u'Afghanistan War (2001- )', u'name': u'subject', u'rank': u'1'}, {u'value': u'Defense and Military Forces', u'name': u'subject', u'rank': u'2'}]
print [v['value'] for v in d] # prints [u'Dunford, Joseph F Jr', u'Afghanistan', u'Afghan National Police', u'Afghanistan War (2001- )', u'Defense and Military Forces']

Paginate and collect posts and comments from Facebook (Python, Json) [duplicate]

I'm trying to pull posts and comments from Facebook for a project, but can't seem to pull everything. It seems that I get two links with previous and next
Here's my code:
import facebook
import requests
def some_action(post):
#print posts['data']
print(post['created_time'])
access_token = INSERT ACCESS TOKEN HERE
user = 'walkers'
graph = facebook.GraphAPI(access_token)
profile = graph.get_object(user)
posts = graph.get_connections(profile['id'], 'posts')
x = 0
while x < 900000:
#while True:
try:
posts = requests.get(posts['paging']['next']).json()
print (posts)
except KeyError:
break
x = x+1
My results are something like this:
{u'paging': {u'next': u' https://graph.facebook.com/v2.0/53198517648/posts?limit=25&__paging_token=enc_AdDr64IO8892JzsoPWiKkMDcF4lTDosOcP0H0ZB1mIpIW5EYRrCylZAji6ZBSCVBAVUYiS80oNtWtAL9GazXxRf0yva&access_token=INSERT ACCESS TOKEN HERE, u'previous': u' https://graph.facebook.com/v2.0/53198517648/posts?limit=25&__paging_token=enc_AdCqTjKBhfOsNBoKe3CJbnM2gU2RvLEYLgAQt1pHEcERVeK4qiw1dQAFHjt2sSInSZAIioZCqotwLMx8azzfZClIuCN&since=1430997206&access_token=INSERT ACCESS TOKEN HERE'}, u'data': [{u'picture': u'https://scontent.xx.fbcdn.net/hphotos-xfp1/v/t1.0-9/p130x130/11200605_10153202855322649_2472649283560371030_n.jpg?oh=50a0b3998e7bae8bb10c3a5f0854af46&oe=56556974', u'story': u'Walkers added 8 new photos to the album: 20 Years of Gary.', u'likes': {u'paging': {u'cursors': {u'after': u'MzgxNzk1MDQ4NjQ3MTI3', u'before': u'MTAxNTI0OTY1MDIwNDIyODI='}}, u'data': [{u'id': u'10152496502042282', u'name': u'Aaron Hanson'}, {u'id': u'10203040513950876', u'name': u'Gary GazRico Hinchliffe'}, {u'id': u'10152934096109345', u'name': u'Stuart Collister'}, {u'id': u'10152297022606059', u'name': u'Helen Preston'}, {u'id': u'326285380900188', u'name': u'Rhys Edwards'}, {u'id': u'10204744346589601', u'name': u'Aaron Benfield'}, {u'id': u'10200910780691953', u'name': u'Mike S Q Wilkins'}, {u'id': u'10204902354187051', u'name': u'Paul Owen Davies'}, {u'id': u'10152784755311784', u'name': u'Dafydd Ifan'}, {u'id': u'1517704468487365', u'name': u'Stephen Collier'}, {u'id': u'10202198826115234', u'name': u'John McKellar'}, {u'id': u'10151949129487143', u'name': u'Lucy Morrison'}, {u'id': u'1474199509524133', u'name': u'Christine Leek'}, {u'id': u'381795048647127', u'name': u'Sandra Taylor'}]}, u'from': {u'category': u'Product/Service', u'name':
These are the two links with the missing information:
u'https://graph.facebook.com/v2.0/53198517648/posts?limit=25&__paging_token=enc_AdDr64IO8892JzsoPWiKkMDcF4lTDosOcP0H0ZB1mIpIW5EYRrCylZAji6ZBSCVBAVUYiS80oNtWtAL9GazXxRf0yva&access_token=INSERT ACCESS TOKEN HERE',
u'previous': u'https://graph.facebook.com/v2.0/53198517648/posts?limit=25&__paging_token=enc_AdCqTjKBhfOsNBoKe3CJbnM2gU2RvLEYLgAQt1pHEcERVeK4qiw1dQAFHjt2sSInSZAIioZCqotwLMx8azzfZClIuCN&since=1430997206&access_token=INSERT ACCESS TOKEN HERE'
Obviously where I've put "INSERT ACCESS TOKEN HERE" I've removed the access token. Is there any way of getting all the data?

Python and Vend JSON Queries

Just trying to make some sense of the JSON outputs I'm getting from the Vend JSON API:
Item pagination
Item {u'pages': 10, u'results': 487, u'page_size': 50, u'page': 1}
Item customers
Item [{u'custom_field_3': u'', u'customer_code': u'WALKIN', u'custom_field_1': u'', u'balance': u'0', u'customer_group_id': u'xxx', u'custom_field_2': u'',
Is an example.
I'm trying to isolate a number of fields, such as 'customer_code' from the JSON output, but haven't seem to have worked it out quite yet.
My code:
response = requests.get('http://xxx.vendhq.com/api/customers',
auth=('xxx', 'yyy'))
data = response.json()
for item in data.items():
print 'Item', item[0]
print 'Item', item[1]
If I could "walk" across the JSON output, isolating the fields that would be pertinent, that would be really good code.

According to the output and the given code, the structure of the data is:
{
'pagination': {u'pages': 10, u'results': 487, u'page_size': 50, u'page': 1}
'customers': [{u'custom_field_3': u'', u'customer_code': u'WALKIN',
u'custom_field_1': u'', u'balance': u'0',
u'customer_group_id': u'xxx', u'custom_field_2': u'', ..]
}
To get the customer_code list, you need to access a dict entry with the key customers and iterate it:
customer_codes = [customer['customer_code'] for customer in data['customers']]

Entrez epost + elink returns results out of order with Biopython

I ran into this today and wanted to toss it out there. It appears that using the the Biopython interface to Entrez at NCBI, it's not possible to get results back (at least from elink) in the correct (same as input) order. Please see the code below for an example. I have thousands of GIs for which I need to get taxonomy information, and querying them individually is painfully slow due to NCBI restrictions.
from Bio import Entrez
Entrez.email = "my#email.com"
ids = ["148908191", "297793721", "48525513", "507118461"]
search_results = Entrez.read(Entrez.epost("protein", id=','.join(ids)))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print Entrez.read(Entrez.elink(webenv=webenv,
query_key=query_key,
dbfrom="protein",
db="taxonomy"))
print "-------"
for i in ids:
search_results = Entrez.read(Entrez.epost("protein", id=i))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print Entrez.read(Entrez.elink(webenv=webenv,
query_key=query_key,
dbfrom="protein",
db="taxonomy"))
Results:
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '211604'}, {u'Id': '81972'}, {u'Id': '32630'}, {u'Id': '3332'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['148908191', '297793721', '48525513', '507118461'], u'LinkSetDbHistory': [], u'ERROR': []}]
-------
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '3332'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['148908191'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '81972'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['297793721'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '211604'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['48525513'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '32630'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['507118461'], u'LinkSetDbHistory': [], u'ERROR': []}]
The elink documentation (http://www.ncbi.nlm.nih.gov/books/NBK25499/) at NCBI says this should be possible,
but only by passing multiple 'id=', but this doesn't appear possible with the Biopython epost interface. Has anyone else seen this or am I missing something obvious.
Thanks!

from Bio import Entrez
Entrez.email = "my#email.com"
ids = ["148908191", "297793721", "48525513", "507118461"]
search_results = Entrez.read(Entrez.epost("protein", id=','.join(ids)))
xml = Entrez.efetch("protein",
query_key=search_results["QueryKey"],
WebEnv=search_results["WebEnv"],
rettype="gp",
retmode="xml")
for record in Entrez.read(xml):
print [x[3:] for x in record["GBSeq_other-seqids"] if x.startswith("gi")]
gb_quals = record["GBSeq_feature-table"][0]["GBFeature_quals"]
for qualifier in gb_quals:
if qualifier["GBQualifier_name"] == "db_xref":
print qualifier["GBQualifier_value"]
# Or with list comprehension
# print [q["GBQualifier_value"] for q in
# record["GBSeq_feature-table"][0]["GBFeature_quals"] if
# q["GBQualifier_name"] == "db_xref"]
xml.close()
I efetch the Query, and then parse-like the xml after read it with Entrez.read(). This is where things turn messy, and you have to dive the xml-dict-list. I guess there's a way to extract the "GBFeature_quals" where "GBQualifier_name" is "db_xref" nicer than mine... but this works (by now). Output:
['148908191']
taxon:3332
['297793721']
taxon:81972
['48525513']
taxon:211604
['507118461']
taxon:32630

Traversing dictionary of list

I am new to python and having trouble traversing this type of structure. Any help would be appreciated.
I would like to print "name" field from each book type:
[
{ 'books': [ { u'published': 1957,
u'name': u'The Cat In The Hat'},
{ u'published': 1947,
u'name': u'Goodnight Moon'},
{ u'published': 1964,
u'name': u'The Giving Tree'}],
'type': u'Kids'
}
{ 'books': [ { u'published': 1954,
u'name': u'The Lord Of The Rings'},
{ u'published': 2008,
u'name': u'The Hunger Games'}],
'type': u'Adventure'
}
]
Here is the code I have that is not working:
for books in d:
book = books['book']
for name in files.iteritems():
print name

Your first issue is that d is a string - lose those quotes around the { } braces.
Then for books in d gives you each key in the dictionary as a variable books. Those keys are books and type.
You then have another syntax error - the dictionary ends (}), and you then open 'another', without declaring it as a new variable.
I expect you wanted the dictionary something more like:
books = {
'Kids': [
{'published': 1957, 'name': u'The Cat in the hat'},
{'published': 1964, 'name': u'The Giving Tree'},
{'published': 1947, 'name': u'Goodnight Moon'}
],
'Adventure': []
}
And then you can do something like:
for bookType in books:
for book in books[bookType]:
print book['name']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing complex and changing JSON data in Python, several levels deep - python

Related

Joining Keywords in a list NYT

Paginate and collect posts and comments from Facebook (Python, Json) [duplicate]

Python and Vend JSON Queries

Entrez epost + elink returns results out of order with Biopython

Traversing dictionary of list

Categories

Resources