How to handle a JSON that returns a list of dict-like objects in Pandas? - python

I am using an API from collegefootballdata.com to get data on scores and betting lines. I want to use betting lines to infer expected win % and then compare that to actual results (I feel like my team loses too many games where we are big favorites and want to test that.) This code retrieves one game for example purposes.
parameters = {
"gameId": 401112435,
"year": 2019
}
response = requests.get("https://api.collegefootballdata.com/lines", params=parameters)
The JSON output is this:
[
{
"awayConference": "ACC",
"awayScore": 28,
"awayTeam": "Virginia Tech",
"homeConference": "ACC",
"homeScore": 35,
"homeTeam": "Boston College",
"id": 401112435,
"lines": [
{
"formattedSpread": "Virginia Tech -4.5",
"overUnder": "57.5",
"provider": "consensus",
"spread": "4.5"
},
{
"formattedSpread": "Virginia Tech -4.5",
"overUnder": "57",
"provider": "Caesars",
"spread": "4.5"
},
{
"formattedSpread": "Virginia Tech -4.5",
"overUnder": "58",
"provider": "numberfire",
"spread": "4.5"
},
{
"formattedSpread": "Virginia Tech -4.5",
"overUnder": "56.5",
"provider": "teamrankings",
"spread": "4.5"
}
],
"season": 2019,
"seasonType": "regular",
"week": 1
}
]
I'm then loading into a pandas dataframe with:
def jstring(obj):
# create a formatted string of the Python JSON object
text = json.dumps(obj, sort_keys=True, indent=4)
return text
json_str = jstring(response.json())
df = pd.read_json(json_str)
This creates a dataframe with a "lines" column that contains the entire lines section of the JSON as a string. Ultimately, I want to use the "spread" value in the block where "provider" = "consensus". Everything else is extraneous for my purposes. I've tried exploding the column with
df = df.explode('lines')
which gives me 4 rows with something like this for each game (as expected):
{'formattedSpread': 'Virginia Tech -4.5', 'overUnder': '57.5', 'provider': 'consensus', 'spread': '4.5'}
Here is where I'm stuck. I want to keep only the rows where 'provider' = 'consensus', and further I need to have 'spread' to use as a separate variable/column in my analysis. I've tried exploding a 2nd time, df.split, df.replace to change { to [ and explode as a list, all to no avail. Any help is appreciated!!

This is probably what you're looking for -
EDIT: Handling special case.
import pandas as pd
import requests
params = {
"gameId": 401112435,
"year": 2019,
}
r = requests.get("https://api.collegefootballdata.com/lines", params=params)
df = pd.DataFrame(r.json()) # Create a DataFrame with a lines column that contains JSON
df = df.explode('lines') # Explode the DataFrame so that each line gets its own row
df = df.reset_index(drop=True) # After explosion, the indices are all the same - this resets them so that you can align the DataFrame below cleanly
def fill_na_lines(lines):
if pd.isna(lines):
return {k: None for k in ['provider', 'spread', 'formattedSpread', 'overUnder']}
return lines
df.lines = df.lines.apply(fill_na_lines)
lines_df = pd.DataFrame(df.lines.tolist()) # A separate lines DataFrame created from the lines JSON column
df = pd.concat([df, lines_df], axis=1) # Concatenating the two DataFrames along the vertical axis.
# Now you can filter down to whichever rows you need.
df = df[df.provider == 'consensus']
The documentation on joining DataFrames in different ways is probably useful.

Related

How can I organize JSON data from pandas dataframe

I can't figure out how to correctly organize the JSON data that is created from my pandas dataframe. This is my code:
with open (spreadsheetName, 'rb') as spreadsheet:
newSheet = spreadsheet.read()
newSheet = pd.read_excel(newSheet)
exportSheet = newSheet.to_json('file.json', orient = 'index')
And I'd like for the JSON data to look something like
{
"cars": [
{
"Model": "Camry",
"Year": "2015"
},
{
"Model": "Model S",
"Year": "2018"
}
]
}
But instead I'm getting a single line of JSON data from the code I have. Any ideas on how I can make it so that each row is a JSON 'object' with it's own keys and values from the column headers (like model and year)?
Set an indent argument to desired value in to_json function.
exportSheet = newSheet.to_json('file.json', orient='index', indent=4)

How can I filter API GET Request on multiple variables?

I am really struggling with this one. I'm new to python and I'm trying to extract data from an API.
I have managed to run the script below but I need to amend it to filter on multiple values for one column, lets say England and Scotland. Is there an equivelant to the SQL IN operator e.g. Area_Name IN ('England','Scotland').
from requests import get
from json import dumps
ENDPOINT = "https://api.coronavirus.data.gov.uk/v1/data"
AREA_TYPE = "nation"
AREA_NAME = "england"
filters = [
f"areaType={ AREA_TYPE }",
f"areaName={ AREA_NAME }"
]
structure = {
"date": "date",
"name": "areaName",
"code": "areaCode",
"dailyCases": "newCasesByPublishDate",
}
api_params = {
"filters": str.join(";", filters),
"structure": dumps(structure, separators=(",", ":")),
"latestBy": "cumCasesByPublishDate"
}
formats = [
"json",
"xml",
"csv"
]
for fmt in formats:
api_params["format"] = fmt
response = get(ENDPOINT, params=api_params, timeout=10)
assert response.status_code == 200, f"Failed request for {fmt}: {response.text}"
print(f"{fmt} data:")
print(response.content.decode())
I have tried the script, and dict is the easiest type to handle in this case.
Given your json data output
data = {"length":1,"maxPageLimit":1,"data":[{"date":"2020-09-17","name":"England","code":"E92000001","dailyCases":2788}],"pagination":{"current":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1","next":null,"previous":null,"first":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1","last":"/v1/data?filters=areaType%3Dnation%3BareaName%3Dengland&structure=%7B%22date%22%3A%22date%22%2C%22name%22%3A%22areaName%22%2C%22code%22%3A%22areaCode%22%2C%22dailyCases%22%3A%22newCasesByPublishDate%22%7D&latestBy=cumCasesByPublishDate&format=json&page=1"}}
You can try something like this:
countries = ['England', 'France', 'Whatever']
return [country for country in data where country['name'] in countries]
I presume the data list is the only interesting key in the data dict since all others do not have any meaningful values.

Append item to Mongo Array

I have a mongodb document I am trying to update. This answer was helpful, but every time I insert into the database, the data is inserted as an array inside of the array whereas I just want to insert the object directly into the array.
Here is what I am doing.
# My function to update the array
def append_site(gml_id, new_site):
col.update_one({'gml_id': gml_id}, {'$push': {'websites': new_site}}, upsert = True)
# My Dataframe
data = {'name':['ABC'],
'gml_id':['f9395e09'],
'url':['ABC.com']
}
df = pd.DataFrame(data)
# Grouping data for upsert
df = df.groupby(['gml_id']).apply(lambda x: x[['name','url']].to_dict('r')).reset_index().rename(columns={0:'websites'})
# Apply function to every row
df.apply(lambda row: append_site(row['gml_id'], row['websites']), axis = 1)
Here is the outcome:
{
"gml_id": "f9395e09",
"websites": [
{
"name": "XYZ.com",
"url": "...xyz.com"
},
[
{
"name": "ABC.com",
"url": "...abc.com"
}
]
]
}
Here is the goal:
{
"gml_id": "f9395e09",
"websites": [
{
"name": "XYZ.com",
"url": "...xyz.com"
},
{
"name": "ABC.com",
"url": "...abc.com"
}
]
}
Your issue is that the websites array is being appended with a list object rather than a dict, i.e. new_site is a list.
As you haven't posted where you call append_site(), this is a litle speculative, but you could try changing this line and seeing if it gives the effect you need.
col.update_one({'gml_id': gml_id}, {'$push': {'websites': new_site[0]}}, upsert = True)
Alternatively make sure you are passing a dict object to the function.
Instead of doing an unncessary groupby, I decided to leave the dataframe flat and then adjust the function like this:
def append_site(gml_id, name, url):
col.update_one({'gml_id': gml_id}, {'$push': {'websites': {'name': name, 'url': url}}}, upsert = True)
I now call it like this: df.apply(lambda row: append_site(row['gml_id'], row['url'], row['name']), axis = 1)
Works perfectly fine.

How to read this JSON into dataframe with specfic dataframe format

This is my JSON string, I want to make it read into dataframe in the following tabular format.
I have no idea what should I do after pd.Dataframe(json.loads(data))
JSON data, edited
{
"data":[
{
"data":{
"actual":"(0.2)",
"upper_end_of_central_tendency":"-"
},
"title":"2009"
},
{
"data":{
"actual":"2.8",
"upper_end_of_central_tendency":"-"
},
"title":"2010"
},
{
"data":{
"actual":"-",
"upper_end_of_central_tendency":"2.3"
},
"title":"longer_run"
}
],
"schedule_id":"2014-03-19"
}
That's a somewhat overly nested JSON. But if that's what you have to work with, and assuming your parsed JSON is in jdata:
datapts = jdata['data']
rownames = ['actual', 'upper_end_of_central_tendency']
colnames = [ item['title'] for item in datapts ] + ['schedule_id' ]
sched_id = jdata['schedule_id']
rows = [ [item['data'][rn] for item in datapts ] + [sched_id] for rn in rownames]
df = pd.DataFrame(rows, index=rownames, columns=colnames)
df is now:
If you wanted to simplify that a bit, you could construct the core data without the asymmetric schedule_id field, then add that after the fact:
datapts = jdata['data']
rownames = ['actual', 'upper_end_of_central_tendency']
colnames = [ item['title'] for item in datapts ]
rows = [ [item['data'][rn] for item in datapts ] for rn in rownames]
d2 = pd.DataFrame(rows, index=rownames, columns=colnames)
d2['schedule_id'] = jdata['schedule_id']
That will make an identical DataFrame (i.e. df == d2). It helps when learning pandas to try a few different construction strategies, and get a feel for what is more straightforward. There are more powerful tools for unfolding nested structures into flatter tables, but they're not as easy to understand first time out of the gate.
(Update) If you wanted a better structuring on your JSON to make it easier to put into this format, ask pandas what it likes. E.g. df.to_json() output, slightly prettified:
{
"2009": {
"actual": "(0.2)",
"upper_end_of_central_tendency": "-"
},
"2010": {
"actual": "2.8",
"upper_end_of_central_tendency": "-"
},
"longer_run": {
"actual": "-",
"upper_end_of_central_tendency": "2.3"
},
"schedule_id": {
"actual": "2014-03-19",
"upper_end_of_central_tendency": "2014-03-19"
}
}
That is a format from which pandas' read_json function will immediately construct the DataFrame you desire.

Python - JSON to CSV table?

I was wondering how I could import a JSON file, and then save that to an ordered CSV file, with header row and the applicable data below.
Here's what the JSON file looks like:
[
{
"firstName": "Nicolas Alexis Julio",
"lastName": "N'Koulou N'Doubena",
"nickname": "N. N'Koulou",
"nationality": "Cameroon",
"age": 24
},
{
"firstName": "Alexandre Dimitri",
"lastName": "Song-Billong",
"nickname": "A. Song",
"nationality": "Cameroon",
"age": 26,
etc. etc. + } ]
Note there are multiple 'keys' (firstName, lastName, nickname, etc.). I would like to create a CSV file with those as the header, then the applicable info beneath in rows, with each row having a player's information.
Here's the script I have so far for Python:
import urllib2
import json
import csv
writefilerows = csv.writer(open('WCData_Rows.csv',"wb+"))
api_key = "xxxx"
url = "http://worldcup.kimonolabs.com/api/players?apikey=" + api_key + "&limit=1000"
json_obj = urllib2.urlopen(url)
readable_json = json.load(json_obj)
list_of_attributes = readable_json[0].keys()
print list_of_attributes
writefilerows.writerow(list_of_attributes)
for x in readable_json:
writefilerows.writerow(x[list_of_attributes])
But when I run that, I get a "TypeError: unhashable type:'list'" error. I am still learning Python (obviously I suppose). I have looked around online (found this) and can't seem to figure out how to do it without explicitly stating what key I want to print...I don't want to have to list each one individually...
Thank you for any help/ideas! Please let me know if I can clarify or provide more information.
Your TypeError is occuring because you are trying to index a dictionary, x with a list, list_of_attributes with x[list_of_attributes]. This is not how python works. In this case you are iterating readable_json which appears it will return a dictionary with each iteration. There is no need pull values out of this data in order to write them out.
The DictWriter should give you what your looking for.
import csv
[...]
def encode_dict(d, out_encoding="utf8"):
'''Encode dictionary to desired encoding, assumes incoming data in unicode'''
encoded_d = {}
for k, v in d.iteritems():
k = k.encode(out_encoding)
v = unicode(v).encode(out_encoding)
encoded_d[k] = v
return encoded_d
list_of_attributes = readable_json[0].keys()
# sort fields in desired order
list_of_attributes.sort()
with open('WCData_Rows.csv',"wb+") as csv_out:
writer = csv.DictWriter(csv_out, fieldnames=list_of_attributes)
writer.writeheader()
for data in readable_json:
writer.writerow(encode_dict(data))
Note:
This assumes that each entry in readable_json has the same fields.
Maybe pandas could do this - but I newer tried to read JSON
import pandas as pd
df = pd.read_json( ... )
df.to_csv( ... )
pandas.DataFrame.to_csv
pandas.io.json.read_json
EDIT:
data = ''' [
{
"firstName": "Nicolas Alexis Julio",
"lastName": "N'Koulou N'Doubena",
"nickname": "N. N'Koulou",
"nationality": "Cameroon",
"age": 24
},
{
"firstName": "Alexandre Dimitri",
"lastName": "Song-Billong",
"nickname": "A. Song",
"nationality": "Cameroon",
"age": 26,
}
]'''
import pandas as pd
df = pd.read_json(data)
print df
df.to_csv('results.csv')
result:
age firstName lastName nationality nickname
0 24 Nicolas Alexis Julio N'Koulou N'Doubena Cameroon N. N'Koulou
1 26 Alexandre Dimitri Song-Billong Cameroon A. Song
With pandas you can save it in csv, excel, etc (and maybe even directly in database).
And you can do some operations on data in table and show it as graph.

Categories

Resources