python generator to pandas dataframe - python

I have a generator being returned from:
data = public_client.get_product_trades(product_id='BTC-USD', limit=10)
How do i turn the data in to a pandas dataframe?
the method DOCSTRING reads:
"""{"Returns": [{
"time": "2014-11-07T22:19:28.578544Z",
"trade_id": 74,
"price": "10.00000000",
"size": "0.01000000",
"side": "buy"
}, {
"time": "2014-11-07T01:08:43.642366Z",
"trade_id": 73,
"price": "100.00000000",
"size": "0.01000000",
"side": "sell"
}]}"""
I have tried:
df = [x for x in data]
df = pd.DataFrame.from_records(df)
but it does not work as i get the error:
AttributeError: 'str' object has no attribute 'keys'
When i print the above "x for x in data" i see the list of dicts but the end looks strange, could this be why?
print(list(data))
[{'time': '2020-12-30T13:04:14.385Z', 'trade_id': 116918468, 'price': '27853.82000000', 'size': '0.00171515', 'side': 'sell'},{'time': '2020-12-30T12:31:24.185Z', 'trade_id': 116915675, 'price': '27683.70000000', 'size': '0.01683711', 'side': 'sell'}, 'message']
It looks to be a list of dicts but the end value is a single string 'message'.

Based on the updated question:
df = pd.DataFrame(list(data)[:-1])
Or, more cleanly:
df = pd.DataFrame([x for x in data if isinstance(x, dict)])
print(df)
time trade_id price size side
0 2020-12-30T13:04:14.385Z 116918468 27853.82000000 0.00171515 sell
1 2020-12-30T12:31:24.185Z 116915675 27683.70000000 0.01683711 sell
Oh, and BTW, you'll still need to change those strings into something usable...
So e.g.:
df['time'] = pd.to_datetime(df['time'])
for k in ['price', 'size']:
df[k] = pd.to_numeric(df[k])

You could access the values in the dictionary and build a dataframe from it (although not particularly clean):
dict_of_data = [{
"time": "2014-11-07T22:19:28.578544Z",
"trade_id": 74,
"price": "10.00000000",
"size": "0.01000000",
"side": "buy"
}, {
"time": "2014-11-07T01:08:43.642366Z",
"trade_id": 73,
"price": "100.00000000",
"size": "0.01000000",
"side": "sell"
}]
import pandas as pd
list_of_data = [list(dict_of_data[0].values()),list(dict_of_data[1].values())]
pd.DataFrame(list_of_data, columns=list(dict_of_data[0].keys())).set_index('time')

its straightforward just use the pd.DataFrame constructor:
#list_of_dicts = [{
# "time": "2014-11-07T22:19:28.578544Z",
# "trade_id": 74,
# "price": "10.00000000",
# "size": "0.01000000",
# "side": "buy"
# }, {
# "time": "2014-11-07T01:08:43.642366Z",
# "trade_id": 73,
# "price": "100.00000000",
# "size": "0.01000000",
# "side": "sell"
#}]
# or if you take it from 'data'
list_of_dicts = data[:-1]
df = pd.DataFrame(list_of_dicts)
df
Out[4]:
time trade_id price size side
0 2014-11-07T22:19:28.578544Z 74 10.00000000 0.01000000 buy
1 2014-11-07T01:08:43.642366Z 73 100.00000000 0.01000000 sell
UPDATE
according to the question update, it seems you have json data that is still string...
import json
data = json.loads(data)
data = data['Returns']
pd.DataFrame(data)
time trade_id price size side
0 2014-11-07T22:19:28.578544Z 74 10.00000000 0.01000000 buy
1 2014-11-07T01:08:43.642366Z 73 100.00000000 0.01000000 sell

Related

Fastest way to generate a nested JSON using pandas

This is a sample of a real-world problem that I cannot find a way to solve.
I need to create a nested JSON from a pandas dataframe. Considering this data, I need to create a JSON object like that:
[
{
"city": "Belo Horizonte",
"by_rooms": [
{
"rooms": 1,
"total price": [
{
"total (R$)": 499,
"details": [
{
"animal": "acept",
"area": 22,
"bathroom": 1,
"parking spaces": 0,
"furniture": "not furnished",
"hoa (R$)": 30,
"rent amount (R$)": 450,
"property tax (R$)": 13,
"fire insurance (R$)": 6
}
]
}
]
},
{
"rooms": 2,
"total price": [
{
"total (R$)": 678,
"details": [
{
"animal": "not acept",
"area": 50,
"bathroom": 1,
"parking spaces": 0,
"furniture": "not furnished",
"hoa (R$)": 0,
"rent amount (R$)": 644,
"property tax (R$)": 25,
"fire insurance (R$)": 9
}
]
}
]
}
]
},
{
"city": "Campinas",
"by_rooms": [
{
"rooms": 1,
"total price": [
{
"total (R$)": 711,
"details": [
{
"animal": "acept",
"area": 42,
"bathroom": 1,
"parking spaces": 0,
"furniture": "not furnished",
"hoa (R$)": 0,
"rent amount (R$)": 690,
"property tax (R$)": 12,
"fire insurance (R$)": 9
}
]
}
]
}
]
}
]
each level can have one or more items.
Based on this answer, I have a snippet like that:
data = pd.read_csv("./houses_to_rent_v2.csv")
cols = data.columns
data = (
data.groupby(['city', 'rooms', 'total (R$)'])[['animal', 'area', 'bathroom', 'parking spaces', 'furniture',
'hoa (R$)', 'rent amount (R$)', 'property tax (R$)', 'fire insurance (R$)']]
.apply(lambda x: x.to_dict(orient='records'))
.reset_index(name='details')
.groupby(['city', 'rooms'])[['total (R$)', 'details']]
.apply(lambda x: x.to_dict(orient='records'))
.reset_index(name='total price')
.groupby(['city'])[['rooms', 'total price']]
.apply(lambda x: x.to_dict(orient='records'))
.reset_index(name='by_rooms')
)
data.to_json('./jsondata.json', orient='records', force_ascii=False)
but all those groupbys don't look very Pythonic and it's pretty slow.
Before use this method, I tried split this big dataframe into smaller ones to use individual groupbys for each level, but it's even slower than doing that way.
I tried dask, with no improvement at all.
I read about numba and cython, but I have no idea how to implement in this case. All docs that I find use only numeric data and I have string and date/datetime data too.
In my real-world problem, this data is processed to response to http request. My dataframe has 30+ columns and ~35K rows per request and it takes 45 seconds to process just this snippet.
So, there is a faster way to do that?
This can be done as list / dict comprehensions. Have not timed this, but I'm not waiting for it.
import kaggle.cli
import sys, requests
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
import urllib
# fmt: off
# download data set
url = "https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent"
sys.argv = [sys.argv[0]] + f"datasets download {urllib.parse.urlparse(url).path[1:]}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f'{urllib.parse.urlparse(url).path.split("/")[-1]}.zip')
dfs = {f.filename: pd.read_csv(zfile.open(f)) for f in zfile.infolist()}
# fmt: on
js = [
{
"city": g[0],
"by_room": [
{
"rooms": r["rooms"],
"total_price": [
{
"total (R$)": r["total (R$)"],
"details": [
{
k: v
for k, v in r.items()
if k not in ["city", "rooms", "total (R$)"]
}
],
}
],
}
for r in g[1].to_dict("records")
],
}
for g in dfs["houses_to_rent_v2.csv"].groupby("city")
]
print(len(js), len(js[0]["by_room"]))
I needed to adapt #RobRaymond answer, because I need the inner data grouped too. So I take his code, did some adjustments and this is the final result:
import kaggle.cli
import sys, requests
import pandas as pd
from pathlib import Path
from zipfile import ZipFile
import urllib
# fmt: off
# download data set
url = "https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent"
sys.argv = [sys.argv[0]] + f"datasets download {urllib.parse.urlparse(url).path[1:]}".split(" ")
kaggle.cli.main()
zfile = ZipFile(f'{urllib.parse.urlparse(url).path.split("/")[-1]}.zip')
dfs = {f.filename: pd.read_csv(zfile.open(f)) for f in zfile.infolist()}
# fmt: on
js = [
{
"city": g[0],
"by_room": [
{
"rooms": r["rooms"],
"total_price": [
{
"total (R$)": r["total (R$)"],
"details": [
{
k: v
for k, v in r.items()
if k not in ["city", "rooms", "total (R$)"]
}
],
}
],
}
for r in g[1].to_dict("records")
],
}
for g in dfs["houses_to_rent_v2.csv"].groupby("city")
]
for city in js:
rooms_qty = list(set([r['rooms'] for r in city['by_room']]))
newRooms = [{'rooms': x, 'total_price': []} for x in rooms_qty]
for r in city['by_room']:
newRooms[rooms_qty.index(r['rooms'])]'total_price'].extend(r['total_price'])
for r in newRooms:
prices = list(set([p['total (R$)'] for p in r['total_price']]))
newPrices = [{'total (R$)': x, 'details': []} for x in prices]
for price in r['total_price']:
newPrices[prices.index(price['total (R$)'])]['details'].extend(price['details'])
r['total_price'] = newPrices
city['by_room'] = newRooms
And the execution time drops to 5 seconds.

Flattening nested JSON to pandas.DataFrame: Ordering and Naming Columns based on dictionary values

My question raised when I exploited this helpful answer provided by Trenton McKinney on the issue of flattening multiple nested JSON-files for handling in pandas.
Following his advice, I have used the flatten_json function described here to flatten a batch of nested json files. However, I have run into a problem with the uniformity of my JSON-files.
A single JSON-File looks roughly like this made-up example data:
{
"product": "example_productname",
"product_id": "example_productid",
"product_type": "example_producttype",
"producer": "example_producer",
"currency": "example_currency",
"client_id": "example_clientid",
"supplement": [
{
"supplementtype": "RTZ",
"price": 300000,
"rebate": "500",
},
{
"supplementtype": "CVB",
"price": 500000,
"rebate": "250",
},
{
"supplementtype": "JKL",
"price": 100000,
"rebate": "750",
},
],
}
Utilizing the referenced code, I will end up with data looking like this:
product
product_id
product_type
producer
currency
client_id
supplement_0_supplementtype
supplement_0_price
supplement_0_rebate
supplement_1_supplementtype
supplement_1_price
supplement_1_rebate
etc
example_productname
example_productid
example_type
example_producer
example_currency
example_clientid
RTZ
300000
500
CVB
500000
250
etc
example_productname2
example_productid2
example_type2
example_producer2
example_currency2
example_clientid2
CVB
500000
250
RTZ
300000
500
etc
There are multiple issues with this.
Firstly, in my data, there is a limited list of "supplements", however, they do not always appear, and if they do, they are not always in the same order. In the example table, you can see that the two "supplements" switched positions in the second row. I would prefer a fixed order of the "supplement columns".
Secondly, the best option would be a table like this:
product
product_id
product_type
producer
currency
client_id
supplement_RTZ_price
supplement_RTZ_rebate
supplement_CVB_price
supplement_CVB_rebate
etc
example_productname
example_productid
example_type
example_producer
example_currency
example_clientid
300000
500
500000
250
etc
I have tried editing the flatten_json function referenced, but I don't have an inkling of how to make this work.
The solution consists of simply editing the dictionary (thanks to Andrej Kesely). I just added a pass to exceptions in case some columns are inexistent.
d = {
"product": "example_productname",
"product_id": "example_productid",
"product_type": "example_producttype",
"producer": "example_producer",
"currency": "example_currency",
"client_id": "example_clientid",
"supplement": [
{
"supplementtype": "RTZ",
"price": 300000,
"rebate": "500",
},
{
"supplementtype": "CVB",
"price": 500000,
"rebate": "250",
},
{
"supplementtype": "JKL",
"price": 100000,
"rebate": "750",
},
],
}
for s in d["supplement"]:
try:
d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
except:
pass
try:
d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]
except:
pass
del d["supplement"]
df = pd.DataFrame([d])
print(df)
product product_id product_type producer currency client_id supplementtype_RTZ_price supplementtype_RTZ_rebate supplementtype_CVB_price supplementtype_CVB_rebate supplementtype_JKL_price supplementtype_JKL_rebate
0 example_productname example_productid example_producttype example_producer example_currency example_clientid 300000 500 500000 250 100000 750
The used/referenced code:
def flatten_json(nested_json: dict, exclude: list=[''], sep: str='_') -> dict:
"""
Flatten a list of nested dicts.
"""
out = dict()
def flatten(x: (list, dict, str), name: str='', exclude=exclude):
if type(x) is dict:
for a in x:
if a not in exclude:
flatten(x[a], f'{name}{a}{sep}')
elif type(x) is list:
i = 0
for a in x:
flatten(a, f'{name}{i}{sep}')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
# list of files
files = ['test1.json', 'test2.json']
# list to add dataframe from each file
df_list = list()
# iterate through files
for file in files:
with open(file, 'r') as f:
# read with json
data = json.loads(f.read())
# flatten_json into a dataframe and add to the dataframe list
df_list.append(pd.DataFrame.from_dict(flatten_json(data), orient='index').T)
# concat all dataframes together
df = pd.concat(df_list).reset_index(drop=True)
You can modify the dictionary before you create dataframe from it:
d = {
"product": "example_productname",
"product_id": "example_productid",
"product_type": "example_producttype",
"producer": "example_producer",
"currency": "example_currency",
"client_id": "example_clientid",
"supplement": [
{
"supplementtype": "RTZ",
"price": 300000,
"rebate": "500",
},
{
"supplementtype": "CVB",
"price": 500000,
"rebate": "250",
},
{
"supplementtype": "JKL",
"price": 100000,
"rebate": "750",
},
],
}
for s in d["supplement"]:
d["supplementtype_{}_price".format(s["supplementtype"])] = s["price"]
d["supplementtype_{}_rebate".format(s["supplementtype"])] = s["rebate"]
del d["supplement"]
df = pd.DataFrame([d])
print(df)
Prints:
product product_id product_type producer currency client_id supplementtype_RTZ_price supplementtype_RTZ_rebate supplementtype_CVB_price supplementtype_CVB_rebate supplementtype_JKL_price supplementtype_JKL_rebate
0 example_productname example_productid example_producttype example_producer example_currency example_clientid 300000 500 500000 250 100000 750

Can't delete a dictionary from a list of dicts using del method

I'm trying to delete every dictionary that has a point value of 0 but when I run this code the object still remains. Is there a special case here why it won't delete?
import json
import numpy as np
import pandas as pd
from itertools import groupby
# using json open the player objects file and set it equal to data
with open('Combined_Players_DK.json') as json_file:
player_data = json.load(json_file)
for player in player_data:
for points in player['Tournaments']:
player['Average'] = round(sum(float(tourny['Points']) for tourny in player['Tournaments']) / len(player['Tournaments']),2)
for players in player_data:
for value in players['Tournaments']:
if value['Points'] == 0:
del value
with open('PGA_Player_Objects_With_Average.json', 'w') as my_file:
json.dump(player_data, my_file)
Here is the JSON
[
{
"Name": "Dustin Johnson",
"Tournaments": [
{
"Date": "2020-06-25",
"Points": 133.5
},
{
"Date": "2020-06-18",
"Points": 101
},
{
"Date": "2020-06-11",
"Points": 25
},
{
"Date": "2020-02-20",
"Points": 60
},
{
"Date": "2020-02-13",
"Points": 89.5
},
{
"Date": "2020-02-06",
"Points": 69.5
},
{
"Date": "2020-01-02",
"Points": 91
},
{
"Date": "2019-12-04",
"Points": 0
}
],
"Average": 71.19
}]
I'm not sure why I can't use the delete value. I tried remove as well but then I was left with an empty object.
You can't delete value with del while looping as you can see here, if you want to use del, you should delete the item by its index not the value it takes in the scope of the for loop, because as #MosesKoledoye said in the comments:
You want del players['Tournaments'] for the deletion to act on the
dictionary & remove that entry, and not del value which only deletes
the value name from the scope. See the del stmt.
When you're looping and you want to modify the list, you have to create a copy, as you can see in the docs. I suggest you to see the link above, to see other ways to delete an element while looping. Try this:
for players in player_data:
for i in range(len(players['Tournaments'])):
if players['Tournaments'][i]['Points'] == 0:
del players['Tournaments'][i]
I prefer to use better dictionary comprehension, so you can try this too:
player_data[0]['Tournaments']=[dct for dct in player_data[0]['Tournaments'] if dct['Points']!=0]
print(data)
Output:
[{'Name': 'Dustin Johnson', 'Tournaments': [{'Date': '2020-06-25', 'Points': 133.5}, {'Date': '2020-06-18', 'Points': 101}, {'Date': '2020-06-11', 'Points': 25}, {'Date': '2020-02-20', 'Points': 60}, {'Date': '2020-02-13', 'Points': 89.5}, {'Date': '2020-02-06', 'Points': 69.5}, {'Date': '2020-01-02', 'Points': 91}], 'Average': 71.19}]

Pandas DataFrame created for each row

I am attempting to pass data in JSON from an API to a Pandas DataFrame. I could not get pandas.read_json to work with the API data so I'm sure it's not the best solution, but I currently have for loop running through the JSON to extract the values I want.
Here is what I have:
import json
import urllib.request
import pandas as pd
r = urllib.request.urlopen("https://graph.facebook.com/v3.1/{page-id}/insights?access_token={access-token}&pretty=0&metric=page_impressions%2cpage_engaged_users%2cpage_fans%2cpage_video_views%2cpage_posts_impressions").read()
output = json.loads(r)
for item in output['data']:
name = item['name']
period = item['period']
value = item['values'][0]['value']
df = [{'Name': name, 'Period': period, 'Value': value}]
df = pd.DataFrame(df)
print(df)
And here is an excerpt of the JSON from the API:
{
"data": [
{
"name": "page_video_views",
"period": "day",
"values": [
{
"value": 634,
"end_time": "2018-11-23T08:00:00+0000"
},
{
"value": 465,
"end_time": "2018-11-24T08:00:00+0000"
}
],
"title": "Daily Total Video Views",
"description": "Daily: Total number of times videos have been viewed for more than 3 seconds. (Total Count)",
"id": "{page-id}/insights/page_video_views/day"
},
The issue I am now facing is because of the For Loop (I believe), each row of data is being inserted into its own DataFrame like so:
Name Period Value
0 page_video_views day 465
Name Period Value
0 page_video_views week 3257
Name Period Value
0 page_video_views days_28 9987
Name Period Value
0 page_impressions day 1402
How can I pass all of them easily into the same DataFrame like so?
Name Period Value
0 page_video_views day 465
1 page_video_views week 3257
2 page_video_views days_28 9987
3 page_impressions day 1402
Again, I know this most likely isn't the best solution so any suggestions on how to improve any aspect are very welcome.
You can create list of dictionaries and pass to DataFrame constructor:
L = []
for item in output['data']:
name = item['name']
period = item['period']
value = item['values'][0]['value']
L.append({'Name': name, 'Period': period, 'Value': value})
df = pd.DataFrame(L)
Or use list comprehension:
L = [({'Name': item['name'], 'Period': item['period'], 'Value': item['values'][0]['value']})
for item in output['data']]
df = pd.DataFrame(L)
print (df)
Name Period Value
0 page_video_views day 634
Sample for testing:
output = {
"data": [
{
"name": "page_video_views",
"period": "day",
"values": [
{
"value": 634,
"end_time": "2018-11-23T08:00:00+0000"
},
{
"value": 465,
"end_time": "2018-11-24T08:00:00+0000"
}
],
"title": "Daily Total Video Views",
"description": "Daily: Total number of times videos have been viewed for more than 3 seconds. (Total Count)",
"id": "{page-id}/insights/page_video_views/day"
}]}
Try to convert dictionary after json loading to dataframe like:
output = json.loads(r)
df = pd.DataFrame.from_dict(output , orient='index')
df.reset_index(level=0, inplace=True)
If you are taking the data from the url. I would suggest this approach and passing only the data stored under an attribute
import request
data=request.get("url here").json('Period')
Period is now dictionary you can now call the pd.DataFrame.from_dict(data) to parse the data
df = pd.DataFrame.from_dict(Period)

Pandas DataFrame to Nested JSON Without Changing Data Structure

I have pandas.DataFrame:
import pandas as pd
import json
df = pd.DataFrame([['2016-04-30T20:02:25.693Z', 'vmPowerOn', 'vmName'],['2016-04-07T22:35:41.145Z','vmPowerOff','hostName']],
columns=['date', 'event', 'object'])
date event object
0 2016-04-30T20:02:25.693Z vmPowerOn vmName
1 2016-04-07T22:35:41.145Z vmPowerOff hostName
I want to convert that dataframe into the following format:
{
"name":"Alarm/Error",
"data":[
{"date": "2016-04-30T20:02:25.693Z", "details": {"event": "vmPowerOn", "object": "vmName"}},
{"date": "2016-04-07T22:35:41.145Z", "details": {"event": "vmPowerOff", "object": "hostName"}}
]
}
So far, I've tried this:
df = df.to_dict(orient='records')
j = {"name":"Alarm/Error", "data":df}
json.dumps(j)
'{"name": "Alarm/Error",
"data": [{"date": "2016-04-30T20:02:25.693Z", "event": "vmPowerOn", "object": "vmName"},
{"date": "2016-04-07T22:35:41.145Z", "event": "vmPowerOff", "object": "hostName"}
]
}'
However, this obviously does not put the detail columns in their own dictionary.
How would I efficiently split the df date column and all other columns into separate parts of the JSON?
With a list and dict comprehension, you can do that like:
Code:
[{'date': x['date'], 'details': {k: v for k, v in x.items() if k != 'date'}}
for x in df.to_dict('records')]
Test Code:
df = pd.DataFrame([['2016-04-30T20:02:25.693Z', 'vmPowerOn', 'vmName'],
['2016-04-07T22:35:41.145Z', 'vmPowerOff', 'hostName']],
columns=['date', 'event', 'object'])
print([{'date': x['date'],
'details': {k: v for k, v in x.items() if k != 'date'}}
for x in df.to_dict('records')])
Results:
[{'date': '2016-04-30T20:02:25.693Z', 'details': {'event': 'vmPowerOn', 'object': 'vmName'}},
{'date': '2016-04-07T22:35:41.145Z', 'details': {'event': 'vmPowerOff', 'object': 'hostName'}}
]

Categories

Resources