Pandas grouping and express as proportion - python

d = [{'name': 'tv', 'value': 10, 'amount': 35},
{'name': 'tv', 'value': 10, 'amount': 14},
{'name': 'tv', 'value': 15, 'amount': 23},
{'name': 'tv', 'value': 34, 'amount': 56},
{'name': 'radio', 'value': 90, 'amount': 35},
{'name': 'radio', 'value': 90, 'amount': 65},
{'name': 'radio', 'value': 100, 'amount': 50},
{'name': 'dvd', 'value': 0.5, 'amount': 35},
{'name': 'dvd', 'value': 0.2, 'amount': 40},
{'name': 'dvd', 'value': 0.5, 'amount': 15}
]
df = pd.DataFrame(d)
dff = df.groupby(['name', 'value']).agg('sum').reset_index()
dfff = dff.groupby(['name']).apply(lambda x: round((x['amount']/x['amount'].sum())*100))
print(dff)
print(dfff)
name value amount
0 dvd 0.2 40
1 dvd 0.5 50
2 radio 90.0 100
3 radio 100.0 50
4 tv 10.0 49
5 tv 15.0 23
6 tv 34.0 56
name
dvd 0 44.0
1 56.0
radio 2 67.0
3 33.0
tv 4 38.0
5 18.0
6 44.0
I now want to take this dataset and concatenate the rows grouped on the name variable. The amount variable should be expressed as a proportion.
The final dataset should look like below, where the value is the first term and amount expressed as a proportion is the second term.
name concatenated_values
0 dvd 0.2, 44%, 0.5, 56%
1 radio 90, 67%, 100, 33%
.
.
.

Use custom lambda function with flatten nested lists in GroupBy.apply:
dff = df.groupby(['name', 'value']).agg('sum').reset_index()
dff['amount'] = ((dff['amount'] / dff.groupby(['name'])['amount'].transform('sum')*100)
.round().astype(int).astype(str) + '%')
f = lambda x: ', '.join(str(z) for y in x.to_numpy() for z in y)
d = dff.groupby('name')[['value','amount']].apply(f).reset_index(name='concatenated_values')
print(d)
name concatenated_values
0 dvd 0.2, 44%, 0.5, 56%
1 radio 90.0, 67%, 100.0, 33%
2 tv 10.0, 38%, 15.0, 18%, 34.0, 44%

Related

python pandas unnest data column containing a list of dictionaries

we have the following dataframe:
import pandas as pd
our_df = pd.DataFrame(data = {'rank': {0: 1, 1: 2}, 'title_name': {0: "And It's Still Alright", 1: 'Black Madonna'}, 'title_id': {0: '120034150', 1: '106938609'}, 'artist_id': {0: '222521', 1: '200160'}, 'artist_name': {0: 'Nathaniel Rateliff', 1: 'Cage The Elephant'}, 'label': {0: 'CNCO', 1: 'RCA'}, 'metrics': {0: [{'name': 'Rank', 'value': 1}, {'name': 'Song', 'value': "And It's Still Alright"}, {'name': 'Artist', 'value': 'Nathaniel Rateliff'}, {'name': 'TP Spins', 'value': 933}, {'name': '+/- Chg. Spins', 'value': -32}, {'name': 'LP Spins', 'value': 965}, {'name': 'Stations', 'value': '44/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1260000}, {'name': '+/- Chg. Audience', 'value': -40600}, {'name': 'LP Audience', 'value': 1300600}, {'name': 'TP Stream', 'value': 413101}], 1: [{'name': 'Rank', 'value': 2}, {'name': 'Song', 'value': 'Black Madonna'}, {'name': 'Artist', 'value': 'Cage The Elephant'}, {'name': 'TP Spins', 'value': 814}, {'name': '+/- Chg. Spins', 'value': 38}, {'name': 'LP Spins', 'value': 776}, {'name': 'Stations', 'value': '38/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1283400}, {'name': '+/- Chg. Audience', 'value': -21600}, {'name': 'LP Audience', 'value': 1305000}, {'name': 'TP Stream', 'value': 362366}]}})
and we are looking to convert the metrics column into 12 new columns in our dataframe, using the metric's name field as the column name, and value field as the field in the dataframe. Something like this:
rank title_name title_id artist_id artist_name label Rank Song ...
1 'And It's Still Alright' 120034150 222521 'Nathaniel Rateliff' 'CNCO' 1 "And It's Still Alright"
Here's what the value in the metrics column looks like for row 1:
our_df['metrics'][0]
[{'name': 'Rank', 'value': 1},
{'name': 'Song', 'value': "And It's Still Alright"},
{'name': 'Artist', 'value': 'Nathaniel Rateliff'},
{'name': 'TP Spins', 'value': 933},
{'name': '+/- Chg. Spins', 'value': -32},
{'name': 'LP Spins', 'value': 965},
{'name': 'Stations', 'value': '44/46'},
{'name': 'Adds', 'value': 0},
{'name': 'TP Audience', 'value': 1260000},
{'name': '+/- Chg. Audience', 'value': -40600},
{'name': 'LP Audience', 'value': 1300600},
{'name': 'TP Stream', 'value': 413101}]
The +/- in the column names may be problematic though, along with the . in Chg. This dataframe would be best if all the column names were snake_case, if the +/- was replaced with plus_minus, and if the . in Chg. was simply dropped.
Edit: we can assume that the metric names will be the same in every row in the dataframe. However, there may be other dataframes with different metric names, so it would be preferable if the names 'Rank', 'Song', 'Artist', etc. were not hardcoded. Here is the original list before it was converted into a pandas dataframe:
raw_data = [{'rank': 1,
'title_name': 'BUTTER',
'title_id': '',
'artist_id': '',
'artist_name': 'BTS',
'label': '',
'peak_position': 1,
'last_week_rank': 7,
'last_2week_rank': 8,
'metrics': [{'name': 'Rank', 'value': 1},
{'name': 'Song', 'value': 'BUTTER'},
{'name': 'Artist', 'value': 'BTS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 7},
{'name': 'Last 2 Week Rank', 'value': 8},
{'name': 'Weeks On Chart', 'value': 15}]},
{'rank': 2,
'title_name': 'STAY',
'title_id': '',
'artist_id': '',
'artist_name': 'THE KID LAROI & JUS',
'label': '',
'peak_position': 1,
'last_week_rank': 1,
'last_2week_rank': 1,
'metrics': [{'name': 'Rank', 'value': 2},
{'name': 'Song', 'value': 'STAY'},
{'name': 'Artist', 'value': 'THE KID LAROI & JUS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 1},
{'name': 'Last 2 Week Rank', 'value': 1},
{'name': 'Weeks On Chart', 'value': 8}]}]
Most likely, the fastest way is to process raw_data as a dictionary and only then construct a DataFrame with it.
records = []
for rec in raw_data:
for metric in rec['metrics']:
# process name: snake_case > drop '.' > '+/-' to 'plus_minus'
name = metric['name'].lower().replace(' ','_').replace('.','').replace('+/-','plus_minus')
rec[name] = metric['value']
rec.pop('metrics') # drop metric records
records.append(rec)
df = pd.DataFrame(records)
Output
Resulting df
rank
title_name
title_id
artist_id
artist_name
label
peak_position
last_week_rank
last_2week_rank
song
artist
label_description
last_2_week_rank
weeks_on_chart
0
1
BUTTER
BTS
1
7
8
BUTTER
BTS
8
15
1
2
STAY
THE KID LAROI & JUS
1
1
1
STAY
THE KID LAROI & JUS
1
8
Setup
raw_data = [{'rank': 1,
'title_name': 'BUTTER',
'title_id': '',
'artist_id': '',
'artist_name': 'BTS',
'label': '',
'peak_position': 1,
'last_week_rank': 7,
'last_2week_rank': 8,
'metrics': [{'name': 'Rank', 'value': 1},
{'name': 'Song', 'value': 'BUTTER'},
{'name': 'Artist', 'value': 'BTS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 7},
{'name': 'Last 2 Week Rank', 'value': 8},
{'name': 'Weeks On Chart', 'value': 15}]},
{'rank': 2,
'title_name': 'STAY',
'title_id': '',
'artist_id': '',
'artist_name': 'THE KID LAROI & JUS',
'label': '',
'peak_position': 1,
'last_week_rank': 1,
'last_2week_rank': 1,
'metrics': [{'name': 'Rank', 'value': 2},
{'name': 'Song', 'value': 'STAY'},
{'name': 'Artist', 'value': 'THE KID LAROI & JUS'},
{'name': 'Label Description', 'value': None},
{'name': 'Label', 'value': ' '},
{'name': 'Last Week Rank', 'value': 1},
{'name': 'Last 2 Week Rank', 'value': 1},
{'name': 'Weeks On Chart', 'value': 8}]}]
Using the example's data as raw_data, i.e.
our_df = pd.DataFrame(data = {'rank': {0: 1, 1: 2}, 'title_name': {0: "And It's Still Alright", 1: 'Black Madonna'}, 'title_id': {0: '120034150', 1: '106938609'}, 'artist_id': {0: '222521', 1: '200160'}, 'artist_name': {0: 'Nathaniel Rateliff', 1: 'Cage The Elephant'}, 'label': {0: 'CNCO', 1: 'RCA'}, 'metrics': {0: [{'name': 'Rank', 'value': 1}, {'name': 'Song', 'value': "And It's Still Alright"}, {'name': 'Artist', 'value': 'Nathaniel Rateliff'}, {'name': 'TP Spins', 'value': 933}, {'name': '+/- Chg. Spins', 'value': -32}, {'name': 'LP Spins', 'value': 965}, {'name': 'Stations', 'value': '44/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1260000}, {'name': '+/- Chg. Audience', 'value': -40600}, {'name': 'LP Audience', 'value': 1300600}, {'name': 'TP Stream', 'value': 413101}], 1: [{'name': 'Rank', 'value': 2}, {'name': 'Song', 'value': 'Black Madonna'}, {'name': 'Artist', 'value': 'Cage The Elephant'}, {'name': 'TP Spins', 'value': 814}, {'name': '+/- Chg. Spins', 'value': 38}, {'name': 'LP Spins', 'value': 776}, {'name': 'Stations', 'value': '38/46'}, {'name': 'Adds', 'value': 0}, {'name': 'TP Audience', 'value': 1283400}, {'name': '+/- Chg. Audience', 'value': -21600}, {'name': 'LP Audience', 'value': 1305000}, {'name': 'TP Stream', 'value': 362366}]}})
raw_data = our_df.to_dict(orient='records')
Output
Resulting df from the solution above
rank
title_name
title_id
artist_id
artist_name
label
song
artist
tp_spins
plus_minus_chg_spins
lp_spins
stations
adds
tp_audience
plus_minus_chg_audience
lp_audience
tp_stream
0
1
And It's Still Alright
120034150
222521
Nathaniel Rateliff
CNCO
And It's Still Alright
Nathaniel Rateliff
933
-32
965
44/46
0
1260000
-40600
1300600
413101
1
2
Black Madonna
106938609
200160
Cage The Elephant
RCA
Black Madonna
Cage The Elephant
814
38
776
38/46
0
1283400
-21600
1305000
362366
Let's start decomposing your issue. After defining our_df we can generate a new dataframe based on the column metrics with:
pd.concat([pd.DataFrame({x['name']:x['value'] for x in y},index=[0]) for y in our_df['metrics']]
Which outputs:
Rank Song ... LP Audience TP Stream
0 1 And It's Still Alright ... 1300600 413101
0 2 Black Madonna ... 1305000 362366
Next it's just a question of joining them together with pd.concat() or merge. I assume the common key is the column Rank therefore I'll use merge:
our_df.drop(columns=['metrics']).merge(pd.concat([pd.DataFrame({x['name']:x['value'] for x in y},index=[0]) for y in our_df['metrics']]),left_on='rank',right_on='Rank')
Outputting the full dataframe
rank title_name ... LP Audience TP Stream
0 1 And It's Still Alright ... 1300600 413101
1 2 Black Madonna ... 1305000 362366
Alternative that might be robust against missing names
metric_df = our_df.apply(
lambda r:
pd.Series(
index=list(map(lambda d: d['name'], r['metrics']))+['rank'],
data=list(map(lambda d: d['value'], r['metrics']))+[r['rank']],
),
axis=1,
)
our_df.merge(metric_df, on='rank')
box = pd.concat({index : pd.DataFrame(ent)
for index, ent in
zip( our_df.index, our_df.metrics)})
( our_df
.drop(columns = 'metrics')
.join(box.droplevel(-1))
.pivot(['rank', 'title_name', 'title_id', 'artist_id', 'artist_name', 'label'],
'name',
'value')
.reset_index()
)
name rank title_name title_id artist_id artist_name label +/- Chg. Audience +/- Chg. Spins Adds Artist LP Audience LP Spins Rank Song Stations TP Audience TP Spins TP Stream
0 1 And It's Still Alright 120034150 222521 Nathaniel Rateliff CNCO -40600 -32 0 Nathaniel Rateliff 1300600 965 1 And It's Still Alright 44/46 1260000 933 413101
1 2 Black Madonna 106938609 200160 Cage The Elephant RCA -21600 38 0 Cage The Elephant 1305000 776 2 Black Madonna 38/46 1283400 814 362366
Working on the raw_data:
from itertools import chain, product
metrics = [ent['metrics'] for ent in raw_data]
non_metrics = [{key : value
for key, value
in ent.items()
if key != 'metrics'}
for ent in raw_data]
combo = zip(metrics, non_metrics)
combo = (product(metrics, [non_metrics])
for metrics, non_metrics in combo)
combo = chain.from_iterable(combo)
combo = [{**left, **right} for left, right in combo]
pd.DataFrame(combo)
name value rank title_name title_id artist_id artist_name label peak_position last_week_rank last_2week_rank
0 Rank 1 1 BUTTER BTS 1 7 8
1 Song BUTTER 1 BUTTER BTS 1 7 8
2 Artist BTS 1 BUTTER BTS 1 7 8
3 Label Description None 1 BUTTER BTS 1 7 8
4 Label 1 BUTTER BTS 1 7 8
5 Last Week Rank 7 1 BUTTER BTS 1 7 8
6 Last 2 Week Rank 8 1 BUTTER BTS 1 7 8
7 Weeks On Chart 15 1 BUTTER BTS 1 7 8
8 Rank 2 2 STAY THE KID LAROI & JUS 1 1 1
9 Song STAY 2 STAY THE KID LAROI & JUS 1 1 1
10 Artist THE KID LAROI & JUS 2 STAY THE KID LAROI & JUS 1 1 1
11 Label Description None 2 STAY THE KID LAROI & JUS 1 1 1
12 Label 2 STAY THE KID LAROI & JUS 1 1 1
13 Last Week Rank 1 2 STAY THE KID LAROI & JUS 1 1 1
14 Last 2 Week Rank 1 2 STAY THE KID LAROI & JUS 1 1 1
15 Weeks On Chart 8 2 STAY THE KID LAROI & JUS 1 1 1
You can then reshape/transform into whatever you desire.

Extract values from Dictionary inside a pandas dataframe (Python)

trying to extract the dictionary in a dataframe. but unable to. none of the solution mentioned matches my requirement hence seeking help for the same.
instrument_token last_price change depth
0 17600770 180.75 20.500000 {'buy': [{'quantity': 1, 'price': 1, 'orders': 1},{'quantity': 0, 'price': 0.0, 'orders': 0}], 'sell': [{'quantity': 1, 'price': 1, 'orders': 1},{'quantity': 0, 'price': 0.0, 'orders': 0}]}
1 12615426 0.05 -50.000000 {'buy': [{'quantity': 2, 'price': 2, 'orders': 2},{'quantity': 0, 'price': 0.0, 'orders': 0}], 'sell': [{'quantity': 2, 'price': 2, 'orders': 2},{'quantity': 0, 'price': 0.0, 'orders': 0}]}
2 17543682 0.35 -89.062500 {'buy': [{'quantity': 3, 'price': 3, 'orders': 3},{'quantity': 0, 'price': 0.0, 'orders': 0}], 'sell': [{'quantity': 3, 'price': 3, 'orders': 3},{'quantity': 0, 'price': 0.0, 'orders': 0}]}
3 17565954 6.75 -10.000000 {'buy': [{'quantity': 4, 'price': 4, 'orders': 4},{'quantity': 0, 'price': 0.0, 'orders': 0}], 'sell': [{'quantity': 4, 'price': 4, 'orders': 4},{'quantity': 0, 'price': 0.0, 'orders': 0}]}
4 26077954 3.95 -14.130435 {'buy': [{'quantity': 5, 'price': 5, 'orders': 5},{'quantity': 0, 'price': 0.0, 'orders': 0}], 'sell': [{'quantity': 5, 'price': 5, 'orders': 5},{'quantity': 0, 'price': 0.0, 'orders': 0}]}
5 17599490 141.75 -2.241379 {'buy': [{'quantity': 6, 'price': 6, 'orders': 6},{'quantity': 0, 'price': 0.0, 'orders': 0}], 'sell': [{'quantity': 6, 'price': 6, 'orders': 6},{'quantity': 0, 'price': 0.0, 'orders': 0}]}
6 17566978 17.65 -1.671309 {'buy': [{'quantity': 7, 'price': 7, 'orders': 7},{'quantity': 0, 'price': 0.0, 'orders': 0}], 'sell': [{'quantity': 7, 'price': 7, 'orders': 7},{'quantity': 0, 'price': 0.0, 'orders': 0}]}
7 26075906 24.70 -16.554054 {'buy': [{'quantity': 8, 'price': 8, 'orders': 8},{'quantity': 0, 'price': 0.0, 'orders': 0}], 'sell': [{'quantity': 8, 'price': 8, 'orders': 8},{'quantity': 0, 'price': 0.0, 'orders': 0}]}
looking to convert to the following:
instrument_token last_price change buy_price sell_price
0 17600770 180.75 20.500000 1 1
1 12615426 0.05 -50.000000 2 2
2 17543682 0.35 -89.062500 3 3
3 17565954 6.75 -10.000000 4 4
4 26077954 3.95 -14.130435 5 5
5 17599490 141.75 -2.241379 6 6
6 17566978 17.65 -1.671309 7 7
...
able to access the individual elements using a for loop by unable to convert the dictionary to the desired df.col as shown in the above desired df.
You want to get price only from the first element of the list, and not a sum, then do:
df["buy_price"]=df["depth"].str["buy"].str[0].str["price"]
df["sell_price"]=df["depth"].str["sell"].str[0].str["price"]
In case you wish to get a sum of all nested elements:
df["buy_price"]=df["depth"].str["buy"].apply(lambda x: sum(el["price"] for el in x))
df["sell_price"]=df["depth"].str["sell"].apply(lambda x: sum(el["price"] for el in x))
Is this what you're looking for?
def get_prices(depth, tag):
def sum(items):
total = 0
for item in items:
total += item['price']
return total
return int(sum(depth[tag]))
df['buy_price'] = df['depth'].apply(lambda depth: get_prices(depth, 'buy'))
df['sell_price'] = df['depth'].apply(lambda depth: get_prices(depth, 'sell'))
df.drop(columns='depth', inplace=True)
print(df)
Output:
instrument_token last_price change buy_price sell_price
0 17600770 180.75 20.500000 1 1
1 12615426 0.05 -50.000000 2 2
2 17543682 0.35 -89.062500 3 3
3 17565954 6.75 -10.000000 4 4
4 26077954 3.95 -14.130435 5 5
5 17599490 141.75 -2.241379 6 6
6 17566978 17.65 -1.671309 7 7
7 26075906 24.70 -16.554054 8 8
I use ast here to get it into Python data structure from string. For actual dictionaries, as is your case, you can remove the ast.literal_eval part out of the script.
Get the dictionary and merge back to original dataframe. Assumption, based on your output is that you are only interested in the first dict in each sublist for buy and sell respectively.
import ast
res = [{f"{x}_price" : ast.literal_eval(ent)[x][0]['price']
for x in ("buy","sell")}
for ent in df.pop('depth') ]
df.join(pd.DataFrame(res))
instrument_token last_price change buy_price sell_price
0 17600770 180.75 20.500000 1 1
1 12615426 0.05 -50.000000 2 2
2 17543682 0.35 -89.062500 3 3
3 17565954 6.75 -10.000000 4 4
4 26077954 3.95 -14.130435 5 5
5 17599490 141.75 -2.241379 6 6
6 17566978 17.65 -1.671309 7 7
7 26075906 24.70 -16.554054 8 8
For actual dictionaries:
res = [{f"{x}_price" : ent[x][0]['price']
for x in ("buy","sell")}
for ent in df.pop('depth') ]
#merge back to df
result = df.join(pd.DataFrame(res))

Create pandas dataframe from a series containing list of dictionaries

One of the columns of my pandas dataframe looks like this
>> df
Item
0 [{"id":A,"value":20},{"id":B,"value":30}]
1 [{"id":A,"value":20},{"id":C,"value":50}]
2 [{"id":A,"value":20},{"id":B,"value":30},{"id":C,"value":40}]
I want to expand it as
A B C
0 20 30 NaN
1 20 NaN 50
2 20 30 40
I tried
dfx = pd.DataFrame()
for i in range(df.shape[0]):
df1 = pd.DataFrame(df.item[i]).T
header = df1.iloc[0]
df1 = df1[1:]
df1 = df1.rename(columns = header)
dfx = dfx.append(df1)
But this takes a lot of time as my data is huge. What is the best way to do this?
My original json data looks like this:
{
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
.............
}
I agree with #JeffH, you should really look at how you are constructing the DataFrame.
Assuming you are getting this from somewhere out of your control then you can convert to the your desired DataFrame with:
In []:
pd.DataFrame(df['Item'].apply(lambda r: {d['id']: d['value'] for d in r}).values.tolist())
Out[]:
A B C
0 20 30.0 NaN
1 20 NaN 50.0
2 20 30.0 40.0

How to unpack an object of dictionaries to a range of Data Frames

I am creating a function that grabs data from an ERP system to display to the end user.
I want to unpack an object of dictionaries and create a range of Pandas DataFrames with them.
For example, I have:
troRows
{0: [{'productID': 134336, 'price': '10.0000', 'amount': '1', 'cost': 0}],
1: [{'productID': 142141, 'price': '5.5000', 'amount': '4', 'cost': 0}],
2: [{'productID': 141764, 'price': '5.5000', 'amount': '1', 'cost': 0}],
3: [{'productID': 81661, 'price': '4.5000', 'amount': '1', 'cost': 0}],
4: [{'productID': 146761, 'price': '5.5000', 'amount': '1', 'cost': 0}],
5: [{'productID': 143585, 'price': '5.5900', 'amount': '9', 'cost': 0}],
6: [{'productID': 133018, 'price': '5.0000', 'amount': '1', 'cost': 0}],
7: [{'productID': 146250, 'price': '13.7500', 'amount': '5', 'cost': 0}],
8: [{'productID': 149986, 'price': '5.8900', 'amount': '2', 'cost': 0},
{'productID': 149790, 'price': '4.9900', 'amount': '2', 'cost': 0},
{'productID': 149972, 'price': '5.2900', 'amount': '2', 'cost': 0},
{'productID': 149248, 'price': '2.0000', 'amount': '2', 'cost': 0},
{'productID': 149984, 'price': '4.2000', 'amount': '2', 'cost': 0},
Each time the function will need to unpack x number of dictionaries which may have different number of rows into a range of DataFrames.
So for example, this range of Dictionaries would return
DF0, DF1, DF2, DF3, DF4, DF5, DF6, DF7, DF8.
I can unpack a single Dictionary with:
pd.DataFrame(troRows[8])
which returns
amount cost price productID
0 2 0 5.8900 149986
1 2 0 4.9900 149790
2 2 0 5.2900 149972
3 2 0 2.0000 149248
4 2 0 4.2000 149984
How can I structure my code so that it does this for all the dictionaries for me?
Solution for dictionary of DataFrames - use dictioanry comprehension and set index values to keys of dictionary:
dfs = {k: pd.DataFrame(v) for k, v in troRows.items()}
print (dfs)
{0: amount cost price productID
0 1 0 10.0000 134336, 1: amount cost price productID
0 4 0 5.5000 142141, 2: amount cost price productID
0 1 0 5.5000 141764, 3: amount cost price productID
0 1 0 4.5000 81661, 4: amount cost price productID
0 1 0 5.5000 146761, 5: amount cost price productID
0 9 0 5.5900 143585, 6: amount cost price productID
0 1 0 5.0000 133018, 7: amount cost price productID
0 5 0 13.7500 146250, 8: amount cost price productID
0 2 0 5.8900 149986
1 2 0 4.9900 149790
2 2 0 5.2900 149972
3 2 0 2.0000 149248
4 2 0 4.2000 149984}
print (dfs[8])
amount cost price productID
0 2 0 5.8900 149986
1 2 0 4.9900 149790
2 2 0 5.2900 149972
3 2 0 2.0000 149248
4 2 0 4.2000 149984
Solutions for one DataFrame:
Use list comprehension with flattening and pass it to DataFrame constructor:
troRows = pd.Series([[{'productID': 134336, 'price': '10.0000', 'amount': '1', 'cost': 0}],
[{'productID': 142141, 'price': '5.5000', 'amount': '4', 'cost': 0}],
[{'productID': 141764, 'price': '5.5000', 'amount': '1', 'cost': 0}],
[{'productID': 81661, 'price': '4.5000', 'amount': '1', 'cost': 0}],
[{'productID': 146761, 'price': '5.5000', 'amount': '1', 'cost': 0}],
[{'productID': 143585, 'price': '5.5900', 'amount': '9', 'cost': 0}],
[{'productID': 133018, 'price': '5.0000', 'amount': '1', 'cost': 0}],
[{'productID': 146250, 'price': '13.7500', 'amount': '5', 'cost': 0}],
[{'productID': 149986, 'price': '5.8900', 'amount': '2', 'cost': 0},
{'productID': 149790, 'price': '4.9900', 'amount': '2', 'cost': 0},
{'productID': 149972, 'price': '5.2900', 'amount': '2', 'cost': 0},
{'productID': 149248, 'price': '2.0000', 'amount': '2', 'cost': 0},
{'productID': 149984, 'price': '4.2000', 'amount': '2', 'cost': 0}]])
df = pd.DataFrame([y for x in troRows for y in x])
Another solution for flatten your data is use chain.from_iterable:
from itertools import chain
df = pd.DataFrame(list(chain.from_iterable(troRows)))
print (df)
amount cost price productID
0 1 0 10.0000 134336
1 4 0 5.5000 142141
2 1 0 5.5000 141764
3 1 0 4.5000 81661
4 1 0 5.5000 146761
5 9 0 5.5900 143585
6 1 0 5.0000 133018
7 5 0 13.7500 146250
8 2 0 5.8900 149986
9 2 0 4.9900 149790
10 2 0 5.2900 149972
11 2 0 2.0000 149248
12 2 0 4.2000 149984

pandas dataframe convert values in array of objects

I want to convert the below pandas data frame
data = pd.DataFrame([[1,2], [5,6]], columns=['10+', '20+'], index=['A', 'B'])
data.index.name = 'City'
data.columns.name= 'Age Group'
print data
Age Group 10+ 20+
City
A 1 2
B 5 6
in to an array of dictionaries, like
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]
I am able to get the above expected result using the following loops
result = []
cols_name = data.columns.name
index_names = data.index.name
for index in data.index:
for col in data.columns:
result.append({cols_name: col, index_names: index, 'count': data.loc[index, col]})
Is there any better ways of doing this? Since my original data will be having large number of records, using for loops will take more time.
I think you can use stack with reset_index for reshape and last to_dict:
print (data.stack().reset_index(name='count'))
City Age Group count
0 A 10+ 1
1 A 20+ 2
2 B 10+ 5
3 B 20+ 6
print (data.stack().reset_index(name='count').to_dict(orient='records'))
[
{'Age Group': '10+', 'City': 'A', 'count': 1},
{'Age Group': '20+', 'City': 'A', 'count': 2},
{'Age Group': '10+', 'City': 'B', 'count': 5},
{'Age Group': '20+', 'City': 'B', 'count': 6}
]

Categories

Resources