I have this reproducible data set where i need to add a column based for the 'best usage' source.
df_in = pd.DataFrame({
'year': [ 5, 5, 5,
10, 10,
15, 15,
30, 30, 30 ],
'usage': ['farm', 'best', '',
'manual', 'best',
'best', 'city',
'random', 'best', 'farm' ],
'value': [0.825, 0.83, 0.85,
0.935, 0.96,
1.12, 1.305,
1.34, 1.34, 1.455],
'source': ['wood', 'metal', 'water',
'metal', 'water',
'wood', 'water',
'wood', 'metal', 'water' ]})
desired outcome:
print(df)
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal
Is there a way to do that without grouping? currently, i'm using:
grouped = df_in.groupby('usage').get_group('best')
grouped = grouped.rename(columns={'source': 'best'})
df = df_in.merge(grouped[['year','best']],how='outer', on='year')
You could just query:
df_in.merge(df_in.query('usage=="best"')[['year','source']]
.drop_duplicates('year') # you might not need/want this line if `best` is unique per year (or doesn't need to be in the output)
.rename(columns={'source':'best'}),
on='year', how='left')
Output:
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal
Here is a way using .loc and .map()
(df.assign(best = df_in['year']
.map(df_in.loc[df_in['usage'].eq('best'),['year','source']]
.set_index('year')
.squeeze())))
Output:
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal
Related
Context:
I have a pandas dataframe with 7 columns (taste, color, temperature, texture, shape, age_of_participant, name_of_participant).
Of the 7 columns, taste, color, temperature, texture and shape can have overlapping values across multiple rows (i.e taste could be sour for more than one row)
I'm trying to collapse all the rows into the lowest number of combinations given
taste,color,temperature,texture and shape values while ignoring NA's ( in other words, overwriting them). The next part is to map each of these rows to the original rows.
Mock data set:
data_set = [
{'color':'brown', 'age_of_participant':23, 'name_of_participant':'feb'},
{'taste': 'sour', 'color':'green', 'temperature': 'hot', 'age_of_participant':16,'name_of_participant': 'joe'},
{'taste': 'sour', 'color':'green', 'texture':'soft', 'age_of_participant':17,'name_of_participant': 'jane'},
{'color':'green','age_of_participant':18,'name_of_participant': 'jeff'},
{'taste': 'sweet', 'color':'red', 'age_of_participant':19,'name_of_participant': 'joke'},
{'taste': 'sweet', 'temperature': 'cold', 'age_of_participant':20,'name_of_participant': 'jolly'},
{'taste': 'salty', 'color':'purple', 'texture':'soft', 'age_of_participant':21,'name_of_participant': 'jupyter'},
{'taste': 'salty', 'color':'brown', 'age_of_participant':22,'name_of_participant': 'january'}
]
import pandas as pd
import random
data_set = random.sample(data_set, k=len(data_set))
data_frame = pd.DataFrame(data_set)
print(data_frame)
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot NaN
1 17 green jane sour NaN soft
2 18 green jeff NaN NaN NaN
3 19 red joke sweet NaN NaN
4 20 NaN jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
What I've attempted:
# These columns are used to do the grouping since age_of_participant and name_of_participant are unique per row
values_that_can_be_grouped = ['taste', 'color','temperature','texture']
sub_set = data_frame[values_that_can_be_grouped].drop_duplicates().reset_index(drop=False)
my_unique_set = sub_set.groupby('taste', as_index=False).first()
print(my_unique_set)
taste index color temperature texture
0 2 green
1 salty 6 brown
2 sour 1 green soft
3 sweet 4 cold
At this point I'm not quite sure how I can map the rows above to all original rows except for indices 2,6,1,4. I checked pandas code and doesn't look like the other indices are preserved anywhere?
What I'm trying to achieve:
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot soft
1 17 green jane sour hot soft
2 18 green jeff sour hot soft
3 19 red joke sweet cold NaN
4 20 red jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
data_frame.assign(color=data_frame.color.ffill()).groupby('color').apply(lambda x: x.ffill().bfill())
Out[1089]:
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot soft
1 17 green jane sour hot soft
2 18 green jeff sour hot soft
3 19 red joke sweet cold NaN
4 20 red jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
IIUC I feel using ffill and bfill for each taste and color, then groupby them is safer here
df.taste.fillna(df.groupby('color').taste.apply(lambda x : x.ffill().bfill()),inplace=True)
df.color.fillna(df.groupby('taste').color.apply(lambda x : x.ffill().bfill()),inplace=True)
df=df.groupby(['color','taste']).apply(lambda x : x.ffill().bfill())
df
age_of_participant color ... temperature texture
0 16 green ... hot soft
1 17 green ... hot soft
2 18 green ... hot soft
3 19 red ... cold NaN
4 20 red ... cold NaN
5 21 purple ... NaN soft
6 22 brown ... NaN NaN
[7 rows x 6 columns]
I am working on a carsale data set having columns : 'car', 'price', 'body', 'mileage', 'engV', 'engType', 'registration','year', 'model', 'drive'
column 'drive' and 'engType' have NaN missing values, I want to calculate mode for let say for 'drive' based on group by of ['car', 'model'] and then where this group falls, I want to replace NaN value there based on this groupby
I have tried these methods:
for numeric data
carsale['engV2'] = (carsale.groupby(['car','body','model']))['engV'].transform(lambda x: x.fillna(x.median()))
this is working fine, filling/replacing data accurately
for categorical data
carsale['driveT'] = (carsale.groupby(['car','model']))['drive'].transform(lambda x: x.fillna(x.mode()))
carsale['driveT'] = (carsale.groupby(['car','model']))['drive'].transform(lambda x: x.fillna(pd.Series.mode(x)))
both are giving same results
Here is the full code:
# carsale['price2'] = (carsale.groupby(['car','model','year']))['price'].transform(lambda x: x.fillna(x.median()))
# carsale['engV2'] = (carsale.groupby(['car','body','model']))['engV'].transform(lambda x: x.fillna(x.median()))
# carsale['mileage2'] = (carsale.groupby(['car','model','year']))['mileage'].transform(lambda x: x.fillna(x.median()))
# mode = carsale.filter(['car','drive']).mode()
# carsale[['test1','test2']] = carsale[['car','engType']].fillna(carsale.mode().iloc[0])
**carsale.groupby(['car', 'model'])['engType'].apply(pd.Series.mode)**
# carsale.apply()
# carsale
# carsale['engType2'] = carsale.groupby('car').engType.transform(lambda x: x.fillna(x.mode()))
**carsale['driveT'] = carsale.groupby(['car', 'model'])['drive'
].transform(lambda x: x.fillna(x.mode()))
carsale['driveT'] = carsale.groupby(['car', 'model'])['drive'
].transform(lambda x: x.fillna(pd.Series.mode(x)))**
# carsale[carsale.car == 'Mercedes-Benz'].sort_values(['body','engType','model','mileage']).tail(50)
# carsale[carsale.engV.isnull()]
# carsale.sort_values(['car','body','model'])
**carsale**
from above both methods giving the same results, it is just replacing/adding values in new column driveT same as we have in origional column 'drive'. like if we have NaN in some indexes then it is showing same NaN in driveT as well and same for other values.
But for numerical data, if i applied median it is adding/replacing correct value.
So the thing is it actually not calculating mode based on ['car', 'model'] group instead it is doing mode for single values in 'drive', but if you run this command
**carsale.groupby(['car','model'])['engType'].apply(pd.Series.mode)**
this is correctly calculating mode based on groupby (car, model)
Can anyone help in this matter?
My approach was to:
Use .groupby() to create a look-up dataframe that contains the mode of the drive feature for each car/model combo.
Write a method that looks up the mode in this dataframe and returns it for a given car/model, when that car/model's value in drive is null.
However, turned out there were two key corner cases specific to OP's dataset that needed to be handled:
When a particular car/model combo has no mode (because all entries in the drive column for this combo were NaN).
When a particular car brand has no mode.
Below are the steps I followed. If I begin with an example extended from first several rows of the sample dataframe in the question:
carsale = pd.DataFrame({'car': ['Ford', 'Mercedes-Benz', 'Mercedes-Benz', 'Mercedes-Benz', 'Mercedes-Benz', 'Nissan', 'Honda','Renault', 'Mercedes-Benz', 'Mercedes-Benz', 'Toyota', 'Toyota', 'Ferrari'],
'price': [15500.000, 20500.000, 35000.000, 17800.000, 33000.000, 16600.000, 6500.000, 10500.000, 21500.000, 21500.000, 1280.000, 2005.00, 300000.000],
'body': ['crossover', 'sedan', 'other', 'van', 'vagon', 'crossover', 'sedan', 'vagon', 'sedan', 'sedan', 'compact', 'compact', 'sport'],
'mileage': [68.0, 173.0, 135.0, 162.0, 91.0, 83.0, 199.0, 185.0, 146.0, 146.0, 200.0, 134, 123.0],
'engType': ['Gas', 'Gas', 'Petrol', 'Diesel', np.nan, 'Petrol', 'Petrol', 'Diesel', 'Gas', 'Gas', 'Hybrid', 'Gas', 'Gas'],
'registration':['yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes'],
'year': [2010, 2011, 2008, 2012, 2013, 2013, 2003, 2011, 2012, 2012, 2009, 2003, 1988],
'model': ['Kuga', 'E-Class', 'CL 550', 'B 180', 'E-Class', 'X-Trail', 'Accord', 'Megane', 'E-Class', 'E-Class', 'Prius', 'Corolla', 'Testarossa'],
'drive': ['full', 'rear', 'rear', 'front', np.nan, 'full', 'front', 'front', 'rear', np.nan, np.nan, 'front', np.nan],
})
carsale
car price body mileage engType registration year model drive
0 Ford 15500.0 crossover 68.0 Gas yes 2010 Kuga full
1 Mercedes-Benz 20500.0 sedan 173.0 Gas yes 2011 E-Class rear
2 Mercedes-Benz 35000.0 other 135.0 Petrol yes 2008 CL 550 rear
3 Mercedes-Benz 17800.0 van 162.0 Diesel yes 2012 B 180 front
4 Mercedes-Benz 33000.0 vagon 91.0 NaN yes 2013 E-Class NaN
5 Nissan 16600.0 crossover 83.0 Petrol yes 2013 X-Trail full
6 Honda 6500.0 sedan 199.0 Petrol yes 2003 Accord front
7 Renault 10500.0 vagon 185.0 Diesel yes 2011 Megane front
8 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class rear
9 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class NaN
10 Toyota 1280.0 compact 200.0 Hybrid yes 2009 Prius NaN
11 Toyota 2005.0 compact 134.0 Gas yes 2003 Corolla front
12 Ferrari 300000.0 sport 123.0 Gas yes 1988 Testarossa NaN
Create a dataframe to that shows the mode of the drive feature for each car/model combination.
If a car/model combo has no mode (such as the row with Toyota Prius), I fill with the mode of that particular car brand (Toyota).
However, if the car brand, itself, (such as Ferrari here in my example) has no mode, I fill with the dataset's mode for the drive feature.
def get_drive_mode(x):
brand = x.name[0]
if x.count() > 0:
return x.mode() # Return mode for a brand/model if the mode exists.
elif carsale.groupby(['car'])['drive'].count()[brand] > 0:
brand_mode = carsale.groupby(['car'])['drive'].apply(lambda x: x.mode())[brand]
return brand_mode # Return mode of brand if particular brand/model combo has no mode,
else: # but brand itself has a mode for the 'drive' feature.
return carsale['drive'].mode() # Otherwise return dataset's mode for the 'drive' feature.
drive_modes = carsale.groupby(['car','model'])['drive'].apply(get_drive_mode).reset_index().drop('level_2', axis=1)
drive_modes.rename(columns={'drive': 'drive_mode'}, inplace=True)
drive_modes
car model drive_mode
0 Ferrari Testarossa front
1 Ford Kuga full
2 Honda Accord front
3 Mercedes-Benz B 180 front
4 Mercedes-Benz CL 550 rear
5 Mercedes-Benz E-Class rear
6 Nissan X-Trail full
7 Renault Megane front
8 Toyota Corolla front
9 Toyota Prius front
Write a method that looks up the drive mode value for a given car/model in a given row if that row's value for drive is NaN:
def fill_with_mode(x):
if pd.isnull(x['drive']):
return drive_modes[(drive_modes['car'] == x['car']) & (drive_modes['model'] == x['model'])]['drive_mode'].values[0]
else:
return x['drive']
Apply the above method to the rows in the carsale dataframe in order to create the driveT feature:
carsale['driveT'] = carsale.apply(fill_with_mode, axis=1)
del(drive_modes)
Which results in the following dataframe:
carsale
car price body mileage engType registration year model drive driveT
0 Ford 15500.0 crossover 68.0 Gas yes 2010 Kuga full full
1 Mercedes-Benz 20500.0 sedan 173.0 Gas yes 2011 E-Class rear rear
2 Mercedes-Benz 35000.0 other 135.0 Petrol yes 2008 CL 550 rear rear
3 Mercedes-Benz 17800.0 van 162.0 Diesel yes 2012 B 180 front front
4 Mercedes-Benz 33000.0 vagon 91.0 NaN yes 2013 E-Class NaN rear
5 Nissan 16600.0 crossover 83.0 Petrol yes 2013 X-Trail full full
6 Honda 6500.0 sedan 199.0 Petrol yes 2003 Accord front front
7 Renault 10500.0 vagon 185.0 Diesel yes 2011 Megane front front
8 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class rear rear
9 Mercedes-Benz 21500.0 sedan 146.0 Gas yes 2012 E-Class NaN rear
10 Toyota 1280.0 compact 200.0 Hybrid yes 2009 Prius NaN front
11 Toyota 2005.0 compact 134.0 Gas yes 2003 Corolla front front
12 Ferrari 300000.0 sport 123.0 Gas yes 1988 Testarossa NaN front
Notice that in rows 4 and 9 of the driveT column, the NaN value that was in the drive column has been replaced by the string rear, which as we would expect, is the mode of drive for a Mercedes E-Class.
Also, in row 11, since there is no mode for the Toyota Prius car/model combo, we fill with the mode for the Toyota brand, which is front.
Finally, in row 12, since there is no mode for the Ferrari car brand, we fill with the mode of the entire dataset's drive column, which is also front.
I have an Excel file with some (mostly) nicely grouped rows. I built a fake example below.
Is there a way to get read_excel in Pandas to produce a multiindex preserving this structure?
For this example the MultiIndex would have four levels (Family, Individual, Child (optional), investment). If the subtotal values were lost that would be fine as they can easily be recreated in Pandas.
No, pandas can't read such a structure.
An alternative solution is to use pandas to read your data, but transform this into an easily accessible dictionary, rather than keeping your data in a dataframe with MultiIndex.
There are 2 sensible requirements to make your data more usable:
Make your investment fund names unique. This is trivial.
Convert your Excel grouping to an additional column which indicates the parent of the row.
In the below example, these 2 requirements are assumed.
Setup
from collections import defaultdict
from functools import reduce
import operator
import pandas as pd
df = pd.DataFrame({'name': ['Simpson Family', 'Marge Simpson', 'Maggies College Fund',
'MCF Investment 2', 'MS Investment 1', 'MS Investment 2', 'MS Investment 3',
'Homer Simpson', 'HS Investment 1', 'HS Investment 3', 'HS Investment 2',
'Griffin Family', 'Lois Griffin', 'LG Investment 2', 'LG Investment 3',
'Brian Giffin', 'BG Investment 3'],
'Value': [600, 450, 100, 100, 100, 200, 50, 150, 100, 50, 0, 200, 150, 100, 50, 50, 50],
'parent': ['Families', 'Simpson Family', 'Marge Simpson', 'Maggies College Fund',
'Marge Simpson', 'Marge Simpson', 'Marge Simpson', 'Simpson Family',
'Homer Simpson', 'Homer Simpson', 'Homer Simpson', 'Families',
'Griffin Family', 'Lois Griffin', 'Lois Griffin', 'Griffin Family',
'Brian Giffin']})
Value name parent
0 600 Simpson Family Families
1 450 Marge Simpson Simpson Family
2 100 Maggies College Fund Marge Simpson
3 100 MCF Investment 2 Maggies College Fund
4 100 MS Investment 1 Marge Simpson
5 200 MS Investment 2 Marge Simpson
6 50 MS Investment 3 Marge Simpson
7 150 Homer Simpson Simpson Family
8 100 HS Investment 1 Homer Simpson
9 50 HS Investment 3 Homer Simpson
10 0 HS Investment 2 Homer Simpson
11 200 Griffin Family Families
12 150 Lois Griffin Griffin Family
13 100 LG Investment 2 Lois Griffin
14 50 LG Investment 3 Lois Griffin
15 50 Brian Giffin Griffin Family
16 50 BG Investment 3 Brian Giffin
Step 1
Define a child -> parent dictionary and some utility functions:
child_parent_dict = df.set_index('name')['parent'].to_dict()
tree = lambda: defaultdict(tree)
d = tree()
def get_all_parents(child):
"""Get all parents from hierarchy structure"""
while child != 'Families':
child = child_parent_dict[child]
if child != 'Families':
yield child
def getFromDict(dataDict, mapList):
"""Iterate nested dictionary"""
return reduce(operator.getitem, mapList, dataDict)
def default_to_regular_dict(d):
"""Convert nested defaultdict to regular dict of dicts."""
if isinstance(d, defaultdict):
d = {k: default_to_regular_dict(v) for k, v in d.items()}
return d
Step 2
Apply this to your dataframe. Use it to create a nested dictionary structure which will be more efficient for repeated queries.
df['structure'] = df['name'].apply(lambda x: ['Families'] + list(get_all_parents(x))[::-1])
for idx, row in df.iterrows():
getFromDict(d, row['structure'])[row['name']]['Value'] = row['Value']
res = default_to_regular_dict(d)
Result
Dataframe
Value name parent \
0 600 Simpson Family Families
1 450 Marge Simpson Simpson Family
2 100 Maggies College Fund Marge Simpson
3 100 MCF Investment 2 Maggies College Fund
4 100 MS Investment 1 Marge Simpson
5 200 MS Investment 2 Marge Simpson
6 50 MS Investment 3 Marge Simpson
7 150 Homer Simpson Simpson Family
8 100 HS Investment 1 Homer Simpson
9 50 HS Investment 3 Homer Simpson
10 0 HS Investment 2 Homer Simpson
11 200 Griffin Family Families
12 150 Lois Griffin Griffin Family
13 100 LG Investment 2 Lois Griffin
14 50 LG Investment 3 Lois Griffin
15 50 Brian Giffin Griffin Family
16 50 BG Investment 3 Brian Giffin
structure
0 [Families]
1 [Families, Simpson Family]
2 [Families, Simpson Family, Marge Simpson]
3 [Families, Simpson Family, Marge Simpson, Magg...
4 [Families, Simpson Family, Marge Simpson]
5 [Families, Simpson Family, Marge Simpson]
6 [Families, Simpson Family, Marge Simpson]
7 [Families, Simpson Family]
8 [Families, Simpson Family, Homer Simpson]
9 [Families, Simpson Family, Homer Simpson]
10 [Families, Simpson Family, Homer Simpson]
11 [Families]
12 [Families, Griffin Family]
13 [Families, Griffin Family, Lois Griffin]
14 [Families, Griffin Family, Lois Griffin]
15 [Families, Griffin Family]
16 [Families, Griffin Family, Brian Giffin]
Dictionary
{'Families': {'Griffin Family': {'Brian Giffin': {'BG Investment 3': {'Value': 50},
'Value': 50},
'Lois Griffin': {'LG Investment 2': {'Value': 100}, 'LG Investment 3': {'Value': 50},
'Value': 150},
'Value': 200},
'Simpson Family': {'Homer Simpson': {'HS Investment 1': {'Value': 100}, 'HS Investment 2': {'Value': 0}, 'HS Investment 3': {'Value': 50},
'Value': 150},
'Marge Simpson': {'MS Investment 1': {'Value': 100}, 'MS Investment 2': {'Value': 200}, 'MS Investment 3': {'Value': 50},
'Maggies College Fund': {'MCF Investment 2': {'Value': 100},
'Value': 100},
'Value': 450},
'Value': 600}}}
I don't think it is possible to implement this using read_excel as-it.
What you can do is to add additional columns to your excel sheet based on the four hierarchy levels (Family, Individual, Child (optional), investment) and then use read_excel() with index_col[0,1,2,3] to generate the pandas dataframe.
See the index_col parameter of the read_excel function.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
index_col : int, list of ints, default None
Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.
To avoid duplicates of the same user, I want to neatly organize a nested dictionary of {k: artist1, artist2, artist3, etc} using pandas groupby function. Here is sample data (my instinct tells me chain an agg func?)
...like df.groupby('users')?
users artist
0 00001411dc427966b17297bf4d69e7e193135d89 the most serene republic
1 00001411dc427966b17297bf4d69e7e193135d89 stars
2 00001411dc427966b17297bf4d69e7e193135d89 broken social scene
3 00001411dc427966b17297bf4d69e7e193135d89 have heart
4 00001411dc427966b17297bf4d69e7e193135d89 luminous orange
5 00001411dc427966b17297bf4d69e7e193135d89 boris
6 00001411dc427966b17297bf4d69e7e193135d89 arctic monkeys
7 00001411dc427966b17297bf4d69e7e193135d89 bright eyes
8 00001411dc427966b17297bf4d69e7e193135d89 coaltar of the deepers
9 00001411dc427966b17297bf4d69e7e193135d89 polar bear club
10 00001411dc427966b17297bf4d69e7e193135d89 the libertines
11 00001411dc427966b17297bf4d69e7e193135d89 death from above 1979
12 00001411dc427966b17297bf4d69e7e193135d89 owl city
13 00001411dc427966b17297bf4d69e7e193135d89 coldplay
14 00001411dc427966b17297bf4d69e7e193135d89 okkervil river
15 00001411dc427966b17297bf4d69e7e193135d89 jim sturgess
16 00001411dc427966b17297bf4d69e7e193135d89 deerhoof
17 00001411dc427966b17297bf4d69e7e193135d89 fear before the march of flames
18 00001411dc427966b17297bf4d69e7e193135d89 breathe carolina
19 00001411dc427966b17297bf4d69e7e193135d89 mstrkrft
I believe you're looking for groupby + agg here.
df.groupby('users').artist.apply(list).to_dict()
{'00001411dc427966b17297bf4d69e7e193135d89': ['the most serene republic',
'stars',
'broken social scene',
'have heart',
'luminous orange',
'boris',
...
]
}
I have a dictionary of values:
{'Spanish Omlette': -0.20000000000000284,
'Crumbed Chicken Salad': -1.2999999999999972,
'Chocolate Bomb': 0.0,
'Seed Nut Muesli': -3.8999999999999915,
'Fruit': -1.2999999999999972,
'Frikerdels Salad': -1.2000000000000028,
'Seed Nut Cheese Biscuits': 0.4000000000000057,
'Chorizo Pasta': -2.0,
'No carbs Ice Cream': 0.4000000000000057,
'Veg Stew': 0.4000000000000057,
'Bulgar spinach Salad': 0.10000000000000853,
'Mango Cheese': 0.10000000000000853,
'Crumbed Calamari chips': 0.10000000000000853,
'Slaw Salad': 0.20000000000000284,
'Mango': -1.2000000000000028,
'Rice & Fish': 0.20000000000000284,
'Almonds Cheese': -0.09999999999999432,
'Nectarine': -1.7000000000000028,
'Banana Cheese': 0.7000000000000028,
'Mediteranean Salad': 0.7000000000000028,
'Almonds': -4.099999999999994}
I am trying to get the aggregated sum of the values of each food item from the dictionary using Pandas:
fooddata = pd.DataFrame(list(foodWeight.items()), columns=['food','weight']).groupby('food')['weight'].agg(['sum']).sort_values(by='sum', ascending=0)
The above code gives the the correct output:
sum
food
Banana Cheese 0.7
Mediteranean Salad 0.7
Seed Nut Cheese Biscuits 0.4
Veg Stew 0.4
No carbs Ice Cream 0.4
Slaw Salad 0.2
Rice & Fish 0.2
Almonds Mango 0.1
Bulgar spinach Salad 0.1
Crumbed Calamari chips 0.1
Frikkadels Salad 0.1
Mango Cheese 0.1
Chocolate Bomb 0.0
Burrito Salad 0.0
Fried Eggs Cheese Avocado 0.0
Burger and Chips -0.1
Traditional Breakfast -0.1
Almonds Cheese -0.1
However, I need to get the output in 2 columns not one which Pandas is giving me above.
How do I get the output into a format that I can plot the data. I.E Label and Value as separate values
Set as_index=False while calling group by
fooddata = pd.DataFrame(list(foodWeight.items()), columns=['food','weight']).groupby('food',as_index=False).agg({"weight":"sum"}).sort_values(by='weight', ascending=0)
You can use parameter as_index=False in groupby and aggregate sum:
fooddata = pd.DataFrame(list(foodWeight.items()), columns=['food','weight'])
print (fooddata.groupby('food', as_index=False)['weight']
.sum()
.sort_values(by='weight', ascending=0))
food weight
2 Banana Cheese 0.7
12 Mediteranean Salad 0.7
20 Veg Stew 0.4
14 No carbs Ice Cream 0.4
16 Seed Nut Cheese Biscuits 0.4
18 Slaw Salad 0.2
15 Rice & Fish 0.2
3 Bulgar spinach Salad 0.1
6 Crumbed Calamari chips 0.1
11 Mango Cheese 0.1
4 Chocolate Bomb 0.0
1 Almonds Cheese -0.1
19 Spanish Omlette -0.2
10 Mango -1.2
8 Frikerdels Salad -1.2
9 Fruit -1.3
7 Crumbed Chicken Salad -1.3
13 Nectarine -1.7
5 Chorizo Pasta -2.0
17 Seed Nut Muesli -3.9
0 Almonds -4.1
Another solution is add reset_index:
print (fooddata.groupby('food')['weight']
.sum()
.sort_values(ascending=0)
.reset_index(name='sum'))
food sum
0 Banana Cheese 0.7
1 Mediteranean Salad 0.7
2 Veg Stew 0.4
3 Seed Nut Cheese Biscuits 0.4
4 No carbs Ice Cream 0.4
5 Slaw Salad 0.2
6 Rice & Fish 0.2
7 Crumbed Calamari chips 0.1
8 Mango Cheese 0.1
9 Bulgar spinach Salad 0.1
10 Chocolate Bomb 0.0
11 Almonds Cheese -0.1
12 Spanish Omlette -0.2
13 Mango -1.2
14 Frikerdels Salad -1.2
15 Crumbed Chicken Salad -1.3
16 Fruit -1.3
17 Nectarine -1.7
18 Chorizo Pasta -2.0
19 Seed Nut Muesli -3.9
20 Almonds -4.1
For plotting is better not reset index - then values of index create axis x - use plot:
fooddata.groupby('food')['weight'].sum().sort_values(ascending=0).plot()
Or if need plot barh:
fooddata.groupby('food')['weight'].sum().sort_values(ascending=0).plot.barh()
After the grouping you need to reset the index or use as_index=False when calling groupby. Paraphrasing this post, by default aggregation functions will not return the groups that you are aggregating over if they are named columns. Instead the grouped columns will be the indices of the returned object. Passing as_index=False or calling reset_index afterwards, will return the groups that you are aggregating over, if they are named columns.
See below my attempt to turn your results in a meaningful graph:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
df = fooddata.reset_index()
ax = df[['food','sum']].plot(kind='barh', title ="Total Sum per Food Item", figsize=(15, 10), legend=True, fontsize=12)
ax.set_xlabel("Sum per Food Item", fontsize=12)
ax.set_ylabel("Food Items", fontsize=12)
ax.set_yticklabels(df['food'])
plt.show()
This results in