Pandas Dataframe groupby statement output to 2 columns - python

I have a dictionary of values:
{'Spanish Omlette': -0.20000000000000284,
'Crumbed Chicken Salad': -1.2999999999999972,
'Chocolate Bomb': 0.0,
'Seed Nut Muesli': -3.8999999999999915,
'Fruit': -1.2999999999999972,
'Frikerdels Salad': -1.2000000000000028,
'Seed Nut Cheese Biscuits': 0.4000000000000057,
'Chorizo Pasta': -2.0,
'No carbs Ice Cream': 0.4000000000000057,
'Veg Stew': 0.4000000000000057,
'Bulgar spinach Salad': 0.10000000000000853,
'Mango Cheese': 0.10000000000000853,
'Crumbed Calamari chips': 0.10000000000000853,
'Slaw Salad': 0.20000000000000284,
'Mango': -1.2000000000000028,
'Rice & Fish': 0.20000000000000284,
'Almonds Cheese': -0.09999999999999432,
'Nectarine': -1.7000000000000028,
'Banana Cheese': 0.7000000000000028,
'Mediteranean Salad': 0.7000000000000028,
'Almonds': -4.099999999999994}
I am trying to get the aggregated sum of the values of each food item from the dictionary using Pandas:
fooddata = pd.DataFrame(list(foodWeight.items()), columns=['food','weight']).groupby('food')['weight'].agg(['sum']).sort_values(by='sum', ascending=0)
The above code gives the the correct output:
sum
food
Banana Cheese 0.7
Mediteranean Salad 0.7
Seed Nut Cheese Biscuits 0.4
Veg Stew 0.4
No carbs Ice Cream 0.4
Slaw Salad 0.2
Rice & Fish 0.2
Almonds Mango 0.1
Bulgar spinach Salad 0.1
Crumbed Calamari chips 0.1
Frikkadels Salad 0.1
Mango Cheese 0.1
Chocolate Bomb 0.0
Burrito Salad 0.0
Fried Eggs Cheese Avocado 0.0
Burger and Chips -0.1
Traditional Breakfast -0.1
Almonds Cheese -0.1
However, I need to get the output in 2 columns not one which Pandas is giving me above.
How do I get the output into a format that I can plot the data. I.E Label and Value as separate values

Set as_index=False while calling group by
fooddata = pd.DataFrame(list(foodWeight.items()), columns=['food','weight']).groupby('food',as_index=False).agg({"weight":"sum"}).sort_values(by='weight', ascending=0)

You can use parameter as_index=False in groupby and aggregate sum:
fooddata = pd.DataFrame(list(foodWeight.items()), columns=['food','weight'])
print (fooddata.groupby('food', as_index=False)['weight']
.sum()
.sort_values(by='weight', ascending=0))
food weight
2 Banana Cheese 0.7
12 Mediteranean Salad 0.7
20 Veg Stew 0.4
14 No carbs Ice Cream 0.4
16 Seed Nut Cheese Biscuits 0.4
18 Slaw Salad 0.2
15 Rice & Fish 0.2
3 Bulgar spinach Salad 0.1
6 Crumbed Calamari chips 0.1
11 Mango Cheese 0.1
4 Chocolate Bomb 0.0
1 Almonds Cheese -0.1
19 Spanish Omlette -0.2
10 Mango -1.2
8 Frikerdels Salad -1.2
9 Fruit -1.3
7 Crumbed Chicken Salad -1.3
13 Nectarine -1.7
5 Chorizo Pasta -2.0
17 Seed Nut Muesli -3.9
0 Almonds -4.1
Another solution is add reset_index:
print (fooddata.groupby('food')['weight']
.sum()
.sort_values(ascending=0)
.reset_index(name='sum'))
food sum
0 Banana Cheese 0.7
1 Mediteranean Salad 0.7
2 Veg Stew 0.4
3 Seed Nut Cheese Biscuits 0.4
4 No carbs Ice Cream 0.4
5 Slaw Salad 0.2
6 Rice & Fish 0.2
7 Crumbed Calamari chips 0.1
8 Mango Cheese 0.1
9 Bulgar spinach Salad 0.1
10 Chocolate Bomb 0.0
11 Almonds Cheese -0.1
12 Spanish Omlette -0.2
13 Mango -1.2
14 Frikerdels Salad -1.2
15 Crumbed Chicken Salad -1.3
16 Fruit -1.3
17 Nectarine -1.7
18 Chorizo Pasta -2.0
19 Seed Nut Muesli -3.9
20 Almonds -4.1
For plotting is better not reset index - then values of index create axis x - use plot:
fooddata.groupby('food')['weight'].sum().sort_values(ascending=0).plot()
Or if need plot barh:
fooddata.groupby('food')['weight'].sum().sort_values(ascending=0).plot.barh()

After the grouping you need to reset the index or use as_index=False when calling groupby. Paraphrasing this post, by default aggregation functions will not return the groups that you are aggregating over if they are named columns. Instead the grouped columns will be the indices of the returned object. Passing as_index=False or calling reset_index afterwards, will return the groups that you are aggregating over, if they are named columns.
See below my attempt to turn your results in a meaningful graph:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
df = fooddata.reset_index()
ax = df[['food','sum']].plot(kind='barh', title ="Total Sum per Food Item", figsize=(15, 10), legend=True, fontsize=12)
ax.set_xlabel("Sum per Food Item", fontsize=12)
ax.set_ylabel("Food Items", fontsize=12)
ax.set_yticklabels(df['food'])
plt.show()
This results in

Related

add a new column based on a group without grouping

I have this reproducible data set where i need to add a column based for the 'best usage' source.
df_in = pd.DataFrame({
'year': [ 5, 5, 5,
10, 10,
15, 15,
30, 30, 30 ],
'usage': ['farm', 'best', '',
'manual', 'best',
'best', 'city',
'random', 'best', 'farm' ],
'value': [0.825, 0.83, 0.85,
0.935, 0.96,
1.12, 1.305,
1.34, 1.34, 1.455],
'source': ['wood', 'metal', 'water',
'metal', 'water',
'wood', 'water',
'wood', 'metal', 'water' ]})
desired outcome:
print(df)
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal
Is there a way to do that without grouping? currently, i'm using:
grouped = df_in.groupby('usage').get_group('best')
grouped = grouped.rename(columns={'source': 'best'})
df = df_in.merge(grouped[['year','best']],how='outer', on='year')
You could just query:
df_in.merge(df_in.query('usage=="best"')[['year','source']]
.drop_duplicates('year') # you might not need/want this line if `best` is unique per year (or doesn't need to be in the output)
.rename(columns={'source':'best'}),
on='year', how='left')
Output:
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal
Here is a way using .loc and .map()
(df.assign(best = df_in['year']
.map(df_in.loc[df_in['usage'].eq('best'),['year','source']]
.set_index('year')
.squeeze())))
Output:
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal

pandas sort values to get which item is placed with the most quantity

how to show which item is placed with the most quantity from this data?
how to show which item is most ordered groupby choice_description?
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
df= pd.read_csv(url, sep = '\t')
My data
order_id quantity item_name choice_description item_price
1 1 Chips and Fresh Tomato Salsa NULL $2.39
1 1 Nantucket Nectar [Apple] $3.39
2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]] $16.98
3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]] $10.98
3 1 Side of Chips NULL $1.69
4 1 Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]] $11.75
4 1 Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Cheese, Sour Cream, Lettuce]] $9.25
...
...
If you want to show all the data:
df.sort_values('quantity', ascending=False)
Output:
order_id quantity item_name choice_descri item_price
1443 15 Chips and Fresh Tomato Salsa NaN $44.25
1660 10 Bottled Water NaN $15.00
1559 8 Side of Chips NaN $13.52
1443 7 Bottled Water NaN $10.50
...
If you want to show only the first row:
df.sort_values('quantity', ascending=False).head(1)
Output:
order_id quantity item_name choice_descri item_price
1443 15 Chips and Fresh Tomato Salsa NaN $44.25
Or if you want to show only the name:
df.sort_values('quantity', ascending=False).head(1).item_name
3598 Chips and Fresh Tomato Salsa
Name: item_name, dtype: object
This will list all ties (if any).
df.loc[df["quantity"] == df["quantity"].max(), "item_name"]
Output:
3598 Chips and Fresh Tomato Salsa
Name: item_name, dtype: object

Adding column to dataframe based on values in another dataframe

I have two dataframes the first one:
df1:
product price
0 apples 1.99
1 bananas 1.20
2 oranges 1.49
3 lemons 0.5
4 Olive Oil 8.99
df2:
product product.1 product.2
0 apples bananas Olive Oil
1 bananas lemons oranges
2 Olive Oil bananas oranges
3 lemons apples bananas
I want a column in the second dataframe to be the sum of the prices base on the price of each item in the first dataframe. So desired outcome would be:
product product.1 product.2 total_price
0 apples bananas Olive Oil 12.18
1 bananas lemons oranges 3.19
2 Olive Oil bananas oranges 11.68
3 lemons apples bananas 3.69
What is the best way to accomplish this? I have tried merging the dataframes on the name for each of the columns in df2 but this seems time consuming especially as df1 gets more rows and df2 gets more columns.
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product')
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product.1')
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product.2')
df['Total_Price'] = df['price']+df['price.1']+df['price.2']
You can try something like below:
First, converting df1 to dictionary of keys and values
Using dictionary in above with applymap followed by sum
May be following snippet will do something similar:
dictionary_val = { k[0]: k[1] for k in df1.values }
df2['Total_Price'] = df2.applymap(lambda row: dictionary_val[row]).sum(axis=1) # Note not creating new dataframe but using existing one
Then result is df2:
product product.1 product.2 Total_Price
0 apples bananas Olive Oil 12.18
1 bananas lemons oranges 3.19
2 Olive Oil bananas oranges 11.68
3 lemons apples bananas 3.69

filter dataframe and add the newly created columns to original df

Is there a simple way to perform calculations for each fruit in turn, adding the newly created column to original df?
df
concatted score fruit status date
apple_bana 0.500 apple high 2010-02-20
apple 0.600 apple low 2010-02-21
banana 0.530 pear low 2010-01-12
Expected output:
concatted score fruit status date first_diff
apple_bana 0.500 apple high 2010-02-20
apple 0.600 apple low 2010-02-21 0.1
banana 0.530 pear low 2010-01-12
I tried:
fruits = ['apple', 'banana', 'pair']
for fruit in fruits :
selected_rows = df[(df['fruit'] == fruit)]
selected_rows['first_diff']= df.score.diff().dropna()
df = df.append(selected_rows)
groupby(), and apply .diff() to score
df['first_diff']=df[['concatted', 'score', 'fruit', 'status', 'date']].groupby('fruit')['score'].diff().fillna('')
If in need of something general please try;
df['first_diff']=df[[x for x in df.columns]].groupby('fruit')['score'].diff().fillna('')
concatted score fruit status date first_diff
0 apple_bana 0.50 apple high 2010-02-20
1 apple 0.60 apple low 2010-02-21 0.1
2 banana 0.53 pear low 2010-01-12

How to collapse pandas rows for select column values to minimal combinations and map back to original rows

Context:
I have a pandas dataframe with 7 columns (taste, color, temperature, texture, shape, age_of_participant, name_of_participant).
Of the 7 columns, taste, color, temperature, texture and shape can have overlapping values across multiple rows (i.e taste could be sour for more than one row)
I'm trying to collapse all the rows into the lowest number of combinations given
taste,color,temperature,texture and shape values while ignoring NA's ( in other words, overwriting them). The next part is to map each of these rows to the original rows.
Mock data set:
data_set = [
{'color':'brown', 'age_of_participant':23, 'name_of_participant':'feb'},
{'taste': 'sour', 'color':'green', 'temperature': 'hot', 'age_of_participant':16,'name_of_participant': 'joe'},
{'taste': 'sour', 'color':'green', 'texture':'soft', 'age_of_participant':17,'name_of_participant': 'jane'},
{'color':'green','age_of_participant':18,'name_of_participant': 'jeff'},
{'taste': 'sweet', 'color':'red', 'age_of_participant':19,'name_of_participant': 'joke'},
{'taste': 'sweet', 'temperature': 'cold', 'age_of_participant':20,'name_of_participant': 'jolly'},
{'taste': 'salty', 'color':'purple', 'texture':'soft', 'age_of_participant':21,'name_of_participant': 'jupyter'},
{'taste': 'salty', 'color':'brown', 'age_of_participant':22,'name_of_participant': 'january'}
]
import pandas as pd
import random
data_set = random.sample(data_set, k=len(data_set))
data_frame = pd.DataFrame(data_set)
print(data_frame)
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot NaN
1 17 green jane sour NaN soft
2 18 green jeff NaN NaN NaN
3 19 red joke sweet NaN NaN
4 20 NaN jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
What I've attempted:
# These columns are used to do the grouping since age_of_participant and name_of_participant are unique per row
values_that_can_be_grouped = ['taste', 'color','temperature','texture']
sub_set = data_frame[values_that_can_be_grouped].drop_duplicates().reset_index(drop=False)
my_unique_set = sub_set.groupby('taste', as_index=False).first()
print(my_unique_set)
taste index color temperature texture
0 2 green
1 salty 6 brown
2 sour 1 green soft
3 sweet 4 cold
At this point I'm not quite sure how I can map the rows above to all original rows except for indices 2,6,1,4. I checked pandas code and doesn't look like the other indices are preserved anywhere?
What I'm trying to achieve:
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot soft
1 17 green jane sour hot soft
2 18 green jeff sour hot soft
3 19 red joke sweet cold NaN
4 20 red jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
data_frame.assign(color=data_frame.color.ffill()).groupby('color').apply(lambda x: x.ffill().bfill())
Out[1089]:
age_of_participant color name_of_participant taste temperature texture
0 16 green joe sour hot soft
1 17 green jane sour hot soft
2 18 green jeff sour hot soft
3 19 red joke sweet cold NaN
4 20 red jolly sweet cold NaN
5 21 purple jupyter salty NaN soft
6 22 brown january salty NaN NaN
IIUC I feel using ffill and bfill for each taste and color, then groupby them is safer here
df.taste.fillna(df.groupby('color').taste.apply(lambda x : x.ffill().bfill()),inplace=True)
df.color.fillna(df.groupby('taste').color.apply(lambda x : x.ffill().bfill()),inplace=True)
df=df.groupby(['color','taste']).apply(lambda x : x.ffill().bfill())
df
age_of_participant color ... temperature texture
0 16 green ... hot soft
1 17 green ... hot soft
2 18 green ... hot soft
3 19 red ... cold NaN
4 20 red ... cold NaN
5 21 purple ... NaN soft
6 22 brown ... NaN NaN
[7 rows x 6 columns]

Categories

Resources