In a pandas Dataframe df I have columns likes this:
NAME KEYWORD AMOUNT INFO
0 orange fruit 13 from italy
1 potato veggie 7 from germany
2 potato veggie 9 from germany
3 orange fruit 8 from italy
4 potato veggie 6 from germany
Doing a groupby KEYWORD operation I want to build the sum of the AMOUNT values per group and keep from the other columns always the first value, so that the result reads:
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
I tried
df.groupby('KEYWORD).sum()
but this "summarises" over all columns, i.e I get
NAME KEYWORD AMOUNT INFO
0 orangeorange fruit 21 from italyfrom italy
1 potatopotatopotato veggie 22 from germanyfrom germanyfrom germany
Then I tried to use different functions for different columns:
df.groupby('KEYWORD).agg({'AMOUNT': sum, 'NAME': first, ....})
with
def first(f_arg, *args):
return f_arg
But this gives me unfortunately a "ValueError: function does not reduce" error.
So I am a bit at a loss. How can I apply sum only to the AMOUNT column, while keeping the others?
Use groupby + agg with a custom aggfunc dict.
f = dict.fromkeys(df.columns.difference(['KEYWORD']), 'first')
f['AMOUNT'] = sum
df = df.groupby('KEYWORD', as_index=False).agg(f)
df
KEYWORD NAME AMOUNT INFO
0 fruit orange 21 from italy
1 veggie potato 22 from germany
dict.fromkeys gives me a nice way of generalising this for N number of columns. If column order matters, add a reindex operation at the end:
df = df.groupby('KEYWORD', as_index=False).agg(f).reindex(columns=df.columns)
df
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
Use drop_duplicates by column KEYWORD and then assign aggregate values:
df=df.drop_duplicates('KEYWORD').assign(AMOUNT=df.groupby('KEYWORD')['AMOUNT'].sum().values)
print (df)
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
Related
I have a table like this
Name Type Food Variant and Price
A Cake {‘Choco’:100, ‘Cheese’:100, ‘Mix’: 125}
B Drinks {‘Cola’:25, ‘Milk’:35}
C Side dish {‘French Fries’:20}
D Bread {None:10}
I want to use the keys and values of dictionaries in the Variant and Price column as 2 different columns but I am still confused, here is the output that I want:
Name Type Food Variant Price
A Cake Choco 100
A Cake Cheese 100
A Cake Mix 125
B Drinks Cola 25
B Drinks Milk 35
C Side dish French Fries 20
D Bread NaN 10
Can anyone help me to figure it out?
Create list of tuples and then use DataFrame.explode, last create 2 columns:
df['Variant and Price'] = df['Variant and Price'].apply(lambda x: list(x.items()))
df = df.explode('Variant and Price').reset_index(drop=True)
df[['Variant','Price']] = df.pop('Variant and Price').to_numpy().tolist()
print (df)
Name Type Food Variant Price
0 A Cake Choco 100
1 A Cake Cheese 100
2 A Cake Mix 125
3 B Drinks Cola 25
4 B Drinks Milk 35
5 C Side dish French Fries 20
6 D Bread None 10
Or create 2 columns and then use DataFrame.explode:
df['Variant'] = df['Variant and Price'].apply(lambda x: list(x.keys()))
df['Price'] = df.pop('Variant and Price').apply(lambda x: list(x.values()))
df = df.explode(['Variant', 'Price']).reset_index(drop=True)
I have below dataframe
Is there anyway we can combine values in column (Fruit) with respect to values in other two columns and get below result using pandas
Use groupby_agg. If you have other columns, expand the dict with another functions if needed (max, min, first, last, ... or lambda)
out = df.groupby(['SellerName', 'SellerID'], as_index=False).agg({'Fruit': ', '.join})
print(out)
# Output
SellerName SellerID Fruit
0 Rob 200 Apple, Bannana
1 Scott 201 Apple, Kiwi, Pineapple
Input dataframe:
>>> df
SellerName SellerID Fruit
0 Rob 200 Apple
1 Scott 201 Apple
2 Rob 200 Bannana
3 Scott 201 Kiwi
4 Scott 201 Pineapple
I have two dataframes the first one:
df1:
product price
0 apples 1.99
1 bananas 1.20
2 oranges 1.49
3 lemons 0.5
4 Olive Oil 8.99
df2:
product product.1 product.2
0 apples bananas Olive Oil
1 bananas lemons oranges
2 Olive Oil bananas oranges
3 lemons apples bananas
I want a column in the second dataframe to be the sum of the prices base on the price of each item in the first dataframe. So desired outcome would be:
product product.1 product.2 total_price
0 apples bananas Olive Oil 12.18
1 bananas lemons oranges 3.19
2 Olive Oil bananas oranges 11.68
3 lemons apples bananas 3.69
What is the best way to accomplish this? I have tried merging the dataframes on the name for each of the columns in df2 but this seems time consuming especially as df1 gets more rows and df2 gets more columns.
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product')
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product.1')
df = pd.merge(df1, df2, how='right', left_on='product', right_on='product.2')
df['Total_Price'] = df['price']+df['price.1']+df['price.2']
You can try something like below:
First, converting df1 to dictionary of keys and values
Using dictionary in above with applymap followed by sum
May be following snippet will do something similar:
dictionary_val = { k[0]: k[1] for k in df1.values }
df2['Total_Price'] = df2.applymap(lambda row: dictionary_val[row]).sum(axis=1) # Note not creating new dataframe but using existing one
Then result is df2:
product product.1 product.2 Total_Price
0 apples bananas Olive Oil 12.18
1 bananas lemons oranges 3.19
2 Olive Oil bananas oranges 11.68
3 lemons apples bananas 3.69
I have a DF as follows:
Date Bought | Fruit
2018-01 Apple
2018-02 Orange
2018-02 Orange
2018-02 Lemon
I wish to group the data by 'Date Bought' & 'Fruit' and count the occurrences.
Expected result:
Date Bought | Fruit | Count
2018-01 Apple 1
2018-02 Orange 2
2018-02 Lemon 1
What I get:
Date Bought | Fruit | Count
2018-01 Apple 1
2018-02 Orange 2
Lemon 1
Code used:
Initial attempt:
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count')
#2
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count').reset_index()
ERROR: Cannot insert Fruit, already exists
#3
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count').reset_index(inplace=True)
ERROR: Type Error: Cannot reset_index inplace on a Series to create a DataFrame
Documentation shows that the groupby function returns a 'groupby object' not a standard DF. How can I group the data as mentioned above and retain the DF format?
The problem here is that by resetting the index you'd end up with 2 columns with the same name. Because working with Series is possible set parameter name in Series.reset_index:
df1 = (df.groupby(['Date Bought','Fruit'], sort=False)['Fruit']
.agg('count')
.reset_index(name='Count'))
print (df1)
Date Bought Fruit Count
0 2018-01 Apple 1
1 2018-02 Orange 2
2 2018-02 Lemon 1
So, what i mean with explode is like this, i want to transform some dataframe like :
ID | Name | Food | Drink
1 John Apple, Orange Tea , Water
2 Shawn Milk
3 Patrick Chichken
4 Halley Fish Nugget
into this dataframe:
ID | Name | Order Type | Items
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Shawn Drink Milk
6 Pattrick Food Chichken
i dont know how to make this happen. any help would be appreciated !
IIUC stack with unnest process , here I would not change the ID , I think keeping the original one is better
s=df.set_index(['ID','Name']).stack()
pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).reset_index()
Out[289]:
ID Name level_2 0
0 1 John Food Apple
1 1 John Food Orange
2 1 John Drink Tea
3 1 John Drink Water
4 2 Shawn Drink Milk
5 3 Patrick Food Chichken
6 4 Halley Food Fish Nugget
# if you need rename the column to item try below
#pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).rename(columns={0:'Item'}).reset_index()
You can use pd.melt to convert the data from wide to long format. I think this will be easier to understand step by step.
# first split into separate columns
df[['Food1','Food2']] = df.Food.str.split(',', expand=True)
df[['Drink1','Drink2']] = df.Drink.str.split(',', expand=True)
# now melt the df into long format
df = pd.melt(df, id_vars=['Name'], value_vars=['Food1','Food2','Drink1','Drink2'])
# remove unwanted rows and filter data
df = df[df['value'].notnull()].sort_values('Name').reset_index(drop=True)
# rename the column names and values
df.rename(columns={'variable':'Order Type', 'value':'Items'}, inplace=True)
df['Order Type'] = df['Order Type'].str.replace('\d','')
# output
print(df)
Name Order Type Items
0 Halley Food Fish Nugget
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Patrick Food Chichken
6 Shawn Drink Milk