I have below dataframe
Is there anyway we can combine values in column (Fruit) with respect to values in other two columns and get below result using pandas
Use groupby_agg. If you have other columns, expand the dict with another functions if needed (max, min, first, last, ... or lambda)
out = df.groupby(['SellerName', 'SellerID'], as_index=False).agg({'Fruit': ', '.join})
print(out)
# Output
SellerName SellerID Fruit
0 Rob 200 Apple, Bannana
1 Scott 201 Apple, Kiwi, Pineapple
Input dataframe:
>>> df
SellerName SellerID Fruit
0 Rob 200 Apple
1 Scott 201 Apple
2 Rob 200 Bannana
3 Scott 201 Kiwi
4 Scott 201 Pineapple
Related
I'm struggling with next task: I would like to identify using pandas (or any other tool on python) if any of multiple cells (Fruit 1 through Fruit 3) in each row from Table 2 contains in column Fruits of Table1. And at the end obtain "Contains Fruits Table 2?" table.
Fruits
apple
orange
grape
melon
Name
Fruit 1
Fruit 2
Fruit 3
Contains Fruits Table 2?
Mike
apple
Yes
Bob
peach
pear
orange
Yes
Jack
banana
No
Rob
peach
banana
No
Rita
apple
orange
banana
Yes
Fruits in Table 2 can be up to 40 columns. Number of rows in Table1 is about 300.
I hope it is understandable, and someone can help me resolve this.
I really appreciate the support in advance!
Try:
filter DataFrame to include columns that contain the word "Fruit"
Use isin to check if the values are in table1["Fruits"]
Return True if any of fruits are found
map True/False to "Yes"/"No"
table2["Contains Fruits Table 2"] = table2.filter(like="Fruit")
.isin(table1["Fruits"].tolist())
.any(axis=1)
.map({True: "Yes", False: "No"})
>>> table2
Name Fruit 1 Fruit 2 Fruit 3 Contains Fruits Table 2
0 Mike apple None None Yes
1 Bob peach pear orange Yes
2 Jack banana None None No
3 Rob peach banana None No
4 Rita apple orange banana Yes
~~~
I am most interested in how this is done in a good and excellent pandas way.
In this example data Tim from Osaka has two fruit's.
import pandas as pd
data = {'name': ['Susan', 'Tim', 'Tim', 'Anna'],
'fruit': ['Apple', 'Apple', 'Banana', 'Banana'],
'town': ['Berlin', 'Osaka', 'Osaka', 'Singabpur']}
df = pd.DataFrame(data)
print(df)
Result
name fruit town
0 Susan Apple Berlin
1 Tim Apple Osaka
2 Tim Banana Osaka
3 Anna Banana Singabpur
I investigate the data ans see that one of the persons have multiple fruits. I want to create a new "category" for it named banana&fruit (or something else). The point is that the other fields of Tim are equal in their values.
df.groupby(['name', 'town', 'fruit']).size()
I am not sure if this is the correct way to explore this data set. The logical question behind is if some of the person+town combinations have multiple fruits.
As a result I want this
name fruit town
0 Susan Apple Berlin
1 Tim Apple&Banana Osaka
2 Anna Banana Singabpur
Use groupby agg:
new_df = (
df.groupby(['name', 'town'], as_index=False, sort=False)
.agg(fruit=('fruit', '&'.join))
)
new_df:
name town fruit
0 Susan Berlin Apple
1 Tim Osaka Apple&Banana
2 Anna Singabpur Banana
>>> df.groupby(["name", "town"], sort=False)["fruit"]
.apply(lambda f: "&".join(f)).reset_index()
name town fruit
0 Anna Singabpur Banana
1 Susan Berlin Apple
2 Tim Osaka Apple&Banana
I have a dataframe that records the number and type of fruits owned by various people. I'd like to add a column that indicates the top fruit(s) for each person. If a person has 2+ top-ranking fruits (aka, a tie), I want a list (or tuple) of them all.
Input
For example, let's say my input is this dataframe:
# Create all the fruit data
data = [{'fruit0':'strawberry','fruit0_count':23,'fruit1':'orange','fruit1_count':4,'fruit2':'grape','fruit2_count':27},
{'fruit0':'apple','fruit0_count':45,'fruit1':'mango','fruit1_count':45,'fruit2':'orange','fruit2_count':12},
{'fruit0':'blueberry','fruit0_count':30,'fruit1':'grapefruit','fruit1_count':32,'fruit2':'cherry','fruit2_count':94},
{'fruit0':'pineapple','fruit0_count':4,'fruit1':'grape','fruit1_count':4,'fruit2':'lemon','fruit2_count':67}]
# Add people's names as an index
df = pd.DataFrame(data, index=['Shawn', 'Monica','Jamal','Tracy'])
# Print the dataframe
df
. . . which creates the input dataframe:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count
Shawn strawberry 23 orange 4 grape 27
Monica apples 45 mango 45 orange 12
Jamal blueberry 30 grapefruit 32 cherry 94
Tracy pineapple 4 grape 4 lemon 67
Target output
What I'd like to get is a new column that gives the name of the top fruit for each person. If the person has two (or more) fruits that tied for first, I'd like a list or a tuple of those fruits:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count top_fruit
Shawn strawberry 23 orange 4 grape 27 grape
Monica apple 45 mango 45 orange 12 (apple,mango)
Jamal blueberry 30 grapefruit 32 cherry 94 cherry
Tracy pineapple 4 grape 4 lemon 67 lemon
My attempt far
The closest I've gotten is based on https://stackoverflow.com/a/38955365/6480859.
Problems:
If there is a tie for top fruit, it only captures one top fruit (Monica's top fruit is only apple.)
It's really complicated. Not really a problem, but if there is a more straightforward path, I'd like to learn it.
# List the columns that contain count numbers
cols = ['fruit0_count', 'fruit1_count', 'fruit2_count']
# Make a new dataframe with just those columns.
only_counts_df=pd.DataFrame()
only_counts_df[cols]=df[cols].copy()
# Indicate how many results you want. Note: If you increase
# this from 1, it gives you the #2, #3, etc. ranking -- it
# doesn't represent tied results.
nlargest = 1
# The next two lines are suggested from
# https://stackoverflow.com/a/38955365/6480859. I don't totally
# follow along . . .
order = np.argsort(-only_counts_df.values, axis=1)[:, :nlargest]
result = pd.DataFrame(only_counts_df.columns[order],
columns=['top{}'.format(i) for i in range(1, nlargest+1)],
index=only_counts_df.index)
# Join the results back to our original dataframe
result = df.join(result).copy()
# The dataframe now reports the name of the column that
# contains the top fruit. Convert this to the fruit name.
def id_fruit(row):
if row['top1'] == 'fruit0_count':
return row['fruit0']
elif row['top1'] == 'fruit1_count':
return row['fruit1']
elif row['top1'] == 'fruit2_count':
return row['fruit2']
else:
return "Failed"
result['top_fruit'] = result.apply(id_fruit,axis=1)
result = result.drop(['top1'], axis=1).copy()
result
. . . which outputs:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count top_fruit
Shawn strawberry 23 orange 4 grape 27 grape
Monica apple 45 mango 45 orange 12 apple
Jamal blueberry 30 grapefruit 32 cherry 94 cherry
Tracy pineapple 4 grape 4 lemon 67 lemon
Monica's top fruit should be apple and mango.
Any tips are welcome, thanks!
Idea is filter each pair and unpair column to df1 and df2, then compare values by max and filter with DataFrame.mask, last get non missing values in apply:
df1 = df.iloc[:, ::2]
df2 = df.iloc[:, 1::2]
mask = df2.eq(df2.max(axis=1), axis=0)
df['top'] = df1.where(mask.to_numpy()).apply(lambda x: x.dropna().tolist(), axis=1)
print (df)
fruit0 fruit0_count fruit1 fruit1_count fruit2 \
Shawn strawberry 23 orange 4 grape
Monica apple 45 mango 45 orange
Jamal blueberry 30 grapefruit 32 cherry
Tracy pineapple 4 grape 4 lemon
fruit2_count top
Shawn 27 [grape]
Monica 12 [apple, mango]
Jamal 94 [cherry]
Tracy 67 [lemon]
Here's what I've come up with:
maxes = df[[f"fruit{i}_count" for i in range(3)]].max(axis=1)
mask = df[[f"fruit{i}_count" for i in range(3)]].isin(maxes)
df_masked = df[[f"fruit{i}" for i in range(3)]][
mask.rename(lambda x: x.replace("_count", ""), axis=1)
]
df["top_fruit"] = df_masked.apply(lambda x: x.dropna().tolist(), axis=1)
This will return
fruit0 fruit0_count ... fruit2_count top_fruit
Shawn strawberry 23 ... 27 [grape]
Monica apple 45 ... 12 [apple, mango]
Jamal blueberry 30 ... 94 [cherry]
Tracy pineapple 4 ... 67 [lemon]
So, what i mean with explode is like this, i want to transform some dataframe like :
ID | Name | Food | Drink
1 John Apple, Orange Tea , Water
2 Shawn Milk
3 Patrick Chichken
4 Halley Fish Nugget
into this dataframe:
ID | Name | Order Type | Items
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Shawn Drink Milk
6 Pattrick Food Chichken
i dont know how to make this happen. any help would be appreciated !
IIUC stack with unnest process , here I would not change the ID , I think keeping the original one is better
s=df.set_index(['ID','Name']).stack()
pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).reset_index()
Out[289]:
ID Name level_2 0
0 1 John Food Apple
1 1 John Food Orange
2 1 John Drink Tea
3 1 John Drink Water
4 2 Shawn Drink Milk
5 3 Patrick Food Chichken
6 4 Halley Food Fish Nugget
# if you need rename the column to item try below
#pd.DataFrame(data=s.str.split(',').sum(),index=s.index.repeat(s.str.split(',').str.len())).rename(columns={0:'Item'}).reset_index()
You can use pd.melt to convert the data from wide to long format. I think this will be easier to understand step by step.
# first split into separate columns
df[['Food1','Food2']] = df.Food.str.split(',', expand=True)
df[['Drink1','Drink2']] = df.Drink.str.split(',', expand=True)
# now melt the df into long format
df = pd.melt(df, id_vars=['Name'], value_vars=['Food1','Food2','Drink1','Drink2'])
# remove unwanted rows and filter data
df = df[df['value'].notnull()].sort_values('Name').reset_index(drop=True)
# rename the column names and values
df.rename(columns={'variable':'Order Type', 'value':'Items'}, inplace=True)
df['Order Type'] = df['Order Type'].str.replace('\d','')
# output
print(df)
Name Order Type Items
0 Halley Food Fish Nugget
1 John Food Apple
2 John Food Orange
3 John Drink Tea
4 John Drink Water
5 Patrick Food Chichken
6 Shawn Drink Milk
In a pandas Dataframe df I have columns likes this:
NAME KEYWORD AMOUNT INFO
0 orange fruit 13 from italy
1 potato veggie 7 from germany
2 potato veggie 9 from germany
3 orange fruit 8 from italy
4 potato veggie 6 from germany
Doing a groupby KEYWORD operation I want to build the sum of the AMOUNT values per group and keep from the other columns always the first value, so that the result reads:
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
I tried
df.groupby('KEYWORD).sum()
but this "summarises" over all columns, i.e I get
NAME KEYWORD AMOUNT INFO
0 orangeorange fruit 21 from italyfrom italy
1 potatopotatopotato veggie 22 from germanyfrom germanyfrom germany
Then I tried to use different functions for different columns:
df.groupby('KEYWORD).agg({'AMOUNT': sum, 'NAME': first, ....})
with
def first(f_arg, *args):
return f_arg
But this gives me unfortunately a "ValueError: function does not reduce" error.
So I am a bit at a loss. How can I apply sum only to the AMOUNT column, while keeping the others?
Use groupby + agg with a custom aggfunc dict.
f = dict.fromkeys(df.columns.difference(['KEYWORD']), 'first')
f['AMOUNT'] = sum
df = df.groupby('KEYWORD', as_index=False).agg(f)
df
KEYWORD NAME AMOUNT INFO
0 fruit orange 21 from italy
1 veggie potato 22 from germany
dict.fromkeys gives me a nice way of generalising this for N number of columns. If column order matters, add a reindex operation at the end:
df = df.groupby('KEYWORD', as_index=False).agg(f).reindex(columns=df.columns)
df
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany
Use drop_duplicates by column KEYWORD and then assign aggregate values:
df=df.drop_duplicates('KEYWORD').assign(AMOUNT=df.groupby('KEYWORD')['AMOUNT'].sum().values)
print (df)
NAME KEYWORD AMOUNT INFO
0 orange fruit 21 from italy
1 potato veggie 22 from germany