I know basic pandas functions, but I'm not clear how to group in this case.
I have a dataframe with a list of various fruits and their characteristics:
fruit x1 x2
apple red sweet
apple yellow sweet
apple green tart
apple red sweet
apple red sweet
How could I sum each fruit (there are more after apples) like this?
3 apples: red and sweet
1 apple: yellow and sweet
1 apple: green and tart
I've looked at groupby, tried an apply function, and looked over pandas documentation, but this escapes me.
Any ideas? Thank you so much.
With Counter
import pandas as pd
from collections import Counter
pd.Series(Counter(map(tuple, df.values)))
apple green tart 1
red sweet 3
yellow sweet 1
dtype: int64
pd.factorize and np.bincount
i, r = pd.factorize(list(map(tuple, df.values)))
pd.Series(dict(zip(r, np.bincount(i))))
apple green tart 1
red sweet 3
yellow sweet 1
dtype: int64
You can try below:
df['count']=0
group_df = df.groupby(["fruit","x1","x2"])['count'].count().reset_index()
The output will be as below:
fruit x1 x2 count
0 apple green tart 1
1 apple red sweet 3
2 apple yellow sweet 1
of course, you can concatenate columns after this to make it exactly as your required output.
And if u want the count to be sorted :
group_df = df.groupby(["fruit","x1","x2"])['count'].count().reset_index().sort_values(by=['count'],ascending=False)
Related
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed last month.
Say I have the following dataframes:
>>> df1 = pd.DataFrame({'fruit':['apple','orange','orange'],'taste':['sweet','sweet','sour']})
>>> df1
fruit taste
0 apple sweet
1 orange sweet
2 orange sour
>>> df2 = pd.DataFrame({'fruit':['apple','orange','orange'],'price':['high','low','low']})
>>> df2
fruit price
0 apple high
1 orange low
2 orange low
When I do df3=df1.merge(df2,on='fruit'), I got the following result:
fruit taste price
0 apple sweet high
1 orange sweet low
2 orange sweet low
3 orange sour low
4 orange sour low
Here it looks like 2 duplicate rows were created; instead, I would expect something like
fruit taste price
0 apple sweet high
1 orange sweet low
3 orange sour low
How should I understand this behavior and how to obtain the result I was looking for?
if you want to merge row1,row2,row3 from df1 with row1,row2,row3 from df2 then the code below works.
import pandas as pd
df1 = pd.DataFrame({'fruit':['apple','orange','orange'],'taste':['sweet','sweet','sour']})
df2 = pd.DataFrame({'fruit':['apple','orange','orange'],'price':['high','low','low']})
df3=df1.copy()
df3["price"]=None
df3.update(df2, join="left")
print(df3)
the reason why you get duplicated rows usingdf3=df1.merge(df2,on='fruit') is because merge uses df1 cross join df2 (aka df1 X df2).
More information on that if you research sql cross join.
you should remove duplicates rows before merge:
df3 = df1.merge(df2.drop_duplicates(), on='fruit')
fruit taste price
0 apple sweet high
1 orange sweet low
3 orange sour low
I have a dataset and I am looking to see if there is a way to match data based on col values.
col-A col-B
Apple squash
Apple lettuce
Banana Carrot
Banana Carrot
Banana Carrot
dragon turnip
melon potato
melon potato
pear potato
Match
if col A matches another col a and col b doesn't match
if col B matches another col B and col a doesn't match
col-A col-B
Apple squash
Apple lettuce
melon potato
melon potato
pear potato
edit fixed typo
edit2 fixed 2nd typo
So, if I understand well, you want to select each rows, such that grouping for colA (resp. colB) then colB (resp. colA) lead to more than one group.
I can advice :
grA = df2.groupby("colA").filter(lambda x : x.groupby("colB").ngroups > 1)
grB = df2.groupby("colB").filter(lambda x : x.groupby("colA").ngroups > 1)
Leading to :
grA
colA colB
0 Apple squash
1 Apple lettuce
and
grB
colA colB
6 melon potato
7 melon potato
8 pear potato
Merging the two dataframes will lead to the desired ouput.
IIUC, you need to compute two masks to identify which group has a unique match with the other values:
m1 = df.groupby('col-B')['col-A'].transform('nunique').gt(1)
m2 = df.groupby('col-A')['col-B'].transform('nunique').gt(1)
out = df[m1|m2]
Output:
col-A col-B
0 Apple squash
1 Apple lettuce
6 melon potato
7 melon potato
8 pear potato
You can also get the unique/exclusive pairs with:
df[~(m1|m2)]
col-A col-B
2 Banana Carrot
3 Banana Carrot
4 Banana Carrot
5 Pear Cabbage
So let's say I have the following:
Item
Quantity
Blue Banana
3
Red Banana
4
Green Banana
1
Blue Apple
2
Orange Apple
6
I would like to grab all of the bananas and add them, no matter the color.
Or I would like to grab all Blue item, no matter the fruit type, and add them.
You can use a dictionary comprehension and str.contains:
words = ['banana', 'blue']
pd.Series({w: df.loc[df['Item'].str.contains(w, case=False), 'Quantity'].sum()
for w in words})
output:
banana 8
blue 5
I have a list named list1
list1 = ['Banana','Apple','Pear','Strawberry','Muskmelon','Apricot','Peach','Plum','Cherry','Blackberry','Raspberry','Cranberry','Grapes','Greenapple','Kiwi','Watermelon','Orange','Lychee','Custardapples','Jackfruit','Pineapple','Mango']
I want to form a df with specific columns and random data from list1
Eg:
a b c d e f
0 Banana Orange Lychee Custardapples Jackfruit Pineapple
1 Apple Pear Strawberry Muskmelon Apricot Peach
2 Raspberry Cherry Plum Kiwi Mango Blackberry
A structure something like this but with random data from list1?
There can't be any duplicate/repeated values present.
If every item from the list can end up everywhere in the DataFrame you could write:
pd.DataFrame(np.random.choice(list1, 3*6, replace=False).reshape(3, 6), columns=list("abcdef"))
Out:
a b c d e f
0 Lychee Peach Apricot Pear Plum Grapes
1 Cherry Jackfruit Blackberry Cranberry Kiwi Apple
2 Orange Greenapple Watermelon Banana Custardapples Raspberry
The replace-parameter in np.random.choice() is True by default, so for unique values you need to set it to False.
i have a pandas dateframe like this:
FRUITS COLOURS
0 apple red
1 berry black
2 apple green
3 grapes green
4 apple black
5 grapes red
6 tomato black
7 tomato green
keeping in mind the priority order of COLOURS red > green > black, i want to eliminate all the duplicate entries in FRUITS
Desired output should be:
FRUITS COLOURS
0 apple red
1 berry black
2 grapes red
3 tomato green
You can set the order by setting COLOUR to an ordered categorical, then sorting, and dropping the duplicate FRUITS:
df['COLOURS'] = pd.Categorical(df['COLOURS'], categories=['red','green','black'],ordered=True)
df.sort_values('COLOURS').drop_duplicates('FRUITS').sort_index()
FRUITS COLOURS
0 apple red
1 berry black
5 grapes red
7 tomato green