Duplicate rows when merging dataframes with repetitions [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed last month.
Say I have the following dataframes:
>>> df1 = pd.DataFrame({'fruit':['apple','orange','orange'],'taste':['sweet','sweet','sour']})
>>> df1
fruit taste
0 apple sweet
1 orange sweet
2 orange sour
>>> df2 = pd.DataFrame({'fruit':['apple','orange','orange'],'price':['high','low','low']})
>>> df2
fruit price
0 apple high
1 orange low
2 orange low
When I do df3=df1.merge(df2,on='fruit'), I got the following result:
fruit taste price
0 apple sweet high
1 orange sweet low
2 orange sweet low
3 orange sour low
4 orange sour low
Here it looks like 2 duplicate rows were created; instead, I would expect something like
fruit taste price
0 apple sweet high
1 orange sweet low
3 orange sour low
How should I understand this behavior and how to obtain the result I was looking for?

if you want to merge row1,row2,row3 from df1 with row1,row2,row3 from df2 then the code below works.
import pandas as pd
df1 = pd.DataFrame({'fruit':['apple','orange','orange'],'taste':['sweet','sweet','sour']})
df2 = pd.DataFrame({'fruit':['apple','orange','orange'],'price':['high','low','low']})
df3=df1.copy()
df3["price"]=None
df3.update(df2, join="left")
print(df3)
the reason why you get duplicated rows usingdf3=df1.merge(df2,on='fruit') is because merge uses df1 cross join df2 (aka df1 X df2).
More information on that if you research sql cross join.

you should remove duplicates rows before merge:
df3 = df1.merge(df2.drop_duplicates(), on='fruit')
fruit taste price
0 apple sweet high
1 orange sweet low
3 orange sour low

Related

How to count duplicate rows in pandas dataframe where the order of the column values is not important?

I wonder if we can extend the logic of How to count duplicate rows in pandas dataframe?, so that we also consider rows which have similar values on the columns with other rows, but the values are unordered.
Imagine a dataframe like this:
fruit1 fruit2
0 apple banana
1 cherry orange
3 apple banana
4 banana apple
we want to produce an output like this:
fruit1 fruit2 occurences
0 apple banana 3
1 cherry orange 1
Is this possible?
You can directly re-assign the np.sort values like so, then use value_counts():
import numpy as np
df.loc[:] = np.sort(df, axis=1)
out = df.value_counts().reset_index(name='occurences')
print(out)
Output:
fruit1 fruit2 occurences
0 apple banana 3
1 cherry orange 1
You could use np.sort along axis=1 to sort the values in your rows.
Then it's just the regular groupby.size():
import numpy as np
fruit_cols = ['fruit1','fruit2']
df_sort = pd.DataFrame(np.sort(df.values,axis=1),columns=fruit_cols)
df_sort.groupby(fruit_cols,as_index=False).size()
prints:
fruit1 fruit2 size
0 apple banana 3
1 cherry orange 1

Change values in one dataframe if values appear in another dataframe based on one column value

I have two dataframe, "big" and "correction":
big:
>>>ID class fruit
0 1 medium banana
1 2 medium banana
2 3 high peach
3 4 low nuts
4 5 low banana
and correction:
>>> ID class time fruit
0 2 medium 17 melon
1 5 high 19 oranges
I want to fix the table "big" according to information in table correction. in order to get the following table:
>>>ID class fruit
0 1 medium banana
1 2 medium **melon**
2 3 high peach
3 4 low nuts
4 5 **high** **oranges**
as you can see, the starred values "fixed" according to the correction tabl, on the ID field.
I thought to use for nested loop but I believe there are better ways to get the same results.
Try df.update after aligning the indexes of both dataframes:
big.set_index("ID",inplace=True)
big.update(correction.set_index("ID")
big = big.reset_index() #ifyou want `ID` as a column back
print(big)
ID class fruit
0 1 medium banana
1 2 medium melon
2 3 high peach
3 4 low nuts
4 5 high oranges
Let's try:
corr_df.set_index('ID').combine_first(big_df.set_index('ID'))

Select rows in pandas where value in one column is a substring of value in another column

I have a dataframe below
>df = pd.DataFrame({'A':['apple','orange','grape','pear','banana'], \
'B':['She likes apples', 'I hate oranges', 'This is a random sentence',\
'This one too', 'Bananas are yellow']})
>print(df)
A B
0 apple She likes apples
1 orange I hate oranges
2 grape This is a random sentence
3 pear This one too
4 banana Bananas are yellow
I'm trying to fetch all rows where column B contains the value in column A.
Expected Result:
A B
0 apple She likes apples
1 orange I hate oranges
4 banana Bananas are yellow
I'm able to do fetch only one row using
>df[df['B'].str.contains(df.iloc[0,0])]
A B
0 apple She likes apples
How can I fetch all such rows?
Use DataFrame.apply with convert both values to lower and test contains by in and filter by boolean indexing:
df = df[df.apply(lambda x: x.A in x.B.lower(), axis=1)]
Or list comprehension solution:
df = df[[a in b.lower() for a, b in zip(df.A, df.B)]]
print (df)
A B
0 apple She likes apples
1 orange I hate oranges
4 banana Bananas are yellow

Python & pandas: how to group in a nonstandard way

I know basic pandas functions, but I'm not clear how to group in this case.
I have a dataframe with a list of various fruits and their characteristics:
fruit x1 x2
apple red sweet
apple yellow sweet
apple green tart
apple red sweet
apple red sweet
How could I sum each fruit (there are more after apples) like this?
3 apples: red and sweet
1 apple: yellow and sweet
1 apple: green and tart
I've looked at groupby, tried an apply function, and looked over pandas documentation, but this escapes me.
Any ideas? Thank you so much.
With Counter
import pandas as pd
from collections import Counter
pd.Series(Counter(map(tuple, df.values)))
apple green tart 1
red sweet 3
yellow sweet 1
dtype: int64
pd.factorize and np.bincount
i, r = pd.factorize(list(map(tuple, df.values)))
pd.Series(dict(zip(r, np.bincount(i))))
apple green tart 1
red sweet 3
yellow sweet 1
dtype: int64
You can try below:
df['count']=0
group_df = df.groupby(["fruit","x1","x2"])['count'].count().reset_index()
The output will be as below:
fruit x1 x2 count
0 apple green tart 1
1 apple red sweet 3
2 apple yellow sweet 1
of course, you can concatenate columns after this to make it exactly as your required output.
And if u want the count to be sorted :
group_df = df.groupby(["fruit","x1","x2"])['count'].count().reset_index().sort_values(by=['count'],ascending=False)

Merge dataframes based on column, only keeping first match

I have 2 dataframes like the following.
df_1
Index Fruit
1 Apple
2 Banana
3 Peach
df_2
Fruit Taste
Apple Tasty
Banana Tasty
Banana Rotten
Peach Rotten
Peach Tasty
Peach Tasty
I want to merge the two dataframes based on Fruit but only keeping the first occurrence of Apple, Banana, and Peach in the second dataframe. The final result should be:
df_output
Index Fruit Taste
1 Apple Tasty
2 Banana Tasty
3 Peach Rotten
Where Fruit, Index, and Taste are column headers. I tried something like df1.merge(df2,how='left',on='Fruit but it created extra rows based on the length of df_2
Thanks.
Use drop_duplicates for first rows:
df = df_1.merge(df_2.drop_duplicates('Fruit'),how='left',on='Fruit')
print (df)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten
If want add only one column faster is use map:
s = df_2.drop_duplicates('Fruit').set_index('Fruit')['Taste']
df_1['Taste'] = df_1['Fruit'].map(s)
print (df_1)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten

Categories

Resources