Merge dataframes based on column, only keeping first match - python

I have 2 dataframes like the following.
df_1
Index Fruit
1 Apple
2 Banana
3 Peach
df_2
Fruit Taste
Apple Tasty
Banana Tasty
Banana Rotten
Peach Rotten
Peach Tasty
Peach Tasty
I want to merge the two dataframes based on Fruit but only keeping the first occurrence of Apple, Banana, and Peach in the second dataframe. The final result should be:
df_output
Index Fruit Taste
1 Apple Tasty
2 Banana Tasty
3 Peach Rotten
Where Fruit, Index, and Taste are column headers. I tried something like df1.merge(df2,how='left',on='Fruit but it created extra rows based on the length of df_2
Thanks.

Use drop_duplicates for first rows:
df = df_1.merge(df_2.drop_duplicates('Fruit'),how='left',on='Fruit')
print (df)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten
If want add only one column faster is use map:
s = df_2.drop_duplicates('Fruit').set_index('Fruit')['Taste']
df_1['Taste'] = df_1['Fruit'].map(s)
print (df_1)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten

Related

Duplicate rows when merging dataframes with repetitions [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed last month.
Say I have the following dataframes:
>>> df1 = pd.DataFrame({'fruit':['apple','orange','orange'],'taste':['sweet','sweet','sour']})
>>> df1
fruit taste
0 apple sweet
1 orange sweet
2 orange sour
>>> df2 = pd.DataFrame({'fruit':['apple','orange','orange'],'price':['high','low','low']})
>>> df2
fruit price
0 apple high
1 orange low
2 orange low
When I do df3=df1.merge(df2,on='fruit'), I got the following result:
fruit taste price
0 apple sweet high
1 orange sweet low
2 orange sweet low
3 orange sour low
4 orange sour low
Here it looks like 2 duplicate rows were created; instead, I would expect something like
fruit taste price
0 apple sweet high
1 orange sweet low
3 orange sour low
How should I understand this behavior and how to obtain the result I was looking for?
if you want to merge row1,row2,row3 from df1 with row1,row2,row3 from df2 then the code below works.
import pandas as pd
df1 = pd.DataFrame({'fruit':['apple','orange','orange'],'taste':['sweet','sweet','sour']})
df2 = pd.DataFrame({'fruit':['apple','orange','orange'],'price':['high','low','low']})
df3=df1.copy()
df3["price"]=None
df3.update(df2, join="left")
print(df3)
the reason why you get duplicated rows usingdf3=df1.merge(df2,on='fruit') is because merge uses df1 cross join df2 (aka df1 X df2).
More information on that if you research sql cross join.
you should remove duplicates rows before merge:
df3 = df1.merge(df2.drop_duplicates(), on='fruit')
fruit taste price
0 apple sweet high
1 orange sweet low
3 orange sour low

How do I create a table to match based on different columns values ? If statements?

I have a dataset and I am looking to see if there is a way to match data based on col values.
col-A col-B
Apple squash
Apple lettuce
Banana Carrot
Banana Carrot
Banana Carrot
dragon turnip
melon potato
melon potato
pear potato
Match
if col A matches another col a and col b doesn't match
if col B matches another col B and col a doesn't match
col-A col-B
Apple squash
Apple lettuce
melon potato
melon potato
pear potato
edit fixed typo
edit2 fixed 2nd typo
So, if I understand well, you want to select each rows, such that grouping for colA (resp. colB) then colB (resp. colA) lead to more than one group.
I can advice :
grA = df2.groupby("colA").filter(lambda x : x.groupby("colB").ngroups > 1)
grB = df2.groupby("colB").filter(lambda x : x.groupby("colA").ngroups > 1)
Leading to :
grA
colA colB
0 Apple squash
1 Apple lettuce
and
grB
colA colB
6 melon potato
7 melon potato
8 pear potato
Merging the two dataframes will lead to the desired ouput.
IIUC, you need to compute two masks to identify which group has a unique match with the other values:
m1 = df.groupby('col-B')['col-A'].transform('nunique').gt(1)
m2 = df.groupby('col-A')['col-B'].transform('nunique').gt(1)
out = df[m1|m2]
Output:
col-A col-B
0 Apple squash
1 Apple lettuce
6 melon potato
7 melon potato
8 pear potato
You can also get the unique/exclusive pairs with:
df[~(m1|m2)]
col-A col-B
2 Banana Carrot
3 Banana Carrot
4 Banana Carrot
5 Pear Cabbage

How to create a DataFrame with random sample from list?

I have a list named list1
list1 = ['Banana','Apple','Pear','Strawberry','Muskmelon','Apricot','Peach','Plum','Cherry','Blackberry','Raspberry','Cranberry','Grapes','Greenapple','Kiwi','Watermelon','Orange','Lychee','Custardapples','Jackfruit','Pineapple','Mango']
I want to form a df with specific columns and random data from list1
Eg:
a b c d e f
0 Banana Orange Lychee Custardapples Jackfruit Pineapple
1 Apple Pear Strawberry Muskmelon Apricot Peach
2 Raspberry Cherry Plum Kiwi Mango Blackberry
A structure something like this but with random data from list1?
There can't be any duplicate/repeated values present.
If every item from the list can end up everywhere in the DataFrame you could write:
pd.DataFrame(np.random.choice(list1, 3*6, replace=False).reshape(3, 6), columns=list("abcdef"))
Out:
a b c d e f
0 Lychee Peach Apricot Pear Plum Grapes
1 Cherry Jackfruit Blackberry Cranberry Kiwi Apple
2 Orange Greenapple Watermelon Banana Custardapples Raspberry
The replace-parameter in np.random.choice() is True by default, so for unique values you need to set it to False.

Select rows in pandas where value in one column is a substring of value in another column

I have a dataframe below
>df = pd.DataFrame({'A':['apple','orange','grape','pear','banana'], \
'B':['She likes apples', 'I hate oranges', 'This is a random sentence',\
'This one too', 'Bananas are yellow']})
>print(df)
A B
0 apple She likes apples
1 orange I hate oranges
2 grape This is a random sentence
3 pear This one too
4 banana Bananas are yellow
I'm trying to fetch all rows where column B contains the value in column A.
Expected Result:
A B
0 apple She likes apples
1 orange I hate oranges
4 banana Bananas are yellow
I'm able to do fetch only one row using
>df[df['B'].str.contains(df.iloc[0,0])]
A B
0 apple She likes apples
How can I fetch all such rows?
Use DataFrame.apply with convert both values to lower and test contains by in and filter by boolean indexing:
df = df[df.apply(lambda x: x.A in x.B.lower(), axis=1)]
Or list comprehension solution:
df = df[[a in b.lower() for a, b in zip(df.A, df.B)]]
print (df)
A B
0 apple She likes apples
1 orange I hate oranges
4 banana Bananas are yellow

Problems while concatenating string columns into a new column with pandas?

I have the following pandas dataframe:
colA ColB
orange NaN
apple red apples
NaN fruit
... ...
tomato tomato
I am interested in concatenating ColA and ColB into a new column (ColC), the problem is that when I do:
df["ColC"] = df["ColA"].map(str) + df["ColB"]
I get:
colA ColB ColC
orange NaN orangenan
apple red apples applered apples
NaN fruit nanfruit
... ... ...
tomato tomato tomatotomato
How can I handle, repeated strings, nans and adding different strings separated by commas?, for example the expected output should be:
colA ColB ColC
orange NaN orange
apple red apples apple, red apples
NaN fruit fruit
... ... ...
tomato tomato tomato
UPDATE
After trying #MaxU solution:
df["ColC"] = df[["ColA","ColB"].fillna('').astype(str).sum(1)
I am still having problems for:
apple red apples applered apples
Since the string is not separated by commas:
apple red apples apple, red apples
Any idea of how to get the string separated by commas??
Try this:
df["ColC"] = df["ColA"].fillna('').astype(str) + df["ColB"].fillna('').astype(str)
or:
df["ColC"] = df[["ColA","ColB"]].fillna('').astype(str).sum(1)
UPDATE:
cols = ['ColA','ColB']
In [94]: df['ColC'] = df[cols].apply(lambda x: ', '.join(x.dropna().unique()), axis=1)
In [95]: df
Out[95]:
ColA ColB ColC
0 orange NaN orange
1 apple red apples apple, red apples
2 NaN fruit fruit
3 tomato tomato tomato

Categories

Resources