Problems while concatenating string columns into a new column with pandas? - python

I have the following pandas dataframe:
colA ColB
orange NaN
apple red apples
NaN fruit
... ...
tomato tomato
I am interested in concatenating ColA and ColB into a new column (ColC), the problem is that when I do:
df["ColC"] = df["ColA"].map(str) + df["ColB"]
I get:
colA ColB ColC
orange NaN orangenan
apple red apples applered apples
NaN fruit nanfruit
... ... ...
tomato tomato tomatotomato
How can I handle, repeated strings, nans and adding different strings separated by commas?, for example the expected output should be:
colA ColB ColC
orange NaN orange
apple red apples apple, red apples
NaN fruit fruit
... ... ...
tomato tomato tomato
UPDATE
After trying #MaxU solution:
df["ColC"] = df[["ColA","ColB"].fillna('').astype(str).sum(1)
I am still having problems for:
apple red apples applered apples
Since the string is not separated by commas:
apple red apples apple, red apples
Any idea of how to get the string separated by commas??

Try this:
df["ColC"] = df["ColA"].fillna('').astype(str) + df["ColB"].fillna('').astype(str)
or:
df["ColC"] = df[["ColA","ColB"]].fillna('').astype(str).sum(1)
UPDATE:
cols = ['ColA','ColB']
In [94]: df['ColC'] = df[cols].apply(lambda x: ', '.join(x.dropna().unique()), axis=1)
In [95]: df
Out[95]:
ColA ColB ColC
0 orange NaN orange
1 apple red apples apple, red apples
2 NaN fruit fruit
3 tomato tomato tomato

Related

Groupby drop duplicates

I have a df where the category is separated by underscores.
df
fruit cat
0 apple green_heavy_pricy
1 apple heavy_cheap
2 banana yellow
3 pear green
4 banana brown_raw_yellow
...
I want to create an agg column that gathers all unique information. I tried df.groupby("fruit")["cat"].transform("unique"). Expected Output
fruit cat agg
0 apple green_heavy_pricy green_heavy_pricy_cheap
1 apple heavy_cheap green_heavy_pricy_cheap
2 banana yellow yellow_brown_raw
3 pear green green
4 banana brown_raw_yellow yellow_brown_raw
Use custom lambda function with dict.fromkeys in GroupBy.transform:
f = lambda x: '_'.join(dict.fromkeys('_'.join(x).split('_')))
#alternative solution
#f = lambda x: '_'.join(pd.unique('_'.join(x).split('_')))
#alternative2 solution
#f = lambda x: '_'.join(dict.fromkeys(y for y in x for y in y.split('-')))
df['agg'] = df.groupby("fruit")["cat"].transform(f)
print (df)
fruit cat agg
0 apple green_heavy_pricy green_heavy_pricy_cheap
1 apple heavy_cheap green_heavy_pricy_cheap
2 banana yellow yellow_brown_raw
3 pear green green
4 banana brown_raw_yellow yellow_brown_raw

How do I create a table to match based on different columns values ? If statements?

I have a dataset and I am looking to see if there is a way to match data based on col values.
col-A col-B
Apple squash
Apple lettuce
Banana Carrot
Banana Carrot
Banana Carrot
dragon turnip
melon potato
melon potato
pear potato
Match
if col A matches another col a and col b doesn't match
if col B matches another col B and col a doesn't match
col-A col-B
Apple squash
Apple lettuce
melon potato
melon potato
pear potato
edit fixed typo
edit2 fixed 2nd typo
So, if I understand well, you want to select each rows, such that grouping for colA (resp. colB) then colB (resp. colA) lead to more than one group.
I can advice :
grA = df2.groupby("colA").filter(lambda x : x.groupby("colB").ngroups > 1)
grB = df2.groupby("colB").filter(lambda x : x.groupby("colA").ngroups > 1)
Leading to :
grA
colA colB
0 Apple squash
1 Apple lettuce
and
grB
colA colB
6 melon potato
7 melon potato
8 pear potato
Merging the two dataframes will lead to the desired ouput.
IIUC, you need to compute two masks to identify which group has a unique match with the other values:
m1 = df.groupby('col-B')['col-A'].transform('nunique').gt(1)
m2 = df.groupby('col-A')['col-B'].transform('nunique').gt(1)
out = df[m1|m2]
Output:
col-A col-B
0 Apple squash
1 Apple lettuce
6 melon potato
7 melon potato
8 pear potato
You can also get the unique/exclusive pairs with:
df[~(m1|m2)]
col-A col-B
2 Banana Carrot
3 Banana Carrot
4 Banana Carrot
5 Pear Cabbage

How to create a DataFrame with random sample from list?

I have a list named list1
list1 = ['Banana','Apple','Pear','Strawberry','Muskmelon','Apricot','Peach','Plum','Cherry','Blackberry','Raspberry','Cranberry','Grapes','Greenapple','Kiwi','Watermelon','Orange','Lychee','Custardapples','Jackfruit','Pineapple','Mango']
I want to form a df with specific columns and random data from list1
Eg:
a b c d e f
0 Banana Orange Lychee Custardapples Jackfruit Pineapple
1 Apple Pear Strawberry Muskmelon Apricot Peach
2 Raspberry Cherry Plum Kiwi Mango Blackberry
A structure something like this but with random data from list1?
There can't be any duplicate/repeated values present.
If every item from the list can end up everywhere in the DataFrame you could write:
pd.DataFrame(np.random.choice(list1, 3*6, replace=False).reshape(3, 6), columns=list("abcdef"))
Out:
a b c d e f
0 Lychee Peach Apricot Pear Plum Grapes
1 Cherry Jackfruit Blackberry Cranberry Kiwi Apple
2 Orange Greenapple Watermelon Banana Custardapples Raspberry
The replace-parameter in np.random.choice() is True by default, so for unique values you need to set it to False.

Select rows in pandas where value in one column is a substring of value in another column

I have a dataframe below
>df = pd.DataFrame({'A':['apple','orange','grape','pear','banana'], \
'B':['She likes apples', 'I hate oranges', 'This is a random sentence',\
'This one too', 'Bananas are yellow']})
>print(df)
A B
0 apple She likes apples
1 orange I hate oranges
2 grape This is a random sentence
3 pear This one too
4 banana Bananas are yellow
I'm trying to fetch all rows where column B contains the value in column A.
Expected Result:
A B
0 apple She likes apples
1 orange I hate oranges
4 banana Bananas are yellow
I'm able to do fetch only one row using
>df[df['B'].str.contains(df.iloc[0,0])]
A B
0 apple She likes apples
How can I fetch all such rows?
Use DataFrame.apply with convert both values to lower and test contains by in and filter by boolean indexing:
df = df[df.apply(lambda x: x.A in x.B.lower(), axis=1)]
Or list comprehension solution:
df = df[[a in b.lower() for a, b in zip(df.A, df.B)]]
print (df)
A B
0 apple She likes apples
1 orange I hate oranges
4 banana Bananas are yellow

Merge dataframes based on column, only keeping first match

I have 2 dataframes like the following.
df_1
Index Fruit
1 Apple
2 Banana
3 Peach
df_2
Fruit Taste
Apple Tasty
Banana Tasty
Banana Rotten
Peach Rotten
Peach Tasty
Peach Tasty
I want to merge the two dataframes based on Fruit but only keeping the first occurrence of Apple, Banana, and Peach in the second dataframe. The final result should be:
df_output
Index Fruit Taste
1 Apple Tasty
2 Banana Tasty
3 Peach Rotten
Where Fruit, Index, and Taste are column headers. I tried something like df1.merge(df2,how='left',on='Fruit but it created extra rows based on the length of df_2
Thanks.
Use drop_duplicates for first rows:
df = df_1.merge(df_2.drop_duplicates('Fruit'),how='left',on='Fruit')
print (df)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten
If want add only one column faster is use map:
s = df_2.drop_duplicates('Fruit').set_index('Fruit')['Taste']
df_1['Taste'] = df_1['Fruit'].map(s)
print (df_1)
Index Fruit Taste
0 1 Apple Tasty
1 2 Banana Tasty
2 3 Peach Rotten

Categories

Resources