Python Pandas DataFrame str contains merge if - python

I want to merge the rows of the two dataframes hereunder, when the strings in Test1 column of DF2 contain a substring of Test1 column of DF1.
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
print (DF1)
Test1 Test2
0 A 1
1 B 2
2 C 3
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print (DF2)
Test1 Test2
0 ee 1
1 bA 2
2 cCc 3
3 D 4
For that, I am able with "str contains" to identify the substring of DF1.Test1 available in the strings of DF2.Test1
INPUT:
for i in DF1.Test1:
ok = DF2[Df2.Test1.str.contains(i)]
print(ok)
OUPUT:
Now, I would like to add in the output, the merge of the substrings of Test1 which match with the strings of Test2
OUPUT:
For that, I tried with "pd.merge" and "if" but i am not able to find the right code yet..
Do you have suggestions please?
for i in DF1.Test1:
if DF2.Test1.str.contains(i) == 'True':
ok = pd.merge(DF1, DF2, on= ['Test1'[i]], how='outer')
print(ok)
Thank you for your ideas :)

I could not respnd to jezrael's comment because of my reputation. But I changed his answer to a function to merge on non-capitalized text.
def str_merge(part_string_df,full_string_df, merge_column):
merge_column_lower = 'merge_column_lower'
part_string_df[merge_column_lower] = part_string_df[merge_column].str.lower()
full_string_df[merge_column_lower] = full_string_df[merge_column].str.lower()
pat = '|'.join(r"{}".format(x) for x in part_string_df[merge_column_lower])
full_string_df['Test3'] = full_string_df[merge_column_lower].str.extract('('+ pat + ')', expand=True)
DF = pd.merge(part_string_df, full_string_df, left_on= merge_column_lower, right_on='Test3').drop([merge_column_lower + '_x',merge_column_lower + '_y','Test3'],axis=1)
return DF
Used with example:
DF1 = pd.DataFrame({'Test1':list('ABC'),
'Test2':[1,2,3]})
DF2 = pd.DataFrame({'Test1':['ee','bA','cCc','D'],
'Test2':[1,2,3,4]})
print(str_merge(DF1,DF2, 'Test1'))
Test1_x Test2_x Test1_y Test2_y
0 B 2 bA 2
1 C 3 cCc 3

I believe you need extract values to new column and then merge, last remove helper column Test3:
pat = '|'.join(r"{}".format(x) for x in DF1.Test1)
DF2['Test3'] = DF2.Test1.str.extract('('+ pat + ')', expand=False)
DF = pd.merge(DF1, DF2, left_on= 'Test1', right_on='Test3').drop('Test3', axis=1)
print (DF)
Test1_x Test2_x Test1_y Test2_y
0 A 1 bA 2
1 C 3 cCc 3
Detail:
print (DF2)
Test1 Test2 Test3
0 ee 1 NaN
1 bA 2 A
2 cCc 3 C
3 D 4 NaN

Related

Pandas DataFrame efficiently split one column into multiple

I have a dataframe similar to this:
data = {"col_1": [0, 1, 2],
"col_2": ["abc", "defg", "hi"]}
df = pd.DataFrame(data)
Visually:
col_1 col_2
0 0 abc
1 1 defg
2 2 hi
What I'd like to do is split up each character in col_2, and append it as a new column to the dataframe
example iterative method:
def get_chars(string):
chars = []
for char in string:
chars.append(char)
return chars
char_df = pd.DataFrame()
for i in range(len(df)):
char_arr = get_chars(df.loc[i, "col_2"])
temp_df = pd.DataFrame(char_arr).T
char_df = pd.concat([char_df, temp_df], ignore_index=True, axis=0)
df = pd.concat([df, char_df], ignore_index=True, axis=1)
Which results in the correct form:
0 1 2 3 4 5
0 0 abc a b c NaN
1 1 defg d e f g
2 2 hi h i NaN NaN
But I believe iterating though the dataframe like this is very inefficient, so I want to find a faster (ideally vectorised) solution.
In reality, I'm not really splitting up strings, but the point of this question is to find a way to efficiently process one column, and return many.
If need performance use DataFrame constructor with convert values to lists:
df = df.join(pd.DataFrame([list(x) for x in df['col_2']], index=df.index))
Or:
df = df.join(pd.DataFrame(df['col_2'].apply(list).tolist(), index=df.index))
print (df)
col_1 col_2 0 1 2 3
0 0 abc a b c None
1 1 defg d e f g
2 2 hi h i None None

Python - lookup value in dataframe and return random corresponding value

df1 has a lot of NaN values.
I have compiled df2 with all unique values for code and name.
I need to replace the NaN code values in df1 with a random code value from df2 where df1 and df2 match on name.
df1 = pd.DataFrame(columns=['ID','name','code'])
df1.ID = [1,2,3,4]
df1.name = ['A','A','B','B']
df1.code = [np.nan,np.nan,np.nan,np.nan]
df2 = pd.DataFrame(columns=['name','code'])
df2.name = ['A','A','A','A','B','B','B','B']
df2.code = ['a','b','c','d','e','f','g','h']
df1
df2
example result
You could use random.sample and pas 2 after joining the values for each group into a list. Then, merge back into the initial dataframe, explode the list and drop_duplicates()
import random
df2 = df2.groupby('name')['code'].apply(lambda x: random.sample(list(x), 2)).reset_index()
df3 = df1[['ID', 'name']].merge(df2).explode('code').drop_duplicates(['name', 'code']).reset_index(drop=True)
df3['ID'] = np.flatnonzero(df3['ID']) + 1
Out[1]:
ID name code
0 1 A d
1 2 A a
2 3 B h
3 4 B f
You could create a dictionary where the keys are names and the values are the possible code, then for each name in df1 sample from the corresponding value:
import random
lookup = df2.groupby('name')['code'].apply(list).to_dict()
df1['code'] = df1['code'].fillna(pd.Series([random.choice(lookup[name]) for name in df1['name']],
index=df1.index))
print(df1)
Output
ID name code
0 1 A b
1 2 A b
2 3 B g
3 4 B g
If sample without replacement is needed you could do:
lst = [s for k, g in df1.groupby('name', as_index=False) for s in random.sample(lookup[k], len(g))]
df1['code'] = df1['code'].fillna(pd.Series(lst, index=df1.index))
print(df1)
Output
ID name code
0 1 A d
1 2 A a
2 3 B e
3 4 B h

How to replace a value in column A where column B is equal to value?

I want to replace the start of a string (the first 5 characters) with nothing in column A if the value in Column B is equal to VendorA. I am not getting any further then replacing a value without the condition as mentioned above.
I have tried the following code:
ColumnA Vendor
1 A ABBC/1234
2 B BCCD/1234
3 B 1234
4 C 1234ABBC/
Dataset.ColumnA= Dataset.ColumnA.replace(regex=['ABBC/'], value='')
#This should be the output
ColumnA Vendor
1 A 1234
2 B BCCD/1234
3 B 1234
4 C 1234ABBC/
>>> df = pd.DataFrame({'ColumnA':['A','B', 'B', 'C'], 'Vendor':['ABBC/1234','BCCD/1234','1234','1234ABBC/']})
>>> cola = ''.join(df['ColumnA'].values.tolist())
>>> df['Vendor'] = df.apply(lambda row: row['Vendor'].split('/')[1] if row['Vendor'].startswith(cola) else row['Vendor'], axis=1)
>>> df
ColumnA Vendor
0 A 1234
1 B BCCD/1234
2 B 1234
3 C 1234ABBC/
You can split on the '/' and use np.where to specify only on 'A'.
df['Vendor'] = np.where(df['ColumnA'].eq('A'), df['Vendor'].str.split('/').str[1], df['Vendor'])

Appending duplicates as columns and removing the other rows

I have a df with some repeated IDs, like this:
index ID name surname
1 1 a x
2 2 b y
3 1 c z
4 3 d j
I'd like to append the columns of the repeated rows to the right and to remove the "single" rows, like this:
index ID name surname second_name second_surname
1 1 a x c z
What is the most efficient way to do it? (I have many millions of rows)
Try using drop_duplicates, merge and query like so:
df['second_name'] = (df.drop_duplicates(subset='ID')
.reset_index()
.merge(df, on='ID', how='inner', suffixes=('', '_'))
.query("name != name_")
.set_index('level_0')['name_'])
[out]
index ID name second_name
0 1 1 a c
1 2 2 b NaN
2 3 1 c NaN
3 4 3 d NaN
If you only need the single row, use dropna:
df.dropna(subset=['second_name'])
[out]
index ID name second_name
0 1 1 a c
My suggestion involves groupby and should work for an arbitrary number of "additional" names:
df_in = pd.DataFrame({'ID': [1, 2, 1, 3], 'name': ['a', 'b', 'c', 'd']})
grp = df_in.groupby('ID', as_index=True)
df_a = grp.first()
df_b = grp['name'].unique().apply(pd.Series).rename(columns = lambda x: 'name_{:.0f}'.format(x+1)).drop('name_1', axis=1)
df_out = df_a.merge(df_b, how='inner', left_index=True, right_index=True).reset_index(drop=False)
I would try to pivot the dataframe. For that, I will first add a rank column to give the rank of a name for its ID:
df['rank'] = df.groupby('ID').cumcount()
pivoted = df.pivot(index='ID', columns='rank', values='name')
giving:
rank 0 1
ID
1 a c
2 b NaN
3 d NaN
Let us just format it:
pivoted = pivoted.rename_axis(None, axis=1).rename(lambda x: 'name_{}'.format(x),
axis=1).reset_index()
ID name_0 name_1
0 1 a c
1 2 b NaN
2 3 d NaN
Numpy / Pandas
r, i = np.unique(df.ID, return_inverse=True)
j = df.groupby('ID').cumcount()
names = np.empty((len(r), j.max() + 1), object)
names.fill(np.nan)
names[i, j] = df.name
pd.DataFrame(names, r).rename_axis('ID').add_prefix('name_')
name_0 name_1
ID
1 a c
2 b NaN
3 d NaN
Loopy
from itertools import count
from collections import defaultdict
c = defaultdict(count)
d = defaultdict(dict)
for i, n in zip(df.ID, df.name):
d[f'name_{next(c[i])}'][i] = n
pd.DataFrame(d).rename_axis('ID')
name_0 name_1
ID
1 a c
2 b NaN
3 d NaN

pandas hwo to groupby create other columns by counting values of existing columns

I got to know how to do this in R(
How to make new columns by counting up an existing column), but I'd like also to know how it works in python as well.
When the original table is like below
userID cat1 cat2
a f 3
a f 3
a u 1
a m 1
b u 2
b m 1
b m 2
I group them by userID and want it come like
userID cat1_f cat1_m cat1_u cat2_1 cat2_2 cat2_3
a 2 1 1 2 0 1
b 0 2 1 1 2 0
Use melt with GroupBy.size and unstack:
df = (df.melt('userID')
.groupby(['userID','variable','value'])
.size()
.unstack([1,2], fill_value=0))
#python 3.6+
df.columns = [f'{a}_{b}' for a, b in df.columns]
#python bellow
#df.columns = ['{}_{}'.format(a,b) for a, b in df.columns]
df = df.reset_index()
print (df)
RangeIndex(start=0, stop=7, step=1)
userID cat1_f cat1_m cat1_u cat2_1 cat2_3 cat2_2
0 a 2 1 1 2 2 0
1 b 0 2 1 1 0 2
Alternative with crosstab:
df = df.melt('userID')
df = pd.crosstab(df['userID'], [df['variable'], df['value']])
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()

Categories

Resources