Keyword search between two DataFrames using python pandas

Keyword search between two DataFrames using python pandas - python

Hi I have two DataFrames like below
DF1
Alpha | Numeric | Special
and, or | 1,2,3,4,5| #,$,&
and
DF2 with single column
Content |
boy or girl |
school # morn|
I want to search if anyone of the column in DF1 has anyone of the keyword in content column of DF2 and the output should be in a new DF
output_DF
output_column|
Alpha |
Special |
someone help me with this

Solution is s bit complicated, because for multiple match (row 2) need only matched first column df1:
df1 = pd.DataFrame({'Alpha':['and','or', None, None,None],
'Numeric':['1','2','3','4','5'],
'Special':['#','$','&', None, None]})
print (df1)
Alpha Numeric Special
0 and 1 #
1 or 2 $
2 None 3 &
3 None 4 None
4 None 5 None
df2 = pd.DataFrame({'Content':['boy or girl','school # morn',
'1 school # morn', 'Pechi']})
print (df2)
Content
0 boy or girl
1 school # morn
2 1 school # morn
3 Pechi
#reshape df1
df1.columns = [np.arange(len(df1.columns)), df1.columns]
df11 = df1.unstack()
.reset_index(level=2,drop=True)
.rename_axis(('col_order','col_name'))
.dropna()
.reset_index(name='val')
print (df11)
col_order col_name val
0 0 Alpha and
1 0 Alpha or
2 1 Numeric 1
3 1 Numeric 2
4 1 Numeric 3
5 1 Numeric 4
6 1 Numeric 5
7 2 Special #
8 2 Special $
9 2 Special &
#split column by whitespaces, reshape
df22 = df2['Content'].str.split(expand=True)
.stack()
.rename('val')
.reset_index(level=1,drop=True)
.rename_axis('idx').reset_index()
print (df22)
idx val
0 0 boy
1 0 or
2 0 girl
3 1 school
4 1 #
5 1 morn
6 2 1
7 2 school
8 2 #
9 2 morn
10 3 Pechi
#left join dataframes, remove non match values by dropna
#also for multiple match get always first - use sorting with drop_duplicates
df = pd.merge(df22, df11, on='val', how='left')
.dropna(subset=['col_name'])
.sort_values(['idx','col_order'])
.drop_duplicates(['idx'])
#if necessary get values from df2
#if no value matched add Other category
df = pd.concat([df2, df.set_index('idx')], axis=1)
.fillna({'col_name':'Other'})[['val','col_name','Content']]
print (df)
val col_name Content
0 or Alpha boy or girl
1 # Special school # morn
2 1 Numeric 1 school # morn
3 NaN Other Pechi
EDIT:
:
df1 = pd.DataFrame({'Alpha':['and','or', None, None,None],
'Numeric':['1','2','3','4','5'],
'Special':['#','$','&', None, None]})
df2 = pd.DataFrame({'Content':['boy OR girl','school # morn',
'1 school # morn', 'Pechi']})
#If df1 Alpha values are not lower
#df1['Alpha'] = df1['Alpha'].str.lower()
df1.columns = [np.arange(len(df1.columns)), df1.columns]
df11 = (df1.unstack()
.reset_index(level=2,drop=True)
.rename_axis(('col_order','col_name'))
.dropna()
.reset_index(name='val_low'))
df22 = (df2['Content'].str.split(expand=True)
.stack()
.rename('val')
.reset_index(level=1,drop=True)
.rename_axis('idx')
.reset_index())
#convert columns values to lower to new column
df22['val_low'] = df22['val'].str.lower()
df = (pd.merge(df22, df11, on='val_low', how='left')
.dropna(subset=['col_name'])
.sort_values(['idx','col_order'])
.drop_duplicates(['idx']))
df = (pd.concat([df2, df.set_index('idx')], axis=1)
.fillna({'col_name':'Other'})[['val','col_name','Content']])
print (df)
val col_name Content
0 OR Alpha boy OR girl
1 # Special school # morn
2 1 Numeric 1 school # morn
3 NaN Other Pechi

Related

Preserve original column name in pd.get_dummies()

I have a list of columns whose values are all strings. I need to one hot encode them with pd.get_dummies().
I want to keep the original name of those columns along with the value.
So lets say I have a column named Street, and its values are Paved and Not Paved.
After running get_dummies(), I would like the 2 resulting columns to be entitled Street_Paved and Street_Not_Paved. Is this possible? Basically the format for the prefix parameter is {i}_{value}, with i referring to the for i in cols common nomenclature.
My code is:
cols = ['Street', 'Alley', 'CentralAir', 'Utilities', 'LandSlope', 'PoolQC']
pd.get_dummies(df, columns = cols, prefix = '', prefix_sep = '')

If remove prefix = '', prefix_sep = '' parameters get default prefix from columns names with default separator _:
df = pd.DataFrame({'Street' : ['Paved','Paved','Not Paved','Not Paved'],
'Alley':list('acca')})
cols = ['Street','Alley']
df = pd.get_dummies(df, columns = cols)
print (df)
Street_Not Paved Street_Paved Alley_a Alley_c
0 0 1 1 0
1 0 1 0 1
2 1 0 0 1
3 1 0 1 0
If need replace all spaces by _ add rename:
cols = ['Street','Alley']
df = pd.get_dummies(df, columns = cols).rename(columns=lambda x: x.replace(' ', '_'))
print (df)
Street_Not_Paved Street_Paved Alley_a Alley_c
0 0 1 1 0
1 0 1 0 1
2 1 0 0 1
3 1 0 1 0

How do I sort a Pandas dataframe Excel import?

I have imported the following Excel file but would like to sort it based on Frequency descending, but then with 'Other','No data' and 'All' (the total) at the bottom in that order. Is this possible?
table1 = pd.read_excel("table1.xlsx")
table1

Use:
df = pd.DataFrame({
'generalenq':list('abcdef'),
'percentage':[1,3,5,7,1,0],
'frequency':[5,3,6,9,2,4],
})
df.loc[0, 'generalenq'] = 'All'
df.loc[2, 'generalenq'] = 'No data'
df.loc[3, 'generalenq'] = 'Other'
print (df)
generalenq percentage frequency
0 All 1 5
1 b 3 3
2 No data 5 6
3 Other 7 9
4 e 1 2
5 f 0 4
First create dictionary for ordering by some integers. Then create mask by membership with Series.isin and sorting non matched rows selected with ~ for invert mask with boolean indexing:
d = {'Other':0,'No data':1,'All':2}
mask = df['generalenq'].isin(list(d.keys()))
df1 = df[~mask].sort_values('frequency', ascending=False)
print (df1)
generalenq percentage frequency
5 f 0 4
1 b 3 3
4 e 1 2
Then filter matched rows by mask and create helper column for sorting by mapped dict:
df2 = df[mask].assign(new = lambda x: x['generalenq'].map(d)).sort_values('new').drop('new', 1)
print (df2)
generalenq percentage frequency
3 Other 7 9
2 No data 5 6
0 All 1 5
And last join together by concat:
df = pd.concat([df1, df2], ignore_index=True)
print (df)
generalenq percentage frequency
0 f 0 4
1 b 3 3
2 e 1 2
3 Other 7 9
4 No data 5 6
5 All 1 5

Pandas comparing multiindex dataframes without looping

I want to compare two multiindex dataframes and add another column to show the difference in values (if all index value match between the first dataframe and second dataframe) without using loops
index_a = [1,2,2,3,3,3]
index_b = [0,0,1,0,1,2]
index_c = [1,2,2,4,4,4]
index = pd.MultiIndex.from_arrays([index_a,index_b], names=('a','b'))
index_1 = pd.MultiIndex.from_arrays([index_c,index_b], names=('a','b'))
df1 = pd.DataFrame(np.random.rand(6,), index=index, columns=['p'])
df2 = pd.DataFrame(np.random.rand(6,), index=index_1, columns=['q'])
df1
p
a b
1 0 .4655
2 0 .8600
1 .9010
3 0 .0652
1 .5686
2 .8965
df2
q
a b
1 0 .6591
2 0 .5684
1 .5689
4 0 .9898
1 .3656
2 .6989
The resultant matrix (df1-df2) should look like
p diff
a b
1 0 .4655 -0.1936
2 0 .8600 .2916
1 .9010 .3321
3 0 .0652 No Match
1 .5686 No Match
2 .8965 No Match

Use reindex_like or reindex for intersection of indices:
df1['new'] = (df1['p'] - df2['q'].reindex_like(df1)).fillna('No Match')
#alternative
#df1['new'] = (df1['p'] - df2['q'].reindex(df1.index)).fillna('No Match')
print (df1)
p new
a b
1 0 0.955587 0.924466
2 0 0.312497 -0.310224
1 0.306256 0.231646
3 0 0.575613 No Match
1 0.674605 No Match
2 0.462807 No Match
Another idea with Index.intersection and DataFrame.loc:
df1['new'] = (df1['p'] - df2.loc[df2.index.intersection(df1.index), 'q']).fillna('No Match')
Or with merge with left join:
df = pd.merge(df1, df2, how='left', left_index=True, right_index=True)
df['new'] = (df['p'] - df['q']).fillna('No Match')
print (df)
p q new
a b
1 0 0.789693 0.665148 0.124544
2 0 0.082677 0.814190 -0.731513
1 0.762339 0.235435 0.526905
3 0 0.727695 NaN No Match
1 0.903596 NaN No Match
2 0.315999 NaN No Match

Use following to get the difference of matached indexes. Unmatch indices will be NaN
diff = df1['p'] - df2['q']
#Output
a b
1 0 -0.666542
2 0 -0.389033
1 0.064986
3 0 NaN
1 NaN
2 NaN
4 0 NaN
1 NaN
2 NaN
dtype: float64

Adding rows to a Dataframe to unify the length of groups

I would like to add element to specific groups in a Pandas DataFrame in a selective way. In particular, I would like to add zeros so that all groups have the same number of elements. The following is a simple example:
import pandas as pd
df = pd.DataFrame([[1,1], [2,2], [1,3], [2,4], [2,5]], columns=['key', 'value'])
df
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
I would like to have the same number of elements per group (where grouping is by the key column). The group 2 has the most elements: three elements. However, the group 1 has only two elements so a zeros should be added as follows:
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0
Note that the index does not matter.

You can create new level of MultiIndex by cumcount and then add missing values by unstack/stack or reindex:
df = (df.set_index(['key', df.groupby('key').cumcount()])['value']
.unstack(fill_value=0)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='value'))
Alternative solution:
df = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df.index.levels, names = df.index.names)
df = df.reindex(mux, fill_value=0).reset_index(level=1, drop=True).reset_index()
print (df)
key value
0 1 1
1 1 3
2 1 0
3 2 2
4 2 4
5 2 5
If is important order of values:
df1 = df.set_index(['key', df.groupby('key').cumcount()])
mux = pd.MultiIndex.from_product(df1.index.levels, names = df1.index.names)
#get appended values
miss = mux.difference(df1.index).get_level_values(0)
#create helper df and add 0 to all columns of original df
df2 = pd.DataFrame({'key':miss}).reindex(columns=df.columns, fill_value=0)
#append to original df
df = pd.concat([df, df2], ignore_index=True)
print (df)
key value
0 1 1
1 2 2
2 1 3
3 2 4
4 2 5
5 1 0

Replace column values based on another dataframe python pandas - better way?

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)

Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0

Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]

In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values

df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keyword search between two DataFrames using python pandas - python

Related

Preserve original column name in pd.get_dummies()

How do I sort a Pandas dataframe Excel import?

Pandas comparing multiindex dataframes without looping

Adding rows to a Dataframe to unify the length of groups

Replace column values based on another dataframe python pandas - better way?

Categories

Resources