Merge three dataframe, count the matches and add new columns - python

I have 3 dataframes (df1, df2, df3) i want to merge these dataframe based on a column and add two new columns. one column should say which dataframes are matching, second how many of them matched.
# df1
data = {'ID': ["M1", "M2", "M3", "M4"],
'Movie': ["Top gun", "Thor", "Batman", "MadMax"],
'Actor' : ["Tom", "Chris", "Bale", "Tom"],
'type': ["Action", "SciFi", "Comic", "SciFi"]}
df1 = pd.DataFrame(data)
# df2
data = {'ID': ["M1", "M2", "M3"],
'highlight': ["Flight school", "Love and thunder", "I am Batman"]}
df2 = pd.DataFrame(data)
# df3
data = {'ID': ["M2", "M3"],
'no of parts': [3, 3],
'co-star' : ["portman", "neeson"],
'award': ["yes", "yes"]}
df3 = pd.DataFrame(data)
Expected output will be
The match and no of match are the new column
Thank you for your time
Any help would be much appreciated

You can merge your three dataframes on ID, then use the indicator parameter to merge to determine which dataframes had valid data, using this info to generate the match column. You can then count the number of | characters in match to determine the No of match column:
import pandas as pd
data = {'ID': ["M1", "M2", "M3", "M4"], 'Movie': ["Top gun", "Thor", "Batman", "MadMax"], 'Actor' : ["Tom", "Chris", "Bale", "Tom"], 'type': ["Action", "SciFi", "Comic", "SciFi"]}
df1 = pd.DataFrame(data)
data = {'ID': ["M1", "M2", "M3"], 'highlight': ["Flight school", "Love and thunder", "I am Batman"]}
df2 = pd.DataFrame(data)
data = {'ID': ["M2", "M3"], 'no of parts': [3, 3], 'co-star' : ["portman", "neeson"], 'award': ["yes", "yes"]}
df3 = pd.DataFrame(data)
df = df1.merge(df2, on='ID', how='left', indicator='df1df2').merge(df3, on='ID', how='left',indicator='df3')
df['match'] = df['df1df2'].map({'both':'df1|df2', 'left_only':'df1'})+df['df3'].map({'both':'|df3', 'left_only':''})
df['No of match'] = df['match'].str.count('\|')+1
df = df.drop(['df1df2', 'df3'], axis=1)
Output:
ID Movie Actor type highlight no of parts co-star award match No of match
0 M1 Top gun Tom Action Flight school NaN NaN NaN df1|df2 2
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes df1|df2|df3 3
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi NaN NaN NaN NaN df1 1

You can use pandas.concat on a list of the input DafaFrames. This will work on any number of input DataFrames (not just 3):
# dataframes will be later named in order: 1->2->3
# you can easily tweak this solution to use a dictionary
# and custom names if desired
dfs = [df1, df2, df3]
out = (pd
.concat([d.set_index('ID').assign(ID=f'df{i}')
for i,d in enumerate(dfs, start=1)], axis=1)
.assign(**{'match': lambda d: d[['ID']].agg(lambda x: '|'.join(x.dropna()),
axis=1),
'No of matches': lambda d: d[['ID']].notna().sum(axis=1)
})
.drop('ID', axis=1).reset_index()
)
NB. this approach uses a temporary ID column, make sure it is not present in any of the input DataFrame's column. You can chose another name for safety if needed.
output:
ID Movie Actor type highlight no of parts co-star award match No of matches
0 M1 Top gun Tom Action Flight school NaN NaN NaN df1|df2 2
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes df1|df2|df3 3
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi NaN NaN NaN NaN df1 1

Use DataFrame.merge with left join and indicator parameters for see matched DataFrames, then use DataFrame.pop for remove column with processing by Series.map for dictionaries, append df3 column with mapping another dictionary and last count | by Series.str.count:
df = (df1.merge(df2, on='ID', how='left', indicator='df2')
.merge(df3, on='ID', how='left', indicator='df3'))
df['match'] = (df.pop('df2').map({'both':'df1|df2', 'left_only':'df1'}) +
df.pop('df3').map({'both':'|df3', 'left_only':''}))
df['No of match'] = df['match'].str.count('\|') + 1
print (df)
ID Movie Actor type highlight no of parts co-star award \
0 M1 Top gun Tom Action Flight school NaN NaN NaN
1 M2 Thor Chris SciFi Love and thunder 3.0 portman yes
2 M3 Batman Bale Comic I am Batman 3.0 neeson yes
3 M4 MadMax Tom SciFi NaN NaN NaN NaN
match No of match
0 df1|df2 2
1 df1|df2|df3 3
2 df1|df2|df3 3
3 df1 1
Another idea with concat for assign helper columns df1-df3 filled by Trues and for match columns use DataFrame.dot:
dfs = [df1, df2, df3]
L = [df_.set_index('ID').assign(**{f'df{i}':True}) for i, df_ in enumerate(dfs, start=1)]
df = pd.concat(L, axis=1)
cols = df.filter(regex='^df\d').columns
df['match'] = df[cols].fillna(False).dot(cols + '|').str[:-1]
df['No of match'] = df['match'].str.count('\|') + 1
df = df.drop(cols, axis=1)
print (df)
Movie Actor type highlight no of parts co-star award \
ID
M1 Top gun Tom Action Flight school NaN NaN NaN
M2 Thor Chris SciFi Love and thunder 3.0 portman yes
M3 Batman Bale Comic I am Batman 3.0 neeson yes
M4 MadMax Tom SciFi NaN NaN NaN NaN
match No of match
ID
M1 df1|df2 2
M2 df1|df2|df3 3
M3 df1|df2|df3 3
M4 df1 1

You can try this one too; merging with reduce-lambda
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='ID',how='outer'), dfs)
df_temp = df_final[[df1.columns[1],df2.columns[1],df3.columns[1]]]
df_final["match"] = df_temp.apply(lambda x: "|".join(["df"+str(idx+1) for idx,i in enumerate(x) if pd.isna(i)==False]),axis=1)
df_final["No of match"] = df_final["match"].apply(lambda x: x.count("|")+1)
Output;
ID Movie Actor type ... co-star award match No of match
0 M1 Top gun Tom Action ... NaN NaN df1|df2 2
1 M2 Thor Chris SciFi ... portman yes df1|df2|df3 3
2 M3 Batman Bale Comic ... neeson yes df1|df2|df3 3
3 M4 MadMax Tom SciFi ... NaN NaN df1 1

Related

replace values to a dataframe column from another dataframe

I am trying to replace values from a dataframe column with values from another based on a third one and keep the rest of the values from the first df.
# df1
country name value
romania john 100
russia emma 200
sua mark 300
china jack 400
# df2
name value
emma 2
mark 3
Desired result:
# df3
country name value
romania john 100
russia emma 2
sua mark 3
china jack 400
Thank you
One approach could be as follows:
Use Series.map on column name and turn df2 into a Series for mapping by setting its index to name (df.set_index).
Next, chain Series.fillna to replace NaN values with original values from df.value (i.e. whenever mapping did not result in a match) and assign to df['value'].
df['value'] = df['name'].map(df2.set_index('name')['value']).fillna(df['value'])
print(df)
country name value
0 romania john 100.0
1 russia emma 2.0
2 sua mark 3.0
3 china jack 400.0
N.B. The result will now contain floats. If you prefer integers, chain .astype(int) as well.
Another option could be using pandas.DataFrame.Update:
df1.set_index('name', inplace=True)
df1.update(df2.set_index('name'))
df1.reset_index(inplace=True)
name country value
0 john romania 100.0
1 emma russia 2.0
2 mark sua 3.0
3 jack china 400.0
Another option:
df3 = df1.merge(df2, on = 'name', how = 'left')
df3['value'] = df3.value_y.fillna(df3.value_x)
df3.drop(['value_x', 'value_y'], axis = 1, inplace = True)
# country name value
# 0 romania john 100.0
# 1 russia emma 2.0
# 2 sua mark 3.0
# 3 china jack 400.0
Reproducible data:
df1=pd.DataFrame({'country':['romania','russia','sua','china'],'name':['john','emma','mark','jack'],'value':[100,200,300,400]})
df2=pd.DataFrame({'name':['emma','mark'],'value':[2,3]})

Python - expand rows in dataframe n-times

I need to make a function to expand a dataframe. For example, the input of the function is :
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', 'phone', 'food', 'bag']
})
suppose the n value is 3. Then, for each person inside the Name column, I have to add 3 more new rows and leave the Cart as np.nan. The output should be like this :
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', np.nan, np.nan, np.nan, 'phone', 'food', 'bag', np.nan, np.nan, np.nan]
})
How can I solve this with using copy() and append()?
You can use np.repeat with pd.Series.unique:
n = 3
print (df.append(pd.DataFrame(np.repeat(df["Name"].unique(), n), columns=["Name"])))
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Sasha phone
4 Sasha food
5 Sasha bag
0 Ali NaN
1 Ali NaN
2 Ali NaN
3 Sasha NaN
4 Sasha NaN
5 Sasha NaN
Try this one: (it adds n rows to each group of rows with the same Name value)
import pandas as pd
import numpy as np
n = 3
list_of_df_unique_names = [df[df["Name"]==name] for name in df["Name"].unique()]
df2 = pd.concat([d.append(pd.DataFrame({"Name":np.repeat(d["Name"].values[-1], n)}))\
for d in list_of_df_unique_names]).reset_index(drop=True)
print(df2)
Output:
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Ali NaN
4 Ali NaN
5 Ali NaN
6 Sasha phone
7 Sasha food
8 Sasha bag
9 Sasha NaN
10 Sasha NaN
11 Sasha NaN
Maybe not the most beautiful of all solutions, but it works. Say that you want to add 4 NaN rows by group. Then, given your df:
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', 'phone', 'food', 'bag']
})
you can creat an empty dataframe DF and loop trough the range (1,4), filter the df you had and in every loop add an empty row:
DF = []
names = list(set(df.Name))
for i in range(4):
for name in names:
gf = df[df['Name']=='{}'.format(name)]
a = pd.concat([gf, gf.groupby('Name')['Cart'].apply(lambda x: x.shift(-1).iloc[-1]).reset_index()]).sort_values('Name').reset_index(drop=True)
DF.append(a)
DF_full = pd.concat(DF)
Now, you'll end up with copies of your original df, so you need to dump them without dumping the NaN rows:
DFF = DF_full.sort_values(['Name','Cart'])
DFF = DFF[(~DFF.duplicated()) | (DFF['Cart'].isnull())]
which gives:
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Ali NaN
3 Ali NaN
3 Ali NaN
3 Ali NaN
2 Sasha bag
1 Sasha food
0 Sasha phone
3 Sasha NaN
3 Sasha NaN
3 Sasha NaN
3 Sasha NaN

Matching IDs with names on a pandas DataFrame

I have a DataFrame with 3 columns: ID, BossID and Name. Each row has a unique ID and has a corresponding name. BossID is the ID of the boss of the person in that row. Suppose I have the following DataFrame:
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3],
'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
So here, Anne is Ben's boss and Ben Coe is Cate and Dan's boss, etc.
Now, I want to have another column that has the boss's name for each person.
The desired output is:
id boss name boss_name
0 1 NaN Anne NaN
1 2 1.0 Ben Anne
2 3 2.0 Cate Ben
3 4 2.0 Dan Ben
4 5 3.0 Erin Cate
I can get my output using an ugly double for-loop. Is there a cleaner way to obtain the desired output?
This should work:
bossmap = df.set_index('id')['name'].squeeze()
df['boss_name'] = df['bossId'].map(bossmap)
You can set id as index, then use pd.Series.reindex
df = df.set_index('id')
df['boss_name'] = df['name'].reindex(df['bossId']).to_numpy() # or .to_list()
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe
Create a separate dataframe for 'name' and 'id'.
Renamed 'name' and set 'id' as the index
.merge df with the new dataframe
import pandas as pd
# test dataframe
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3], 'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
# separate dataframe with id and name
names = df[['id', 'name']].dropna().set_index('id').rename(columns={'name': 'boss_name'})
# merge the two
df = df.merge(names, left_on='bossId', right_index=True, how='left')
# df
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe

How to add an index of a another dataframe

I have two different dataframes. First I had to check that the data in my df1 matches my df2. If that were the case, it add a column "isRep" = true otherwise it's equal to false. It created a df3 for me.
Now, I need to add an "idRep" column in my df3 that corresponds to the index, generate automatically with pandas, where to find the data in df2
This is the df1 :
Index Firstname Name Origine
0 Johnny Depp USA
1 Brad Pitt USA
2 Angelina Pitt USA
This is the d2 :
Index Firstname Name Origine
0 Kidman Nicole AUS
1 Jean Dujardin FR
2 Brad Pitt USA
After the merge with this code :
df = pd.merge(data, dataRep, on=['Firstname', 'Name', 'Origine'], how='left', indicator='IsRep')
df['IsRep'] = np.where(df.IsRep == 'both', True, False)
after this code, I got this result which is my df3 (its the same of the df1 but with the column "isRep" ) :
Index Firstname Name Origine isRep
0 Johnny Depp USA False
1 Brad Pitt USA True
2 Angelina Pitt USA False
Now, I need an other dataframe with the column named "idRep" where I put the index corresponds to df2 like that. But I don't know how I can do that:
Index Firstname Name Origine isRep IdRep
0 Johnny Depp USA False -
1 Brad Pitt USA True 2
2 Angelina Pitt USA False -
A bit of a hack would be to reset_index before you merge. Only reset the index on the right DataFrame.
m = dataRep.rename_axis('IdRep').reset_index()
df = pd.merge(data, m, on=['Firstname', 'Name', 'Origine'], how='left', indicator='IsRep')
df['IsRep'] = np.where(df.IsRep == 'both', True, False)
Firstname Name Origine IdRep IsRep
0 Johnny Depp USA NaN False
1 Brad Pitt USA 2.0 True
2 Angelina Pitt USA NaN False
reverse look up using dict
cols = ['Firstname', 'Name', 'Origine']
d = dict(zip(zip(*map(df2.get, cols)), df2.index))
z = [*zip(*map(df1.get, cols))]
df1.assign(
isRep=[*map(d.__contains__, z)],
IdRep=[*map(d.get, z)]
)
Firstname Name Origine isRep IdRep
Index
0 Johnny Depp USA False NaN
1 Brad Pitt USA True 2.0
2 Angelina Pitt USA False NaN
Variation where we take advantage of assign arguments being order dependent
cols = ['Firstname', 'Name', 'Origine']
d = dict(zip(zip(*map(df2.get, cols)), df2.index))
z = [*zip(*map(df1.get, cols))]
df1.assign(
IdRep=[*map(d.get, z)],
isRep=lambda d: d.IdRep.notna()
)

Adding Dates (Series) column from one DataFrame to the other Pandas, Python

I am trying to 'broadcast' a date column from df1 to df2.
In df1 I have the names of all the users and their basic information.
In df2 I have a list of purchases made by the users.
df1 and df2 code
Assuming I have a much bigger dataset (the above created for sample) how can I add just(!) the df1['DoB'] column to df2?
I have tried both concat() and merge() but none of them seem to work:
code and error
The only way it seems to work is only if I merge both df1 and df2 together and then just delete the columns I don't need. But if I have tens of unwanted columns, it is going to be very problematic.
The full code (including the lines that throw an error):
import pandas as pd
df1 = pd.DataFrame(columns=['Name','Age','DoB','HomeTown'])
df1['Name'] = ['John', 'Jack', 'Wendy','Paul']
df1['Age'] = [25,23,30,31]
df1['DoB'] = pd.to_datetime(['04-01-2012', '03-02-1991', '04-10-1986', '06-03-1985'], dayfirst=True)
df1['HomeTown'] = ['London', 'Brighton', 'Manchester', 'Jersey']
df2 = pd.DataFrame(columns=['Name','Purchase'])
df2['Name'] = ['John','Wendy','John','Jack','Wendy','Jack','John','John']
df2['Purchase'] = ['fridge','coffee','washingmachine','tickets','iPhone','stove','notebook','laptop']
df2 = df2.concat(df1) # error
df2 = df2.merge(df1['DoB'], on='Name', how='left') #error
df2 = df2.merge(df1, on='Name', how='left')
del df2['Age'], df2['HomeTown']
df2 #that's how i want it to look like
Any help would be much appreciated. Thank you :)
I think you need merge with subset [['Name','DoB']] - need Name column for matching:
print (df1[['Name','DoB']])
Name DoB
0 John 2012-01-04
1 Jack 1991-02-03
2 Wendy 1986-10-04
3 Paul 1985-03-06
df2 = df2.merge(df1[['Name','DoB']], on='Name', how='left')
print (df2)
Name Purchase DoB
0 John fridge 2012-01-04
1 Wendy coffee 1986-10-04
2 John washingmachine 2012-01-04
3 Jack tickets 1991-02-03
4 Wendy iPhone 1986-10-04
5 Jack stove 1991-02-03
6 John notebook 2012-01-04
7 John laptop 2012-01-04
Another solution with map by Series s:
s = df1.set_index('Name')['DoB']
print (s)
Name
John 2012-01-04
Jack 1991-02-03
Wendy 1986-10-04
Paul 1985-03-06
Name: DoB, dtype: datetime64[ns]
df2['DoB'] = df2.Name.map(s)
print (df2)
Name Purchase DoB
0 John fridge 2012-01-04
1 Wendy coffee 1986-10-04
2 John washingmachine 2012-01-04
3 Jack tickets 1991-02-03
4 Wendy iPhone 1986-10-04
5 Jack stove 1991-02-03
6 John notebook 2012-01-04
7 John laptop 2012-01-04

Categories

Resources