Processing a dataframe with another dataframe - python

I have two data frames: df1 and df2. They both include information like 'ID', 'Name', 'Score' and 'Status', which I need is to update the 'Score' in df1 if that person's status in df2 is "Edit", and I also need to drop the row in df1 if that person's status in df2 is "Cancel".
For example:
dic1 = {'ID': [1, 2, 3],
'Name':['Jack', 'Tom', 'Annie'],
'Score':[20, 10, 25],
'Status':['New', 'New', 'New']}
dic2 = {'ID': [1, 2],
'Name':['Jack', 'Tom'],
'Score':[28, 10],
'Status':['Edit', 'Cancel']}
df1 = pd.DataFrame(dic1)
df2 = pd.DataFrame(dic2)
The output should be like:
ID Name Score Status
1 Jack 28 Edit
3 Annie 25 New
Any pointers or hints?

Use DataFrame.merge with left join first and then filter out Cancel rows and also columns ending with _ from original DataFrame:
df = df1.merge(df2, on=['ID','Name'], how='left', suffixes=('_', ''))
df = df.loc[df['Status'] != 'Cancel', ~df.columns.str.endswith('_')]
print (df)
ID Name Score Status
0 1 Jack 28 Edit
EDIT Add DataFrame.combine_first for repalce missing rows:
df = df1.merge(df2, on=['ID','Name'], how='left', suffixes=('', '_'))
df = df.loc[df['Status_'] != 'Cancel']
df1 = df.loc[:, df.columns.str.endswith('_')]
df = df1.rename(columns=lambda x: x.rstrip('_')).combine_first(df).drop(df1.columns, axis=1)
print (df)
ID Name Score Status
0 1.0 Jack 28.0 Edit
2 3.0 Annie 25.0 New

Use pandas.DataFrame.update commnad of pandas package.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
df1.update(df2)
print(df1)
df1 = df1[df1.Status != "Cancel"]
print(df1)

Related

Drop rows in a pandas dataframe by criteria from another dataframe

I have the following dataframe containing scores for a competition as well as a column that counts what number entry for each person.
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df
Then I have another table that stores data on the maximum number of entries that each person can have:
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2
I am trying to drop rows from df where the entry number is greater than the Limit according to each person in df2 so that my expected output is this:
If there are any ideas on how to help me achieve this that would be fantastic! Thanks
You can use pandas.merge to create another dataframe and drop columns by your condition:
df3 = pd.merge(df, df2, on="Name", how="left")
df3[df3["Entry_No"] <= df3["Limit"]][df.columns].reset_index(drop=True)
Name Score Entry_No
0 John 10 1
1 Jim 8 1
2 John 9 2
3 Jim 3 2
4 Jim 0 3
5 Jack 5 1
I used how="left" to keep the order of df and reset_index(drop=True) to reset the index of the resulting dataframe.
You could join the 2 dataframes, and then drop with a condition:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2 = df2.set_index('Name')
df = df.join(df2, on='Name')
df.drop(df[df.Entry_No>df.Limit].index, inplace = True)
gives the expected output

Combining columns for each row in a data frame

I would like to combine two columns: Column 1 + Column 2 and that for each row individually. Unfortunately it didn't work for me. How do i solve this?
import pandas as pd
import numpy as np
d = {'Nameid': [1, 2, 3, 1], 'Name': ['Michael', 'Max', 'Susan', 'Michael'], 'Project': ['S455', 'G874', 'B7445', 'Z874']}
df = pd.DataFrame(data=d)
display(df.head(10))
df['Dataframe']='df'
d2 = {'Nameid': [4, 2, 5, 1], 'Name': ['Petrova', 'Michael', 'Mike', 'Gandalf'], 'Project': ['Z845', 'Q985', 'P512', 'Y541']}
df2 = pd.DataFrame(data=d2)
display(df2.head(10))
df2['Dataframe']='df2'
What I tried
df_merged = pd.concat([df,df2])
df_merged.head(10)
df3 = pd.concat([df,df2])
df3['unique_string'] = df['Nameid'].astype(str) + df['Dataframe'].astype(str)
df3.head(10)
As you can see, he didn't combine every row. He probably only has the first combined with all of them. How can I combine the two columns row by row?
What I want
You can simply concat strings like this:
You don't need to do df['Dataframe'].astype(str)
In [363]: df_merged['unique_string'] = df_merged.Nameid.astype(str) + df_merged.Dataframe
In [365]: df_merged
Out[365]:
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
0 4 Petrova Z845 df2 4df2
1 2 Michael Q985 df2 2df2
2 5 Mike P512 df2 5df2
3 1 Gandalf Y541 df2 1df2
Please make sure you are using the df3 assign back to df3 ,also do reset_index
df3 = df3.reset_index()
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe'].astype(str)
Use df3 instead df, also ignore_index=True for default index is added:
df3 = pd.concat([df,df2], ignore_index=True)
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe']
print (df3)
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
4 4 Petrova Z845 df2 4df2
5 2 Michael Q985 df2 2df2
6 5 Mike P512 df2 5df2
7 1 Gandalf Y541 df2 1df2

How to update DataFrame without making column type implicitly changed?

Given the following code
df1 = pd.DataFrame({
'name': ['a', 'b'],
'age': [1, 2]
})
df2 = pd.DataFrame({
'name': ['a'],
'age': [3]
})
df1.update(df2)
df1 becomes
name age
0 a 3.0
1 b 2.0
But what I want is
name age
0 a 3
1 b 2
I can do df1['age'] = df1['age'].astype(np.int64), but it feels a little awkward, especially if there are more columns.
Is there a better/more concise way to do this?
One idea is use original dtypes and pass to DataFrame.astype:
d = df1.dtypes
df1.update(df2)
df1 = df1.astype(d)
print (df1)
name age
0 a 3
1 b 2
If need update rows by name solution failed, because use index for match, check edited data:
df1 = pd.DataFrame({
'name': ['a', 'b'],
'age': [1, 2]
})
df2 = pd.DataFrame({
'name': ['a','c','b'],
'age': [3,4,8]
})
d = df1.dtypes
df1.update(df2)
df1 = df1.astype(d)
print (df1)
name age
0 a 3
1 c 4 <- replaced by index 1 both values
Correct solution is set index by name:
d = df1.dtypes
df11 = df1.set_index('name')
df22 = df2.set_index('name')
df11.update(df22)
df1 = df11.reset_index().astype(d)
print (df1)
name age
0 a 3
1 b 8 <- replaced by index 'b'

compare two dataframes and get nearest matching dataframe

have two dataframes with columns
df1
name cell marks
tom 2 21862
df2
name cell marks passwd
tom 2 11111 2548
matt 2 158416 2483
2 21862 26846
How to compare df2 with df1 and get nearest matched data frames
expected_output:
df2
name cell marks passwd
tom 2 11111 2548
2 21862 26846
tried merge but data is dynamic. On one case name might change and in another case marks might change
You can try the following:
import pandas as pd
dict1 = {'name': ['tom'], 'cell': [2], 'marks': [21862]}
dict2 = {'name': ['tom', 'matt'], 'cell': [2, 2],
'marks': [21862, 158416], 'passwd': [2548, 2483]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
compare = df2.isin(df1)
df2 = df2.iloc[df2.where(compare).dropna(how='all').index]
print(df2)
Output:
name cell marks passwd
0 tom 2 21862 2548
You can use pandas.merge with the option indicator=True, filtering the result for 'both':
import pandas as pd
df1 = pd.DataFrame([['tom', 2, 11111]], columns=["name", "cell", "marks"])
df2 = pd.DataFrame([['tom', 2, 11111, 2548],
['matt', 2, 158416, 2483]
], columns=["name", "cell", "marks", "passwd"])
def compare_dataframes(df1, df2):
"""Find rows which are similar between two DataFrames."""
comparison_df = df1.merge(df2,
indicator=True,
how='outer')
return comparison_df[comparison_df['_merge'] == 'both'].drop(columns=["_merge"])
print(compare_dataframes(df1, df2))
Returns:
name cell marks passwd
0 tom 2 11111 2548

Pandas left join - how to replace values not present in second df with NaN

I have two dataframes which I am joining like so:
df3 = df1.join(df2.set_index('id'), on='id', how='left')
But I want to replace values for id-s which are present in df1 but not in df2 with NaN (left join will just leave the values in df1 as they are). Whats the easiest way to accomplish this?
I think you need Series.where with Series.isin:
df1['id'] = df1['id'].where(df1['id'].isin(df2['id']))
Or numpy.where:
df1['id'] = np.where(df1['id'].isin(df2['id']), df1['id'], np.nan)
Sample:
df1 = pd.DataFrame({
'id':list('abc'),
})
df2 = pd.DataFrame({
'id':list('dmna'),
})
df1['id'] = df1['id'].where(df1['id'].isin(df2['id']))
print (df1)
id
0 a
1 NaN
2 NaN
Or solution with merge and indicator parameter:
df3 = df1.merge(df2, on='id', how='left', indicator=True)
df3['id'] = df3['id'].mask(df3.pop('_merge').eq('left_only'))
print (df3)
id
0 a
1 NaN
2 NaN

Categories

Resources