pandas replace items in df from mappings defined in separate df

pandas replace items in df from mappings defined in separate df - python

Let's say I have the following two DataFrames:
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Johnny', 'Sara', 'Mike']})
df2 = pd.DataFrame({'name': [2, 1, 2]})
How can update df2 from the mappings defined in df1:
df2 = pd.DataFrame({'name': ['Sara', 'Johnny', 'Sara']})
I've done the following but there has to be a better way to do it:
id_to_name = {i: name for i, name in zip(df1['id'].tolist(), df1['name'].tolist())}
df2['name'] = df2['name'].map(id_to_name)

Another way :-) I learn it recently
df1.set_index('id').name.get(df2.name)
Out[381]:
id
2 Sara
1 Johnny
2 Sara
Name: name, dtype: object

You can pass a Series/dict to map (thanks to piR for the improvement!) -
df2.name.map(dict(df1.values))
Or, replace, but this is slower -
df2.name.replace(df1.set_index('id').name)
0 Sara
1 Johnny
2 Sara
Name: name, dtype: object

Related

Drop rows in a pandas dataframe by criteria from another dataframe

I have the following dataframe containing scores for a competition as well as a column that counts what number entry for each person.
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df
Then I have another table that stores data on the maximum number of entries that each person can have:
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2
I am trying to drop rows from df where the entry number is greater than the Limit according to each person in df2 so that my expected output is this:
If there are any ideas on how to help me achieve this that would be fantastic! Thanks

You can use pandas.merge to create another dataframe and drop columns by your condition:
df3 = pd.merge(df, df2, on="Name", how="left")
df3[df3["Entry_No"] <= df3["Limit"]][df.columns].reset_index(drop=True)
Name Score Entry_No
0 John 10 1
1 Jim 8 1
2 John 9 2
3 Jim 3 2
4 Jim 0 3
5 Jack 5 1
I used how="left" to keep the order of df and reset_index(drop=True) to reset the index of the resulting dataframe.

You could join the 2 dataframes, and then drop with a condition:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2 = df2.set_index('Name')
df = df.join(df2, on='Name')
df.drop(df[df.Entry_No>df.Limit].index, inplace = True)
gives the expected output

Combining columns for each row in a data frame

I would like to combine two columns: Column 1 + Column 2 and that for each row individually. Unfortunately it didn't work for me. How do i solve this?
import pandas as pd
import numpy as np
d = {'Nameid': [1, 2, 3, 1], 'Name': ['Michael', 'Max', 'Susan', 'Michael'], 'Project': ['S455', 'G874', 'B7445', 'Z874']}
df = pd.DataFrame(data=d)
display(df.head(10))
df['Dataframe']='df'
d2 = {'Nameid': [4, 2, 5, 1], 'Name': ['Petrova', 'Michael', 'Mike', 'Gandalf'], 'Project': ['Z845', 'Q985', 'P512', 'Y541']}
df2 = pd.DataFrame(data=d2)
display(df2.head(10))
df2['Dataframe']='df2'
What I tried
df_merged = pd.concat([df,df2])
df_merged.head(10)
df3 = pd.concat([df,df2])
df3['unique_string'] = df['Nameid'].astype(str) + df['Dataframe'].astype(str)
df3.head(10)
As you can see, he didn't combine every row. He probably only has the first combined with all of them. How can I combine the two columns row by row?
What I want

You can simply concat strings like this:
You don't need to do df['Dataframe'].astype(str)
In [363]: df_merged['unique_string'] = df_merged.Nameid.astype(str) + df_merged.Dataframe
In [365]: df_merged
Out[365]:
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
0 4 Petrova Z845 df2 4df2
1 2 Michael Q985 df2 2df2
2 5 Mike P512 df2 5df2
3 1 Gandalf Y541 df2 1df2

Please make sure you are using the df3 assign back to df3 ,also do reset_index
df3 = df3.reset_index()
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe'].astype(str)

Use df3 instead df, also ignore_index=True for default index is added:
df3 = pd.concat([df,df2], ignore_index=True)
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe']
print (df3)
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
4 4 Petrova Z845 df2 4df2
5 2 Michael Q985 df2 2df2
6 5 Mike P512 df2 5df2
7 1 Gandalf Y541 df2 1df2

How to update DataFrame without making column type implicitly changed?

Given the following code
df1 = pd.DataFrame({
'name': ['a', 'b'],
'age': [1, 2]
})
df2 = pd.DataFrame({
'name': ['a'],
'age': [3]
})
df1.update(df2)
df1 becomes
name age
0 a 3.0
1 b 2.0
But what I want is
name age
0 a 3
1 b 2
I can do df1['age'] = df1['age'].astype(np.int64), but it feels a little awkward, especially if there are more columns.
Is there a better/more concise way to do this?

One idea is use original dtypes and pass to DataFrame.astype:
d = df1.dtypes
df1.update(df2)
df1 = df1.astype(d)
print (df1)
name age
0 a 3
1 b 2
If need update rows by name solution failed, because use index for match, check edited data:
df1 = pd.DataFrame({
'name': ['a', 'b'],
'age': [1, 2]
})
df2 = pd.DataFrame({
'name': ['a','c','b'],
'age': [3,4,8]
})
d = df1.dtypes
df1.update(df2)
df1 = df1.astype(d)
print (df1)
name age
0 a 3
1 c 4 <- replaced by index 1 both values
Correct solution is set index by name:
d = df1.dtypes
df11 = df1.set_index('name')
df22 = df2.set_index('name')
df11.update(df22)
df1 = df11.reset_index().astype(d)
print (df1)
name age
0 a 3
1 b 8 <- replaced by index 'b'

compare two dataframes and get nearest matching dataframe

have two dataframes with columns
df1
name cell marks
tom 2 21862
df2
name cell marks passwd
tom 2 11111 2548
matt 2 158416 2483
2 21862 26846
How to compare df2 with df1 and get nearest matched data frames
expected_output:
df2
name cell marks passwd
tom 2 11111 2548
2 21862 26846
tried merge but data is dynamic. On one case name might change and in another case marks might change

You can try the following:
import pandas as pd
dict1 = {'name': ['tom'], 'cell': [2], 'marks': [21862]}
dict2 = {'name': ['tom', 'matt'], 'cell': [2, 2],
'marks': [21862, 158416], 'passwd': [2548, 2483]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
compare = df2.isin(df1)
df2 = df2.iloc[df2.where(compare).dropna(how='all').index]
print(df2)
Output:
name cell marks passwd
0 tom 2 21862 2548

You can use pandas.merge with the option indicator=True, filtering the result for 'both':
import pandas as pd
df1 = pd.DataFrame([['tom', 2, 11111]], columns=["name", "cell", "marks"])
df2 = pd.DataFrame([['tom', 2, 11111, 2548],
['matt', 2, 158416, 2483]
], columns=["name", "cell", "marks", "passwd"])
def compare_dataframes(df1, df2):
"""Find rows which are similar between two DataFrames."""
comparison_df = df1.merge(df2,
indicator=True,
how='outer')
return comparison_df[comparison_df['_merge'] == 'both'].drop(columns=["_merge"])
print(compare_dataframes(df1, df2))
Returns:
name cell marks passwd
0 tom 2 11111 2548

Replace a value in a column by vlookup another dataframe only if the value exists

I want to overwrite my df1.Name values based on a mapping table in (df2.Name1, df2.Name2). However, not all values in df1.Name exist in df2.Name1
df1:
Name
Alex
Maria
Marias
Pandas
Coala
df2:
Name1 Name2
Alex Alexs
Marias Maria
Coala Coalas
Expected Result:
Name
Alexs
Maria
Maria
Pandas
Coalas
I have tried several solutions online such as using the Map function. By turning df2 in a Dictionary I am using df1.Name = df1.Name.map(Dictionary), but this will result in nan for all values not in df2 as per below.
Name
Alexs
Maria
Maria
NAN
Coalas
I am not sure how to use an IF statement to replace only the ones that do exist in df2 and keep the rest as per df1.
I also tried to create a function with if statements, but was big time failure.
How I could approach this problem?

By using replace
df1.Name.replace(df2.set_index('Name1').Name2.to_dict())
Out[437]:
0 Alexs
1 Maria
2 Maria
3 Pandas
4 Coalas
Name: Name, dtype: object

Let's use a Pandas solution with map and combine_first:
df1['Name'].map(df2.set_index('Name1')['Name2']).combine_first(df1['Name'])
Output:
0 Alexs
1 Maria
2 Maria
3 Pandas
4 Coalas
Name: Name, dtype: object

Python dict.get() allows a default parameter. So if you build a translation dict, then if the lookup is not found, it is easy to just returned the original value like:
Code:
translate = {x: y for x, y in df2[['Name1', 'Name2']].values}
new_names = [translate.get(x, x) for x in df1['Name']]
Test Code:
import pandas as pd
df1 = pd.DataFrame({'Name': ['Alex', 'Maria', 'Marias', 'Pandas', 'Coala']})
df2 = pd.DataFrame({'Name1': ['Alex', 'Marias', 'Coala'],
'Name2': ['Alexs', 'Maria', 'Coalas']})
print(df1)
print(df2)
translate = {x: y for x, y in df2[['Name1', 'Name2']].values}
print([translate.get(x, x) for x in df1['Name']])
Test Results:
Name
0 Alex
1 Maria
2 Marias
3 Pandas
4 Coala
Name1 Name2
0 Alex Alexs
1 Marias Maria
2 Coala Coalas
['Alexs', 'Maria', 'Maria', 'Pandas', 'Coalas']

you can also use merge:
In [27]: df1['Name'] = df1.merge(df2.rename(columns={'Name1':'Name'}), how='left') \
.ffill(axis=1)['Name2']
In [28]: df1
Out[28]:
Name
0 Alexs
1 Maria
2 Maria
3 Pandas
4 Coalas

You can also use replace
df1 = pd.DataFrame({'Name': ['Alex', 'Maria', 'Marias', 'Pandas', 'Coala']})
df2 = pd.DataFrame({'Name1': ['Alex', 'Marias', 'Coala'],
'Name2': ['Alexs', 'Maria', 'Coalas']})
# Create the dictionary from df2
d = {"Name": {k:v for k, v in zip(df2["Name1"], df2["Name2"])}}
# Suggestion from Wen to create the dictionary
# d = {"Name": df2.set_index('Name1').Name2.to_dict()}
df1.replace(d) # Use df1.replace(d, inplace=True) if you want this in place
Name
0 Alexs
1 Maria
2 Maria
3 Pandas
4 Coalas
replace can take a dictionary, where you can specify the column to do replace, "Name" here, and the corresponding mapping that you want to replace in this particular column.
{"Name": {old_1: new_1, old_2: new_2...}}
-> replace values in "Name" column such that old_1 will be replaced with new_1. old_2 will be replaced by new_2 and so on.
Thanks for the setup from Stephen Rauch. Thanks for Wen for offering a clean way to create dictionary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas replace items in df from mappings defined in separate df - python

Another way :-) I learn it recently df1.set_index('id').name.get(df2.name) Out[381]: id 2 Sara 1 Johnny 2 Sara Name: name, dtype: object

You can pass a Series/dict to map (thanks to piR for the improvement!) - df2.name.map(dict(df1.values)) Or, replace, but this is slower - df2.name.replace(df1.set_index('id').name) 0 Sara 1 Johnny 2 Sara Name: name, dtype: object

Related

Drop rows in a pandas dataframe by criteria from another dataframe

Combining columns for each row in a data frame

How to update DataFrame without making column type implicitly changed?

compare two dataframes and get nearest matching dataframe

Replace a value in a column by vlookup another dataframe only if the value exists

Categories

Resources