I have two different dataframes. First I had to check that the data in my df1 matches my df2. If that were the case, it add a column "isRep" = true otherwise it's equal to false. It created a df3 for me.
Now, I need to add an "idRep" column in my df3 that corresponds to the index, generate automatically with pandas, where to find the data in df2
This is the df1 :
Index Firstname Name Origine
0 Johnny Depp USA
1 Brad Pitt USA
2 Angelina Pitt USA
This is the d2 :
Index Firstname Name Origine
0 Kidman Nicole AUS
1 Jean Dujardin FR
2 Brad Pitt USA
After the merge with this code :
df = pd.merge(data, dataRep, on=['Firstname', 'Name', 'Origine'], how='left', indicator='IsRep')
df['IsRep'] = np.where(df.IsRep == 'both', True, False)
after this code, I got this result which is my df3 (its the same of the df1 but with the column "isRep" ) :
Index Firstname Name Origine isRep
0 Johnny Depp USA False
1 Brad Pitt USA True
2 Angelina Pitt USA False
Now, I need an other dataframe with the column named "idRep" where I put the index corresponds to df2 like that. But I don't know how I can do that:
Index Firstname Name Origine isRep IdRep
0 Johnny Depp USA False -
1 Brad Pitt USA True 2
2 Angelina Pitt USA False -
A bit of a hack would be to reset_index before you merge. Only reset the index on the right DataFrame.
m = dataRep.rename_axis('IdRep').reset_index()
df = pd.merge(data, m, on=['Firstname', 'Name', 'Origine'], how='left', indicator='IsRep')
df['IsRep'] = np.where(df.IsRep == 'both', True, False)
Firstname Name Origine IdRep IsRep
0 Johnny Depp USA NaN False
1 Brad Pitt USA 2.0 True
2 Angelina Pitt USA NaN False
reverse look up using dict
cols = ['Firstname', 'Name', 'Origine']
d = dict(zip(zip(*map(df2.get, cols)), df2.index))
z = [*zip(*map(df1.get, cols))]
df1.assign(
isRep=[*map(d.__contains__, z)],
IdRep=[*map(d.get, z)]
)
Firstname Name Origine isRep IdRep
Index
0 Johnny Depp USA False NaN
1 Brad Pitt USA True 2.0
2 Angelina Pitt USA False NaN
Variation where we take advantage of assign arguments being order dependent
cols = ['Firstname', 'Name', 'Origine']
d = dict(zip(zip(*map(df2.get, cols)), df2.index))
z = [*zip(*map(df1.get, cols))]
df1.assign(
IdRep=[*map(d.get, z)],
isRep=lambda d: d.IdRep.notna()
)
Related
I am trying to replace values from a dataframe column with values from another based on a third one and keep the rest of the values from the first df.
# df1
country name value
romania john 100
russia emma 200
sua mark 300
china jack 400
# df2
name value
emma 2
mark 3
Desired result:
# df3
country name value
romania john 100
russia emma 2
sua mark 3
china jack 400
Thank you
One approach could be as follows:
Use Series.map on column name and turn df2 into a Series for mapping by setting its index to name (df.set_index).
Next, chain Series.fillna to replace NaN values with original values from df.value (i.e. whenever mapping did not result in a match) and assign to df['value'].
df['value'] = df['name'].map(df2.set_index('name')['value']).fillna(df['value'])
print(df)
country name value
0 romania john 100.0
1 russia emma 2.0
2 sua mark 3.0
3 china jack 400.0
N.B. The result will now contain floats. If you prefer integers, chain .astype(int) as well.
Another option could be using pandas.DataFrame.Update:
df1.set_index('name', inplace=True)
df1.update(df2.set_index('name'))
df1.reset_index(inplace=True)
name country value
0 john romania 100.0
1 emma russia 2.0
2 mark sua 3.0
3 jack china 400.0
Another option:
df3 = df1.merge(df2, on = 'name', how = 'left')
df3['value'] = df3.value_y.fillna(df3.value_x)
df3.drop(['value_x', 'value_y'], axis = 1, inplace = True)
# country name value
# 0 romania john 100.0
# 1 russia emma 2.0
# 2 sua mark 3.0
# 3 china jack 400.0
Reproducible data:
df1=pd.DataFrame({'country':['romania','russia','sua','china'],'name':['john','emma','mark','jack'],'value':[100,200,300,400]})
df2=pd.DataFrame({'name':['emma','mark'],'value':[2,3]})
DOB Name
0 1956-10-30 Anna
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry
6 1972-05-04 Kate
In the dataframe similar to the one above where I have duplicate names. So I am want to add a suffix '_0' to the name if DOB is before 1990 and a duplicate name.
I am expecting a result like this
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I am using the following
df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0')
But I am getting this result
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 NaN
2 2001-09-09 NaN
3 1993-01-15 NaN
4 1999-05-02 NaN
5 1962-12-17 Jerry_0
6 1972-05-04 NaN
How can I add a suffix to the Name which is a duplicate and have to be born before 1990.
Problem in your df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0') is that df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))] is a filtered dataframe whose rows are less than the original. When you assign it back, the not filtered rows doesn't have corresponding value in the filtered dataframe, so it becomes NaN.
You can try mask instead
m = (df['DOB'] < '1990-01-01') & df['Name'].duplicated(keep=False)
df['Name'] = df['Name'].mask(m, df['Name']+'_0')
You can use masks and boolean indexing:
# is the year before 1990?
m1 = pd.to_datetime(df['DOB']).dt.year.lt(1990)
# is the name duplicated?
m2 = df['Name'].duplicated(keep=False)
# if both conditions are True, add '_0' to the name
df.loc[m1&m2, 'Name'] += '_0'
output:
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I have a dataframe which looks something like this:
dfA
name field country action
Sam elec USA POS
Sam elec USA POS
Sam elec USA NEG
Tommy mech Canada NEG
Tommy mech Canada NEG
Brian IT Spain NEG
Brian IT Spain NEG
Brian IT Spain POS
I want to group the dataframe based on the first 3 columns adding a new column "No of data". This is something which I do using this:
dfB = dfA.groupby(["name", "field", "country"], dropna=False).size().reset_index(name = "No_of_data")
This gives me a new dataframe which looks something like this:
dfB
name field country No_of_data
Sam elec USA 3
Tommy mech Canada 2
Brian IT Spain 3
But now I also want to add a new column to this particular dataframe which tells me what is the count of number of "POS" for every combination of "name", "field" and "country". Which should look something like this:
dfB
name field country No_of_data No_of_POS
Sam elec USA 3 2
Tommy mech Canada 2 0
Brian IT Spain 3 1
How do I add the new column (No_of_POS) to the table dfB when I dont have the information about "POS NEG" in it and needs to be taken from dfA.
You can use a dictionary with functions in the aggregate method:
dfA.groupby(["name", "field", "country"], as_index=False)['action']\
.agg({'No_of_data': 'size', 'No_of_POS': lambda x: x.eq('POS').sum()})
You can precompute the boolean before aggregating; performance should be better as the data size increases :
(df.assign(action = df.action.eq('POS'))
.groupby(['name', 'field', 'country'],
sort = False,
as_index = False)
.agg(no_of_data = ('action', 'size'),
no_of_pos = ('action', 'sum'))
name field country no_of_data no_of_pos
0 Sam elec USA 3 2
1 Tommy mech Canada 2 0
2 Brian IT Spain 3 1
You can add an aggregation function when you're grouping your data. Check agg() function, maybe this will help.
I have a DataFrame with 3 columns: ID, BossID and Name. Each row has a unique ID and has a corresponding name. BossID is the ID of the boss of the person in that row. Suppose I have the following DataFrame:
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3],
'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
So here, Anne is Ben's boss and Ben Coe is Cate and Dan's boss, etc.
Now, I want to have another column that has the boss's name for each person.
The desired output is:
id boss name boss_name
0 1 NaN Anne NaN
1 2 1.0 Ben Anne
2 3 2.0 Cate Ben
3 4 2.0 Dan Ben
4 5 3.0 Erin Cate
I can get my output using an ugly double for-loop. Is there a cleaner way to obtain the desired output?
This should work:
bossmap = df.set_index('id')['name'].squeeze()
df['boss_name'] = df['bossId'].map(bossmap)
You can set id as index, then use pd.Series.reindex
df = df.set_index('id')
df['boss_name'] = df['name'].reindex(df['bossId']).to_numpy() # or .to_list()
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe
Create a separate dataframe for 'name' and 'id'.
Renamed 'name' and set 'id' as the index
.merge df with the new dataframe
import pandas as pd
# test dataframe
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3], 'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
# separate dataframe with id and name
names = df[['id', 'name']].dropna().set_index('id').rename(columns={'name': 'boss_name'})
# merge the two
df = df.merge(names, left_on='bossId', right_index=True, how='left')
# df
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe
I am attempting to append two DataFrames using Python Pandas, but I am receiving a null error. How can I resolve this?
Here's the first DataFrame (after I load to Python):
name State
0 Tom NY
1 Lee CA
Here's the second DataFrame (after I load to Python) with no header:
0 1
0 Jon FL
1 Tan NJ
I attempt to append the DataFrames using:
pd.concat([df1,df2])
The result is:
name State 0 1
0 Tom NY NaN NaN
1 Lee CA NaN NaN
0 NaN NaN Jon FL
1 NaN NaN Tan NJ
I want the result to be:
name State
0 Tom NY
1 Lee CA
2 Jon FL
3 Tan NJ
I've made the following attempt, but it doesn't work:
pd.concat([df1,df2], axis=1)
Here is my second unsuccessful attempt:
pd.concat([df1,df2], ignore_index=True)
Align your column names and use append
df1.columns = df.columns
df.append(df1).reset_index(drop=True)
# Result
name State
0 Tom NY
1 Lee CA
2 Jon FL
3 Tan NJ
Rename your column name and then concat them:
df2.columns = df1.columns
pd.concat([df1, df2], ignore_index=True)
Output:
name State
0 Tom NY
1 Lee CA
2 Jon FL
3 Tan NJ