I have df1
df1 = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
df1 = pd.DataFrame(df1)
and I have another df2
df2 = {'Name':['krish', 'jack','Tom', 'nick',]}
df2 = pd.DataFrame(df2)
df2['Name'] is exactly same with df1. However, they are in a different order.
I want to fill df2['Age'] based on df1.
If I used df2['Age'] = df1['Age'] the value of is filled but wrong.
How to map those values on df2 from df1 correctly?
Thank you
Use:
df2 = df2.merge(df1,on='Name')
df2
Name Age
0 krish 19
1 jack 18
2 Tom 20
3 nick 21
Set Name as index and reindex based on df2:
df1.set_index('Name').reindex(df2.Name).reset_index()
Name Age
0 krish 19
1 jack 18
2 Tom 20
3 nick 21
Or for a better performance, we can use pd.Categorical here:
df1['Name'] = pd.Categorical(df1.Name, categories=df2.Name)
df1.sort_values('Name', inplace=True)
Related
I have the following dataframe containing scores for a competition as well as a column that counts what number entry for each person.
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df
Then I have another table that stores data on the maximum number of entries that each person can have:
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2
I am trying to drop rows from df where the entry number is greater than the Limit according to each person in df2 so that my expected output is this:
If there are any ideas on how to help me achieve this that would be fantastic! Thanks
You can use pandas.merge to create another dataframe and drop columns by your condition:
df3 = pd.merge(df, df2, on="Name", how="left")
df3[df3["Entry_No"] <= df3["Limit"]][df.columns].reset_index(drop=True)
Name Score Entry_No
0 John 10 1
1 Jim 8 1
2 John 9 2
3 Jim 3 2
4 Jim 0 3
5 Jack 5 1
I used how="left" to keep the order of df and reset_index(drop=True) to reset the index of the resulting dataframe.
You could join the 2 dataframes, and then drop with a condition:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2 = df2.set_index('Name')
df = df.join(df2, on='Name')
df.drop(df[df.Entry_No>df.Limit].index, inplace = True)
gives the expected output
I have 2 df
df = pd.DataFrame({'Ages':[20, 22, 57], 'Label':[1,1,2]})
label_df = pd.DataFrame({'Label':[1,2,3], 'Description':['Young','Old','Very Old']})
I want to replace the label values in df to the description in label_df
Wanted result:
df = pd.DataFrame({'Ages':[20, 22, 57], 'Label':['Young','Young','Old']})
Use Series.map with Series by label_df:
df['Label'] = df['Label'].map(label_df.set_index('Label')['Description'])
print (df)
Ages Label
0 20 Young
1 22 Young
2 57 Old
simple use merge
df['Label'] = df.merge(label_df,on='Label')['Description']
Ages Label
0 20 Young
1 22 Young
2 57 Old
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
I would like to combine two columns: Column 1 + Column 2 and that for each row individually. Unfortunately it didn't work for me. How do i solve this?
import pandas as pd
import numpy as np
d = {'Nameid': [1, 2, 3, 1], 'Name': ['Michael', 'Max', 'Susan', 'Michael'], 'Project': ['S455', 'G874', 'B7445', 'Z874']}
df = pd.DataFrame(data=d)
display(df.head(10))
df['Dataframe']='df'
d2 = {'Nameid': [4, 2, 5, 1], 'Name': ['Petrova', 'Michael', 'Mike', 'Gandalf'], 'Project': ['Z845', 'Q985', 'P512', 'Y541']}
df2 = pd.DataFrame(data=d2)
display(df2.head(10))
df2['Dataframe']='df2'
What I tried
df_merged = pd.concat([df,df2])
df_merged.head(10)
df3 = pd.concat([df,df2])
df3['unique_string'] = df['Nameid'].astype(str) + df['Dataframe'].astype(str)
df3.head(10)
As you can see, he didn't combine every row. He probably only has the first combined with all of them. How can I combine the two columns row by row?
What I want
You can simply concat strings like this:
You don't need to do df['Dataframe'].astype(str)
In [363]: df_merged['unique_string'] = df_merged.Nameid.astype(str) + df_merged.Dataframe
In [365]: df_merged
Out[365]:
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
0 4 Petrova Z845 df2 4df2
1 2 Michael Q985 df2 2df2
2 5 Mike P512 df2 5df2
3 1 Gandalf Y541 df2 1df2
Please make sure you are using the df3 assign back to df3 ,also do reset_index
df3 = df3.reset_index()
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe'].astype(str)
Use df3 instead df, also ignore_index=True for default index is added:
df3 = pd.concat([df,df2], ignore_index=True)
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe']
print (df3)
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
4 4 Petrova Z845 df2 4df2
5 2 Michael Q985 df2 2df2
6 5 Mike P512 df2 5df2
7 1 Gandalf Y541 df2 1df2
Apologies for the crappy title...
Say I have two pandas dataframes concerning field sampling locations. DF1 contains sample ID, coordinates, year of recording etc. DF2 contains a meteorological variable, with values provided per year as columns:
import pandas as pd
df1 = pd.DataFrame(data = {'ID': [10, 20, 30], 'YEAR': [1980, 1981, 1991]}, index=[1,2,3])
df2 = pd.DataFrame(data= np.random.randint(0,100,size=(3, 10)), columns=['year_{0}'.format(x) for x in range(1980, 1991)], index=[10, 20, 30])
print(df1)
> ID YEAR
1 10 1980
2 20 1981
3 30 1991
print(df2)
> year_1980 year_1981 ... year_1990
10 48 61 ... 53
20 68 69 ... 21
30 76 37 ... 70
Note how the Plot ID's from DF1 correspond to DF2.index and also how DF1 sampling years extend beyond the coverage of DF2. I'd like to add as a new column to DF1 the value from DF2 corresponding to the year column in DF1. What I have so far is:
def grab(df, plot_id, yr):
try:
out = df.loc[plot_id, 'year_{}'.format(yr)]
except KeyError:
out = -99
return out
df1['meteo_val'] = df1.apply(lambda row: grab(df2, row.index, row.year), axis=1)
print(df1)
> ID YEAR meteo_val
1 10 1980 48
2 20 1981 69
3 30 1991 -99
This works, but seems to take an awful long time to compute. I wonder for a smarter, quicker, approach to solving this. Any suggestions?
SetUp
np.random.seed(0)
df1 = pd.DataFrame(data = {'ID': [10, 20, 30], 'YEAR': [1980, 1981, 1991]}, index=[1,2,3])
df2 = pd.DataFrame(data= np.random.randint(0,100,size=(3, 11)),
columns=['year_{0}'.format(x) for x in range(1980, 1991)],
index=[10, 20, 30])
Solution with DataFrame.lookup:
mapper = df1.assign(YEAR = ('year_' + df1['YEAR'].astype(str)))
c2 = mapper['ID'].isin(df2.index)
c1 = mapper['YEAR'].isin(df2.columns)
mapper = mapper.loc[c1 & c2]
df1.loc[c2&c1, 'meteo_val'] = df2.lookup(mapper['ID'], mapper['YEAR'])
df1 ['meteo_val'] = df1['meteo_val'].fillna(-99)
ID YEAR meteo_val
1 10 1980 44.0
2 20 1981 88.0
3 30 1991 -99.0
Alternative with DataFrame.join and DataFrame.stack
df1 = df1.join(df2.set_axis(df2.columns.str.split('_').str[1].astype(int),
axis=1).stack().rename('meteo_val'),
on = ['ID', 'YEAR'], how='left').fillna(-99)
This is my first question on this forum, sorry if my English is not so good!
I want to add a row to a DataFrame only if a specific column doesn't already contain a specific value. Let say I write this :
df = pd.DataFrame([['Mark', 9], ['Laura', 22]], columns=['Name', 'Age'])
new_friend = pd.DataFrame([['Alex', 23]], columns=['Name', 'Age'])
df = df.append(new_friend, ignore_index=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 23
Now I want to add another friend, but I want to make sure I don't already have a friend with the same name. Here is what I'm actually doing:
new_friend = pd.DataFrame([['Mark', 16]], columns=['Name', 'Age'])
df = df.append(new_friend, ignore_index=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 55
3 Mark 16
then :
df = df.drop_duplicates(subset='Name', keep='first')
df = df.reset_index(drop=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 55
Is there another way of doing this something like :
if name in Column 'Name'is True:
don't add friend
else:
add friend
Thank you!
if 'Mark' in list(df['Name']):
print('Mark already in DF')
else:
print('Mark not in DF')