This is my first question on this forum, sorry if my English is not so good!
I want to add a row to a DataFrame only if a specific column doesn't already contain a specific value. Let say I write this :
df = pd.DataFrame([['Mark', 9], ['Laura', 22]], columns=['Name', 'Age'])
new_friend = pd.DataFrame([['Alex', 23]], columns=['Name', 'Age'])
df = df.append(new_friend, ignore_index=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 23
Now I want to add another friend, but I want to make sure I don't already have a friend with the same name. Here is what I'm actually doing:
new_friend = pd.DataFrame([['Mark', 16]], columns=['Name', 'Age'])
df = df.append(new_friend, ignore_index=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 55
3 Mark 16
then :
df = df.drop_duplicates(subset='Name', keep='first')
df = df.reset_index(drop=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 55
Is there another way of doing this something like :
if name in Column 'Name'is True:
don't add friend
else:
add friend
Thank you!
if 'Mark' in list(df['Name']):
print('Mark already in DF')
else:
print('Mark not in DF')
Related
I have the following dataframe containing scores for a competition as well as a column that counts what number entry for each person.
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df
Then I have another table that stores data on the maximum number of entries that each person can have:
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2
I am trying to drop rows from df where the entry number is greater than the Limit according to each person in df2 so that my expected output is this:
If there are any ideas on how to help me achieve this that would be fantastic! Thanks
You can use pandas.merge to create another dataframe and drop columns by your condition:
df3 = pd.merge(df, df2, on="Name", how="left")
df3[df3["Entry_No"] <= df3["Limit"]][df.columns].reset_index(drop=True)
Name Score Entry_No
0 John 10 1
1 Jim 8 1
2 John 9 2
3 Jim 3 2
4 Jim 0 3
5 Jack 5 1
I used how="left" to keep the order of df and reset_index(drop=True) to reset the index of the resulting dataframe.
You could join the 2 dataframes, and then drop with a condition:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2 = df2.set_index('Name')
df = df.join(df2, on='Name')
df.drop(df[df.Entry_No>df.Limit].index, inplace = True)
gives the expected output
I'm trying to populate the 'Selected' column with the values found in the 'Name' column if the conditions for 'Name' and 'Age' are true. Else, remain column as empty string. However, the program seems to not reading the if condition as it jumps to the result inside 'else'.
import pandas as pd
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'],
'Age': ['20', '35', '43', '50'],
'Selected': ' '}
df = pd.DataFrame(data)
df['Selected'] = df['Selected'].apply(lambda x: df['Name'] if ((df['Name']).any() == 'Tom') &
(df['Age'].any() < 25) else ' ')
print(df)
Here's the output of above code:
Name Age Selected
0 Tom 20
1 Joseph 35
2 Krish 43
3 John 50
whereas I'm expecting to see Tom in the Selected column at the same index for the row because Tom has met the conditions for both 'Name' and 'Age'. --> Tom < 25
Any helps are appreciated, thanks!
Using np.where function:
condition = (df.Name == 'Tom') & (df.Age.lt(25))
df['col'] = np.where(condition, df.Name, df.Selected)
print(df)
Name Age Selected col
0 Tom 20 Tom
1 Joseph 35
2 Krish 43
3 John 50
Using Apply method:
df.apply(lambda row: row.Name if ((row.Name == 'Tom')&
(row.Age < 25)) else row.Selected, axis=1)
When using lambda functions with apply, you can pass the column names as variables, if I understood correctly and you are actually trying to check the conditions for each row:
df['Selected'] = df.apply(lambda x: x.Name if (x.Name == 'Tom') & (int(x.Age) < 25) else '', axis=1)
This will return:
Name Age Selected
0 Tom 20 Tom
1 Joseph 35
2 Krish 43
3 John 50
Your current solution is using the whole columns in the conditions for each row.
I have 2 df
df = pd.DataFrame({'Ages':[20, 22, 57], 'Label':[1,1,2]})
label_df = pd.DataFrame({'Label':[1,2,3], 'Description':['Young','Old','Very Old']})
I want to replace the label values in df to the description in label_df
Wanted result:
df = pd.DataFrame({'Ages':[20, 22, 57], 'Label':['Young','Young','Old']})
Use Series.map with Series by label_df:
df['Label'] = df['Label'].map(label_df.set_index('Label')['Description'])
print (df)
Ages Label
0 20 Young
1 22 Young
2 57 Old
simple use merge
df['Label'] = df.merge(label_df,on='Label')['Description']
Ages Label
0 20 Young
1 22 Young
2 57 Old
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
I have a df similar to the one below:
name age sex
1 john 12 m
2 mary 13 f
3 joseph 12 m
4 maria 14 f
How can I make a new column based on the index? for example for index 1 and 2, i want them to have the label 1 and for index 3 and 4, i want them to be labeled 2, like so:
name age sex label
1 john 12 m cluster1
2 mary 13 f cluster1
3 joseph 12 m cluster2
4 maria 14 f cluster2
Should i use something like (df.index.isin([1, 2])) == 'cluster1'? I think it's not possible to do df['target'] = (df.index.isin([1, 2])) == 'cluster1 assuming that label doesn't exist in the beginning.
I think this is what you are looking for? You can use lists for different clusters to make your labels arbitrary in this way.
import pandas as pd
data = {'name':['bob','sue','mary','steve'], 'age':[11, 23, 53, 44]}
df = pd.DataFrame(data)
print(df)
df['label'] = 0
cluster1 = [0, 3]
cluster2 = [1, 2]
df.loc[cluster1, 'label'] = 1
df.loc[cluster2, 'label'] = 2
#another way
#df.iloc[cluster1, df.columns.get_loc('label')] = 1
#df.iloc[cluster2, df.columns.get_loc('label')] = 2
print(df)
output:
name age
0 bob 11
1 sue 23
2 mary 53
3 steve 44
name age label
0 bob 11 1
1 sue 23 2
2 mary 53 2
3 steve 44 1
You can let the initial column creation to be anything. So you can either have it be one of the cluster values (so then you only have to set the other cluster manually instead of both) or you can have it be None so you can then easily check after assigning labels that you didn't miss any rows.
If the assignment to clusters is truly arbitrary I don't think you'll be able to automate it much more than this.
Is this the solution you are looking for? I doubled the data so you can try different sequences. Here, if you write create_label(df, 3) instead of 2, it will iterate over 3 by 3. It gives you an opportunity to have a parametric solution.
import pandas as pd
df = pd.DataFrame({'name': ['john', 'mary', 'joseph', 'maria', 'john', 'mary', 'joseph', 'maria'],
'age': [12, 13, 12, 14, 12, 13, 12, 14],
'sex': ['m', 'f','m', 'f', 'm', 'f','m', 'f']})
df.index = df.index + 1
df['label'] = pd.Series()
def create_label(data, each_row):
i = 0
j = 1
while i <= len(data):
data['label'][i: i + each_row] = 'label' + str(j)
i += each_row
j += 1
return data
df_new = create_label(df, 2)
For small data frame or dataset you can use the below code
Label=pd.Series(['cluster1','cluster1','cluster2','cluster2'])
df['label']=Label
you can use a for loop and use list to get a new column with desired data
import pandas as pd
df = pd.read_csv("dataset.csv")
list1 = []
for i in range(len(df.name)):
if i < 2:
list1.append('cluster1')
else:
list1.append('cluster2')
label = pd.Series(list1)
df['label'] = label
You can simply use iloc and assign the values for the columns:
import pandas as pd
df = pd.read_csv('test.txt',sep='\+', engine = "python")
df["label"] = "" # adds empty "label" column
df["label"].iloc[0:2] = "cluster1"
df["label"].iloc[2:4] = "cluster2"
Since the values do not follow a certain order, as per your comments, you'd have to assign each "cluster" value manually.
I have df1
df1 = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
df1 = pd.DataFrame(df1)
and I have another df2
df2 = {'Name':['krish', 'jack','Tom', 'nick',]}
df2 = pd.DataFrame(df2)
df2['Name'] is exactly same with df1. However, they are in a different order.
I want to fill df2['Age'] based on df1.
If I used df2['Age'] = df1['Age'] the value of is filled but wrong.
How to map those values on df2 from df1 correctly?
Thank you
Use:
df2 = df2.merge(df1,on='Name')
df2
Name Age
0 krish 19
1 jack 18
2 Tom 20
3 nick 21
Set Name as index and reindex based on df2:
df1.set_index('Name').reindex(df2.Name).reset_index()
Name Age
0 krish 19
1 jack 18
2 Tom 20
3 nick 21
Or for a better performance, we can use pd.Categorical here:
df1['Name'] = pd.Categorical(df1.Name, categories=df2.Name)
df1.sort_values('Name', inplace=True)