I have a pandas dataframe like this:
Name Preferred Name
0 Tyler None
1 Rachel None
2 Jason None
3 Jack John
4 Peter None
I'd like to overwrite the observation in the Name field with the Preferred Name field if there is a Preferred Name available, to get:
Name Preferred Name
0 Tyler None
1 Rachel None
2 Jason None
3 John John
4 Peter None
What is the best way to accomplish this?
I have tried to create a dictionary from Name:Preferred Name, and then use the dictionary to overwrite, but it brings over all of the blank values in this case.
Is there any way to apply it to only those rows where Preferred Name is populated?
Thank you
You can do this with boolean indexing and notna:
df.loc[df.PreferredName.notna(), 'Name'] = df.PreferredName
print(df)
Name PreferredName
0 Tyler NaN
1 Rachel NaN
2 Jason NaN
3 John John
4 Peter NaN
Related
I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee
DOB Name
0 1956-10-30 Anna
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry
6 1972-05-04 Kate
In the dataframe similar to the one above where I have duplicate names. So I am want to add a suffix '_0' to the name if DOB is before 1990 and a duplicate name.
I am expecting a result like this
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I am using the following
df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0')
But I am getting this result
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 NaN
2 2001-09-09 NaN
3 1993-01-15 NaN
4 1999-05-02 NaN
5 1962-12-17 Jerry_0
6 1972-05-04 NaN
How can I add a suffix to the Name which is a duplicate and have to be born before 1990.
Problem in your df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0') is that df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))] is a filtered dataframe whose rows are less than the original. When you assign it back, the not filtered rows doesn't have corresponding value in the filtered dataframe, so it becomes NaN.
You can try mask instead
m = (df['DOB'] < '1990-01-01') & df['Name'].duplicated(keep=False)
df['Name'] = df['Name'].mask(m, df['Name']+'_0')
You can use masks and boolean indexing:
# is the year before 1990?
m1 = pd.to_datetime(df['DOB']).dt.year.lt(1990)
# is the name duplicated?
m2 = df['Name'].duplicated(keep=False)
# if both conditions are True, add '_0' to the name
df.loc[m1&m2, 'Name'] += '_0'
output:
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I have a pandas dataframe that includes a "Name" column. Strings in the Name column may contain "Joe", "Bob", or "Joe Bob". I want to add a column for the type of person: just Joe, just Bob, or Both.
I was able to do this by creating boolean columns, turning them into strings, combining the strings, and then replacing the values. It just...didn't feel very elegant! I am new to Python...is there a better way to do this?
My original dataframe:
df = pd.DataFrame(data= [['Joe Biden'],['Bobby Kennedy'],['Joe Bob Briggs']], columns = ['Name'])
0
Name
1
Joe Biden
2
Bobby Kennedy
3
Joe Bob Briggs
I added two boolean columns to find names:
df['Joe'] = df.Name.str.contains('Joe')
df['Joe'] = df.Joe.astype('int')
df['Bob'] = df.Name.str.contains('Bob')
df['Bob'] = df.Bob.astype('int')
Now my dataframe looks like this:
df = pd.DataFrame(data= [['Joe Biden',1,0],['Bobby Kennedy',0,1],['Joe Bob Briggs',1,1]], columns = ['Name','Joe', 'Bob'])
0
Name
Joe
Bob
1
Joe Biden
1
0
2
Bobby Kennedy
0
1
3
Joe Bob Briggs
1
1
But what I really want is one "Type" column with categorical values: Joe, Bob, or Both.
To do that, I added a column to combine the booleans, then I replaced the values:
df["Type"] = df["Joe"].astype(str) + df["Bob"].astype(str)
0
Name
Joe
Bob
Type
1
Joe Biden
1
0
10
2
Bobby Kennedy
0
1
1
3
Joe Bob Briggs
1
1
11
df['Type'] = df.Type.astype('str') df['Type'].replace({'11': 'Both', '10': 'Joe','1': 'Bob'}, inplace=True)
0
Name
Joe
Bob
Type
1
Joe Biden
1
0
Joe
2
Bobby Kennedy
0
1
Bob
3
Joe Bob Briggs
1
1
Both
This feels clunky. Anyone have a better way?
Thanks!
You can use np.select to create the column Type.
You need to ordered correctly your condlist from the most precise to the widest.
df['Type'] = np.select([df['Name'].str.contains('Joe') & df['Name'].str.contains('Bob'),
df['Name'].str.contains('Joe'),
df['Name'].str.contains('Bob')],
choicelist=['Both', 'Joe', 'Bob'])
Output:
>>> df
Name Type
0 Joe Biden Joe
1 Bobby Kennedy Bob
2 Joe Bob Briggs Both
I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!
You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary
I have two dataframes with the same columns that I need to combine:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
and
first_name last_name
0 Billy Bonder
1 Brian Black
2 Bran Balwner
When I do this:
df_new = pd.concat([df1, df1])
I get this:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
0 Billy Bonder
1 Brian Black
2 Bran Balwner
Is there a way to have the left column have a unique number like this?
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
3 Billy Bonder
4 Brian Black
5 Bran Balwner
If not, how can I add a new key column with numbers from 1 to whatever the row count is?
As said earlier by #MaxU you can use ignore_index=True.
If you want to keep the index of your first table you can use the parameter ignore_index=True after the [dataframe1, dataframe2].
You can check if the indexes are being repeated with the paremeter verify_integrity=True it will return a boolean (you never know when you'll have to check.
But be careful because this procedure can be a little slow depending on the size of you Dataframe