I have a dataframe with one column of last names, and one column of first names. How do I merge these columns so that I have one column with first and last names?
Here is what I have:
First Name (Column 1)
John
Lisa
Jim
Last Name (Column 2)
Smith
Brown
Dandy
This is what I want:
Full Name
John Smith
Lisa Brown
Jim Dandy.
Thank you!
Try
df.assign(name = df.apply(' '.join, axis = 1)).drop(['first name', 'last name'], axis = 1)
You get
name
0 bob smith
1 john smith
2 bill smith
Here's a sample df:
df
first name last name
0 bob smith
1 john smith
2 bill smith
You can do the following to combine columns:
df['combined']= df['first name'] + ' ' + df['last name']
df
first name last name combined
0 bob smith bob smith
1 john smith john smith
2 bill smith bill smith
Related
Let's say I have a pandas dataframe that looks like this:
import pandas as pd
data = {'name': ['Tom, Jeffrey, Henry', 'Nick, James', 'Chris', 'David, Oscar']}
df = pd.DataFrame(data)
df
name
0 Tom, Jeffrey, Henry
1 Nick, James
2 Chris
3 David, Oscar
I know I can split the names into separate columns using the comma as separator, like so:
df[["name1", "name2", "name3"]] = df["name"].str.split(", ", expand=True)
df
name name1 name2 name3
0 Tom, Jeffrey, Henry Tom Jeffrey Henry
1 Nick, James Nick James None
2 Chris Chris None None
3 David, Oscar David Oscar None
However, if the name column would have a row that contains 4 names, like below, the code above will yield a ValueError: Columns must be same length as key
data = {'name': ['Tom, Jeffrey, Henry', 'Nick, James', 'Chris', 'David, Oscar', 'Jim, Jones, William, Oliver']}
# Create DataFrame
df = pd.DataFrame(data)
df
name
0 Tom, Jeffrey, Henry
1 Nick, James
2 Chris
3 David, Oscar
4 Jim, Jones, William, Oliver
How can automatically split the name column into n-number of separate columns based on the ',' separator? The desired output would be this:
name name1 name2 name3 name4
0 Tom, Jeffrey, Henry Tom Jeffrey Henry None
1 Nick, James Nick James None None
2 Chris Chris None None None
3 David, Oscar David Oscar None None
4 Jim, Jones, William, Oliver Jim Jones William Oliver
Use DataFrame.join for new DataFrame with rename for new columns names:
f = lambda x: f'name{x+1}'
df = df.join(df["name"].str.split(", ", expand=True).rename(columns=f))
print (df)
name name1 name2 name3 name4
0 Tom, Jeffrey, Henry Tom Jeffrey Henry None
1 Nick, James Nick James None None
2 Chris Chris None None None
3 David, Oscar David Oscar None None
4 Jim, Jones, William, Oliver Jim Jones William Oliver
I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee
I'm trying to remove a duplicate word in a cell
Current Desired
0 John and Jane John and Jane
1 John and John John
2 John John
3 Jane and Jane Jane
I have tried the following, desired column gets filled with o d i c t _ k e y s ( [ ' n a n ' ] ):
from collections import OrderedDict
df['Current'] = (df['Desired'].astype(str).str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.astype(str).str.join(' '))
I have also tried this, but the desired column gets filled with nan
df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
Let us do split with set then join back
df['out'] = df.Current.str.split(' and ').map(lambda x : ' and '.join(set(x)))
df
Out[876]:
Current out
0 John and Jane Jane and John
1 John and John John
2 John John
3 Jane and Jane Jane
I am trying to find the number of unique values that cover 2 fields. So for example, a typical example would be last name and first name. I have a data frame.
When I do the following, I just get the number of unique fields for each column, in this case, Last and First. Not a composite.
df[['Last Name','First Name']].nunique()
Thanks!
Groupby both columns first, and then use nunique
>>> df.groupby(['First Name', 'Last Name']).nunique()
IIUC, you could use value_counts() for that:
df[['Last Name','First Name']].value_counts().size
3
For another example, if you start with this extended data frame that contains some dups:
Last Name First Name
0 Smith Bill
1 Johnson Bill
2 Smith John
3 Curtis Tony
4 Taylor Elizabeth
5 Smith Bill
6 Johnson Bill
7 Smith Bill
Then value_counts() gives you the counts by unique composite last-first name:
df[['Last Name','First Name']].value_counts()
Last Name First Name
Smith Bill 3
Johnson Bill 2
Curtis Tony 1
Smith John 1
Taylor Elizabeth 1
Then the length of that frame will give you the number of unique composite last-first names:
df[['Last Name','First Name']].value_counts().size
5
I have a df that looks like this:
fname lname
joe smith
john smith
jane#jane.com
jacky /jax jack
a#a.com non
john (jack) smith
Bob J. Smith
I want to create logic that says that if lname is empty, and if there are two OR three strings in fname seperate the second string OR third string and push it into lname column. If email address in fname leave as is, and if slashes or parenthesis in the fname column and no value in lname leave as is.
new df:
fname lname
joe smith
john smith
jane#jane.com
jacky /jax jack
a#a.com non
john (jack) smith
Bob J. smith
Code so far to seperate two strings:
df[['lname']] = df['name'].loc[df['fname'].str.split().str.len() == 2].str.split(expand=True)
With the following sample dataframe:
df = pd.DataFrame({'fname': ['joe', 'john smith', 'jane#jane.com', 'jacky /jax', 'a#a.com', 'john (jack)', 'Bob J. Smith'],
'lname': ['smith', '', '', 'jack', 'non', 'smith', '']})
You can use np.where():
conditions = (df['lname']=='') & (df['fname'].str.split().str.len()>1)
df['lname'] = np.where(conditions, df['fname'].str.split().str[-1].str.lower(), df['lname'])
Yields:
fname lname
0 joe smith
1 john smith smith
2 jane#jane.com
3 jacky /jax jack
4 a#a.com non
5 john (jack) smith
6 Bob J. Smith smith
To remove the last string from the fname column of the rows that had their lname column populated:
df['fname'] = np.where(conditions, df['fname'].str.split().str[:-1].str.join(' '), df['fname'])
Yields:
fname lname
0 joe smith
1 john smith
2 jane#jane.com
3 jacky /jax jack
4 a#a.com non
5 john (jack) smith
6 Bob J. smith
If I understand correctly you have a dataframe with columns fname and lname. If so then you can modify empty rows in column lname with:
condition = (df.loc[:, 'lname'] == '') & (df.loc[:, 'fname'].str.contains(' '))
df.loc[condition, 'lname'] = df.loc[condition, 'fname'].str.split().str[-1]
The code works for the sample data you have provided in the question but should be improved to be used in more general case.
To modify column fname you may use:
df.loc[condition, 'fname'] = df.loc[condition, 'fname'].str.split().str[:-1].str.join(sep=' ')