Partial string slice (or string split?) in new column - python

I am trying to figure out how to remove a word from a group of words in a column and insert that removed word into a new column. I figured out how to remove a part of a column and insert it into a new row, but I cannot figure out how to target a specific word (by placement I assume; "Mr." is always the 2nd word; or maybe by taking the word between the first "," and ".'s which is also always constant in my data set).
Name Age New_Name
Doe, Mr. John 23 Mr.
Anna, Mrs. Fox 33 Mrs.
EDITED the above to add another row
How would I remove the "Mr." from the name column and insert it into the "New_Name" column?
So far I have come up with:
data['New_name'] = data.Name.str[:2]
This doesn't allow me to specifically target "Mr." though.
I think I have to use a string.split, but the exact code is eluding me.

If the Mr. is always in the same position as indicated by your example, this can be accomplished with list interpolation:
df['New_Name'] = [x.split(' ')[1] for x in df['Name']]
and
d['Name'] = [' '.join(x.split(' ')[::2]) for x in d['Name']]

First, you have to get title from a name (it is between comma and dot) and stores it to another column. Then repeat this operation to remove title from column 'Name':
import pandas as pd
df = pd.DataFrame({'Name':['Doe, Mr. John', 'Anna, Ms. Fox'], 'Age':[23,33]})
df['New_Name'] = df['Name'].apply(lambda x: x[x.find(',')+len(','):x.rfind('.')]+'.')
df['Name'] = df['Name'].apply(lambda x: x.replace(x[x.find(',')+len(','):x.rfind('.')]+'.',''))
print df
Output:
Age Name New_Name
0 23 Doe, John Mr.
1 33 Anna, Fox Ms.

You can use pandas str.replace and str.extract methods
First extract title to form new column
df['New_Name'] = df['Name'].str.extract(',\s([A-Za-z]+.)')
Then use replace to replace extracted string with empty string
df['Name'] = df['Name'].str.replace('\s([A-Za-z]+.)\s', ' ')
You get:
Age Name New_Name
0 23 Doe, John Mr.

name = "Doe, Mr. John"
# if you always expect a title (Mr/Ms) between comma and dot
# split to lastname, title and firstname and strip spaces
newname = [ n.strip() for n in name.replace(".", ",").split(",") ]
print(newname)
#> ['Doe', 'Mr', 'John']
then you can print a title and a firstname-lastname column or other combination of them.

Related

Pandas DF: Create New Col by removing last word from of existing column

This should be easy, but I'm stumped.
I have a df that includes a column of PLACENAMES. Some of these have multiple word names:
Able County
Baker County
Charlie County
St. Louis County
All I want to do is to create a new column in my df that has just the name, without the "county" word:
Able
Baker
Charlie
St. Louis
I've tried a variety of things:
1. places['name_split'] = places['PLACENAME'].str.split()
2. places['name_split'] = places['PLACENAME'].str.split()[:-1]
3. places['name_split'] = places['PLACENAME'].str.rsplit(' ',1)[0]
4. places = places.assign(name_split = lambda x: ' '.join(x['PLACENAME].str.split()[:-1]))
Works - splits the names into a list ['St.','Louis','County']
The list splice is ignored, resulting in the same list ['St.','Louis','County'] rather than ['St.','Louis']
Raises a ValueError: Length of values (2) does not match length of index (41414)
Raises a TypeError: sequence item 0: expected str instance, list found
I've also defined a function and called it with .assign():
def processField(namelist):
words = namelist[:-1]
name = ' '.join(words)
return name
places = places.assign(name_split = lambda x: processField(x['PLACENAME]))
This also raises a TypeError: sequence item 0: expected str instance, list found
This seems to be a very simple goal and I've probably overthought it, but I'm just stumped. Suggestions about what I should be doing would be deeply appreciated.
Apply Series.str.rpartition function:
places['name_split'] = places['PLACENAME'].str.rpartition()[0]
Use str.replace to remove the last word and the preceding spaces:
places['new'] = place['PLACENAME'].str.replace(r'\s*\w+$', '', regex=True)
# or
places['new'] = place['PLACENAME'].str.replace(r'\s*\S+$', '', regex=True)
# or, only match 'County'
places['new'] = place['PLACENAME'].str.replace(r'\s*County$', '', regex=True)
Output:
PLACENAME new
0 Able County Able
1 Baker County Baker
2 Charlie County Charlie
3 St. Louis County St. Louis
regex demo

How to replace substrings in a Dataframe column, but only at the start of the strings?

I'm trying to replace specific characters in a data frame just if the string of the column starts with the characters specified. I mean, the df is as below:
UBICACION
NAME
AL03
Joe
FL03
Maria
AL07
Karla
DAL5
Marco
The desired output would be:
UBICACION
NAME
FL03
Joe
FL03
Maria
FL07
Karla
DAL5
Marco
This is my try:
df['UBICACION'] = df['UBICACION'].replace ("FL","AL")
The last sentence is not working, cause' it changes all the word, it just keeps the specified characters
Hope you can help me, I'm a little bit new on this. Best regards.
DataFrame.replace includes a regex=True option, so you can use ^AL:
df['UBICACION'] = df['UBICACION'].replace('^AL', 'FL', regex=True)
# UBICACION NAME
# 0 FL03 Joe
# 1 FL03 Maria
# 2 FL07 Karla
# 3 DAL5 Marco
try this:
df["UBICACION"] = df["UBICACION"].apply(lambda x: f"FL{x[2:]}" if x.startswith("AL") else x)

Formatting strings in a dataframe

i have a dataframe
Name
Joe Smith
Jane Doe
Homer Simpson
i am trying to format this to get to
Name
Smith, Joe
Doe, Jane
Simpson, Homer
i have this code, and it works for ~ 80% of users in my list but some users are not coming through right.
invalid_users = ['Test User', 'Test User2', 'Test User3']
for index, row in df_Users.iterrows():
gap_pos = df_Users["Name"][index].find(" ")
if gap_pos > 0 and row["Name"] not in invalid_users:
row["Name"] = df_Users["Name"][index][len(df_Users["Name"][index])-gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
the users who are not coming through correctly, usually their last name is truncated somewhere - i.e. Simpson ==> mpson
What am I doing wrong here?
Just split on space, then reverse it (that's what .str[::-1] is doing) and join on , :
>>> df['Name'].str.split(' ').str[::-1].str.join(', ')
0 Smith, Joe
1 Doe, Jane
2 Simpson, Homer
Name: Name, dtype: object
And if your data contains the name like Jr. Joe Smith, then you may do it following way:
df['Name'].str.split(' ').str[::-1].apply(lambda x:(x[0],' '.join(x[1:]))).str.join(', ')
I'm not sure what you were trying to with len there, but it's not right. You just want to start straight from gap_pos:
row["Name"] = df_Users["Name"][index][gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
I would be tempted to use split for this.
Pandas is a library that takes profit of vectorial operations, especially for simple transformations and most of DataFrame manipulations.
Given your example, here is a code that would work:
import pandas as pd
df = pd.DataFrame({"name": ["Joe Smith", "Jane Doe", "Homer Simpson"]})
# df
# name
# 0 Joe Smith
# 1 Jane Doe
# 2 Homer Simpson
df["name"] = df["name"].apply(lambda x: f"{x.split(' ')[1]}, {x.split(' ')[0]}")
# df
# name
# 0 Smith, Joe
# 1 Doe, Jane
# 2 Simpson, Homer
The apply function takes every row and applies the specified function to each one of them.
Here, the specified function is a lambda function that, supposing the name pattern is "FirstName LastName", does what you want.

How to split a column with names that are in different format and have different delimiters

I have a column of names that are in different languages and are entered in different formats. It appears that the English and Mandarin names have "," as a delimiter. The korean names have "." as a delimiter while the Japanese names have both "," and "/" as a delimiter. Am hoping to be able to obtain the New_Name column
Name_old Language New_Name
Phillipe, Mr Johnson English Mr Johnson Phillipe
李, Mr 永 Mandarin Mr 永 李
김두한.Kim Do Han Korean Kim Do Han
Amori, Shinji/ あもりさせる / 由紀 Japanese Shinji Amori
I have tried the following code but it only works for the English and Mandarin names. Am thinking i might have to filter the rows based on the language column and then string split. Appreciate any form of help, thank you.
splitname = df1["Name_old"].str.split(",", n = 1, expand = True)
# create first name column based on values after comma in Name_old column
df1["First_Name"]= splitname[1]
# create first name column based on values before comma in Name_old column
df1["Last_Name"]= splitname[0]
#concatenate the first name and last name
df1['New_Name'] = df1['First_Name'] +' '+ df1['Last_Name']
One way is to use np.select with conditions base on your Language:
d = {"Name":["Phillipe, Mr Johnson","李, Mr 永","김두한.Kim Do Han","Amori, Shinji/ あもりさせる / 由紀"],
"Language":["English","Mandarin","Korean","Japanese"]}
df = pd.DataFrame(d)
df["new"] = np.select([df["Language"].isin(["English", "Mandarin"]),
df["Language"].eq("Korean")],
[df["Name"].str.split(",", n = 1).str[::-1].str.join(" "),
df["Name"].str.findall(r"[A-Za-z]+").str.join(" ")],
df["Name"].str.findall(r"[A-Za-z]+").str[::-1].str.join(" "))
print (df)
#
Name Language new
0 Phillipe, Mr Johnson English Mr Johnson Phillipe
1 李, Mr 永 Mandarin Mr 永 李
2 김두한.Kim Do Han Korean Kim Do Han
3 Amori, Shinji/ あもりさせる / 由紀 Japanese Shinji Amori
You can split your string using regular expression:
import re
test_str = 'a,b.c/d,e,f.g/hij.k'
print(re.split(r'[,\/.]', test_str))
r'[,\/.]' means string of any of the three: ,, / or .
Output would be:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'hij', 'k']
Complete example:
import re
import pandas as pd
test_str = 'abc,def'
df = pd.DataFrame({"old_name": [test_str]})
def split_name(name):
split_name = re.split(r'[,\/.]', test_str)
return split_name[0], split_name[1]
df['first_name'], df['last_name'] = zip(*df['old_name'].apply(split_name))
print(df)
Output:
old_name first_name last_name
0 abc,def abc def

How to perform a 'selective strip' on a pandas dataframe column

Let's say I have a df:
Name A
'John '
'John and Mary '
'Harry '
'Paul '
'Paul and Harry '
How would I remove the trailing whitespace from each of the dataframe values without removing the spaces between 'John and Mary' so...the new df would look like:
Name A
'John'
'John and Mary'
'Harry'
'Paul'
'Paul and Harry'
I have tried str.split method but this interferes with the multi-name values..Also tried the replace method. Maybe some sort of indexing of the dataframe values like [:-1]. This could work?
not sure what else to try?
It seems you need strip if need remove ' and whitespaces from left and right sides:
df['Name A'] = df['Name A'].str.strip("' ")
print (df)
Name A
0 John
1 John and Mary
2 Harry
3 Paul
4 Paul and Harry
If need remove only whitespaces from right side use rstrip - parameter is not necessary, becasue whitespace is default:
df['Name A'] = df['Name A'].str.rstrip()
Try .rstrip(), this will strip all the spaces on the right side of the string

Categories

Resources