I have a dataframe with a Name column like this:
How can I use pandas to reverse the names in the format "xxx, xxx" efficiently? Also if you have other string cleaning tips for munging names like these I would appreciate it!
Maybe you can try something like this with reverse function:
d = {'name':['Bran Stark','Jon Snow','Rhaegar Targaryen']}
df = pd.DataFrame(data=d)
df['new name'] = df['name'].apply(lambda x : ', '.join(reversed(x.split(' '))))
print(df['new name'])
0 Stark, Bran
1 Snow, Jon
2 Targaryen, Rhaegar
Use Series.str.replace to perform regex string substitutions:
df['Name'] = df['Name'].str.replace(r'(.+),\s+(.+)', r'\2 \1')
The regex pattern (.+), (.+) means
( begin group #1
.+ match 1-or-more of any character
) end group #1
, match a literal comma
\s+ match 1-or-more whitespace characters
( begin group #2
.+ match 1-or-more of any character
) end group #2
The second argument r'\2 \1', tells str.replace to replace substrings that match the pattern with group #2 followed by a space, followed by group #1.
import pandas as pd
names = '''\
John Snow
Black, Jack
Jim Bean/
Draper, Don
'''
df = pd.DataFrame({'Name': names.splitlines()})
# Name
# 0 John Snow
# 1 Black, Jack
# 2 Jim Bean/
# 3 Draper, Don
df['Name'] = df['Name'].str.replace(r'(.+),\s+(.+)', r'\2 \1')
yields
Name
0 John Snow
1 Jack Black
2 Jim Bean/
3 Don Draper
Related
I'm trying to replace specific characters in a data frame just if the string of the column starts with the characters specified. I mean, the df is as below:
UBICACION
NAME
AL03
Joe
FL03
Maria
AL07
Karla
DAL5
Marco
The desired output would be:
UBICACION
NAME
FL03
Joe
FL03
Maria
FL07
Karla
DAL5
Marco
This is my try:
df['UBICACION'] = df['UBICACION'].replace ("FL","AL")
The last sentence is not working, cause' it changes all the word, it just keeps the specified characters
Hope you can help me, I'm a little bit new on this. Best regards.
DataFrame.replace includes a regex=True option, so you can use ^AL:
df['UBICACION'] = df['UBICACION'].replace('^AL', 'FL', regex=True)
# UBICACION NAME
# 0 FL03 Joe
# 1 FL03 Maria
# 2 FL07 Karla
# 3 DAL5 Marco
try this:
df["UBICACION"] = df["UBICACION"].apply(lambda x: f"FL{x[2:]}" if x.startswith("AL") else x)
I have a function with a for loop that is returning a bunch of strings for example:
58, pluto
172, uno
5, peaches
How can I take the first part of the string (the number) in one column in a pandas dataframe and the second part (the fruit) in the second column. The columns should be named "amount" and "fruit".
Here is the code so far:
regex = r"(\d+)( ML/year )(in the |the )([\w \/\(\)]+)"
for line in finalText.splitlines():
matches = re.finditer(pattern, line)
for matchNum, match in enumerate(matches, start=1):
print (match.group(1) +","+ match.group(4))
I am using re to filter out the data I need from a large block of text but for now it is just printing to the console and I need it to go into a dataframe.
Essentially, the last print statement in that code needs to be changed so instead of printing I am inserting into a dataframe.
Example of final text is:
(a)58 ML/Y in the pear region
(b)
64 ML/Y in the apple region
It is plain text
Had to work on figuring out a simpler solution for you. Use the \W regex to remove ()\ from your string.
If the pattern of your string is always going to be
(x)## ML/Y in the fruit region (y) ## ML/Y in the fruit region
then use this code. It will strip out the ( ) \ from the list and give you a simpler list. Use 3rd, 8th, 13th, and 18th position from the list to get what you want.
import pandas as pd
import re
finalText = '(a)58 ML/Y in the pear region (b) 64 ML/Y in the apple region'
df = pd.DataFrame(data=None, columns=['amount','fruit'])
for line in finalText.splitlines():
matches = re.split(r'\W',line)
df.loc[len(df)] = [matches[2],matches[7]]
df.loc[len(df)] = [matches[12],matches[17]]
print(df)
The output for this resulted in:
amount fruit
0 58 pear
1 64 apple
An alternate way to do this will be to use findall.
for line in finalText.splitlines():
print (line)
m = re.findall(r'\w+',line)
print (m)
matches = re.findall(r'\w+',line)
df.loc[len(df)] = [matches[1],matches[6]]
df.loc[len(df)] = [matches[9],matches[14]]
print(df)
Same results as above
amount fruit
0 58 pear
1 64 apple
old code
Try this and let me know if it works.
import pandas as pd
df = pd.DataFrame(data=None, columns=['amount','fruit'])
regex = r"(\d+)( ML/year )(in the |the )([\w \/\(\)]+)"
for line in finalText.splitlines():
matches = re.finditer(pattern, line)
for matchNum, match in enumerate(matches, start=1):
df[matchNum] = [match.group(1) , match.group(4)]
Here is my solution
s = "58, pluto 172, uno 5, peaches"
temp = s.split() # ['58,', 'pluto', '172,', 'uno', '5,', 'peaches']
amount = temp[::2] #['58,', '172,', '5,']
fruit = temp[1::2] # ['pluto', 'uno', 'peaches']
df['amount'] = amount
df['fruit'] = fruit
You can continue dropping the comma and change type from string to int
I have a list of suffixes I want to remove in a list, say suffixes = ['inc','co','ltd'].
I want to remove these from a column in a Pandas dataframe, and I have been doing this:
df['name'] = df['name'].str.replace('|'.join(suffixes), '').
This works, but I do NOT want to remove the suffice if what remains is numeric. For example, if the name is 123 inc, I don't want to strip the 'inc'. Is there a way to add this condition in the code?
Using Regex --> negative lookbehind.
Ex:
suffixes = ['inc','co','ltd']
df = pd.DataFrame({"Col": ["Abc inc", "123 inc", "Abc co", "123 co"]})
df['Col_2'] = df['Col'].str.replace(r"(?<!\d) \b(" + '|'.join(suffixes) + r")\b", '', regex=True)
print(df)
Output:
Col Col_2
0 Abc inc Abc
1 123 inc 123 inc
2 Abc co Abc
3 123 co 123 co
Try adding ^[^0-9]+ to the suffixes. It is a REGEX that literally means "at least one not numeric char before". The code would look like this:
non_numeric_regex = r"^[^0-9]+"
suffixes = ['inc','co','ltd']
regex_w_suffixes = [non_numeric_regex + suf for suf in suffixes]
df['name'] = df['name'].str.replace('|'.join(regex_w_suffixes ), '')
I have a column of names that are in different languages and are entered in different formats. It appears that the English and Mandarin names have "," as a delimiter. The korean names have "." as a delimiter while the Japanese names have both "," and "/" as a delimiter. Am hoping to be able to obtain the New_Name column
Name_old Language New_Name
Phillipe, Mr Johnson English Mr Johnson Phillipe
李, Mr 永 Mandarin Mr 永 李
김두한.Kim Do Han Korean Kim Do Han
Amori, Shinji/ あもりさせる / 由紀 Japanese Shinji Amori
I have tried the following code but it only works for the English and Mandarin names. Am thinking i might have to filter the rows based on the language column and then string split. Appreciate any form of help, thank you.
splitname = df1["Name_old"].str.split(",", n = 1, expand = True)
# create first name column based on values after comma in Name_old column
df1["First_Name"]= splitname[1]
# create first name column based on values before comma in Name_old column
df1["Last_Name"]= splitname[0]
#concatenate the first name and last name
df1['New_Name'] = df1['First_Name'] +' '+ df1['Last_Name']
One way is to use np.select with conditions base on your Language:
d = {"Name":["Phillipe, Mr Johnson","李, Mr 永","김두한.Kim Do Han","Amori, Shinji/ あもりさせる / 由紀"],
"Language":["English","Mandarin","Korean","Japanese"]}
df = pd.DataFrame(d)
df["new"] = np.select([df["Language"].isin(["English", "Mandarin"]),
df["Language"].eq("Korean")],
[df["Name"].str.split(",", n = 1).str[::-1].str.join(" "),
df["Name"].str.findall(r"[A-Za-z]+").str.join(" ")],
df["Name"].str.findall(r"[A-Za-z]+").str[::-1].str.join(" "))
print (df)
#
Name Language new
0 Phillipe, Mr Johnson English Mr Johnson Phillipe
1 李, Mr 永 Mandarin Mr 永 李
2 김두한.Kim Do Han Korean Kim Do Han
3 Amori, Shinji/ あもりさせる / 由紀 Japanese Shinji Amori
You can split your string using regular expression:
import re
test_str = 'a,b.c/d,e,f.g/hij.k'
print(re.split(r'[,\/.]', test_str))
r'[,\/.]' means string of any of the three: ,, / or .
Output would be:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'hij', 'k']
Complete example:
import re
import pandas as pd
test_str = 'abc,def'
df = pd.DataFrame({"old_name": [test_str]})
def split_name(name):
split_name = re.split(r'[,\/.]', test_str)
return split_name[0], split_name[1]
df['first_name'], df['last_name'] = zip(*df['old_name'].apply(split_name))
print(df)
Output:
old_name first_name last_name
0 abc,def abc def
I am trying to figure out how to remove a word from a group of words in a column and insert that removed word into a new column. I figured out how to remove a part of a column and insert it into a new row, but I cannot figure out how to target a specific word (by placement I assume; "Mr." is always the 2nd word; or maybe by taking the word between the first "," and ".'s which is also always constant in my data set).
Name Age New_Name
Doe, Mr. John 23 Mr.
Anna, Mrs. Fox 33 Mrs.
EDITED the above to add another row
How would I remove the "Mr." from the name column and insert it into the "New_Name" column?
So far I have come up with:
data['New_name'] = data.Name.str[:2]
This doesn't allow me to specifically target "Mr." though.
I think I have to use a string.split, but the exact code is eluding me.
If the Mr. is always in the same position as indicated by your example, this can be accomplished with list interpolation:
df['New_Name'] = [x.split(' ')[1] for x in df['Name']]
and
d['Name'] = [' '.join(x.split(' ')[::2]) for x in d['Name']]
First, you have to get title from a name (it is between comma and dot) and stores it to another column. Then repeat this operation to remove title from column 'Name':
import pandas as pd
df = pd.DataFrame({'Name':['Doe, Mr. John', 'Anna, Ms. Fox'], 'Age':[23,33]})
df['New_Name'] = df['Name'].apply(lambda x: x[x.find(',')+len(','):x.rfind('.')]+'.')
df['Name'] = df['Name'].apply(lambda x: x.replace(x[x.find(',')+len(','):x.rfind('.')]+'.',''))
print df
Output:
Age Name New_Name
0 23 Doe, John Mr.
1 33 Anna, Fox Ms.
You can use pandas str.replace and str.extract methods
First extract title to form new column
df['New_Name'] = df['Name'].str.extract(',\s([A-Za-z]+.)')
Then use replace to replace extracted string with empty string
df['Name'] = df['Name'].str.replace('\s([A-Za-z]+.)\s', ' ')
You get:
Age Name New_Name
0 23 Doe, John Mr.
name = "Doe, Mr. John"
# if you always expect a title (Mr/Ms) between comma and dot
# split to lastname, title and firstname and strip spaces
newname = [ n.strip() for n in name.replace(".", ",").split(",") ]
print(newname)
#> ['Doe', 'Mr', 'John']
then you can print a title and a firstname-lastname column or other combination of them.