How to perform a 'selective strip' on a pandas dataframe column - python

Let's say I have a df:
Name A
'John '
'John and Mary '
'Harry '
'Paul '
'Paul and Harry '
How would I remove the trailing whitespace from each of the dataframe values without removing the spaces between 'John and Mary' so...the new df would look like:
Name A
'John'
'John and Mary'
'Harry'
'Paul'
'Paul and Harry'
I have tried str.split method but this interferes with the multi-name values..Also tried the replace method. Maybe some sort of indexing of the dataframe values like [:-1]. This could work?
not sure what else to try?

It seems you need strip if need remove ' and whitespaces from left and right sides:
df['Name A'] = df['Name A'].str.strip("' ")
print (df)
Name A
0 John
1 John and Mary
2 Harry
3 Paul
4 Paul and Harry
If need remove only whitespaces from right side use rstrip - parameter is not necessary, becasue whitespace is default:
df['Name A'] = df['Name A'].str.rstrip()

Try .rstrip(), this will strip all the spaces on the right side of the string

Related

How to replace substrings in a Dataframe column, but only at the start of the strings?

I'm trying to replace specific characters in a data frame just if the string of the column starts with the characters specified. I mean, the df is as below:
UBICACION
NAME
AL03
Joe
FL03
Maria
AL07
Karla
DAL5
Marco
The desired output would be:
UBICACION
NAME
FL03
Joe
FL03
Maria
FL07
Karla
DAL5
Marco
This is my try:
df['UBICACION'] = df['UBICACION'].replace ("FL","AL")
The last sentence is not working, cause' it changes all the word, it just keeps the specified characters
Hope you can help me, I'm a little bit new on this. Best regards.
DataFrame.replace includes a regex=True option, so you can use ^AL:
df['UBICACION'] = df['UBICACION'].replace('^AL', 'FL', regex=True)
# UBICACION NAME
# 0 FL03 Joe
# 1 FL03 Maria
# 2 FL07 Karla
# 3 DAL5 Marco
try this:
df["UBICACION"] = df["UBICACION"].apply(lambda x: f"FL{x[2:]}" if x.startswith("AL") else x)

Pandas regex, replace group with char

Problem
How to replace X with _, given the following dataframe:
data = {'street':['13XX First St', '2XXX First St', '47X Second Ave'],
'city':['Ashland', 'Springfield', 'Ashland']}
df = pd.DataFrame(data)
The streets need to be edited, replacing each X with an underscore _.
Notice that the number of Integers changes, as does the number of Xs. Also, street names such as Xerxes should not be edited to _er_es, but rather left unedited. Only the street number section should change.
Desired Output
data = {'street':['13__ First St', '2___ First St', '47_ Second Ave'],
'city':['Ashland', 'Springfield', 'Ashland']}
df = pd.DataFrame(data)
Progress
Some potential regex building blocks include:
1. [0-9]+ to capture numbers
2. X+ to capture Xs
3. ([0-9]+)(X+) to capture groups
df['street']replace("[0-9]+)(X+)", value=r"\2", regex=True, inplace=False)
I'm pretty weak with regex, so my approach may not be the best. Preemptive thank you for any guidance or solutions!
IIUC, this would do:
def repl(m):
return m.group(1) + '_'*len(m.group(2))
df['street'].str.replace("^([0-9]+)(X*)", repl)
Output:
0 13__ First St
1 2___ First St
2 47_ Second Ave
Name: street, dtype: object
IIUC, we can pass a function into the repl argument much like re.sub
def repl(m):
return '_' * len(m.group())
df['street'].str.replace(r'([X])+',repl)
out:
0 13__ First St
1 2___ First St
2 47_ Second Ave
Name: street, dtype: object
if you need to match only after numbers, we can add a '\d{1}' which will only match after a single instance of X
df['street'].str.replace(r'\d{1}([X]+)+',repl)
Assuming 'X' only occurs in the 'street' column
streetresult=re.sub('X','_',str(df['street']))
Your desired output should be the result
Code I tested
import pandas as pd
import re
data = {'street':['13XX First St', '2XXX First St', '47X Second Ave'],
'city':['Ashland', 'Springfield', 'Ashland']}
df = pd.DataFrame(data)
for i in data:
streetresult=re.sub('X','_',str(df['street']))
print(streetresult)

Python "if does not exist, then..." logic?

With the following dataframe, I'm trying to create a new guest_1 column that takes the first two words in each item of the guest column. At the bottom, you can see my desired output.
Is there some sort of "if doesn't exist, then..." logic I can apply here?
I've tried the following, but the obvious difficulty is accounting for a person with a single word for a name.
df.guest_1 = data.guest.str.split().str.get(0) + ' ' + data.guest.str.split().str.get(1)
df = pd.DataFrame(
{'date': ['2018-11-21','2018-02-26'],
'guest': ['Anthony Scaramucci & Michael Avenatti', 'Robyn'],
})
df.guest_1 = ['Anthony Scaramucci', 'Robyn']
You can split, slice, and join. This will gracefully handle out-of-bounds slices:
df.guest.str.split().str[:2].str.join(' ')
df['guest_1'] = df.guest.str.split().str[:2].str.join(' ')
df
date guest guest_1
0 2018-11-21 Anthony Scaramucci & Michael Avenatti Anthony Scaramucci
1 2018-02-26 Robyn Robyn

Partial string slice (or string split?) in new column

I am trying to figure out how to remove a word from a group of words in a column and insert that removed word into a new column. I figured out how to remove a part of a column and insert it into a new row, but I cannot figure out how to target a specific word (by placement I assume; "Mr." is always the 2nd word; or maybe by taking the word between the first "," and ".'s which is also always constant in my data set).
Name Age New_Name
Doe, Mr. John 23 Mr.
Anna, Mrs. Fox 33 Mrs.
EDITED the above to add another row
How would I remove the "Mr." from the name column and insert it into the "New_Name" column?
So far I have come up with:
data['New_name'] = data.Name.str[:2]
This doesn't allow me to specifically target "Mr." though.
I think I have to use a string.split, but the exact code is eluding me.
If the Mr. is always in the same position as indicated by your example, this can be accomplished with list interpolation:
df['New_Name'] = [x.split(' ')[1] for x in df['Name']]
and
d['Name'] = [' '.join(x.split(' ')[::2]) for x in d['Name']]
First, you have to get title from a name (it is between comma and dot) and stores it to another column. Then repeat this operation to remove title from column 'Name':
import pandas as pd
df = pd.DataFrame({'Name':['Doe, Mr. John', 'Anna, Ms. Fox'], 'Age':[23,33]})
df['New_Name'] = df['Name'].apply(lambda x: x[x.find(',')+len(','):x.rfind('.')]+'.')
df['Name'] = df['Name'].apply(lambda x: x.replace(x[x.find(',')+len(','):x.rfind('.')]+'.',''))
print df
Output:
Age Name New_Name
0 23 Doe, John Mr.
1 33 Anna, Fox Ms.
You can use pandas str.replace and str.extract methods
First extract title to form new column
df['New_Name'] = df['Name'].str.extract(',\s([A-Za-z]+.)')
Then use replace to replace extracted string with empty string
df['Name'] = df['Name'].str.replace('\s([A-Za-z]+.)\s', ' ')
You get:
Age Name New_Name
0 23 Doe, John Mr.
name = "Doe, Mr. John"
# if you always expect a title (Mr/Ms) between comma and dot
# split to lastname, title and firstname and strip spaces
newname = [ n.strip() for n in name.replace(".", ",").split(",") ]
print(newname)
#> ['Doe', 'Mr', 'John']
then you can print a title and a firstname-lastname column or other combination of them.

Reversing names in pandas

I have a dataframe with a Name column like this:
How can I use pandas to reverse the names in the format "xxx, xxx" efficiently? Also if you have other string cleaning tips for munging names like these I would appreciate it!
Maybe you can try something like this with reverse function:
d = {'name':['Bran Stark','Jon Snow','Rhaegar Targaryen']}
df = pd.DataFrame(data=d)
df['new name'] = df['name'].apply(lambda x : ', '.join(reversed(x.split(' '))))
print(df['new name'])
0 Stark, Bran
1 Snow, Jon
2 Targaryen, Rhaegar
Use Series.str.replace to perform regex string substitutions:
df['Name'] = df['Name'].str.replace(r'(.+),\s+(.+)', r'\2 \1')
The regex pattern (.+), (.+) means
( begin group #1
.+ match 1-or-more of any character
) end group #1
, match a literal comma
\s+ match 1-or-more whitespace characters
( begin group #2
.+ match 1-or-more of any character
) end group #2
The second argument r'\2 \1', tells str.replace to replace substrings that match the pattern with group #2 followed by a space, followed by group #1.
import pandas as pd
names = '''\
John Snow
Black, Jack
Jim Bean/
Draper, Don
'''
df = pd.DataFrame({'Name': names.splitlines()})
# Name
# 0 John Snow
# 1 Black, Jack
# 2 Jim Bean/
# 3 Draper, Don
df['Name'] = df['Name'].str.replace(r'(.+),\s+(.+)', r'\2 \1')
yields
Name
0 John Snow
1 Jack Black
2 Jim Bean/
3 Don Draper

Categories

Resources