Pandas - Replace substrings from a column if not numeric - python

I have a list of suffixes I want to remove in a list, say suffixes = ['inc','co','ltd'].
I want to remove these from a column in a Pandas dataframe, and I have been doing this:
df['name'] = df['name'].str.replace('|'.join(suffixes), '').
This works, but I do NOT want to remove the suffice if what remains is numeric. For example, if the name is 123 inc, I don't want to strip the 'inc'. Is there a way to add this condition in the code?

Using Regex --> negative lookbehind.
Ex:
suffixes = ['inc','co','ltd']
df = pd.DataFrame({"Col": ["Abc inc", "123 inc", "Abc co", "123 co"]})
df['Col_2'] = df['Col'].str.replace(r"(?<!\d) \b(" + '|'.join(suffixes) + r")\b", '', regex=True)
print(df)
Output:
Col Col_2
0 Abc inc Abc
1 123 inc 123 inc
2 Abc co Abc
3 123 co 123 co

Try adding ^[^0-9]+ to the suffixes. It is a REGEX that literally means "at least one not numeric char before". The code would look like this:
non_numeric_regex = r"^[^0-9]+"
suffixes = ['inc','co','ltd']
regex_w_suffixes = [non_numeric_regex + suf for suf in suffixes]
df['name'] = df['name'].str.replace('|'.join(regex_w_suffixes ), '')

Related

How to identify records in a DataFrame (Python/Pandas) that contains leading or trailing spaces

I would like to know how to write a formula that would identify/display records of string/object data type on a Pandas DataFrame that contains leading or trailing spaces.
The purpose for this is to get an audit on a Jupyter notebook of such records before applying any strip functions.
The goal is for the script to identify these records automatically without having to type the name of the columns manually. The scope should be any column of str/object data type that contains a value that includes either a leading or trailing spaces or both.
Please notice. I would like to see the resulting output in a dataframe format.
Thank you!
Link to sample dataframe data
You can use:
df['col'].str.startswith(' ')
df['col'].str.endswith(' ')
or with a regex:
df['col'].str.match(r'\s+')
df['col'].str.contains(r'\s+$')
Example:
df = pd.DataFrame({'col': [' abc', 'def', 'ghi ', ' jkl ']})
df['start'] = df['col'].str.startswith(' ')
df['end'] = df['col'].str.endswith(' ')
df['either'] = df['start'] | df['stop']
col start end either
0 abc True False True
1 def False False False
2 ghi False True True
3 jkl True True True
However, this is likely not faster than directly stripping the spaces:
df['col'] = df['col'].str.strip()
col
0 abc
1 def
2 ghi
3 jkl
updated answer
To detect the columns with leading/traiing spaces, you can use:
cols = df.astype(str).apply(lambda c: c.str.contains(r'^\s+|\s+$')).any()
cols[cols].index
example on the provided link:
Index(['First Name', 'Team'], dtype='object')

Remove * from a specific column value

For this dataframe, what is the best way to get ride of the * of "Stad Brussel*". In the real dataframe, the * is also on the upside. Please refer to the pic. Thanks.
Dutch name postcode Population
0 Anderlecht 1070 118241
1 Oudergem 1160 33313
2 Sint-Agatha-Berchem 1082 24701
3 Stad Brussel* 1000 176545
4 Etterbeek 1040 47414
Desired results:
Dutch name postcode Population
0 Anderlecht 1070 118241
1 Oudergem 1160 33313
2 Sint-Agatha-Berchem 1082 24701
3 Stad Brussel 1000 176545
4 Etterbeek 1040 47414
You can try:
df['Dutch name'] = df['Dutch name'].replace({'\*':''}, regex = True)
This will remove all * characters in the 'Dutch name' column. If you need to remove the character from multiple columns use:
df.replace({'\*':''}, regex = True)
If you manipulate only strings you can use regular expression matching. See here.
Something like :
import re
txt = 'Your file as a string here'
out = re.sub('\*', '', txt)
out now contain what you want.
for dataframe, first define column(s) to be checked:
cols_to_check = ['4']
then,
df[cols_to_check] = df[cols_to_check].replace({'*':''}, regex=True)

Replace values in dataframe column (regex)

I have a dataframe column with names:
df = pd.DataFrame({'Names': ['ROS-053', 'ROS-54', 'ROS-51', 'ROS-051B', 'ROS-051A', 'ROS-52']})
df.replace(to_replace=r'[a-zA-Z]{3}-\d{2}$', value='new', regex=True)
The format needs to be three letters followed by - then three numbers. So ROS-51 should be replaced with ROS-051.. And ROS-051B should be ROS-051. I have tried numerous things but can't seem to figure it out.
Any help would be highly appreciated:)
You can do:
df['Names'] = df.Names.replace('^([a-zA-Z]{3})-0?(\d{2})(.*)$', r'\1-0\2', regex=True)
Output:
Names
0 ROS-053
1 ROS-054
2 ROS-051
3 ROS-051
4 ROS-051
5 ROS-052
Here is one option using regex replacement with a callback:
repl = lambda m: m.group(1) + ('00' + m.group(2))[-3:] + m.group(3)
df.str.replace(r'^([A-Z]{3}-)(\d+)(.*)$', repl)
Note this answer is flexible and will left pad with zeroes either a single or double digit only to three digits.
Here's another way to do it:
df = pd.DataFrame({'Names': ['ROS-053', 'ROS-54', 'ROS-51', 'ROS-051B', 'ROS-051A', 'ROS-52']})
df['Names'] = df['Names'].str.replace(r'[A-Z]$', '')
df['Names'] = df['Names'].str.split('-').str[0] + '-' + df['Names'].str.split('-').str[1].apply(lambda x: x.zfill(3))
print(df)
Output:
Names
0 ROS-053
1 ROS-054
2 ROS-051
3 ROS-051
4 ROS-051
5 ROS-052

Is there a way in pandas to remove duplicates from within a series?

I have a dataframe which has some duplicate tags separated by commas in the "Tags" column, is there a way to remove the duplicate strings from the series. I want the output in 400 to have just Museum, Drinking, Shopping.
I can't split on a comma & remove them because there are some tags in the series that have similar words like for example: [Museum, Art Museum, Shopping] so splitting and dropping multiple museum strings would affect the unique 'Art Museum' string.
Desired Output
You can split by comma and convert to a set(),which removes duplicates, after removing leading/trailing white space with str.strip(). Then, you can df.apply() this to your column.
df['Tags']=df['Tags'].apply(lambda x: ', '.join(set([y.strip() for y in x.split(',')])))
You can create a function that removes duplicates from a given string. Then apply this function to your column Tags.
def remove_dup(strng):
'''
Input a string and split them
'''
return ', '.join(list(dict.fromkeys(strng.split(', '))))
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
DEMO:
import pandas as pd
my_dict = {'Tags':["Museum, Art Museum, Shopping, Museum",'Drink, Drink','Shop','Visit'],'Country':['USA','USA','USA', 'USA']}
df = pd.DataFrame(my_dict)
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
df
Output:
Tags Country
0 Museum, Art Museum, Shopping USA
1 Drink USA
2 Shop USA
3 Visit USA
Without some code example, I've thrown together something that would work.
import pandas as pd
test = [['Museum', 'Art Museum', 'Shopping', "Museum"]]
df = pd.DataFrame()
df[0] = test
df[0]= df.applymap(set)
Out[35]:
0
0 {Museum, Shopping, Art Museum}
One approach that avoids apply
# in your code just s = df['Tags']
s = pd.Series(['','', 'Tour',
'Outdoors, Beach, Sports',
'Museum, Drinking, Drinking, Shopping'])
(s.str.split(',\s+', expand=True)
.stack()
.reset_index()
.drop_duplicates(['level_0',0])
.groupby('level_0')[0]
.agg(','.join)
)
Output:
level_0
0
1
2 Tour
3 Outdoors,Beach,Sports
4 Museum,Drinking,Shopping
Name: 0, dtype: object
there maybe mach fancier way doing these kind of stuffs.
but will do the job.
make it lower-case
data['tags'] = data['tags'].str.lower()
split every row in tags col by comma it will return a list of string
data['tags'] = data['tags'].str.split(',')
map function str.strip to every element of list (remove trailing spaces).
apply set function return set of current words and remove duplicates
data['tags'] = data['tags'].apply(lambda x: set(map(str.strip , x)))

Reversing names in pandas

I have a dataframe with a Name column like this:
How can I use pandas to reverse the names in the format "xxx, xxx" efficiently? Also if you have other string cleaning tips for munging names like these I would appreciate it!
Maybe you can try something like this with reverse function:
d = {'name':['Bran Stark','Jon Snow','Rhaegar Targaryen']}
df = pd.DataFrame(data=d)
df['new name'] = df['name'].apply(lambda x : ', '.join(reversed(x.split(' '))))
print(df['new name'])
0 Stark, Bran
1 Snow, Jon
2 Targaryen, Rhaegar
Use Series.str.replace to perform regex string substitutions:
df['Name'] = df['Name'].str.replace(r'(.+),\s+(.+)', r'\2 \1')
The regex pattern (.+), (.+) means
( begin group #1
.+ match 1-or-more of any character
) end group #1
, match a literal comma
\s+ match 1-or-more whitespace characters
( begin group #2
.+ match 1-or-more of any character
) end group #2
The second argument r'\2 \1', tells str.replace to replace substrings that match the pattern with group #2 followed by a space, followed by group #1.
import pandas as pd
names = '''\
John Snow
Black, Jack
Jim Bean/
Draper, Don
'''
df = pd.DataFrame({'Name': names.splitlines()})
# Name
# 0 John Snow
# 1 Black, Jack
# 2 Jim Bean/
# 3 Draper, Don
df['Name'] = df['Name'].str.replace(r'(.+),\s+(.+)', r'\2 \1')
yields
Name
0 John Snow
1 Jack Black
2 Jim Bean/
3 Don Draper

Categories

Resources