Faster way to remove punctuations and special characters in pandas dataframe column - python

I'm using this below code to remove special characters and punctuations from a column in pandas dataframe. But this method of using regex.sub is not time efficient. Is there other options I could try to have better time efficiency and remove punctuations and special characters? Or the way I'm removing special characters and parsing it back to the column, pandas dataframe is causing me major computation burn?
for n, string in data['text'].iteritems():
data['text'] = re.sub('([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])','', string)

One way would be to keep only alphanumeric. Consider this dataframe
df=pd.DataFrame({'Text':['#^#346fetvx#!.,;:', 'fhfgd54#!#><?']})
Text
0 #^#346fetvx#!.,;:
1 fhfgd54#!#><?
You can use
df['Text'] = df['Text'].str.extract('(\w+)', expand = False)
Text
0 346fetvx
1 fhfgd54

Use Regex and lambda function:
import re
data['PROD_NAME'] = data['PROD_NAME'].apply(lambda x: re.sub('[^A-Za-z0-9]', ' ', x))
This would remove all characters except alphabets and digits.

Related

Remove Characters From A String Until A Specific Format is Reached

So I have the following strings and I have been trying to figure out how to manipulate them in such a way that I get a specific format.
string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo
I want to be able to get rid of any of the last string so I am just left with the month and year, like below:
string1-itd_jan2021
string2itd_mar2021
string3itd_feb2021
string4-itd_mar2021
string5itd_jun2021
string6-itd_feb2021
I thought about using string.split on the - but then realized that for some strings this wouldn't work. I also thought about getting rid of a set amount of characters by putting it into a list and slicing but the end is varying characters length?
Is there anything I can do it with regex or any other python module?
Use str.rsplit with the appropriate maxsplit parameter:
s = s.rsplit("-", 1)[0]
You could also use str.split (even though this is clearly the worse choice):
s = "-".join(s.split("-")[:-1])
Or using regular expressions:
s = re.sub(r'-[^-]*$', '', s)
# "-[^-]*" a "-" followed by any number of non-"-"
With a regex:
import re
re.sub(r'([0-9]{4}).*$', r'\1', s)
Use re.sub like so:
import re
lines = '''string1-itd_jan2021-internal
string2itd_mar2021-space
string3itd_feb2021-internal
string4-itd_mar2021-moon
string5itd_jun2021-internal
string6-itd_feb2021-apollo'''
for old in lines.split('\n'):
new = re.sub(r'[-][^-]+$', '', old)
print('\t'.join([old, new]))
Prints:
string1-itd_jan2021-internal string1-itd_jan2021
string2itd_mar2021-space string2itd_mar2021
string3itd_feb2021-internal string3itd_feb2021
string4-itd_mar2021-moon string4-itd_mar2021
string5itd_jun2021-internal string5itd_jun2021
string6-itd_feb2021-apollo string6-itd_feb2021
Explanation:
r'[-][^-]+$' : Literal dash (-), followed by any character other than a dash ([^-]) repeated 1 or more times, followed by the end of the string ($).

python remove variable substring patterns from a long string in a python dataframe column

I have a column in my dataframe, containing very large strings.
here is a short sample of the string
FixedChar{3bf3423 Data to keep}, FixedChar{5e0d20 Data to keep}, FixedChar{6cb86d9 Data to keep}, ...
I need to remove the recurring static "FixedChar{" and the variable substring after it that has static length of 6 and also "}"
and just keep the "Data to keep" strings that have variable lengths.
what is the best way to remove this recurring variable pattern?
It was easier than I thought.
At first I started to use re.sub() from re library.
regex \w* removes all the word characters (letters and numbers) after the "FixedChar" and the argument flags = re.I makes it case insensitive.
import re
re.sub(r"FixedChar{\w*","",dataFrame.Column[row],flags = re.I)
but I found str.replace() more useful and replaced the values in my dataFrame using loc, as I needed to filter my dataframe cause this pattern shows up only in specific rows.
dataFrame.loc['Column'] = dataFrame.Column.str.replace("FixedChar{\w* ",'',regex=True)
dataFrame.loc['Column'] = dataFrame.Column.str.replace("}",'',regex=True)

using str.replace() to remove nth character from a string in a pandas dataframe

I have a pandas dataframe that consists of strings. I would like to remove the n-th character from the end of the strings. I have the following code:
DF = pandas.DataFrame({'col': ['stri0ng']})
DF['col'] = DF['col'].str.replace('(.)..$','')
Instead of removing the third to the last character (0 in this case), it removes 0ng. The result should be string but it outputs stri. Where am I wrong?
You may want to rather replace a single character followed by n-1 characters at the end of the string:
DF['col'] = DF['col'].str.replace('.(?=.{2}$)', '')
col
0 string
If you want to make sure you're only removing digits (so that 'string' in one special row doesn't get changed to 'strng'), then use something like '[0-9](?=.{2}$)' as pattern.
Another way using pd.Series.str.slice_replace:
df['col'].str.slice_replace(4,5,'')
Output:
0 string
Name: col, dtype: object

Pandas to match column contents to keywords (with spaces and brackets )

A columns in data frame contains the keywords I want to match with.
I want to check if each column contains any of the keywords. If yes, print them.
Tried below:
import pandas as pd
import re
Keywords = [
"Caden(S, A)",
"Caden(a",
"Caden(.A))",
"Caden.Q",
"Caden.K",
"Caden"
]
data = {'People' : ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]}
df = pd.DataFrame(data)
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
print df["found"]
It returns Nan. I guess the challenge lies in the spaces and brackets in the keywords.
What's the right way to get the ideal outputs? Thank you.
Caden(S, A); Caden.K
Caden(a
Caden(.A))
Caden.Q
Since you do not need to find every keyword, but the longest ones if they are overlapping you may use a regex with findall approach.
The point here is that you need to sort the keywords by length in the descending order first (because there are whitespaces in them), then you need to escape these values as they contain special characters, then you must amend the word boundaries to use unambiguous word boundaries, (?<!\w) and (?!\w) (note that \b is context dependent).
Use
pat = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
See an online Python test:
import re
Keywords = ["Caden(S, A)", "Caden(a","Caden(.A))", "Caden.Q", "Caden.K", "Caden"]
rx = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
# => (?<!\w)(?:Caden\(S,\ A\)|Caden\(\.A\)\)|Caden\(a|Caden\.Q|Caden\.K|Caden)(?!\w)
strs = ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]
for s in strs:
print(re.findall(rx, s))
Output
['Caden(S, A)', 'Caden.K']
['Caden(a']
['Caden(.A))']
['Caden.Q']
Hey don't know if this solution is optimal but it works. I just replaced dot by 8 and '(' by 6 and ')' by 9 don't know why those character are ignored by str.findall ?
A kind of bijection between {8,6,9} and {'.','(',')'}
for i in range(len(Keywords)):
Keywords[i] = Keywords[i].replace('(','6').replace(')','9').replace('.','8')
for i in range(len(df['People'])):
df['People'][i] = df['People'][i].replace('(','6').replace(')','9').replace('.','8')
And then you apply your function
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
Final step get back the {'.','(',')'}
for i in range(len(df['found'])):
df['found'][i] = df['found'][i].replace('6','(').replace('9',')').replace('8','.')
df['People'][i] = df['People'][i].replace('6','(').replace('9',')').replace('8','.')
Voilà

str.translate() method gives error against Pandas series

I have a DataFrame of 3 columns. 2 of the columns I wish to manipulate with are Dog_Summary and Dog_Description. These columns are strings and I wish to remove any punctuation they may have.
I have tried the following:
df[['Dog_Summary', 'Dog_Description']] = df[['Dog_Summary', 'Dog_Description']].apply(lambda x: x.str.translate(None, string.punctuation))
For the above I get an error saying:
ValueError: ('deletechars is not a valid argument for str.translate in python 3. You should simply specify character deletions in the table argument', 'occurred at index Summary')
The second way I tried was:
df[['Dog_Summary', 'Dog_Description']] = df[['Dog_Summary', 'Dog_Description']].apply(lambda x: x.replace(string.punctuation, ' '))
However, it still does not work!
Can anyone give me suggestions or advice
Thanks! :)
I wish to remove any punctuation it may have.
You can use a regular expression and string.punctuation for this:
>>> import pandas as pd
>>> from string import punctuation
>>> s = pd.Series(['abcd$*%&efg', ' xyz#)$(#rst'])
>>> s.str.replace(rf'[{punctuation}]', '')
0 abcdefg
1 xyzrst
dtype: object
The first argument to .str.replace() can be a regular expression. In this case, you can use f-strings and a character class to catch any of the punctuation characters:
>>> rf'[{punctuation}]'
'[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~]' # ' and \ are escaped
If you want to apply this to a DataFrame, just follow what you're doing now:
df.loc[:, cols] = df[cols].apply(lambda s: s.str.replace(rf'[{punctuation}]', ''))
Alternatively, you could use s.replace(rf'[{punctuation}]', '', regex=True) (no .str accessor).

Categories

Resources