I have a DataFrame of 3 columns. 2 of the columns I wish to manipulate with are Dog_Summary and Dog_Description. These columns are strings and I wish to remove any punctuation they may have.
I have tried the following:
df[['Dog_Summary', 'Dog_Description']] = df[['Dog_Summary', 'Dog_Description']].apply(lambda x: x.str.translate(None, string.punctuation))
For the above I get an error saying:
ValueError: ('deletechars is not a valid argument for str.translate in python 3. You should simply specify character deletions in the table argument', 'occurred at index Summary')
The second way I tried was:
df[['Dog_Summary', 'Dog_Description']] = df[['Dog_Summary', 'Dog_Description']].apply(lambda x: x.replace(string.punctuation, ' '))
However, it still does not work!
Can anyone give me suggestions or advice
Thanks! :)
I wish to remove any punctuation it may have.
You can use a regular expression and string.punctuation for this:
>>> import pandas as pd
>>> from string import punctuation
>>> s = pd.Series(['abcd$*%&efg', ' xyz#)$(#rst'])
>>> s.str.replace(rf'[{punctuation}]', '')
0 abcdefg
1 xyzrst
dtype: object
The first argument to .str.replace() can be a regular expression. In this case, you can use f-strings and a character class to catch any of the punctuation characters:
>>> rf'[{punctuation}]'
'[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~]' # ' and \ are escaped
If you want to apply this to a DataFrame, just follow what you're doing now:
df.loc[:, cols] = df[cols].apply(lambda s: s.str.replace(rf'[{punctuation}]', ''))
Alternatively, you could use s.replace(rf'[{punctuation}]', '', regex=True) (no .str accessor).
Related
I need to replace substrings in a column value in dataframe
Example: I have this column 'code' in a dataframe (in really, the dataframe is very large)
3816R(motor) #I need '3816R'
97224(Eletro)
502812(Defletor)
97252(Defletor)
97525(Eletro)
5725 ( 56)
And I have this list to replace the values:
list = ['(motor)', '(Eletro)', '(Defletor)', '(Eletro)', '( 56)']
I've tried a lot of methods, like:
df['code'] = df['code'].str.replace(list, '')
And regex= True, but anyone method worked to remove the substrings.
How can I do that?
You can try regex replace and regex or condition: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html
https://www.ocpsoft.org/tutorials/regular-expressions/or-in-regex/
l = ['(motor)', '(Eletro)', '(Defletor)', '( 56)']
l = [s.replace('(', '\(').replace(')', '\)') for s in l]
regex_str = f"({'|'.join(l)})"
df['code'] = df['code'].str.replace(regex_str, '', regex=True)
The regex_str will end up with something like
"(\(motor\)|\(Eletro\)|\(Defletor\)|\( 56\))"
If you are certain any and all rows follow the format provided, you could attempt the following by using a lambda function:
df['code_clean'] = df['code'].apply(lambda x: x.split('(')[0])
You can try the regular expression match method:
https://docs.python.org/3/library/re.html#re.Pattern.match
df['code'] = df['code'].apply(lambda x: re.match(r'^(\w+)\(\w+\)',x).group(1))
The first part of the regular expression ^(\w+), creates a capturing group of any letters or numbers before encountering a parenthesis. The group(1) then extracts the text.
str.replace will work with one string not a list of strings.. you could probably loop through it
rmlist = ['(motor)', '(Eletro)', '(Defletor)', '(Eletro)', '( 56)']
for repl in rmlist:
df['code'] = df['code'].str.replace(repl, '')
alternatively if your bracketed substring is at the end.. split it at "("
and discard additional column generated..will be faster for sure
df["code"]=df["code"].str.split(pat="(",n=1,expand=True)[0]
str.split is reasonably fast
I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]
I have a column
|ABC|
-----
|JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ|
i WANT TO CHECK IF THERE ARE WORDS "DK" AND 'PK' in the row or not. i need to perform this with different words in entire column.
match = ['DK', 'PK']
i used df.ABC.str.split('_').isin(match), but it splits into list but getting error
SystemError: <built-in method view of numpy.ndarray object at
0x0000021171056DB0> returned a result with an error set
What is the best way to get the expected output, which is a bool True|False
Thanks.
Maybe either of the two following options:
(?:[A-Z\d]+_)*?([DP]K)\d*_(?:[A-Z\d]+_)*?(?!\1)([DP]K)\d*(?:_[A-Z\d]+)*?
See an online [demo](https://regex101.com/r/KyqtsT/10
import pandas as pd
df = pd.DataFrame(data={'ABC': ['JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ']})
df['REX_TEST'] = df.REX_TEST.str.match(r'(?:[A-Z\d]+_)*?([DP]K)\d*_(?:[A-Z\d]+_)*?(?!\1)([DP]K)\d*(?:_[A-Z\d]+)*?')
print(df)
Or, add leading/trailing underscores to your data before matching:
import pandas as pd
df = pd.DataFrame(data={'ABC': ['JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ']})
df['REX_TEST']= '_' + df.ABC + '_'
df['REX_TEST'] = df.REX_TEST.str.match(r'(?=.*_PK\d*_)(?=.*_DK\d*_).*')
print(df)
Both options print:
ABC REX_TEST
0 JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ True
Note that I wanted to make sure that both 'DK' nor 'PK' are a substring of a larger word.
You can use python re library to search a string:
import re
s = "JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ"
r = re.search(r"(DK).*(PK)|(PK).*(DK)",s) # the pipe is used like "or" keyword
If what your parameters are matched with the string it will evaluate to True:
if r:
print("hello world!")
I'm using this below code to remove special characters and punctuations from a column in pandas dataframe. But this method of using regex.sub is not time efficient. Is there other options I could try to have better time efficiency and remove punctuations and special characters? Or the way I'm removing special characters and parsing it back to the column, pandas dataframe is causing me major computation burn?
for n, string in data['text'].iteritems():
data['text'] = re.sub('([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])','', string)
One way would be to keep only alphanumeric. Consider this dataframe
df=pd.DataFrame({'Text':['#^#346fetvx#!.,;:', 'fhfgd54#!#><?']})
Text
0 #^#346fetvx#!.,;:
1 fhfgd54#!#><?
You can use
df['Text'] = df['Text'].str.extract('(\w+)', expand = False)
Text
0 346fetvx
1 fhfgd54
Use Regex and lambda function:
import re
data['PROD_NAME'] = data['PROD_NAME'].apply(lambda x: re.sub('[^A-Za-z0-9]', ' ', x))
This would remove all characters except alphabets and digits.
I am trying to split a string such as: add(ten)sub(one) into add(ten) sub(one).
I can't figure out how to match the close parentheses. I have used re.sub(r'\\)', '\\) ') and every variation of escaping the parentheses,I can think of. It is hard to tell in this font but I am trying to add a space between these commands so I can split it into a list later.
There's no need to escape ) in the replacement string, ) has a special a special meaning only in the regex pattern so it needs to be escaped there in order to match it in the string, but in normal string it can be used as is.
>>> strs = "add(ten)sub(one)"
>>> re.sub(r'\)(?=\S)',r') ', strs)
'add(ten) sub(one)'
As #StevenRumbalski pointed out in comments the above operation can be simply done using str.replace and str.rstrip:
>>> strs.replace(')',') ').strip()
'add(ten) sub(one)'
d = ')'
my_str = 'add(ten)sub(one)'
result = [t+d for t in my_str.split(d) if len(t) > 0]
result = ['add(ten)','sub(one)']
Create a list of all substrings
import re
a = 'add(ten)sub(one)'
print [ b for b in re.findall('(.+?\(.+?\))', a) ]
Output:
['add(ten)', 'sub(one)']