How to find multiple keywords in a string column in python - python

I have a column
|ABC|
-----
|JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ|
i WANT TO CHECK IF THERE ARE WORDS "DK" AND 'PK' in the row or not. i need to perform this with different words in entire column.
match = ['DK', 'PK']
i used df.ABC.str.split('_').isin(match), but it splits into list but getting error
SystemError: <built-in method view of numpy.ndarray object at
0x0000021171056DB0> returned a result with an error set
What is the best way to get the expected output, which is a bool True|False
Thanks.

Maybe either of the two following options:
(?:[A-Z\d]+_)*?([DP]K)\d*_(?:[A-Z\d]+_)*?(?!\1)([DP]K)\d*(?:_[A-Z\d]+)*?
See an online [demo](https://regex101.com/r/KyqtsT/10
import pandas as pd
df = pd.DataFrame(data={'ABC': ['JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ']})
df['REX_TEST'] = df.REX_TEST.str.match(r'(?:[A-Z\d]+_)*?([DP]K)\d*_(?:[A-Z\d]+_)*?(?!\1)([DP]K)\d*(?:_[A-Z\d]+)*?')
print(df)
Or, add leading/trailing underscores to your data before matching:
import pandas as pd
df = pd.DataFrame(data={'ABC': ['JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ']})
df['REX_TEST']= '_' + df.ABC + '_'
df['REX_TEST'] = df.REX_TEST.str.match(r'(?=.*_PK\d*_)(?=.*_DK\d*_).*')
print(df)
Both options print:
ABC REX_TEST
0 JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ True
Note that I wanted to make sure that both 'DK' nor 'PK' are a substring of a larger word.

You can use python re library to search a string:
import re
s = "JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ"
r = re.search(r"(DK).*(PK)|(PK).*(DK)",s) # the pipe is used like "or" keyword
If what your parameters are matched with the string it will evaluate to True:
if r:
print("hello world!")

Related

replace text string in entire column after first occurance

I'm trying to replace all but the first occurrence of a text string in an entire column. My specific case is replacing underscores with periods in data that looks like client_19_Aug_21_22_2022 and I need this to be client_19.Aug.21.22.2022
if I use [1], I get this error: string index out of range
but [:1] does all occurrences (it doesn't skip the first one)
[1:] inserts . after every character but doesn't find _ and replace
df1['Client'] = df1['Client'].str.replace('_'[:1],'.')
Not the simplest, but solution:
import re
df.str.apply(lambda s: re.sub(r'^(.*?)\.', r'\1_', s.replace('_', '.')))
Here in the lambda function we firstly replace all _ with .. Then we replace the first occurrence of . back with _. And finally, we apply lambda to each value in a column.
Pandas Series have a .map method that you can use to apply an arbitrary function to every row in the Series.
In your case you can write your own replace_underscores_except_first
function, looking something like:
def replace_underscores_except_first(s):
newstring = ''
# Some logic here to handle replacing all but first.
# You probably want a for loop with some conditional checking
return newstring
and then pass that to .map like:
df1['Client'] = df1['Client'].map(replace_underscores_except_first)
An example using map, and in the function check if the string contain an underscore. If it does, split on it, and join back all parts except the first with a dot.
import pandas as pd
items = [
"client_19_Aug_21_22_2022",
"client123"
]
def replace_underscore_with_dot_except_first(s):
if "_" in s:
parts = s.split("_")
return f"{parts[0]}_{'.'.join(parts[1:])}"
return s
df1 = pd.DataFrame(items, columns=["Client"])
df1['Client'] = df1['Client'].map(replace_underscore_with_dot_except_first)
print(df1)
Output
Client
0 client_19.Aug.21.22.2022
1 client123

is there any method or function to extract only URL's from a dataframe column

i have a dataframe with one column name as replies which has below contents
ljaganathan:https://engineering.paypalcorp.com/confluence/pages/viewpage.action?spaceKey=CAL&title=Report+REST+Interface
vanbalagan:Please refer: https://engineering.paypalcorp.com/confluence/display/GPS/User+Guide+for+Self-serve+Alerts
i want to extract only the URL's from that specific column which has only URL's
tried with below code
import re
re.findall(r'(https?://\S+)', df['replies'])
got this error
TypeError: expected string or bytes-like object
even tried with this
df["replies"]=df["replies"].astype(str)
pattern = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9#:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()#:%_+.~#?&/=]*)'
df['links'] = ''
df['links']= df["replies"].str.extract(pattern, expand=True)
print(df['links')
getting NaN values for the above one
Can some onehelp me with the above one .
Your regular expression is not correct. Can you try:
import re
re.findall('(https?:\/\/\S+)', df['replies'])
or your second version:
df["replies"]=df["replies"].astype(str)
pattern = '(https?:\/\/(?:www\.)?[-a-zA-Z0-9#:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()#:%_+.~#?&\/=]*)'
df['links'] = ''
df['links']= df["replies"].str.extract(pattern, expand=True)
print(df['links')
There are online testers to test your regular expression that are very useful in debugging, like regex101.com.

Apply filtering on column to keep values that have specific structure

I would like to put a condition on a column so that I only keep rows for which the values respect the following rule : 'Start with 3 caps letter followed by a number'. Whatever comes after the number is OK.
Example, if the input is :
pd.Series(['LMJ5410P','PTJ9910B','C4800WI3','INJ1B','CDBBM2','ALI9920L'])
Then the output should be
pd.Series(['LMJ5410P','PTJ9910B','INJ1B','ALI9920L'])
So far, here's how I proceed :
def filter_rows(value) :
pattern = re.compile("[A-Z]{3}[0-9]")
try :
if not pattern.match(value) :
return 'remove'
return value
except :
if type(value) == 'float' :
return 'remove'
Then proceed to apply this function to my column and remove all rows with the value "remove".
Is there a more efficient way to get to the same result ?
You may match str.match in Python Pandas as it uses re.match that only matches the pattern at the start of the string:
import pandas as pd
s = pd.Series(['LMJ5410P','PTJ9910B','C4800WI3','INJ1B','CDBBM2','ALI9920L'])
s[s.str.match(r"[A-Z]{3}\d")]
# 0 LMJ5410P
# 1 PTJ9910B
# 3 INJ1B
# 5 ALI9920L
# dtype: object
Alternatively, you may use str.contains (that uses re.search), but you will need to prepend the pattern with ^ anchor (making sure the current position is start of string):
s[s.str.match(r"^[A-Z]{3}\d")]

Custom Regex Query in an effiicient manner

So, I have a simple doubt but I am new to regex. I am working with a Pandas DataFrame. One of the columns contains the names. However, some names are written like "John Doe" but some are written like "John.Doe" and I need to write all of them like "John Doe". I need to run this on the whole dataframe. What is the regex query to fix this and in an efficient manner. Col Name = 'Customer_Name'. Let me know if more details are needed.
Try running this to replace all . with space, if that is your only condition:
df['Customer_Name'] = df['Customer_Name'].str.replace('.', ' ')
All you need is to use apply function from pandas that applies a function to all the values on column. You do not need regex for this but below is an example that has both
import pandas as pd
import re
# Read CSV File
df = pd.read_csv(<PATH TO CSV FILE>)
# Apply Function to Column
df['NewCustomerName'] = df['Customer_Name'].apply(format_name)
# Function that does replacement
def format_name(val):
return val.replace('.', ' ')
# return re.sub('\.', ' ', val) # If you would like to use regex

str.translate() method gives error against Pandas series

I have a DataFrame of 3 columns. 2 of the columns I wish to manipulate with are Dog_Summary and Dog_Description. These columns are strings and I wish to remove any punctuation they may have.
I have tried the following:
df[['Dog_Summary', 'Dog_Description']] = df[['Dog_Summary', 'Dog_Description']].apply(lambda x: x.str.translate(None, string.punctuation))
For the above I get an error saying:
ValueError: ('deletechars is not a valid argument for str.translate in python 3. You should simply specify character deletions in the table argument', 'occurred at index Summary')
The second way I tried was:
df[['Dog_Summary', 'Dog_Description']] = df[['Dog_Summary', 'Dog_Description']].apply(lambda x: x.replace(string.punctuation, ' '))
However, it still does not work!
Can anyone give me suggestions or advice
Thanks! :)
I wish to remove any punctuation it may have.
You can use a regular expression and string.punctuation for this:
>>> import pandas as pd
>>> from string import punctuation
>>> s = pd.Series(['abcd$*%&efg', ' xyz#)$(#rst'])
>>> s.str.replace(rf'[{punctuation}]', '')
0 abcdefg
1 xyzrst
dtype: object
The first argument to .str.replace() can be a regular expression. In this case, you can use f-strings and a character class to catch any of the punctuation characters:
>>> rf'[{punctuation}]'
'[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~]' # ' and \ are escaped
If you want to apply this to a DataFrame, just follow what you're doing now:
df.loc[:, cols] = df[cols].apply(lambda s: s.str.replace(rf'[{punctuation}]', ''))
Alternatively, you could use s.replace(rf'[{punctuation}]', '', regex=True) (no .str accessor).

Categories

Resources