How to find multiple keywords in a string column in python

How to find multiple keywords in a string column in python - python

I have a column
|ABC|
-----
|JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ|
i WANT TO CHECK IF THERE ARE WORDS "DK" AND 'PK' in the row or not. i need to perform this with different words in entire column.
match = ['DK', 'PK']
i used df.ABC.str.split('_').isin(match), but it splits into list but getting error
SystemError: <built-in method view of numpy.ndarray object at
0x0000021171056DB0> returned a result with an error set
What is the best way to get the expected output, which is a bool True|False
Thanks.

Maybe either of the two following options:
(?:[A-Z\d]+_)*?([DP]K)\d*_(?:[A-Z\d]+_)*?(?!\1)([DP]K)\d*(?:_[A-Z\d]+)*?
See an online [demo](https://regex101.com/r/KyqtsT/10
import pandas as pd
df = pd.DataFrame(data={'ABC': ['JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ']})
df['REX_TEST'] = df.REX_TEST.str.match(r'(?:[A-Z\d]+_)*?([DP]K)\d*_(?:[A-Z\d]+_)*?(?!\1)([DP]K)\d*(?:_[A-Z\d]+)*?')
print(df)
Or, add leading/trailing underscores to your data before matching:
import pandas as pd
df = pd.DataFrame(data={'ABC': ['JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ']})
df['REX_TEST']= '_' + df.ABC + '_'
df['REX_TEST'] = df.REX_TEST.str.match(r'(?=.*_PK\d*_)(?=.*_DK\d*_).*')
print(df)
Both options print:
ABC REX_TEST
0 JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ True
Note that I wanted to make sure that both 'DK' nor 'PK' are a substring of a larger word.

You can use python re library to search a string:
import re
s = "JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ"
r = re.search(r"(DK).*(PK)|(PK).*(DK)",s) # the pipe is used like "or" keyword
If what your parameters are matched with the string it will evaluate to True:
if r:
print("hello world!")

Related

replace text string in entire column after first occurance

I'm trying to replace all but the first occurrence of a text string in an entire column. My specific case is replacing underscores with periods in data that looks like client_19_Aug_21_22_2022 and I need this to be client_19.Aug.21.22.2022
if I use [1], I get this error: string index out of range
but [:1] does all occurrences (it doesn't skip the first one)
[1:] inserts . after every character but doesn't find _ and replace
df1['Client'] = df1['Client'].str.replace('_'[:1],'.')

Not the simplest, but solution:
import re
df.str.apply(lambda s: re.sub(r'^(.*?)\.', r'\1_', s.replace('_', '.')))
Here in the lambda function we firstly replace all _ with .. Then we replace the first occurrence of . back with _. And finally, we apply lambda to each value in a column.

Pandas Series have a .map method that you can use to apply an arbitrary function to every row in the Series.
In your case you can write your own replace_underscores_except_first
function, looking something like:
def replace_underscores_except_first(s):
newstring = ''
# Some logic here to handle replacing all but first.
# You probably want a for loop with some conditional checking
return newstring
and then pass that to .map like:
df1['Client'] = df1['Client'].map(replace_underscores_except_first)

An example using map, and in the function check if the string contain an underscore. If it does, split on it, and join back all parts except the first with a dot.
import pandas as pd
items = [
"client_19_Aug_21_22_2022",
"client123"
]
def replace_underscore_with_dot_except_first(s):
if "_" in s:
parts = s.split("_")
return f"{parts[0]}_{'.'.join(parts[1:])}"
return s
df1 = pd.DataFrame(items, columns=["Client"])
df1['Client'] = df1['Client'].map(replace_underscore_with_dot_except_first)
print(df1)
Output
Client
0 client_19.Aug.21.22.2022
1 client123

is there any method or function to extract only URL's from a dataframe column

i have a dataframe with one column name as replies which has below contents
ljaganathan:https://engineering.paypalcorp.com/confluence/pages/viewpage.action?spaceKey=CAL&title=Report+REST+Interface
vanbalagan:Please refer: https://engineering.paypalcorp.com/confluence/display/GPS/User+Guide+for+Self-serve+Alerts
i want to extract only the URL's from that specific column which has only URL's
tried with below code
import re
re.findall(r'(https?://\S+)', df['replies'])
got this error
TypeError: expected string or bytes-like object
even tried with this
df["replies"]=df["replies"].astype(str)
pattern = r'(https?:\/\/(?:www\.)?[-a-zA-Z0-9#:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()#:%_+.~#?&/=]*)'
df['links'] = ''
df['links']= df["replies"].str.extract(pattern, expand=True)
print(df['links')
getting NaN values for the above one
Can some onehelp me with the above one .

Your regular expression is not correct. Can you try:
import re
re.findall('(https?:\/\/\S+)', df['replies'])
or your second version:
df["replies"]=df["replies"].astype(str)
pattern = '(https?:\/\/(?:www\.)?[-a-zA-Z0-9#:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}[-a-zA-Z0-9()#:%_+.~#?&\/=]*)'
df['links'] = ''
df['links']= df["replies"].str.extract(pattern, expand=True)
print(df['links')
There are online testers to test your regular expression that are very useful in debugging, like regex101.com.

Apply filtering on column to keep values that have specific structure

I would like to put a condition on a column so that I only keep rows for which the values respect the following rule : 'Start with 3 caps letter followed by a number'. Whatever comes after the number is OK.
Example, if the input is :
pd.Series(['LMJ5410P','PTJ9910B','C4800WI3','INJ1B','CDBBM2','ALI9920L'])
Then the output should be
pd.Series(['LMJ5410P','PTJ9910B','INJ1B','ALI9920L'])
So far, here's how I proceed :
def filter_rows(value) :
pattern = re.compile("[A-Z]{3}[0-9]")
try :
if not pattern.match(value) :
return 'remove'
return value
except :
if type(value) == 'float' :
return 'remove'
Then proceed to apply this function to my column and remove all rows with the value "remove".
Is there a more efficient way to get to the same result ?

You may match str.match in Python Pandas as it uses re.match that only matches the pattern at the start of the string:
import pandas as pd
s = pd.Series(['LMJ5410P','PTJ9910B','C4800WI3','INJ1B','CDBBM2','ALI9920L'])
s[s.str.match(r"[A-Z]{3}\d")]
# 0 LMJ5410P
# 1 PTJ9910B
# 3 INJ1B
# 5 ALI9920L
# dtype: object
Alternatively, you may use str.contains (that uses re.search), but you will need to prepend the pattern with ^ anchor (making sure the current position is start of string):
s[s.str.match(r"^[A-Z]{3}\d")]

Custom Regex Query in an effiicient manner

So, I have a simple doubt but I am new to regex. I am working with a Pandas DataFrame. One of the columns contains the names. However, some names are written like "John Doe" but some are written like "John.Doe" and I need to write all of them like "John Doe". I need to run this on the whole dataframe. What is the regex query to fix this and in an efficient manner. Col Name = 'Customer_Name'. Let me know if more details are needed.

Try running this to replace all . with space, if that is your only condition:
df['Customer_Name'] = df['Customer_Name'].str.replace('.', ' ')

All you need is to use apply function from pandas that applies a function to all the values on column. You do not need regex for this but below is an example that has both
import pandas as pd
import re
# Read CSV File
df = pd.read_csv(<PATH TO CSV FILE>)
# Apply Function to Column
df['NewCustomerName'] = df['Customer_Name'].apply(format_name)
# Function that does replacement
def format_name(val):
return val.replace('.', ' ')
# return re.sub('\.', ' ', val) # If you would like to use regex

str.translate() method gives error against Pandas series

I have a DataFrame of 3 columns. 2 of the columns I wish to manipulate with are Dog_Summary and Dog_Description. These columns are strings and I wish to remove any punctuation they may have.
I have tried the following:
df[['Dog_Summary', 'Dog_Description']] = df[['Dog_Summary', 'Dog_Description']].apply(lambda x: x.str.translate(None, string.punctuation))
For the above I get an error saying:
ValueError: ('deletechars is not a valid argument for str.translate in python 3. You should simply specify character deletions in the table argument', 'occurred at index Summary')
The second way I tried was:
df[['Dog_Summary', 'Dog_Description']] = df[['Dog_Summary', 'Dog_Description']].apply(lambda x: x.replace(string.punctuation, ' '))
However, it still does not work!
Can anyone give me suggestions or advice
Thanks! :)

I wish to remove any punctuation it may have.
You can use a regular expression and string.punctuation for this:
>>> import pandas as pd
>>> from string import punctuation
>>> s = pd.Series(['abcd$*%&efg', ' xyz#)$(#rst'])
>>> s.str.replace(rf'[{punctuation}]', '')
0 abcdefg
1 xyzrst
dtype: object
The first argument to .str.replace() can be a regular expression. In this case, you can use f-strings and a character class to catch any of the punctuation characters:
>>> rf'[{punctuation}]'
'[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~]' # ' and \ are escaped
If you want to apply this to a DataFrame, just follow what you're doing now:
df.loc[:, cols] = df[cols].apply(lambda s: s.str.replace(rf'[{punctuation}]', ''))
Alternatively, you could use s.replace(rf'[{punctuation}]', '', regex=True) (no .str accessor).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find multiple keywords in a string column in python - python

You can use python re library to search a string: import re s = "JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ" r = re.search(r"(DK).(PK)|(PK).(DK)",s) # the pipe is used like "or" keyword If what your parameters are matched with the string it will evaluate to True: if r: print("hello world!")

Related

replace text string in entire column after first occurance

is there any method or function to extract only URL's from a dataframe column

Apply filtering on column to keep values that have specific structure

Custom Regex Query in an effiicient manner

str.translate() method gives error against Pandas series

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to find multiple keywords in a string column in python - python

You can use python re library to search a string: import re s = "JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ" r = re.search(r"(DK).*(PK)|(PK).*(DK)",s) # the pipe is used like "or" keyword If what your parameters are matched with the string it will evaluate to True: if r: print("hello world!")

Related

replace text string in entire column after first occurance

is there any method or function to extract only URL's from a dataframe column

Apply filtering on column to keep values that have specific structure

Custom Regex Query in an effiicient manner

str.translate() method gives error against Pandas series

Categories

Resources

You can use python re library to search a string: import re s = "JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ" r = re.search(r"(DK).(PK)|(PK).(DK)",s) # the pipe is used like "or" keyword If what your parameters are matched with the string it will evaluate to True: if r: print("hello world!")