Custom Regex Query in an effiicient manner - python

So, I have a simple doubt but I am new to regex. I am working with a Pandas DataFrame. One of the columns contains the names. However, some names are written like "John Doe" but some are written like "John.Doe" and I need to write all of them like "John Doe". I need to run this on the whole dataframe. What is the regex query to fix this and in an efficient manner. Col Name = 'Customer_Name'. Let me know if more details are needed.

Try running this to replace all . with space, if that is your only condition:
df['Customer_Name'] = df['Customer_Name'].str.replace('.', ' ')

All you need is to use apply function from pandas that applies a function to all the values on column. You do not need regex for this but below is an example that has both
import pandas as pd
import re
# Read CSV File
df = pd.read_csv(<PATH TO CSV FILE>)
# Apply Function to Column
df['NewCustomerName'] = df['Customer_Name'].apply(format_name)
# Function that does replacement
def format_name(val):
return val.replace('.', ' ')
# return re.sub('\.', ' ', val) # If you would like to use regex

Related

How can I split a string with no delimeter?

I need to import CSV file which contains all values in one column although it should be on 3 different columns.
The value I want to split is looking like this "2020-12-30 13:17:00Mojito5.5". I want to look like this: "2020-12-30 13:17:00 Mojito 5.5"
I tried different approaches to splitting it but I either get the error " Dataframe object has no attribute 'split' or something similar.
Any ideas how I can split this?
Assuming you always want to add spaces around a word without special characters and numbers you can use this regex:
def add_spaces(m):
return f' {m.group(0)} '
import re
s = "2020-12-30 13:17:00Mojito5.5"
re.sub('[a-zA-Z]+', add_spaces, s)
We could use a regex approach here:
inp = "2020-12-30 13:17:00Mojito5.5"
m = re.findall(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(\w+?)(\d+(?:\.\d+)?)', inp)
print(m) # [('2020-12-30 13:17:00', 'Mojito', '5.5')]

How to find multiple keywords in a string column in python

I have a column
|ABC|
-----
|JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ|
i WANT TO CHECK IF THERE ARE WORDS "DK" AND 'PK' in the row or not. i need to perform this with different words in entire column.
match = ['DK', 'PK']
i used df.ABC.str.split('_').isin(match), but it splits into list but getting error
SystemError: <built-in method view of numpy.ndarray object at
0x0000021171056DB0> returned a result with an error set
What is the best way to get the expected output, which is a bool True|False
Thanks.
Maybe either of the two following options:
(?:[A-Z\d]+_)*?([DP]K)\d*_(?:[A-Z\d]+_)*?(?!\1)([DP]K)\d*(?:_[A-Z\d]+)*?
See an online [demo](https://regex101.com/r/KyqtsT/10
import pandas as pd
df = pd.DataFrame(data={'ABC': ['JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ']})
df['REX_TEST'] = df.REX_TEST.str.match(r'(?:[A-Z\d]+_)*?([DP]K)\d*_(?:[A-Z\d]+_)*?(?!\1)([DP]K)\d*(?:_[A-Z\d]+)*?')
print(df)
Or, add leading/trailing underscores to your data before matching:
import pandas as pd
df = pd.DataFrame(data={'ABC': ['JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ']})
df['REX_TEST']= '_' + df.ABC + '_'
df['REX_TEST'] = df.REX_TEST.str.match(r'(?=.*_PK\d*_)(?=.*_DK\d*_).*')
print(df)
Both options print:
ABC REX_TEST
0 JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ True
Note that I wanted to make sure that both 'DK' nor 'PK' are a substring of a larger word.
You can use python re library to search a string:
import re
s = "JWUFT_P_RECF_1_DK1_VWAP_DFGDG_P_REGB_1_PK1_XYZ"
r = re.search(r"(DK).*(PK)|(PK).*(DK)",s) # the pipe is used like "or" keyword
If what your parameters are matched with the string it will evaluate to True:
if r:
print("hello world!")

In dataframe: delete the paranthesis and everything within them in a column

I have a pandas dataframe, where a column has parentheses . I want to keep the content of the column, but delete everything inside the parentheses as below. Then add a constant text called "data" to it.
col1
counties(17) - cities(8)
I tried df['col1']=df['col1'].str.replace(r"\(.*\)","")
this command is outputting only counties
My desired output is
counties - cities data
You are your expression is replacing everything with ""? You should replace with " data" so as to get the result shown above.
Change
df['col1']=df['col1'].str.replace(r"\(.*\)","")
to
df['col1']=df['col1'].str.replace(r"\(.*\)", " data")
Pandas uses the re module under the hood, so you regex must respect it and can use all its features. Here you want a non greedy match for the parenthesed word (the shorter) so you should use df['col1'].str.replace(r"\(.*?\)",""). If you want to add " data" it ends in:
df['col1'] = df.col1.str.replace(r'\(.*?\)', '') + ' data'

how to extract a particular substring from a string in python

I have a dataframe "data" which as a column called "Description" which as a text " the IN678I78 is delivered" every row as some code starts with 'IN'
now i need to pull that IN------ separately into new column
please do help
thanks
When asking a question, always put a sample of your dataframe for us to vizualize your problem and try some solutions.
IIUC you can use an apply on your Description column and regular expressions manipulation to extract your desired feature. You can try the following:
def extr(x):
lis = x.split(' ')
for string in lis:
if string[:2] == 'IN':
return string
data['New col'] = data.Description.apply(extr)

Search for string in file while ignoring id and replacing only a substring

I’ve got a master .xml file generated by an external application and want to create several new .xmls by adapting and deleting some rows with python. The search strings and replace strings for these adaptions are stored within an array, e.g.:
replaceArray = [
[u'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"',
u'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"'],
[u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="false"/>',
u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="true"/>'],
[u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="false"/>',
u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="true"/>']]
So I'd like to iterate through my file and replace all occurences of 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"' with 'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"' and so on.
Unfortunately the ID values of "RowID", “id_tool_base” and “ref_layerid_mapping” might change occassionally. So what I need is to search for matches of the whole string in the master file regardless which id value is inbetween the quotation mark and only to replace the substring that is different in both strings of the replaceArray (e.g. use=”true” instead of use=”false”). I’m not very familiar with regular expressions, but I think I need something like that for my search?
re.sub(r'<TOOL_SELECT_LINE RowID="\d+" id_tool_base="\d+" use="false"/>', "", sentence)
I'm happy about any hint that points me in the right direction! If you need any further information or if something is not clear in my question, please let me know.
One way to do this is to have a function for replacing text. The function would get the match object from re.sub and insert id captured from the string being replaced.
import re
s = 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"'
pat = re.compile(r'ref_layerid_mapping=(.+) lyvis="off" toc_visible="off"')
def replacer(m):
return "ref_layerid_mapping=" + m.group(1) + 'lyvis="on" toc_visible="on"';
re.sub(pat, replacer, s)
Output:
'ref_layerid_mapping="x4049"lyvis="on" toc_visible="on"'
Another way is to use back-references in replacement pattern. (see http://www.regular-expressions.info/replacebackref.html)
For example:
import re
s = "Ab ab"
re.sub(r"(\w)b (\w)b", r"\1d \2d", s)
Output:
'Ad ad'

Categories

Resources