Speed up python regex matching - python

In python 're' module, I want to use a large number of calls ~ 1 million of re.findall() and re.sub(). I want to find all occurrences of a pattern in a string and then replace them with a fixed string. Ex. all dates in a strings are returned as a list and in original list, it was replaced by 'DATE'. How can I combine both into one ?

re.sub's replace argument can be a callable:
dates = []
def store_dates(match):
dates.append(match.group())
return 'DATE'
data = re.sub('some-date-string', store_dates, data)
# data is now your data with all the date strings replaced with 'DATE'
# dates now has all of the date strings that matched your regex

Related

pythonic method for extracting numeric digits from string

I am developing a program to read through a CSV file and create a dictionary of information from it. Each line in the CSV is essentially a new dictionary entry with the delimited objects being the values.
As one subpart of task, I need to extract an unknown number of numeric digits from within a string. I have a working version, but it does not seem very pythonic.
An example string looks like this:
variable = Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]
variable is string's name in the python code, and represents the variable name within a MODBUS. I want to extract just the digits prior to the .WORD_type[0] which relate to the number of bytes the string is packed into.
Here is my working code, note this is nested within a for statement iterating through the lines in the CSV. var_length and var_type are some of the keys, i.e. {"var_length": var_length}
if re.search(".+_ST[0-9]{1,2}\\.WORD_type.+", variable):
var_type = "string"
temp = re.split("\\.", variable)
temp = re.split("_", temp[2])
temp = temp[-1]
var_length = int(str.lstrip(temp, "ST")) / 2
You could maybe try using matching groups like so:
import re
variable = "Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]"
matches = re.match(r".+_ST(\d+)\.WORD_type.+", variable)
if matches:
print(matches[1])
matches[0] has the full match and matches[1] contains the matched group.

python remove variable substring patterns from a long string in a python dataframe column

I have a column in my dataframe, containing very large strings.
here is a short sample of the string
FixedChar{3bf3423 Data to keep}, FixedChar{5e0d20 Data to keep}, FixedChar{6cb86d9 Data to keep}, ...
I need to remove the recurring static "FixedChar{" and the variable substring after it that has static length of 6 and also "}"
and just keep the "Data to keep" strings that have variable lengths.
what is the best way to remove this recurring variable pattern?
It was easier than I thought.
At first I started to use re.sub() from re library.
regex \w* removes all the word characters (letters and numbers) after the "FixedChar" and the argument flags = re.I makes it case insensitive.
import re
re.sub(r"FixedChar{\w*","",dataFrame.Column[row],flags = re.I)
but I found str.replace() more useful and replaced the values in my dataFrame using loc, as I needed to filter my dataframe cause this pattern shows up only in specific rows.
dataFrame.loc['Column'] = dataFrame.Column.str.replace("FixedChar{\w* ",'',regex=True)
dataFrame.loc['Column'] = dataFrame.Column.str.replace("}",'',regex=True)

Filter a list of strings but the filter must appear at a certain place

I have a python dataframe with a column populated with strings of the same length like 0302000C0AABGBG , 0407020B0AAAGAG, 040702040BGAAAC
I want to filter to identify all values that contain 'AA' but it must be at position _________AA ____ i.e. do not include 040702040BGAAAC in results.
How do I achieve that?
Current searches yield str.contains but I can't find how to specify the position of the substring.
Append to your regex \w{4}$ (requiring that four word characters occur at end of line) in str.contains call

using str.replace() to remove nth character from a string in a pandas dataframe

I have a pandas dataframe that consists of strings. I would like to remove the n-th character from the end of the strings. I have the following code:
DF = pandas.DataFrame({'col': ['stri0ng']})
DF['col'] = DF['col'].str.replace('(.)..$','')
Instead of removing the third to the last character (0 in this case), it removes 0ng. The result should be string but it outputs stri. Where am I wrong?
You may want to rather replace a single character followed by n-1 characters at the end of the string:
DF['col'] = DF['col'].str.replace('.(?=.{2}$)', '')
col
0 string
If you want to make sure you're only removing digits (so that 'string' in one special row doesn't get changed to 'strng'), then use something like '[0-9](?=.{2}$)' as pattern.
Another way using pd.Series.str.slice_replace:
df['col'].str.slice_replace(4,5,'')
Output:
0 string
Name: col, dtype: object

How to slice all of the elements of pandas dataframe at once?

I have the following data stored in my Pandas datframe:
Factor SimTime RealTime SimStatus
0 Factor[0.48] SimTime[83.01] RealTime[166.95] Paused[F]
1 Factor[0.48] SimTime[83.11] RealTime[167.15] Paused[F]
2 Factor[0.49] SimTime[83.21] RealTime[167.36] Paused[F]
3 Factor[0.48] SimTime[83.31] RealTime[167.57] Paused[F]
I want to create a new dataframe with only everything within [].
I am attempting to use the following code:
df = dataframe.apply(lambda x: x.str.slice(start=x.str.find('[')+1, stop=x.str.find(']')))
However, all I see in df is NaN. Why? What's going on? What should I do to achieve the desired behavior?
You can use regex to replace the contents.
df.replace(r'\w+\[([\S]+)\]', r'\1', regex=True)
Edit
replace function of pandas DataFrame
Replace values given in to_replace with value
The target string and the value with which it needs to be replaced can be regex expressions. And for that you need to set the regex=True in the arguments to replace
https://regex101.com/r/7KCs6q/1
Look at the above link to see the explanation of the regular expression in detail.
Basically it is using the non whitespace content within the square brackets as the value and any string with some characters followed by square brackets with non whitespace characters as the target string.

Categories

Resources