How to speed up this Regular Expression in .Replace function? - python

I have 4.5 million rows to process, so I need to speed this up so bad.
I don't understand how regex works pretty well, so other answers are so hard for me to understand.
I have a column that contains IDs, e.g.:
ELI-123456789
This numeric part of this ID is contained in this string in a columnm bk_name started with a "#":
AAAAA#123456789_BBBBB;CCCCC;
Now my goal is to change that string into this string, throw the ID at the end, started with a "#", save it in new_name:
AAAAA_BBBBB;CCCCC;#123456789
Here's what I tried:
Take the ID and replace "ELI-" with "#", save as ID2:
df["ID2"] = df["ID"].str.replace("ELI-", "#")
Find ID2 in the string of bk_name and replace it with "":
df["new_name"] = df["bk_name"].replace(regex = r'(?i)' + df["ID2"].astype(str), value = "")
Take the ID again, replace "ELI-" with "#", add it at the end of the string of new_name:
df["new_name"] = df["new_name"] + df["ID"].str.replace("ELI-", "#")
Now the problem is that step 2 is the line that takes most of the time, it took 6.5 second to process 6700 rows.
But now I have 4.6 millions rows to process, it's been 7 hours and it's still running and I have no idea why.
In my opinions, regex slows down my code. But I have no deeper understanding.
So thanks for your help in advance, any suggestions would be appreciated :)

So, your approach wasn't that bad.
I recommend you look into the functions docs before using them. replace takes no keyword arguments.
Regarding step 2, don't use a regex just for replacing a string with another string ("" in this case).
df = {};
df['ID'] = 'ELI-123456789'; df['bk_name'] = 'AAAAA#123456789_BBBBB;CCCCC;';
print(df)
# Take the ID and replace "ELI-" with "#", save as ID2:
# df["ID2"] = df["ID"].str.replace("ELI-", "#")
df["ID2"] = df["ID"].replace("ELI-", "#")
print(df)
# Find ID2 in the string of bk_name and replace it with "":
# df["new_name"] = df["bk_name"].replace(regex = r'(?i)' + df["ID2"].astype(str), value = "")
df["new_name"] = df["bk_name"].replace(df["ID2"], "")
print(df)
# Take the ID again, replace "ELI-" with "#", add it at the end of the string of new_name:
# df["new_name"] = df["new_name"] + df["ID"].str.replace("ELI-", "#")
df["new_name"] = df["new_name"] + df["ID"].replace("ELI-", "#")
print(df)
Hope it helps
EDIT:
Are you sure it's this piece of code the one slowing down your project? Have you tried to isolate it and see how long it actually takes to execute?
Try this:
import timeit
code_to_test = """
df = {};
df['ID'] = 'ELI-123456789'; df['bk_name'] = 'AAAAA#123456789_BBBBB;CCCCC;';
df["new_name"] = df['bk_name'].replace('#' + df['ID'][4:],'') + '#' + df['ID'][4:] # <----- ONE LINER
# print(df['new_name'])
"""
elapsed_time = timeit.timeit(code_to_test, number=4600000)
print(elapsed_time)
I shortened it to a one-liner which ONLY WORKS IF:
ELI- (or a 4-character string) is always the string to be replaced
# is always the character to be removed and # is always the one to be added
At least in the laptop I'm coding this, it always executes the code 4600000 times in under 4 secs

Well, I would say that if you know that the ID part is always ELI- followed by a number and that this number is always behind the # at the beginning then I would consider not reading the ID at step 1. Directly work at step 2 with this regular expression and replacement:
https://regex101.com/r/QBQB7E/1
You should be able to create the content of your new field just with one single assignement with the result of the substitution of the old field. The idea to
speed up things is to avoid having 4-5 lines of code to do the operation of
getting the id, searching it and then recomposing the result. The regular
expression and the substitution pattern can do all that in one single operation.
You can generate some Python code directly from Regex101 so that you can integrate it to create your new_name field's content.
Explanation
^(.*?)#(\d+)(.*)$ is the regular expression
^ means "starting with".
$ means "ending with".
() is to capture a group and create a variable that we can then use
in the replacement pattern. \1 for the first matching group, \2 for
the second, etc.
.*? matches any character between zero and unlimited times, as few times
as possible, expanding as needed (lazy).
# matches the character # literally (case sensitive).
\d+ matches a digit (equal to [0-9]) one or more times.
.* matches any character between zero and unlimited times, as many times as possible, giving back as needed (greedy).
The replacement string is \1\3#\2, so it takes the first matching group
followed by the last one and then adds a # followed by the second matching
group, which is your id.
In terms of speed of the regex itself, it could also be changed a little bit
to find a faster version of it, depending how it's written.
Second version:
^([^#]+)#(\d+)(.*)$ where I replaced .*? by [^#]+ meaning find any
character which is not # one or more times.
Solution here: https://regex101.com/r/QBQB7E/3
Tested the code and I got it in about 4 seconds...
import timeit
code_to_test = """
import re
regex = r"^([^#]+)#(\d+)(.*)$"
subst = "\\1\\3#\\2"
df = {
"ID": "ELI-123456789",
"bk_name": "AAAAA#123456789_BBBBB;CCCCC;"
}
df["new_name"] = re.sub(regex, subst, df["bk_name"])
"""
elapsed_time = timeit.timeit(code_to_test, number=4600000)
print(elapsed_time)
I really think that if you code runs in 7 hours, it's probably somewhere else
where the problem is. The regex engine doesn't seem much slower than doing
manual search/replace operations.

I don't understand what caused the problem but I solved it.
The only thing I changed is in step 2. I used apply/lambda function and it suddenly works and I have no idea why.
Take the ID and replace "ELI-" with "#", save as ID2:
df["ID2"] = df["ID"].str.replace("ELI-", "#")
Find ID2 in the string of bk_name and replace it with "":
df["new_name"] = df[["bk_name", "ID2"]].astype(str).apply(lambda x: x["bk_name"].replace(x["ID2"], ""), axis=1)
Take the ID again, replace "ELI-" with "#", add it at the end of the string of new_name:
df["new_name"] = df["new_name"] + df["ID"].str.replace("ELI-", "#")

Related

Seemingly simple text replacement script question

I have a little "problem" that I would like to solve via programming a simple script. I don't have much programming experience and thought I'd ask here for help for what I should look for or know to do this.
Basically I want to take an email address such as placeholder.1234#fakemail.com and replace it into pl*************4#fakemail.com.
I need the script to take the letters after the first two, and before the last, and turn those letters into asterisks, and they have to match the amount of characters.
Just for clarification, I am not asking for someone to write this for me, I just need some guidance for how to go about this. I would also like to do this in Python, as I already have Python setup on my PC.
You can use Python string slicing:
email = "placeholder.1234#fakemail.com"
idx1 = 2
idx2 = email.index("#") - 1
print(email[:idx1] + "*" * (idx2 - idx1) + email[idx2:])
Output:
pl*************4#fakemail.com
Explanation:
Define the string that will contain the email address:
email = "placeholder.1234#fakemail.com"
Define the index of where the asterisks should begin, which is 2:
idx = 2
Define the index of where the asterisks should end, which is the index of where the # symbol is minus 1:
idx2 = email.index("#") - 1
Finally, using the indices defined, you can slice and concatenate the string defined accordingly:
print(email[:idx1] + "*" * (idx2 - idx1) + email[idx2:])
So this email will be a string.
Try use a combination of String indexing and (string replacement or string concatenation).
First, let's think about what data type we would store this in. It should be a String since it contains letters and characters.
We know we want to replace a portion of the String from the THIRD character to the character 2 before "#".
Let's think about this in terms of indexing now. The third character, or our start index for replacement is at index 2. To find the index of the character "#", we can use the: index() function as so:
end = email.index('#')
However, we want to start replacing from 2 before that index, so we can just subtract 2 from it.
Now we have the indexes (start and endpoint) of what we want to replace. We can now use a substring that goes from the starting to ending indexes and use the .replace() function to replace this substring in our email with a bunch of *'s.
To determine how many *'s we need, we can find the difference in indexes and add 1 to get the total number. So we could do something like:
stars = "*" * (end - start + 1)
email = email.replace(email[start:end + 1], stars)
Notice how I did start:end + 1 for the substring. This is because we want to include that ending index, and the substring function on its own will not include the ending index.
I hope this helped answer your question! Please let me know if you need any further clarification or details:)

Substring replacements based on replace and no-replace rules

I have a string and rules/mappings for replacement and no-replacements.
E.g.
"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."
Replacement rules:
replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}
Result:
"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
Additional criteria:
Only replace if case is matched, i.e. case matters.
Whole words replacement only, interpunction should be ignored, but kept after replacement.
I was thinking what would the cleanest way to solve this problem in Python 3.x be?
Based on the answer of demongolem.
UPDATE
I am sorry, I missed the fact, that only whole words should be replaced. I updated my code and even generalized it for usage in a function.
def replace_whole(sentence, replace_token, replace_with, dont_replace):
rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
iter = re.finditer(rx, sentence)
out_sentence = ""
found = []
indices = []
for m in iter:
indices.append(m.start(0))
found.append(m.group())
context_size=len(dont_replace)
for i in range(len(indices)):
context = sentence[indices[i]-context_size:indices[i]+context_size]
if dont_replace in context:
continue
else:
# First replace the word only in the substring found
to_replace = found[i].replace(replace_token, replace_with)
# Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
replace_val = context.replace(found[i], to_replace)
# finally replace the context found with the replacing context
out_sentence = sentence.replace(context, replace_val)
return out_sentence
Use regular expressions for finding all occurences and values of your string (as we need to check whether is a whole word or embedded in any kind of word), by using finditer(). You might need to adjust the rx to what your definition of "whole word" is. Then get the context around these values of the size of your no_replace rule. Then check, whether the context contains your no_replace string.
If not, you may replace it, by using replace() for the word only, then replace the occurence of the word in the context, then replace the context in the whole text. That way the replacing process is nearly unique and no weird behaviour should happen.
Using your examples, this leads to:
replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
and
replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'
After some research, this is what I believe to be the best and cleanest solution to my problem. The solution works by calling the match_fun whenever a match has been found, and the match_fun only performs the replacement, if and only if, there is no "no-replace-phrase" overlapping with the current match. Let me know if you need more clarification or if you believe something can be improved.
replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.
def match_fun(match: re.Match):
str_match: str = match.group()
if str_match not in cls.no_replace_dict:
return cls.replace_dict[str_match]
for no_replace in cls.no_replace_dict[str_match]:
no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
for no_replace_match in no_replace_matches_iter:
if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
return str_match
if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
return str_match
return cls.replace_dict[str_match]
for replace in cls.replace_dict:
pattern = re.compile(r'\b' + replace + r'\b')
text = pattern.sub(match_fun, text)

Python inserting spaces in string

Alright, I'm working on a little project for school, a 6-frame translator. I won't go into too much detail, I'll just describe what I wanted to add.
The normal output would be something like:
TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSD
The important part of this string are the M and the _ (the start and stop codons, biology stuff). What I wanted to do was highlight these like so:
TTCPTISPALGLAWS_DLGTLGF 'MSYSANTASGETLVSLYQLGLFEM_' VVSYGRTKYYLICP_LFHLSVGFVPSD
Now here is where (for me) it gets tricky, I got my output to look like this (adding a space and a ' to highlight the start and stop). But it only does this once, for the first start and stop it finds. If there are any other M....._ combinations it won't highlight them.
Here is my current code, attempting to make it highlight more than once:
def start_stop(translation):
index_2 = 0
while True:
if 'M' in translation[index_2::1]:
index_1 = translation[index_2::1].find('M')
index_2 = translation[index_1::1].find('_') + index_1
new_translation = translation[:index_1] + " '" + \
translation[index_1:index_2 + 1] + "' " +\
translation[index_2 + 1:]
else:
break
return new_translation
I really thought this would do it, guess not. So now I find myself being stuck.
If any of you are willing to try and help, here is a randomly generated string with more than one M....._ set:
'TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSDGRRLTLYMPPARRLATKSRFLTPVISSG_DKPRHNPVARSQFLNPLVRPNYSISASKSGLRLVLSYTRLSLGINSLPIERLQYSVPAPAQITP_IPEHGNARNFLPEWPRLLISEPAPSVNVPCSVFVVDPEHPKAHSKPDGIANRLTFRWRLIG_VFFHNAL_VITHGYSRVDILLPVSRALHVHLSKSLLLRSAWFTLRNTRVTGKPQTSKT_FDPKATRVHAIDACAE_QQH_PDSGLRFPAPGSCSEAIRQLMI'
Thank you to anyone willing to help :)
Regular expressions are pretty handy here:
import re
sequence = "TTCP...."
highlighted = re.sub(r"(M\w*?_)", r" '\1' ", sequence)
# Output:
"TTCPTISPALGLAWS_DLGTLGF 'MSYSANTASGETLVSLYQLGLFEM_' VVSYGRTKYYLICP_LFHLSVGFVPSDGRRLTLY 'MPPARRLATKSRFLTPVISSG_' DKPRHNPVARSQFLNPLVRPNYSISASKSGLRLVLSYTRLSLGINSLPIERLQYSVPAPAQITP_IPEHGNARNFLPEWPRLLISEPAPSVNVPCSVFVVDPEHPKAHSKPDGIANRLTFRWRLIG_VFFHNAL_VITHGYSRVDILLPVSRALHVHLSKSLLLRSAWFTLRNTRVTGKPQTSKT_FDPKATRVHAIDACAE_QQH_PDSGLRFPAPGSCSEAIRQLMI"
Regex explanation:
We look for an M followed by any number of "word characters" \w* then an _, using the ? to make it a non-greedy match (otherwise it would just make one group from the first M to the last _).
The replacement is the matched group (\1 indicates "first group", there's only one), but surrounded by spaces and quotes.
You just require little slice of 'slice' module , you don't need any external module :
Python string have a method called 'index' just use it.
string_1='TTCPTISPALGLAWS_DLGTLGFMSYSANTASGETLVSLYQLGLFEM_VVSYGRTKYYLICP_LFHLSVGFVPSD'
before=string_1.index('M')
after=string_1[before:].index('_')
print('{} {} {}'.format(string_1[:before],string_1[before:before+after+1],string_1[before+after+1:]))
output:
TTCPTISPALGLAWS_DLGTLGF MSYSANTASGETLVSLYQLGLFEM_ VVSYGRTKYYLICP_LFHLSVGFVPSD

Print everything after a few specific characters in a string

My intention was use the index method to search for either a colon (:) or a equal sign (=) in the string and print everything after that character but I realized it's not syntactically possible as it's written below with the OR statement. So is there another way to write this piece of code? (I wasn't able to come up with a simple way to write this without getting into loops and if statements)
l='Name = stack'
pos=l.index(':' or '=')
print (' '.join(l[pos+1:-1].split())) #this just gets rid of the whitespaces
Assuming your example as above, the long way (explanation of each piece below):
pos = max(l.find(':'), l.find('='), 0)
print(l[pos:].strip())
Here's a way to shorten it to one line, with an explanation of each part in the order it's evaluated in.
print(l[max(l.find(':'),l.find('='),0):].strip())
#--------------- Breakdown
# max -> highest of values; find returns -1 if it isn't there.
# using a 0 at the end means if ':'/'=' aren't in the string, print the whole thing.
# l.find(),l.find() -> check the two characters, using the higher due to max()
# l[max():] -> use that higher value until the end (implied with empty :])
# .strip() -> remove whitespace
import re
l='Name = stack'
print(re.split(':|=', l)[-1])
Regular expression split on either character, then take the last result.
You didn't mention if there was guaranteed to be one or the other separator and not both, always a separator, not more than one separator... this might not do what you want, depending.
You should limit the number of splits to one, using maxsplit in re.split():
import re
s1 = 'name1 = x1 and noise:noise=noise'
s2 = 'name2: x2 and noise:noise=noise'
print(re.split(':|=', s1, maxsplit=1)[-1].strip())
print(re.split(':|=', s2, maxsplit=1)[-1].strip())
Output:
x1 and noise:noise=noise
x2 and noise:noise=noise

python's regular expression that repeats

I have a list of lines. I'm writing a typical text modifying function that runs through each line in the list and modifies it when a pattern is detected.
I realized later in writing this type of functions that a pattern may repeat multiple times in the line.
For example, this is one of the functions I wrote:
def change_eq(string):
#inputs a string and outputs the modified string
#replaces (X####=#) to (X####==#)
#set pattern
pat_eq=r"""(.*) #stuff before
([\(\|][A-Z]+[0-9]*) #either ( or | followed by the variable name
(=) #single equal sign we want to change
([0-9]*[\)\|]) #numeric value of the variable followed by ) or |
(.*)""" #stuff after
p= re.compile(pat_eq, re.X)
p1=p.match(string)
if bool(p1)==1:
# if pattern in pat_eq is detected, replace that portion of the string with a modified version
original=p1.group(0)
fixed=p1.group(1)+p1.group(2)+"=="+p1.group(4)+p1.group(5)
string_c=string.replace(original,fixed)
return string_c
else:
# returns the original string
return string
But for an input string such as
'IF (X2727!=78|FLAG781=0) THEN PURPILN2=(X2727!=78|FLAG781=0)*X2727'
, group() only works on the last pattern detected in the string, so it changes it to
'IF (X2727!=78|FLAG781=0) THEN PURPILN2=(X2727!=78|FLAG781==0)*X2727'
, ignoring the first case detected. I understand that's the product of my function using the group attribute.
How would I address this issue? I know there is {m,n}, but does it work with match?
Thank you in advance.
Different languages handle "global" matches in different ways. You'll want to use Python's re.finditer (link) and use a for loop to iterate through the resulting match objects.
Example with some of your code:
p = re.compile(pat_eq, re.X)
string_c = string
for match_obj in p.finditer(string):
original = match_obj.group(0)
fixed = p1.group(1) + p1.group(2) + '==' + p1.group(4) + p1.group(5)
string_c = string_c.replace(original, fixed)
return string_c

Categories

Resources