Python regex fuzzy searching - python

I have a question about making a pattern using fuzzy regex with the python regex module.
I have several strings such as TCATGCACGTGGGGCTGAC
The first eight characters of this string are variable (multiple options): TCAGTGTG, TCATGCAC, TGGTGGCT. In addition, there is a constant part after the variable part: GTGGGGCTGAC.
I would like to design a regex that can detect this string in a longer string, while allowing for at most 2 substitutions.
For example, this would be acceptable as two characters have been substituted:
TCATGCACGTGGGGCTGAC
TCCTGCACGTGGAGCTGAC
However, more substitutions should not be accepted.
In my code, I tried to do the following:
import regex
variable_parts = ["TCAGTGTG", "TCATGCAC", "TGGTGGCT", "GATAAGTG", "ATTAGACG", "CACTTCCG", "GTCTGTAT", "TGTCAAAG"]
string_to_test = "TCATGCACGTGGGGCTGAC"
motif = "(%s)GTGGGGCTGAC" % "|".join(variable_parts)
pattern = regex.compile(r''+motif+'{s<=2}')
print(pattern.search(string_to_test))
I get a match when I run this code and when I change the last character of string_to_test. But when I manually add a substitution in the middle of string_to_test, I do not get any match (even while I want to allow up to 2 substitutions).
Now I know that my regex is probably total crap, but I would like to know what I exactly need to do to make this work and where in the code I need to add/remove/change stuff. Any suggestions/tips are welcome!

Right now, you only add the restriction to the last C in the pattern that looks likelooks like (TCAGTGTG|TCATGCAC|TGGTGGCT|GATAAGTG|ATTAGACG|CACTTCCG|GTCTGTAT|TGTCAAAG)GTGGGGCTGAC{s<=2}.
To apply the {s<=2} quantifier to the whole expression you need to enclose the pattern within a non-capturing group:
pattern = regex.compile(fr'(?:{motif}){{s<=2}}')
The example above shows how to declare your pattern with the help of an f-string literal, where literal braces are defined with {{ and }} (doubled) braces. It yields the same result as pattern = regex.compile('(?:'+motif+'){s<=2}').
Also, note that r''+ is redundant and has no effect on the final pattern.

Related

Regex Statement to only match parts of a string for comparison - Python

What I am trying to do is match values from one file to another, but I only need to match the first portion of the string and the last portion.
I am reading each file into a list, and manipulating these based on different Regex patterns I have created. Everything works, except when it comes to these type of values:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
In this example, I only want to match 'V-1\ZDS\R\EMBO-20' and then compare the '24' value at the end of the string. The number x in '20-x:', can vary and doesn't matter in terms of comparisons, as long as the first and last parts of this string match.
This is the Regex I am using:
re.compile(r"(?:.*V-1\\ZDS\\R\\EMBO-20-\d.*)(:\d*\w.*)")
Once I filter down the list, I use the following function to return the difference between the two sets:
funcDiff = lambda x, y: list((set(x)- set(y))) + list((set(y)- set(x)))
Is there a way to take the list of differences and filter out the ones that have matching values after the
:
as mentioned above?
I apologize is this is an obvious answer, I'm new to Python and Regex!
The output I get is the differences between the entire strings, so even if the first and last part of the string match, if the number following the 'EMBO-20-x' doesn't also match, it returns it as being different.
Before discussing your question, regex101 is an incredibly useful tool for this type of thing.
Your issue stems from two issues:
1.) The way you used .*
2.) Greedy vs. Nongreedy matches
.* kinda sucks
.* is a regex expression that is very rarely what you actually want.
As a quick aside, a useful regex expression is [^c]* or [^c]+. These expressions match any character except the letter c, with the first expression matching 0 or more, and the second matched 1 or more.
.* will match all characters as many times as it can. Instead, try to start your regex patterns with more concrete starting points. Two good ways to do this are lookbehind expressions and anchors.
Another quick aside, it's likely that you are misusing regex.match and regex.find. match will only return a match that begins at the start of the string, while find will return matches anywhere in the input string. This could be the reason you included the .* in the first place, to allow a .match call to return a match deeper in the string.
Lookbehind Expressions
There are more complete explanations online, but in short, regex patterns like:
(?<=test)foo
will match the text foo, but only if test is right in front of it. To be more clear, the following strings will not match that regex:
foo
test-foo
test foo
but the following string will match:
testfoo
This will only match the text foo, though.
Anchors
Another option is anchors. ^ and $ are special characters, matching the start and end of a line of text. If you know your regex pattern will match exactly one line of text, start it with ^ and end it with $.
Leading patterns with .* and ending with .* are likely the source of your issue. Although you did not include full examples of your input or your code, you likely used match as opposed to find.
In regex, . matches any character, and * means 0 or more times. This means that for any input, your pattern will match the entire string.
Greedy vs. Non-Greedy qualifiers
The second issue is related to greediness. When your regex patterns have a * in them, they can match 0 or more characters. This can hide problems, as entire * expressions can be skipped. Your regex is likely matched several lines of text as one match, and hiding multiple records in a single .*.
The Actual Answer
Taking all of this in to consideration, let's assume that your input data looks like this:
V-1\ZDS\R\EMBO-20-1:24
V-1\ZDS\R\EMBO-20-6:24
V-1\ZDS\R\EMBO-20-3:93
V-1\ZDS\R\EMBO-20-6:22309
V-1\ZDS\R\EMBO-20-8:2238
V-1\ZDS\R\EMBO-20-3:28
A better regular expression would be:
^V-1\\ZDS\\R\\EMBO-20-\d:(\d+)$
To visualize this regex in action, follow this link.
There are several differences I would like to highlight:
Starting the expression with ^ and ending with $. This forces the regex to match exactly one line. Even though the pattern works without these characters, it's good practice when working with regex to be as explicit as possible.
No useless non-capturing group. Your example had a (?:) group at the start. This denotes a group that does not capture it's match. It's useful if you want to match a subpattern multiple times ((?:ab){5} matches ababababab without capturing anything). However, in your example, it did nothing :)
Only capturing the number. This makes it easier to extract the value of the capture groups.
No use of *, one use of +. + works like *, but it matches 1 or more. This is often more correct, as it prevents 'skipping' entire characters.

Python: Dynamic matching with regular expressions

I've been trying several things to use variables inside a regular expression. None seem to be capable of doing what I need.
I want to search for a consecutively repeated substring (e.g. foofoofoo) within a string (e.g. "barbarfoofoofoobarbarbar"). However, I need both the repeating substring (foo) and the number of repetitions (In this case, 3) to be dynamic, contained within variables. Since the regular expression for repeating is re{n}, those curly braces conflict with the variables I put inside the string, since they also need curly braces around.
The code should match foofoofoo, but NOT foo or foofoo.
I suspect I need to use string interpolation of some sort.
I tried stuff like
n = 3
str = "foo"
string = "barbarfoofoofoobarbarbar"
match = re.match(fr"{str}{n}", string)
or
match = re.match(fr"{str}{{n}}", string)
or escaping with
match = re.match(fr"re.escape({str}){n}", string)
but none of that seems to work. Any thoughts? It's really important both pieces of information are dynamic, and it matches only consecutive stuff. Perhaps I could use findall or finditer? No idea how to proceed.
Something I havent tried at all is not using regular expressions, but something like
if (str*n) in string:
match
I don't know if that would work, but if I ever need the extra functionality of regex, I'd like to be able to use it.
For the string barbarfoofoofoobarbarbar, if you wanted to capture foofoofoo, the regex would be r"(foo){3}". if you wanted to do this dynamically, you could do fr"({your_string}){{{your_number}}}".
If you want a curly brace in an f-string, you use {{ or }} and it'll be printed literally as { or }.
Also, str is not a good variable name because str is a class (the string class).

Regex optional order of capturing group

I have simple, but tricky question about regex (using in python), which i have did not find answer for anywhere here on google. Is there any "trick" how to make two capture groups in optional order? Let's say we have following:
.*abc.*
What i want is to match also this:
.*acb.*
I know i could use
.*abc|acb.*
but the problem is, that if we have something more complicated then abc, code is very long. Is not there any workaround to say e.g. "match last two capturing groups (or symbols, etc.) in any order?
I don't really get what is this in-any-order thing that would make the regex shorter. On the other hand, I can show you how to make this readable, even if you have tons of options.
import re
pattern = """
.* # match from starting the line
(?: # A non-capturing group starts so we can list lots of alternatives
abc| # alternative 1
acb # alternative 2
) # end of alternatives
.* # then match everything up to the end of the line
"""
re.search(pattern, 'qqabcqq', re.VERBOSE) # returns a match
re.search(pattern, 'qqacbqq', re.VERBOSE) # returns a match
re.search(pattern, 'qqaSDqq', re.VERBOSE) # does not return a match
So what did we just see here?
The """ ... """ construct is a convenient way to define multiline strings in python.
Then the re.VERBOSE skips the whitespaces and comments. As the manual says:
Whitespace within the pattern is ignored, except when in a character
class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.
This two things let you add structure and comments to your regex. Here is another great example.
With standard regular expressions you can define patterns without order. Example:
[cdgjow]
Of course this example refers to characters.
Alternative sequences must be specified using "|". Example:
abc|cba
There is no way to express what you would like to express in classic regular expression syntax. Regular expression syntax has no syntactic elements to express what you would like to express. It's lacking this feature. You have to rely on "manually" specifying your alternatives. It's not a limit of the automaton constructed from regular expressions but of the regular expression syntax itself.
That means: You will have to construct the regular expression you require by yourself with all variants possible. There are two ways how to do this:
Do it manually. Take your time, be careful, built the correct regex manually.
Do it programmatically. Write some code that generates the regex you require.
If you do it manually consider #TamasRev answer. (Thanks #TamasRev! Nice answer!) But if I were you I'd build the regex programmatically. (For things like that programming has been invented for anyway :-) )

Python regex: how to achieve this complex replacement rule?

I'm working with long strings and I need to replace with '' all the combinations of adjacent full stops . and/or colons :, but only when they are not adjacent to any whitespace. Examples:
a.bcd should give abcd
a..::.:::.:bcde.....:fg should give abcdefg
a.b.c.d.e.f.g.h should give abcdefgh
a .b should give a .b, because . here is adjacent to a whitespace on its left, so it has not to be replaced
a..::.:::.:bcde.. ...:fg should give abcde.. ...:fg for the same reason
Well, here is what I tried (without any success).
Attempt 1:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1), r'', s1)
I would expect to get 'abcdefgh' but what I actually get is r''. I understood why: the code
re.search(r'[^\s.:]+([.:]+)[^\s.:]+', s1).group(1)
returns '.' instead of '\.', and thus re.search doesn't understand that it has to replace the single full stop . rather than understanding '.' as the usual regex.
Attempt 2:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*\S)[.:]+(\S[^\s.:]*)', r'\g<1>\g<2>', s1)
This doesn't work as it returns a.b.c.d.e.f.gh.
Attempt 3:
s1 = r'a.b.c.d.e.f.g.h'
re.sub(r'([^\s.:]*)[.:]+([^\s.:]*)', r'\g<1>\g<2>', s1)
This works on s1, but it doesn't solve my problem because on s2 = r'a .b' it returns a b rather than a .b.
Any suggestion?
There are multiple problems here. Your regex doesn't match what you want to match; but also, your understanding of re.sub and re.search is off.
To find something, re.search lets you find where in a string that something occurs.
To replace that something, use re.sub on the same regular expression instead of re.search, not as well.
And, understand that re.sub(r'thing(moo)other', '', s1) replaces the entire match with the replacement string.
With that out of the way, for your regex, it sounds like you want
r'(?<![\s.:])[.:]+(?![\s.:])' # updated from comments, thanks!
which contains a character class with full stop and colon (notice how no backslash is necessary inside the square brackets -- this is a context where dot and colon do not have any special meaning1), repeated as many times as possible; and lookarounds on both sides to say we cannot match these characters when there is whitespace \s on either side, and also excluding the characters themselves so that there is no way for the regex engine to find a match by applying the + less strictly (it will do its darndest to find a match if there is a way).
Now, the regex only matches the part you want to actually replace, so you can do
>>> import re
>>> s1 = 'name.surname#domain.com'
>>> re.sub(r'(?<![\s.:])[.:]+(?![\s.:])', r'', s1)
'namesurname#domaincom'
though in the broader scheme of things, you also need to know how to preserve some parts of the match. For the purpose of this demonstration, I will use a regular expression which captures into parenthesized groups the text before and after the dot or colon:
>>> re.sub(r'(.*\S)[.:]+(\S.*)', r'\g<1>\g<2>', s1)
'name.surname#domaincom'
See how \g<1> in the replacement string refers back to "whatever the first set of parentheses matched" and similarly \g<2> to the second parenthesized group.
You will also notice that this failed to replace the first full stop, because the .* inside the first set of parentheses matches as much of the string as possible. To avoid this, you need a regex which only matches as little as possible. We already solved that above with the lookarounds, so I will leave you here, though it would be interesting (and yet not too hard) to solve this in a different way.
1 You could even say that the normal regex language (or syntax, or notation, or formalism) is separate from the language (or syntax, or notation, or formalism) inside square brackets!

replace a comma only if is between two numbers [duplicate]

This question already has answers here:
Convert decimal mark when reading numbers as input
(8 answers)
Closed last year.
I'm trying to replace commas for cases like:
123,123
where the output should be:
123123
for that I tried this:
re.sub('\d,\d','','123,123')
but that is also deleting the the digits, how can avoid this?
I only want to remode the comma for that case in particular, that's way I'm using regex. For this case, e.g.
'123,123 hello,word'
The desired output is:
'123123 hello,word'
You can use regex look around to restrict the comma (?<=\d),(?=\d); use ?<= for look behind and ?= for look ahead; They are zero length assertions and don't consume characters so the pattern in the look around will not be removed:
import re
re.sub('(?<=\d),(?=\d)', '', '123,123 hello,word')
# '123123 hello,word'
This is one of the cases where you want regular expression "lookaround assertions" ... which have zero length (pattern capture semantics).
Doing so allows you to match cases which would otherwise be "overlapping" in your substitution.
Here's an example:
#!python
import re
num = '123,456,7,8.012,345,6,7,8'
pattern = re.compile(r'(?<=\d),(?=\d)')
pattern.sub('',num)
# >>> '12345678.012345678'
... note that I'm using re.compile() to make this more readable and also because that usage pattern is likely to perform better in many cases. I'm using the same regular expression as #Psidom; but I'm using a Python 'raw' string which is more commonly the way to express regular expressions in Python.
I'm deliberately using an example where the spacing of the commas would overlap if I were using a regular expression such as; re.compile(r'(\d),(\d)') and trying to substitute using back references to the captured characters pattern.sub(r'\1\2', num) ... that would work for many examples; but '1,2,3' would not match because the capturing causes them to be overlapping.
This one of the main reasons that these "lookaround" (lookahead and lookbehind) assertions exist ... to avoid cases where you'd have to repeatedly/recursively apply a pattern due to capture and overlap semantics. These assertions don't capture, they match "zero" characters (as with some PCRE meta patterns like \b ... which matches the zero length boundary between words rather than any of the whitespace (\s which or non-"word" (\W) characters which separate words).

Categories

Resources