Python: Dynamic matching with regular expressions - python

I've been trying several things to use variables inside a regular expression. None seem to be capable of doing what I need.
I want to search for a consecutively repeated substring (e.g. foofoofoo) within a string (e.g. "barbarfoofoofoobarbarbar"). However, I need both the repeating substring (foo) and the number of repetitions (In this case, 3) to be dynamic, contained within variables. Since the regular expression for repeating is re{n}, those curly braces conflict with the variables I put inside the string, since they also need curly braces around.
The code should match foofoofoo, but NOT foo or foofoo.
I suspect I need to use string interpolation of some sort.
I tried stuff like
n = 3
str = "foo"
string = "barbarfoofoofoobarbarbar"
match = re.match(fr"{str}{n}", string)
or
match = re.match(fr"{str}{{n}}", string)
or escaping with
match = re.match(fr"re.escape({str}){n}", string)
but none of that seems to work. Any thoughts? It's really important both pieces of information are dynamic, and it matches only consecutive stuff. Perhaps I could use findall or finditer? No idea how to proceed.
Something I havent tried at all is not using regular expressions, but something like
if (str*n) in string:
match
I don't know if that would work, but if I ever need the extra functionality of regex, I'd like to be able to use it.

For the string barbarfoofoofoobarbarbar, if you wanted to capture foofoofoo, the regex would be r"(foo){3}". if you wanted to do this dynamically, you could do fr"({your_string}){{{your_number}}}".
If you want a curly brace in an f-string, you use {{ or }} and it'll be printed literally as { or }.
Also, str is not a good variable name because str is a class (the string class).

Related

Python regex fuzzy searching

I have a question about making a pattern using fuzzy regex with the python regex module.
I have several strings such as TCATGCACGTGGGGCTGAC
The first eight characters of this string are variable (multiple options): TCAGTGTG, TCATGCAC, TGGTGGCT. In addition, there is a constant part after the variable part: GTGGGGCTGAC.
I would like to design a regex that can detect this string in a longer string, while allowing for at most 2 substitutions.
For example, this would be acceptable as two characters have been substituted:
TCATGCACGTGGGGCTGAC
TCCTGCACGTGGAGCTGAC
However, more substitutions should not be accepted.
In my code, I tried to do the following:
import regex
variable_parts = ["TCAGTGTG", "TCATGCAC", "TGGTGGCT", "GATAAGTG", "ATTAGACG", "CACTTCCG", "GTCTGTAT", "TGTCAAAG"]
string_to_test = "TCATGCACGTGGGGCTGAC"
motif = "(%s)GTGGGGCTGAC" % "|".join(variable_parts)
pattern = regex.compile(r''+motif+'{s<=2}')
print(pattern.search(string_to_test))
I get a match when I run this code and when I change the last character of string_to_test. But when I manually add a substitution in the middle of string_to_test, I do not get any match (even while I want to allow up to 2 substitutions).
Now I know that my regex is probably total crap, but I would like to know what I exactly need to do to make this work and where in the code I need to add/remove/change stuff. Any suggestions/tips are welcome!
Right now, you only add the restriction to the last C in the pattern that looks likelooks like (TCAGTGTG|TCATGCAC|TGGTGGCT|GATAAGTG|ATTAGACG|CACTTCCG|GTCTGTAT|TGTCAAAG)GTGGGGCTGAC{s<=2}.
To apply the {s<=2} quantifier to the whole expression you need to enclose the pattern within a non-capturing group:
pattern = regex.compile(fr'(?:{motif}){{s<=2}}')
The example above shows how to declare your pattern with the help of an f-string literal, where literal braces are defined with {{ and }} (doubled) braces. It yields the same result as pattern = regex.compile('(?:'+motif+'){s<=2}').
Also, note that r''+ is redundant and has no effect on the final pattern.

Edit regex strings in Python using format method

I want to develop a regex in Python where a component of the pattern is defined in a separate variable and combined to a single string on-the-fly using Python's .format() string method. A simplified example will help to clarify. I have a series of strings where the space between words may be represented by a space, an underscore, a hyphen etc. As an example:
new referral
new-referal
new - referal
new_referral
I can define a regex string to match these possibilities as:
space_sep = '[\s\-_]+'
(The hyphen is escaped to ensure it is not interpreted as defining a character range.)
I can now build a bigger regex to match the strings above using:
myRegexStr = "new{spc}referral".format(spc = space_sep)
The advantage of this method for me is that I need to define lots of reasonably complex regexes where there may be several different commonly-occurring stings that occur multiple times and in an unpredictable order; defining commonly-used patterns beforehand makes the regexes easier to read and allows the strings to be edited very easily.
However, a problem occurs if I want to define the number of occurrences of other characters using the {m,n} or {n} structure. For example, to allow for a common typo in the spelling of 'referral', I need to allow either 1 or 2 occurrences of the letter 'r'. I can edit myRegexStr to the following:
myRegexStr = "new{spc}refer{1,2}al".format(spc = space_sep)
However, now all sorts of things break due to confusion over the use of curly braces (either a KeyError in the case of {1,2} or an IndexError: tuple index out of range in the case of {n}).
Is there a way to use the .format() string method to build longer regexes whilst still being able to define number of occurrences of characters using {n,m}?
You can double the { and } to escape them or you can use the old-style string formatting (% operator):
my_regex = "new{spc}refer{{1,2}}al".format(spc="hello")
my_regex_old_style = "new%(spc)srefer{1,2}al" % {"spc": "hello"}
print(my_regex) # newhellorefer{1,2}al
print(my_regex_old_style) # newhellorefer{1,2}al

Pandas extract text notation

I'm new to Pandas, using it for a class, and I can't for the life of me find a resource that shows the notation used in pandas when representing text in the extract function. For example:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
I know this is telling the extract function to extract everything inside the parentheses from examples done in class, but I don't understand which symbols mean what inside the extract function. Is there a resource that can explain what these symbols mean? Thank you.
In General
The string argument of the .str.extract is a Regular Expression (regex), which is a language used for pattern matching and feature extraction in strings. If you go to the section called "Regular Expression Patterns" in the previous link you can find the meaning of the special control characters.
This Example
What specifically that regex string means is:
match any character, ., zero or more times, *, until a parenthesis, \(, then extract all the content in the parentheses, (.*), then close parenthesis, \), then any character zero or more times, .*, again.
Essentially this will match any string like: 'xxx(message)xxxx' or '(message)' or 'xx(message)' or '(message)x' and extract the 'message'.
Notes on Pandas and Regex
An important part of regular expressions (in general, but particularly for use in pandas with .str.extract) is capturing groups. You can 'capture' or grab part of a string by enclosing the pattern for that part inside of parenthesis. Note that these are the unescaped (no preceding slash - the inner set) parentheses in the regex and not the actual parentheses that appear in the string itself, e.g. in 'xxx(message)xxx'.
Check out the docs on .str.extract for a few examples of using regex with capturing groups in pandas:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html

Python Regex that will work for any type of bracket

Is there a way to take a regular expression, such as
\(.*\)
and make it correctly identify pairs of any type of bracket, like
(\(|\{|\[).*(\)|\}|\])
without making incorrect matches, like \(.*\]?
I'm specifically working with Python, but it should work similarly in any language.
No. Regular languages can't handle nesting correctly. You'll need a proper parser for that.
((?:\([^()]*\))|(?:\{[^{}]*\})|(?:\[[^[\]]*\]))
Even more unwieldy, but unlike the solution containing .* this will only catch the innermost pair of brackets in case of nested brackets. Between a pair of brackets everything is allowed, even newlines, except the brackets themselves [^{}].
.* is greedy, it would catch two pairs of brackets as one group, like (ab)cd(ef), or even mix pairs chaotically and match (ab)cd) for example.
To catch a group containing the outer pair of brackets like (ab(cd)ef) I would consider not possible with regex.
There is no magic way to tell regex to match a specific char to some other char (if it was the exact same char/string you could do a backreference look, but this is not the case).
What you can do is write a "complex" (not so much) expression for that:
((?:\(.*\))|(?:\{.*\})|(?:\[.*\]))

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Categories

Resources