Combine compiled Python regexes - python

Is there any mechanism in Python for combining compiled regular expressions?
I know it's possible to compile a new expression by extracting the plain-old-string .pattern property from existing pattern objects. But this fails in several ways. For example:
import re
first = re.compile(r"(hello?\s*)")
# one-two-three or one/two/three - but not one-two/three or one/two-three
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
# Incorrect - back-reference \1 would refer to the wrong capturing group now,
# and we get an error "redefinition of group name 'r1' as group 3; was
# group 2 at position 47" for the `(?P)` group.
# Result is also now case-sensitive, unlike 'second' which is IGNORECASE
both = re.compile(first.pattern + second.pattern + second.pattern)
The result I'm looking for is achievable like so in Perl:
$first = qr{(hello?\s*)};
# one-two-three or one/two/three - but not one-two/three or one/two-three
$second = qr{one([-/])two\g{-1}three}i;
$both = qr{$first$second$second};
A test shows the results:
test($second, "...one-two-three..."); # Matches
test($both, "...hello one-two-THREEone-two-three..."); # Matches
test($both, "...hellone/Two/ThreeONE-TWO-THREE..."); # Matches
test($both, "...HELLO one/Two/ThreeONE-TWO-THREE..."); # No match
sub test {
my ($pat, $str) = #_;
print $str =~ $pat ? "Matches\n" : "No match\n";
}
Is there a library somewhere that makes this use case possible in Python? Or a built-in feature I'm missing somewhere?
(Note - one very useful feature in the Perl regex above is \g{-1}, which unambiguously refers to the immediately preceding capture group, so that there are no collisions of the type that Python is complaining about when I try to compile the combined expression. I haven't seen that anywhere in Python world, not sure if there's an alternative I haven't thought of.)

Ken, this is an interesting problem. I agree with you that the Perl solution is very slick.
I came up with something, but it is not so elegant. Maybe it gives you some idea to further explore the solution using Python. The idea is to simulate the concatenation using Python re methods.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
str="...hello one-two-THREEone/two/three..."
#str="...hellone/Two/ThreeONE-TWO-THREE..."
if re.search(first,str):
first_end_pos = re.search(first,str).end()
if re.match(second,str[first_end_pos:]):
second_end_pos = re.match(second,str[first_end_pos:]).end() + first_end_pos
if re.match(second,str[second_end_pos:]):
print ('Matches')
It will work for most of the cases but it is not working for the below case:
...hellone/Two/ThreeONE-TWO-THREE...
So, yes I admit it is not a complete solution to your problem. Hope this helps though.

I'm not a perl expert, but it doesn't seem like you're comparing apples to apples. You're using named capture groups in python, but I don't see any named capture groups in the perl example. This causes the error you mention, because this
both = re.compile(first.pattern + second.pattern + second.pattern)
tries to create two capture groups named r1
For example, if you use the regex below, then try to access group_one by name, would you get the numbers before "some text" or after?
# Not actually a valid regex
r'(?P<group_one>[0-9]*)some text(?P<group_one>[0-9]*)'
Solution 1
An easy solution is probably to remove the names from the capture groups. Also add the re.IGNORECASE to both. The code below works, although I'm not sure the resulting regex pattern will match what you want it to match.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one([-/])two([-/])three", re.IGNORECASE)
both = re.compile(first.pattern + second.pattern + second.pattern, re.IGNORECASE)
Solution 2
What I'd probably do instead is define the separate regular expressions as strings, then you can combine them however you'd like.
pattern1 = r"(hello?\s*)"
pattern2 = r"one([-/])two([-/])three"
first = re.compile(pattern1, re.IGNORECASE)
second = re.compile(pattern2, re.IGNORECASE)
both = re.compile(r"{}{}{}".format(pattern1, pattern2, pattern2), re.IGNORECASE)
Or better yet, for this specific example, don't repeat pattern2 twice, just account for the fact that it'll repeat in the regex:
both = re.compile("{}({}){{2}}".format(pattern1, pattern2), re.IGNORECASE)
which gives you the following regex:
r'(hello?\s*)(one([-/])two([-/])three){2}'

Related

Is there a way to find (potentially) multiple results with re.search?

While parsing file names of TV shows, I would like to extract information about them to use for renaming. I have a working model, but it currently uses 28 if/elif statements for every iteration of filename I've seen over the last few years. I'd love to be able to condense this to something that I'm not ashamed of, so any help would be appreciated.
Phase one of this code repentance is to hopefully grab multiple episode numbers. I've gotten as far as the code below, but in the first entry it only displays the first episode number and not all three.
import re
def main():
pattern = '(.*)\.S(\d+)[E(\d+)]+'
strings = ['blah.s01e01e02e03', 'foo.s09e09', 'bar.s05e05']
#print(strings)
for string in strings:
print(string)
result = re.search("(.*)\.S(\d+)[E(\d+)]+", string, re.IGNORECASE)
print(result.group(2))
if __name__== "__main__":
main()
This outputs:
blah.s01e01e02e03
01
foo.s09e09
09
bar.s05e05
05
It's probably trivial, but regular expressions might as well be Cuneiform most days. Thanks in advance!
No. You can use findall to find all e\d+, but it cannot find overlapping matches, which makes it impossible to use s\d+ together with it (i.e. you can't distinguish e02 in "foo.s01e006e007" from that of "age007.s01e001"), and Python doesn't let you use variable-length lookbehind (to make sure s\d+ is before it without overlapping).
The way to do this is to find \.s\d+((?:e\d+)+)$ then split the resultant group 1 in another step (whether by using findall with e\d+, or by splitting with (?<!^)(?=e)).
text = 'blah.s01e01e02e03'
match = re.search(r'\.(s\d+)((?:e\d+)+)$', text, re.I)
season = match.group(1)
episodes = re.findall(r'e\d+', match.group(2), re.I)
print(season, episodes)
# => s01 ['e01', 'e02', 'e03']
re.findall instead of re.search will return a list of all matches
If you can make use of the PyPi regex module you could make use of repeating capture groups in the pattern, and then use .captures()
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(s\d+)(e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.captures(1)[0], m.captures(2))
Output:
s01 ['e01', 'e02', 'e03']
See a Python demo and a regex101 demo.
Or using .capturesdict () with named capture groups.
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(?P<season>s\d+)(?P<episodes>e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.capturesdict())
Output:
{'season': ['s01'], 'episodes': ['e01', 'e02', 'e03']}
See a Python demo.
Note that the notation [E(\d+)] that you used is a character class, that matches 1 or the listed characters like E ( a digit + )

RegEx: How to match Prefix + Shared OR Shared + Postfix?

Assume I want to match:
PREFIXsomething
or:
somethingPOSTFIX
But certainly NOT:
PREFIXsomethingPOSTFIX
Where something is a certain shared pattern, and PREFIX/POSTFIX are in reality also certain different patterns.
I can (or thought) solve this in Python. However this construct works for 'PREFIXabc' but does not work for 'abcPOSTFIX'. How to solve this?
import re
prefix_pattern = "PREFIX"
postfix_pattern = "POSTFIX"
shared_pattern = "[a-zA-z]*"
test_pattern ="("+prefix_pattern+shared_pattern+")|("+shared_pattern+postfix_pattern+")$"
pattern = re.compile(test_pattern)
#test = 'PREFIXabc' # Match
test = 'abcPOSTFIX' # No match
x = re.match(pattern,test)
if x:
print(x.group())
else:
print("Not found")
Note that your pattern, when used with re.match, follows the scheme like ^(alternative1)|^(alternative2)$. That means that the $ end of string anchor only affects the second alternative and in case test = 'PREFIXabc123', PREFIXabc will get matched.
There are two ways to solve it depending on your requirements.
Either you need to remove $ and then you will also match abcPOSTFIX in test = 'abcPOSTIFX123', or group the two alternatives:
test_pattern=r"(?:{0}{1}|{1}{2})$".format(prefix_pattern, shared_pattern, postfix_pattern)
Then, partial matches won't be found any longer.
And FYI: If the prefix_pattern, shared_pattern and postfix_pattern are literal strings, do not forget to use re.escape().

RegEx to match a term before OR after another specific term

I'm looking for a squaremeter term in some kind of text using this RegExpression:
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
Works pretty well.
Now, this thing should only be matched if before OR after it, a string like "Wohnfläche"/"Wohnfl"/"Wfl" exists. In other words: the latter term is mandatory, however its positon is not.
Writing a RegEx for this is not the issue in general, my problem is how to write it most elegant. Currently I only see one approach:
^[.]*[Wohnfläche|Wohnfl|Wfl]([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
new search, kombined with 'or' statement (I'm using Python)
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2][.]*[Wohnfläche|Wohnfl|Wfl]$
Ugly, isn't it? ;)
You can use alternation like this:
(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)
And check which capture group matched. It is just not possible to use the restrictive strings optionally in the regex on both sides, the will just be ignored.
See the regex demo
IDEONE demo:
import re
pat = re.compile(r'(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)')
strs = ["12,56m qm Wohnfläche", "14.54 mqm Wohnfl", "Wfl 134 m qm"]
for x in strs:
m = pat.search(x)
if m:
if m.group(1): # First alternative found a match
print("{}".format(m.group(1), " - ", m.group(2)))
else: # Second alternative "won"
print("{}".format(m.group(3), " - ", m.group(4)))
Specify a logical conjunction in the controlling application, like (pseudo-code) <area-regex>.match(string) and <text-regex>.match(string).
This assumes that any pair of matches of the two regexen on the same string will never overlap ( if they did, you'd get a false positive ). Your regexen meet this requirement.
Note that your regex for the textual context contains the additional restriction that your test string either starts or ends with a match, while in your informal description you just require a match to either occur before or after the area spec. This difference is incorporated in pt vs pt_anchored in the code below.
Python fragment (untested):
import re
...
# pa: <area_regex>
# pt: <text_regex>
# pt_anchored: <text_regex>, anchored
#
pa = re.compile ( r'([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]' )
pt = re.compile ( r'[.]*[Wohnfläche|Wohnfl|Wfl]' )
pt_anchored = re.compile ( r'^[.]*[Wohnfläche|Wohnfl|Wfl]|[.]*[Wohnfläche|Wohnfl|Wfl]$' )
if pa.match(<teststring>) and pt.match(<teststring>):
print 'Match found: '
else:
print 'No match'
...

replace regex variable with string in python

I have a situation where I have a regular expression like this
regex_string = r'(?P<x>\d+)\s(?P<y>\w+)'
r = re.compile(regex_string)
and, before I start matching things with it, I'd like to replace the regex group named x with a particular value, say 2014. This way, when I search for matches to this regular expression, we will only find things that have x=2014. What is the best way to approach this issue?
The challenge here is that both the original regular expression regex_string and the arbitrary replacement value x=2014 are specified by an end user. In my head, the ideal thing would be to have a function like replace_regex:
r = re.compile(regex_string)
r = replace_regex_variables(r, x=2014)
for match in r.finditer(really_big_string):
do_something_with_each_match(match)
I'm open to any solution, but specifically interested in understanding if its possible to do this without checking matches after they are returned by finditer to take advantage of re's performance. In other words, preferrably NOT this:
r = re.compile(regex_string)
for match in r.finditer(really_big_string):
if r.groupdict()['x'] == 2014:
do_sometehing_with_each_match(match)
You want something like this, don't you?
r = r'(?P<x>%(x)s)\s(?P<y>\w+)'
r = re.compile(r % {x: 2014})
for match in r.finditer(really_big_string):
do_something_with_each_match(match)

regex replace group with specific value

I use python 2.7
I just try to change a group in a regex with a value:
import re
r = "/foo/bar/(?P<pk>[0-9]+)/"
rc = re.compile(r)
#that i try to do : rc["pk"] = 42 and get the resut
print rc.groupindex
#return {'pk' : 1}
I need to do this because i don't know the regex, but I know that ther is a group in it.
Edit:
I want to have a result like this:
rc["pk"] = 42
#now rc is /foo/bar/42 because (?P<pk>[0-9]+) is replace with 42
I am not a python programmer, but I work with regexes a great deal in a number of other systems. I believe you can use the re.sub function with backreferences to groups like so:
Search Pattern:
'(/foo/bar/)[0-9]+(/)'
Replacement pattern:
'\g<1>42\g<2>'
This would replace
'/foo/bar/17/'
with
'/foo/bar/42/'
This would even work where the folder names are expressions themselves:
'(/\w+/\w+/)\d+(/)'
Python also supports lookaround statements, like this:
'(?<=/foo/bar/)\d+(?=/)'
Then you just replace the match with '42'. (Lookarounds do not "consume" characters, so the text in '((?<=...)' and '(?=...)' would not be replaced.)

Categories

Resources