Assume I want to match:
PREFIXsomething
or:
somethingPOSTFIX
But certainly NOT:
PREFIXsomethingPOSTFIX
Where something is a certain shared pattern, and PREFIX/POSTFIX are in reality also certain different patterns.
I can (or thought) solve this in Python. However this construct works for 'PREFIXabc' but does not work for 'abcPOSTFIX'. How to solve this?
import re
prefix_pattern = "PREFIX"
postfix_pattern = "POSTFIX"
shared_pattern = "[a-zA-z]*"
test_pattern ="("+prefix_pattern+shared_pattern+")|("+shared_pattern+postfix_pattern+")$"
pattern = re.compile(test_pattern)
#test = 'PREFIXabc' # Match
test = 'abcPOSTFIX' # No match
x = re.match(pattern,test)
if x:
print(x.group())
else:
print("Not found")
Note that your pattern, when used with re.match, follows the scheme like ^(alternative1)|^(alternative2)$. That means that the $ end of string anchor only affects the second alternative and in case test = 'PREFIXabc123', PREFIXabc will get matched.
There are two ways to solve it depending on your requirements.
Either you need to remove $ and then you will also match abcPOSTFIX in test = 'abcPOSTIFX123', or group the two alternatives:
test_pattern=r"(?:{0}{1}|{1}{2})$".format(prefix_pattern, shared_pattern, postfix_pattern)
Then, partial matches won't be found any longer.
And FYI: If the prefix_pattern, shared_pattern and postfix_pattern are literal strings, do not forget to use re.escape().
Related
Is there any mechanism in Python for combining compiled regular expressions?
I know it's possible to compile a new expression by extracting the plain-old-string .pattern property from existing pattern objects. But this fails in several ways. For example:
import re
first = re.compile(r"(hello?\s*)")
# one-two-three or one/two/three - but not one-two/three or one/two-three
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
# Incorrect - back-reference \1 would refer to the wrong capturing group now,
# and we get an error "redefinition of group name 'r1' as group 3; was
# group 2 at position 47" for the `(?P)` group.
# Result is also now case-sensitive, unlike 'second' which is IGNORECASE
both = re.compile(first.pattern + second.pattern + second.pattern)
The result I'm looking for is achievable like so in Perl:
$first = qr{(hello?\s*)};
# one-two-three or one/two/three - but not one-two/three or one/two-three
$second = qr{one([-/])two\g{-1}three}i;
$both = qr{$first$second$second};
A test shows the results:
test($second, "...one-two-three..."); # Matches
test($both, "...hello one-two-THREEone-two-three..."); # Matches
test($both, "...hellone/Two/ThreeONE-TWO-THREE..."); # Matches
test($both, "...HELLO one/Two/ThreeONE-TWO-THREE..."); # No match
sub test {
my ($pat, $str) = #_;
print $str =~ $pat ? "Matches\n" : "No match\n";
}
Is there a library somewhere that makes this use case possible in Python? Or a built-in feature I'm missing somewhere?
(Note - one very useful feature in the Perl regex above is \g{-1}, which unambiguously refers to the immediately preceding capture group, so that there are no collisions of the type that Python is complaining about when I try to compile the combined expression. I haven't seen that anywhere in Python world, not sure if there's an alternative I haven't thought of.)
Ken, this is an interesting problem. I agree with you that the Perl solution is very slick.
I came up with something, but it is not so elegant. Maybe it gives you some idea to further explore the solution using Python. The idea is to simulate the concatenation using Python re methods.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
str="...hello one-two-THREEone/two/three..."
#str="...hellone/Two/ThreeONE-TWO-THREE..."
if re.search(first,str):
first_end_pos = re.search(first,str).end()
if re.match(second,str[first_end_pos:]):
second_end_pos = re.match(second,str[first_end_pos:]).end() + first_end_pos
if re.match(second,str[second_end_pos:]):
print ('Matches')
It will work for most of the cases but it is not working for the below case:
...hellone/Two/ThreeONE-TWO-THREE...
So, yes I admit it is not a complete solution to your problem. Hope this helps though.
I'm not a perl expert, but it doesn't seem like you're comparing apples to apples. You're using named capture groups in python, but I don't see any named capture groups in the perl example. This causes the error you mention, because this
both = re.compile(first.pattern + second.pattern + second.pattern)
tries to create two capture groups named r1
For example, if you use the regex below, then try to access group_one by name, would you get the numbers before "some text" or after?
# Not actually a valid regex
r'(?P<group_one>[0-9]*)some text(?P<group_one>[0-9]*)'
Solution 1
An easy solution is probably to remove the names from the capture groups. Also add the re.IGNORECASE to both. The code below works, although I'm not sure the resulting regex pattern will match what you want it to match.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one([-/])two([-/])three", re.IGNORECASE)
both = re.compile(first.pattern + second.pattern + second.pattern, re.IGNORECASE)
Solution 2
What I'd probably do instead is define the separate regular expressions as strings, then you can combine them however you'd like.
pattern1 = r"(hello?\s*)"
pattern2 = r"one([-/])two([-/])three"
first = re.compile(pattern1, re.IGNORECASE)
second = re.compile(pattern2, re.IGNORECASE)
both = re.compile(r"{}{}{}".format(pattern1, pattern2, pattern2), re.IGNORECASE)
Or better yet, for this specific example, don't repeat pattern2 twice, just account for the fact that it'll repeat in the regex:
both = re.compile("{}({}){{2}}".format(pattern1, pattern2), re.IGNORECASE)
which gives you the following regex:
r'(hello?\s*)(one([-/])two([-/])three){2}'
I'm looking for a squaremeter term in some kind of text using this RegExpression:
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
Works pretty well.
Now, this thing should only be matched if before OR after it, a string like "Wohnfläche"/"Wohnfl"/"Wfl" exists. In other words: the latter term is mandatory, however its positon is not.
Writing a RegEx for this is not the issue in general, my problem is how to write it most elegant. Currently I only see one approach:
^[.]*[Wohnfläche|Wohnfl|Wfl]([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
new search, kombined with 'or' statement (I'm using Python)
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2][.]*[Wohnfläche|Wohnfl|Wfl]$
Ugly, isn't it? ;)
You can use alternation like this:
(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)
And check which capture group matched. It is just not possible to use the restrictive strings optionally in the regex on both sides, the will just be ignored.
See the regex demo
IDEONE demo:
import re
pat = re.compile(r'(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)')
strs = ["12,56m qm Wohnfläche", "14.54 mqm Wohnfl", "Wfl 134 m qm"]
for x in strs:
m = pat.search(x)
if m:
if m.group(1): # First alternative found a match
print("{}".format(m.group(1), " - ", m.group(2)))
else: # Second alternative "won"
print("{}".format(m.group(3), " - ", m.group(4)))
Specify a logical conjunction in the controlling application, like (pseudo-code) <area-regex>.match(string) and <text-regex>.match(string).
This assumes that any pair of matches of the two regexen on the same string will never overlap ( if they did, you'd get a false positive ). Your regexen meet this requirement.
Note that your regex for the textual context contains the additional restriction that your test string either starts or ends with a match, while in your informal description you just require a match to either occur before or after the area spec. This difference is incorporated in pt vs pt_anchored in the code below.
Python fragment (untested):
import re
...
# pa: <area_regex>
# pt: <text_regex>
# pt_anchored: <text_regex>, anchored
#
pa = re.compile ( r'([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]' )
pt = re.compile ( r'[.]*[Wohnfläche|Wohnfl|Wfl]' )
pt_anchored = re.compile ( r'^[.]*[Wohnfläche|Wohnfl|Wfl]|[.]*[Wohnfläche|Wohnfl|Wfl]$' )
if pa.match(<teststring>) and pt.match(<teststring>):
print 'Match found: '
else:
print 'No match'
...
I have a series of reg expressions called in order. I need to check the first one, and then the second, then the third etc etc right the way until the end. I need to do some processed on the matched string, so I'm trying to avoid too much logic, but in python, unlike perl I do not think I can perform assignment in the if-elif-elif..blocks so I'll end up doing an assignment, then checking for a match and then getting the results of that match. For example:
m = re.search(patternA, string)
if m:
stripped = m.group(0)
xyz = stripped[45:67]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
abc = stripped[5:7]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
txt = stripped[4:5]
elif:
......
Ideally I'd like to find a better structure that ensures I preserve the ordering of the tested regular expressions, and also that I can incorporate the assignment into the if-then statements. So for example:
if (m = re.search(patternA, string)):
stripped = m.group(0)
xyz = stripped[45:67]
elif (m = re.search(patternB, string)):
stripped = m.group(0)
abc = stripped[5:7]
...
What is the most pythonic way of dealing with this? Thanks.
The use case is to read old data - very old data. However each string may include information about particular values and these are only present if the regular expression matches a particular pattern. So the variables extracted are highly dependent upon what matches.
for (pattern, slice) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)]):
m = re.search(pattern, string)
if m:
value = m.group(0)[slice]
break
else:
# Handle no match found for any pattern here
This iterates over pairs of regular expressions and the relevant portion of their match until a match is found. If there is no match found, the else clause of the for loop will execute. The result of the match is found in value after the loop, regardless of which pattern matches.
Having different variables set based on which "branch" succeeds is not a great idea, since you won't necessarily know which variables are set at any given time. A dictionary would be a better idea if you really want separate labels for each match, since you can query which key or keys are set in a dictionary.
value = {}
for (pattern, slice, key) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)],
['abc', 'xyx', 'txt']):
m = re.search(pattern, string)
if m:
value[key] = m.group(0)[slice]
break
The general idea, though, is to note that your chain of if statements is like a hard-coded iteration, so you just need to identify which parts of each if/elif clause varies from the preceding ones, and create a list that you can iterate over instead.
I have a situation where I have a regular expression like this
regex_string = r'(?P<x>\d+)\s(?P<y>\w+)'
r = re.compile(regex_string)
and, before I start matching things with it, I'd like to replace the regex group named x with a particular value, say 2014. This way, when I search for matches to this regular expression, we will only find things that have x=2014. What is the best way to approach this issue?
The challenge here is that both the original regular expression regex_string and the arbitrary replacement value x=2014 are specified by an end user. In my head, the ideal thing would be to have a function like replace_regex:
r = re.compile(regex_string)
r = replace_regex_variables(r, x=2014)
for match in r.finditer(really_big_string):
do_something_with_each_match(match)
I'm open to any solution, but specifically interested in understanding if its possible to do this without checking matches after they are returned by finditer to take advantage of re's performance. In other words, preferrably NOT this:
r = re.compile(regex_string)
for match in r.finditer(really_big_string):
if r.groupdict()['x'] == 2014:
do_sometehing_with_each_match(match)
You want something like this, don't you?
r = r'(?P<x>%(x)s)\s(?P<y>\w+)'
r = re.compile(r % {x: 2014})
for match in r.finditer(really_big_string):
do_something_with_each_match(match)
I need to match contents of a list with a given pattern, and form another list which will be having everything except the matches. Meaning, I am trying to make an exclude list.
Now with one pattern match, it is easy. But for more that one, it becomes tricky.
Lets see an example :
Lmain=[arc123, arc234,xyz111,xyz222,ppp999,ppp888]
for count in range(len[Lmain]):
if Pattern matches Lmain[i]:
Pass
else:result.append(Lmain[i])
Now lets say pattern = arc , my result will be
result = [xyz111,xyz222,ppp999,ppp888]
This is just a logic, where I will be using regular expr for finding match.
Now if we have 2 patterns, then using above logic in a loop :
Pattern=['arc','xyz']
for pat in Pattern:
if pat matches Lmain[i]:
Pass
else:result.append(Lmain[i])
This will give us the wrong result
result = [xyz111,xyz222,ppp999,ppp888,arc123,arc234,ppp999,ppp888]
So , you can see above logic just wont work .
My plan:
First we find exclude list for first Pattern which will give us result:
result = [xyz111,xyz222,ppp999,ppp888]
For 2nd pattern, we need to look in to the above result.
if Pattern matches Result[i]:
Pass
else:result_final.append(Result[i])
I think we need to use Recursion to implement above logic. Now how do we do that?
Also we dont know the number of patterns user is going to enter. It can be one or more.
Anybody has any logic ideas, then please share.
Using a list comprehension and a generator expression, and skipping the intermediate step of building an exclude list and just building the final list:
>>> import re
>>> Lmain=['arc123', 'arc234', 'xyz111', 'xyz222','ppp999','ppp888']
>>> Pattern=['arc','xyz']
>>> [x for x in Lmain if not any(re.search(y, x) for y in Pattern)]
['ppp999', 'ppp888']
for item in lst:
if all(pat not in item for pat in patterns):
exclude_list.append(item)
Replace in to what is more appropriate in your case (e.g. item.startswith(pat))
If there are more matches than non-matches, it should be more efficient to find the matches first, and then exclude them:
matches = [x for x in lst if any(x.startswith(p) for p in patterns)]
exclude_list = list(set(lst).difference(matches))
Yet another (and probably the fastest) option is to use regular expressions (here in combination with filter):
import re
expr = '^(?!%s)' % '|'.join(patterns)
exclude_list = filter(re.compile(expr).search, lst)
matched = False
for pat in Pattern:
if pat patches Lmain[i]:
matched = True
break;
if matched:
Pass
else:
result.append(Lmain[i])