How to build a regular expression from predefined blocks - python

I want to build a regular expression based on a template and a set of predefined blocks, and using string.Template for the substitution.
For example:
template: /data/${year}_${month}_${day}/${year}${month}${day}_${type}_${id}.dat
blocks:
day: (?P<day>\d{2})
month: (?P<month>\d{2})
year: (?P<year>\d{4})
type: (?P<typechar>[BDPCLNIYSQJ])
id: (?P<id>\d{8})
>>> string.Template(template).safe_substitute(blocks)
/data/(?P<year>\d{4})_(?P<month>\d{2})_(?P<day>\d{2})/(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})_(?P<typechar>[BDPCLNIYSQJ])_(?P<id>\d{8}).dat
The problem is with duplicated name groups, which are not accepted in the regular expression.
I'm looking either for a way to correct the template (before or after the substitution), a way to trick re to swallow the duplicates, or a complete new approach to the problem.

I'm not sure about Python but both PCRE and Perl support (?(DEFINE)...) construction. So you can use something like this
(?x)
(?(DEFINE)
(?<date> (?&long_date) | (?&short_date))
(?<long_date> (?&year) _ (?&month) _ (?&day) _ (?&type) _ (?&id))
(?<short_date> (?&year) _ (?&month) _ (?&day))
(?<day> \d{2})
(?<month> \d{2})
(?<year> \d{4})
(?<type> [BDPCLNIYSQJ])
(?<id> \d{8})
)
(?&date)
I've used "x" modifier (?x) just to make the regex more readable (now whitespaces inside the regex are ignored).
A "conditional group" of the form (?(DEFINE)...) can be used to define
groups (named and numbered) that are never evaluated inline, but can
be called as "subroutines" from elsewhere. In effect, the DEFINE
condition is always false. There may be only one alternative in such a
group.
http://www.pcre.org/changelog.txt

Get rid of the ?P name elements.
ie
day: (?P<name>\d{2})
becomes
day: (\d{2})
I've never used the ?P feature before tbh
Your regexp template idea is good though!

apply the template twice, once to set the name and then use the output to make the final regexp
day='(?P<$dayname>\d{2})'
d=dict(dayname='day_start')
Template(day).safe_substitute(d)
repeat for all the names you need then feed them all into your final template which uses day1 day2 etc

After following a friend's advice, I found a way to achieve the desired result.
The idea is to modify the template string to eliminate duplicate vars before substituting the regex blocks. In fact it's not removing the duplicates but replacing them with a reference to the first one with the (?P=name) syntax. This way you force the contents to be the same everywhere you use that block.
I will assume the regex groupname is the same as the template block name. This is not true in the question example, but it can be changed without any problem.
To transform the duplicates I use the following function:
>>> def remove_duplicate_blocks(template):
regex = '\$\{([\w]+)\}'
def alt_seen(matchobj):
x = matchobj.group(1)
if x not in seen and not seen_add(x): return '${%s}' % x
else: return '(?P=%s)' % x
seen = set()
seen_add = seen.add
return re.sub(regex, alt_seen, template)
which returns the transformed template without duplicates and enforcing all similar blocks to have the same content.
Afterwards it's just a matter of replacing the blocks
>>> unique_blocks_template = remove_duplicate_blocks(template)
>>> print unique_blocks_template
/data/${year}_${month}_${day}/(?P=year)(?P=month)(?P=day)_${type}_${id}.dat
>>> string.Template(unique_blocks_template).safe_substitute(blocks)
'/data/(?P<year>\\d{4})_(?P<month>\\d{2})_(?P<day>\\d{2})/(?P=year)(?P=month)(?P=day)_(?P<type>[BDPCLNIYSQJ])_(?P<id>\\d{8}).dat'
Not mentioned in the question, but the same original template can also be used to reconstruct the string we want to match with the regex, which is the original aim of this code.

Related

Check if a variable substring is in a string

I receive an input string having values expressed in two possible formats. E.g.:
#short format
data = '"interval":19'
>>> "interval":19
#extended format
data = '"interval":{"t0":19,"tf":19}'
>>> "interval":{"t0":19,"tf":19}
I would like to check whether a short format is used and, in case, make it extended.
Considering that the string could be composed of multiple values, i.e.
data = '"interval":19,"interval2":{"t0":10,"tf":15}'
>>> "interval":19,"interval2":{"t0":10,"tf":15}
I cannot just say:
if ":{" not in data:
#then short format is used
I would like to code something like:
if ":$(a general int/float/double number)" in data:
#extract the number
#replace ":{number}" with the extended format
I know how to code the replacing part.
I need help for implementing if condition: in my mind, I model it like a variable substring, in which the variable part is the number inside it, while the rigid format is the $(value name) + ":" part.
"some_value":19
^ ^
rigid format variable part
EDIT - WHY NOT PARSE IT?
I know the string is "JSON-friendly" and I can convert it into a dictionary, easily accessing then the values.
Indeed, I already have this solution in my code. But I don't like it since the input string could be multilevel and I need to iterate on the leaf values of the resulting dictionary, independently from the dictionary levels. The latter is not a simple thing to do.
So I was wondering whether a way to act directly on the string exists.
If you replace all keys, except t0, tf, followed by numbers, it should work.
I show you an example on a multilevel string, probably to be put in a better shape:
import re
s = '"interval": 19,"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval":23}}'
gex = '("(?!(t0|tf)")\w+":)\s*(\d+)'
new_s = re.sub(gex, r'\1 {"t0": \3, "tf": \3}', s)
print(new_s)
>>> print(new_s)
"interval": {"t0": 19, "tf": 19},"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval": {"t0": 23, "tf": 23}}}
You could use a regular expression. ("interval":)(\d+) will look for the string '"interval":' followed by any number of digits.
Let's test this
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'xxx', data)
print(result)
# -> xxx,"interval2":{"t0":10,"tf":15},xxx
We see that we found the correct places. Now we're going to create your target format. Here the matched groups come in handy. In the regular expression ("interval":) is group 1, (\d+) is group 2.
Now we use the content of those groups to create your wanted result.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"interval":{"t0":25,"tf":25}
If there are floating point values involved you'll have to change (\d+) to ([.\d]+).
If you want any Unicode standard word characters and not only interval you can use the special sequence \w and because it could be multiple characters the expression will be \w+.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'("\w+":)([.\d]+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":{"t0":10,"tf":10},"tf":{"t0":15,"tf":15}},"Monty":{"t0":25.4,"tf":25.4}
Dang! Yes, we found "Monty" but now the values from the second part are found too. We'll have to fix this somehow. Let's see. We don't want ("\w+") if it's preceded by { so were going to use a negative lookbehind assertion: (?<!{)("\w+"). And after the number part (\d+) we don't want a } or an other digit so we're using a negative lookahead assertion here: ([.\d]+)(?!})(?!\d).
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'(?<!{)("\w+":)([.\d]+)(?!})(?!\d)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"Monty":{"t0":25.4,"tf":25.4}
Hooray, it works!
Regular expressions are powerful and fun, but if you start to add more constraints this might become unmanageable.

Python re.sub() is not replacing every match

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:
abbcabb should give c and abca should give bc.
I've tried the following regex (here):
(.)(.*?)\1
But, it gives wrong output for first string. Also, when I tried another one (here):
(.)(.*?)*?\1
But, this one again gives wrong output. What's going wrong here?
The python code is a print statement:
print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string
It can be solved without regular expression, like below
>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'
re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on
abbcabb
it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.
If you want that, you need to write your own loop.
while True:
newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
if newS == s:
break
s = newS
print(newS)
DEMO
Regular expressions doesn't seem to be the ideal solution
they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters
I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.
It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter
first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.
like this:
import collections
s = "abbcabb"
cnt = collections.Counter(s)
s = "".join([c for c in s if cnt[c]==1])
(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)
EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like #jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)
My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".
You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:
>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'
Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:
>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'
Looks good to me.
The site explains it well, hover and use the explanation section.
(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.
so, for abbcabb the "sandwiched" portion should be bbc between two a
EDIT:
You can try something like this instead without regexes:
string = "abbcabb"
result = []
for i in string:
if i not in result:
result.append(i)
else:
result.remove(i)
print(''.join(result))
Note that this produces the "last" odd occurrence of a string and not first.
For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

Combine compiled Python regexes

Is there any mechanism in Python for combining compiled regular expressions?
I know it's possible to compile a new expression by extracting the plain-old-string .pattern property from existing pattern objects. But this fails in several ways. For example:
import re
first = re.compile(r"(hello?\s*)")
# one-two-three or one/two/three - but not one-two/three or one/two-three
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
# Incorrect - back-reference \1 would refer to the wrong capturing group now,
# and we get an error "redefinition of group name 'r1' as group 3; was
# group 2 at position 47" for the `(?P)` group.
# Result is also now case-sensitive, unlike 'second' which is IGNORECASE
both = re.compile(first.pattern + second.pattern + second.pattern)
The result I'm looking for is achievable like so in Perl:
$first = qr{(hello?\s*)};
# one-two-three or one/two/three - but not one-two/three or one/two-three
$second = qr{one([-/])two\g{-1}three}i;
$both = qr{$first$second$second};
A test shows the results:
test($second, "...one-two-three..."); # Matches
test($both, "...hello one-two-THREEone-two-three..."); # Matches
test($both, "...hellone/Two/ThreeONE-TWO-THREE..."); # Matches
test($both, "...HELLO one/Two/ThreeONE-TWO-THREE..."); # No match
sub test {
my ($pat, $str) = #_;
print $str =~ $pat ? "Matches\n" : "No match\n";
}
Is there a library somewhere that makes this use case possible in Python? Or a built-in feature I'm missing somewhere?
(Note - one very useful feature in the Perl regex above is \g{-1}, which unambiguously refers to the immediately preceding capture group, so that there are no collisions of the type that Python is complaining about when I try to compile the combined expression. I haven't seen that anywhere in Python world, not sure if there's an alternative I haven't thought of.)
Ken, this is an interesting problem. I agree with you that the Perl solution is very slick.
I came up with something, but it is not so elegant. Maybe it gives you some idea to further explore the solution using Python. The idea is to simulate the concatenation using Python re methods.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
str="...hello one-two-THREEone/two/three..."
#str="...hellone/Two/ThreeONE-TWO-THREE..."
if re.search(first,str):
first_end_pos = re.search(first,str).end()
if re.match(second,str[first_end_pos:]):
second_end_pos = re.match(second,str[first_end_pos:]).end() + first_end_pos
if re.match(second,str[second_end_pos:]):
print ('Matches')
It will work for most of the cases but it is not working for the below case:
...hellone/Two/ThreeONE-TWO-THREE...
So, yes I admit it is not a complete solution to your problem. Hope this helps though.
I'm not a perl expert, but it doesn't seem like you're comparing apples to apples. You're using named capture groups in python, but I don't see any named capture groups in the perl example. This causes the error you mention, because this
both = re.compile(first.pattern + second.pattern + second.pattern)
tries to create two capture groups named r1
For example, if you use the regex below, then try to access group_one by name, would you get the numbers before "some text" or after?
# Not actually a valid regex
r'(?P<group_one>[0-9]*)some text(?P<group_one>[0-9]*)'
Solution 1
An easy solution is probably to remove the names from the capture groups. Also add the re.IGNORECASE to both. The code below works, although I'm not sure the resulting regex pattern will match what you want it to match.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one([-/])two([-/])three", re.IGNORECASE)
both = re.compile(first.pattern + second.pattern + second.pattern, re.IGNORECASE)
Solution 2
What I'd probably do instead is define the separate regular expressions as strings, then you can combine them however you'd like.
pattern1 = r"(hello?\s*)"
pattern2 = r"one([-/])two([-/])three"
first = re.compile(pattern1, re.IGNORECASE)
second = re.compile(pattern2, re.IGNORECASE)
both = re.compile(r"{}{}{}".format(pattern1, pattern2, pattern2), re.IGNORECASE)
Or better yet, for this specific example, don't repeat pattern2 twice, just account for the fact that it'll repeat in the regex:
both = re.compile("{}({}){{2}}".format(pattern1, pattern2), re.IGNORECASE)
which gives you the following regex:
r'(hello?\s*)(one([-/])two([-/])three){2}'

Smart pythonic way of removing if elif on regular expressions

I have a series of reg expressions called in order. I need to check the first one, and then the second, then the third etc etc right the way until the end. I need to do some processed on the matched string, so I'm trying to avoid too much logic, but in python, unlike perl I do not think I can perform assignment in the if-elif-elif..blocks so I'll end up doing an assignment, then checking for a match and then getting the results of that match. For example:
m = re.search(patternA, string)
if m:
stripped = m.group(0)
xyz = stripped[45:67]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
abc = stripped[5:7]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
txt = stripped[4:5]
elif:
......
Ideally I'd like to find a better structure that ensures I preserve the ordering of the tested regular expressions, and also that I can incorporate the assignment into the if-then statements. So for example:
if (m = re.search(patternA, string)):
stripped = m.group(0)
xyz = stripped[45:67]
elif (m = re.search(patternB, string)):
stripped = m.group(0)
abc = stripped[5:7]
...
What is the most pythonic way of dealing with this? Thanks.
The use case is to read old data - very old data. However each string may include information about particular values and these are only present if the regular expression matches a particular pattern. So the variables extracted are highly dependent upon what matches.
for (pattern, slice) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)]):
m = re.search(pattern, string)
if m:
value = m.group(0)[slice]
break
else:
# Handle no match found for any pattern here
This iterates over pairs of regular expressions and the relevant portion of their match until a match is found. If there is no match found, the else clause of the for loop will execute. The result of the match is found in value after the loop, regardless of which pattern matches.
Having different variables set based on which "branch" succeeds is not a great idea, since you won't necessarily know which variables are set at any given time. A dictionary would be a better idea if you really want separate labels for each match, since you can query which key or keys are set in a dictionary.
value = {}
for (pattern, slice, key) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)],
['abc', 'xyx', 'txt']):
m = re.search(pattern, string)
if m:
value[key] = m.group(0)[slice]
break
The general idea, though, is to note that your chain of if statements is like a hard-coded iteration, so you just need to identify which parts of each if/elif clause varies from the preceding ones, and create a list that you can iterate over instead.

replace regex variable with string in python

I have a situation where I have a regular expression like this
regex_string = r'(?P<x>\d+)\s(?P<y>\w+)'
r = re.compile(regex_string)
and, before I start matching things with it, I'd like to replace the regex group named x with a particular value, say 2014. This way, when I search for matches to this regular expression, we will only find things that have x=2014. What is the best way to approach this issue?
The challenge here is that both the original regular expression regex_string and the arbitrary replacement value x=2014 are specified by an end user. In my head, the ideal thing would be to have a function like replace_regex:
r = re.compile(regex_string)
r = replace_regex_variables(r, x=2014)
for match in r.finditer(really_big_string):
do_something_with_each_match(match)
I'm open to any solution, but specifically interested in understanding if its possible to do this without checking matches after they are returned by finditer to take advantage of re's performance. In other words, preferrably NOT this:
r = re.compile(regex_string)
for match in r.finditer(really_big_string):
if r.groupdict()['x'] == 2014:
do_sometehing_with_each_match(match)
You want something like this, don't you?
r = r'(?P<x>%(x)s)\s(?P<y>\w+)'
r = re.compile(r % {x: 2014})
for match in r.finditer(really_big_string):
do_something_with_each_match(match)

Categories

Resources