replace regex variable with string in python - python

I have a situation where I have a regular expression like this
regex_string = r'(?P<x>\d+)\s(?P<y>\w+)'
r = re.compile(regex_string)
and, before I start matching things with it, I'd like to replace the regex group named x with a particular value, say 2014. This way, when I search for matches to this regular expression, we will only find things that have x=2014. What is the best way to approach this issue?
The challenge here is that both the original regular expression regex_string and the arbitrary replacement value x=2014 are specified by an end user. In my head, the ideal thing would be to have a function like replace_regex:
r = re.compile(regex_string)
r = replace_regex_variables(r, x=2014)
for match in r.finditer(really_big_string):
do_something_with_each_match(match)
I'm open to any solution, but specifically interested in understanding if its possible to do this without checking matches after they are returned by finditer to take advantage of re's performance. In other words, preferrably NOT this:
r = re.compile(regex_string)
for match in r.finditer(really_big_string):
if r.groupdict()['x'] == 2014:
do_sometehing_with_each_match(match)

You want something like this, don't you?
r = r'(?P<x>%(x)s)\s(?P<y>\w+)'
r = re.compile(r % {x: 2014})
for match in r.finditer(really_big_string):
do_something_with_each_match(match)

Related

Check if a variable substring is in a string

I receive an input string having values expressed in two possible formats. E.g.:
#short format
data = '"interval":19'
>>> "interval":19
#extended format
data = '"interval":{"t0":19,"tf":19}'
>>> "interval":{"t0":19,"tf":19}
I would like to check whether a short format is used and, in case, make it extended.
Considering that the string could be composed of multiple values, i.e.
data = '"interval":19,"interval2":{"t0":10,"tf":15}'
>>> "interval":19,"interval2":{"t0":10,"tf":15}
I cannot just say:
if ":{" not in data:
#then short format is used
I would like to code something like:
if ":$(a general int/float/double number)" in data:
#extract the number
#replace ":{number}" with the extended format
I know how to code the replacing part.
I need help for implementing if condition: in my mind, I model it like a variable substring, in which the variable part is the number inside it, while the rigid format is the $(value name) + ":" part.
"some_value":19
^ ^
rigid format variable part
EDIT - WHY NOT PARSE IT?
I know the string is "JSON-friendly" and I can convert it into a dictionary, easily accessing then the values.
Indeed, I already have this solution in my code. But I don't like it since the input string could be multilevel and I need to iterate on the leaf values of the resulting dictionary, independently from the dictionary levels. The latter is not a simple thing to do.
So I was wondering whether a way to act directly on the string exists.
If you replace all keys, except t0, tf, followed by numbers, it should work.
I show you an example on a multilevel string, probably to be put in a better shape:
import re
s = '"interval": 19,"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval":23}}'
gex = '("(?!(t0|tf)")\w+":)\s*(\d+)'
new_s = re.sub(gex, r'\1 {"t0": \3, "tf": \3}', s)
print(new_s)
>>> print(new_s)
"interval": {"t0": 19, "tf": 19},"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval": {"t0": 23, "tf": 23}}}
You could use a regular expression. ("interval":)(\d+) will look for the string '"interval":' followed by any number of digits.
Let's test this
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'xxx', data)
print(result)
# -> xxx,"interval2":{"t0":10,"tf":15},xxx
We see that we found the correct places. Now we're going to create your target format. Here the matched groups come in handy. In the regular expression ("interval":) is group 1, (\d+) is group 2.
Now we use the content of those groups to create your wanted result.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"interval":{"t0":25,"tf":25}
If there are floating point values involved you'll have to change (\d+) to ([.\d]+).
If you want any Unicode standard word characters and not only interval you can use the special sequence \w and because it could be multiple characters the expression will be \w+.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'("\w+":)([.\d]+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":{"t0":10,"tf":10},"tf":{"t0":15,"tf":15}},"Monty":{"t0":25.4,"tf":25.4}
Dang! Yes, we found "Monty" but now the values from the second part are found too. We'll have to fix this somehow. Let's see. We don't want ("\w+") if it's preceded by { so were going to use a negative lookbehind assertion: (?<!{)("\w+"). And after the number part (\d+) we don't want a } or an other digit so we're using a negative lookahead assertion here: ([.\d]+)(?!})(?!\d).
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'(?<!{)("\w+":)([.\d]+)(?!})(?!\d)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"Monty":{"t0":25.4,"tf":25.4}
Hooray, it works!
Regular expressions are powerful and fun, but if you start to add more constraints this might become unmanageable.

Pythonic way to find the last position in a string matching a negative regex

In Python, I try to find the last position in an arbitrary string that does match a given pattern, which is specified as negative character set regex pattern. For example, with the string uiae1iuae200, and the pattern of not being a number (regex pattern in Python for this would be [^0-9]), I would need '8' (the last 'e' before the '200') as result.
What is the most pythonic way to achieve this?
As it's a little tricky to quickly find method documentation and the best suited method for something in the Python docs (due to method docs being somewhere in the middle of the corresponding page, like re.search() in the re page), the best way I quickly found myself is using re.search() - but the current form simply must be a suboptimal way of doing it:
import re
string = 'uiae1iuae200' # the string to investigate
len(string) - re.search(r'[^0-9]', string[::-1]).start()
I am not satisfied with this for two reasons:
- a) I need to reverse string before using it with [::-1], and
- b) I also need to reverse the resulting position (subtracting it from len(string) because of having reversed the string before.
There needs to be better ways for this, likely even with the result of re.search().
I am aware of re.search(...).end() over .start(), but re.search() seems to split the results into groups, for which I did not quickly find a not-cumbersome way to apply it to the last matched group. Without specifying the group, .start(), .end(), etc, seem to always match the first group, which does not have the position information about the last match. However, selecting the group seems to at first require the return value to temporarily be saved in a variable (which prevents neat one-liners), as I would need to access both the information about selecting the last group and then to select .end() from this group.
What's your pythonic solution to this? I would value being pythonic more than having the most optimized runtime.
Update
The solution should be functional also in corner cases, like 123 (no position that matches the regex), empty string, etc. It should not crash e.g. because of selecting the last index of an empty list. However, as even my ugly answer above in the question would need more than one line for this, I guess a one-liner might be impossible for this (simply because one needs to check the return value of re.search() or re.finditer() before handling it). I'll accept pythonic multi-line solutions to this answer for this reason.
You can use re.finditer to extract start positions of all matches and return the last one from list. Try this Python code:
import re
print([m.start(0) for m in re.finditer(r'\D', 'uiae1iuae200')][-1])
Prints:
8
Edit:
For making the solution a bit more elegant to behave properly in for all kind of inputs, here is the updated code. Now the solution goes in two lines as the check has to be performed if list is empty then it will print -1 else the index value:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
lst = [m.start() for m in re.finditer(r'\D', s)]
print(s, '-->', lst[-1] if len(lst) > 0 else None)
Prints the following, where if no such index is found then prints None instead of index:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
Edit 2:
As OP stated in his post, \d was only an example we started with, due to which I came up with a solution to work with any general regex. But, if this problem has to be really done with \d only, then I can give a better solution which would not require list comprehension at all and can be easily written by using a better regex to find the last occurrence of non-digit character and print its position. We can use .*(\D) regex to find the last occurrence of non-digit and easily print its index using following Python code:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
m = re.match(r'.*(\D)', s)
print(s, '-->', m.start(1) if m else None)
Prints the string and their corresponding index of non-digit char and None if not found any:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
And as you can see, this code doesn't need to use any list comprehension and is better as it can just find the index by just one regex call to match.
But in case OP indeed meant it to be written using any general regex pattern, then my above code using comprehension will be needed. I can even write it as a function that can take the regex (like \d or even a complex one) as an argument and will dynamically generate a negative of passed regex and use that in the code. Let me know if this indeed is needed.
To me it sems that you just want the last position which matches a given pattern (in this case the not a number pattern).
This is as pythonic as it gets:
import re
string = 'uiae1iuae200'
pattern = r'[^0-9]'
match = re.match(fr'.*({pattern})', string)
print(match.end(1) - 1 if match else None)
Output:
8
Or the exact same as a function and with more test cases:
import re
def last_match(pattern, string):
match = re.match(fr'.*({pattern})', string)
return match.end(1) - 1 if match else None
cases = [(r'[^0-9]', 'uiae1iuae200'), (r'[^0-9]', '123a'), (r'[^0-9]', '123'), (r'[^abc]', 'abcabc1abc'), (r'[^1]', '11eea11')]
for pattern, string in cases:
print(f'{pattern}, {string}: {last_match(pattern, string)}')
Output:
[^0-9], uiae1iuae200: 8
[^0-9], 123a: 3
[^0-9], 123: None
[^abc], abcabc1abc: 6
[^1], 11eea11: 4
This does not look Pythonic because it's not a one-liner, and it uses range(len(foo)), but it's pretty straightforward and probably not too inefficient.
def last_match(pattern, string):
for i in range(1, len(string) + 1):
substring = string[-i:]
if re.match(pattern, substring):
return len(string) - i
The idea is to iterate over the suffixes of string from the shortest to the longest, and to check if it matches pattern.
Since we're checking from the end, we know for sure that the first substring we meet that matches the pattern is the last.

Combine compiled Python regexes

Is there any mechanism in Python for combining compiled regular expressions?
I know it's possible to compile a new expression by extracting the plain-old-string .pattern property from existing pattern objects. But this fails in several ways. For example:
import re
first = re.compile(r"(hello?\s*)")
# one-two-three or one/two/three - but not one-two/three or one/two-three
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
# Incorrect - back-reference \1 would refer to the wrong capturing group now,
# and we get an error "redefinition of group name 'r1' as group 3; was
# group 2 at position 47" for the `(?P)` group.
# Result is also now case-sensitive, unlike 'second' which is IGNORECASE
both = re.compile(first.pattern + second.pattern + second.pattern)
The result I'm looking for is achievable like so in Perl:
$first = qr{(hello?\s*)};
# one-two-three or one/two/three - but not one-two/three or one/two-three
$second = qr{one([-/])two\g{-1}three}i;
$both = qr{$first$second$second};
A test shows the results:
test($second, "...one-two-three..."); # Matches
test($both, "...hello one-two-THREEone-two-three..."); # Matches
test($both, "...hellone/Two/ThreeONE-TWO-THREE..."); # Matches
test($both, "...HELLO one/Two/ThreeONE-TWO-THREE..."); # No match
sub test {
my ($pat, $str) = #_;
print $str =~ $pat ? "Matches\n" : "No match\n";
}
Is there a library somewhere that makes this use case possible in Python? Or a built-in feature I'm missing somewhere?
(Note - one very useful feature in the Perl regex above is \g{-1}, which unambiguously refers to the immediately preceding capture group, so that there are no collisions of the type that Python is complaining about when I try to compile the combined expression. I haven't seen that anywhere in Python world, not sure if there's an alternative I haven't thought of.)
Ken, this is an interesting problem. I agree with you that the Perl solution is very slick.
I came up with something, but it is not so elegant. Maybe it gives you some idea to further explore the solution using Python. The idea is to simulate the concatenation using Python re methods.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
str="...hello one-two-THREEone/two/three..."
#str="...hellone/Two/ThreeONE-TWO-THREE..."
if re.search(first,str):
first_end_pos = re.search(first,str).end()
if re.match(second,str[first_end_pos:]):
second_end_pos = re.match(second,str[first_end_pos:]).end() + first_end_pos
if re.match(second,str[second_end_pos:]):
print ('Matches')
It will work for most of the cases but it is not working for the below case:
...hellone/Two/ThreeONE-TWO-THREE...
So, yes I admit it is not a complete solution to your problem. Hope this helps though.
I'm not a perl expert, but it doesn't seem like you're comparing apples to apples. You're using named capture groups in python, but I don't see any named capture groups in the perl example. This causes the error you mention, because this
both = re.compile(first.pattern + second.pattern + second.pattern)
tries to create two capture groups named r1
For example, if you use the regex below, then try to access group_one by name, would you get the numbers before "some text" or after?
# Not actually a valid regex
r'(?P<group_one>[0-9]*)some text(?P<group_one>[0-9]*)'
Solution 1
An easy solution is probably to remove the names from the capture groups. Also add the re.IGNORECASE to both. The code below works, although I'm not sure the resulting regex pattern will match what you want it to match.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one([-/])two([-/])three", re.IGNORECASE)
both = re.compile(first.pattern + second.pattern + second.pattern, re.IGNORECASE)
Solution 2
What I'd probably do instead is define the separate regular expressions as strings, then you can combine them however you'd like.
pattern1 = r"(hello?\s*)"
pattern2 = r"one([-/])two([-/])three"
first = re.compile(pattern1, re.IGNORECASE)
second = re.compile(pattern2, re.IGNORECASE)
both = re.compile(r"{}{}{}".format(pattern1, pattern2, pattern2), re.IGNORECASE)
Or better yet, for this specific example, don't repeat pattern2 twice, just account for the fact that it'll repeat in the regex:
both = re.compile("{}({}){{2}}".format(pattern1, pattern2), re.IGNORECASE)
which gives you the following regex:
r'(hello?\s*)(one([-/])two([-/])three){2}'

Smart pythonic way of removing if elif on regular expressions

I have a series of reg expressions called in order. I need to check the first one, and then the second, then the third etc etc right the way until the end. I need to do some processed on the matched string, so I'm trying to avoid too much logic, but in python, unlike perl I do not think I can perform assignment in the if-elif-elif..blocks so I'll end up doing an assignment, then checking for a match and then getting the results of that match. For example:
m = re.search(patternA, string)
if m:
stripped = m.group(0)
xyz = stripped[45:67]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
abc = stripped[5:7]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
txt = stripped[4:5]
elif:
......
Ideally I'd like to find a better structure that ensures I preserve the ordering of the tested regular expressions, and also that I can incorporate the assignment into the if-then statements. So for example:
if (m = re.search(patternA, string)):
stripped = m.group(0)
xyz = stripped[45:67]
elif (m = re.search(patternB, string)):
stripped = m.group(0)
abc = stripped[5:7]
...
What is the most pythonic way of dealing with this? Thanks.
The use case is to read old data - very old data. However each string may include information about particular values and these are only present if the regular expression matches a particular pattern. So the variables extracted are highly dependent upon what matches.
for (pattern, slice) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)]):
m = re.search(pattern, string)
if m:
value = m.group(0)[slice]
break
else:
# Handle no match found for any pattern here
This iterates over pairs of regular expressions and the relevant portion of their match until a match is found. If there is no match found, the else clause of the for loop will execute. The result of the match is found in value after the loop, regardless of which pattern matches.
Having different variables set based on which "branch" succeeds is not a great idea, since you won't necessarily know which variables are set at any given time. A dictionary would be a better idea if you really want separate labels for each match, since you can query which key or keys are set in a dictionary.
value = {}
for (pattern, slice, key) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)],
['abc', 'xyx', 'txt']):
m = re.search(pattern, string)
if m:
value[key] = m.group(0)[slice]
break
The general idea, though, is to note that your chain of if statements is like a hard-coded iteration, so you just need to identify which parts of each if/elif clause varies from the preceding ones, and create a list that you can iterate over instead.

How to build a regular expression from predefined blocks

I want to build a regular expression based on a template and a set of predefined blocks, and using string.Template for the substitution.
For example:
template: /data/${year}_${month}_${day}/${year}${month}${day}_${type}_${id}.dat
blocks:
day: (?P<day>\d{2})
month: (?P<month>\d{2})
year: (?P<year>\d{4})
type: (?P<typechar>[BDPCLNIYSQJ])
id: (?P<id>\d{8})
>>> string.Template(template).safe_substitute(blocks)
/data/(?P<year>\d{4})_(?P<month>\d{2})_(?P<day>\d{2})/(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})_(?P<typechar>[BDPCLNIYSQJ])_(?P<id>\d{8}).dat
The problem is with duplicated name groups, which are not accepted in the regular expression.
I'm looking either for a way to correct the template (before or after the substitution), a way to trick re to swallow the duplicates, or a complete new approach to the problem.
I'm not sure about Python but both PCRE and Perl support (?(DEFINE)...) construction. So you can use something like this
(?x)
(?(DEFINE)
(?<date> (?&long_date) | (?&short_date))
(?<long_date> (?&year) _ (?&month) _ (?&day) _ (?&type) _ (?&id))
(?<short_date> (?&year) _ (?&month) _ (?&day))
(?<day> \d{2})
(?<month> \d{2})
(?<year> \d{4})
(?<type> [BDPCLNIYSQJ])
(?<id> \d{8})
)
(?&date)
I've used "x" modifier (?x) just to make the regex more readable (now whitespaces inside the regex are ignored).
A "conditional group" of the form (?(DEFINE)...) can be used to define
groups (named and numbered) that are never evaluated inline, but can
be called as "subroutines" from elsewhere. In effect, the DEFINE
condition is always false. There may be only one alternative in such a
group.
http://www.pcre.org/changelog.txt
Get rid of the ?P name elements.
ie
day: (?P<name>\d{2})
becomes
day: (\d{2})
I've never used the ?P feature before tbh
Your regexp template idea is good though!
apply the template twice, once to set the name and then use the output to make the final regexp
day='(?P<$dayname>\d{2})'
d=dict(dayname='day_start')
Template(day).safe_substitute(d)
repeat for all the names you need then feed them all into your final template which uses day1 day2 etc
After following a friend's advice, I found a way to achieve the desired result.
The idea is to modify the template string to eliminate duplicate vars before substituting the regex blocks. In fact it's not removing the duplicates but replacing them with a reference to the first one with the (?P=name) syntax. This way you force the contents to be the same everywhere you use that block.
I will assume the regex groupname is the same as the template block name. This is not true in the question example, but it can be changed without any problem.
To transform the duplicates I use the following function:
>>> def remove_duplicate_blocks(template):
regex = '\$\{([\w]+)\}'
def alt_seen(matchobj):
x = matchobj.group(1)
if x not in seen and not seen_add(x): return '${%s}' % x
else: return '(?P=%s)' % x
seen = set()
seen_add = seen.add
return re.sub(regex, alt_seen, template)
which returns the transformed template without duplicates and enforcing all similar blocks to have the same content.
Afterwards it's just a matter of replacing the blocks
>>> unique_blocks_template = remove_duplicate_blocks(template)
>>> print unique_blocks_template
/data/${year}_${month}_${day}/(?P=year)(?P=month)(?P=day)_${type}_${id}.dat
>>> string.Template(unique_blocks_template).safe_substitute(blocks)
'/data/(?P<year>\\d{4})_(?P<month>\\d{2})_(?P<day>\\d{2})/(?P=year)(?P=month)(?P=day)_(?P<type>[BDPCLNIYSQJ])_(?P<id>\\d{8}).dat'
Not mentioned in the question, but the same original template can also be used to reconstruct the string we want to match with the regex, which is the original aim of this code.

Categories

Resources