Removing redundant regex? - python

Suppose I have a list of very simple regex represented as strings (by "very simple", I mean only containing .*). Every string in the list starts and ends with .*. For example, I could have
rs = [.*a.*, .*ab.*, .*ba.*cd.*, ...]
What I would like to do is keep track of those patterns that are a subset of another. In this example, .*a.* matches everything .*ab.* does, and more. Hence, I consider the latter pattern to be redundant.
What I thought to do was to split the strings on .*, match up corresponding elements, and test if one startswith the other. More specifically, consider .*a.* and .*ab.*. Splitting these on .*
a = ['', 'a', '']
b = ['', 'ab', '']
and zipping them together gives
c = [('', ''), ('a', 'ab'), ('', '')]
And then,
all(elt[1].startswith(elt[0]) for elt in c)
returns True and so I conclude that .*ab.* is indeed redundant if .*a.* is included in the list.
Does this make sense and does it do what I am trying to do? Of course, this approach gets complicated for a number of reasons, and so my next question is, is there a better way to do this that anyone has encountered previously?

For this problem you need to find the minimal DFAs for both the regex and compare them.
Here is the link of a discussion of same problem-
How to tell if one regular expression matches a subset of another regular expression?

Assuming every letter combination is surrounded by .* and does not have it in the middle, the approach would almost work. Instead of startswith you need to check for contains, though.
reglist = ['.*a.*', '.*ab.*', '.*ba.*', '.*cd.*']
patterns = set(x.split('.*')[1] for x in reglist)
remove = []
for x in patterns:
for y in patterns:
if x in y and x != y:
remove.append(y)
print (['.*{}.*'.format(x) for x in sorted(patterns - set(remove))])
gives you
['.*a.*', '.*cd.*']

Related

In python, find tokens in line

long time ago I wrote a tool for parsing text files, line by line, and do some stuff, depending on commands and conditions in the file.
I used regex for this, however, I was never good in regex.
A line holding a condition looks like this:
[type==STRING]
And the regex I use is:
re.compile(r'^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
This regex would result me the keyword "type" and the value "STRING".
However, now I need to update my tool to have more conditions in one line, e.g.
[type==STRING][amount==0]
I need to update my regex to get me two pairs of results, one pair type/STRING and one pair amount/0.
But I'm lost on this. My regex above gets me zero results with this line.
Any ideas how to do this?
You could either match a second pair of groups:
^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*(?:\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*)?$
Regex demo
Or you can omit the anchors and the [^\[\]]* part to get the group1 and group 2 values multiple times:
\[([^\]\[=]*)==([^\]\[=]*)\]
Regex demo
Is it a requirement that you use regex? You can alternatively accomplish this pretty easily using the split function twice and stripping the first opening and last closing bracket.
line_to_parse = "[type==STRING]"
# omit the first and last char before splitting
pairs = line_to_parse[1:-1].split("][")
for pair in pairs:
x, y = pair.split("==")
Rather depends on the precise "rules" that describe your data. However, for your given data why not:
import re
text = '[type==STRING][amount==0]'
words = re.findall('\w+', text)
lst = []
for i in range(0, len(words), 2):
lst.append((words[i], words[i+1]))
print(lst)
Output:
[('type', 'STRING'), ('amount', '0')]

Pythonic way to find the last position in a string matching a negative regex

In Python, I try to find the last position in an arbitrary string that does match a given pattern, which is specified as negative character set regex pattern. For example, with the string uiae1iuae200, and the pattern of not being a number (regex pattern in Python for this would be [^0-9]), I would need '8' (the last 'e' before the '200') as result.
What is the most pythonic way to achieve this?
As it's a little tricky to quickly find method documentation and the best suited method for something in the Python docs (due to method docs being somewhere in the middle of the corresponding page, like re.search() in the re page), the best way I quickly found myself is using re.search() - but the current form simply must be a suboptimal way of doing it:
import re
string = 'uiae1iuae200' # the string to investigate
len(string) - re.search(r'[^0-9]', string[::-1]).start()
I am not satisfied with this for two reasons:
- a) I need to reverse string before using it with [::-1], and
- b) I also need to reverse the resulting position (subtracting it from len(string) because of having reversed the string before.
There needs to be better ways for this, likely even with the result of re.search().
I am aware of re.search(...).end() over .start(), but re.search() seems to split the results into groups, for which I did not quickly find a not-cumbersome way to apply it to the last matched group. Without specifying the group, .start(), .end(), etc, seem to always match the first group, which does not have the position information about the last match. However, selecting the group seems to at first require the return value to temporarily be saved in a variable (which prevents neat one-liners), as I would need to access both the information about selecting the last group and then to select .end() from this group.
What's your pythonic solution to this? I would value being pythonic more than having the most optimized runtime.
Update
The solution should be functional also in corner cases, like 123 (no position that matches the regex), empty string, etc. It should not crash e.g. because of selecting the last index of an empty list. However, as even my ugly answer above in the question would need more than one line for this, I guess a one-liner might be impossible for this (simply because one needs to check the return value of re.search() or re.finditer() before handling it). I'll accept pythonic multi-line solutions to this answer for this reason.
You can use re.finditer to extract start positions of all matches and return the last one from list. Try this Python code:
import re
print([m.start(0) for m in re.finditer(r'\D', 'uiae1iuae200')][-1])
Prints:
8
Edit:
For making the solution a bit more elegant to behave properly in for all kind of inputs, here is the updated code. Now the solution goes in two lines as the check has to be performed if list is empty then it will print -1 else the index value:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
lst = [m.start() for m in re.finditer(r'\D', s)]
print(s, '-->', lst[-1] if len(lst) > 0 else None)
Prints the following, where if no such index is found then prints None instead of index:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
Edit 2:
As OP stated in his post, \d was only an example we started with, due to which I came up with a solution to work with any general regex. But, if this problem has to be really done with \d only, then I can give a better solution which would not require list comprehension at all and can be easily written by using a better regex to find the last occurrence of non-digit character and print its position. We can use .*(\D) regex to find the last occurrence of non-digit and easily print its index using following Python code:
import re
arr = ['', '123', 'uiae1iuae200', 'uiae1iuae200aaaaaaaa']
for s in arr:
m = re.match(r'.*(\D)', s)
print(s, '-->', m.start(1) if m else None)
Prints the string and their corresponding index of non-digit char and None if not found any:
--> None
123 --> None
uiae1iuae200 --> 8
uiae1iuae200aaaaaaaa --> 19
And as you can see, this code doesn't need to use any list comprehension and is better as it can just find the index by just one regex call to match.
But in case OP indeed meant it to be written using any general regex pattern, then my above code using comprehension will be needed. I can even write it as a function that can take the regex (like \d or even a complex one) as an argument and will dynamically generate a negative of passed regex and use that in the code. Let me know if this indeed is needed.
To me it sems that you just want the last position which matches a given pattern (in this case the not a number pattern).
This is as pythonic as it gets:
import re
string = 'uiae1iuae200'
pattern = r'[^0-9]'
match = re.match(fr'.*({pattern})', string)
print(match.end(1) - 1 if match else None)
Output:
8
Or the exact same as a function and with more test cases:
import re
def last_match(pattern, string):
match = re.match(fr'.*({pattern})', string)
return match.end(1) - 1 if match else None
cases = [(r'[^0-9]', 'uiae1iuae200'), (r'[^0-9]', '123a'), (r'[^0-9]', '123'), (r'[^abc]', 'abcabc1abc'), (r'[^1]', '11eea11')]
for pattern, string in cases:
print(f'{pattern}, {string}: {last_match(pattern, string)}')
Output:
[^0-9], uiae1iuae200: 8
[^0-9], 123a: 3
[^0-9], 123: None
[^abc], abcabc1abc: 6
[^1], 11eea11: 4
This does not look Pythonic because it's not a one-liner, and it uses range(len(foo)), but it's pretty straightforward and probably not too inefficient.
def last_match(pattern, string):
for i in range(1, len(string) + 1):
substring = string[-i:]
if re.match(pattern, substring):
return len(string) - i
The idea is to iterate over the suffixes of string from the shortest to the longest, and to check if it matches pattern.
Since we're checking from the end, we know for sure that the first substring we meet that matches the pattern is the last.

The difference between ( [^,]*) and (.*,) in regular expression? Using python

When I tried to transform the string into a dict-like form, I met this problem
s = '&a: 12, &b:13, &c:14, &d: 15' # the string I want to convert
Before converting it, I tried to find all the matched results at first so I used
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
result = dict_form.findall(s)
print(result) # [('&a:', ' 12, &b:13, &c:14')]
It's quite unexpected, and a little bit messy
But when I tried another way to match the string:
dict_form1 = re.compile(r'(&[a-zA-Z]*:)([^,]*)')
result = dict_form1.findall(s)
print(result) # [('&a:', ' 12'), ('&b:', '13'), ('&c:', '14'), ('&d:', ' 15')]
This time, I get a better one with key and item separately stored in a tuple.
The only difference I made was (.), into [^,]
The first one I thought was to find anything until it matches a comma
The second one I thought was to find anything but comma
What's the difference?
In the first instance:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
the (.*) operator is greedy. This means it will match everything up to the last comma, which is why you see the match extend up to &c:14.
In the second instance, by excluding the comma, you are forcing the match to be bound by a comma-- it's like saying "match everything until we hit a comma". This will cause the matching behavior you were expecting in the first place.
as have been said the .* will be greedy and try to match as much as possible, to make it non-greedy use the question mark (?) as in .*?. In your code:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*?),')
result = dict_form.findall(s)
print(result)
Another maybe easier solution is to just use string splits instead of regex:
result = [_s.split(':') for _s in s.split(',')]

Python Regular Expressions Findall

To look through data, I am using regular expressions. One of my regular expressions is (they are dynamic and change based on what the computer needs to look for --- using them to search through data for a game AI):
O,2,([0-9],?){0,},X
After the 2, there can (and most likely will) be other numbers, each followed by a comma.
To my understanding, this will match:
O,2,(any amount of numbers - can be 0 in total, each followed by a comma),X
This is fine, and works (in RegExr) for:
O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X # matches this
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X
My issue is that I need to match all the numbers after the original, provided number. So, I want to match (in the example) 9,6,7,11,8.
However, implementing this in Python:
import re
pattern = re.compile("O,2,([0-9],?){0,},X")
matches = pattern.findall(s) # s is the above string
matches is ['8'], the last number, but I need to match all of the numbers after the given (so '9,6,7,11,8').
Note: I need to use pattern.findall because thee will be more than one match (I shortened my list of strings, but there are actually around 20 thousand strings), and I need to find the shortest one (as this would be the shortest way for the AI to win).
Is there a way to match the entire string (or just the last numbers after those I provided)?
Thanks in advance!
Use this:
O,2,((?:[0-9],?){0,}),X
See it in action:http://regex101.com/r/cV9wS1
import re
s = '''O,4,1,8,6,7,9,5,3,X
X,6,3,7,5,9,4,1,8,2,T
O,2,9,6,7,11,8,X
O,4,6,9,3,1,7,5,O
X,6,9,3,5,1,7,4,8,O
X,3,2,7,1,9,4,6,X
X,9,2,6,8,5,3,1,X'''
pattern = re.compile("O,2,((?:[0-9],?){0,}),X")
matches = pattern.findall(s) # s is the above string
print matches
Outputs:
['9,6,7,11,8']
Explained:
By wrapping the entire value capture between 2, and ,X in (), you end up capturing that as well. I then used the (?: ) to ignore the inner captured set.
you don't have to use regex
split the string to array
check item 0 == 0 , item 1==2
check last item == X
check item[2:-2] each one of them is a number (is_digit)
that's all

python regex in pyparsing

How do you make the below regex be used in pyparsing? It should return a list of tokens given the regex.
Any help would be greatly appreciated! Thank you!
python regex example in the shell:
>>> re.split("(\w+)(lab)(\d+)", "abclab1", 3)
>>> ['', 'abc', 'lab', '1', '']
I tried this in pyparsing, but I can't seem to figure out how to get it right because the first match is being greedy, i.e the first token will be 'abclab' instead of two tokens 'abc' and 'lab'.
pyparsing example (high level, i.e non working code):
name = 'abclab1'
location = Word(alphas).setResultsName('location')
lab = CaselessLiteral('lab').setResultsName('environment')
identifier = Word(nums).setResultsName('identifier')
expr = location + lab + identifier
match, start, end = expr.scanString(name).next()
print match.asDict()
Pyparsing's classes are pretty much left-to-right, with lookahead implemented using explicit expressions like FollowedBy (for positive lookahead) and NotAny or the '~' operator (for negative lookahead). This allows you to detect a terminator which would normally match an item that is being repeated. For instance, OneOrMore(Word(alphas)) + Literal('end') will never find a match in strings like "start blah blah end", because the terminating 'end' will get swallowed up in the repetition expression in OneOrMore. The fix is to add negative lookahead in the expression being repeated: OneOrMore(~Literal('end') + Word(alphas)) + Literal('end') - that is, before reading another word composed of alphas, first make sure it is not the word 'end'.
This breaks down when the repetition is within a pyparsing class, like Word. Word(alphas) will continue to read alpha characters as long as there is no whitespace to stop the word. You would have to break into this repetition using something very expensive, like Combine(OneOrMore(~Literal('lab') + Word(alphas, exact=1))) - I say expensive because composition of simple tokens using complex Combine expressions will make for a slow parser.
You might be able to compromise by using a regex wrapped in a pyparsing Regex object:
>>> labword = Regex(r'(\w+)(lab)(\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
This does the right kind of grouping and detection, but does not expose the groups themselves. To do that, add names to each group - pyparsing will treat these like results names, and give you access to the individual fields, just as if you had called setResultsName:
>>> labword = Regex(r'(?P<locn>\w+)(?P<env>lab)(?P<identifier>\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
- env: lab
- identifier: 1
- locn: abc
>>> print labword.parseString("abclab1").asDict()
{'identifier': '1', 'locn': 'abc', 'env': 'lab'}
The only other non-regex approach I can think of would be to define an expression to read the whole string, and then break up the parts in a parse action.
If you strip the subgroup sign(the parenthesis), you'll get the right answer:)
>>> re.split("\w+lab\d+", "abclab1")
['', '']

Categories

Resources