Python: check if one regex covers another regex [duplicate] - python

I want to find out if there could ever be conflicts between two known regular expressions, in order to allow the user to construct a list of mutually exclusive regular expressions.
For example, we know that the regular expressions below are quite different but they both match xy50:
'^xy1\d'
'[^\d]\d2$'
Is it possible to determine, using a computer algorithm, if two regular expressions can have such a conflict? How?

There's no halting problem involved here. All you need is to compute if the intersection of ^xy1\d and [^\d]\d2$ in non-empty.
I can't give you an algorithm here, but here are two discussions of a method to generate the intersection without resorting the construction of a DFA:
http://sulzmann.blogspot.com/2008/11/playing-with-regular-expressions.html
And then there's RAGEL
http://www.complang.org/ragel/
which can compute the intersection of regular expressions too.
UPDATE: I just tried out Ragel with OP's regexp. Ragel can generate a "dot" file for graphviz from the resulting state machine, which is terrific. The intersection of the OP's regexp looks like this in Ragel syntax:
('xy1' digit any*) & (any* ^digit digit '2')
and has the following state machine:
While the empty intersection:
('xy1' digit any*) & ('q' any* ^digit digit '2')
looks like this:
So if all else fails, then you can still have Ragel compute the intersection and check if it outputs the empty state machine, by comparing the generated "dot" file.

The problem can be restated as, "do the languages described by two or more regular
expressions have a non-empty intersection"?
If you confine the question to pure regular expressions (no backreferences, lookahead,
lookbehind, or other features that allow recognition of context-free or more complex
languages), the question is at least decidable. Regular languages are closed under
intersection, and there is an algorithm that takes the two regular expressions
as inputs and produces, in finite time, a DFA that recognizes the intersection.
Each regular expression can be converted to a nondeterministic finite automaton,
and then to a deterministic finite automaton. A pair of DFAs can be converted
to a DFA that recognizes the intersection. If there is a path from the
start state to any accepting state of that final DFA, the intersection is non-empty
(a "conflict", using your terminology).
Unfortunately, there is a possibly-exponential blowup when converting the initial NFA
to a DFA, so the problem quickly becomes infeasible in practice as the size of
the input expressions grows.
And if extensions to pure regular expressions are permitted, all bets are off --
such languages are no longer closed under intersection, so this construction won't
work.

Yes I think this is solvable: instead of thinking of regular expressions as matching strings, you can also think of them as generating strings. That is, all the strings they would match.
Let [R] be the set of strings generated by the regular expression R. Then given to regular expressions R and T, we could try to match T against [R]. That is [R] matches T iff there is a string in [R] which matches T.
It should be possible to develop this into an algorithm where [R] is lazily constructed as needed: where normal regular expression matching would try to match the next character from an input string and then advance to the next character in the string, the modified algorithm would check whether the FSM corresponding to the input regular expression can generate a matching character at its current state and then advances to 'all next states' simultaneously.

Another approach would be to leverage Dan Kogai's Perl Regexp::Optimizer instead.
use Regexp::Optimizer;
my $o = Regexp::Optimizer->new->optimize(qr/foobar|fooxar|foozap/);
# $re is now qr/foo(?:[bx]ar|zap)/
.. first, optimize and then discard all redundant patterns.
Maybe Ron Savage's Regexp::Assemble could be even more helpful.
It allows assembling an arbitrary number of regular expressions into a single regular expression that matches all that the individual REs match.* Or a combination of both.
* However, you need to be aware of some differences between Perl and Java or other PCRE-flavors.

If you are looking for a lib in Java you can use Automaton using '&' operator:
RegExp re = new RegExp("(ABC_123.*56.txt)&(ABC_12.*456.*\\.txt)", RegExp.INTERSECTION); // Parse RegExp
Automaton a = re.toAutomaton(); // convert RegExp to automaton
if(a.isEmpty()) { // Test if intersection is empty
System.out.println("Intersection is empty!");
}
else {
// Print the shortest accepted string
System.out.println("Intersection is non-empty, example: " + a.getShortestExample(true));
}
Original Answer:
Detecting if two regexes could possibly match the same string

Related

REGEX: Negative lookbehind with multiple whitespaces [duplicate]

I am trying to use lookbehinds in a regular expression and it doesn't seem to work as I expected. So, this is not my real usage, but to simplify I will put an example. Imagine I want to match "example" on a string that says "this is an example". So, according to my understanding of lookbehinds this should work:
(?<=this\sis\san\s*?)example
What this should do is find "this is an", then space characters and finally match the word "example". Now, it doesn't work and I don't understand why, is it impossible to use '+' or '*' inside lookbehinds?
I also tried those two and they work correctly, but don't fulfill my needs:
(?<=this\sis\san\s)example
this\sis\san\s*?example
I am using this site to test my regular expressions: http://gskinner.com/RegExr/
Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:
only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)
The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.
Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).
See also section about limitations of look-behind assertions on Regular-Expressions.info.
Hey if your not using python variable look behind assertion you can trick the regex engine by escaping the match and starting over by using \K.
This site explains it well .. http://www.phpfreaks.com/blog/pcre-regex-spotlight-k ..
But pretty much when you have an expression that you match and you want to get everything behind it using \K will force it to start over again...
Example:
string = '<a this is a tag> with some information <div this is another tag > LOOK FOR ME </div>'
matching /(\<a).+?(\<div).+?(\>)\K.+?(?=\<div)/ will cause the regex to restart after you match the ending div tag so the regex won't include that in the result. The (?=\div) will make the engine get everything in front of ending div tag
What Amber said is true, but you can work around it with another approach: A non-capturing parentheses group
(?<=this\sis\san)(?:\s*)example
That make it a fixed length look behind, so it should work.
You can use sub-expressions.
(this\sis\san\s*?)(example)
So to retrieve group 2, "example", $2 for regex, or \2 if you're using a format string (like for python's re.sub)
Most regex engines don't support variable-length expressions for lookbehind assertions.

Python Search repeated operators in string

For the purpose of writing a calculator, like the python interpeter, I want to check the validity of my expressions.
I want to check a string for repeated mathematical operators, I don't want to catch anything, just to know if they exist, in which case the expression would be invalid.
4++-+4 is valid.
4*-8 is invalid
4-/7 is invalid
4/-4 is valid, mine probably fails here.
minut and plus can repeat themselves, but -* is, for example, invalid.
Much like the way the python interpeter works.
This is what I have, as a Regex, but any simpler solution is welcome, even not regex is great.
[*/^%\-+][*/^%] | [\-+*/^%][*/^%]
Link
Basicially, check if operators */^%-+ are either preceded by or followed by */^% (without minus and plus)
Again a more concise solution would be either a CFG or a stack based approach for infix expressions. However something you could hack and experiment with is the following idea.
construct a product of all operators like so:
from itertools import product as p
all=list(p('*/^%-+',repeat=2))
all=map(lambda x:''.join(x),all)
invalids=[..write them by hand in here(hacky part)]
valids=filter(lambda x:x not in invalids,all)
And now you're left with all valid operations of length 2. You can scan your string with a window of 2, and when you find a pair of operators not belonging in the valids you can declare the expression invalid and move on.
Another way you could go about it is a rule-based one. Construct a dictionary with operators as keys, and for each operator the value would be a list holding all operators that can follow it.
Then your problem becomes one of checking your string at character i with the validity condition being
string[i+1] in dictionary[string[i]]
If you do find a CFG solution and its beautiful, let me know

Matching characters in two Python strings

I am trying to print the shared characters between 2 sets of strings in Python, I am doing this with the hopes of actually finding how to do this using nothing but python regular expressions (I don't know regex so this might be a good time to learn it).
So if first_word = "peepa" and second_word = "poopa" I want the return value to be: "pa"
since in both variables the characters that are shared are p and a. So far I am following the documentation on how to use the re module, but I can't seem to grasp the basic concepts of this.
Any ideas as to how would I solve this problem?
This sounds like a problem where you want to find the intersection of characters between the two strings. The quickest way would be to do this:
>>> set(first_word).intersection(second_word)
set(['a', 'p'])
I don't think regular expressions are the right fit for this problem.
Use sets. Casting a string to a set returns an iterable with unique letters. Then you can retrieve the intersection of the two sets.
match = set(first_word.lower()) & set(second_word.lower())
Using regular expressions
This problem is tailor made for sets. But, you ask for "how to do this using nothing but python regular expressions."
Here is a start:
>>> import re
>>> re.sub('[^peepa]', '', "poopa")
'ppa'
The above uses regular expressions to remove from "poopa" every letter that was not already in "peepa". (As you see it leaves duplicated letters which sets would not do.)
In more detail, re.sub does substitutions based on regular expressions. [peepa] is a regular expression that means any of the letters peepa. The regular expression [^peepa] means anything that is not in peepa. Anything matching this regular expression is replaced with the empty string "", that is, it is removed. What remains are only the common letters.

Is this regex correct for xsd:anyURI

I am implementing a function (in Python) that checks for conformance of the string to xsd:anyURI.
According to Schema Central it only makes sense to check for repeated, consecutive and non-consecutive # characters and % followed by something other than hex characters 0-Ff.
So far, I have something like and it seems to be working:
if uri.search('(%[^0-9A-Fa-f]+)|(#.*#+)')
The second expression for multiple '#' signs may be faulty.
If you are aiming for an exclusion regex according to the Schema Central parser requirement, you are almost there. The first half, excluding percent signs not followed by two hexadecimal digits is best solved using a negative look-ahead assertion; the second half is fine, though you can ditch the last repeat indicator without affecting your results:
(%(?![0-9A-F]{2})|#.*#)
Compile your regex with case independence (i flag) and you are good to go.
Recommended reading: the Python Standard Library’s chapter on Regular Expression Operation Syntax.
I recently had to do this without a negative lookahead, and the following seems to work:
(%.?[^0-9A-Fa-f]|#.*#)

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Categories

Resources