Minimization of the regex

Minimization of the regex - python

I am fairly new to Programming world. I am trying to create a common regex that would match only list of strings given, nothing more than that.
For Eg., given the below list
List = ['starguide,'snoreguide','snoraguide','smarguides']
It should create a regex like this - s(((tar|nor(e|a))(guide))|marguides)
I implemented a trie. Could only manage to get s(marguides|nor(aguide|eguide)|targuide)
I want my regex to be shortened (common suffixes tied together). Is there any better way to shorten the regex I am getting from the trie?

To get the desired result try use automata minimization.
For your simple example, deterministic automaton suffices.
Use github.com/siddharthasahu/automata-from-regex to build min deterministic state machine/automaton from trivial regex (enumeration of words), then transform automaton into regex (it is easy for acyclic automata, http://www-igm.univ-mlv.fr/~dr/thdr/ www.dcc.fc.up.pt/~nam/publica/extAbsCIAA05.pdf) see also https://cs.stackexchange.com/questions/2016/how-to-convert-finite-automata-to-regular-expressions
In general case, non-determinist automata could yield shorter regex, yet it is a hard problem https://cstheory.stackexchange.com/questions/31630/how-can-one-actually-minimize-a-regular-expression

Related

NLTK Regular Expressions and CFGs

Is there any practical difference in power between a 'regular expression' as exampled by NLTK's docs and a CFG from the same? There definitely should be, since there are context-free languages which are not regular, but I can't find a concrete example where the CFG approach outshines a regular expression.
http://nltk.org/book/ch07.html

From the documentation of RegexpParser:
The patterns of a clause are executed in order. An earlier
pattern may introduce a chunk boundary that prevents a later
pattern from executing. Sometimes an individual pattern will
match on multiple, overlapping extents of the input. As with
regular expression substitution more generally, the chunker will
identify the first match possible, then continue looking for matches
after this one has ended.
The clauses of a grammar are also executed in order. A cascaded
chunk parser is one having more than one clause. The maximum depth
of a parse tree created by this chunk parser is the same as the
number of clauses in the grammar.
That is, each clause/pattern is executed once. Thus you'll run into trouble as soon as you need the output of a later clause to be matched by an earlier one.
A practical example is the way something that could be a complete sentence on its own can be used as a clause in a larger sentence:
The cat purred.
He heard that the cat purred.
She saw that he heard that the cat purred.
As we can read from the documentation above, when you construct a RegexpParser you're setting an arbitrary limit for the "depth" of this sort of sentence. There is no "recursion limit" for context-free grammars.
The documentation mentions that you can use looping to mitigate this somewhat -- if you run through a suitable grammar two or three or four times, you can get a deeper parse. You can add external logic to loop your grammar many times, or until nothing more can be parsed.
However, as the documentation also notes, the basic approach of this parser is still "greedy". It proceeds like this for a fixed or variable number of steps:
Do as much chunking as you can in one step.
Use the output of the last step as the input of the next step, and repeat.
This is naïve because if an early step makes a mistake, this will ruin the whole parse.
Think of a "garden path sentence":
The horse raced past the barn fell.
And a similar string but an entirely different sentence:
The horse raced past the barn.
It will likely be hard to construct a RegexpParser that will parse both of these sentences, because the approach relies on the initial chunking being correct. Correct initial chunking for one will probably be incorrect initial chunking for the other, yet you can't know "which sentence you're in" until you're at a late level in the parsing logic.
For instance, if "the barn fell" is chunked together early on, the parse will fail.
You can add external logic to backtrack when you end up with a "poor" parse, to see if you can find a better one. However, I think you'll find that at that point, more of the important parts of the parsing algorithm are in your external logic, instead of in RegexpParser.

Randomly generate a string that does NOT match a given regular expression

For testing purposes on a project I'm working on, I have a need to, if given a regular expression, randomly generate a string that will FAIL to be matched by it. For instance, if I'm given this regex:
^[abcd]d+
Then I should be able to generate strings such as:
hnbbad
uduebbaef
9f8;djfew
skjcc98332f
...each of which does NOT match the regex, but NOT generate:
addr32
bdfd09usdj
cdddddd-9fdssee
...each of which DO. In other words, I want something like an anti-Xeger.
Does such a library exist, preferably in Python (if I can understand the theory, I can most likely convert it to Python if need be)? I gave some thought to how I could write this, but given the scope of regular expressions, it seemed that might be a much harder problem than what things like Xeger can tackle. I also looked around for a pre-made library to do this, but either I'm not using the right keywords to search or nobody's had this problem before.

My initial instinct is, no, such a library does not exist because it's not possible. You can't be sure that you can find a valid input for any arbitrary regular expression in a reasonable amount of time.
For example, proving whether a number is prime is believed to be a hard to solve mathematical problem. The following regular expression matches any string which is at least 10000 characters long and whose total length is a prime number:
(?!(..+)\1+$).{10000}
I doubt that any library exists that can find a valid input to this regular expression in reasonable time. And this is a very easy example with a simple solution, e.g. 'x' * 10007 will work. It would be possible to come up with other regular expressions that are much harder to find valid inputs for.
I think the only way you are going to solve this is if you limit yourself to some subset of all possible regular expressions.
But having said that if you have a magical library that generates text that matches for any arbitrary regular expression then all you need to do is generate a regular expression that matches all the strings that don't match your original expression.
Luckily this is possible using a negative lookahead:
^(?![\s\S]*(?:^[abcd]d+))
If you are willing to change the requirements to only allow a limited subset of regular expressions then you can negate the regular expression by using boolean logic. For example if ^[abcd]d+ becomes ^[^abcd]|^[abcd][^d]. It is then possible to find a valid input for this regular expression in reasonable time.

I would do a loop, generating random combinations of random length, and test if matches the regexp. Repeat the loop until a not-match situation is reached.
Obviously, this would be inefficient. Are you sure you cannot invert the regexp and generate a match on the inverted regexp?

No this is impossible. There are an infinite number of regexes that match every string in the known universe. For example:
/^/
/.*/
/[^"\\]*(\\.[^"\\]*)*$/
etc.
This is because all these regexes can match nothing at all (which is something all strings have!)

Can we reduce the infinite number of possibilities, by restricting to generate strings from a give character set.
For example, I can define the character set, [QWERTYUIOP!##$%^%^&*))_] and all the strings I generate randomly should be born from this set. That way we can reduce the infinite nature of this problem?
In fact even I am looking for a utility like this, preferably in Python.

Python: regex vs find(), strip()

I am learning Python, and need to format "From" fields received from IMAP. I tried it using str.find() and str.strip(), and also using regex. With find(), etc. my function runs quite a bit faster than with re (I timed it). So, when is it better to use re? Does anybody have any good links/articles related to that? Python documentation obviously doesn't mention that...

find only matches an exact sequence of characters, while a regular expression matches a pattern. Naturally only looking an for exact sequence is faster (even if your regex pattern is also an exact sequence, there is still some overhead involved).
As a consequence of the above, you should use find if you know the exact sequence, and a regular expression (or something else) when you don't. The exact approach you should use really depends on the complexity of the problem you face.
As a side note, the python re module provides a compile method that allows you to pre-compile a regex if you are going to be using it repeatedly. This can substantially improve speed if you are using the same pattern many times.

If you intend to do something complex you should use re . It is more scalable than using string methods.
String methods are good for doing something simple and not worth bothering with regular expressions.
So, it depends on what are you doing, but usually you should use regular expressions since they are more powerful.

Is there a Python equivalent for Perl's `study`?

From Perl's documentation:
study takes extra time to study SCALAR ($_ if unspecified) in anticipation of doing
many pattern matches on the string before it is next modified. This may or may not save
time, depending on the nature and number of patterns you are searching and the distribution
of character frequencies in the string to be searched;
I'm trying to speed up some regular expression-driven parsing that I'm doing in Python, and I remembered this trick from Perl. I realize I'll have to benchmark to determine if there is a speedup, but I can't find an equivalent method in Python.

Perl’s study doesn’t really do much anymore. The regex compiled has gotten a whole, whole lot smarter than it was when study was created.
For example, it compiles alternatives into a trie structure with Aho–Corasick prediction.
Run with perl -Mre=debug to see the sorts of cleverness the regex compiler and execution engine apply.

As far as I know there's nothing like this built into Python. But according to the perldoc:
The way study works is this: a linked list of every character in the
string to be searched is made, so we know, for example, where all the
'k' characters are. From each search string, the rarest character is
selected, based on some static frequency tables constructed from some
C programs and English text. Only those places that contain this
"rarest" character are examined.
This doesn't sound very sophisticated, and you could probably hack together something equivalent yourself.
esmre is kind of vaguely similar. And as #Frg noted, you'll want to use re.compile if you're reusing a single regex (to avoid re-parsing the regex itself over and over).
Or you could use suffix trees (here's one implementation, or here's a C extension with unicode support) or suffix arrays (implementation).

Building an Inference Engine in Python

I am seeking direction and attempting to label this problem:
I am attempting to build a simple inference engine (is there a better name?) in Python which will take a string and -
1 - create a list of tokens by simply creating a list of white space separated values
2 - categorise these tokens, using regular expressions
3 - Use a higher level set of rules to make decisions based on the categorisations
Example:
"90001" - one token, matches the zipcode regex, a rule exists for a string containing just a zipcode causes a certain behaviour to occur
"30 + 14" - three tokens, regexs for numerical value and mathematical operators match, a rule exists for a numerical value followed by a mathematical operator followed by another numerical value causes a certain behaviour to occur
I'm struggling with how best to do step #3, the higher level set of rules. I'm sure that some framework must exist. Any ideas? Also, how would you characterise this problem? Rule based system, expert system, inference engine, something else?
Thanks!

I'm very surprised that step #3 is the one giving you trouble...
Assuming you can label/categorize properly each token (and that prior to categorization you can find the proper tokens, as there may be many ambiguous cases...), the "Step #3" problem seems one that could easily be tackled with a context free grammar where each of the desired actions (such as ZIP code lookup or Mathematical expression calculation...) would be symbols with their production rule itself made of the possible token categories. To illustrate this in BNF notation, we could have something like
<SimpleMathOperation> ::= <NumericalValue><Operator><NumericalValue>
Maybe your concern is that when things get complicated, it will become difficult to express the whole requirement in terms of non-conflicting grammar rules. Or maybe your concern is that one could add rules dynamically, hence forcing the grammar "compilation" logic to be integrated with the program ? Whatever the concern, I think that this 3rd step will comparatively be trivial.
On the other hand, and unless the various categories (and underlying input text) are such that they can be described with a regular language as well (as you seem to hint in the question), a text parser and classifier (Steps #1 and #2...) is typically a less than trivial affair..
Some example Python libraries that simplify writing and evaluating grammars:
pijnu
pyparsing

It looks like you search for "grammar inference" (grammar induction) library.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.