Python Search repeated operators in string - python

For the purpose of writing a calculator, like the python interpeter, I want to check the validity of my expressions.
I want to check a string for repeated mathematical operators, I don't want to catch anything, just to know if they exist, in which case the expression would be invalid.
4++-+4 is valid.
4*-8 is invalid
4-/7 is invalid
4/-4 is valid, mine probably fails here.
minut and plus can repeat themselves, but -* is, for example, invalid.
Much like the way the python interpeter works.
This is what I have, as a Regex, but any simpler solution is welcome, even not regex is great.
[*/^%\-+][*/^%] | [\-+*/^%][*/^%]
Link
Basicially, check if operators */^%-+ are either preceded by or followed by */^% (without minus and plus)

Again a more concise solution would be either a CFG or a stack based approach for infix expressions. However something you could hack and experiment with is the following idea.
construct a product of all operators like so:
from itertools import product as p
all=list(p('*/^%-+',repeat=2))
all=map(lambda x:''.join(x),all)
invalids=[..write them by hand in here(hacky part)]
valids=filter(lambda x:x not in invalids,all)
And now you're left with all valid operations of length 2. You can scan your string with a window of 2, and when you find a pair of operators not belonging in the valids you can declare the expression invalid and move on.
Another way you could go about it is a rule-based one. Construct a dictionary with operators as keys, and for each operator the value would be a list holding all operators that can follow it.
Then your problem becomes one of checking your string at character i with the validity condition being
string[i+1] in dictionary[string[i]]
If you do find a CFG solution and its beautiful, let me know

Related

Wildcard in python dictionary

I am trying create a python dictionary to reference 'WHM1',2,3, 'HISPM1',2,3, etc. and other iterations to create a new column with a specific string for ex. White or Hispanic. Using regex seems like the right path but I am missing something here and refuse to hard code the whole thing in the dictionary.
I have tried several iterations of regex and regexdict :
d = regexdict({'W*':'White', 'H*':'Hispanic'})
eeoc_nac2_All_unpivot_df['Race'] =
eeoc_nac2_All_unpivot_df['EEOC_Code'].map(d)
A new column will be created with 'White' or 'Hispanic' for each row based on what is in an existing column called 'EEOC_Code'.
Your regular expressions are wrong - you appear to be using glob syntax instead of proper regular expressions.
In regex, x* means "zero or more of x" and so both your regexes will trivially match the empty string. You apparently mean
d = regexdict({'^W':'White', '^H':'Hispanic'})
instead, where the regex anchor ^ matches beginning of string.
There are several third-party packages 1, 2, 3 named regexdict so you should probably point out which one you use. I can't tell whether the ^ is necessary here, or whether the regexes need to match the input completely (I have assumed a substring match is sufficient, as is usually the case in regex) because this sort of detail may well differ between implementations.
I'm not sure to have completely understood your problem. However, if all your labels have structure WHM... and HISP..., then you can simply check the first character:
for race in eeoc_nac2_All_unpivot_df['EEOC_Code']:
if race.startswith('W'):
eeoc_nac2_All_unpivot_df['Race'] = "White"
else:
eeoc_nac2_All_unpivot_df['Race'] = "Hispanic"
Note: it only works if what you have inside eeoc_nac2_All_unpivot_df['EEOC_Code'] is iterable.

Python: check if one regex covers another regex [duplicate]

I want to find out if there could ever be conflicts between two known regular expressions, in order to allow the user to construct a list of mutually exclusive regular expressions.
For example, we know that the regular expressions below are quite different but they both match xy50:
'^xy1\d'
'[^\d]\d2$'
Is it possible to determine, using a computer algorithm, if two regular expressions can have such a conflict? How?
There's no halting problem involved here. All you need is to compute if the intersection of ^xy1\d and [^\d]\d2$ in non-empty.
I can't give you an algorithm here, but here are two discussions of a method to generate the intersection without resorting the construction of a DFA:
http://sulzmann.blogspot.com/2008/11/playing-with-regular-expressions.html
And then there's RAGEL
http://www.complang.org/ragel/
which can compute the intersection of regular expressions too.
UPDATE: I just tried out Ragel with OP's regexp. Ragel can generate a "dot" file for graphviz from the resulting state machine, which is terrific. The intersection of the OP's regexp looks like this in Ragel syntax:
('xy1' digit any*) & (any* ^digit digit '2')
and has the following state machine:
While the empty intersection:
('xy1' digit any*) & ('q' any* ^digit digit '2')
looks like this:
So if all else fails, then you can still have Ragel compute the intersection and check if it outputs the empty state machine, by comparing the generated "dot" file.
The problem can be restated as, "do the languages described by two or more regular
expressions have a non-empty intersection"?
If you confine the question to pure regular expressions (no backreferences, lookahead,
lookbehind, or other features that allow recognition of context-free or more complex
languages), the question is at least decidable. Regular languages are closed under
intersection, and there is an algorithm that takes the two regular expressions
as inputs and produces, in finite time, a DFA that recognizes the intersection.
Each regular expression can be converted to a nondeterministic finite automaton,
and then to a deterministic finite automaton. A pair of DFAs can be converted
to a DFA that recognizes the intersection. If there is a path from the
start state to any accepting state of that final DFA, the intersection is non-empty
(a "conflict", using your terminology).
Unfortunately, there is a possibly-exponential blowup when converting the initial NFA
to a DFA, so the problem quickly becomes infeasible in practice as the size of
the input expressions grows.
And if extensions to pure regular expressions are permitted, all bets are off --
such languages are no longer closed under intersection, so this construction won't
work.
Yes I think this is solvable: instead of thinking of regular expressions as matching strings, you can also think of them as generating strings. That is, all the strings they would match.
Let [R] be the set of strings generated by the regular expression R. Then given to regular expressions R and T, we could try to match T against [R]. That is [R] matches T iff there is a string in [R] which matches T.
It should be possible to develop this into an algorithm where [R] is lazily constructed as needed: where normal regular expression matching would try to match the next character from an input string and then advance to the next character in the string, the modified algorithm would check whether the FSM corresponding to the input regular expression can generate a matching character at its current state and then advances to 'all next states' simultaneously.
Another approach would be to leverage Dan Kogai's Perl Regexp::Optimizer instead.
use Regexp::Optimizer;
my $o = Regexp::Optimizer->new->optimize(qr/foobar|fooxar|foozap/);
# $re is now qr/foo(?:[bx]ar|zap)/
.. first, optimize and then discard all redundant patterns.
Maybe Ron Savage's Regexp::Assemble could be even more helpful.
It allows assembling an arbitrary number of regular expressions into a single regular expression that matches all that the individual REs match.* Or a combination of both.
* However, you need to be aware of some differences between Perl and Java or other PCRE-flavors.
If you are looking for a lib in Java you can use Automaton using '&' operator:
RegExp re = new RegExp("(ABC_123.*56.txt)&(ABC_12.*456.*\\.txt)", RegExp.INTERSECTION); // Parse RegExp
Automaton a = re.toAutomaton(); // convert RegExp to automaton
if(a.isEmpty()) { // Test if intersection is empty
System.out.println("Intersection is empty!");
}
else {
// Print the shortest accepted string
System.out.println("Intersection is non-empty, example: " + a.getShortestExample(true));
}
Original Answer:
Detecting if two regexes could possibly match the same string

Matching characters in two Python strings

I am trying to print the shared characters between 2 sets of strings in Python, I am doing this with the hopes of actually finding how to do this using nothing but python regular expressions (I don't know regex so this might be a good time to learn it).
So if first_word = "peepa" and second_word = "poopa" I want the return value to be: "pa"
since in both variables the characters that are shared are p and a. So far I am following the documentation on how to use the re module, but I can't seem to grasp the basic concepts of this.
Any ideas as to how would I solve this problem?
This sounds like a problem where you want to find the intersection of characters between the two strings. The quickest way would be to do this:
>>> set(first_word).intersection(second_word)
set(['a', 'p'])
I don't think regular expressions are the right fit for this problem.
Use sets. Casting a string to a set returns an iterable with unique letters. Then you can retrieve the intersection of the two sets.
match = set(first_word.lower()) & set(second_word.lower())
Using regular expressions
This problem is tailor made for sets. But, you ask for "how to do this using nothing but python regular expressions."
Here is a start:
>>> import re
>>> re.sub('[^peepa]', '', "poopa")
'ppa'
The above uses regular expressions to remove from "poopa" every letter that was not already in "peepa". (As you see it leaves duplicated letters which sets would not do.)
In more detail, re.sub does substitutions based on regular expressions. [peepa] is a regular expression that means any of the letters peepa. The regular expression [^peepa] means anything that is not in peepa. Anything matching this regular expression is replaced with the empty string "", that is, it is removed. What remains are only the common letters.

Regular Expressions Dependant on Previous Matchings

For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'

Conditional Regular Expressions

I'm using Python and I want to use regular expressions to check if something "is part of an include list" but "is not part of an exclude list".
My include list is represented by a regex, for example:
And.*
Everything which starts with And.
Also the exclude list is represented by a regex, for example:
(?!Andrea)
Everything, but not the string Andrea. The exclude list is obviously a negation.
Using the two examples above, for example, I want to match everything which starts with And except for Andrea.
In the general case I have an includeRegEx and an excludeRegEx. I want to match everything which matchs includeRegEx but not matchs excludeRegEx. Attention: excludeRegEx is still in the negative form (as you can see in the example above), so it should be better to say: if something matches includeRegEx, I check if it also matches excludeRegEx, if it does, the match is satisfied. Is it possible to represent this in a single regular expression?
I think Conditional Regular Expressions could be the solution but I'm not really sure of that.
I'd like to see a working example in Python.
Thank you very much.
Why not put both in one regex?
And(?!rea$).*
Since the lookahead only "looks ahead" without consuming any characters, this works just fine (well, this is the whole point of lookaround, actually).
So, in Python:
if re.match(r"And(?!rea$).*", subject):
# Successful match
# Note that re.match always anchor the match
# to the start of the string.
else:
# Match attempt failed
From the wording of your question, I'm not sure if you're starting with two already finished lists of "match/don't match" pairs. In that case, you could simply combine them automatically by concatenating the regexes. This works just as well but is uglier:
(?!Andrea$)And.*
In general, then:
(?!excludeRegex$)includeRegex

Categories

Resources