python complementing a complex regular expression - python

Trying to learn regular expressions and despite some great posts on here and links to a regEx site, I have a case I was trying to hack out of sheer stubbornness that defied producing the match I was looking for. To understand it, consider the following code which allows us to pass in a list of strings and a pattern and find out if the pattern either matches all items in the list or matches none of them:
import re
def matchNone(pattern, lst):
return not any([re.search(pattern, i) for i in lst])
def matchAll(pattern, lst):
return all([re.search(pattern, i) for i in lst])
To help w/ debugging, this simple code allows us to just add _test to a function call and see what is being passed to the any() or all() functions which ultimately return the result:
def matchAll_test(pattern, lst):
return [re.search(pattern, i) for i in lst]
def matchNone_test(pattern, lst):
return ([re.search(pattern, i) for i in lst])
This pattern and list produces True from matchAll():
wordPattern = "^[cfdrp]an$"
matchAll(wordPattern, ['can', 'fan', 'dan', 'ran', 'pan']) # True
This pattern on the surface appears to work with matchNone() in our effort to reverse the pattern:
wordPattern = "^[^cfdrp]an|[cfdrp](^an)$"
matchNone(wordPattern, ['can', 'fan', 'dan', 'ran', 'pan']) # True
It returns True as we hoped it would. But a true reversal of this pattern would return False for a list of values where none of them are equivalent to our original list ['can', 'fan', 'dan', 'ran', 'pan'] regardless of what else we pass into it. (i.e. "match anything except these 5 words")
In testing to see what changes to the words in this list will get us a False, we quickly discover the pattern is not as successful as it first appears. If it were, it would fail for matchNone() on anything not in the aforementioned list.
These permutations helped uncover the short-comings of my pattern tests:
["something unrelated", "p", "xan", "dax", "ccan", "dann", "ra"]
In my exploration of above, I tried other permutations as well taking the original list, using the _test version of the functions and changing one letter at a time on the original words, and or modifying one term or adding one term from permutations like what is above.
If anyone can find the true inverse of my original pattern, I would love to see it so I can learn from it.
To help with your investigation:
This pattern also works with matchAll() for all words, but I could not seem to create its inverse either: "^(can|fan|dan|ran|pan)$"
Thanks for any time you expend on this. I'm hoping to find a regEx guru on here who spots the mistake and can propose the right solution.

I hope I understood your question. This is the solution that I found:
^(?:[^cfdrp].*|[cfdrp][^a].*|[cfdrp]a[^n].*|.{4,}|.{0,2})$
[^cfdrp].*: if text starts not with c, f, d, r or p than match
[cfdrp][^a].*: text starts with c, f, d, r or p: match if second character is not an a
[cfdrp]a[^n].*: text starts with [cfdrp]a: match if third character is not a n.
.{4,}: match anything with more than 3 characters
.{0,2}: match anything with 0, 1, or 2 characters
It is equal to:
^(?:[^cfdrp].*|.[^a].*|..[^n].*|.{4,}|.{0,2})$

What you are looking to do is find the complement. To do this for any regular expression is a difficult problem. There is no built-in for complementing a regex.
There is an open challenge on PPCG to do this. One comment explains the difficulty:
It's possible, but crazy tedious. You need to parse the regexp into an NFA (eg Thompson's algorithm), convert the NFA to a DFA (powerset construction), complete the DFA, find the complement, then convert the DFA to a RE (eg Brzozowski's method). ie slightly harder than writing a complete RE engine!
There are Python libraries out there that will convert from a regular expression (the original specification refers to a "regular language", which only has literals, "or", and "star" -- much simpler than the type of regex you're thinking of [more info here]) to an NFA, to a DFA, complement it, and convert it back. It's pretty complicated.
Here's a related SO question: Finding the complement of a DFA?
In summary, it's far simpler to find the result of the original regular expression instead, then use Boolean negation.

Related

Trying to sort two combined strings alphabetically without duplicates

Challenge: Take 2 strings s1 and s2 including only letters from a to z. Return a new sorted string, the longest possible, containing distinct letters - each taken only once - coming from s1 or s2.
# Examples
a = "xyaabbbccccdefww"
b = "xxxxyyyyabklmopq"
assert longest(a, b) == "abcdefklmopqwxy"
a = "abcdefghijklmnopqrstuvwxyz"
assert longest(a, a) == "abcdefghijklmnopqrstuvwxyz"
So I am just starting to learn, but so far I have this:
def longest(a1, a2):
for letter in max(a1, a2):
return ''.join(sorted(a1+a2))
which returns all the letters but I am trying to filter out the duplicates.
This is my first time on stack overflow so please forgive anything I did wrong. I am trying to figure all this out.
I also do not know how to indent in the code section if anyone could help with that.
You have two options here. The first is the answer you want and the second is an alternative method
To filter out duplicates, you can make a blank string, and then go through the returned string. For each character, if the character is already in the string, move onto the next otherwise add it
out = ""
for i in returned_string:
if i not in out:
out += i
return out
This would be empedded inside a function
The second option you have is to use Pythons sets. For what you want to do you can consider them as lists with no dulicate elements in them. You could simplify your function to
def longest(a: str, b: str):
return "".join(set(a).union(set(b)))
This makes a set from all the characters in a, and then another one with all the characters in b. It then "joins" them together (union) and you get another set. You can them join all the characters together in this final set to get your string. Hope this helps

Generating expressions from permutations of variables and operators

So, I've decided that it's time to learn regular expressions. Thus, I set out to solve various problems, and after a bit of smooth sailing, I seem to have hit a wall and need help getting unstuck.
The task:
Given a list of characters and logical operators, find all possible combinations of these characters and operators that are not gibberish.
For example, given:
my_list = ['p', 'q', '&', '|']
the output would be:
answers = ['p', 'q', 'p&q', 'p|q'...]
However, strings like 'pq&' and 'p&|' are gibberish and therefore not allowed.
Naturally, as more elements are added to my_list, the more complicated the process becomes.
My current approach:
(I'd like to learn how to solve it with regex, but I am also curious if there exists a better way, too... but again, my focus is regex)
step 1:
find all permutations of the elements such that each permutation is 3 <= x <= len(my_list) long.
step 2:
Loop over the list, and if a regex match is found, pull that element out and put it in the answers list.
(I'm not married to this 2-step approach, it is just what seemed most logical to me)
My current code, minus the regex:
import re
from itertool import permutations
my_list = ['p', 'q', '~r', 'r', '|', '&']
foo = []
answers = []
count = 3
while count < 7:
for i in permutations(a, count):
i = ''.join(k for k in i)
foo.append(i)
count +=1
for i in foo:
if re.match(r'insert_regex', i):
answers.append(i)
else:
None
print answers
Now, I have tried a vast slew of different regex's to get this to work (too many to list them all here) but some of the main ones are:
A straightforward approach by finding all the cases that have two letters side by side, or two operators side by side, then instead of appending 'answers', I just removed them from 'foo'. This is the regex I tried:
r'(\w\w)[&\|]{2,}'
and did not even come close.
I then decided to try and find the strings that I wanted, as opposed to the ones I did not want.
First I tested:
r'^[~\w]'
to make sure I could get the strings whose first character were a letter or a negation. This worked. I was happy.
I then tried:
r'^[~\w][&\|]'
to try and get the next logical operator; however, it only picked up strings whose first character was a letter, and ignored all of the strings whose first character was a negation.
I then tried a conditional so that if the first character was a negation, the next character would be a letter, otherwise it would be an operator:
r'^(?(~)\w|[&\|])'
but this thew me "error: bad character in group name".
I then tried to resolve this error by:
r'^(?:(~)\w|[&\|])'
But that returned only strings that started with '~' or an operator.
I then tried a slew of other things related to conditionals and groupings (2 days worth, actually), but I can't seem to find a solution. Part of the problem is that I don't know enough about regex to know where to go to find the solution, so I have kind of been wandering around the internet aimlessly.
I have run through a lot of tutorials and explanation pages, but they are all rather opaque and don't piece things together in a way is conducive to understanding... they just sort of throw out code for you to copy and paste or mimic.
Any insights you have would be much appreciated, and as much as I would love an answer to the problem, if possible, an ELI5 explanation of what the solution does would be excellent for my own progress.
In a bitter twist of irony, it turns out that I had the solution written down (I documented all the regex's I tried), but it originally failed because I forgot to remove strings from the original list, not the copy.
If anyone is looking for a solution to the problem, the following code worked on all of my test cases (can't promise beyond that, however).
import re
from itertools import permutations
import copy
a = ['p', 'q', 'r', '~r', '|', '&']
foo = []
count = 3
while count < len(a)+1:
for j in permutations(a, count):
j = ''.join(k for k in j)
foo.append(j)
count +=1
foo_copy = copy.copy(foo)
for i in foo:
if re.search(r'(^[&\|])|(\w\w)|(\w~)|([&\|][&\|])|([&\|]$)', i):
foo_copy.remove(i)
else:
None
print foo_copy
You have a list of variables (characters), binary operators, and/or variables prefixed with a unitary operator (like ~). The last case can be dealt with just like a variable.
As binary operators need a variable at either side, we can conclude that a valid expression is an alternation of variables and operators, starting and ending with a variable.
So, you could first divide the input list into two lists based on whether an item is a variable or an operator. Then you could increase the size of the output you will generate, and for each size, get the permutations of both lists and zip these in order to build a valid expression each time. This way you don't need a regular expression to verify the validity.
Here is the suggested function:
from itertools import permutations, zip_longest, chain
def expressions(my_list):
answers = []
variables = [x for x in my_list if x[-1].isalpha()]
operators = [x for x in my_list if not x[-1].isalpha()]
max_var_count = min(len(operators) + 1, len(variables))
for var_count in range(1, max_var_count+1):
for vars in permutations(variables, var_count):
for ops in permutations(operators, var_count-1):
answers.append(''.join(list(chain.from_iterable(zip_longest(vars, ops)))[:-1]))
return answers
print(expressions(['p', 'q', '~r', 'r', '|', '&']))

Check logical concatenation of regular expressions

I have the following problem in python, which I hope you can assist with.
The input is 2 regular expressions, and I have to check if their concatenation can have values.
For example if one says take strings with length greater than 10 and the other says at most 5, than
no value can ever pass both expressions.
Is there something in python to solve this issue?
Thanks,
Max.
Getting this brute force algorithm from here:
Generating a list of values a regex COULD match in Python
def all_matching_strings(alphabet, max_length, regex1, regex2):
"""Find the list of all strings over 'alphabet' of length up to 'max_length' that match 'regex'"""
if max_length == 0: return
L = len(alphabet)
for N in range(1, max_length+1):
indices = [0]*N
for z in xrange(L**N):
r = ''.join(alphabet[i] for i in indices)
if regex1.match(r) and regex2.match(r):
yield(r)
i = 0
indices[i] += 1
while (i<N) and (indices[i]==L):
indices[i] = 0
i += 1
if i<N: indices[i] += 1
return
example usage, for your situation (two regexes)... you'd need to add all possible symbols/whitespaces/etc to that alphabet also...:
alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890'
import re
regex1 = re.compile(regex1_str)
regex2 = re.compile(regex1_str)
for r in all_matching_strings(alphabet, 5, regex1, regex2):
print r
That said, the runtime on this is super-crazy and you'll want to do whatever you can to speed it up. One suggestion on the answer I swiped the algorithm from was to filter the alphabet to only have characters that are "possible" for the regex. So if you scan your regex and you only see [1-3] and [a-eA-E], with no ".", "\w", "\s", etc, then you can reduce the size of the alphabet to 13 length. Lots of other little tricks you could implement as well.
Is there something in python to solve this issue?
There is nothing in Python that solves this directly.
That said, you can simulate a logical-and operation for two regexes by using lookahead assertions. There is a good explanation with examples at Regular Expressions: Is there an AND operator?
This will combine the regexes but won't show directly whether some string exists that satisfies the combined regex.
I highly doubt that something like this is implemented and even that there is a way to efficiently compute it.
One approximative way that comes to my mind now, that detects the most obvious conflicts would be to generate a random string conforming to each of the regexes, and then check if the concatenation of the regexes matches the concatenation of the generated strings.
Something like:
import re, rstr
s1 = rstr.xeger(r1)
s2 = rstr.xeger(r2)
print re.match(r1 + r2, s1 + s2)
Although I can't really think of a way for this to fail. In my opinion, for your example, where r1 matches strings with more than 10 chars, r2 matches strings shorter than 5 chars, then the sum of the two would yield strings with the first part longer than 10 and a tail of less than 5.

Storing and evaluating nested string elements

Given the exampleString = "[9+[7*3+[1+2]]-5]"
How does one extract and store elements enclosed by [] brackets, and then evaluate them in order?
1+2 --+
|
7*3+3 --+
|
9+24-5
Does one have to create somekind of nested list? Sorry for this somewhat broad question and bad English.
I see, this question is really too broad... Is there a way to create a nested list from that string? Or maybe i should simply do regex search for every element and evaluate each? The nested list option (if it exists) would be a IMO "cleaner" approach than looping over same string and evaluating until theres no [] brackets.
Have a look at pyparsing module and some examples they have (four function calculator is something you want and more).
PS. In case the size of that code worries you, look again: most of this can be stripped. The lower half are just tests. The upper part can be stripped from things like supporting e/pi/... constants, trigonometric funcitons, etc. I'm sure you can cut it down to 10 lines for what you need.
A good starting point is the the shunting-yard algorithm.
There are multiple Python implementations available online; here is one.
The algorithm can be used to translate infix notation into a variety of representations. If you are not constrained with regards to which representation you can use, I'd recommend considering Reverse Polish notation as it's easy to work with.
Here is a regex solution:
import re
def evaluatesimple(s):
return eval(s)
def evaluate(s):
while 1:
simplesums=re.findall("\[([^\]\[]*)\]",s)
if (len(simplesums) == 0):
break
replacements=[('[%s]' % item,str(evaluatesimple(item))) for item in simplesums]
for r in replacements:
s = s.replace(*r)
return s
print evaluate("[9+[7*3+[1+2]]-5]")
But if you want to go the whole hog and build a tree to evaluate later, you can use the same technique but store the expressions and sub expressions in a dict:
def tokengen():
for c in 'abcdefghijklmnopqrstuvwyxz':
yield c
def makeexpressiontree(s):
d=dict()
tokens = tokengen()
while 1:
simplesums=re.findall("\[([^\]\[]*)\]",s)
if (len(simplesums) == 0):
break
for item in simplesums:
t = tokens.next()
d[t] = item
s = s.replace("[%s]"% item,t)
return d
def evaltree(d):
"""A simple dumb way to show in principle how to evaluate the tree"""
result=0
ev={}
for i,t in zip(range(len(d)),tokengen()):
ev[t] = eval(d[t],ev)
result = ev[t]
return result
s="[9+[7*3+[1+2]]-5]"
print evaluate(s)
tree=makeexpressiontree(s)
print tree
print evaltree(tree)
(Edited to extend my answer)

What's the difference between findall() and using a for loop with an iterator to find pattern matches

I'm using this to calculate the number of sentences in a text:
fileObj = codecs.open( "someText.txt", "r", "utf-8" )
shortText = fileObj.read()
pat = '[.]'
for match in re.finditer(pat, shortText, re.UNICODE):
nSentences = nSentences+1
Someone told me this is better:
result = re.findall(pat, shortText)
nSentences = len(result)
Is there a difference? Don't they do the same thing?
The second is probably going to be a little faster, since the iteration is done entirely in C. How much faster? About 15% in my tests (matching 'a' in 'a' * 16), though that percentage will get smaller as the regex gets more complex and takes a larger proportion of the running time. But it will use more memory since it's actually going to create a list for you. Assuming you don't have a ton of matches, though, not too much more memory.
As to which I'd prefer, I do kind of like the second's conciseness, especially when combined into a single statement:
nSentences = len(re.findall(pat, shortText))
The finditer function returns an iterator of match objects.
The findall function returns a list of matching strings.
The advantage of iterators over lists is that they are memory friendly (producing values only when needed).
The advantage of match objects over strings is that they are versatile (giving you groups, groupdict, start, end, span, etc.).
The choice of which is best depends on your needs. If you need a list of matching strings, then findall is great. If you need match object methods or if you need to conserve memory, then finditer is the way to go.
Hope this helps. Good luck with your project :-)
They do much the same thing. Your choice should be dictated by whether your other usage suggests an iterator or list would be better.
One difference between finditer and findall is that the former returns regex match objects whereas the other return a list of groups if one or more groups are present in the pattern; this will be a list of tuples if the pattern has more than one group.
Other than that it all depends on your usage.
There are two main differences:
1) findall() returns a list, while finditer() returns a iterator. This could be a huge difference if you're going to handle big strings (like files).
2) findall() returns str objects, while finditer() returns Match objects. I think that's the major difference. So, depending of what information you need from the matches, you can choose between one or the other. Here a small example:
We want to get all the numbers from a string:
>>> s = 'I have 921 apples, 53 oranges, 3 bananas and 1 lemon.'
# if you just need to find them, better use findall():
>>> re.findall('\d+', s)
['921', '53', '3', '1']
# but, if you need more than just that, use finditer():
>>> [(m.group(), m.start(), m.end()) for m in re.finditer('\d+', s)]
[('921', 7, 10), ('53', 19, 21), ('3', 31, 32), ('1', 45, 46)]

Categories

Resources