Python Regex to start from inner match - python

I am working on a regex in Python that is converting a mathematical expression to the power format in Sympy language pow(x,y). For example, it takes 2^3 and returns pow(2,3).
My current pattern is:
# pattern
pattern = re.compile(r'(.+?)\^(.*)')
To find the math for nested expressions, I use a for loop to count the number of ^(hat)s and then use re.sub to generate the power format:
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
# for nested pow
for i in range(num_hat):
exp = re.sub(pattern, r'pow(\1,\2)', exp)
return exp
This method does not work for nested ^ expressions such as a^b^c^d or sin(x^2)^3 as the position of the final parenthesis is not correct.
For a^b^c^d it returns pow(pow(pow(a,b,c,d)))
For sin(x^2)^3 it returns pow(pow(sin(x,2),3))
Is there any way to overcome this issue? I tried the negative lookahead but it is still not working

There is no nice way of saying this, but you have an extreme case of an XY problem. What you apparently want is to convert some mathematical expression to SymPy. Writing your own regular expression seems like a very tedious, error-prone, and possibly impossible approach to this.
Being a vast symbolic library, SymPy comes with an entire parsing submodule, which allows you to tweak the parsing expressions in all detail, in particular convert_xor governs what happens to the ^ character. However, it appears you do not need to do anything since converting ^ to exponentiation is the default. You can therefore simply do:
from sympy import sympify
print( sympy.sympify("a^b^c^d") ) # a**(b**(c**d))
print( sympy.sympify("sin(x^2)^3") ) # sin(x**2)**3
Note that ** is equivalent to pow, so I am not sure why you are insisting on the latter. If you need an output that shall work in yet another programming language, that’s what the printing module is for and it’s comparably easy to change this yourself. Another thing that may help you obtain the desired form is sympy.srepr.

Why don't you use recursion for this? It might not be the best, but will work for your use case if the nested powers are not a lot,
A small demonstration,
import re
#Your approach
def func(exp):
# pattern
pattern = re.compile(r'(.+?)\^(.*)')
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
# for nested pow
for i in range(num_hat): # num_hat-1 since we created the first match already
exp = re.sub(pattern, r'pow(\1,\2)', exp)
return exp
#With recursion
def refined_func(exp):
# pattern
pattern = '(.+?)\^(.*)'
# length of the for loop
num_hat = len(re.findall(r'\^', exp))
#base condition
if num_hat == 1:
search = re.search(pattern, exp)
group1 = search.group(1)
group2 = search.group(2)
exp = "pow("+group1+", "+group2+")"
return exp
# for nested pow
for i in range(num_hat): # num_hat-1 since we created the first match already
search = re.search(pattern, exp)
if not search: # the point where there are no hats in the exp
break
group1 = search.group(1)
group2 = search.group(2)
exp = "pow("+group1+", "+refined_func(group2)+")"
return exp
if __name__=='__main__':
print(func("a^b^c^d"))
print("###############")
print(refined_func("a^b^c^d"))
The output of the above program is,
pow(pow(pow(a,b,c,d)))
###############
pow(a, pow(b, pow(c, d)))
Problem in your approach:
Initially you start off with the following expression,
a^b^c^d
With your defined regex, two parts are made of the above expression -> part1: a and part2: b^c^d. With these two, you generate pow(a,b^c^d). So the next expression that you work with is,
pow(a,b^c^d)
In this case now, your regex will give part1 to be pow(a,b and part2 will be c^d). Since, the pow statement used to construct the info from part1 and part2 is like pow(part1, part2), you end up having pow( pow(a,b , c^d) ) which is not what you intended.

I made an attempt into your examples but I'll still advise you to find a math parser (from my comment) as this regex is very complex.
import re
pattern = re.compile(r"(\w+(\(.+\))?)\^(\w+(\(.+\))?)([^^]*)$")
def convert(string):
while "^" in string:
string = pattern.sub(r"pow(\1, \3)\5", string, 1)
return string
print(convert("a^b^c^d")) # pow(a, pow(b, pow(c, d)))
print(convert("sin(x^2)^3")) # pow(sin(pow(x, 2)), 3)
Explanation: Loop while there is a ^ and replace the last match (presence of $)

Related

Check logical concatenation of regular expressions

I have the following problem in python, which I hope you can assist with.
The input is 2 regular expressions, and I have to check if their concatenation can have values.
For example if one says take strings with length greater than 10 and the other says at most 5, than
no value can ever pass both expressions.
Is there something in python to solve this issue?
Thanks,
Max.
Getting this brute force algorithm from here:
Generating a list of values a regex COULD match in Python
def all_matching_strings(alphabet, max_length, regex1, regex2):
"""Find the list of all strings over 'alphabet' of length up to 'max_length' that match 'regex'"""
if max_length == 0: return
L = len(alphabet)
for N in range(1, max_length+1):
indices = [0]*N
for z in xrange(L**N):
r = ''.join(alphabet[i] for i in indices)
if regex1.match(r) and regex2.match(r):
yield(r)
i = 0
indices[i] += 1
while (i<N) and (indices[i]==L):
indices[i] = 0
i += 1
if i<N: indices[i] += 1
return
example usage, for your situation (two regexes)... you'd need to add all possible symbols/whitespaces/etc to that alphabet also...:
alphabet = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890'
import re
regex1 = re.compile(regex1_str)
regex2 = re.compile(regex1_str)
for r in all_matching_strings(alphabet, 5, regex1, regex2):
print r
That said, the runtime on this is super-crazy and you'll want to do whatever you can to speed it up. One suggestion on the answer I swiped the algorithm from was to filter the alphabet to only have characters that are "possible" for the regex. So if you scan your regex and you only see [1-3] and [a-eA-E], with no ".", "\w", "\s", etc, then you can reduce the size of the alphabet to 13 length. Lots of other little tricks you could implement as well.
Is there something in python to solve this issue?
There is nothing in Python that solves this directly.
That said, you can simulate a logical-and operation for two regexes by using lookahead assertions. There is a good explanation with examples at Regular Expressions: Is there an AND operator?
This will combine the regexes but won't show directly whether some string exists that satisfies the combined regex.
I highly doubt that something like this is implemented and even that there is a way to efficiently compute it.
One approximative way that comes to my mind now, that detects the most obvious conflicts would be to generate a random string conforming to each of the regexes, and then check if the concatenation of the regexes matches the concatenation of the generated strings.
Something like:
import re, rstr
s1 = rstr.xeger(r1)
s2 = rstr.xeger(r2)
print re.match(r1 + r2, s1 + s2)
Although I can't really think of a way for this to fail. In my opinion, for your example, where r1 matches strings with more than 10 chars, r2 matches strings shorter than 5 chars, then the sum of the two would yield strings with the first part longer than 10 and a tail of less than 5.

repetition "{}" on fly for regex

I am trying to write a function that compare a value with a regex to see if matches. The problem is that I have quite a many regex that are similar with just one difference which the range {} e.g. ^[a-z]{0,500}$ & ^[a-z]{0,200}$ are similar regex with just a diff of range/repetition. I am trying to solve that problem of how to deal with these regex with one function. So far I have written that function. But I think there must be some option that is much better than what I have done below. It should also be able to deal if no max or min is specified as well.
def check(value, min=None, max=None):
regex = "^[a-z]"+"{"+min+","+max+"}$"
r= re.compile(regex)
if r.match(value):
return True
else:
return False
Use min="0" and max="" instead (that way, they will construct valid ranges if left unspecified).
Also, don't do if condition: return True etc. - just return the match object - it will evaluate to True if there is a match (and you can do stuff with it later if you want to).
Further, no need to compile the regex if you're only using it once.
def check(value, min="0", max=""):
regex = "[a-z]{" + min + "," + max + "}$"
return re.match(regex, value)
Also, I've removed the ^ because it's implicit in re.match().

Using Python, How can I evaluate an expression in the form of prefix notation compacted?

I am currently working on the python problemsets on a website called singpath. The question is:
Prefix Evaluation
Create a function that evaluates the arithmetic expression in the form of prefix notation without spaces or syntax errors. The expression is given as a string, all the numbers in the expression are integer 0~9, and the operators are +(addition), -(subtraction), *(multiplication), /(division), %(modulo), which operate just the same as those in Python.
Prefix notation, also known as Polish notation, is a form of notation for logic, arithmetic, and algebra. it places operators to the left of their operands. If the arity of the operators is fixed, the result is a syntax lacking parentheses or other brackets that can still be parsed without ambiguity.
This seems simple enough but the string is condensed with no spaces in the input to splice out the data. How could I separate the data from the string without importing modules? Furthermore how could I use the results of the data to solve the given equation? Also please keep in minf that Singpath solutions must be in ONE function that cannot use methods that couldn't be found in the standard python library. This also includes functions declared within the solution :S
Examples:
>>> eval_prefix("+34")
7
>>> eval_prefix("*−567")
-7
>>> eval_prefix("-*33+2+11")
5
>>> eval_prefix("-+5*+1243")
14
>>> eval_prefix("*+35-72")
40
>>> eval_prefix("%3/52")
1
See my point no spaces D:
OK, not as snazzy as alex jordan's lamba/reduce solution, but it doesn't choke on garbage input. It's sort of a recursive descent parser meets bubble sort abomination (I'm thinking it could be a little more efficient when it finds a solvable portion than just jumping back to the start. ;)
import operator
def eval_prefix(expr):
d = {'+': operator.add,
'-': operator.sub,
'*': operator.mul,
'/': operator.div, # for 3.x change this to operator.truediv
'%': operator.mod}
for n in range(10):
d[str(n)] = n
e = list(d.get(e, None) for e in expr)
i = 0
while i + 3 <= len(e):
o, l, r = e[i:i+3]
if type(o) == type(operator.add) and type(l) == type(r) == type(0):
e[i:i+3] = [o(l, r)]
i = 0
else:
i += 1
if len(e) != 1:
print 'Error in expression:', expr
return 0
else:
return e[0]
def test(s, v):
r = eval_prefix(s)
print s, '==', v, r, r == v
test("+34", 7)
test("*-567", -7)
test("-*33+2+11", 5)
test("-+5*+1243", 14)
test("*+35-72", 40)
test("%3/52", 1)
test("****", 0)
test("-5bob", 10)
I think the crucial bit here is "all the numbers in the expression are integer 0~9". All numbers are single digit. You don't need spaces to find out where one number ends and the next one starts. You can access the numbers directly by their string index, as lckknght said.
To convert the characters in the string into integers for calculation, use ord(ch) - 48 (because "0" has the ASCII code 48). So, to get the number stored in position 5 of input, use ord(input[5]) - 48.
To evaluate nested expressions, you can call your function recursively. The crucial assumption here is that there are always exactly two operants to an operator.
Your "one function" limitation isn't as bad as you think. Python allows defining functions inside functions. In the end, a function definition is nothing more than assigning the function to a (usually new) variable. In this case, I think you will want to use recursion. While that can also be done without an extra function, you may find it easier to define an extra recursion function for it. This is no problem for your limits:
def eval_prefix (data):
def handle_operator (operator, rest):
# You fill this in.
# and this, too.
That should be enough of a hint (if you want to use a recursive approach).
Well, one-liner fits in? Reduce in python3 is hidden in functools
Somewhat lispy :)
eval_prefix = lambda inp:\
reduce(lambda stack, symbol:\
(
(stack+[symbol]) if symbol.isdigit() \
else \
(
stack[:-2]+\
[str(
eval(
stack[-1]+symbol+stack[-2]
)
)
]
)
), inp[::-1], [])[0]
The hint that you are most likely looking for is "strings are iterable":
def eval_prefix(data):
# setup state machine
for symbol_ in data:
# update state machine
Separating the elements of the string is easy. All elements are a single character long, so you can directly iterate over (or index) the string to get at each one. Or if you want to be able to manipulate the values, you could passing the string to the list constructor.
Here are some examples of how this can work:
string = "*-567"
# iterating over each character, one at a time:
for character in string:
print(character) # prints one character from the string per line
# accessing a specific character by index:
third_char = string[2] # note indexing is zero-based, so 3rd char is at index 2
# transform string to list
list_of_characters = list(string) # will be ["*", "-", "5", "6", "7"]
As for how to solve the equation, I think there are two approaches.
One is to make your function recursive, so that each call evaluates a single operation or literal value. This is a little tricky, since you're only supposed to use one function (it would be much easier if you could have a recursive helper function that gets called with a different API than the main non-recursive function).
The other approach is to build up a stack of values and operations that you're waiting to evaluate while taking just a single iteration over the input string. This is probably easier given the one-function limit.

Storing and evaluating nested string elements

Given the exampleString = "[9+[7*3+[1+2]]-5]"
How does one extract and store elements enclosed by [] brackets, and then evaluate them in order?
1+2 --+
|
7*3+3 --+
|
9+24-5
Does one have to create somekind of nested list? Sorry for this somewhat broad question and bad English.
I see, this question is really too broad... Is there a way to create a nested list from that string? Or maybe i should simply do regex search for every element and evaluate each? The nested list option (if it exists) would be a IMO "cleaner" approach than looping over same string and evaluating until theres no [] brackets.
Have a look at pyparsing module and some examples they have (four function calculator is something you want and more).
PS. In case the size of that code worries you, look again: most of this can be stripped. The lower half are just tests. The upper part can be stripped from things like supporting e/pi/... constants, trigonometric funcitons, etc. I'm sure you can cut it down to 10 lines for what you need.
A good starting point is the the shunting-yard algorithm.
There are multiple Python implementations available online; here is one.
The algorithm can be used to translate infix notation into a variety of representations. If you are not constrained with regards to which representation you can use, I'd recommend considering Reverse Polish notation as it's easy to work with.
Here is a regex solution:
import re
def evaluatesimple(s):
return eval(s)
def evaluate(s):
while 1:
simplesums=re.findall("\[([^\]\[]*)\]",s)
if (len(simplesums) == 0):
break
replacements=[('[%s]' % item,str(evaluatesimple(item))) for item in simplesums]
for r in replacements:
s = s.replace(*r)
return s
print evaluate("[9+[7*3+[1+2]]-5]")
But if you want to go the whole hog and build a tree to evaluate later, you can use the same technique but store the expressions and sub expressions in a dict:
def tokengen():
for c in 'abcdefghijklmnopqrstuvwyxz':
yield c
def makeexpressiontree(s):
d=dict()
tokens = tokengen()
while 1:
simplesums=re.findall("\[([^\]\[]*)\]",s)
if (len(simplesums) == 0):
break
for item in simplesums:
t = tokens.next()
d[t] = item
s = s.replace("[%s]"% item,t)
return d
def evaltree(d):
"""A simple dumb way to show in principle how to evaluate the tree"""
result=0
ev={}
for i,t in zip(range(len(d)),tokengen()):
ev[t] = eval(d[t],ev)
result = ev[t]
return result
s="[9+[7*3+[1+2]]-5]"
print evaluate(s)
tree=makeexpressiontree(s)
print tree
print evaltree(tree)
(Edited to extend my answer)

Possible to Simplify These Python Regular Expressions?

patterns = {}
patterns[1] = re.compile("[A-Z]\d-[A-Z]\d")
patterns[2] = re.compile("[A-Z]\d-[A-Z]\d\d")
patterns[3] = re.compile("[A-Z]\d\d-[A-Z]\d\d")
patterns[4] = re.compile("[A-Z]\d\d-[A-Z]\d\d\d")
patterns[5] = re.compile("[A-Z]\d\d\d-[A-Z]\d\d\d")
patterns[6] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d")
patterns[7] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d\d")
patterns[8] = re.compile("[A-Z][A-Z]\d\d-[A-Z][A-Z]\d\d")
patterns[9] = re.compile("[A-Z][A-Z]\d\d-[A-Z][A-Z]\d\d\d")
patterns[10] = re.compile("[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d")
def matchFound(toSearch):
for items in sorted(patterns.keys(), reverse=True):
matchObject = patterns[items].search(toSearch)
if matchObject:
return items
return 0
then I use the following code to look for matches:
while matchFound(toSearch) > 0:
I have 10 different regular expressions but I feel like they could be replaced by one, well written, more elegant regular expression. Do you guys think it's possible?
EDIT: FORGOT TWO MORE EXPRESSIONS:
patterns[11] = re.compile("[A-Z]\d-[A-Z]\d\d\d")
patterns[12] = re.compile("[A-Z][A-Z]\d-[A-Z][A-Z]\d\d\d")
EDIT2: I ended up with the following. I realize I COULD get extra results but I don't think they're possible in the data I'm parsing.
patterns = {}
patterns[1] = re.compile("[A-Z]{1,2}\d-[A-Z]{1,2}\d{1,3}")
patterns[2] = re.compile("[A-Z]{1,2}\d\d-[A-Z]{1,2}\d{2,3}")
patterns[3] = re.compile("[A-Z]{1,2}\d\d\d-[A-Z]{1,2}\d\d\d")
Josh Caswell noted that Sean Bright's answer will match more inputs than your original group. Sorry I didn't figure this out. (In the future it might be good to spell out your problem a little bit more.)
So your basic problem is that regular expressions can't count. But we can still solve this in Python in a very slick way. First we make a pattern that matches any of your legal inputs, but would also match some you want to reject.
Next, we define a function that uses the pattern and then examines the match object, and counts to make sure that the matched string meets the length requirements.
import re
_s_pat = r'([A-Z]{1,2})(\d{1,3})-([A-Z]{1,2})(\d{1,3})'
_pat = re.compile(_s_pat)
_valid_n_len = set([(1,1), (1,2), (1,3), (2,2), (2,3), (3,3)])
def check_match(s):
m = _pat.search(s)
try:
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
if not (len(n0), len(n1)) in _valid_n_len:
return False
return True
except (AttributeError, TypeError, ValueError):
return False
Here is some explanation of the above code.
First we use a raw string to define the pattern, and then we pre-compile the pattern. We could just stuff the literal string into the call to re.compile() but I like to have a separate string. Our pattern has four distinct sections enclosed in parentheses; these will become "match groups". There are two match groups to match the alphabet characters, and two match groups to match numbers. This one pattern will match everything you want, but won't exclude some stuff you don't want.
Next we declare a set that has all the valid lengths for numbers. For example, the first group of numbers can be 1 digit long and the second group can be 2 digits; this is (1,2) (a tuple value). A set is a nice way to specify all the possible combinations that we want to be legal, while still being able to check quickly whether a given pair of lengths is legal.
The function check_match() first uses the pattern to match against the string, returning a "match object" which is bound to the name m. If the search fails, m might be set to None. Instead of explicitly testing for None, I used a try/except block; in retrospect it might have been better to just test for None. Sorry, I didn't mean to be confusing. But the try/except block is a pretty simple way to wrap something and make it very reliable, so I often use it for things like this.
Finally, check_match() unpacks the match groups into four variables. The two alpha groups are a0 and a1, and the two number groups are n0 and n1. Then it checks that the lengths are legal. As far as I can tell, the rule is that alpha groups need to be the same length; and then we build a tuple of number group lengths and check to see if the tuple is in our set of valid tuples.
Here's a slightly different version of the above. Maybe you will like it better.
import re
# match alpha: 1 or 2 capital letters
_s_pat_a = r'[A-Z]{1,2}'
# match number: 1-3 digits
_s_pat_n = r'\d{1,3}'
# pattern: four match groups: alpha, number, alpha, number
_s_pat = '(%s)(%s)-(%s)(%s)' % (_s_pat_a, _s_pat_n, _s_pat_a, _s_pat_n)
_pat = re.compile(_s_pat)
# set of valid lengths of number groups
_valid_n_len = set([(1,1), (1,2), (1,3), (2,2), (2,3), (3,3)])
def check_match(s):
m = _pat.search(s)
if not m:
return False
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
tup = (len(n0), len(n1)) # make tuple of actual lengths
if not tup in _valid_n_len:
return False
return True
Note: It looks like the rule for valid lengths is actually simple:
if len(n0) > len(n1):
return False
If that rule works for you, you could get rid of the set and the tuple stuff. Hmm, and I'll make the variable names a bit shorter.
import re
# match alpha: 1 or 2 capital letters
pa = r'[A-Z]{1,2}'
# match number: 1-3 digits
pn = r'\d{1,3}'
# pattern: four match groups: alpha, number, alpha, number
p = '(%s)(%s)-(%s)(%s)' % (pa, pn, pa, pn)
_pat = re.compile(p)
def check_match(s):
m = _pat.search(s)
if not m:
return False
a0, n0, a1, n1 = m.groups()
if len(a0) != len(a1):
return False
if len(n0) > len(n1):
return False
return True
Sean Bright gave you the answer you need. Here's just a general tip:
Python has wonderful documentation. In this case, you could read it with the "help" command:
import re
help(re)
And if you read through the help, you would see:
{m,n} Matches from m to n repetitions of the preceding RE.
It also helps to use Google. "Python regular expressions" found these links for me:
http://docs.python.org/library/re.html
http://docs.python.org/howto/regex.html
Both are worth reading.
Josh is right about at least reducing the number of REs.
But you could also take a RE which is wider than allowed and then additionally check if all conditions are met. Such as
pattern = re.compile("([A-Z]{1,2})(\d{1,3})-([A-Z]{1,2})(\d{1,3})")
and then
matchObject = pattern.search(toSearch)
if matchObject and <do something with the length of the groups, comparing them)>:
return <stuff>
But even if that does not work due to any reason, there are ways to improve that:
patterns = tuple(re.compile(r) for r in (
"[A-Z]\d-[A-Z]\d{1,2}",
"[A-Z]\d\d-[A-Z]\d{2,3}",
"[A-Z]\d\d\d-[A-Z]\d\d\d",
"[A-Z][A-Z]\d-[A-Z][A-Z]\d{1,2}",
"[A-Z][A-Z]\d\d-[A-Z][A-Z]\d{2,3}",
"[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d",
)
def matchFound(toSearch):
for pat in reversed(patterns):
matchObject = pat.search(toSearch)
if matchObject:
return items # maybe more useful?
return None
Building on Sean's (now apparently deleted) answer, you can reduce the number of patterns. Because of the limitations on the combinations of length of digit matches (i.e., if m in the first position, at least m and no more than 3 in the second) I'm not sure you can get it down to one:
"[A-Z]\d-[A-Z]\d{1,3}"
"[A-Z]\d\d-[A-Z]\d{2,3}"
"[A-Z]\d\d\d-[A-Z]\d\d\d"
"[A-Z][A-Z]\d-[A-Z][A-Z]\d{1,3}"
"[A-Z][A-Z]\d\d-[A-Z][A-Z]\d{2,3}"
"[A-Z][A-Z]\d\d\d-[A-Z][A-Z]\d\d\d"
This uses the {m,n} repeat qualifier syntax, which specifies that the immediately preceding match be repeated at least m but no more than n times. You can also specify a single number n; then the match must succeed exactly n times:
"[A-Z]{2}\d-[A-Z]{2}\d{2,3}"

Categories

Resources