I want to make a program to detect a correct expanding.
For example:
I want to expand (x + 2)*(x - 3).
The solution is x*x -x -6
But x*x +2*x -3*x -6 is a correct solution.
I want to detect such correct (but unsimplified) expansions.
If you allow a user to input the expression as a string and parse the expression with evaluate=False as shown here you can compare the number of arguments in what is entered with the fully simplified version.
>>> expr = (x - 3)*(x + 2)
>>> expanded = expand(expr)
>>> ans = 'x*x +2*x -3*x -6' # obtained from user
>>> if S(ans) == expanded: # it's right
... if len(parse_expr(ans, evaluate=False).args) != len(expanded.args):
... print('right, but not simplified')
The unsimplified ans will have 4 arguments while the expanded form will have 3.
The CPython implementation of substring search (e.g. via in) is implemented by the following algorithm.
def find(s, p):
# find first occurrence of p in s
n = len(s)
m = len(p)
skip = delta1(p)[p[m-1]]
i = 0
while i <= n-m:
if s[i+m-1] == p[m-1]: # (boyer-moore)
# potential match
if s[i:i+m-1] == p[:m-1]:
return i
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + skip # (horspool)
else:
# skip
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + 1
return -1 # not found
At least, according to this source (taken from this older answer) written by the author (?) of the CPython implementation.
This same source mentions a worst-case complexity of this algorithm as O(nm), where n and m are the lengths of the two strings. I am interested in whether this bound is tight. My question is:
Are there adversarial examples for the algorithm used in Python in? Can we give a sequence of pairs of strings (pattern, string) so that running pattern in string takes quadratic (or at least superlinear) time?
The standard example that demonstrates the quadratic worst-case run-time of naive substring search, where string = 'a'*n and pattern = 'a'*m + b does not work.
The naive example of s='a'*n and p='a'*m+'b' does not work because of the line
if s[i+m-1] == p[m-1]:
This checks the last character (not the first) of p ('b') with the corresponding current position in s. As this fails, then the result is to just a single iteration over s, which is why it is so fast.
If you flip p (s='a'*n and p='b'+'a'*m), then a similar thing occurs - this time the above line passes (the last character of p is now 'a'), but then p is iterated over forwards, so then the 'b' is found quickly, so again this example is linear and fast.
A simple change to the naive example that would show O(nm) behaviour is s='a'*n and p='a'*m+'ba'. In this case, the last character of p is 'a', so the initial check passes, but then it needs to iterate over the rest of p before it gets to the 'b'.
# full='a'*n; sub='a'*m+'b'
>>> timeit("sub in full", "sub='a'*10+'b'; full='a'*100")
0.13620498299860628
>>> timeit("sub in full", "sub='a'*10+'b'; full='a'*1000")
0.9594046580004942
>>> timeit("sub in full", "sub='a'*100+'b'; full='a'*1000")
0.9768632190007338
# Linear in n, but m has minimal effect: ~O(n)
# full='a'*n; sub='a'*m+'ba'
>>> timeit("sub in full", "sub='a'*10+'ba'; full='a'*100")
0.35251976200015633
>>> timeit("sub in full", "sub='a'*10+'ba'; full='a'*1000")
3.4642483099996753
>>> timeit("sub in full", "sub='a'*100+'ba'; full='a'*1000")
27.152958754999418
# Both n and m have linear effect: ~O(nm)
Try this:
import re
import time
def slow_match(n):
pat = 'a' + ('z' * n)
str = 'z' * (n + n)
start_time = time.time()
if re.search(pat, str):
print("Shouldn't happen")
print(("Searched", n, time.time() - start_time))
slow_match(10000)
slow_match(50000)
slow_match(100000)
slow_match(300000)
I want to use z3 to solve this case. The input is a 10 character string. Each character of the input is a printable character (ASCII). The input should be such that when calc2() function is called with input as a parameter, the result should be: 0x0009E38E1FB7629B.
How can I use z3py in such cases?
Usually I would just add independent equations as a constraint to z3. In this case, I am not sure how to use z3.
def calc2(input):
result = 0
for i in range(len(input)):
r1 = (result << 0x5) & 0xffffffffffffffff
r2 = result >> 0x1b
r3 = (r1 ^ r2)
result = (r3 ^ ord(input[i]))
return result
if __name__ == "__main__":
input = sys.argv[1]
result = calc2(input)
if result == 0x0009E38E1FB7629B:
print "solved"
Update: I tried the following however it does not give me correct answer:
from z3 import *
def calc2(input):
result = 0
for i in range(len(input)):
r1 = (result << 0x5) & 0xffffffffffffffff
r2 = result >> 0x1b
r3 = (r1 ^ r2)
result = r3 ^ Concat(BitVec(0, 56), input[i])
return result
if __name__ == "__main__":
s = Solver()
X = [BitVec('x' + str(i), 8) for i in range(10)]
s.add(calc2(X) == 0x0009E38E1FB7629B)
if s.check() == sat:
print(s.model())
I hope this isn't homework, but here's one way to go about it:
from z3 import *
s = Solver()
# Input is 10 character long; represent with 10 8-bit symbolic variables
input = [BitVec("input%s" % i, 8) for i in range(10)]
# Make sure each character is printable ASCII, i.e., between 0x20 and 0x7E
for i in range(10):
s.add(input[i] >= 0x20)
s.add(input[i] <= 0x7E)
def calc2(input):
# result is a 64-bit value
result = BitVecVal(0, 64)
for i in range(len(input)):
# NB. We don't actually need to mask with 0xffffffffffffffff
# Since we explicitly have a 64-bit value in result.
# But it doesn't hurt to mask it, so we do it here.
r1 = (result << 0x5) & 0xffffffffffffffff
r2 = result >> 0x1b
r3 = r1 ^ r2
# We need to zero-extend to match sizes
result = r3 ^ ZeroExt(56, input[i])
return result
# Assert the required equality
s.add(calc2(input) == 0x0009E38E1FB7629B)
# Check and get model
print s.check()
m = s.model()
# reconstruct the string:
s = ''.join([chr (m[input[i]].as_long()) for i in range(10)])
print s
This prints:
$ python a.py
sat
L`p:LxlBVU
Looks like your secret string is
"L`p:LxlBVU"
I've put in some comments in the program to help you with how things are coded in z3py, but feel free to ask for clarification. Hope this helps!
Getting all solutions
To get other solutions, you simply loop and assert that the solution shouldn't be the previous one. You can use the following while loop after the assertion:
while s.check() == sat:
m = s.model()
print ''.join([chr (m[input[i]].as_long()) for i in range(10)])
s.add(Or([input[i] != m[input[i]] for i in range(10)]))
When I ran it, it kept going! You might want to stop it after a while.
You can encode calc2 in Z3. you'll need to unroll the loop for 1,2,3,4,..,n times (for n = max input size expected), but that's it.
(You don't actually need to unroll the loop, you can use z3py to create the constraints)
I have a very long math formula (just to put you in context: it has 293095 characters) which in practice will be the body of a python function. This function has 15 input parameters as in:
def math_func(t,X,P,n1,n2,R,r):
x,y,z = X
a,b,c = P
u1,v1,w1 = n1
u2,v2,w2 = n2
return <long math formula>
The formula uses simple math operations + - * ** / and one function call to arctan. Here an extract of it:
r*((-16*(r**6*t*u1**6 - 6*r**6*u1**5*u2 - 15*r**6*t*u1**4*u2**2 +
20*r**6*u1**3*u2**3 + 15*r**6*t*u1**2*u2**4 - 6*r**6*u1*u2**5 -
r**6*t*u2**6 + 3*r**6*t*u1**4*v1**2 - 12*r**6*u1**3*u2*v1**2 -
18*r**6*t*u1**2*u2**2*v1**2 + 12*r**6*u1*u2**3*v1**2 +
3*r**6*t*u2**4*v1**2 + 3*r**6*t*u1**2*v1**4 - 6*r**6*u1*u2*v1**4 -
3*r**6*t*u2**2*v1**4 + r**6*t*v1**6 - 6*r**6*u1**4*v1*v2 -
24*r**6*t*u1**3*u2*v1*v2 + 36*r**6*u1**2*u2**2*v1*v2 +
24*r**6*t*u1*u2**3*v1*v2 - 6*r**6*u2**4*v1*v2 -
12*r**6*u1**2*v1**3*v2 - 24*r**6*t*u1*u2*v1**3*v2 +
12*r**6*u2**2*v1**3*v2 - 6*r**6*v1**5*v2 - 3*r**6*t*u1**4*v2**2 + ...
Now the point is that in practice the bulk evaluation of this function will be done for fixed values of P,n1,n2,R and r which reduces the set of free variables to only four, and "in theory" the formula with less parameters should be faster.
So the question is: How can I implement this optimization in Python?
I know I can put everything in a string and do some sort of replace,compile and eval like in
formula = formula.replace('r','1').replace('R','2')....
code = compile(formula,'formula-name','eval')
math_func = lambda t,x,y,z: eval(code)
It would be good if some operations (like power) are substituted by their value, for example 18*r**6*t*u1**2*u2**2*v1**2 should become 18*t for r=u1=u2=v1=1. I think compile should do so but in any case I'm not sure. Does compile actually perform this optimization?
My solution speeds up the computation but if I can squeeze it more it will be great. Note: preferable within standard Python (I could try Cython later).
In general I'm interesting in a pythonic way to accomplish my goal maybe with some extra libraries: what is a reasonably good way of doing this? Is my solution a good approach?
EDIT: (To give more context)
The huge expression is the output of a symbolic line integral over an arc of circle. The arc is given in space by the radius r, two ortho-normal vectors (like the x and y axis in a 2D version) n1=(u1,v1,w1),n2=(u2,v2,w2) and the center P=(a,b,c). The rest is the point over which I'm performing the integration X=(x,y,z) and a parameter R for the function I'm integrating.
Sympy and Maple just take ages to compute this, the actual output is from Mathematica.
If you are curious about the formula here it is (pseudo-pseudo-code):
G(u) = P + r*(1-u**2)/(1+u**2)*n1 + r*2*u/(1+u**2)*n2
integral of (1-|X-G(t)|^2/R^2)^3 over t
You could use Sympy:
>>> from sympy import symbols
>>> x,y,z,a,b,c,u1,v1,w1,u2,v2,w2,t,r = symbols("x,y,z,a,b,c,u1,v1,w1,u2,v2,w2,t,r")
>>> r=u1=u2=v1=1
>>> a = 18*r**6*t*u1**2*u2**2*v1**2
>>> a
18*t
Then you can create a Python function like this:
>>> from sympy import lambdify
>>> f = lambdify(t, a)
>>> f(1)
18
And that f function is indeed simply 18*t:
>>> import dis
>>> dis.dis(f)
1 0 LOAD_CONST 1 (18)
3 LOAD_FAST 0 (_Dummy_18)
6 BINARY_MULTIPLY
7 RETURN_VALUE
If you want to compile the resulting code into machine code, you can try a JIT compiler such as Numba, Theano, or Parakeet.
Here's how I would approach this problem:
compile() your function to an AST (Abstract Syntax Tree) instead of a normal bytecode function - see the standard ast module for details.
Traverse the AST, replacing all references to the fixed parameters with their fixed value. There are libraries such as macropy that may be useful for this, I don't have any specific recommendation.
Traverse the AST again, performing whatever optimizations this might enable, such as Mult(1, X) => X. You don't have to worry about operations between two constants, as Python (since 2.6) optimizes that already.
compile() the AST into a normal function. Call it, and hope that the speed was increased by a sufficient amount to justify all the pre-optimization.
Note that Python will never optimize things like 1*X on its own, as it cannot know what type X will be at runtime - it could be an instance of a class that implements the multiplication operation in an arbitrary way, so the result is not necessarily X. Only your knowledge that all the variables are ordinary numbers, obeying the usual rules of arithmetic, makes this optimization valid.
The "right way" to solve a problem like this is one or more of:
Find a more efficient formulation
Symbolically simplify and reduce terms
Use vectorization (e.g. NumPy)
Punt to low-level libraries that are already optimized (e.g. in languages like C or Fortran that implicitly do strong expression optimization, rather than Python, which does nada).
Let's say for a moment, though, that approaches 1, 3, and 4 are not available, and you have to do this in Python. Then simplifying and "hoisting" common subexpressions is your primary tool.
The good news is, there are a lot of opportunities. The expression r**6, for example, is repeated 26 times. You could save 25 computations by simply assigning r_6 = r ** 6 once, then replacing r**6 every time it occurs.
When you start looking for common expressions here, you'll find them everywhere. It'd be nice to mechanize that process, right? In general, that requires a full expression parser (e.g. from the ast module) and is an exponential-time optimization problem. But your expression is a bit of a special case. While long and varied, it's not especially complicated. It has few internal parenthetical groupings, so we can get away with a quicker and dirtier approach.
Before the how, the resulting code is:
sa = r**6 # 26 occurrences
sb = u1**2 # 5 occurrences
sc = u2**2 # 5 occurrences
sd = v1**2 # 5 occurrences
se = u1**4 # 4 occurrences
sf = u2**3 # 3 occurrences
sg = u1**3 # 3 occurrences
sh = v1**4 # 3 occurrences
si = u2**4 # 3 occurrences
sj = v1**3 # 3 occurrences
sk = v2**2 # 1 occurrence
sl = v1**6 # 1 occurrence
sm = v1**5 # 1 occurrence
sn = u1**6 # 1 occurrence
so = u1**5 # 1 occurrence
sp = u2**6 # 1 occurrence
sq = u2**5 # 1 occurrence
sr = 6*sa # 6 occurrences
ss = 3*sa # 5 occurrences
st = ss*t # 5 occurrences
su = 12*sa # 4 occurrences
sv = sa*t # 3 occurrences
sw = v1*v2 # 5 occurrences
sx = sj*v2 # 3 occurrences
sy = 24*sv # 3 occurrences
sz = 15*sv # 2 occurrences
sA = sr*u1 # 2 occurrences
sB = sy*u1 # 2 occurrences
sC = sb*sc # 2 occurrences
sD = st*se # 2 occurrences
# revised formula
sv*sn - sr*so*u2 - sz*se*sc +
20*sa*sg*sf + sz*sb*si - sA*sq -
sv*sp + sD*sd - su*sg*u2*sd -
18*sv*sC*sd + su*u1*sf*sd +
st*si*sd + st*sb*sh - sA*u2*sh -
st*sc*sh + sv*sl - sr*se*sw -
sy*sg*u2*sw + 36*sa*sC*sw +
sB*sf*sw - sr*si*sw -
su*sb*sx - sB*u2*sx +
su*sc*sx - sr*sm*v2 - sD*sk
That avoids 81 computations. It's just a rough cut. Even the result could be further improved. The subexpressions sr*sw and su*sd for example, could be pre-computed as well. But we'll leave that next level for another day.
Note that this doesn't include the starting r*((-16*(. The majority of the simplification can be (and needs to be) done on the core of the expression, not on its outer terms. So I stripped those away for now; they can be added back once the common core is computed.
How do you do this?
f = """
r**6*t*u1**6 - 6*r**6*u1**5*u2 - 15*r**6*t*u1**4*u2**2 +
20*r**6*u1**3*u2**3 + 15*r**6*t*u1**2*u2**4 - 6*r**6*u1*u2**5 -
r**6*t*u2**6 + 3*r**6*t*u1**4*v1**2 - 12*r**6*u1**3*u2*v1**2 -
18*r**6*t*u1**2*u2**2*v1**2 + 12*r**6*u1*u2**3*v1**2 +
3*r**6*t*u2**4*v1**2 + 3*r**6*t*u1**2*v1**4 - 6*r**6*u1*u2*v1**4 -
3*r**6*t*u2**2*v1**4 + r**6*t*v1**6 - 6*r**6*u1**4*v1*v2 -
24*r**6*t*u1**3*u2*v1*v2 + 36*r**6*u1**2*u2**2*v1*v2 +
24*r**6*t*u1*u2**3*v1*v2 - 6*r**6*u2**4*v1*v2 -
12*r**6*u1**2*v1**3*v2 - 24*r**6*t*u1*u2*v1**3*v2 +
12*r**6*u2**2*v1**3*v2 - 6*r**6*v1**5*v2 - 3*r**6*t*u1**4*v2**2
""".strip()
from collections import Counter
import re
expre = re.compile('(?<!\w)\w+\*\*\d+')
multre = re.compile('(?<!\w)\w+\*\w+')
expr_saved = 0
stmts = []
secache = {}
seindex = 0
def subexpr(e):
global seindex
cached = secache.get(e)
if cached:
return cached
base = ord('a') if seindex < 26 else ord('A') - 26
name = 's' + chr(seindex + base)
seindex += 1
secache[e] = name
return name
def hoist(e, flat, c):
"""
Hoist the expression e into name defined by flat.
c is the count of how many times seen in incoming
formula.
"""
global expr_saved
assign = "{} = {}".format(flat, e)
s = "{:30} # {} occurrence{}".format(assign, c, '' if c == 1 else 's')
stmts.append(s)
print "{} needless computations quashed with {}".format(c-1, flat)
expr_saved += c - 1
def common_exp(form):
"""
Replace ALL exponentiation operations with a hoisted
sub-expression.
"""
# find the exponentiation operations
exponents = re.findall(expre, form)
# find and count exponentiation operations
expcount = Counter(re.findall(expre, form))
# for each exponentiation, create a hoisted sub-expression
for e, c in expcount.most_common():
hoist(e, subexpr(e), c)
# replace all exponentiation operations with their sub-expressions
form = re.sub(expre, lambda x: subexpr(x.group(0)), form)
return form
def common_mult(f):
"""
Replace multiplication operations with a hoisted
sub-expression if they occur > 1 time. Also, only
replaces one sub-expression at a time (the most common)
because it may affect further expressions
"""
mults = re.findall(multre, f)
for e, c in Counter(mults).most_common():
# unlike exponents, only replace if >1 occurrence
if c == 1:
return f
# occurs >1 time, so hoist
hoist(e, subexpr(e), c)
# replace in loop and return
return re.sub('(?<!\w)' + re.escape(e), subexpr(e), f)
# return f.replace(e, flat(e))
return f
# fix all exponents
form = common_exp(f)
# fix selected multiplies
prev = form
while True:
form = common_mult(form)
if form == prev:
# have converged; no more replacements possible
break
prev = form
print "--"
mults = re.split(r'\s*[+-]\s*', form)
smults = ['*'.join(sorted(terms.split('*'))) for terms in mults]
print smults
# print the hoisted statements and the revised expression
print '\n'.join(stmts)
print
print "# revised formula"
print form
Parsing with regular expressions is dicey business. That journey is prone to error, sorrow, and regret. I guarded against bad outcomes by hoisting some exponentiations that didn't strictly need to be, and by plugging random values into both the before and after formulas to make sure they both give the same results. I recommend the "punt to C" strategy if this is production code. But if you can't...
So, I am using the answer to this question to color some values I have for some polygons to plot to a basemap instance. I modified the function found in that link to be the following. The issue I'm having is that I have to convert the string that it returns to a hex digit to use so that I can color the polygons. But when I convert something like "0x00ffaa" to a python hex digit, it changes it to be "0xffaa", which cannot be used to color the polygon
How can I get around this?
Here is the modified function:
def rgb(mini,maxi,value):
mini, maxi, value = float(mini), float(maxi), float(value)
ratio = 2* (value - mini) / (maxi-mini)
b = int(max(0,255*(1-ratio)))
r = int(max(0,255*(ratio -1)))
g = 255 - b - r
b = hex(b)
r = hex(r)
g = hex(g)
if len(b) == 3:
b = b[0:2] + '0' + b[-1]
if len(r) == 3:
r = r[0:2] + '0' + r[-1]
if len(g) == 3:
g = g[0:2] + '0' + g[-1]
string = r+g[2:]+b[2:]
return string
The answer from cdarke is OK, but using the % operator for string interpolation is kind of deprecated. For the sake of completion, here is the format function or the str.format method:
>>> format(254, '06X')
'0000FE'
>>> "#{:06X}".format(255)
'#0000FF'
New code is expected to use one of the above instead of the % operator. If you are curious about "why does Python have a format function as well as a format method?", see my answer to this question.
But usually you don't have to worry about the representation of the value if the function/method you are using takes integers as well as strings, because in this case the string '0x0000AA' is the same as the integer value 0xAA or 170.
Use string formatting, for example:
>>> "0x%08x" % 0xffaa
'0x0000ffaa'