Optimization on a set of data using python

Optimization on a set of data using python - python

Optimization on a set of data using python.
Following data sets available
x, y, f(x), f(y).
Function to be optimized (maximize):
f(x,y) = f(x)*y - f(y)*x
based on following contraints:
V >= sqrt(f(x)^2+f(y)^2)
I >= sqrt(x^2+y2)
where V and I are constants.
Can anyone please let me know what optimization module do I need to use? From what I understand I need to perform a discrete optimization as I have set f values for x, y, f(x) and f(y).

Using complex optimizers (http://docs.scipy.org/doc/scipy/reference/optimize.html) for such a problem is rather a bad idea.
It looks like a problem which can be quite easily solved in under O(n^2) where n=max(|x|,|y|), simply:
sort x,y,f(x),f(y) creating sorted(x), sorted(y), sorted(f(x)), sorted(f(y))
for each x find the positions in sorted(y) for which I^2 >= x^2+y^2 holds and similarly for f(x) and sorted(f(y)) and V^2 >= f(x)^2 + f(y)^2 (two binary searches, as I^2 >= x^2+y^2 <=> |y| <= sqrt(I^2-x^2) so you can find the "barrier"in constant time and then use bin searches to find actual data points which are the closest ones "on the right side of inequality")
Iterate through sorted(x) and for each x:
Iterate simultanously through elements of y and f(y) and discard (in this loop) points which are not in borth intervals found in step 2. (linear complexity)
Record argument pairs x_max,y_max for which f(x_max,y_max) is maximized
Return x_max,y_max
Total complexity is under quadratic, as step 1 takes O(nlgn), each iteration of loop in step 2 is O(lgn) so the whole step 2 takes O(nlgn), loop in step 3 is O(n) and loop in first substep of step 3 is O(n) (but in real life it should be almost constant due to the constraints), which makes the whole algorithm O(n^2) (and in most cases it will behave as O(nlgn)). It also does not depend on the definition of f(x,y) (it uses it as a black box) so you can optimize an arbitrary function is such a way.

Related

How can I specify to the Z3's optimizer to start the search from the lower bound of the objective function?

I'm using Z3 to optimize a SMT problem. I have a variable "h" (obviously bounded by some constraints) that I want to minimize with the Z3 Optimize class. The point is that I know the lower bound of h but not its upper bound, so if I write something like
optimizer.add(h >= lower_bound)
what happens is that the solver spends a lot of time trying suboptimal values of h. If instead I write
optimizer.add(h == lower_bound)
the optimizer finds the solution for h fairly quickly if there is one. My problem is that clearly the optimal solution doesn't always have h == lower_bound, but it's usually close to it. It would be nice if there was a way to specify to the optimizer to start searching from the lower bound and then go up. A workaround that I found consists in using the Z3 Solver class instead of the Optimize one and iterating over all the possible values of h starting from the lower bound, so something like:
h = lower_bound
sat = 'unsat'
while sat != 'sat':
solver = Solver()
h_var = Int('h')
solver.add(h_var == h)
# all the other constraints here...
sat = solver.check()
h += 1
But it's not really elegant. Can some of you help me? Thank you very much.

If you know an upper bound as well, then you can do a binary search. That'd be logarithmically optimal in terms of the number of calls to check you have to make.
If you don't have an upper limit, then first find it by incrementing h not by 1, but by a larger amount to "jump" to an upper-bound. (For instance, increment by 1000 till you hit unsat.) Then do a binary search since you'll have upper-lower bounds at that time. Of course a good value for increment will depend on the exact problem you have.
I'm afraid these are your only options. The "official" way of doing this is to say opt.add(h >= lower_limit), which doesn't seem to be working for your problem. Perhaps the above trick can help.
Another thing to try is a different solver: OptiMathSAT has different algorithms and optimization techniques implemented. Perhaps it'll perform better: https://optimathsat.disi.unitn.it

Python odeint with array in differential equation

I have next first order differential equation (example):
dn/dt=A*n; n(0)=28
When A is constant, it is perfectly solved with python odeint.
But i have an array of different values of A from .txt file [not function,just an array of values]
A = [0.1,0.2,0.3,-0.4,0.7,...,0.0028]
And i want that in each iteration (or in each moment of time t) of solving ode A is a new value from array.
I mean that:
First iteration (or t=0) - A=0.1
Second iteration (or t=1) - A=0.2 and etc from array.
How can i do it with using python odeint?

Yes, you can to that, but not directly in odeint, as that has no event mechanism, and what you propose needs an event-action mechanism.
But you can separate your problem into steps, use inside each step odeint with the now constant A parameter, and then in the end join the steps.
T = [[0]]
N = [[n0]]
for k in range(len(A)):
t = np.linspan(k,k+1,11);
n = odeint(lambda u,t: A[k]*u, [n0],t)
n0 = n[-1]
T.append(t[1:])
N.append(n[1:])
T = np.concatenate(T)
N = np.concatenate(N)
If you are satisfied with less efficiency, both in the evaluation of the ODE and in the number of internal steps, you can also implement the parameter as a piecewise constant function.
tA = np.arange(len(A));
A_func = interp1d(tA, A, kind="zero", fill_value="extrapolate")
T = np.linspace(0,len(A)+1, 10*len(A)+11);
N = odeint(lambda u,t: A_func(t)*u, [n0], T)
The internal step size controller works on the assumption that the ODE function is well differentiable to 5th or higher order. The jumps are then seen via the implicit numerical differentiation inherent in the step error calculation as highly oscillatory events, requiring a very small step size. There is some mitigation inside the code that usually allows the solver to eventually step over such a jump, but it will require much more internal steps and thus function evaluations than the first variant above.

my code is giving me the wrong output sometimes, how to solve it?

I am trying to solve this problem: 'Your task is to construct a building which will be a pile of n cubes. The cube at the bottom will have a volume of n^3, the cube above will have the volume of (n-1)^3 and so on until the top which will have a volume of 1^3.
You are given the total volume m of the building. Being given m can you find the number n of cubes you will have to build?
The parameter of the function findNb (find_nb, find-nb, findNb) will be an integer m and you have to return the integer n such as n^3 + (n-1)^3 + ... + 1^3 = m if such a n exists or -1 if there is no such n.'
I tried to first create an arithmetic sequence then transform it into a sigma sum with the nth term of the arithmetic sequence, the get a formula which I can compare its value with m.
I used this code and work 70 - 80% fine, most of the calculations that it does are correct, but some don't.
import math
def find_nb(m):
n = 0
while n < 100000:
if (math.pow(((math.pow(n, 2))+n), 2)) == 4*m:
return n
break
n += 1
return -1
print(find_nb(4183059834009))
>>> output 2022, which is correct
print(find_nb(24723578342962))
>>> output -1, which is also correct
print(find_nb(4837083252765022010))
>>> real output -1, which is incorrect
>>> expected output 57323

As mentioned, this is a math problem, which is mainly what I am better at :).
Sorry for the in-line mathematical formula as I cannot do any math formula rendering (in SO).
I do not see any problem with your code and I believe your sample test case is wrong. However, I'll still give optimisation "tricks" below for your code to run quicker
Firstly as you know, sum of the cubes between 1^3 and n^3 is n^2(n+1)^2/4. Therefore we want to find integer solutions for the equation
n^2(n+1)^2/4 == m i.e. n^4+2n^3+n^2 - 4m=0
Running a loop for n from 1 (or in your case, 2021) to 100000 is inefficient. Firstly, if m is a large number (1e100+) the complexity of your code is O(n^0.25). Considering Python's runtime, you can run your code in time only if m is less than around 1e32.
To optimise your code, you have two approaches.
1) Use Binary Search. I will not get into the details here, but basically, you can halve the search range for a simple comparison. For the initial bounds you can use lower = 0 & upper = k. A better bound for k will be given below, but let's use k = m for now.
Complexity: O(log(k)) = O(log(m))
Feasible range for m: m < 10^(3e7)
2) Use the almighty Newton-Raphson!! Using the iteration formula x_(n+1) = x_n - f(x_n) / f'(x_n), where f'(x) can be calculated explicitly, and a reasonable initial guess, let's say k = m again, the complexity is (I believe) O(log(k)) + O(1) = O(log(m)).
Complexity: O(log(k)) = O(log(m))
Feasible range for m: m < 10^(3e7)
Finally, I'll give a better initial guess for k in the above methods, also given in Ian's answer to this question. Since n^4+2n^3+n^2 = O(n^4), we can actually take k ~ m^0.25 = (m^0.5)^0.5. To calculate this, We can take k = 2^(log(k)/4) where log is base 2. The log should be O(1), but I'm not sure for big numbers/dynamic size (int in Python). Not a theorist. Using this better guess and Newton-Raphson, since the guess is in a constant range from the result, the algorithm is nearly O(1). Again, check out the links for better understanding.
Finally
Since your goal is to find whether n exists such that the equation is "exactly satisfied", use Newton-Raphson and iterate until the next guess is less than 0.5 from the current guess. If your implementation is "floppy", you can also do a range +/- 10 from the guess to ensure that you find the solution.

I think this is a Math question rather than a programming question.
Firstly, I would advise you to start iterating from a function of your input m. Right now you are initialising your n value arbitrarily (though of course it might be a requirement of the question) but I think there are ways to optimise it. Maybe, just maybe you can iterate from the cube root, so if n reaches zero or if at any point the sum becomes smaller than m you can safely assume there is no possible building that can be built.
Secondly, the equation you derived from your summation doesn't seem to be correct. I substituted your expected n and input m into the condition in your if clause and they don't match. So either 1) your equation is wrong or 2) the expected output is wrong. I suggest that you relook at your derivation of the condition. Are you using the sum of cubes factorisation? There might be some edge cases that you neglected (maybe odd n) but my Math is rusty so I can't help much.
Of course, as mentioned, the break is unnecessary and will never be executed.

Memoized to DP solution - Making Change

Recently I read a problem to practice DP. I wasn't able to come up with one, so I tried a recursive solution which I later modified to use memoization. The problem statement is as follows :-
Making Change. You are given n types of coin denominations of values
v(1) < v(2) < ... < v(n) (all integers). Assume v(1) = 1, so you can
always make change for any amount of money C. Give an algorithm which
makes change for an amount of money C with as few coins as possible.
[on problem set 4]
I got the question from here
My solution was as follows :-
def memoized_make_change(L, index, cost, d):
if index == 0:
return cost
if (index, cost) in d:
return d[(index, cost)]
count = cost / L[index]
val1 = memoized_make_change(L, index-1, cost%L[index], d) + count
val2 = memoized_make_change(L, index-1, cost, d)
x = min(val1, val2)
d[(index, cost)] = x
return x
This is how I've understood my solution to the problem. Assume that the denominations are stored in L in ascending order. As I iterate from the end to the beginning, I have a choice to either choose a denomination or not choose it. If I choose it, I then recurse to satisfy the remaining amount with lower denominations. If I do not choose it, I recurse to satisfy the current amount with lower denominations.
Either way, at a given function call, I find the best(lowest count) to satisfy a given amount.
Could I have some help in bridging the thought process from here onward to reach a DP solution? I'm not doing this as any HW, this is just for fun and practice. I don't really need any code either, just some help in explaining the thought process would be perfect.
[EDIT]
I recall reading that function calls are expensive and is the reason why bottom up(based on iteration) might be preferred. Is that possible for this problem?

Here is a general approach for converting memoized recursive solutions to "traditional" bottom-up DP ones, in cases where this is possible.
First, let's express our general "memoized recursive solution". Here, x represents all the parameters that change on each recursive call. We want this to be a tuple of positive integers - in your case, (index, cost). I omit anything that's constant across the recursion (in your case, L), and I suppose that I have a global cache. (But FWIW, in Python you should just use the lru_cache decorator from the standard library functools module rather than managing the cache yourself.)
To solve for(x):
If x in cache: return cache[x]
Handle base cases, i.e. where one or more components of x is zero
Otherwise:
Make one or more recursive calls
Combine those results into `result`
cache[x] = result
return result
The basic idea in dynamic programming is simply to evaluate the base cases first and work upward:
To solve for(x):
For y starting at (0, 0, ...) and increasing towards x:
Do all the stuff from above
However, two neat things happen when we arrange the code this way:
As long as the order of y values is chosen properly (this is trivial when there's only one vector component, of course), we can arrange that the results for the recursive call are always in cache (i.e. we already calculated them earlier, because y had that value on a previous iteration of the loop). So instead of actually making the recursive call, we replace it directly with a cache lookup.
Since every component of y will use consecutively increasing values, and will be placed in the cache in order, we can use a multidimensional array (nested lists, or else a Numpy array) to store the values instead of a dictionary.
So we get something like:
To solve for(x):
cache = multidimensional array sized according to x
for i in range(first component of x):
for j in ...:
(as many loops as needed; better yet use `itertools.product`)
If this is a base case, write the appropriate value to cache
Otherwise, compute "recursive" index values to use, look up
the values, perform the computation and store the result
return the appropriate ("last") value from cache

I suggest considering the relationship between the value you are constructing and the values you need for it.
In this case you are constructing a value for index, cost based on:
index-1 and cost
index-1 and cost%L[index]
What you are searching for is a way of iterating over the choices such that you will always have precalculated everything you need.
In this case you can simply change the code to the iterative approach:
for each choice of index 0 upwards:
for each choice of cost:
compute value corresponding to index,cost
In practice, I find that the iterative approach can be significantly faster (e.g. *4 perhaps) for simple problems as it avoids the overhead of function calls and checking the cache for preexisting values.

Python: sliding window of variable width

I'm writing a program in Python that's processing some data generated during experiments, and it needs to estimate the slope of the data. I've written a piece of code that does this quite nicely, but it's horribly slow (and I'm not very patient). Let me explain how this code works:
1) It grabs a small piece of data of size dx (starting with 3 datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope using OLS regression. If the difference is too small, it will increase dx and redo the loop with this new dx
4) This continues for all the datapoints
[See updated code further down]
For a datasize of about 100k measurements, this takes about 40 minutes, whereas the rest of the program (it does more processing than just this bit) takes about 10 seconds. I am certain there is a much more efficient way of doing these operations, could you guys please help me out?
Thanks
EDIT:
Ok, so I've got the problem solved by using only binary searches, limiting the number of allowed steps by 200. I thank everyone for their input and I selected the answer that helped me most.
FINAL UPDATED CODE:
def slope(self, data, time):
(wave1, wave2) = wt.dwt(data, "db3")
std = 2*np.std(wave2)
e = std/0.05
de = 5*std
N = len(data)
slopes = np.ones(shape=(N,))
data2 = np.concatenate((-data[::-1]+2*data[0], data, -data[::-1]+2*data[N-1]))
time2 = np.concatenate((-time[::-1]+2*time[0], time, -time[::-1]+2*time[N-1]))
for n in xrange(N+1, 2*N):
left = N+1
right = 2*N
for i in xrange(200):
mid = int(0.5*(left+right))
diff = np.abs(data2[n-mid+N]-data2[n+mid-N])
if diff >= e:
if diff < e + de:
break
right = mid - 1
continue
left = mid + 1
leftlim = n - mid + N
rightlim = n + mid - N
y = data2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
x = time2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
xavg = np.average(x)
yavg = np.average(y)
xlen = len(x)
slopes[n-N] = (np.dot(x,y)-xavg*yavg*xlen)/(np.dot(x,x)-xavg*xavg*xlen)
return np.array(slopes)

Your comments suggest that you need to find a better method to estimate ik+1 given ik. No knowledge of values in data would yield to the naive algorithm:
At each iteration for n, leave i at previous value, and see if the abs(data[start]-data[end]) value is less than e. If it is, leave i at its previous value, and find your new one by incrementing it by 1 as you do now. If it is greater, or equal, do a binary search on i to find the appropriate value. You can possibly do a binary search forwards, but finding a good candidate upper limit without knowledge of data can prove to be difficult. This algorithm won't perform worse than your current estimation method.
If you know that data is kind of smooth (no sudden jumps, and hence a smooth plot for all i values) and monotonically increasing, you can replace the binary search with a search backwards by decrementing its value by 1 instead.

How to optimize this will depend on some properties of your data, but here are some ideas:
Have you tried profiling the code? Using one of the Python profilers can give you some useful information about what's taking the most time. Often, a piece of code you've just written will have one biggest bottleneck, and it's not always obvious which piece it is; profiling lets you figure that out and attack the main bottleneck first.
Do you know what typical values of i are? If you have some idea, you can speed things up by starting with i greater than 0 (as #vhallac noted), or by increasing i by larger amounts — if you often see big values for i, increase i by 2 or 3 at a time; if the distribution of is has a long tail, try doubling it each time; etc.
Do you need all the data when doing the least squares regression? If that function call is the bottleneck, you may be able to speed it up by using only some of the data in the range. Suppose, for instance, that at a particular point, you need i to be 200 to see a large enough (above-noise) change in the data. But you may not need all 400 points to get a good estimate of the slope — just using 10 or 20 points, evenly spaced in the start:end range, may be sufficient, and might speed up the code a lot.

I work with Python for similar analyses, and have a few suggestions to make. I didn't look at the details of your code, just to your problem statement:
1) It grabs a small piece of data of size dx (starting with 3
datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is
larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope
using OLS regression. If the difference is too small, it will increase
dx and redo the loop with this new dx
4) This continues for all the datapoints
I think the more obvious reason for slow execution is the LOOPING nature of your code, when perhaps you could use the VECTORIZED (array-based operations) nature of Numpy.
For step 1, instead of taking pairs of points, you can perform directly `data[3:] - data[-3:] and get all the differences in a single array operation;
For step 2, you can use the result from array-based tests like numpy.argwhere(data > threshold) instead of testing every element inside some loop;
Step 3 sounds conceptually wrong to me. You say that if the difference is too small, it will increase dx. But if the difference is small, the resulting slope would be small because it IS actually small. Then, getting a small value is the right result, and artificially increasing dx to get a "better" result might not be what you want. Well, it might actually be what you want, but you should consider this. I would suggest that you calculate the slope for a fixed dx across the whole data, and then take the resulting array of slopes to select your regions of interest (for example, using data_slope[numpy.argwhere(data_slope > minimum_slope)].
Hope this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimization on a set of data using python - python

Related

How can I specify to the Z3's optimizer to start the search from the lower bound of the objective function?

Python odeint with array in differential equation

my code is giving me the wrong output sometimes, how to solve it?

Memoized to DP solution - Making Change

Python: sliding window of variable width

Categories

Resources