Python: Finding period of a two column data - python

This question seems so trivial but I didn't find any suitable answer so I am asking!
Lets say I have a two column data(say, {x, sin(x)} )
X Y(X)
0.0 0.0
0.1 0.099
0.2 0.1986
How do I find the period of the function Y(X);
I have some experience in Mathematica where(roughly)
I just interpolate the data as a function say y(x), then
Calculate y'(x) and set y'(x_p)=0;
collect all (x_p+1 - x_p)'s and take average to get the period.
In python, however I am stuck after step 1 as I can find out x_p for a particular guess value but not all the x_p's. Also this procedure doesn't seem very elegant to me. Is there a better way to do things in python?

For calculating periods I would just find the peak of the Fourier transformed data, to do that in python look into script.fft. Could be computationally intensive though.

Related

OR-Tools solution to partition the data so that subset of rows where every feature falls in the corresponding range maximizes the objective function

Cross-posted from https://cs.stackexchange.com/questions/153558/find-a-range-of-values-to-subset-the-rows-to-maximize-the-objective-function?noredirect=1#comment323025_153558.
I have searched around for some time but couldn't find a similar example to my problem.
It looks common enough that I would expect it to be solved. It lies between search and optimization/regression.
The goal is to find a range of values for each feature, so that the subset of rows where every feature falls in the corresponding range maximizes the objective function.
Assume we have a matrix with Yi and corresponding set of features Xi (say around 40).
Number of samples relatively large, 100k+.
Table example
So in this case for the total data Sum(Y_i) = 73 and the mean(Y_i)= 6.0833
The problem is to:
Max sum(Yi) subj to:
mean(Y_i) > 7$
sum(i) > 5000
, where i are the row index and rows are selected by imposing 2 constraints ( < and > ) or each feature.
I have managed to get solution using DEoptim in R for 5-6 variables with 2 conditions (partitions) "<" and ">". For more features it gets slow/fail to converge.
Seeing the (somewhat) similar question (and answer) here : Pandas find subset of rows minimizing the sum of a column under other column constraint
I am wondering if there is a way to formulate my problem in OR-Tools as well. I have went through the documentation on the https://developers.google.com/optimization but still struggle to understand how to express my problem.
Would appreciate any pointers as to how to formulate (solve) this problem in OR-tools in the general case, where there is a dataset with features + response variable and the objective is find the splits on features to maximize (minimize) the sum (or other function) of the response variable.
The number of splits should be 2 per feature as we want solution to be locally monotonic wrt to features.
Thanks.

Implement variational approach for budget closure with 2 constraints in python

I'm new to Python and am quite helpless with a problem I have to solve:
I have two budget equations, let's say a+b+c+d=Res1 and a+c+e+f=Res2, now every term has a specific standard deviation a_std, b_std,... and I want to distribute the budget residuals Res1 and Res2 onto the individual terms relative to their uncertainty (see eqution below), to get a_new+b_new+c_new+d_new=0 and a_new+c_new+e_new+f_new=0
Regarding only 1 budget equation I'm able to solve the problem and get the terms a_new, b_new, c_new and d_new. But how can I add the second constraint to also get e_new and f_new?
e.g. I calculate a_new = a + (a_std^2/(a_std+b_std+c_std))*Res1 , however this is only dependent of the first equation, but I want a to be modified that way to also satisfy the second equation..
I appreciate any help/any ideas on how to approach this problem.
Thanks in advance,
Sue
Edit:
What I have so far:
def var_close(a,a_std,b,b_std,c,c_std,d,d_std,e,e_std,f,f_std,g,g_std):
x=[a,b,c,d,e]
Res1=np.sum([x])
std_ges1=a_std*a_std+b_std*b_std+c_std*c_std+d_std*d_std+e_std*e_std
y=[a,c,f,g]
Res2=np.sum([y])
std_ges2=a_std*a_std+c_std*c_std+f_std*f_std+g_std*g_std
a_new=a-((a_std*a_std)/std_ges1)*Res1
b_new=b-((b_std*b_std)/std_ges1)*Res1
c_new=c-((c_std*c_std)/std_ges1)*Res1
d_new=d-((d_std*d_std)/std_ges1)*Res1
e_new=e-((e_std*e_std)/std_ges1)*Res1
a_new2=a-((a_std*a_std)/std_ges2)*Res2
c_new2=c-((c_std*c_std)/std_ges2)*Res2
f_new=f-((f_std*f_std)/std_ges2)*Res2
g_new=g-((g_std*g_std)/std_ges2)*Res2
return a_new,b_new,c_new,d_new,e_new,a_new2,c_new2,f_new,g_new
But like this e.g. a_new and a_new2 are slightly different, but I want them to be equal and the other terms modified correspondng to their uncertainty..

Optimization of multiple functions

I have 3 functions which consist of 6 variables (p1,p2,p3,p4,p5,p6). The value of each function is equal to x (say):
f1=
sgn(2-p1)*sqrt(abs(2-p1))+sgn(2-p2)*sqrt(abs(2-p2))+sgn(2-p3)*sqrt(abs(2-p3));
f2= sgn(p4-2)*sqrt(abs(p4-2))+sgn(p5-2)*sqrt(abs(p5-2))+sgn(p6-2)*sqrt(abs(p6-2));
f3=
sgn(p1-p4)*sqrt(abs(p1-p4))+sgn(p2-p5)*sqrt(abs(p2-p5))+sgn(p3-p6)*sqrt(abs(p3-p6));
I want to find the combination of values of p1,p2,p3,p4,p5 and p6 for which x is maximum. Constraints are:
0 <= p1,p2,p3,p4,p5,p6 <= 4
Simply varying every variable from 0 to 4 taking small steps is not a good solution. Can someone tell me an efficient method to optimise the solution (preferably in python).
This is a non-linear optimization problem without an obvious close form solution. Better ask this question in another forum.

matching a multi-variable function with 2 bounded unknown variables to a value with graphical representation

My question is about matching this function: N = 0.13*(s^a), where s and a are variables, to a value. I am trying to find all values of s and a that satisfy N = 100 and N = 10,000,000. S is bounded from 0 to 101 and a is bounded from 3 to 8. And I would like to visualize the results possibly by graphing it with the axes being s and a, like a 2D plot. The algorithms I found that were similar to what I need seemed to all want to find the minimum or maximum of a function instead of matching it to a value. I have hit a wall and I don't know if my coding skills are high enough to write my own algorithm. Any help would be greatly appreciated! Thanks in advance!
This can easily be converted to a minimization problem. Simply minimize this function:
abs(0.13 * s ^ a - 100)
Replace the 100 with 10,000,000 for the second part. It will take some modification to find all values of s and a, rather than just one pair. This could be done by fixing an s value and minimizing over a, then repeating for different s values.

Python: sliding window of variable width

I'm writing a program in Python that's processing some data generated during experiments, and it needs to estimate the slope of the data. I've written a piece of code that does this quite nicely, but it's horribly slow (and I'm not very patient). Let me explain how this code works:
1) It grabs a small piece of data of size dx (starting with 3 datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope using OLS regression. If the difference is too small, it will increase dx and redo the loop with this new dx
4) This continues for all the datapoints
[See updated code further down]
For a datasize of about 100k measurements, this takes about 40 minutes, whereas the rest of the program (it does more processing than just this bit) takes about 10 seconds. I am certain there is a much more efficient way of doing these operations, could you guys please help me out?
Thanks
EDIT:
Ok, so I've got the problem solved by using only binary searches, limiting the number of allowed steps by 200. I thank everyone for their input and I selected the answer that helped me most.
FINAL UPDATED CODE:
def slope(self, data, time):
(wave1, wave2) = wt.dwt(data, "db3")
std = 2*np.std(wave2)
e = std/0.05
de = 5*std
N = len(data)
slopes = np.ones(shape=(N,))
data2 = np.concatenate((-data[::-1]+2*data[0], data, -data[::-1]+2*data[N-1]))
time2 = np.concatenate((-time[::-1]+2*time[0], time, -time[::-1]+2*time[N-1]))
for n in xrange(N+1, 2*N):
left = N+1
right = 2*N
for i in xrange(200):
mid = int(0.5*(left+right))
diff = np.abs(data2[n-mid+N]-data2[n+mid-N])
if diff >= e:
if diff < e + de:
break
right = mid - 1
continue
left = mid + 1
leftlim = n - mid + N
rightlim = n + mid - N
y = data2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
x = time2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
xavg = np.average(x)
yavg = np.average(y)
xlen = len(x)
slopes[n-N] = (np.dot(x,y)-xavg*yavg*xlen)/(np.dot(x,x)-xavg*xavg*xlen)
return np.array(slopes)
Your comments suggest that you need to find a better method to estimate ik+1 given ik. No knowledge of values in data would yield to the naive algorithm:
At each iteration for n, leave i at previous value, and see if the abs(data[start]-data[end]) value is less than e. If it is, leave i at its previous value, and find your new one by incrementing it by 1 as you do now. If it is greater, or equal, do a binary search on i to find the appropriate value. You can possibly do a binary search forwards, but finding a good candidate upper limit without knowledge of data can prove to be difficult. This algorithm won't perform worse than your current estimation method.
If you know that data is kind of smooth (no sudden jumps, and hence a smooth plot for all i values) and monotonically increasing, you can replace the binary search with a search backwards by decrementing its value by 1 instead.
How to optimize this will depend on some properties of your data, but here are some ideas:
Have you tried profiling the code? Using one of the Python profilers can give you some useful information about what's taking the most time. Often, a piece of code you've just written will have one biggest bottleneck, and it's not always obvious which piece it is; profiling lets you figure that out and attack the main bottleneck first.
Do you know what typical values of i are? If you have some idea, you can speed things up by starting with i greater than 0 (as #vhallac noted), or by increasing i by larger amounts — if you often see big values for i, increase i by 2 or 3 at a time; if the distribution of is has a long tail, try doubling it each time; etc.
Do you need all the data when doing the least squares regression? If that function call is the bottleneck, you may be able to speed it up by using only some of the data in the range. Suppose, for instance, that at a particular point, you need i to be 200 to see a large enough (above-noise) change in the data. But you may not need all 400 points to get a good estimate of the slope — just using 10 or 20 points, evenly spaced in the start:end range, may be sufficient, and might speed up the code a lot.
I work with Python for similar analyses, and have a few suggestions to make. I didn't look at the details of your code, just to your problem statement:
1) It grabs a small piece of data of size dx (starting with 3
datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is
larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope
using OLS regression. If the difference is too small, it will increase
dx and redo the loop with this new dx
4) This continues for all the datapoints
I think the more obvious reason for slow execution is the LOOPING nature of your code, when perhaps you could use the VECTORIZED (array-based operations) nature of Numpy.
For step 1, instead of taking pairs of points, you can perform directly `data[3:] - data[-3:] and get all the differences in a single array operation;
For step 2, you can use the result from array-based tests like numpy.argwhere(data > threshold) instead of testing every element inside some loop;
Step 3 sounds conceptually wrong to me. You say that if the difference is too small, it will increase dx. But if the difference is small, the resulting slope would be small because it IS actually small. Then, getting a small value is the right result, and artificially increasing dx to get a "better" result might not be what you want. Well, it might actually be what you want, but you should consider this. I would suggest that you calculate the slope for a fixed dx across the whole data, and then take the resulting array of slopes to select your regions of interest (for example, using data_slope[numpy.argwhere(data_slope > minimum_slope)].
Hope this helps!

Categories

Resources