Python - speed up finding percentile of set which is greater than threshold

Python - speed up finding percentile of set which is greater than threshold - python

I need to find which percentile of a group of numbers is over a threshold value. Is there a way that this can be speed up? My implementation is much too slow for the intended application. In case this changes anything, I am running my program using mpirun -np 100 python program.py. I cannot use numba, as the rest of this program uses try/except statements.
import numpy as np
my_vals = []
threshold_val = 0.065
for i in range(60000):
my_vals.append(np.random.normal(0.05, 0.02))
for i in np.arange(0,100,0.001):
if np.percentile(my_vals,i) > threshold_val:
perc = 1*i
break
else: perc = 100

Since the Gaussian (Normal) distribution produces a bell-curve, you should be able to calculate the percentile with the highest probability of being optimal and then write your code to check there first, and then use a modified binary search to find the optimal lowest threshold.
For example, if you determine that your parameters are most likely to favor e.g. 17.951 (this is an example only, I didn't actually bother computing it), then begin near that point rather than starting at 0. Treat this like a binary search - start your lower limit at 0 and your upper limit at 100.0, and set the point to bisect the list as the optimal percentile for the distribution.
If your current upper limit is over threshold_val, bisect the lower half to find the lowest such value that matches; if it is not over the threshold, bisect the upper half, etc. So, e.g. in the range 0.000 to 100.000, if you start at 17.951 and find that it is not above the threshold, adjust to bounds to 17.952 to 100.000 and try 58.976 (halfway between). As soon as you find a value that is above the threshold, then use that value as the upper bound (since it is a non-optimal answer). Continue this process until the lower and upper bounds are 0.001 apart, which gives you the optimal answer. On average, you should have to run about 17 tests rather than 100,000.
You may also be able to automate the computation of the optimal value in case your normal distribution will change, since the distribution produces a bell-curve, and you will know the statistics of that bell-curve based on the parameters anyway.
Your solution only needs to find the lowest value for which the percentile is above your threshold, so this approach should minimize the number of samples you need to check.
One more hint: np.percentile has to sorts the my_vals 100,000 times in your code; I do not know whether a pre-sorted list would help, but it may be worth checking (you'll probably have to test several possible sort parameters since it doesn't appear to be documented in which direction it sorts).

You can find the solution directly by sorting the values and searching for the first value that exceeds your threshold. The percentile is the fraction of array values before this element:
import numpy as np
my_vals = []
threshold_val = 0.065
for i in range(60000):
my_vals.append(np.random.normal(0.05, 0.02))
from bisect import bisect_right
print bisect_right(sorted(my_vals),threshold_val)/float(len(my_vals))*100

Related

Suppose an array contains only two kinds of elements, how to quickly find their boundaries?

I've asked a similar question before, but this time it's different.
Since our array contains only two elements, we might as well set it to 1 and -1, where 1 is on the left side of the array and -1 is on the right side of the array:
[1,...,1,1,-1,-1,...,-1]
Both 1 and -1 exist at the same time and the number of 1 and -1 is not necessarily the same. Also, the numbers of 1 and -1 are both very large.
Then, define the boundary between 1 and -1 as the index of the -1 closest to 1. For example, for the following array:
[1,1,1,-1,-1,-1,-1]
Its boundary is 3.
Now, for each number in the array, I cover it with a device that you have to unlock to see the number in it.
I want to try to unlock as few devices as possible that cover 1, because it takes much longer to see a '1' than it takes to see a '-1'. And I also want to reduce my time cost as much as possible.
How can I search to get the boundary as quickly as possible?

The problem is very like the "egg dropping" problem, but where a wrong guess has a large fixed cost (100), and a good guess has a small cost (1).
Let E(n) be the (optimal) expected cost of finding the index of the right-most 1 in an array (or finding that the array is all -1), assuming each possible position of the boundary is equally likely. Define the index of the right-most 1 to be -1 if the array is all -1.
If you choose to look at the array element at index i, then it's -1 with probability i/(n+1), and 1 with probability (n-i+1)/(n+1).
So if you look at array element i, your expected cost for finding the boundary is (1+E(i)) * i/(n+1) + (100+E(n-i-1)) * (n-i+1)/(n+1).
Thus E(n) = min((1+E(i)) * i/(n+1) + (100+E(n-i-1)) * (n-i+1)/(n+1), i=0..n-1)
For each n, the i that minimizes the equation is the optimal array element to look at for an array of that length.
I don't think you can solve these equations analytically, but you can solve them with dynamic programming in O(n^2) time.
The solution is going to look like a very skewed binary search for large n. For smaller n, it'll be skewed so much that it will be a traversal from the right.

If I am right, a strategy to minimize the expectation of the cost is to draw at a fraction of the interval that favors the -1 outcome, in inverse proportion of the cost. So instead of picking the middle index, take the right centile.
But this still corresponds to a logarithmic asymptotic complexity.
There is probably nothing that you can do regarding the worst case.

Finding the significant changes in a list of values?

I have a list of values, my_list, which shows the usage of a device in different times, like below:
my_list=[0.0, 11500312.5, 12293437.5, 11896875.0, 7711186.0,
3281768.863, 3341550.1363, 3300694.0,...]
I have many lists of this type and I want to find the numbers of the most significant changes (decreasing or increasing) in different times. One of these lists is plotted below. For example, if you look at the second, third and forth points in the graph you can see the difference between the values are not much, but the value suddenly decreased at fifth and Sith point. The same significant changes happened between point 20, 21 and 22.
So you can see in the plot they are two-three significant increasing and decreasing w.r.t to the other times. Any idea to find the numbers automatically?

Here's an approach that might work for you. Check how the value compares to the moving average. Is it more than one standard deviation away?
Here's a moving average implementation using numpy:
import numpy as np
def running_mean(x, N):
cumsum = numpy.cumsum(numpy.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / float(N)
From here
Here's an implementation of the comparison operation:
TimeSEries=[0.0, 11500312.5, 12293437.5, 11896875.0, 7711186.0,
3281768.863, 3341550.1363, 3300694.0]
MOV = running_mean(TimeSEries,3).tolist()
STD = np.std(MOV)
events= []
ind = []
for ii in range(len(TimeSEries)):
try:
if TimeSEries[ii] > MOV[ii]+STD:
print(TimeSEries[ii])
except IndexError:
pass
From here

Finding and ranking intervals of data

Every time I ride my bike a gather second by second data on a number of metrics. For simplicity, lets pretend that I have a csv file that looks something like:
secs, watts,
1,150
2,151
3,149
4,135
.
.
.
7000,160
So, every second of my ride has an associated power value, in watts.
I want to know "If I break my ride into N second blocks, which blocks have the realize in the highest average power?"
I am using a pandas dataframe to manage my data, and this is the code I have been using to answer my question:
def bestEffort(ride_data,
metric='watts',
interval_length=5,
sort_descending=True):
seconds_in_ride = len(ride_data[metric])
average_interval_list = [[i+1,
np.average(
[ride_data[metric][i+j]
for j in range(interval_length)])
]
for i in range(0,
seconds_in_ride -
interval_length)]
average_interval_list.sort(key=lambda x: x[1], reverse=sort_descending)
return average_interval_list
Seems simple? Right? Given an index, compute the average value of the interval_length subsequent entries. Keep track of this in a list of the form
[[second 1, avg val of metric over the interval starting that second],
[second 2, avg val of metric over the interval starting that second],
[second 3, avg val of metric over the interval starting that second],
.
.
.
[second 7000-interval_length, avg val of metric over the interval starting that second]]
Then, I sort the resulting list by the average values. So the first entry is of the form
[second_n, avg val of metric over the interval starting in second n]
telling me that my strongest effort over the given interval length started at second_n in my workout.
The problem is that if I set "interval_length" to anything higher than 30, this computation takes forever (read: over two minutes on a decent machine). Please, help me find where my code is hitting a bottleneck, this seems like it should be way faster.

If you put your data in a numpy array, say watts, you can compute the mean power using convolve:
mean_power = np.convolve(watts, np.ones(interval_length)/interval_length, mode='valid')
As you can see in the reference of np.convolve, this function computes a local mean of the first argument, smoothed with a window defined by the second argument. Here we smooth with a "top-hat" function--i.e. an "on/off" function which is constant over an interval of length interval_length, and zero otherwise. This is rudimentary but gives a first estimate.
Then the time of your strongest effort is:
time_strongest_effort = np.argmax(mean_power)

Here's a pure-pandas solution using DataFrame.rolling. It's slightly slower than the numpy convolution approach by #BenBoulderite, but is a convenient idiom:
df.rolling(interval_length).mean().shift(-(interval_length - 1))
The .shift() is needed to align the rolling-mean values so that the results are aligned to the left edge of the rolling window, instead of the default right edge (docs on DataFrame.rolling).

Python: sliding window of variable width

I'm writing a program in Python that's processing some data generated during experiments, and it needs to estimate the slope of the data. I've written a piece of code that does this quite nicely, but it's horribly slow (and I'm not very patient). Let me explain how this code works:
1) It grabs a small piece of data of size dx (starting with 3 datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope using OLS regression. If the difference is too small, it will increase dx and redo the loop with this new dx
4) This continues for all the datapoints
[See updated code further down]
For a datasize of about 100k measurements, this takes about 40 minutes, whereas the rest of the program (it does more processing than just this bit) takes about 10 seconds. I am certain there is a much more efficient way of doing these operations, could you guys please help me out?
Thanks
EDIT:
Ok, so I've got the problem solved by using only binary searches, limiting the number of allowed steps by 200. I thank everyone for their input and I selected the answer that helped me most.
FINAL UPDATED CODE:
def slope(self, data, time):
(wave1, wave2) = wt.dwt(data, "db3")
std = 2*np.std(wave2)
e = std/0.05
de = 5*std
N = len(data)
slopes = np.ones(shape=(N,))
data2 = np.concatenate((-data[::-1]+2*data[0], data, -data[::-1]+2*data[N-1]))
time2 = np.concatenate((-time[::-1]+2*time[0], time, -time[::-1]+2*time[N-1]))
for n in xrange(N+1, 2*N):
left = N+1
right = 2*N
for i in xrange(200):
mid = int(0.5*(left+right))
diff = np.abs(data2[n-mid+N]-data2[n+mid-N])
if diff >= e:
if diff < e + de:
break
right = mid - 1
continue
left = mid + 1
leftlim = n - mid + N
rightlim = n + mid - N
y = data2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
x = time2[leftlim:rightlim:int(0.05*(rightlim-leftlim)+1)]
xavg = np.average(x)
yavg = np.average(y)
xlen = len(x)
slopes[n-N] = (np.dot(x,y)-xavg*yavg*xlen)/(np.dot(x,x)-xavg*xavg*xlen)
return np.array(slopes)

Your comments suggest that you need to find a better method to estimate ik+1 given ik. No knowledge of values in data would yield to the naive algorithm:
At each iteration for n, leave i at previous value, and see if the abs(data[start]-data[end]) value is less than e. If it is, leave i at its previous value, and find your new one by incrementing it by 1 as you do now. If it is greater, or equal, do a binary search on i to find the appropriate value. You can possibly do a binary search forwards, but finding a good candidate upper limit without knowledge of data can prove to be difficult. This algorithm won't perform worse than your current estimation method.
If you know that data is kind of smooth (no sudden jumps, and hence a smooth plot for all i values) and monotonically increasing, you can replace the binary search with a search backwards by decrementing its value by 1 instead.

How to optimize this will depend on some properties of your data, but here are some ideas:
Have you tried profiling the code? Using one of the Python profilers can give you some useful information about what's taking the most time. Often, a piece of code you've just written will have one biggest bottleneck, and it's not always obvious which piece it is; profiling lets you figure that out and attack the main bottleneck first.
Do you know what typical values of i are? If you have some idea, you can speed things up by starting with i greater than 0 (as #vhallac noted), or by increasing i by larger amounts — if you often see big values for i, increase i by 2 or 3 at a time; if the distribution of is has a long tail, try doubling it each time; etc.
Do you need all the data when doing the least squares regression? If that function call is the bottleneck, you may be able to speed it up by using only some of the data in the range. Suppose, for instance, that at a particular point, you need i to be 200 to see a large enough (above-noise) change in the data. But you may not need all 400 points to get a good estimate of the slope — just using 10 or 20 points, evenly spaced in the start:end range, may be sufficient, and might speed up the code a lot.

I work with Python for similar analyses, and have a few suggestions to make. I didn't look at the details of your code, just to your problem statement:
1) It grabs a small piece of data of size dx (starting with 3
datapoints)
2) It evaluates whether the difference (i.e. |y(x+dx)-y(x-dx)| ) is
larger than a certain minimum value (40x std. dev. of noise)
3) If the difference is large enough, it will calculate the slope
using OLS regression. If the difference is too small, it will increase
dx and redo the loop with this new dx
4) This continues for all the datapoints
I think the more obvious reason for slow execution is the LOOPING nature of your code, when perhaps you could use the VECTORIZED (array-based operations) nature of Numpy.
For step 1, instead of taking pairs of points, you can perform directly `data[3:] - data[-3:] and get all the differences in a single array operation;
For step 2, you can use the result from array-based tests like numpy.argwhere(data > threshold) instead of testing every element inside some loop;
Step 3 sounds conceptually wrong to me. You say that if the difference is too small, it will increase dx. But if the difference is small, the resulting slope would be small because it IS actually small. Then, getting a small value is the right result, and artificially increasing dx to get a "better" result might not be what you want. Well, it might actually be what you want, but you should consider this. I would suggest that you calculate the slope for a fixed dx across the whole data, and then take the resulting array of slopes to select your regions of interest (for example, using data_slope[numpy.argwhere(data_slope > minimum_slope)].
Hope this helps!

Grouping arbitrary arrays of data into N bins

I want to group an arbitrary-sized array of random values into n groups, such that the sum of values in any one group/bin is as equal as possible.
So for values [1, 2, 4, 5] and n = 2, the output buckets should be [sum(5+1), sum(4+2)].
Some possibilities that occur to me:
Full exhaustive breadth first search
Random processes with stopping conditions hard coded
Start from one end of the sorted array, grouping until the sum is equal to the global average, and move to the next group until n is reached
Seems like the optimal solution (where the sum of the contents of the bins are as equal as possible given the input array) is probably non-trivial; so at the moment I'm leaning towards the last option, but have the feeling I am possibly missing more elegant solutions?

This is an NP-hard problem. In other words, it's not possible to find an optimal solution without exploring all combinations, and the number of combinations is n^M (where M is the size of you array, and n the number of beans). It's a problem very similar to clustering, which is also NP-hard.
If your data set is small enough to deal with, a brute force algorithm is best (explore all combinations).
However, if your data set is big, you'll want a polynomial-time algorithm that won't get you the optimal solution, but a good approximation. In that case, I suggest you use something similar to K-Means...
Step 1. Calculate the expected sum per bin. Let A be your array, then the expected sum per bin is SumBin = SUM(A) / n (the sum of all elements in your array over the number of bins).
Step 2. Put all elements of your array in some collection (e.g. another array) that we'll call The Bag (this is just a conceptual, so you understand the next steps).
Step 3. Partition The Bag into n groups (preferably randomly, so that each element ends up in some bin i with probability 1/n). At this point, your bins have all the elements, and The Bag is empty.
Step 4. Calculate the sum for each bin. If result is the same as last iteration, exit. (this is the expectation step of K-Means)
Step 5. For each bin i, if its sum is greater than SumBin, pick the first element greater than SumBin and put it back in The Bag; if its sum is less than SumBin, pick the first element less than SumBin and put back in The Bag. This is the gradient descent step (aka maximization step) of K-Means.
Step 6. Go to step 3.
This algorithm is just an approximation, but it's fast and guaranteed to converge.
If you are skeptical about a randomized algorithm like the above, after the first iteration when you are back to step 3, instead of assigning elements randomly, you can do so optimally by running the Hungarian algorithm, but I am not sure that will guarantee better over-all results.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.