Finding and ranking intervals of data

Finding and ranking intervals of data - python

Every time I ride my bike a gather second by second data on a number of metrics. For simplicity, lets pretend that I have a csv file that looks something like:
secs, watts,
1,150
2,151
3,149
4,135
.
.
.
7000,160
So, every second of my ride has an associated power value, in watts.
I want to know "If I break my ride into N second blocks, which blocks have the realize in the highest average power?"
I am using a pandas dataframe to manage my data, and this is the code I have been using to answer my question:
def bestEffort(ride_data,
metric='watts',
interval_length=5,
sort_descending=True):
seconds_in_ride = len(ride_data[metric])
average_interval_list = [[i+1,
np.average(
[ride_data[metric][i+j]
for j in range(interval_length)])
]
for i in range(0,
seconds_in_ride -
interval_length)]
average_interval_list.sort(key=lambda x: x[1], reverse=sort_descending)
return average_interval_list
Seems simple? Right? Given an index, compute the average value of the interval_length subsequent entries. Keep track of this in a list of the form
[[second 1, avg val of metric over the interval starting that second],
[second 2, avg val of metric over the interval starting that second],
[second 3, avg val of metric over the interval starting that second],
.
.
.
[second 7000-interval_length, avg val of metric over the interval starting that second]]
Then, I sort the resulting list by the average values. So the first entry is of the form
[second_n, avg val of metric over the interval starting in second n]
telling me that my strongest effort over the given interval length started at second_n in my workout.
The problem is that if I set "interval_length" to anything higher than 30, this computation takes forever (read: over two minutes on a decent machine). Please, help me find where my code is hitting a bottleneck, this seems like it should be way faster.

If you put your data in a numpy array, say watts, you can compute the mean power using convolve:
mean_power = np.convolve(watts, np.ones(interval_length)/interval_length, mode='valid')
As you can see in the reference of np.convolve, this function computes a local mean of the first argument, smoothed with a window defined by the second argument. Here we smooth with a "top-hat" function--i.e. an "on/off" function which is constant over an interval of length interval_length, and zero otherwise. This is rudimentary but gives a first estimate.
Then the time of your strongest effort is:
time_strongest_effort = np.argmax(mean_power)

Here's a pure-pandas solution using DataFrame.rolling. It's slightly slower than the numpy convolution approach by #BenBoulderite, but is a convenient idiom:
df.rolling(interval_length).mean().shift(-(interval_length - 1))
The .shift() is needed to align the rolling-mean values so that the results are aligned to the left edge of the rolling window, instead of the default right edge (docs on DataFrame.rolling).

Related

find frequency of a int appear in a list of interval

I were given a list of intervals, for example [[10,40],[20,60]] and a list of position [5,15,30]
we should return the frequency of position appeared in the list, the answer would be [[5,0],[15,1],[30,2]] because 5 didn't cover by the interval and 15 was covered once, 30 was covered twice.
If I just do a for loop the time complexity would be O(m*n) m is the number of the interval, n is the number of position
Can I preprocess the intervals and make it faster? I was thinking of sort the interval first and use binary search but I am not sure how to implement it in python, Can someone give me a hint? Or can I use a hashtable to store intervals? what would be the time complexity for that?

You can use a frequency array to preprocess all interval data and then query for any value to get the answer. Specifically, create an array able to hold the min and max possible end-points of all the intervals. Then, for each interval, increment the frequency of the starting interval point and decrease the frequency of the value just after the end interval. At the end, accumulate this data for the array and we will have the frequency of occurrence of each value between the min and max of the interval. Each query is then just returning the frequency value from this array.
freq[] --> larger than max-min+1 (min: minimum start value, max: maximum end value)
For each [L,R] --> freq[L]++, freq[R+1] = freq[R+1]-1
freq[i] = freq[i]+freq[i-1]
For any query V, answer is freq[V]
Do consider tradeoffs when range is very large compared to simple queries, where simple check for all may suffice.

Finding the significant changes in a list of values?

I have a list of values, my_list, which shows the usage of a device in different times, like below:
my_list=[0.0, 11500312.5, 12293437.5, 11896875.0, 7711186.0,
3281768.863, 3341550.1363, 3300694.0,...]
I have many lists of this type and I want to find the numbers of the most significant changes (decreasing or increasing) in different times. One of these lists is plotted below. For example, if you look at the second, third and forth points in the graph you can see the difference between the values are not much, but the value suddenly decreased at fifth and Sith point. The same significant changes happened between point 20, 21 and 22.
So you can see in the plot they are two-three significant increasing and decreasing w.r.t to the other times. Any idea to find the numbers automatically?

Here's an approach that might work for you. Check how the value compares to the moving average. Is it more than one standard deviation away?
Here's a moving average implementation using numpy:
import numpy as np
def running_mean(x, N):
cumsum = numpy.cumsum(numpy.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / float(N)
From here
Here's an implementation of the comparison operation:
TimeSEries=[0.0, 11500312.5, 12293437.5, 11896875.0, 7711186.0,
3281768.863, 3341550.1363, 3300694.0]
MOV = running_mean(TimeSEries,3).tolist()
STD = np.std(MOV)
events= []
ind = []
for ii in range(len(TimeSEries)):
try:
if TimeSEries[ii] > MOV[ii]+STD:
print(TimeSEries[ii])
except IndexError:
pass
From here

How to calculate Delta F / F using python?

I've recently "taught" myself python in order to analyze data for my experiments. As such I'm pretty clueless on many aspects. I've managed to make my analysis work for certain files but in some cases it breaks down and I imagine it is a result of faulty programming.
Currently I export a file containing 3 numpy arrays. One of these arrays is my signal (float values from -10 to 10). What I wish to do is to normalize every datum in this array to a range of values that preceed it. (i.e. the 30001st value must have the average of the preceeding 3000 values subtracted from it and then the difference must then be divided by thisvery same average (the preceeding 3000 values). My data is collected at a rate of 100Hz thus to get a normalization of the alst 30s i must use the preceeding 3000values.
As it stand this is how I've managed to make it work:
this stores the signal into the variable photosignal
photosignal = np.array(seg.analogsignals[0], ndmin=1)
now this the part I use to get the delta F/F over a moving window of 30s
normalizedphotosignal = [(uu-(np.mean(photosignal[uu-3000:uu])))/abs(np.mean(photosignal[uu-3000:uu])) for uu in photosignal[3000:]]
The following adds 3000 values to the beginning to keep the array the same length since later on i must time lock it to another list that is the same length
holder =list(range(3000))
normalizedphotosignal = holder + normalizedphotosignal
What I have noticed is that in certain files this code gives me an error because it says that the"slice" is empty and therefore it cannot create a mean.
I think maybe there is a better way to program this that could avoid this problem altogether. Or this a correct way to approach this problem?
So i tried the solution but it is quite slow and it nevertheless still gives me the "empty slice error".
I went over the moving average post and found this method:
def running_mean(x, N):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / N
however I'm having trouble accommodating it to my desired output. namely (x-running average)/running average

Allright so I finally figured it out thanks to your help and the posts you referred me to.
The calculation for my entire data (300 000 +) takes about a second!
I used the following code:
def runningmean(x,N):
cumsum =np.cumsum(np.insert(x,0,0))
return (cumsum[N:] -cumsum[:-N])/N
photosignal = np.array(seg.analogsignal[0], ndmin =1)
photosignalaverage = runningmean(photosignal, 3000)
holder = np.zeros(2999)
photosignalaverage = np.append(holder,photosignalaverage)
detalfsignal = (photosignal-photosignalaverage)/abs(photosignalaverage)
Photosignal stores my raw signal in a numpy array.
Photosignalaverage uses cumsum to calculate the running average of every datapoint in photosignal. I then add the first 2999 values as 0, to maintian the same list size as my photosignal.
I then use basic numpy calculations to get my delta F/F signal.
Thank you once more for the feedback, was truly helpful!

Your approach goes in the right direction. However, you made a mistake in your list comprehension: you are using uu as your index whereas uu are the elements of your input data photosignal.
You want something like this:
normalizedphotosignal2 = np.zeros((photosignal.shape[0]-3000))
for i, uu in enumerate(photosignal[3000:]):
normalizedphotosignal2 = (uu - (np.mean(photosignal[i-3000:i]))) / abs(np.mean(photosignal[i-3000:i]))
Keep in mind that for-loops are relatively slow in python. If performance is an issue here, you could try avoiding the for loop and use numpy methods instead (e.g. have a look at Moving average or running mean).
Hope this helps.

Python - speed up finding percentile of set which is greater than threshold

I need to find which percentile of a group of numbers is over a threshold value. Is there a way that this can be speed up? My implementation is much too slow for the intended application. In case this changes anything, I am running my program using mpirun -np 100 python program.py. I cannot use numba, as the rest of this program uses try/except statements.
import numpy as np
my_vals = []
threshold_val = 0.065
for i in range(60000):
my_vals.append(np.random.normal(0.05, 0.02))
for i in np.arange(0,100,0.001):
if np.percentile(my_vals,i) > threshold_val:
perc = 1*i
break
else: perc = 100

Since the Gaussian (Normal) distribution produces a bell-curve, you should be able to calculate the percentile with the highest probability of being optimal and then write your code to check there first, and then use a modified binary search to find the optimal lowest threshold.
For example, if you determine that your parameters are most likely to favor e.g. 17.951 (this is an example only, I didn't actually bother computing it), then begin near that point rather than starting at 0. Treat this like a binary search - start your lower limit at 0 and your upper limit at 100.0, and set the point to bisect the list as the optimal percentile for the distribution.
If your current upper limit is over threshold_val, bisect the lower half to find the lowest such value that matches; if it is not over the threshold, bisect the upper half, etc. So, e.g. in the range 0.000 to 100.000, if you start at 17.951 and find that it is not above the threshold, adjust to bounds to 17.952 to 100.000 and try 58.976 (halfway between). As soon as you find a value that is above the threshold, then use that value as the upper bound (since it is a non-optimal answer). Continue this process until the lower and upper bounds are 0.001 apart, which gives you the optimal answer. On average, you should have to run about 17 tests rather than 100,000.
You may also be able to automate the computation of the optimal value in case your normal distribution will change, since the distribution produces a bell-curve, and you will know the statistics of that bell-curve based on the parameters anyway.
Your solution only needs to find the lowest value for which the percentile is above your threshold, so this approach should minimize the number of samples you need to check.
One more hint: np.percentile has to sorts the my_vals 100,000 times in your code; I do not know whether a pre-sorted list would help, but it may be worth checking (you'll probably have to test several possible sort parameters since it doesn't appear to be documented in which direction it sorts).

You can find the solution directly by sorting the values and searching for the first value that exceeds your threshold. The percentile is the fraction of array values before this element:
import numpy as np
my_vals = []
threshold_val = 0.065
for i in range(60000):
my_vals.append(np.random.normal(0.05, 0.02))
from bisect import bisect_right
print bisect_right(sorted(my_vals),threshold_val)/float(len(my_vals))*100

Can I make an O(1) search algorithm using a sorted array with a known step?

Background
my software visualizes very large datasets, e.g. the data is so large I can't store all the data in RAM at any one time it is required to be loaded in a page fashion. I embed matplotlib functionality for displaying and manipulating the plot in the backend of my application.
These datasets contains three internal lists I use to visualize: time, height and dataset. My program plots the data with time x height , and additionally users have the options of drawing shapes around regions of the graph that can be extracted to a whole different plot.
The difficult part is, when I want to extract the data from the shapes, the shape vertices are real coordinates computed by the plot, not rounded to the nearest point in my time array. Here's an example of a shape which bounds a region in my program
While X1 may represent the coordinate (2007-06-12 03:42:20.070901+00:00, 5.2345) according to matplotlib, the closest coordinate existing in time and height might be something like (2007-06-12 03:42:20.070801+00:00, 5.219) , only a small bit off from matploblib's coordinate.
The Problem
So given some arbitrary value, lets say x1 = 732839.154395 (a representation of the date in number format) and a list of similar values with a constant step:
732839.154392
732839.154392
732839.154393
732839.154393
732839.154394
732839.154394
732839.154395
732839.154396
732839.154396
732839.154397
732839.154397
732839.154398
732839.154398
732839.154399
etc...
What would be the most efficient way of finding the closest representation of that point? I could simply loop through the list and grab the value with the smallest different, but the size of time is huge. Since I know the array is 1. Sorted and 2. Increments with a constant step , I was thinking this problem should be able to be solved in O(1) time? Is there a known algorithm that solves these kind of problems? Or would I simply need to devise some custom algorithm, here is my current thought process.
grab first and second element of time
subtract second element of time with first, obtain step
subtract bounding x value with first element of time, obtain difference
divide difference by step, obtain index
move time forward to index
check surrounding elements of index to ensure closest representation

The algorithm you suggest seems reasonable and like it would work.
As has become clear in your comments, the problem with it is the coarseness at which your time was recorded. (This can be common when unsynchronized data is recorded -- ie, the data generation clock, eg, frame rate, is not synced with the computer).
The easy way around this is to read two points separated by a larger time, so for example, read the first time value and then the 1000th time value. Then everything stays the same in your calculation but get you timestep by subtracting and then dividing to 1000
Here's a test that makes data a similar to yours:
import matplotlib.pyplot as plt
start = 97523.29783
increment = .000378912098
target = 97585.23452
# build a timeline
times = []
time = start
actual_index = None
for i in range(1000000):
trunc = float(str(time)[:10]) # truncate the time value
times.append(trunc)
if actual_index is None and time>target:
actual_index = i
time = time + increment
# now test
intervals = [1, 2, 5, 10, 100, 1000, 10000]
for i in intervals:
dt = (times[i] - times[0])/i
index = int((target-start)/dt)
print " %6i %8i %8i %.10f" % (i, actual_index, index, dt)
Result:
span actual guess est dt (actual=.000378912098)
1 163460 154841 0.0004000000
2 163460 176961 0.0003500000
5 163460 162991 0.0003800000
10 163460 162991 0.0003800000
100 163460 163421 0.0003790000
1000 163460 163464 0.0003789000
10000 163460 163460 0.0003789100
That is, as the space between the sampled points gets larger, the time interval estimate gets more accurate (compare to increment in the program) and the estimated index (3rd col) gets closer to the actual index (2nd col). Note that the accuracy of the dt estimate is basically just proportional to the number of digits in the span. The best you could do is use the times at the start and end points, but it seemed from you question statement that this would be difficult; but if it's not, it will give the most accurate estimate of your time interval. Note that here, for clarity, I exaggerated the lack of accuracy by making my time interval recording very course, but in general, every power of 10 in your span increase your accuracy by the same amount.
As an example of that last point, if I reduce the courseness of the time values by changing the coursing line to, trunc = float(str(time)[:12]), I get:
span actual guess est dt (actual=.000378912098)
1 163460 163853 0.0003780000
10 163460 163464 0.0003789000
100 163460 163460 0.0003789100
1000 163460 163459 0.0003789120
10000 163460 163459 0.0003789121
So if, as you say, using a span of 1 gets you very close, using a span of 100 or 1000 should be more than enough.
Overall, this is very similar in idea to the linear "interpolation search". It's just a bit easier to implement because it's only making a single guess based on the interpolation, so it just takes one line of code: int((target-start)*i/(times[i] - times[0]))

What you're describing is pretty much interpolation search. It works very much like binary search, but instead of choosing the middle element it assumes the distribution is close to uniform and guesses the approximate location.
The wikipedia link contains a C++ implementation.

That what you did is actually finding the value of n-th element of arithmetic sequence given the first two elements.
It is of course good.
Apart from the real question, if you have that much data that you can't fit into ram, you could setup something like Memory Mapped Files or simply creating Virtual Memory files, on Linux called swap.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.