Time-based sliding window and measuring (change of) data arrival rate - python

I'm trying to implement a time-based sliding window (in Python), i.e., a data sources inserts new data items, and items older than, say, 1h are automatically removed. On top of that, I need to measures the rate, or rather the change of rate the data sources inserts items.
My question is kind of two-fold. First, how is the best way to implement a time-based window. In my currently, probably naive solution, I simply use a Python list window = []. In case of a new data item, I append the item with the current timestamp: window.append((current_time, item)). Then, using a timer, every 1sec I pop all first elements with a timestamp older than the current (timestamp-1h):
threshold = int(time.time()*1000) - self.WINDOW_SIZE_IN_MS
while True:
try:
if window[0][0] < threshold:
del self.word_lists[0]
else:
break
except:
break
While this works, I wonder if there are more clever solutions to this.
More importantly, what would be a good way to measure the change of rate data items enter the window. Here, I have no good idea how to approach this, at least none that sounds also efficient. Something very naive I had in mind: I split the 1h-window in 20 intervals each 5min and count the number of items. If the most recent 5min-interval conains significantly more items than the average of the 20 intervals, I say there is a burst. But I would have to do this every, say, 1min. This sounds not efficient and there are a lot of parameters.
In short, I need to measure the acceleration in which new items enter my window. Are there best-practices approaches for this?

For the first part, it is more efficient to check for expired items and remove them when you receive a new item to add. That is, don't bother with a timer which wakes up the process for no reason once a second--just piggyback the maintenance work when real work is happening.
For the second part, the entire 1 hour has a known length. Store an integer which is the index in the list of five minutes ago. You can maintain this when doing an insert, and you know you only have to move it forward.
Putting it all together, pseudo-code:
window = []
recent_index = 0
def insert(time, item):
while window and window[0][0] < time - timedelta(hours=1):
window.pop()
recent_index -= 1
while window[recent_index][0] < time - timedelta(minutes=5):
recent_index += 1
window.append((time, item))
return float(len(window) - recent_index) / len(window)
The above function returns what fraction of items from the past hour arrived in the past five minutes. Over 20 or 50%, say, and you have a burst.

Related

How to calculate Delta F / F using python?

I've recently "taught" myself python in order to analyze data for my experiments. As such I'm pretty clueless on many aspects. I've managed to make my analysis work for certain files but in some cases it breaks down and I imagine it is a result of faulty programming.
Currently I export a file containing 3 numpy arrays. One of these arrays is my signal (float values from -10 to 10). What I wish to do is to normalize every datum in this array to a range of values that preceed it. (i.e. the 30001st value must have the average of the preceeding 3000 values subtracted from it and then the difference must then be divided by thisvery same average (the preceeding 3000 values). My data is collected at a rate of 100Hz thus to get a normalization of the alst 30s i must use the preceeding 3000values.
As it stand this is how I've managed to make it work:
this stores the signal into the variable photosignal
photosignal = np.array(seg.analogsignals[0], ndmin=1)
now this the part I use to get the delta F/F over a moving window of 30s
normalizedphotosignal = [(uu-(np.mean(photosignal[uu-3000:uu])))/abs(np.mean(photosignal[uu-3000:uu])) for uu in photosignal[3000:]]
The following adds 3000 values to the beginning to keep the array the same length since later on i must time lock it to another list that is the same length
holder =list(range(3000))
normalizedphotosignal = holder + normalizedphotosignal
What I have noticed is that in certain files this code gives me an error because it says that the"slice" is empty and therefore it cannot create a mean.
I think maybe there is a better way to program this that could avoid this problem altogether. Or this a correct way to approach this problem?
So i tried the solution but it is quite slow and it nevertheless still gives me the "empty slice error".
I went over the moving average post and found this method:
def running_mean(x, N):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / N
however I'm having trouble accommodating it to my desired output. namely (x-running average)/running average
Allright so I finally figured it out thanks to your help and the posts you referred me to.
The calculation for my entire data (300 000 +) takes about a second!
I used the following code:
def runningmean(x,N):
cumsum =np.cumsum(np.insert(x,0,0))
return (cumsum[N:] -cumsum[:-N])/N
photosignal = np.array(seg.analogsignal[0], ndmin =1)
photosignalaverage = runningmean(photosignal, 3000)
holder = np.zeros(2999)
photosignalaverage = np.append(holder,photosignalaverage)
detalfsignal = (photosignal-photosignalaverage)/abs(photosignalaverage)
Photosignal stores my raw signal in a numpy array.
Photosignalaverage uses cumsum to calculate the running average of every datapoint in photosignal. I then add the first 2999 values as 0, to maintian the same list size as my photosignal.
I then use basic numpy calculations to get my delta F/F signal.
Thank you once more for the feedback, was truly helpful!
Your approach goes in the right direction. However, you made a mistake in your list comprehension: you are using uu as your index whereas uu are the elements of your input data photosignal.
You want something like this:
normalizedphotosignal2 = np.zeros((photosignal.shape[0]-3000))
for i, uu in enumerate(photosignal[3000:]):
normalizedphotosignal2 = (uu - (np.mean(photosignal[i-3000:i]))) / abs(np.mean(photosignal[i-3000:i]))
Keep in mind that for-loops are relatively slow in python. If performance is an issue here, you could try avoiding the for loop and use numpy methods instead (e.g. have a look at Moving average or running mean).
Hope this helps.

How to calculate running mean traffic for last minute

I have a python server that accepts time series data. Now I need to calculate the average traffic for the last minute, output like 90 samples/minute. I'm currently using a python list to hold all time stamps and use a pretty awful way(in my opinion) to calculate this. The code roughly looks like this:
class TrafficCalculator(object):
timestamps = []
def run():
while True:
# this gets one record of traffic
data = self.accept_data()
# get record's timestamp
timestamp = data.timestamp
# add to list
self.timestamps.append(timestamp)
# get the time one minute ago
minute_ago = timestamp - datetime.timedelta(minutes=1)
# find out the first index of the timestamp in the past that's within 1 minute
for i, t in enumerate(self.timestamp):
if t > minute_ago:
break
# see how many records are within last minute
result = len(self.timestamp[i:])
# throw away the earlier data
self.timestamp = self.timestamp[i:]
As you can see, I have to do this for every record, if my traffic gets intense, the performance is miserable.
Is there a better data structure or algorithm I can use to make this more performant? Even further, how do I write a test to verify my algorithm? Thanks!
Use Queue to hold <traffic, timestamp> pair. Here timestamp is the time it has been pushed on Queue(arrives from server). Track the sum of the traffics of Queue. When a new traffic arrives and the difference between its timestamp and Queue's front element's timestamp more than 1 minute, pop front from Queue. And subtract the poped traffic value from sum. Push the new traffic into queue and add to sum.
This way, your queue is working as a window frame to hold the 1 minute traffic all the time. And you are tracking the sum and you know the Queue size, so you can calculate the average.
The space complexity is O(maximum traffic can be arrived within 1 minute). Time complexity is O(1) for getting average at any time.
This is a very conventional algorithm for Query on any running stream of data in constant time complexity.
Note: Unfortunately I don't know Python. Otherwise I would put the implementation.
You could be able to achieve it with something like this:
define a vector (or a list) data of length 90 (samples/min.)
have a pointer p=0
have a sum variable (unintialized yet)
Fill in the vector with the 90 first samples; compute the sum and put in in the variable sum.
Then:
substract data[p] from sum (remove oldest sample from the sum)
read next sample and put it in the vector at location p
(thus erasing the oldest data) ;
add new data[p] to the sum (current sum)
increment pointer p by 1 ; if p>=90, then p=0 again
(p points to the oldest available data)
current mean is sum/90
etc.

Can I make an O(1) search algorithm using a sorted array with a known step?

Background
my software visualizes very large datasets, e.g. the data is so large I can't store all the data in RAM at any one time it is required to be loaded in a page fashion. I embed matplotlib functionality for displaying and manipulating the plot in the backend of my application.
These datasets contains three internal lists I use to visualize: time, height and dataset. My program plots the data with time x height , and additionally users have the options of drawing shapes around regions of the graph that can be extracted to a whole different plot.
The difficult part is, when I want to extract the data from the shapes, the shape vertices are real coordinates computed by the plot, not rounded to the nearest point in my time array. Here's an example of a shape which bounds a region in my program
While X1 may represent the coordinate (2007-06-12 03:42:20.070901+00:00, 5.2345) according to matplotlib, the closest coordinate existing in time and height might be something like (2007-06-12 03:42:20.070801+00:00, 5.219) , only a small bit off from matploblib's coordinate.
The Problem
So given some arbitrary value, lets say x1 = 732839.154395 (a representation of the date in number format) and a list of similar values with a constant step:
732839.154392
732839.154392
732839.154393
732839.154393
732839.154394
732839.154394
732839.154395
732839.154396
732839.154396
732839.154397
732839.154397
732839.154398
732839.154398
732839.154399
etc...
What would be the most efficient way of finding the closest representation of that point? I could simply loop through the list and grab the value with the smallest different, but the size of time is huge. Since I know the array is 1. Sorted and 2. Increments with a constant step , I was thinking this problem should be able to be solved in O(1) time? Is there a known algorithm that solves these kind of problems? Or would I simply need to devise some custom algorithm, here is my current thought process.
grab first and second element of time
subtract second element of time with first, obtain step
subtract bounding x value with first element of time, obtain difference
divide difference by step, obtain index
move time forward to index
check surrounding elements of index to ensure closest representation
The algorithm you suggest seems reasonable and like it would work.
As has become clear in your comments, the problem with it is the coarseness at which your time was recorded. (This can be common when unsynchronized data is recorded -- ie, the data generation clock, eg, frame rate, is not synced with the computer).
The easy way around this is to read two points separated by a larger time, so for example, read the first time value and then the 1000th time value. Then everything stays the same in your calculation but get you timestep by subtracting and then dividing to 1000
Here's a test that makes data a similar to yours:
import matplotlib.pyplot as plt
start = 97523.29783
increment = .000378912098
target = 97585.23452
# build a timeline
times = []
time = start
actual_index = None
for i in range(1000000):
trunc = float(str(time)[:10]) # truncate the time value
times.append(trunc)
if actual_index is None and time>target:
actual_index = i
time = time + increment
# now test
intervals = [1, 2, 5, 10, 100, 1000, 10000]
for i in intervals:
dt = (times[i] - times[0])/i
index = int((target-start)/dt)
print " %6i %8i %8i %.10f" % (i, actual_index, index, dt)
Result:
span actual guess est dt (actual=.000378912098)
1 163460 154841 0.0004000000
2 163460 176961 0.0003500000
5 163460 162991 0.0003800000
10 163460 162991 0.0003800000
100 163460 163421 0.0003790000
1000 163460 163464 0.0003789000
10000 163460 163460 0.0003789100
That is, as the space between the sampled points gets larger, the time interval estimate gets more accurate (compare to increment in the program) and the estimated index (3rd col) gets closer to the actual index (2nd col). Note that the accuracy of the dt estimate is basically just proportional to the number of digits in the span. The best you could do is use the times at the start and end points, but it seemed from you question statement that this would be difficult; but if it's not, it will give the most accurate estimate of your time interval. Note that here, for clarity, I exaggerated the lack of accuracy by making my time interval recording very course, but in general, every power of 10 in your span increase your accuracy by the same amount.
As an example of that last point, if I reduce the courseness of the time values by changing the coursing line to, trunc = float(str(time)[:12]), I get:
span actual guess est dt (actual=.000378912098)
1 163460 163853 0.0003780000
10 163460 163464 0.0003789000
100 163460 163460 0.0003789100
1000 163460 163459 0.0003789120
10000 163460 163459 0.0003789121
So if, as you say, using a span of 1 gets you very close, using a span of 100 or 1000 should be more than enough.
Overall, this is very similar in idea to the linear "interpolation search". It's just a bit easier to implement because it's only making a single guess based on the interpolation, so it just takes one line of code: int((target-start)*i/(times[i] - times[0]))
What you're describing is pretty much interpolation search. It works very much like binary search, but instead of choosing the middle element it assumes the distribution is close to uniform and guesses the approximate location.
The wikipedia link contains a C++ implementation.
That what you did is actually finding the value of n-th element of arithmetic sequence given the first two elements.
It is of course good.
Apart from the real question, if you have that much data that you can't fit into ram, you could setup something like Memory Mapped Files or simply creating Virtual Memory files, on Linux called swap.

Finding several regions of interest in an array

Say I have conducted an experiment where I've left a python program running for some long time and in that time I've taken several measurements of some quantity against time. Each measurement is separated by some value between 1 and 3 seconds with the time step used much smaller than that... say 0.01s. An example of such an even if you just take the y axis might look like:
[...0,1,-1,4,1,0,0,2,3,1,0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1,1,0,0,1,-1,0,0,2,0...]
Here we have some period of inactivity followed by a sharp rise, fall, a brief pause around 0, drop sharply, rise sharply and settle again around 0. The dots indicate that this is part of a long stream of data extending in both directions. There will be many of these events over the whole dataset with varying lengths separated by low magnitude regions.
I wish to essentially form an array of 'n' arrays (tuples?) with varying lengths capturing just the events so I can analyse them separately later. I can't separate purely by an np.absolute() type threshold because there are occasional small regions of near zero values within a given event such as in the above example. In addition to this there may be occasional blips in between measurements with large magnitudes but short duration.
The sample above would ideally end up as with a couple of elements or so from the flat region either side or so.
[0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1]
I'm thinking something like:
Input:
[0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0]
Split based on some number of consecutive values below a magnitude of 2.
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[8,-7,-1,0],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0],[1,22,4,]]
Like in this graph:
If sub arrays length is less than say 10 then remove:
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0]]
Is this a good way to approach it? The first step is confusing me a little also. I need to preserve those small low magnitude regions within an event also.
Re-edited! I'm going to be comparing two signals each measured as a function of time so they will be zipped together in a list of tuples.
Here is my two cents, based on exponential smoothing.
import itertools
A=np.array([0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0])
B=np.hstack(([0,0],A,[0,0]))
B=np.asanyarray(zip(*[B[i:] for i in range(5)]))
C=(B*[0.25,0.5,1,0.5,0.25]).mean(axis=1) #C is the 5-element sliding windows exponentially smoothed signal
D=[]
for item in itertools.groupby(enumerate(C), lambda x: abs(x[1])>1.5):
if item[0]:
D.append(list(item[1])) #Get the indices where the signal are of magnitude >2. Change 1.5 to control the behavior.
E=[D[0]]
for item in D[1:]:
if (item[0][0]-E[-1][-1][0]) <5: #Merge interesting regions if they are 5 or less indices apart. Change 5 to control the behavior.
E[-1]=E[-1]+item
else:
E.append(item)
print [(item[0][0], item[-1][0]) for item in E]
[A[item[0][0]: item[-1][0]] for item in E if (item[-1][0]-item[0][0])>9] #Filter out the interesting regions <10 in length.

Python Deque - 10 minutes worth of data

I'm trying to write a script that, when executed, appends a new available piece of information and removes data that's over 10 minutes old.
I'm wondering whats the the most efficient way, performance wise, of keeping track of the specific time to each information element while also removing the data over 10 minutes old.
My novice thought would be to append the information with a time stamp - [info, time] - to the deque and in a while loop continuously evaluate the end of the deque to remove anything older than 10 minutes... I doubt this is the best way.
Can someone provide an example? Thanks.
One way to do this is to use a sorted tree structure, keyed on the timestamps. Then you can find the first element >= 10 minutes ago, and remove everything before that.
Using the bintrees library as an example (because its key-slicing syntax makes this very easy to read and write…):
q = bintrees.FastRBTree.Tree()
now = datetime.datetime.now()
q[now] = 'a'
q[now - datetime.timedelta(seconds=5)] = 'b'
q[now - datetime.timedelta(seconds=10)] = 'c'
q[now - datetime.timedelta(seconds=15)] = 'd'
now = datetime.datetime.now()
del q[:now - datetime.timedelta(seconds=10)]
That will remove everything up to, but not including, now-10s, which should be both c and d.
This way, finding the first element to remove takes log N time, and removing N elements below that should be average case amortized log N but worst case N. So, your overall worst case time complexity doesn't improve, but your average case does.
Of course the overhead of managing a tree instead of a deque is pretty high, and could easily be higher than the savings of N/log N steps if you're dealing with a pretty small queue.
There are other logarithmic data structures that map be more appropriate, like a pqueue/heapqueue (as implemented by heapq in the stdlib), or a clock ring; I just chose a red-black tree because (with a PyPI module) it was the easiest one to demonstrate.
If you're only ever appending to the end, and the values are always inherently in sorted order, you don't actually need a logarithmic data structure like a tree or heap at all; you can do a logarithmic search within any sorted random-access structure like a list or collections.deque.
The problem is that deleting everything up to an arbitrary point in a list or deque takes O(N) time. There's no reason that it should; you should be able to drop N elements off a deque in amortized constant time (with del q[:pos] or q.popleft(pos)), it's just that collections.deque doesn't do that. If you find or write a deque class that does have that feature, you could just write this:
q = deque()
now = datetime.datetime.now()
q.append((now, 'a'))
q.append((now - datetime.timedelta(seconds=5), 'b')
q.append((now - datetime.timedelta(seconds=10), 'c')
q.append((now - datetime.timedelta(seconds=15), 'd')
now = datetime.datetime.now()
pos = bisect.bisect_left(q, now - datetime.timedelta(seconds=10))
del q[:pos]
I'm not sure whether a deque like this exists on PyPI, but the C source to collections.deque is available to fork, or the Python source from PyPy, or you could wrap a C or C++ deque type, or write one from scratch…
Or, if you're expecting that the "current" values in the deque will always be a small subset of the total length, you can do it in O(M) time just by not using the deque destructively:
q = q[pos:]
In fact, in that case, you might as well just use a list; it has O(1) append on the right, and slicing the last M items off a list is about as low-overhead a way to copy M items as you're going to find.
Yet another answer, with even more constraints:
If you can bucket things with, e.g., one-minute precision, all you need is 10 lists.
Unlike my other constrained answer, this doesn't require that you only ever append on the right; you can append in the middle (although you'll come after any other values for the same minute).
The down side is that you can't actually remove everything more than 10 minutes old; you can only remove everything in the 10th bucket, which could be off by up to 1 minute. You can choose what this means by choosing how to round:
Truncate one way, and nothing ever gets dropped too early, but everything is dropped late, an average of 30 seconds and at worst 60.
Truncate the other way, and nothing ever gets dropped late, but everything is dropped early, an average of 30 seconds and at worst 60.
Round at half, and things get dropped both early and late, but with an average of 0 seconds and a worst case of 30.
And you can of course use smaller buckets, like 100 buckets of 6-second intervals instead of 10 buckets of 1-minute intervals, to cut the error down as far as you like. Push that too far and you'll ruin the efficiency; a list of 600000 buckets of 1ms intervals is nearly as slow as a list of 1M entries.* But if you need 1 second or even 50ms, that's probably fine.
Here's a simple example:
def prune_queue(self):
now = time.time() // 60
age = now - self.last_bucket_time
if age:
self.buckets = self.buckets[-(10-age):] + [[] for _ in range(age)]
self.last_bucket_time = now
def enqueue(self, thing):
self.prune_queue()
self.buckets[-1].append(thing)
* Of course you could combine this with the logarithmic data structure—a red-black tree of 600000 buckets is fine.

Categories

Resources