I have a python server that accepts time series data. Now I need to calculate the average traffic for the last minute, output like 90 samples/minute. I'm currently using a python list to hold all time stamps and use a pretty awful way(in my opinion) to calculate this. The code roughly looks like this:
class TrafficCalculator(object):
timestamps = []
def run():
while True:
# this gets one record of traffic
data = self.accept_data()
# get record's timestamp
timestamp = data.timestamp
# add to list
self.timestamps.append(timestamp)
# get the time one minute ago
minute_ago = timestamp - datetime.timedelta(minutes=1)
# find out the first index of the timestamp in the past that's within 1 minute
for i, t in enumerate(self.timestamp):
if t > minute_ago:
break
# see how many records are within last minute
result = len(self.timestamp[i:])
# throw away the earlier data
self.timestamp = self.timestamp[i:]
As you can see, I have to do this for every record, if my traffic gets intense, the performance is miserable.
Is there a better data structure or algorithm I can use to make this more performant? Even further, how do I write a test to verify my algorithm? Thanks!
Use Queue to hold <traffic, timestamp> pair. Here timestamp is the time it has been pushed on Queue(arrives from server). Track the sum of the traffics of Queue. When a new traffic arrives and the difference between its timestamp and Queue's front element's timestamp more than 1 minute, pop front from Queue. And subtract the poped traffic value from sum. Push the new traffic into queue and add to sum.
This way, your queue is working as a window frame to hold the 1 minute traffic all the time. And you are tracking the sum and you know the Queue size, so you can calculate the average.
The space complexity is O(maximum traffic can be arrived within 1 minute). Time complexity is O(1) for getting average at any time.
This is a very conventional algorithm for Query on any running stream of data in constant time complexity.
Note: Unfortunately I don't know Python. Otherwise I would put the implementation.
You could be able to achieve it with something like this:
define a vector (or a list) data of length 90 (samples/min.)
have a pointer p=0
have a sum variable (unintialized yet)
Fill in the vector with the 90 first samples; compute the sum and put in in the variable sum.
Then:
substract data[p] from sum (remove oldest sample from the sum)
read next sample and put it in the vector at location p
(thus erasing the oldest data) ;
add new data[p] to the sum (current sum)
increment pointer p by 1 ; if p>=90, then p=0 again
(p points to the oldest available data)
current mean is sum/90
etc.
Related
I were given a list of intervals, for example [[10,40],[20,60]] and a list of position [5,15,30]
we should return the frequency of position appeared in the list, the answer would be [[5,0],[15,1],[30,2]] because 5 didn't cover by the interval and 15 was covered once, 30 was covered twice.
If I just do a for loop the time complexity would be O(m*n) m is the number of the interval, n is the number of position
Can I preprocess the intervals and make it faster? I was thinking of sort the interval first and use binary search but I am not sure how to implement it in python, Can someone give me a hint? Or can I use a hashtable to store intervals? what would be the time complexity for that?
You can use a frequency array to preprocess all interval data and then query for any value to get the answer. Specifically, create an array able to hold the min and max possible end-points of all the intervals. Then, for each interval, increment the frequency of the starting interval point and decrease the frequency of the value just after the end interval. At the end, accumulate this data for the array and we will have the frequency of occurrence of each value between the min and max of the interval. Each query is then just returning the frequency value from this array.
freq[] --> larger than max-min+1 (min: minimum start value, max: maximum end value)
For each [L,R] --> freq[L]++, freq[R+1] = freq[R+1]-1
freq[i] = freq[i]+freq[i-1]
For any query V, answer is freq[V]
Do consider tradeoffs when range is very large compared to simple queries, where simple check for all may suffice.
Every time I ride my bike a gather second by second data on a number of metrics. For simplicity, lets pretend that I have a csv file that looks something like:
secs, watts,
1,150
2,151
3,149
4,135
.
.
.
7000,160
So, every second of my ride has an associated power value, in watts.
I want to know "If I break my ride into N second blocks, which blocks have the realize in the highest average power?"
I am using a pandas dataframe to manage my data, and this is the code I have been using to answer my question:
def bestEffort(ride_data,
metric='watts',
interval_length=5,
sort_descending=True):
seconds_in_ride = len(ride_data[metric])
average_interval_list = [[i+1,
np.average(
[ride_data[metric][i+j]
for j in range(interval_length)])
]
for i in range(0,
seconds_in_ride -
interval_length)]
average_interval_list.sort(key=lambda x: x[1], reverse=sort_descending)
return average_interval_list
Seems simple? Right? Given an index, compute the average value of the interval_length subsequent entries. Keep track of this in a list of the form
[[second 1, avg val of metric over the interval starting that second],
[second 2, avg val of metric over the interval starting that second],
[second 3, avg val of metric over the interval starting that second],
.
.
.
[second 7000-interval_length, avg val of metric over the interval starting that second]]
Then, I sort the resulting list by the average values. So the first entry is of the form
[second_n, avg val of metric over the interval starting in second n]
telling me that my strongest effort over the given interval length started at second_n in my workout.
The problem is that if I set "interval_length" to anything higher than 30, this computation takes forever (read: over two minutes on a decent machine). Please, help me find where my code is hitting a bottleneck, this seems like it should be way faster.
If you put your data in a numpy array, say watts, you can compute the mean power using convolve:
mean_power = np.convolve(watts, np.ones(interval_length)/interval_length, mode='valid')
As you can see in the reference of np.convolve, this function computes a local mean of the first argument, smoothed with a window defined by the second argument. Here we smooth with a "top-hat" function--i.e. an "on/off" function which is constant over an interval of length interval_length, and zero otherwise. This is rudimentary but gives a first estimate.
Then the time of your strongest effort is:
time_strongest_effort = np.argmax(mean_power)
Here's a pure-pandas solution using DataFrame.rolling. It's slightly slower than the numpy convolution approach by #BenBoulderite, but is a convenient idiom:
df.rolling(interval_length).mean().shift(-(interval_length - 1))
The .shift() is needed to align the rolling-mean values so that the results are aligned to the left edge of the rolling window, instead of the default right edge (docs on DataFrame.rolling).
I'm trying to implement a time-based sliding window (in Python), i.e., a data sources inserts new data items, and items older than, say, 1h are automatically removed. On top of that, I need to measures the rate, or rather the change of rate the data sources inserts items.
My question is kind of two-fold. First, how is the best way to implement a time-based window. In my currently, probably naive solution, I simply use a Python list window = []. In case of a new data item, I append the item with the current timestamp: window.append((current_time, item)). Then, using a timer, every 1sec I pop all first elements with a timestamp older than the current (timestamp-1h):
threshold = int(time.time()*1000) - self.WINDOW_SIZE_IN_MS
while True:
try:
if window[0][0] < threshold:
del self.word_lists[0]
else:
break
except:
break
While this works, I wonder if there are more clever solutions to this.
More importantly, what would be a good way to measure the change of rate data items enter the window. Here, I have no good idea how to approach this, at least none that sounds also efficient. Something very naive I had in mind: I split the 1h-window in 20 intervals each 5min and count the number of items. If the most recent 5min-interval conains significantly more items than the average of the 20 intervals, I say there is a burst. But I would have to do this every, say, 1min. This sounds not efficient and there are a lot of parameters.
In short, I need to measure the acceleration in which new items enter my window. Are there best-practices approaches for this?
For the first part, it is more efficient to check for expired items and remove them when you receive a new item to add. That is, don't bother with a timer which wakes up the process for no reason once a second--just piggyback the maintenance work when real work is happening.
For the second part, the entire 1 hour has a known length. Store an integer which is the index in the list of five minutes ago. You can maintain this when doing an insert, and you know you only have to move it forward.
Putting it all together, pseudo-code:
window = []
recent_index = 0
def insert(time, item):
while window and window[0][0] < time - timedelta(hours=1):
window.pop()
recent_index -= 1
while window[recent_index][0] < time - timedelta(minutes=5):
recent_index += 1
window.append((time, item))
return float(len(window) - recent_index) / len(window)
The above function returns what fraction of items from the past hour arrived in the past five minutes. Over 20 or 50%, say, and you have a burst.
Background
my software visualizes very large datasets, e.g. the data is so large I can't store all the data in RAM at any one time it is required to be loaded in a page fashion. I embed matplotlib functionality for displaying and manipulating the plot in the backend of my application.
These datasets contains three internal lists I use to visualize: time, height and dataset. My program plots the data with time x height , and additionally users have the options of drawing shapes around regions of the graph that can be extracted to a whole different plot.
The difficult part is, when I want to extract the data from the shapes, the shape vertices are real coordinates computed by the plot, not rounded to the nearest point in my time array. Here's an example of a shape which bounds a region in my program
While X1 may represent the coordinate (2007-06-12 03:42:20.070901+00:00, 5.2345) according to matplotlib, the closest coordinate existing in time and height might be something like (2007-06-12 03:42:20.070801+00:00, 5.219) , only a small bit off from matploblib's coordinate.
The Problem
So given some arbitrary value, lets say x1 = 732839.154395 (a representation of the date in number format) and a list of similar values with a constant step:
732839.154392
732839.154392
732839.154393
732839.154393
732839.154394
732839.154394
732839.154395
732839.154396
732839.154396
732839.154397
732839.154397
732839.154398
732839.154398
732839.154399
etc...
What would be the most efficient way of finding the closest representation of that point? I could simply loop through the list and grab the value with the smallest different, but the size of time is huge. Since I know the array is 1. Sorted and 2. Increments with a constant step , I was thinking this problem should be able to be solved in O(1) time? Is there a known algorithm that solves these kind of problems? Or would I simply need to devise some custom algorithm, here is my current thought process.
grab first and second element of time
subtract second element of time with first, obtain step
subtract bounding x value with first element of time, obtain difference
divide difference by step, obtain index
move time forward to index
check surrounding elements of index to ensure closest representation
The algorithm you suggest seems reasonable and like it would work.
As has become clear in your comments, the problem with it is the coarseness at which your time was recorded. (This can be common when unsynchronized data is recorded -- ie, the data generation clock, eg, frame rate, is not synced with the computer).
The easy way around this is to read two points separated by a larger time, so for example, read the first time value and then the 1000th time value. Then everything stays the same in your calculation but get you timestep by subtracting and then dividing to 1000
Here's a test that makes data a similar to yours:
import matplotlib.pyplot as plt
start = 97523.29783
increment = .000378912098
target = 97585.23452
# build a timeline
times = []
time = start
actual_index = None
for i in range(1000000):
trunc = float(str(time)[:10]) # truncate the time value
times.append(trunc)
if actual_index is None and time>target:
actual_index = i
time = time + increment
# now test
intervals = [1, 2, 5, 10, 100, 1000, 10000]
for i in intervals:
dt = (times[i] - times[0])/i
index = int((target-start)/dt)
print " %6i %8i %8i %.10f" % (i, actual_index, index, dt)
Result:
span actual guess est dt (actual=.000378912098)
1 163460 154841 0.0004000000
2 163460 176961 0.0003500000
5 163460 162991 0.0003800000
10 163460 162991 0.0003800000
100 163460 163421 0.0003790000
1000 163460 163464 0.0003789000
10000 163460 163460 0.0003789100
That is, as the space between the sampled points gets larger, the time interval estimate gets more accurate (compare to increment in the program) and the estimated index (3rd col) gets closer to the actual index (2nd col). Note that the accuracy of the dt estimate is basically just proportional to the number of digits in the span. The best you could do is use the times at the start and end points, but it seemed from you question statement that this would be difficult; but if it's not, it will give the most accurate estimate of your time interval. Note that here, for clarity, I exaggerated the lack of accuracy by making my time interval recording very course, but in general, every power of 10 in your span increase your accuracy by the same amount.
As an example of that last point, if I reduce the courseness of the time values by changing the coursing line to, trunc = float(str(time)[:12]), I get:
span actual guess est dt (actual=.000378912098)
1 163460 163853 0.0003780000
10 163460 163464 0.0003789000
100 163460 163460 0.0003789100
1000 163460 163459 0.0003789120
10000 163460 163459 0.0003789121
So if, as you say, using a span of 1 gets you very close, using a span of 100 or 1000 should be more than enough.
Overall, this is very similar in idea to the linear "interpolation search". It's just a bit easier to implement because it's only making a single guess based on the interpolation, so it just takes one line of code: int((target-start)*i/(times[i] - times[0]))
What you're describing is pretty much interpolation search. It works very much like binary search, but instead of choosing the middle element it assumes the distribution is close to uniform and guesses the approximate location.
The wikipedia link contains a C++ implementation.
That what you did is actually finding the value of n-th element of arithmetic sequence given the first two elements.
It is of course good.
Apart from the real question, if you have that much data that you can't fit into ram, you could setup something like Memory Mapped Files or simply creating Virtual Memory files, on Linux called swap.
I'm trying to write a script that, when executed, appends a new available piece of information and removes data that's over 10 minutes old.
I'm wondering whats the the most efficient way, performance wise, of keeping track of the specific time to each information element while also removing the data over 10 minutes old.
My novice thought would be to append the information with a time stamp - [info, time] - to the deque and in a while loop continuously evaluate the end of the deque to remove anything older than 10 minutes... I doubt this is the best way.
Can someone provide an example? Thanks.
One way to do this is to use a sorted tree structure, keyed on the timestamps. Then you can find the first element >= 10 minutes ago, and remove everything before that.
Using the bintrees library as an example (because its key-slicing syntax makes this very easy to read and write…):
q = bintrees.FastRBTree.Tree()
now = datetime.datetime.now()
q[now] = 'a'
q[now - datetime.timedelta(seconds=5)] = 'b'
q[now - datetime.timedelta(seconds=10)] = 'c'
q[now - datetime.timedelta(seconds=15)] = 'd'
now = datetime.datetime.now()
del q[:now - datetime.timedelta(seconds=10)]
That will remove everything up to, but not including, now-10s, which should be both c and d.
This way, finding the first element to remove takes log N time, and removing N elements below that should be average case amortized log N but worst case N. So, your overall worst case time complexity doesn't improve, but your average case does.
Of course the overhead of managing a tree instead of a deque is pretty high, and could easily be higher than the savings of N/log N steps if you're dealing with a pretty small queue.
There are other logarithmic data structures that map be more appropriate, like a pqueue/heapqueue (as implemented by heapq in the stdlib), or a clock ring; I just chose a red-black tree because (with a PyPI module) it was the easiest one to demonstrate.
If you're only ever appending to the end, and the values are always inherently in sorted order, you don't actually need a logarithmic data structure like a tree or heap at all; you can do a logarithmic search within any sorted random-access structure like a list or collections.deque.
The problem is that deleting everything up to an arbitrary point in a list or deque takes O(N) time. There's no reason that it should; you should be able to drop N elements off a deque in amortized constant time (with del q[:pos] or q.popleft(pos)), it's just that collections.deque doesn't do that. If you find or write a deque class that does have that feature, you could just write this:
q = deque()
now = datetime.datetime.now()
q.append((now, 'a'))
q.append((now - datetime.timedelta(seconds=5), 'b')
q.append((now - datetime.timedelta(seconds=10), 'c')
q.append((now - datetime.timedelta(seconds=15), 'd')
now = datetime.datetime.now()
pos = bisect.bisect_left(q, now - datetime.timedelta(seconds=10))
del q[:pos]
I'm not sure whether a deque like this exists on PyPI, but the C source to collections.deque is available to fork, or the Python source from PyPy, or you could wrap a C or C++ deque type, or write one from scratch…
Or, if you're expecting that the "current" values in the deque will always be a small subset of the total length, you can do it in O(M) time just by not using the deque destructively:
q = q[pos:]
In fact, in that case, you might as well just use a list; it has O(1) append on the right, and slicing the last M items off a list is about as low-overhead a way to copy M items as you're going to find.
Yet another answer, with even more constraints:
If you can bucket things with, e.g., one-minute precision, all you need is 10 lists.
Unlike my other constrained answer, this doesn't require that you only ever append on the right; you can append in the middle (although you'll come after any other values for the same minute).
The down side is that you can't actually remove everything more than 10 minutes old; you can only remove everything in the 10th bucket, which could be off by up to 1 minute. You can choose what this means by choosing how to round:
Truncate one way, and nothing ever gets dropped too early, but everything is dropped late, an average of 30 seconds and at worst 60.
Truncate the other way, and nothing ever gets dropped late, but everything is dropped early, an average of 30 seconds and at worst 60.
Round at half, and things get dropped both early and late, but with an average of 0 seconds and a worst case of 30.
And you can of course use smaller buckets, like 100 buckets of 6-second intervals instead of 10 buckets of 1-minute intervals, to cut the error down as far as you like. Push that too far and you'll ruin the efficiency; a list of 600000 buckets of 1ms intervals is nearly as slow as a list of 1M entries.* But if you need 1 second or even 50ms, that's probably fine.
Here's a simple example:
def prune_queue(self):
now = time.time() // 60
age = now - self.last_bucket_time
if age:
self.buckets = self.buckets[-(10-age):] + [[] for _ in range(age)]
self.last_bucket_time = now
def enqueue(self, thing):
self.prune_queue()
self.buckets[-1].append(thing)
* Of course you could combine this with the logarithmic data structure—a red-black tree of 600000 buckets is fine.