Background
my software visualizes very large datasets, e.g. the data is so large I can't store all the data in RAM at any one time it is required to be loaded in a page fashion. I embed matplotlib functionality for displaying and manipulating the plot in the backend of my application.
These datasets contains three internal lists I use to visualize: time, height and dataset. My program plots the data with time x height , and additionally users have the options of drawing shapes around regions of the graph that can be extracted to a whole different plot.
The difficult part is, when I want to extract the data from the shapes, the shape vertices are real coordinates computed by the plot, not rounded to the nearest point in my time array. Here's an example of a shape which bounds a region in my program
While X1 may represent the coordinate (2007-06-12 03:42:20.070901+00:00, 5.2345) according to matplotlib, the closest coordinate existing in time and height might be something like (2007-06-12 03:42:20.070801+00:00, 5.219) , only a small bit off from matploblib's coordinate.
The Problem
So given some arbitrary value, lets say x1 = 732839.154395 (a representation of the date in number format) and a list of similar values with a constant step:
732839.154392
732839.154392
732839.154393
732839.154393
732839.154394
732839.154394
732839.154395
732839.154396
732839.154396
732839.154397
732839.154397
732839.154398
732839.154398
732839.154399
etc...
What would be the most efficient way of finding the closest representation of that point? I could simply loop through the list and grab the value with the smallest different, but the size of time is huge. Since I know the array is 1. Sorted and 2. Increments with a constant step , I was thinking this problem should be able to be solved in O(1) time? Is there a known algorithm that solves these kind of problems? Or would I simply need to devise some custom algorithm, here is my current thought process.
grab first and second element of time
subtract second element of time with first, obtain step
subtract bounding x value with first element of time, obtain difference
divide difference by step, obtain index
move time forward to index
check surrounding elements of index to ensure closest representation
The algorithm you suggest seems reasonable and like it would work.
As has become clear in your comments, the problem with it is the coarseness at which your time was recorded. (This can be common when unsynchronized data is recorded -- ie, the data generation clock, eg, frame rate, is not synced with the computer).
The easy way around this is to read two points separated by a larger time, so for example, read the first time value and then the 1000th time value. Then everything stays the same in your calculation but get you timestep by subtracting and then dividing to 1000
Here's a test that makes data a similar to yours:
import matplotlib.pyplot as plt
start = 97523.29783
increment = .000378912098
target = 97585.23452
# build a timeline
times = []
time = start
actual_index = None
for i in range(1000000):
trunc = float(str(time)[:10]) # truncate the time value
times.append(trunc)
if actual_index is None and time>target:
actual_index = i
time = time + increment
# now test
intervals = [1, 2, 5, 10, 100, 1000, 10000]
for i in intervals:
dt = (times[i] - times[0])/i
index = int((target-start)/dt)
print " %6i %8i %8i %.10f" % (i, actual_index, index, dt)
Result:
span actual guess est dt (actual=.000378912098)
1 163460 154841 0.0004000000
2 163460 176961 0.0003500000
5 163460 162991 0.0003800000
10 163460 162991 0.0003800000
100 163460 163421 0.0003790000
1000 163460 163464 0.0003789000
10000 163460 163460 0.0003789100
That is, as the space between the sampled points gets larger, the time interval estimate gets more accurate (compare to increment in the program) and the estimated index (3rd col) gets closer to the actual index (2nd col). Note that the accuracy of the dt estimate is basically just proportional to the number of digits in the span. The best you could do is use the times at the start and end points, but it seemed from you question statement that this would be difficult; but if it's not, it will give the most accurate estimate of your time interval. Note that here, for clarity, I exaggerated the lack of accuracy by making my time interval recording very course, but in general, every power of 10 in your span increase your accuracy by the same amount.
As an example of that last point, if I reduce the courseness of the time values by changing the coursing line to, trunc = float(str(time)[:12]), I get:
span actual guess est dt (actual=.000378912098)
1 163460 163853 0.0003780000
10 163460 163464 0.0003789000
100 163460 163460 0.0003789100
1000 163460 163459 0.0003789120
10000 163460 163459 0.0003789121
So if, as you say, using a span of 1 gets you very close, using a span of 100 or 1000 should be more than enough.
Overall, this is very similar in idea to the linear "interpolation search". It's just a bit easier to implement because it's only making a single guess based on the interpolation, so it just takes one line of code: int((target-start)*i/(times[i] - times[0]))
What you're describing is pretty much interpolation search. It works very much like binary search, but instead of choosing the middle element it assumes the distribution is close to uniform and guesses the approximate location.
The wikipedia link contains a C++ implementation.
That what you did is actually finding the value of n-th element of arithmetic sequence given the first two elements.
It is of course good.
Apart from the real question, if you have that much data that you can't fit into ram, you could setup something like Memory Mapped Files or simply creating Virtual Memory files, on Linux called swap.
Say I have conducted an experiment where I've left a python program running for some long time and in that time I've taken several measurements of some quantity against time. Each measurement is separated by some value between 1 and 3 seconds with the time step used much smaller than that... say 0.01s. An example of such an even if you just take the y axis might look like:
[...0,1,-1,4,1,0,0,2,3,1,0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1,1,0,0,1,-1,0,0,2,0...]
Here we have some period of inactivity followed by a sharp rise, fall, a brief pause around 0, drop sharply, rise sharply and settle again around 0. The dots indicate that this is part of a long stream of data extending in both directions. There will be many of these events over the whole dataset with varying lengths separated by low magnitude regions.
I wish to essentially form an array of 'n' arrays (tuples?) with varying lengths capturing just the events so I can analyse them separately later. I can't separate purely by an np.absolute() type threshold because there are occasional small regions of near zero values within a given event such as in the above example. In addition to this there may be occasional blips in between measurements with large magnitudes but short duration.
The sample above would ideally end up as with a couple of elements or so from the flat region either side or so.
[0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1]
I'm thinking something like:
Input:
[0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0]
Split based on some number of consecutive values below a magnitude of 2.
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[8,-7,-1,0],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0],[1,22,4,]]
Like in this graph:
If sub arrays length is less than say 10 then remove:
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0]]
Is this a good way to approach it? The first step is confusing me a little also. I need to preserve those small low magnitude regions within an event also.
Re-edited! I'm going to be comparing two signals each measured as a function of time so they will be zipped together in a list of tuples.
Here is my two cents, based on exponential smoothing.
import itertools
A=np.array([0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0])
B=np.hstack(([0,0],A,[0,0]))
B=np.asanyarray(zip(*[B[i:] for i in range(5)]))
C=(B*[0.25,0.5,1,0.5,0.25]).mean(axis=1) #C is the 5-element sliding windows exponentially smoothed signal
D=[]
for item in itertools.groupby(enumerate(C), lambda x: abs(x[1])>1.5):
if item[0]:
D.append(list(item[1])) #Get the indices where the signal are of magnitude >2. Change 1.5 to control the behavior.
E=[D[0]]
for item in D[1:]:
if (item[0][0]-E[-1][-1][0]) <5: #Merge interesting regions if they are 5 or less indices apart. Change 5 to control the behavior.
E[-1]=E[-1]+item
else:
E.append(item)
print [(item[0][0], item[-1][0]) for item in E]
[A[item[0][0]: item[-1][0]] for item in E if (item[-1][0]-item[0][0])>9] #Filter out the interesting regions <10 in length.
I'm trying to write a script that, when executed, appends a new available piece of information and removes data that's over 10 minutes old.
I'm wondering whats the the most efficient way, performance wise, of keeping track of the specific time to each information element while also removing the data over 10 minutes old.
My novice thought would be to append the information with a time stamp - [info, time] - to the deque and in a while loop continuously evaluate the end of the deque to remove anything older than 10 minutes... I doubt this is the best way.
Can someone provide an example? Thanks.
One way to do this is to use a sorted tree structure, keyed on the timestamps. Then you can find the first element >= 10 minutes ago, and remove everything before that.
Using the bintrees library as an example (because its key-slicing syntax makes this very easy to read and write…):
q = bintrees.FastRBTree.Tree()
now = datetime.datetime.now()
q[now] = 'a'
q[now - datetime.timedelta(seconds=5)] = 'b'
q[now - datetime.timedelta(seconds=10)] = 'c'
q[now - datetime.timedelta(seconds=15)] = 'd'
now = datetime.datetime.now()
del q[:now - datetime.timedelta(seconds=10)]
That will remove everything up to, but not including, now-10s, which should be both c and d.
This way, finding the first element to remove takes log N time, and removing N elements below that should be average case amortized log N but worst case N. So, your overall worst case time complexity doesn't improve, but your average case does.
Of course the overhead of managing a tree instead of a deque is pretty high, and could easily be higher than the savings of N/log N steps if you're dealing with a pretty small queue.
There are other logarithmic data structures that map be more appropriate, like a pqueue/heapqueue (as implemented by heapq in the stdlib), or a clock ring; I just chose a red-black tree because (with a PyPI module) it was the easiest one to demonstrate.
If you're only ever appending to the end, and the values are always inherently in sorted order, you don't actually need a logarithmic data structure like a tree or heap at all; you can do a logarithmic search within any sorted random-access structure like a list or collections.deque.
The problem is that deleting everything up to an arbitrary point in a list or deque takes O(N) time. There's no reason that it should; you should be able to drop N elements off a deque in amortized constant time (with del q[:pos] or q.popleft(pos)), it's just that collections.deque doesn't do that. If you find or write a deque class that does have that feature, you could just write this:
q = deque()
now = datetime.datetime.now()
q.append((now, 'a'))
q.append((now - datetime.timedelta(seconds=5), 'b')
q.append((now - datetime.timedelta(seconds=10), 'c')
q.append((now - datetime.timedelta(seconds=15), 'd')
now = datetime.datetime.now()
pos = bisect.bisect_left(q, now - datetime.timedelta(seconds=10))
del q[:pos]
I'm not sure whether a deque like this exists on PyPI, but the C source to collections.deque is available to fork, or the Python source from PyPy, or you could wrap a C or C++ deque type, or write one from scratch…
Or, if you're expecting that the "current" values in the deque will always be a small subset of the total length, you can do it in O(M) time just by not using the deque destructively:
q = q[pos:]
In fact, in that case, you might as well just use a list; it has O(1) append on the right, and slicing the last M items off a list is about as low-overhead a way to copy M items as you're going to find.
Yet another answer, with even more constraints:
If you can bucket things with, e.g., one-minute precision, all you need is 10 lists.
Unlike my other constrained answer, this doesn't require that you only ever append on the right; you can append in the middle (although you'll come after any other values for the same minute).
The down side is that you can't actually remove everything more than 10 minutes old; you can only remove everything in the 10th bucket, which could be off by up to 1 minute. You can choose what this means by choosing how to round:
Truncate one way, and nothing ever gets dropped too early, but everything is dropped late, an average of 30 seconds and at worst 60.
Truncate the other way, and nothing ever gets dropped late, but everything is dropped early, an average of 30 seconds and at worst 60.
Round at half, and things get dropped both early and late, but with an average of 0 seconds and a worst case of 30.
And you can of course use smaller buckets, like 100 buckets of 6-second intervals instead of 10 buckets of 1-minute intervals, to cut the error down as far as you like. Push that too far and you'll ruin the efficiency; a list of 600000 buckets of 1ms intervals is nearly as slow as a list of 1M entries.* But if you need 1 second or even 50ms, that's probably fine.
Here's a simple example:
def prune_queue(self):
now = time.time() // 60
age = now - self.last_bucket_time
if age:
self.buckets = self.buckets[-(10-age):] + [[] for _ in range(age)]
self.last_bucket_time = now
def enqueue(self, thing):
self.prune_queue()
self.buckets[-1].append(thing)
* Of course you could combine this with the logarithmic data structure—a red-black tree of 600000 buckets is fine.