Python Deque - 10 minutes worth of data

Python Deque - 10 minutes worth of data - python

I'm trying to write a script that, when executed, appends a new available piece of information and removes data that's over 10 minutes old.
I'm wondering whats the the most efficient way, performance wise, of keeping track of the specific time to each information element while also removing the data over 10 minutes old.
My novice thought would be to append the information with a time stamp - [info, time] - to the deque and in a while loop continuously evaluate the end of the deque to remove anything older than 10 minutes... I doubt this is the best way.
Can someone provide an example? Thanks.

One way to do this is to use a sorted tree structure, keyed on the timestamps. Then you can find the first element >= 10 minutes ago, and remove everything before that.
Using the bintrees library as an example (because its key-slicing syntax makes this very easy to read and write…):
q = bintrees.FastRBTree.Tree()
now = datetime.datetime.now()
q[now] = 'a'
q[now - datetime.timedelta(seconds=5)] = 'b'
q[now - datetime.timedelta(seconds=10)] = 'c'
q[now - datetime.timedelta(seconds=15)] = 'd'
now = datetime.datetime.now()
del q[:now - datetime.timedelta(seconds=10)]
That will remove everything up to, but not including, now-10s, which should be both c and d.
This way, finding the first element to remove takes log N time, and removing N elements below that should be average case amortized log N but worst case N. So, your overall worst case time complexity doesn't improve, but your average case does.
Of course the overhead of managing a tree instead of a deque is pretty high, and could easily be higher than the savings of N/log N steps if you're dealing with a pretty small queue.
There are other logarithmic data structures that map be more appropriate, like a pqueue/heapqueue (as implemented by heapq in the stdlib), or a clock ring; I just chose a red-black tree because (with a PyPI module) it was the easiest one to demonstrate.

If you're only ever appending to the end, and the values are always inherently in sorted order, you don't actually need a logarithmic data structure like a tree or heap at all; you can do a logarithmic search within any sorted random-access structure like a list or collections.deque.
The problem is that deleting everything up to an arbitrary point in a list or deque takes O(N) time. There's no reason that it should; you should be able to drop N elements off a deque in amortized constant time (with del q[:pos] or q.popleft(pos)), it's just that collections.deque doesn't do that. If you find or write a deque class that does have that feature, you could just write this:
q = deque()
now = datetime.datetime.now()
q.append((now, 'a'))
q.append((now - datetime.timedelta(seconds=5), 'b')
q.append((now - datetime.timedelta(seconds=10), 'c')
q.append((now - datetime.timedelta(seconds=15), 'd')
now = datetime.datetime.now()
pos = bisect.bisect_left(q, now - datetime.timedelta(seconds=10))
del q[:pos]
I'm not sure whether a deque like this exists on PyPI, but the C source to collections.deque is available to fork, or the Python source from PyPy, or you could wrap a C or C++ deque type, or write one from scratch…
Or, if you're expecting that the "current" values in the deque will always be a small subset of the total length, you can do it in O(M) time just by not using the deque destructively:
q = q[pos:]
In fact, in that case, you might as well just use a list; it has O(1) append on the right, and slicing the last M items off a list is about as low-overhead a way to copy M items as you're going to find.

Yet another answer, with even more constraints:
If you can bucket things with, e.g., one-minute precision, all you need is 10 lists.
Unlike my other constrained answer, this doesn't require that you only ever append on the right; you can append in the middle (although you'll come after any other values for the same minute).
The down side is that you can't actually remove everything more than 10 minutes old; you can only remove everything in the 10th bucket, which could be off by up to 1 minute. You can choose what this means by choosing how to round:
Truncate one way, and nothing ever gets dropped too early, but everything is dropped late, an average of 30 seconds and at worst 60.
Truncate the other way, and nothing ever gets dropped late, but everything is dropped early, an average of 30 seconds and at worst 60.
Round at half, and things get dropped both early and late, but with an average of 0 seconds and a worst case of 30.
And you can of course use smaller buckets, like 100 buckets of 6-second intervals instead of 10 buckets of 1-minute intervals, to cut the error down as far as you like. Push that too far and you'll ruin the efficiency; a list of 600000 buckets of 1ms intervals is nearly as slow as a list of 1M entries.* But if you need 1 second or even 50ms, that's probably fine.
Here's a simple example:
def prune_queue(self):
now = time.time() // 60
age = now - self.last_bucket_time
if age:
self.buckets = self.buckets[-(10-age):] + [[] for _ in range(age)]
self.last_bucket_time = now
def enqueue(self, thing):
self.prune_queue()
self.buckets[-1].append(thing)
* Of course you could combine this with the logarithmic data structure—a red-black tree of 600000 buckets is fine.

Related

What is optimal algorithm to check if a given integer is equal to sum of two elements of an int array?

def check_set(S, k):
S2 = k - S
set_from_S2=set(S2.flatten())
for x in S:
if(x in set_from_S2):
return True
return False
I have a given integer k. I want to check if k is equal to sum of two element of array S.
S = np.array([1,2,3,4])
k = 8
It should return False in this case because there are no two elements of S having sum of 8. The above code work like 8 = 4 + 4 so it returned True
I can't find an algorithm to solve this problem with complexity of O(n).
Can someone help me?

You have to account for multiple instances of the same item, so set is not good choice here.
Instead you can exploit dictionary with value_field = number_of_keys (as variant - from collections import Counter)
A = [3,1,2,3,4]
Cntr = {}
for x in A:
if x in Cntr:
Cntr[x] += 1
else:
Cntr[x] = 1
#k = 11
k = 8
ans = False
for x in A:
if (k-x) in Cntr:
if k == 2 * x:
if Cntr[k-x] > 1:
ans = True
break
else:
ans = True
break
print(ans)
Returns True for k=5,6 (I added one more 3) and False for k=8,11

Adding onto MBo's answer.
"Optimal" can be an ambiguous term in terms of algorithmics, as there is often a compromise between how fast the algorithm runs and how memory-efficient it is. Sometimes we may also be interested in either worst-case resource consumption or in average resource consumption. We'll loop at worst-case here because it's simpler and roughly equivalent to average in our scenario.
Let's call n the length of our array, and let's consider 3 examples.
Example 1
We start with a very naive algorithm for our problem, with two nested loops that iterate over the array, and check for every two items of different indices if they sum to the target number.
Time complexity: worst-case scenario (where the answer is False or where it's True but that we find it on the last pair of items we check) has n^2 loop iterations. If you're familiar with the big-O notation, we'll say the algorithm's time complexity is O(n^2), which basically means that in terms of our input size n, the time it takes to solve the algorithm grows more or less like n^2 with multiplicative factor (well, technically the notation means "at most like n^2 with a multiplicative factor, but it's a generalized abuse of language to use it as "more or less like" instead).
Space complexity (memory consumption): we only store an array, plus a fixed set of objects whose sizes do not depend on n (everything Python needs to run, the call stack, maybe two iterators and/or some temporary variables). The part of the memory consumption that grows with n is therefore just the size of the array, which is n times the amount of memory required to store an integer in an array (let's call that sizeof(int)).
Conclusion: Time is O(n^2), Memory is n*sizeof(int) (+O(1), that is, up to an additional constant factor, which doesn't matter to us, and which we'll ignore from now on).
Example 2
Let's consider the algorithm in MBo's answer.
Time complexity: much, much better than in Example 1. We start by creating a dictionary. This is done in a loop over n. Setting keys in a dictionary is a constant-time operation in proper conditions, so that the time taken by each step of that first loop does not depend on n. Therefore, for now we've used O(n) in terms of time complexity. Now we only have one remaining loop over n. The time spent accessing elements our dictionary is independent of n, so once again, the total complexity is O(n). Combining our two loops together, since they both grow like n up to a multiplicative factor, so does their sum (up to a different multiplicative factor). Total: O(n).
Memory: Basically the same as before, plus a dictionary of n elements. For the sake of simplicity, let's consider that these elements are integers (we could have used booleans), and forget about some of the aspects of dictionaries to only count the size used to store the keys and the values. There are n integer keys and n integer values to store, which uses 2*n*sizeof(int) in terms of memory. Add to that what we had before and we have a total of 3*n*sizeof(int).
Conclusion: Time is O(n), Memory is 3*n*sizeof(int). The algorithm is considerably faster when n grows, but uses three times more memory than example 1. In some weird scenarios where almost no memory is available (embedded systems maybe), this 3*n*sizeof(int) might simply be too much, and you might not be able to use this algorithm (admittedly, it's probably never going to be a real issue).
Example 3
Can we find a trade-off between Example 1 and Example 2?
One way to do that is to replicate the same kind of nested loop structure as in Example 1, but with some pre-processing to replace the inner loop with something faster. To do that, we sort the initial array, in place. Done with well-chosen algorithms, this has a time-complexity of O(n*log(n)) and negligible memory usage.
Once we have sorted our array, we write our outer loop (which is a regular loop over the whole array), and then inside that outer loop, use dichotomy to search for the number we're missing to reach our target k. This dichotomy approach would have a memory consumption of O(log(n)), and its time complexity would be O(log(n)) as well.
Time complexity: The pre-processing sort is O(n*log(n)). Then in the main part of the algorithm, we have n calls to our O(log(n)) dichotomy search, which totals to O(n*log(n)). So, overall, O(n*log(n)).
Memory: Ignoring the constant parts, we have the memory for our array (n*sizeof(int)) plus the memory for our call stack in the dichotomy search (O(log(n))). Total: n*sizeof(int) + O(log(n)).
Conclusion: Time is O(n*log(n)), Memory is n*sizeof(int) + O(log(n)). Memory is almost as small as in Example 1. Time complexity is slightly more than in Example 2. In scenarios where the Example 2 cannot be used because we lack memory, the next best thing in terms of speed would realistically be Example 3, which is almost as fast as Example 2 and probably has enough room to run if the very slow Example 1 does.
Overall conclusion
This answer was just to show that "optimal" is context-dependent in algorithmics. It's very unlikely that in this particular example, one would choose to implement Example 3. In general, you'd see either Example 1 if n is so small that one would choose whatever is simplest to design and fastest to code, or Example 2 if n is a bit larger and we want speed. But if you look at the wikipedia page I linked for sorting algorithms, you'll see that none of them is best at everything. They all have scenarios where they could be replaced with something better.

Which is better: deque or list slicing?

If I use the code
from collections import deque
q = deque(maxlen=2)
while step <= step_max:
calculate(item)
q.append(item)
another_calculation(q)
how does it compare in efficiency and readability to
q = []
while step <= step_max:
calculate(item)
q.append(item)
q = q[-2:]
another_calculation(q)
calculate() and another_calculation() are not real in this case but in my actual program are simply two calculations. I'm doing these calculations every step for millions of steps (I'm simulating an ion in 2-d space). Because there are so many steps, q gets very long and uses a lot of memory, while another_calculation() only uses the last two values of q. I had been using the latter method, then heard deque mentioned and thought it might be more efficient; thus the question.
I.e., how do deques in python compare to just normal list slicing?

q = q[-2:]
now this is a costly operation because it recreates a list everytime (and copies the references). (A nasty side effect here is that it changes the reference of q even if you can use q[:] = q[-2:] to avoid that).
The deque object just changes the start of the list pointer and "forgets" the oldest item. So it's faster and it's one of the usages it's been designed for.
Of course, for 2 values, there isn't much difference, but for a bigger number there is.

If I interpret your question correctly, you have a function, that calculates a value, and you want to do another calculation with this and the previous value. The best way is to use two variables:
while step <= step_max:
item = calculate()
another_calculation(previous_item, item)
previous_item = item
If the calculations are some form of vector math, you should consider using numpy.

Decreasing while loop working time for gigantic numbers

i = 1
d = 1
e = 20199
a = 10242248284272
while(True):
print(i)
if((i*e)%a == 1):
d = i
break
i = i + 1
Numbers are given to represent lengths of them.
I have a piece of Python code which works for really huge numbers and I have to wait for a day maybe two. How can I decrease the time needed for the code to run?

There are a variety of things you can do to speed this up; whether it will be enough is another matter.
Avoid printing things you don't have to.
Avoid multiplication. Instead of computing i*e each iteration, you could have a new variable, say i_e, which holds this value. Each time you increment i, you can add e to it.
The vast majority of your iterations aren't even close to satisfying (i*e)%a==1. If you used a larger increment for i, you could avoid testing most of them. Of course, you have to be careful not to overshoot.
Because you are doing modular arithmetic, you don't need to use the full value of i in your test (though you will still need to track it, for the output)
Combining the previous two points, you can avoid computing the modulus: If you know that a<=x<2*a, then x-a==1 is equivalent to x%a==1.
None of these may be enough to reduce the complexity of the problem, but might be enough to reduce the constant factor to speed things up enough for your purposes.

Time-based sliding window and measuring (change of) data arrival rate

I'm trying to implement a time-based sliding window (in Python), i.e., a data sources inserts new data items, and items older than, say, 1h are automatically removed. On top of that, I need to measures the rate, or rather the change of rate the data sources inserts items.
My question is kind of two-fold. First, how is the best way to implement a time-based window. In my currently, probably naive solution, I simply use a Python list window = []. In case of a new data item, I append the item with the current timestamp: window.append((current_time, item)). Then, using a timer, every 1sec I pop all first elements with a timestamp older than the current (timestamp-1h):
threshold = int(time.time()*1000) - self.WINDOW_SIZE_IN_MS
while True:
try:
if window[0][0] < threshold:
del self.word_lists[0]
else:
break
except:
break
While this works, I wonder if there are more clever solutions to this.
More importantly, what would be a good way to measure the change of rate data items enter the window. Here, I have no good idea how to approach this, at least none that sounds also efficient. Something very naive I had in mind: I split the 1h-window in 20 intervals each 5min and count the number of items. If the most recent 5min-interval conains significantly more items than the average of the 20 intervals, I say there is a burst. But I would have to do this every, say, 1min. This sounds not efficient and there are a lot of parameters.
In short, I need to measure the acceleration in which new items enter my window. Are there best-practices approaches for this?

For the first part, it is more efficient to check for expired items and remove them when you receive a new item to add. That is, don't bother with a timer which wakes up the process for no reason once a second--just piggyback the maintenance work when real work is happening.
For the second part, the entire 1 hour has a known length. Store an integer which is the index in the list of five minutes ago. You can maintain this when doing an insert, and you know you only have to move it forward.
Putting it all together, pseudo-code:
window = []
recent_index = 0
def insert(time, item):
while window and window[0][0] < time - timedelta(hours=1):
window.pop()
recent_index -= 1
while window[recent_index][0] < time - timedelta(minutes=5):
recent_index += 1
window.append((time, item))
return float(len(window) - recent_index) / len(window)
The above function returns what fraction of items from the past hour arrived in the past five minutes. Over 20 or 50%, say, and you have a burst.

Recursive algorithm using memoization

My problem is as follows:
I have a list of missions each taking a specific amount of time and grants specific amount of points, and a time 'k' given to perform them:
e.g: missions = [(14,3),(54,5),(5,4)] and time = 15
in this example I have 3 missions and the first one gives me 14 points and takes 3 minutes.
I have 15 minutes total.
Each mission is a tuple with the first value being num of points for this mission and second value being num of minutes needed to perform this mission.
I have to find recursively using memoization the maximum amount of points I am able to get for a given list of missions and given time.
I am trying to implement a function called choose(missions,time) that will operate recursively and use the function choose_mem(missions,time,mem,k) to achive my goal.
the function choose_mem should get 'k' which is the number of missions to choose from, and mem which is an empty dictionary, mem, which will contain all the problems that were already been solved before.
This is what I got so far, I need help implementing what is required above, I mean the dictionary usage (which is currently just there and empty all the time), and also the fact that my choose_mem function input is i,j,missions,d and it should be choose_mem(missions, time, mem, k) where mem = d and k is the number of missions to choose from.
If anyone can help me adjust my code it would be very appreciated.
mem = {}
def choose(missions, time):
j = time
result = []
for i in range(len(missions), 0, -1):
if choose_mem(missions, j, mem, i) != choose_mem(missions, j, mem, i-1):
j -= missions[i - 1][1]
return choose_mem(missions, time, mem, len(missions))
def choose_mem(missions, time, mem, k):
if k == 0: return 0
points, a = missions[k - 1]
if a > time:
return choose_mem(missions, time, mem, k-1)
else:
return max(choose_mem(missions, time, mem, k-1),
choose_mem(missions, time-a, mem, k-1) + points)

This is a bit vague, but your problem is roughly translated to a very famous NP-complete problem, the Knapsack Problem.
You can read a bit more about it on wikipedia, if you replace weight with time, you have your problem.
Dynamic programming is a common way to approach that problem, as you can see here:
http://en.wikipedia.org/wiki/Knapsack_problem#Dynamic_programming
Memoization is more or less equivalent to Dynamic Programming, for pratical matters, so don't let the fancy name fool you.
The base concept is that you use an additional data structure to store parts of your problem that you already solved. Since the solution you're implementing is recursive, many sub-problems will overlap, and memoization allows you to only calculate each of them once.
So, the hard part is for you to think about your problem, what what you need to store in the dictionary, so that when you call choose_mem with values that you already calculated, you simply retrieve them from the dictionary, instead of doing another recursive call.
If you want to check an implementation of the generic 0-1 Knapsack Problem (your case, since you can't add items partially), then this seemed to me like a good resource:
https://sites.google.com/site/mikescoderama/Home/0-1-knapsack-problem-in-p
It's well explained, and the code is readable enough. If you understand the usage of the matrix to store costs, then you'll have your problem worked out for you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.