I have two different functions for solving the knapsack problem.
The difference in these functions is that the v2 function uses less space over v1. From my time complexity analysis, the v2 function should not be faster than v1.
However, after running my test cases several times, I found that v2 is significantly faster than v1, and I cannot understand why.
I am using Python Unittest.
Here is the test times:
v1 execution time:
Ran 1 test in 35.985s
v2 execution time:
Ran 1 test in 25.294s
Here is my v1 functions:
def knapsack_bottom_up_v1(self):
N = len(self.values)
C = self.capacity
# table
dp = [[0 for rc in range(C+1)] for i in range(N)]
# filling out the table
for i in range(0, N):
i_weight = self.weights[i]
i_val = self.values[i]
for rc in range(1, C+1):
# edge case
if i == 0:
if i_weight > rc:
dp[i][rc] = 0
else:
dp[i][rc] = i_val
# recurrence relation
if i_weight > rc:
dp[i][rc] = dp[i-1][rc]
else:
dp[i][rc] = max(dp[i-1][rc], dp[i-1][rc-i_weight] + i_val)
return dp[N-1][C]
Here is my v2 function:
def knapsack_bottom_up_v2(self):
N = len(self.values)
C = self.capacity
# prev_dp == dp[i-1]
prev_dp = [0]*(C+1)
# dp == dp[i]
dp = [0]*(C+1)
# filling out the table
for i in range(0, N):
i_weight = self.weights[i]
i_val = self.values[i]
for rc in range(1, C+1):
# recurrence relation
if i_weight > rc:
dp[rc] = prev_dp[rc]
else:
dp[rc] = max(prev_dp[rc], prev_dp[rc-i_weight] + i_val)
prev_dp, dp = dp, prev_dp
for i in range(len(dp)):
dp[i] = 0
return prev_dp[C]
Here is also the test case I'm using:
values = [825594,1677009,1676628,1523970,943972,97426,69666,1296457,1679693,\
1902996,1844992,1049289,1252836,1319836,953277,2067538,675367,853655,\
1826027,65731,901489,577243,466257,369261]
weights = [382745,799601,909247,729069,467902,44328,34610,698150,823460,903959,\
853665,551830,610856,670702,488960,951111,323046,446298,931161,31385,\
496951,264724,224916,169684]
capacity = 6404180
solution = [1,1,0,1,1,1,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,1,1,1]
Can anyone help me understand why the execution time of v2 is faster than v1? I think it should be about the same, if not, v2 should be slightly slower than v1.
Thanks!
The time difference mainly comes from two or three more indexes in each internal loop.
I did a test on my machine and did two additional indexes in each internal loop. The difference was about 9 seconds:
>>> lst = [0]
>>> timeit("""for i in range(C):
... prev = lst[0]
... for j in range(N):
... prev
... prev
... """, globals=globals(), number=1)
3.6931853000132833
>>> timeit("""for i in range(C):
... for j in range(N):
... lst[0]
... lst[0]
... """, globals=globals(), number=1)
12.408428700000513
I have to speed up my current code to do around 10^6 operations in a feasible time. Before I used multiprocessing in my the actual document I tried to do it in a mock case. Following is my attempt:
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1
However, when I increase the size of my "List" my computer tires to use all of my RAM and CPU and freezes. I even cut my "List" into 3 pieces so it will only use 3 of my cores. However, nothing changed. Also, when I tried to use it on a smaller data set I noticed the code ran much slower than when it ran on a single core.
I am still very new to multiprocessing in Python, am I doing something wrong. I would appreciate it if you could help me.
To reduce memory usage, I suggest you use instead the multiprocessing module and specifically the imap method method (or imap_unordered method). Unlike the map method of either multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor, the iterable argument is processed lazily. What this means is that if you use a generator function or generator expression for the iterable argument, you do not need to create the complete list of arguments in memory; as a processor in the pool become free and ready to execute more tasks, the generator will be called upon to generate the next argument for the imap call.
By default a chunksize value of 1 is used, which can be inefficient for a large iterable size. When using map and the default value of None for the chunksize argument, the pool will look at the length of the iterable first converting it to a list if necessary and then compute what it deems to be an efficient chunksize based on that length and the size of the pool. When using imap or imap_unordered, converting the iterable to a list would defeat the whole purpose of using that method. But if you know what that size would be (more or less) if the iterable were converted to a list, then there is no reason not to apply the same chunksize calculation the map method would have, and that is what is done below.
The following benchmarks perform the same processing first as a single process and then using multiprocessing using imap where each invocation of do_something on my desktop takes approximately .5 seconds. do_something now has been modified to just process a single i, k, j tuple as there is no longer any need to break up anything into smaller lists:
from multiprocessing import Pool, cpu_count
import time
def half_second():
HALF_SECOND_ITERATIONS = 10_000_000
sum = 0
for _ in range(HALF_SECOND_ITERATIONS):
sum += 1
return sum
def do_something(tpl):
# in real case this function takes about 0.5 seconds to finish for each iteration
half_second() # on my desktop
return tpl[0]**2, tpl[1]**2, tpl[2]**2
"""
def generate_tpls():
for i in range(1, 20-2):
for k in range(i+1, 20-1):
for j in range(k+1, 20):
yield i, k, j
"""
# Use smaller number of tuples so we finish in a reasonable amount of time:
def generate_tpls():
# 64 tuples:
for i in range(1, 5):
for k in range(1, 5):
for j in range(1, 5):
yield i, k, j
def benchmark1():
""" single processing """
t = time.time()
for tpl in generate_tpls():
result = do_something(tpl)
print('benchmark1 time:', time.time() - t)
def compute_chunksize(iterable_size, pool_size):
""" This is more-or-less the function used by the Pool.map method """
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
def benchmark2():
""" multiprocssing """
t = time.time()
pool_size = cpu_count() # 8 logical cores (4 physical cores)
N_TUPLES = 64 # number of tuples that will be generated
pool = Pool(pool_size)
chunksize = compute_chunksize(N_TUPLES, pool_size)
for result in pool.imap(do_something, generate_tpls(), chunksize=chunksize):
pass
print('benchmark2 time:', time.time() - t)
if __name__ == '__main__':
benchmark1()
benchmark2()
Prints:
benchmark1 time: 32.261038303375244
benchmark2 time: 8.174998044967651
The nested For loops creating the array before the main definition appears to be the problem. Moving that part to underneath the main definition clears up any memory problems.
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
if __name__ == '__main__':
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1
Let A be a numpy 1D array of size 5 to 20 millions.
I'd like to determine, for each i, how many items among A[i-1000000], A[i-999999], ..., A[i-2], A[i-1] are smaller than A[i].
Said in another way: I'm looking for the proportion of items smaller than A[i] in a 1-million-item window preceding A[i].
I've tested various approaches and a few answers were given in Rolling comparison between a value and a past window, with percentile/quantile:
import numpy as np
A = np.random.random(5*1000*1000)
n = 1000*1000
B = (np.lib.stride_tricks.as_strided(A, shape=(n,A.size-n), strides=(A.itemsize,A.itemsize)) <= A[n:]).sum(0)
#or similar version with "view_as_windows(A, n)"
Finally the fastest solution was some naive code + numba:
from numba import jit, prange
#jit(parallel=True)
def doit(A, n):
Q = np.zeros(len(A))
for i in prange(n, len(Q)):
Q[i] = np.sum(A[i-n:i] <= A[i])
return(Q)
C = doit(A, n)
But even with this code, it's too slow for me with A of length 5 millions, and n=1 million: about 30 minutes to do this computation!
Is there a more clever idea to use, that avoids to re-compare 1 million items for each element of the output?
Note: having an approximative proportion with a 10^(-3) precision, like "~34.3% of the 1-million-previous-items are smaller than A[i]" would be enough.
Here is an "exact" approach. It solves the 5,000,000 / 1,000,000 sized problem (with floats) in under 20 seconds on rather pedestrian hardware.
I apologize for the rather technical code. I'm not sure it can be made much more readable.
The basic idea is to partition the array into a binary-ish tree-like thing (sorry, no formal scicomp training).
For example, if we have a chunks of size half a million then we can sort each of those at linlog cost and afterwards find the contribution of any block to each element of the next block at amortized constant cost.
The tricky bit is how to piece chunks of different sizes together in such a way that in the end we've counted everything and exactly once.
My approach is to start with small blocks and then keep fusing pairs of those. In principle that should keep the cost of sorting linear at each iteration because in theory (but not in numpy) we could fully exploit the sortedness of the smaller chunks.
As mentioned above the code is tricky mostly because we need to compare the right elements to any given block. It basically comes down to two rules: 1) The block must be fully contained in the element's lookback. 2) the block must not be contained in a larger block that is fully contained in the element's lookback.
Anyway, here is a sample run
size 5_000_000, lookback 1_000_000 -- took 14.593 seconds
seems correct -- 10_000 samples checked
and the code:
UPDATE: simplified the code a bit, also runs faster
UPDATE 2: added a version that does "<=" instead of "<"
"<":
import numpy as np
from numpy.lib.stride_tricks import as_strided
def add_along_axis(a, indices, values, axis):
if axis<0:
axis += a.ndim
I = np.ogrid[(*map(slice, a.shape),)]
I = *I[:axis], indices, *I[axis+1:]
a[I] += values
aaa, taa, paa = add_along_axis, np.take_along_axis, np.put_along_axis
m2f, f2m = np.ravel_multi_index, np.unravel_index
def inv_perm(p):
i = np.empty_like(p)
paa(i, p, np.arange(p.shape[-1]), -1)
return i
def rolling_count_smaller(data, n):
N = len(data)
b = n.bit_length()
NN = (((N-1)>>b)+2)<<b
d0 = np.empty(NN, data.dtype)
d0[NN-N:] = data[::-1]
d0[:NN-N] = data.max() + 1
dt, it, r0 = d0.copy(), np.zeros(NN, int), np.zeros(NN, int)
ch, ch2 = 1, 2
for i in range(b-1):
d0.shape = dt.shape = it.shape = r0.shape = -1, 2, ch
sh = dt.shape
(il, ir), (jl, jr), (k, _) = f2m(m2f(np.add(sh, (-1, -2, -1)), sh) - (n, n-ch), sh)
I = min(il, ir) + 1
bab = np.empty((I, ch2), dt.dtype)
bab[:, ch:] = dt[sh[0]-I:, 0]
IL, IR = np.s_[il-I+1:il+1, ir-I+1:ir+1]
bab[:, k:ch] = d0[IL, jl, k:]
bab[:, :k] = d0[IR, jr, :k]
o = bab.argsort(1, kind='stable')
ns, io = (o>=ch).cumsum(1), inv_perm(o)
r0[IL, jl, k:] += taa(ns, io[:, k:ch], 1)
r0[IR, jr, :k] += taa(ns, io[:, :k], 1)
it[:, 1, :] += ch
dt.shape = it.shape = r0.shape = -1, ch2
o = dt.argsort(1, kind='stable')
ns, io = (o>=ch).cumsum(1), inv_perm(o)
aaa(r0, it[:, :ch], taa(ns, io[:, :ch], 1), 1)
dt, it = taa(dt, o, 1), taa(it, o, 1)
ch, ch2 = ch2, ch2<<1
si, sj = dt.shape
o = as_strided(dt, (si-1, sj<<1), dt.strides).argsort(1, kind='stable')
ns, io = (o>=ch).cumsum(1), inv_perm(o)
r0[:-1, ch2-n-1:] += taa(ns, taa(io, inv_perm(it)[:-1, ch2-n-1:], 1), 1)
return r0.ravel()[:NN-N-1:-1]
l = 1000
data = np.random.randint(-99, 100, (5*l,))
from time import perf_counter as pc
t = pc()
x = rolling_count_smaller(data, l)
t = pc() - t
print(f'size {data.size:_d}, lookback {l:_d} -- took {t:.3f} seconds')
check = 1000
sample = np.random.randint(0, len(x), check)
y = np.array([np.count_nonzero(data[max(0, i-l):i]<data[i]) for i in sample])
assert np.all(y==x[sample])
print(f'seems correct -- {check:_d} samples checked')
"<=":
import numpy as np
from numpy.lib.stride_tricks import as_strided
def add_along_axis(a, indices, values, axis):
if axis<0:
axis += a.ndim
I = np.ogrid[(*map(slice, a.shape),)]
I = *I[:axis], indices, *I[axis+1:]
a[I] += values
aaa, taa, paa = add_along_axis, np.take_along_axis, np.put_along_axis
m2f, f2m = np.ravel_multi_index, np.unravel_index
def inv_perm(p):
i = np.empty_like(p)
paa(i, p, np.arange(p.shape[-1]), -1)
return i
def rolling_count_smaller(data, n):
N = len(data)
b = n.bit_length()
NN = (((N-1)>>b)+2)<<b
d0 = np.empty(NN, data.dtype)
d0[:N] = data
d0[N:] = data.max() + 1
dt, it, r0 = d0.copy(), np.zeros(NN, int), np.zeros(NN, int)
ch, ch2 = 1, 2
for i in range(b-1):
d0.shape = dt.shape = it.shape = r0.shape = -1, 2, ch
sh = dt.shape
(il, ir), (jl, jr), (k, _) = f2m(m2f((0, 1, 0), sh) + (n-ch+1, n+1), sh)
I = sh[0] - max(il, ir)
bab = np.empty((I, ch2), dt.dtype)
bab[:, :ch] = dt[:I, 1]
IL, IR = np.s_[il:il+I, ir:ir+I]
bab[:, ch+k:] = d0[IL, jl, k:]
bab[:, ch:ch+k] = d0[IR, jr, :k]
o = bab.argsort(1, kind='stable')
ns, io = (o<ch).cumsum(1), inv_perm(o)
r0[IL, jl, k:] += taa(ns, io[:, ch+k:], 1)
r0[IR, jr, :k] += taa(ns, io[:, ch:ch+k], 1)
it[:, 1, :] += ch
dt.shape = it.shape = r0.shape = -1, ch2
o = dt.argsort(1, kind='stable')
ns, io = (o<ch).cumsum(1), inv_perm(o)
aaa(r0, it[:, ch:], taa(ns, io[:, ch:], 1), 1)
dt, it = taa(dt, o, 1), taa(it, o, 1)
ch, ch2 = ch2, ch2<<1
si, sj = dt.shape
o = as_strided(dt, (si-1, sj<<1), dt.strides).argsort(1, kind='stable')
ns, io = (o<ch).cumsum(1), inv_perm(o)
r0[1:, :n+1-ch] += taa(ns, taa(io, ch+inv_perm(it)[1:, :n+1-ch], 1), 1)
return r0.ravel()[:N]
l = 1000
data = np.random.randint(-99, 100, (5*l,))
from time import perf_counter as pc
t = pc()
x = rolling_count_smaller(data, l)
t = pc() - t
print(f'size {data.size:_d}, lookback {l:_d} -- took {t:.3f} seconds')
check = 1000
sample = np.random.randint(0, len(x), check)
y = np.array([np.count_nonzero(data[max(0, i-l):i]<=data[i]) for i in sample])
assert np.all(y==x[sample])
print(f'seems correct -- {check:_d} samples checked')
First attempt of an answer, based on the assumption (from the comments)
we could as well use 16-bits integers by pre-multiplying A by 32768
and rounding. The precision would be enough with int16
Assuming we're working with int16 numbers: I would try to maintain a relatively small array of size 2**16 counting how many times each number appeared in the last 1m window. Maintaining the array is O(1) as with each index increment you just reduce 1 count of the number the window just "left", and increment the "new" number.
Then counting how many numbers in the window are smaller than the current number reduces to summing the array over all indices up to (excluding) the current number.
Assuming A[i] is in the range [-32768, 32768]:
B = np.zeros(2 * 32768 + 1)
Q = np.zeros(len(A))
n = 1000 * 1000
def adjust_index(i):
return int(i) + 32768
for i in range(len(Q)):
if i >= n + 1:
B[adjust_index(A[i - n - 1])] -= 1
if i > 0:
B[adjust_index(A[i - 1])] += 1
Q[i] = B[:adjust_index(A[i])].sum() / float(n)
This ran on my machine in about one minute.
You can trade-off space and some speed for accuracy by using a larger (or smaller) range of integers (e.g. multiplying by 2**17 instead of 2**16 to get more accurate at the cost of some speed; multiplying by 2**15 to get results faster but less accurately).
Sorry in advance for not implementing my idea for you; I don’t quite have the time right now. But I hope it helps!
Notation
I'll use n as the array size, and k as the window size.
The Concept
For each element A[i], build a splay tree
ordering all elements a in A[max(0, i-k): i+1], and then use the splay tree to count the number of elements a < A[i]. The advantage here is that the splay trees for adjacent elements A[i] & A[i+1] will differ only by one node insertion and (for i > k) one node removal, which reduces the time needed to build the splay trees.
The required operations have the following complexities:
for each i: O(n * ?)
adding A[i] as a node to the splay tree: amortized O(log k)
counting a < A[i]: since adding A[i] puts it in the root position, you need only check the left branch’s size counter -> O(1)
removing A[i-k-1] node: amortized O(log k)
Overall complexity: amortized O(n log(k))
Reposting the contents of my comment at #Basj's request:
The Thought
Suppose for a window size k, you use the window A[i-k: i] not for the element A[i], but one of its neighbors A[i+1] (or A[i-1]).
The contents of this window A[i-k:i] are almost identical to that of the "true window for A[i+1]", A[i-k+1: i+1]; k-1 of their elements are the same, with only 1 (potentially) non-matching element. This would affect the lessers count for A[i+1] by at most 1; either the changed element is counted when the real one would not be, or vice-versa. Thus at the most, the lessers count for A[i+1] will deviate from "the true count for A[i+1]" by at most 1.
By the same logic, doing the same for A[i+2] (or A[i-2]) would give you a max deviation of 2, and more generally, doing the same for A[i+j] would give you a max deviation of abs(j).
So if your target precision is 1e-3, meaning that your acceptable error is half of that, 5e-4, then you could instead approximate results for the whole set of values A[i+j] for j in range(int(-k * 5e-4), int(k * 5e-4)), by simply reusing the same window A[i-k: i] for each A[i+j].
...Now what?
You can simply adjust your code to count the lessers in this adjusted window for each A[i+j], and increment i by k*1e-3 chunks.
...but this doesn't save you any time. You're still taking a chunk of k numbers, and counting the number of values less than some reference value a, and doing so for 5 million a's. That's exactly what you did before.
So the question is: how can you abuse the repetition to save time?
#Basj I'll leave the rest of this thought to you. It is Finals season, after all ;]
Here is a pythranized version of my solution. It is roughly twice as fast and I think more readable even if is longer. Obvious downside is the added pythran dependency.
The main work horse is _mergsorted3 this scales well with increasing blocksize but is comparatively slow at small blocksize.
I've written one specialist for blocksize 1 to demonstrate how much more speed one could potentially gain.
import numpy as np
from _mergesorted2 import _mergesorted_1
from _mergesorted3 import _mergesorted3
from time import perf_counter as pc
USE_SPEC_1 = True
def rolling_count_smaller(D, n, countequal=True):
N = len(D)
B = n.bit_length() - 1
# now: 2^(B+1) >= n > 2^B
# result and sorter
R, S = np.zeros(N, int), np.empty(N, int) if USE_SPEC_1 else np.arange(N)
FL, FH, SL, SH = (np.zeros(3, dt) for dt in 'llll')
T = pc()
if USE_SPEC_1:
_mergesorted_1(D, R, S, n, countequal)
for b in range(USE_SPEC_1, B):
print(b, pc()-T)
T = pc()
# for each odd block first treat the elements that are so far to its
# right that they can see that block in full but not the block
# containing it
# most of the time (whenever 2^b does not divide n+1) these will span
# two blocks, hence fall into two ordered subgroups
# thus do a threeway merge, but only a "dry run":
# update the counts R but not the sorter S
L, BB = n+1, ((n>>b)+1)<<b
if L == BB:
Kref = int(countequal)
SL[1-countequal] = BB
SH[1-countequal] = BB+(1<<b)
FL[1-countequal] = BB
FH[1-countequal] = n+1+(1<<b)
SL[2] = SH[2] = FL[2] = FH[2] = 0
else:
Kref = countequal<<1
SL[1-countequal:3-countequal] = BB-(1<<b), BB
SH[1-countequal:3-countequal] = BB, BB+(1<<b)
FL[1-countequal:3-countequal] = L, BB
FH[1-countequal:3-countequal] = BB, n+1+(1<<b)
SL[Kref] = FL[Kref] = 1<<b
SH[Kref] = FH[Kref] = 1<<(b+1)
_mergesorted3(D, R, S, SL, SH, FL, FH, N, 1<<(b+1), Kref, False, True)
# merge pairs of adjacent blocks
SL[...] = 0
SL[1-countequal] = 1<<b
SH[2] = 0
SH[:2] = SL[:2] + (1<<b)
_mergesorted3(D, R, S, SL, SH, FL, FH, N, 1<<(b+1), int(countequal), True, False)
# in this last step even and odd blocks are treated the same because
# neither can be contained in larger valid block
SL[...] = 0
SL[1-countequal] = 1<<B
SH[2] = 0
SH[int(countequal)] = 1<<B
SH[1-countequal] = 1<<(B+1)
FL[...] = 0
FL[1-countequal] = 1<<B
FH[2] = 0
FH[int(countequal)] = 1<<B
FH[1-countequal] = n+1
_mergesorted3(D, R, S, SL, SH, FL, FH, N, 1<<B, int(countequal), False, True)
return R
countequal=True
l = 1_000_000
np.random.seed(0)
data = np.random.randint(-99, 100, (5*l,))
from time import perf_counter as pc
t = pc()
x = rolling_count_smaller(data, l, countequal)
t = pc() - t
print(f'size {data.size:_d}, lookback {l:_d} -- took {t:.3f} seconds')
check = 10
sample = np.random.randint(0, len(x), check)
if countequal:
y = np.array([np.count_nonzero(data[max(0, i-l):i]<=data[i]) for i in sample])
else:
y = np.array([np.count_nonzero(data[max(0, i-l):i]<data[i]) for i in sample])
assert np.all(y==x[sample])
print(f'seems correct -- {check:_d} samples checked')
The main worker _mergesorted3.py. Compile: pythran _mergesorted3.py
import numpy as np
#pythran export _mergesorted3(float[:], int[:], int[:], int[3], int[3], int[3], int[3], int, int, int, bool, bool)
#pythran export _mergesorted3(int[:], int[:], int[:], int[3], int[3], int[3], int[3], int, int, int, bool, bool)
# DB, RB, SB are the data, result and sorter arrays; here they are treated a
# bit like base pointers, hence the B in the names
# SL, SH are the low and high ends of the current rows of the three queues
# the next rows are assumed to be at offset N
# FL, FH are low and high ends of ranges in non sorted order used to filter
# each queue. they are ignored if 'filter' is False
# ST is the top index this can fall in the middle of a row which will then be
# processed partially
# Kref is the index of the referenve queue (the one whose elements are counted)
def _mergesorted3(DB, RB, SB, SL, SH, FL, FH, ST, N, Kref, writeback, filter):
if writeback: # set up row buffer for writing back of merged sort order
SLbuf = min(SL[0], SL[1]) # low end of row
SHbuf = max(SH[0], SH[1]) # high end of row
Sbuf = np.empty(SHbuf-SLbuf, int) # buffer
Ibuf = 0 # index
D = np.empty(3, DB.dtype) # heads of the three queues. values
S = np.empty(3, int) # heads the three queues. sorters
while True: # loop over rows
C = 0 # count of elements in the reference block seen so far
I = SL.copy() # heads of the three queses. indices
S[:2] = SB[I[:2]] # the inner loop expects the heads of the two non
# active (i.e. not incremented just now) queues
# to be in descending order
if filter: # skip elements that are not within a contiguous range.
# this requires filtering because everything is referenced
# in sorted order. so we cannot directly select ranges in
# the original order
# it is the caller's responsibility that for all except
# possibly the last row the filtered queues are not empty
for KK in range(2):
while S[KK] < FL[KK] or S[KK] >= FH[KK]:
I[KK] += 1
S[KK] = SB[I[KK]]
D[:2] = DB[S[:2]] # fetch the first two queue head values
# and set the inter queue sorter accordingly
K = np.array([1, 0, 2], int) if D[1] > D[0] else np.array([0, 1, 2], int)
while I[K[2]] < SH[K[2]]: # loop to merge three rows
# get a valid new elment from the active queue at sorter level
S[K[2]] = SB[I[K[2]]]
if filter and (S[K[2]] < FL[K[2]] or S[K[2]] >= FH[K[2]]):
I[K[2]] += 1
continue
# fetch the corresponding value
D[K[2]] = DB[S[K[2]]]
# re-establish inter-queue sort order
if D[K[2]] > D[K[1]] or (D[K[2]] == D[K[1]] and K[2] < K[1]):
K[2], K[1] = K[1], K[2]
if D[K[1]] > D[K[0]] or (D[K[1]] == D[K[0]] and K[1] < K[0]):
K[1], K[0] = K[0], K[1]
# do the book keeping depending on which queue has become active
if K[2] == Kref: # reference queue: adjust counter
C += 1
else: # other: add current ref element count to head of result queue
RB[S[K[2]]] += C
I[K[2]] += 1 # advance active queue
# one queue has been exhausted, which one?
if K[2] == Kref: # reference queue: no need to sort what's left just
# add the current ref element count to all leftovers
# subject to filtering if applicable
if filter:
KK = SB[I[K[1]]:SH[K[1]]]
RB[KK[(KK >= FL[K[1]]) & (KK < FH[K[1]])]] += C
KK = SB[I[K[0]]:SH[K[0]]]
RB[KK[(KK >= FL[K[0]]) & (KK < FH[K[0]])]] += C
else:
RB[SB[I[K[1]]:SH[K[1]]]] += C
RB[SB[I[K[0]]:SH[K[0]]]] += C
else: # one of the other queues: we are left with a two-way merge
# this is in a separate loop because it also supports writing
# back the new sort order which we do not need in the three way
# situation
while I[K[1]] < SH[K[1]]:
S[K[1]] = SB[I[K[1]]]
if filter and (S[K[1]] < FL[K[1]] or S[K[1]] >= FH[K[1]]):
I[K[1]] += 1
continue
D[K[1]] = DB[S[K[1]]]
if D[K[1]] > D[K[0]] or (D[K[1]] == D[K[0]] and K[1] < K[0]):
K[1], K[0] = K[0], K[1]
if K[1] == Kref:
C += 1
else:
RB[S[K[1]]] += C
if writeback: # we cannot directly write back without messing
# things up. instead we buffer one row at a time
Sbuf[Ibuf] = S[K[1]]
Ibuf += 1
I[K[1]] += 1
# a second queue has been exhausted. which one?
if K[1] == Kref: # the reference queue: must update results in
# the remainder of the other queue
if filter:
KK = SB[I[K[0]]:SH[K[0]]]
RB[KK[(KK >= FL[K[0]]) & (KK < FH[K[0]])]] += C
else:
RB[SB[I[K[0]]:SH[K[0]]]] += C
if writeback: # write back updated order
# the leftovers of the last remaining queue have not been
# buffered but being contiguous can be written back directly
# the way this is used by the main script actually gives a
# fifty-fifty chance of copying something exactly onto itself
SB[SLbuf+Ibuf:SHbuf] = SB[I[K[0]]:SH[K[0]]]
# now copy the buffer
SB[SLbuf:SLbuf+Ibuf] = Sbuf[:Ibuf]
SLbuf += N; SHbuf += N
Ibuf = 0
SL += N; SH += N
if filter:
FL += N; FH += N
# this is ugly:
# going to the next row we must check whether one or more queues
# have fully or partially hit the ceiling ST.
# if two and fully we are done
# if one fully we must alter the queue indices to make sure the
# empty queue is at index 2, because of the requirement of having
# at least one valid element in queues 0 and 1
done = -1
for II in range(3):
if SH[II] == SL[II]:
if done >= 0:
done = -2
break
done = II
elif SH[II] > ST:
if SL[II] >= ST or (filter and FL[II] >= ST):
if done >= 0:
done = -2
break
done = II
if writeback:
SHbuf -= SH[II] - SL[II]
SH[II] = SL[II] = 0
else:
if writeback:
SHbuf -= SH[II] - ST
SH[II] = ST
if filter and FH[II] > ST:
FH[II] = ST
if done == Kref or done == -2:
break
elif done == 0:
SL[:2], SH[:2] = SL[1:], SH[1:]
if filter:
FL[:2], FH[:2] = FL[1:], FH[1:]
SH[2] = SL[2]
Kref -= 1
elif done == 1:
SL[1], SH[1] = SL[2], SH[2]
if filter:
FL[1], FH[1] = FL[2], FH[2]
SH[2] = SL[2]
Kref >>= 1
And the special case _mergesorted2.py - pythran _mergesorted2.py
import numpy as np
#pythran export _mergesorted_1(float[:], int[:], int[:], int, bool)
#pythran export _mergesorted_1(int[:], int[:], int[:], int, bool)
def _mergesorted_1(DB, RB, SB, n, countequal):
N = len(DB)
K = ((N-n-1)>>1)<<1
for i in range(0, K, 2):
if DB[i] < DB[i+1] or (countequal and DB[i] == DB[i+1]):
SB[i] = i
SB[i+1] = i+1
RB[i+1] += 1
else:
SB[i] = i+1
SB[i+1] = i
if DB[i+1] < DB[i+1+n] or (countequal and DB[i+1] == DB[i+1+n]):
RB[i+1+n] += 1
for i in range(K, (N>>1)<<1, 2):
if DB[i] < DB[i+1] or (countequal and DB[i] == DB[i+1]):
SB[i] = i
SB[i+1] = i+1
RB[i+1] += 1
else:
SB[i] = i+1
SB[i+1] = i
if N & 1:
SB[N-1] = N-1
Here is an approximate approach that is simple to implement and responds in O(n) time: (21 seconds for 5M values on my laptop). It should work well for data sets with values that vary by more than 1/1000th of the largest difference.
from collections import deque,Counter
def lessCount(A,window):
precision = 1000 # 1/1000 th of value range
result = deque()
counts = [0]*(precision+1)
minVal = min(A)
chunkSize = (max(A)-minVal)/precision
keys = deque()
for i,a in enumerate(A):
key = int((a-minVal)/chunkSize)
keys.append(key)
counts[key] += 1
lowerCount = sum(counts[:key])
result.append(lowerCount)
if i < window: continue
counts[keys.popleft()] -= 1
return np.array(result)
It builds a rolling array of counts where the index is the relative position of the value divided in chunks. The chunk size is 1/1000th of the largest difference between values. For each element in A, there is only one addition and one subtraction to the array of counts. The number of values lower than the current one is the sum of counts up to the the position of that value in the counts array. You can increase the precision as you need but keep in mind that the time will be proportional to O(n)*precision
I would like to query the value of an exponentially weighted moving average at particular points. An inefficient way to do this is as follows. l is the list of times of events and queries has the times at which I want the value of this average.
a=0.01
l = [3,7,10,20,200]
y = [0]*1000
for item in l:
y[int(item)]=1
s = [0]*1000
for i in xrange(1,1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
queries = [23,68,103]
for q in queries:
print s[q]
Outputs:
0.0355271185019
0.0226018371526
0.0158992102478
In practice l will be very large and the range of values in l will also be huge. How can you find the values at the times in queries more efficiently, and especially without computing the potentially huge lists y and s explicitly. I need it to be in pure python so I can use pypy.
Is it possible to solve the problem in time proportional to len(l)
and not max(l) (assuming len(queries) < len(l))?
Here is my code for doing this:
def ewma(l, queries, a=0.01):
def decay(t0, x, t1, a):
from math import pow
return pow((1-a), (t1-t0))*x
assert l == sorted(l)
assert queries == sorted(queries)
samples = []
try:
t0, x0 = (0.0, 0.0)
it = iter(queries)
q = it.next()-1.0
for t1 in l:
# new value is decayed previous value, plus a
x1 = decay(t0, x0, t1, a) + a
# take care of all queries between t0 and t1
while q < t1:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
# take care of all queries equal to t1
while q == t1:
samples.append(x1)
q = it.next()-1.0
# update t0, x0
t0, x0 = t1, x1
# take care of any remaining queries
while True:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
except StopIteration:
return samples
I've also uploaded a fuller version of this code with unit tests and some comments to pastebin: http://pastebin.com/shhaz710
EDIT: Note that this does the same thing as what Chris Pak suggests in his answer, which he must have posted as I was typing this. I haven't gone through the details of his code, but I think mine is a bit more general. This code supports non-integer values in l and queries. It also works for any kind of iterables, not just lists since I don't do any indexing.
I think you could do it in ln(l) time, if l is sorted. The basic idea is that the non recursive form of EMA is a*s_i + (1-a)^1 * s_(i-1) + (1-a)^2 * s_(i-2) ....
This means for query k, you find the greatest number in l less than k, and for a estimation limit, use the following, where v is the index in l, l[v] is the value
(1-a)^(k-v) *l[v] + ....
Then, you spend lg(len(l)) time in search + a constant multiple for the depth of your estimation. I'll provide a code sample in a little bit (after work) if you want it, just wanted to get my idea out there while I was thinking about it
here's the code -
v is the dictionary of values at a given time; replace with 1 if it's just a 1 every time...
import math
from bisect import bisect_right
a = .01
limit = 1000
l = [1,5,14,29...]
def find_nearest_lt(l, time):
i = bisect_right(a, x)
if i:
return i-1
raise ValueError
def find_ema(l, time):
i = find_nearest_lt(l, time)
if l[i] == time:
result = a * v[l[i]
i -= 1
else:
result = 0
while (time-l[i]) < limit:
result += math.pow(1-a, time-l[i]) * v[l[i]]
i -= 1
return result
if I'm thinking correctly, the find nearest is l(n), then the while loop is <= 1000 iterations, guaranteed, so it's technically a constant (though a kind of large one). find_nearest was stolen from the page on bisect - http://docs.python.org/2/library/bisect.html
It appears that y is a binary value -- either 0 or 1 -- depending on the values of l. Why not use y = set(int(item) for item in l)? That's the most efficient way to store and look up a list of numbers.
Your code will cause an error the first time through this loop:
s = [0]*1000
for i in xrange(1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
because i-1 is -1 when i=0 (first pass of loop) and both y[-1] and s[-1] are the last element of the list, not the previous. Maybe you want xrange(1,1000)?
How about this code:
a=0.01
l = [3.0,7.0,10.0,20.0,200.0]
y = set(int(item) for item in l)
queries = [23,68,103]
ewma = []
x = 1 if (0 in y) else 0
for i in xrange(1, queries[-1]):
x = (1-a)*x
if i in y:
x += a
if i == queries[0]:
ewma.append(x)
queries.pop(0)
When it's done, ewma should have the moving averages for each query point.
Edited to include SchighSchagh's improvements.