Expected execution time vs actual execution time in python - python

I have two different functions for solving the knapsack problem.
The difference in these functions is that the v2 function uses less space over v1. From my time complexity analysis, the v2 function should not be faster than v1.
However, after running my test cases several times, I found that v2 is significantly faster than v1, and I cannot understand why.
I am using Python Unittest.
Here is the test times:
v1 execution time:
Ran 1 test in 35.985s
v2 execution time:
Ran 1 test in 25.294s
Here is my v1 functions:
def knapsack_bottom_up_v1(self):
N = len(self.values)
C = self.capacity
# table
dp = [[0 for rc in range(C+1)] for i in range(N)]
# filling out the table
for i in range(0, N):
i_weight = self.weights[i]
i_val = self.values[i]
for rc in range(1, C+1):
# edge case
if i == 0:
if i_weight > rc:
dp[i][rc] = 0
else:
dp[i][rc] = i_val
# recurrence relation
if i_weight > rc:
dp[i][rc] = dp[i-1][rc]
else:
dp[i][rc] = max(dp[i-1][rc], dp[i-1][rc-i_weight] + i_val)
return dp[N-1][C]
Here is my v2 function:
def knapsack_bottom_up_v2(self):
N = len(self.values)
C = self.capacity
# prev_dp == dp[i-1]
prev_dp = [0]*(C+1)
# dp == dp[i]
dp = [0]*(C+1)
# filling out the table
for i in range(0, N):
i_weight = self.weights[i]
i_val = self.values[i]
for rc in range(1, C+1):
# recurrence relation
if i_weight > rc:
dp[rc] = prev_dp[rc]
else:
dp[rc] = max(prev_dp[rc], prev_dp[rc-i_weight] + i_val)
prev_dp, dp = dp, prev_dp
for i in range(len(dp)):
dp[i] = 0
return prev_dp[C]
Here is also the test case I'm using:
values = [825594,1677009,1676628,1523970,943972,97426,69666,1296457,1679693,\
1902996,1844992,1049289,1252836,1319836,953277,2067538,675367,853655,\
1826027,65731,901489,577243,466257,369261]
weights = [382745,799601,909247,729069,467902,44328,34610,698150,823460,903959,\
853665,551830,610856,670702,488960,951111,323046,446298,931161,31385,\
496951,264724,224916,169684]
capacity = 6404180
solution = [1,1,0,1,1,1,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,1,1,1]
Can anyone help me understand why the execution time of v2 is faster than v1? I think it should be about the same, if not, v2 should be slightly slower than v1.
Thanks!

The time difference mainly comes from two or three more indexes in each internal loop.
I did a test on my machine and did two additional indexes in each internal loop. The difference was about 9 seconds:
>>> lst = [0]
>>> timeit("""for i in range(C):
... prev = lst[0]
... for j in range(N):
... prev
... prev
... """, globals=globals(), number=1)
3.6931853000132833
>>> timeit("""for i in range(C):
... for j in range(N):
... lst[0]
... lst[0]
... """, globals=globals(), number=1)
12.408428700000513

Related

how can I handle the code to avoid killed?

I got Killed after some running this code
part one of the code is
def load_data(distance_file):
distance = {}
min_dis, max_dis = sys.float_info.max, 0.0
num = 0
with open(distance_file, 'r', encoding = 'utf-8') as infile:
for line in infile:
content = line.strip().split()
assert(len(content) == 3)
idx1, idx2, dis = int(content[0]), int(content[1]), float(content[2])
num = max(num, idx1, idx2)
min_dis = min(min_dis, dis)
max_dis = max(max_dis, dis)
distance[(idx1, idx2)] = dis
distance[(idx2, idx1)] = dis
for i in range(1, num + 1):
distance[(i, i)] = 0.0
#infile.close() there are no need to close file it is closed automatically since i am using with
return distance, num, max_dis, min_dis
EDIT i tried this solution
bigfile = open(folder,'r')
tmp_lines = bigfile.readlines(1024)
while tmp_lines:
for line in tmp_lines:
tmp_lines = bigfile.readlines(1024)
i, j, dis = line.strip().split()
i, j, dis = int(i), int(j), float(dis)
distance[(i, j)] = dis
distance[(j, i)] = dis
max_pt = max(i, j, max_pt)
for num in range(1, max_pt + 1):
distance[(num, num)] = 0
return distance, max_pt
but got this error
gap = distance[(i, j)] - threshold
KeyError: (1, 2)
from this method
def CutOff(self, distance, max_id, threshold):
'''
:rtype: list with Cut-off kernel values by desc
'''
cut_off = dict()
for i in range(1, max_id + 1):
tmp = 0
for j in range(1, max_id + 1):
gap = distance[(i, j)] - threshold
print(gap)
tmp += 0 if gap >= 0 else 1
cut_off[i] = tmp
sorted_cutoff = sorted(cut_off.items(), key=lambda k:k[1], reverse=True)
return sorted_cutoff
i used print(gap) to get why this problem appeared and got this value -0.3
rest of the code here
I have a file contains 20000 lines and the code stopped at
['2686', '13856', '64.176689']
Killed
how can I handle the code to accept more lines?
can I increase the memory and how or from the code itself need to change like using file for storing not parameters
I used dmesg and got
Out of memory: Killed process 24502 (python) total-vm:19568804kB, anon-rss:14542148kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:31232kB oom_score_adj:0
[ pid ] uid tgid total_vm rss pgtables_bytes swapents o
1000 24502 4892200 3585991 33763328 579936
On a Linux system, check the output of dmesg. If the process is getting killed by the kernel, there are and explanation. Most probable reason: out of memory.
one reason you might hit a memory limit is that your call to distance.values() in your auto_select_dc function
neighbor_percent = sum([1 for value in distance.values() if value < dc]) / num ** 2
this will allocates a list that contains all the values from your dictionary. If your dictionary had a lot of data, this might be a very big list. A possible solution would be to use distance.iteritems() which is a generator. Rather than returning all the items in a list, it lets you iterate over them with much less memory usage.
neighbor_percent = sum([1 for _,value in distance.iteritems() if value < dc]) / num ** 2
The Cutoff function checks every (i, j) pairs, from 1 ~ max_id.
def CutOff(self, distance, max_id, threshold):
for i in range(1, max_id + 1):
for j in range(1, max_id + 1):
And a sample data file provided in the github link contains distance values for every ID pairs, from 1 to 2000. (so it has 2M lines for the 2K IDs).
However, your data seems to be very sparse, because it has only 20,000 lines but there are large ID numbers such as 2686 and 13856. The error message 'KeyError: (1, 2)' tells that there is no distance value between ID 1 and 2.
Finally, it does not make sense for me if some code loading only 20,000 lines of data (probably few MBytes) raises the out of memory error. I guess your data is much larger, or the OOM error came from another part of your code.

Among 1 million items preceding A[i], how many of them are smaller than A[i]?

Let A be a numpy 1D array of size 5 to 20 millions.
I'd like to determine, for each i, how many items among A[i-1000000], A[i-999999], ..., A[i-2], A[i-1] are smaller than A[i].
Said in another way: I'm looking for the proportion of items smaller than A[i] in a 1-million-item window preceding A[i].
I've tested various approaches and a few answers were given in Rolling comparison between a value and a past window, with percentile/quantile:
import numpy as np
A = np.random.random(5*1000*1000)
n = 1000*1000
B = (np.lib.stride_tricks.as_strided(A, shape=(n,A.size-n), strides=(A.itemsize,A.itemsize)) <= A[n:]).sum(0)
#or similar version with "view_as_windows(A, n)"
Finally the fastest solution was some naive code + numba:
from numba import jit, prange
#jit(parallel=True)
def doit(A, n):
Q = np.zeros(len(A))
for i in prange(n, len(Q)):
Q[i] = np.sum(A[i-n:i] <= A[i])
return(Q)
C = doit(A, n)
But even with this code, it's too slow for me with A of length 5 millions, and n=1 million: about 30 minutes to do this computation!
Is there a more clever idea to use, that avoids to re-compare 1 million items for each element of the output?
Note: having an approximative proportion with a 10^(-3) precision, like "~34.3% of the 1-million-previous-items are smaller than A[i]" would be enough.
Here is an "exact" approach. It solves the 5,000,000 / 1,000,000 sized problem (with floats) in under 20 seconds on rather pedestrian hardware.
I apologize for the rather technical code. I'm not sure it can be made much more readable.
The basic idea is to partition the array into a binary-ish tree-like thing (sorry, no formal scicomp training).
For example, if we have a chunks of size half a million then we can sort each of those at linlog cost and afterwards find the contribution of any block to each element of the next block at amortized constant cost.
The tricky bit is how to piece chunks of different sizes together in such a way that in the end we've counted everything and exactly once.
My approach is to start with small blocks and then keep fusing pairs of those. In principle that should keep the cost of sorting linear at each iteration because in theory (but not in numpy) we could fully exploit the sortedness of the smaller chunks.
As mentioned above the code is tricky mostly because we need to compare the right elements to any given block. It basically comes down to two rules: 1) The block must be fully contained in the element's lookback. 2) the block must not be contained in a larger block that is fully contained in the element's lookback.
Anyway, here is a sample run
size 5_000_000, lookback 1_000_000 -- took 14.593 seconds
seems correct -- 10_000 samples checked
and the code:
UPDATE: simplified the code a bit, also runs faster
UPDATE 2: added a version that does "<=" instead of "<"
"<":
import numpy as np
from numpy.lib.stride_tricks import as_strided
def add_along_axis(a, indices, values, axis):
if axis<0:
axis += a.ndim
I = np.ogrid[(*map(slice, a.shape),)]
I = *I[:axis], indices, *I[axis+1:]
a[I] += values
aaa, taa, paa = add_along_axis, np.take_along_axis, np.put_along_axis
m2f, f2m = np.ravel_multi_index, np.unravel_index
def inv_perm(p):
i = np.empty_like(p)
paa(i, p, np.arange(p.shape[-1]), -1)
return i
def rolling_count_smaller(data, n):
N = len(data)
b = n.bit_length()
NN = (((N-1)>>b)+2)<<b
d0 = np.empty(NN, data.dtype)
d0[NN-N:] = data[::-1]
d0[:NN-N] = data.max() + 1
dt, it, r0 = d0.copy(), np.zeros(NN, int), np.zeros(NN, int)
ch, ch2 = 1, 2
for i in range(b-1):
d0.shape = dt.shape = it.shape = r0.shape = -1, 2, ch
sh = dt.shape
(il, ir), (jl, jr), (k, _) = f2m(m2f(np.add(sh, (-1, -2, -1)), sh) - (n, n-ch), sh)
I = min(il, ir) + 1
bab = np.empty((I, ch2), dt.dtype)
bab[:, ch:] = dt[sh[0]-I:, 0]
IL, IR = np.s_[il-I+1:il+1, ir-I+1:ir+1]
bab[:, k:ch] = d0[IL, jl, k:]
bab[:, :k] = d0[IR, jr, :k]
o = bab.argsort(1, kind='stable')
ns, io = (o>=ch).cumsum(1), inv_perm(o)
r0[IL, jl, k:] += taa(ns, io[:, k:ch], 1)
r0[IR, jr, :k] += taa(ns, io[:, :k], 1)
it[:, 1, :] += ch
dt.shape = it.shape = r0.shape = -1, ch2
o = dt.argsort(1, kind='stable')
ns, io = (o>=ch).cumsum(1), inv_perm(o)
aaa(r0, it[:, :ch], taa(ns, io[:, :ch], 1), 1)
dt, it = taa(dt, o, 1), taa(it, o, 1)
ch, ch2 = ch2, ch2<<1
si, sj = dt.shape
o = as_strided(dt, (si-1, sj<<1), dt.strides).argsort(1, kind='stable')
ns, io = (o>=ch).cumsum(1), inv_perm(o)
r0[:-1, ch2-n-1:] += taa(ns, taa(io, inv_perm(it)[:-1, ch2-n-1:], 1), 1)
return r0.ravel()[:NN-N-1:-1]
l = 1000
data = np.random.randint(-99, 100, (5*l,))
from time import perf_counter as pc
t = pc()
x = rolling_count_smaller(data, l)
t = pc() - t
print(f'size {data.size:_d}, lookback {l:_d} -- took {t:.3f} seconds')
check = 1000
sample = np.random.randint(0, len(x), check)
y = np.array([np.count_nonzero(data[max(0, i-l):i]<data[i]) for i in sample])
assert np.all(y==x[sample])
print(f'seems correct -- {check:_d} samples checked')
"<=":
import numpy as np
from numpy.lib.stride_tricks import as_strided
def add_along_axis(a, indices, values, axis):
if axis<0:
axis += a.ndim
I = np.ogrid[(*map(slice, a.shape),)]
I = *I[:axis], indices, *I[axis+1:]
a[I] += values
aaa, taa, paa = add_along_axis, np.take_along_axis, np.put_along_axis
m2f, f2m = np.ravel_multi_index, np.unravel_index
def inv_perm(p):
i = np.empty_like(p)
paa(i, p, np.arange(p.shape[-1]), -1)
return i
def rolling_count_smaller(data, n):
N = len(data)
b = n.bit_length()
NN = (((N-1)>>b)+2)<<b
d0 = np.empty(NN, data.dtype)
d0[:N] = data
d0[N:] = data.max() + 1
dt, it, r0 = d0.copy(), np.zeros(NN, int), np.zeros(NN, int)
ch, ch2 = 1, 2
for i in range(b-1):
d0.shape = dt.shape = it.shape = r0.shape = -1, 2, ch
sh = dt.shape
(il, ir), (jl, jr), (k, _) = f2m(m2f((0, 1, 0), sh) + (n-ch+1, n+1), sh)
I = sh[0] - max(il, ir)
bab = np.empty((I, ch2), dt.dtype)
bab[:, :ch] = dt[:I, 1]
IL, IR = np.s_[il:il+I, ir:ir+I]
bab[:, ch+k:] = d0[IL, jl, k:]
bab[:, ch:ch+k] = d0[IR, jr, :k]
o = bab.argsort(1, kind='stable')
ns, io = (o<ch).cumsum(1), inv_perm(o)
r0[IL, jl, k:] += taa(ns, io[:, ch+k:], 1)
r0[IR, jr, :k] += taa(ns, io[:, ch:ch+k], 1)
it[:, 1, :] += ch
dt.shape = it.shape = r0.shape = -1, ch2
o = dt.argsort(1, kind='stable')
ns, io = (o<ch).cumsum(1), inv_perm(o)
aaa(r0, it[:, ch:], taa(ns, io[:, ch:], 1), 1)
dt, it = taa(dt, o, 1), taa(it, o, 1)
ch, ch2 = ch2, ch2<<1
si, sj = dt.shape
o = as_strided(dt, (si-1, sj<<1), dt.strides).argsort(1, kind='stable')
ns, io = (o<ch).cumsum(1), inv_perm(o)
r0[1:, :n+1-ch] += taa(ns, taa(io, ch+inv_perm(it)[1:, :n+1-ch], 1), 1)
return r0.ravel()[:N]
l = 1000
data = np.random.randint(-99, 100, (5*l,))
from time import perf_counter as pc
t = pc()
x = rolling_count_smaller(data, l)
t = pc() - t
print(f'size {data.size:_d}, lookback {l:_d} -- took {t:.3f} seconds')
check = 1000
sample = np.random.randint(0, len(x), check)
y = np.array([np.count_nonzero(data[max(0, i-l):i]<=data[i]) for i in sample])
assert np.all(y==x[sample])
print(f'seems correct -- {check:_d} samples checked')
First attempt of an answer, based on the assumption (from the comments)
we could as well use 16-bits integers by pre-multiplying A by 32768
and rounding. The precision would be enough with int16
Assuming we're working with int16 numbers: I would try to maintain a relatively small array of size 2**16 counting how many times each number appeared in the last 1m window. Maintaining the array is O(1) as with each index increment you just reduce 1 count of the number the window just "left", and increment the "new" number.
Then counting how many numbers in the window are smaller than the current number reduces to summing the array over all indices up to (excluding) the current number.
Assuming A[i] is in the range [-32768, 32768]:
B = np.zeros(2 * 32768 + 1)
Q = np.zeros(len(A))
n = 1000 * 1000
def adjust_index(i):
return int(i) + 32768
for i in range(len(Q)):
if i >= n + 1:
B[adjust_index(A[i - n - 1])] -= 1
if i > 0:
B[adjust_index(A[i - 1])] += 1
Q[i] = B[:adjust_index(A[i])].sum() / float(n)
This ran on my machine in about one minute.
You can trade-off space and some speed for accuracy by using a larger (or smaller) range of integers (e.g. multiplying by 2**17 instead of 2**16 to get more accurate at the cost of some speed; multiplying by 2**15 to get results faster but less accurately).
Sorry in advance for not implementing my idea for you; I don’t quite have the time right now. But I hope it helps!
Notation
I'll use n as the array size, and k as the window size.
The Concept
For each element A[i], build a splay tree
ordering all elements a in A[max(0, i-k): i+1], and then use the splay tree to count the number of elements a < A[i]. The advantage here is that the splay trees for adjacent elements A[i] & A[i+1] will differ only by one node insertion and (for i > k) one node removal, which reduces the time needed to build the splay trees.
The required operations have the following complexities:
for each i: O(n * ?)
adding A[i] as a node to the splay tree: amortized O(log k)
counting a < A[i]: since adding A[i] puts it in the root position, you need only check the left branch’s size counter -> O(1)
removing A[i-k-1] node: amortized O(log k)
Overall complexity: amortized O(n log(k))
Reposting the contents of my comment at #Basj's request:
The Thought
Suppose for a window size k, you use the window A[i-k: i] not for the element A[i], but one of its neighbors A[i+1] (or A[i-1]).
The contents of this window A[i-k:i] are almost identical to that of the "true window for A[i+1]", A[i-k+1: i+1]; k-1 of their elements are the same, with only 1 (potentially) non-matching element. This would affect the lessers count for A[i+1] by at most 1; either the changed element is counted when the real one would not be, or vice-versa. Thus at the most, the lessers count for A[i+1] will deviate from "the true count for A[i+1]" by at most 1.
By the same logic, doing the same for A[i+2] (or A[i-2]) would give you a max deviation of 2, and more generally, doing the same for A[i+j] would give you a max deviation of abs(j).
So if your target precision is 1e-3, meaning that your acceptable error is half of that, 5e-4, then you could instead approximate results for the whole set of values A[i+j] for j in range(int(-k * 5e-4), int(k * 5e-4)), by simply reusing the same window A[i-k: i] for each A[i+j].
...Now what?
You can simply adjust your code to count the lessers in this adjusted window for each A[i+j], and increment i by k*1e-3 chunks.
...but this doesn't save you any time. You're still taking a chunk of k numbers, and counting the number of values less than some reference value a, and doing so for 5 million a's. That's exactly what you did before.
So the question is: how can you abuse the repetition to save time?
#Basj I'll leave the rest of this thought to you. It is Finals season, after all ;]
Here is a pythranized version of my solution. It is roughly twice as fast and I think more readable even if is longer. Obvious downside is the added pythran dependency.
The main work horse is _mergsorted3 this scales well with increasing blocksize but is comparatively slow at small blocksize.
I've written one specialist for blocksize 1 to demonstrate how much more speed one could potentially gain.
import numpy as np
from _mergesorted2 import _mergesorted_1
from _mergesorted3 import _mergesorted3
from time import perf_counter as pc
USE_SPEC_1 = True
def rolling_count_smaller(D, n, countequal=True):
N = len(D)
B = n.bit_length() - 1
# now: 2^(B+1) >= n > 2^B
# result and sorter
R, S = np.zeros(N, int), np.empty(N, int) if USE_SPEC_1 else np.arange(N)
FL, FH, SL, SH = (np.zeros(3, dt) for dt in 'llll')
T = pc()
if USE_SPEC_1:
_mergesorted_1(D, R, S, n, countequal)
for b in range(USE_SPEC_1, B):
print(b, pc()-T)
T = pc()
# for each odd block first treat the elements that are so far to its
# right that they can see that block in full but not the block
# containing it
# most of the time (whenever 2^b does not divide n+1) these will span
# two blocks, hence fall into two ordered subgroups
# thus do a threeway merge, but only a "dry run":
# update the counts R but not the sorter S
L, BB = n+1, ((n>>b)+1)<<b
if L == BB:
Kref = int(countequal)
SL[1-countequal] = BB
SH[1-countequal] = BB+(1<<b)
FL[1-countequal] = BB
FH[1-countequal] = n+1+(1<<b)
SL[2] = SH[2] = FL[2] = FH[2] = 0
else:
Kref = countequal<<1
SL[1-countequal:3-countequal] = BB-(1<<b), BB
SH[1-countequal:3-countequal] = BB, BB+(1<<b)
FL[1-countequal:3-countequal] = L, BB
FH[1-countequal:3-countequal] = BB, n+1+(1<<b)
SL[Kref] = FL[Kref] = 1<<b
SH[Kref] = FH[Kref] = 1<<(b+1)
_mergesorted3(D, R, S, SL, SH, FL, FH, N, 1<<(b+1), Kref, False, True)
# merge pairs of adjacent blocks
SL[...] = 0
SL[1-countequal] = 1<<b
SH[2] = 0
SH[:2] = SL[:2] + (1<<b)
_mergesorted3(D, R, S, SL, SH, FL, FH, N, 1<<(b+1), int(countequal), True, False)
# in this last step even and odd blocks are treated the same because
# neither can be contained in larger valid block
SL[...] = 0
SL[1-countequal] = 1<<B
SH[2] = 0
SH[int(countequal)] = 1<<B
SH[1-countequal] = 1<<(B+1)
FL[...] = 0
FL[1-countequal] = 1<<B
FH[2] = 0
FH[int(countequal)] = 1<<B
FH[1-countequal] = n+1
_mergesorted3(D, R, S, SL, SH, FL, FH, N, 1<<B, int(countequal), False, True)
return R
countequal=True
l = 1_000_000
np.random.seed(0)
data = np.random.randint(-99, 100, (5*l,))
from time import perf_counter as pc
t = pc()
x = rolling_count_smaller(data, l, countequal)
t = pc() - t
print(f'size {data.size:_d}, lookback {l:_d} -- took {t:.3f} seconds')
check = 10
sample = np.random.randint(0, len(x), check)
if countequal:
y = np.array([np.count_nonzero(data[max(0, i-l):i]<=data[i]) for i in sample])
else:
y = np.array([np.count_nonzero(data[max(0, i-l):i]<data[i]) for i in sample])
assert np.all(y==x[sample])
print(f'seems correct -- {check:_d} samples checked')
The main worker _mergesorted3.py. Compile: pythran _mergesorted3.py
import numpy as np
#pythran export _mergesorted3(float[:], int[:], int[:], int[3], int[3], int[3], int[3], int, int, int, bool, bool)
#pythran export _mergesorted3(int[:], int[:], int[:], int[3], int[3], int[3], int[3], int, int, int, bool, bool)
# DB, RB, SB are the data, result and sorter arrays; here they are treated a
# bit like base pointers, hence the B in the names
# SL, SH are the low and high ends of the current rows of the three queues
# the next rows are assumed to be at offset N
# FL, FH are low and high ends of ranges in non sorted order used to filter
# each queue. they are ignored if 'filter' is False
# ST is the top index this can fall in the middle of a row which will then be
# processed partially
# Kref is the index of the referenve queue (the one whose elements are counted)
def _mergesorted3(DB, RB, SB, SL, SH, FL, FH, ST, N, Kref, writeback, filter):
if writeback: # set up row buffer for writing back of merged sort order
SLbuf = min(SL[0], SL[1]) # low end of row
SHbuf = max(SH[0], SH[1]) # high end of row
Sbuf = np.empty(SHbuf-SLbuf, int) # buffer
Ibuf = 0 # index
D = np.empty(3, DB.dtype) # heads of the three queues. values
S = np.empty(3, int) # heads the three queues. sorters
while True: # loop over rows
C = 0 # count of elements in the reference block seen so far
I = SL.copy() # heads of the three queses. indices
S[:2] = SB[I[:2]] # the inner loop expects the heads of the two non
# active (i.e. not incremented just now) queues
# to be in descending order
if filter: # skip elements that are not within a contiguous range.
# this requires filtering because everything is referenced
# in sorted order. so we cannot directly select ranges in
# the original order
# it is the caller's responsibility that for all except
# possibly the last row the filtered queues are not empty
for KK in range(2):
while S[KK] < FL[KK] or S[KK] >= FH[KK]:
I[KK] += 1
S[KK] = SB[I[KK]]
D[:2] = DB[S[:2]] # fetch the first two queue head values
# and set the inter queue sorter accordingly
K = np.array([1, 0, 2], int) if D[1] > D[0] else np.array([0, 1, 2], int)
while I[K[2]] < SH[K[2]]: # loop to merge three rows
# get a valid new elment from the active queue at sorter level
S[K[2]] = SB[I[K[2]]]
if filter and (S[K[2]] < FL[K[2]] or S[K[2]] >= FH[K[2]]):
I[K[2]] += 1
continue
# fetch the corresponding value
D[K[2]] = DB[S[K[2]]]
# re-establish inter-queue sort order
if D[K[2]] > D[K[1]] or (D[K[2]] == D[K[1]] and K[2] < K[1]):
K[2], K[1] = K[1], K[2]
if D[K[1]] > D[K[0]] or (D[K[1]] == D[K[0]] and K[1] < K[0]):
K[1], K[0] = K[0], K[1]
# do the book keeping depending on which queue has become active
if K[2] == Kref: # reference queue: adjust counter
C += 1
else: # other: add current ref element count to head of result queue
RB[S[K[2]]] += C
I[K[2]] += 1 # advance active queue
# one queue has been exhausted, which one?
if K[2] == Kref: # reference queue: no need to sort what's left just
# add the current ref element count to all leftovers
# subject to filtering if applicable
if filter:
KK = SB[I[K[1]]:SH[K[1]]]
RB[KK[(KK >= FL[K[1]]) & (KK < FH[K[1]])]] += C
KK = SB[I[K[0]]:SH[K[0]]]
RB[KK[(KK >= FL[K[0]]) & (KK < FH[K[0]])]] += C
else:
RB[SB[I[K[1]]:SH[K[1]]]] += C
RB[SB[I[K[0]]:SH[K[0]]]] += C
else: # one of the other queues: we are left with a two-way merge
# this is in a separate loop because it also supports writing
# back the new sort order which we do not need in the three way
# situation
while I[K[1]] < SH[K[1]]:
S[K[1]] = SB[I[K[1]]]
if filter and (S[K[1]] < FL[K[1]] or S[K[1]] >= FH[K[1]]):
I[K[1]] += 1
continue
D[K[1]] = DB[S[K[1]]]
if D[K[1]] > D[K[0]] or (D[K[1]] == D[K[0]] and K[1] < K[0]):
K[1], K[0] = K[0], K[1]
if K[1] == Kref:
C += 1
else:
RB[S[K[1]]] += C
if writeback: # we cannot directly write back without messing
# things up. instead we buffer one row at a time
Sbuf[Ibuf] = S[K[1]]
Ibuf += 1
I[K[1]] += 1
# a second queue has been exhausted. which one?
if K[1] == Kref: # the reference queue: must update results in
# the remainder of the other queue
if filter:
KK = SB[I[K[0]]:SH[K[0]]]
RB[KK[(KK >= FL[K[0]]) & (KK < FH[K[0]])]] += C
else:
RB[SB[I[K[0]]:SH[K[0]]]] += C
if writeback: # write back updated order
# the leftovers of the last remaining queue have not been
# buffered but being contiguous can be written back directly
# the way this is used by the main script actually gives a
# fifty-fifty chance of copying something exactly onto itself
SB[SLbuf+Ibuf:SHbuf] = SB[I[K[0]]:SH[K[0]]]
# now copy the buffer
SB[SLbuf:SLbuf+Ibuf] = Sbuf[:Ibuf]
SLbuf += N; SHbuf += N
Ibuf = 0
SL += N; SH += N
if filter:
FL += N; FH += N
# this is ugly:
# going to the next row we must check whether one or more queues
# have fully or partially hit the ceiling ST.
# if two and fully we are done
# if one fully we must alter the queue indices to make sure the
# empty queue is at index 2, because of the requirement of having
# at least one valid element in queues 0 and 1
done = -1
for II in range(3):
if SH[II] == SL[II]:
if done >= 0:
done = -2
break
done = II
elif SH[II] > ST:
if SL[II] >= ST or (filter and FL[II] >= ST):
if done >= 0:
done = -2
break
done = II
if writeback:
SHbuf -= SH[II] - SL[II]
SH[II] = SL[II] = 0
else:
if writeback:
SHbuf -= SH[II] - ST
SH[II] = ST
if filter and FH[II] > ST:
FH[II] = ST
if done == Kref or done == -2:
break
elif done == 0:
SL[:2], SH[:2] = SL[1:], SH[1:]
if filter:
FL[:2], FH[:2] = FL[1:], FH[1:]
SH[2] = SL[2]
Kref -= 1
elif done == 1:
SL[1], SH[1] = SL[2], SH[2]
if filter:
FL[1], FH[1] = FL[2], FH[2]
SH[2] = SL[2]
Kref >>= 1
And the special case _mergesorted2.py - pythran _mergesorted2.py
import numpy as np
#pythran export _mergesorted_1(float[:], int[:], int[:], int, bool)
#pythran export _mergesorted_1(int[:], int[:], int[:], int, bool)
def _mergesorted_1(DB, RB, SB, n, countequal):
N = len(DB)
K = ((N-n-1)>>1)<<1
for i in range(0, K, 2):
if DB[i] < DB[i+1] or (countequal and DB[i] == DB[i+1]):
SB[i] = i
SB[i+1] = i+1
RB[i+1] += 1
else:
SB[i] = i+1
SB[i+1] = i
if DB[i+1] < DB[i+1+n] or (countequal and DB[i+1] == DB[i+1+n]):
RB[i+1+n] += 1
for i in range(K, (N>>1)<<1, 2):
if DB[i] < DB[i+1] or (countequal and DB[i] == DB[i+1]):
SB[i] = i
SB[i+1] = i+1
RB[i+1] += 1
else:
SB[i] = i+1
SB[i+1] = i
if N & 1:
SB[N-1] = N-1
Here is an approximate approach that is simple to implement and responds in O(n) time: (21 seconds for 5M values on my laptop). It should work well for data sets with values that vary by more than 1/1000th of the largest difference.
from collections import deque,Counter
def lessCount(A,window):
precision = 1000 # 1/1000 th of value range
result = deque()
counts = [0]*(precision+1)
minVal = min(A)
chunkSize = (max(A)-minVal)/precision
keys = deque()
for i,a in enumerate(A):
key = int((a-minVal)/chunkSize)
keys.append(key)
counts[key] += 1
lowerCount = sum(counts[:key])
result.append(lowerCount)
if i < window: continue
counts[keys.popleft()] -= 1
return np.array(result)
It builds a rolling array of counts where the index is the relative position of the value divided in chunks. The chunk size is 1/1000th of the largest difference between values. For each element in A, there is only one addition and one subtraction to the array of counts. The number of values lower than the current one is the sum of counts up to the the position of that value in the counts array. You can increase the precision as you need but keep in mind that the time will be proportional to O(n)*precision

Python: Parallel distance calculation in pandas dataframe

I've a data frame for which I want to calculate the distance of each row to every other row. I need it to be very fast of course so I added some parallelism. But I see that it runs faster with a single thread for some reason.
dist = {}
dist_lock = threading.Lock()
def calculate_dist_threaded(data_frame, num_threads):
// calculate set of indexes for each thread, store in indexes array.
tuple_data_frame = list(data_frame.itertuples())
r = [None]
for attr in list(data_frame):
if data_frame[attr].dtype in ['int32', 'float64']:
dft = data_frame[attr][data_frame[attr] != sys.maxint].dropna()
r.append(dft.max() - dft.min())
else:
r.append(None)
for i in range(data_frame.shape[0]):
dist[i] = {}
// run each calculate_dist for each thread.
def calculate_dist(tuples, indexes, r):
for i in indexes:
logger.debug("working on index={0}".format(i))
for other in tuples:
if other[0] in dist[i]:
continue
if i == other[0]:
d = 0.0
else:
d = dist_tuples(tuples[i], other, r)
with dist_lock:
dist[i][other[0]] = d
dist[other[0]][i] = d
def dist_tuples(x, y, r):
d = 0.0
for i in range(1, len(x)):
if x[i] == y[i]:
d += 0.0
elif isinstance(x[i], numbers.Number):
d += abs(x[i] - y[i]) / r[i]
else:
d += 1.0
return d
if __name__ == "__main__":
calculate_dist_threaded(data, multiprocessing.cpu_count())
When I run it with 4 threads as it is the number of cpus on my laptop I see that it takes 15 seconds to calculate for 4 separate indexes at the same time.
But if I run the same code with just one thread I see that it takes just 2 seconds to calculate for a single index so it's almost 2 seconds faster than with 4 threads.
Am I missing something and have a blocking piece of code here or is just my lame laptop?

python itertools permutations with tied values

I want to find efficiently permutations of a vector which has tied values.
E.g., if perm_vector = [0,0,1,2] I would want to obtain as output all combinations of [0,0,1,2], [0,0,2,1], [0,1,2,0] and so on, but I don't want to obtain [0,0,1,2] twice which is what the standard itertools.permutations(perm_vector) would give.
I tried the following but it works really SLOW when perm_vector grows in len:
vectors_list = []
for it in itertools.permutations(perm_vector):
vectors_list.append(list(it))
df_vectors_list = pd.DataFrame( vectors_list)
df_gb = df_vectors_list.groupby(list(df_vectors_list.columns))
vectors_list = pd.DataFrame(df_gb.groups.keys()).T
The question is of more general "speed-up" nature, actually. The main time is spent on creating the permutations of long vectors - even without the duplicity, creation of permutations of a vector of 12 unique values takes a "infinity". Is there a possibility to call the itertools iteratively without accessing the entire permutations data but working on bunches of it?
Try this if perm_vector is small:
import itertools as iter
{x for x in iter.permutations(perm_vector)}
This should give you unique values, because now it becomes a set, which by default delete duplications.
If perm_vector is large, you might want to try backtracking:
def permu(L, left, right, cache):
for i in range(left, right):
L[left], L[i] = L[i], L[left]
L_tuple = tuple(L)
if L_tuple not in cache:
permu(L, left + 1, right, cache)
L[left], L[i] = L[i], L[left]
cache[L_tuple] = 0
cache = {}
permu(perm_vector, 0, len(perm_vector), cache)
cache.keys()
How about this:
from collections import Counter
def starter(l):
cnt = Counter(l)
res = [None] * len(l)
return worker(cnt, res, len(l) - 1)
def worker(cnt, res, n):
if n < 0:
yield tuple(res)
else:
for k in cnt.keys():
if cnt[k] != 0:
cnt[k] = cnt[k] - 1
res[n] = k
for r in worker(cnt, res, n - 1):
yield r
cnt[k] = cnt[k] + 1

Query long lists

I would like to query the value of an exponentially weighted moving average at particular points. An inefficient way to do this is as follows. l is the list of times of events and queries has the times at which I want the value of this average.
a=0.01
l = [3,7,10,20,200]
y = [0]*1000
for item in l:
y[int(item)]=1
s = [0]*1000
for i in xrange(1,1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
queries = [23,68,103]
for q in queries:
print s[q]
Outputs:
0.0355271185019
0.0226018371526
0.0158992102478
In practice l will be very large and the range of values in l will also be huge. How can you find the values at the times in queries more efficiently, and especially without computing the potentially huge lists y and s explicitly. I need it to be in pure python so I can use pypy.
Is it possible to solve the problem in time proportional to len(l)
and not max(l) (assuming len(queries) < len(l))?
Here is my code for doing this:
def ewma(l, queries, a=0.01):
def decay(t0, x, t1, a):
from math import pow
return pow((1-a), (t1-t0))*x
assert l == sorted(l)
assert queries == sorted(queries)
samples = []
try:
t0, x0 = (0.0, 0.0)
it = iter(queries)
q = it.next()-1.0
for t1 in l:
# new value is decayed previous value, plus a
x1 = decay(t0, x0, t1, a) + a
# take care of all queries between t0 and t1
while q < t1:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
# take care of all queries equal to t1
while q == t1:
samples.append(x1)
q = it.next()-1.0
# update t0, x0
t0, x0 = t1, x1
# take care of any remaining queries
while True:
samples.append(decay(t0, x0, q, a))
q = it.next()-1.0
except StopIteration:
return samples
I've also uploaded a fuller version of this code with unit tests and some comments to pastebin: http://pastebin.com/shhaz710
EDIT: Note that this does the same thing as what Chris Pak suggests in his answer, which he must have posted as I was typing this. I haven't gone through the details of his code, but I think mine is a bit more general. This code supports non-integer values in l and queries. It also works for any kind of iterables, not just lists since I don't do any indexing.
I think you could do it in ln(l) time, if l is sorted. The basic idea is that the non recursive form of EMA is a*s_i + (1-a)^1 * s_(i-1) + (1-a)^2 * s_(i-2) ....
This means for query k, you find the greatest number in l less than k, and for a estimation limit, use the following, where v is the index in l, l[v] is the value
(1-a)^(k-v) *l[v] + ....
Then, you spend lg(len(l)) time in search + a constant multiple for the depth of your estimation. I'll provide a code sample in a little bit (after work) if you want it, just wanted to get my idea out there while I was thinking about it
here's the code -
v is the dictionary of values at a given time; replace with 1 if it's just a 1 every time...
import math
from bisect import bisect_right
a = .01
limit = 1000
l = [1,5,14,29...]
def find_nearest_lt(l, time):
i = bisect_right(a, x)
if i:
return i-1
raise ValueError
def find_ema(l, time):
i = find_nearest_lt(l, time)
if l[i] == time:
result = a * v[l[i]
i -= 1
else:
result = 0
while (time-l[i]) < limit:
result += math.pow(1-a, time-l[i]) * v[l[i]]
i -= 1
return result
if I'm thinking correctly, the find nearest is l(n), then the while loop is <= 1000 iterations, guaranteed, so it's technically a constant (though a kind of large one). find_nearest was stolen from the page on bisect - http://docs.python.org/2/library/bisect.html
It appears that y is a binary value -- either 0 or 1 -- depending on the values of l. Why not use y = set(int(item) for item in l)? That's the most efficient way to store and look up a list of numbers.
Your code will cause an error the first time through this loop:
s = [0]*1000
for i in xrange(1000):
s[i] = a*y[i-1]+(1-a)*s[i-1]
because i-1 is -1 when i=0 (first pass of loop) and both y[-1] and s[-1] are the last element of the list, not the previous. Maybe you want xrange(1,1000)?
How about this code:
a=0.01
l = [3.0,7.0,10.0,20.0,200.0]
y = set(int(item) for item in l)
queries = [23,68,103]
ewma = []
x = 1 if (0 in y) else 0
for i in xrange(1, queries[-1]):
x = (1-a)*x
if i in y:
x += a
if i == queries[0]:
ewma.append(x)
queries.pop(0)
When it's done, ewma should have the moving averages for each query point.
Edited to include SchighSchagh's improvements.

Categories

Resources