Is insertion of heapq is faster than insertion of bisect?

Is insertion of heapq is faster than insertion of bisect? - python

I have a question about bisect and heapq.
First I will show you 2 versions of code and then ask question about it.
version of using bisect:
while len(scoville) > 1:
a = scoville.pop(0)
#pops out smallest unit
if a >= K:
break
b = scoville.pop(0)
#pops out smallest unit
c = a + b * 2
bisect.insort(scoville, c)
version of using heapq
while len(scoville) > 1:
a = heapq.heappop(scoville)
#pops out smallest unit
if a >= K:
break
b = heapq.heappop(scoville)
#pops out smallest unit
c = a + b * 2
heapq.heappush(scoville, c)
Both algorithms use 2 pops and 1 insert.
As I know, at version of using bisect, pop operation of list is O(1), and insertion operation of bisect class is O(logn).
And at version of using heapq, pop operation of heap is O(1), and insertion operation of heap is O(logn) in average.
So both code should have same time efficiency approximately. However, version of using bisect is keep failing time efficiency test at some code challenge site.
Does anybody have a good guess?
*scoville is a list of integers

Your assumptions are wrong. Neither is pop(0) O(1), nor is bisect.insort O(logn).
The problem is that in both cases, all the elements after the element you pop or insert have to be shifted one position to the left or might, making both operations O(n).
From the bisect.insort documentation:
bisect.insort_left(a, x, lo=0, hi=len(a))
Insert x in a in sorted order. This is equivalent to a.insert(bisect.bisect_left(a, x, lo, hi), x) assuming that a is already sorted. Keep in mind that the O(log n) search is dominated by the slow O(n) insertion step.
You can test this by creating a really long list, say l = list(range(10**8)) and then doing l.pop(0) or l.pop() and bisect.insort(l, 0) or bisect.insort(l, 10**9). The operations popping and inserting at the end shoul be instantaneous, while the others have a short but noticeable delay.
You can also use %timeit to test it repeatedly on shorter lists, if you alternatingly pop and insert so that the length of the list remains constant over many thousands of runs:
>>> l = list(range(10**6))
>>> %timeit l.pop(); bisect.insort(l, 10**6)
100000 loops, best of 3: 2.21 us per loop
>>> %timeit l.pop(0); bisect.insort(l, 0)
100 loops, best of 3: 14.2 ms per loop
Thus, the version using bisect is O(n) and the one with heapq is O(logn).

Related

Why is islice(permutations) 100 times faster if I keep a reference to the underlying iterator?

Iterating through islice(permutations(a), n) is somehow 100 times faster if I just keep an extra reference to the permutations iterator. Alternating between with and without the extra reference:
2.1 ms with
202.2 ms without
2.1 ms with
195.8 ms without
2.1 ms with
192.4 ms without
What's going on?
Full code (Try it online!):
from timeit import timeit
from itertools import permutations, islice
from collections import deque
a = range(10 ** 7)
n = 10 ** 5
for label in ['with', 'without'] * 3:
if label == 'with':
perms = islice((foo := permutations(a)), n)
else:
perms = islice(permutations(a), n)
next(perms)
t = timeit(lambda: deque(perms, 0), number=1)
print('%5.1f ms ' % (t * 1e3), label)

I just realized why. If I don't don't keep a reference, then the iterator and all it entails gets garbage collected at the end of timing, and that's included in the time.
Note that the list I build permutations of is very large. So each permutation is very large. So the permutations iterator has a large result tuple and internal state data structure, and I also have millions of integer objects from the range. All that must be cleaned up.
When I halve the size of a to a = range(10 ** 7 // 2), the times for "without" extra reference drop to about half (100 ms).
When I double the size of a to a = range(10 ** 7 * 2), the times for "without" extra reference roughly double (over 400 ms).
Neither change affects the "with" times (always around 2 ms).
In case anyone is wondering why I build permutations of such a large list: I was looking into the time it takes permutations to provide all n! permutations of n elements. One might think it needs O(n × n!), since that's the overall result size. But it reuses and modifies its result tuple if it can, so it doesn't build each permutation from scratch but just needs to update it a little. So I tested that with large n in order to see a large speed difference between when it can and can't reuse its result tuple. It is indeed much faster if it can, and seems to take only O(n!) time to provide all permutations. It appears to on average change just 2.63 elements from one permutation to the next.

FAST comparing two numpy arrays for equality [Python] [duplicate]

Suppose I have a bunch of arrays, including x and y, and I want to check if they're equal. Generally, I can just use np.all(x == y) (barring some dumb corner cases which I'm ignoring now).
However this evaluates the entire array of (x == y), which is usually not needed. My arrays are really large, and I have a lot of them, and the probability of two arrays being equal is small, so in all likelihood, I really only need to evaluate a very small portion of (x == y) before the all function could return False, so this is not an optimal solution for me.
I've tried using the builtin all function, in combination with itertools.izip: all(val1==val2 for val1,val2 in itertools.izip(x, y))
However, that just seems much slower in the case that two arrays are equal, that overall, it's stil not worth using over np.all. I presume because of the builtin all's general-purposeness. And np.all doesn't work on generators.
Is there a way to do what I want in a more speedy manner?
I know this question is similar to previously asked questions (e.g. Comparing two numpy arrays for equality, element-wise) but they specifically don't cover the case of early termination.

Until this is implemented in numpy natively you can write your own function and jit-compile it with numba:
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def arrays_equal(a, b):
if a.shape != b.shape:
return False
for ai, bi in zip(a.flat, b.flat):
if ai != bi:
return False
return True
a = np.random.rand(10, 20, 30)
b = np.random.rand(10, 20, 30)
%timeit np.all(a==b) # 100000 loops, best of 3: 9.82 µs per loop
%timeit arrays_equal(a, a) # 100000 loops, best of 3: 9.89 µs per loop
%timeit arrays_equal(a, b) # 100000 loops, best of 3: 691 ns per loop
Worst case performance (arrays equal) is equivalent to np.all and in case of early stopping the compiled function has the potential to outperform np.all a lot.

Adding short-circuit logic to array comparisons is apparently being discussed on the numpy page on github, and will thus presumably be available in a future version of numpy.

Probably someone who understands the underlying data structure could optimize this or explain whether it's reliable/safe/good practice, but it seems to work.
np.all(a==b)
Out[]: True
memoryview(a.data)==memoryview(b.data)
Out[]: True
%timeit np.all(a==b)
The slowest run took 10.82 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.2 µs per loop
%timeit memoryview(a.data)==memoryview(b.data)
The slowest run took 8.55 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.85 µs per loop
If I understand this correctly, ndarray.data creates a pointer to the data buffer and memoryview creates a native python type that can be short-circuited out of the buffer.
I think.
EDIT: further testing shows it may not be as big a time-improvement as shown. previously a=b=np.eye(5)
a=np.random.randint(0,10,(100,100))
b=a.copy()
%timeit np.all(a==b)
The slowest run took 6.70 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 17.7 µs per loop
%timeit memoryview(a.data)==memoryview(b.data)
10000 loops, best of 3: 30.1 µs per loop
np.all(a==b)
Out[]: True
memoryview(a.data)==memoryview(b.data)
Out[]: True

Hmmm, I know it is the poor answer but it seems there is no easy way for this. Numpy Creators should fix it. I suggest:
def compare(a, b):
if len(a) > 0 and not np.array_equal(a[0], b[0]):
return False
if len(a) > 15 and not np.array_equal(a[:15], b[:15]):
return False
if len(a) > 200 and not np.array_equal(a[:200], b[:200]):
return False
return np.array_equal(a, b)
:)

Well, not really an answer as I haven't checked if it break-circuits, but:
assert_array_equal.
From the documentation:
Raises an AssertionError if two array_like objects are not equal.
Try Except it if not on a performance sensitive code path.
Or follow the underlying source code, maybe it's efficient.

You could iterate all elements of the arrays and check if they are equal.
If the arrays are most likely not equal it will return much faster than the .all function.
Something like this:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([1, 3, 4])
areEqual = True
for x in range(0, a.size-1):
if a[x] != b[x]:
areEqual = False
break
else:
print "a[x] is equal to b[x]\n"
if areEqual:
print "The tables are equal\n"
else:
print "The tables are not equal\n"

As Thomas Kühn wrote in a comment to your post, array_equal is a function which should solve the problem. It is described in Numpy's API reference.

Breaking down the original problem to three parts: "(1) My arrays are really large, and (2) I have a lot of them, and (3) the probability of two arrays being equal is small"
All the solutions (to date) are focused on part (1) - optimizing the performance of each equality check, and some improve this performance by factor of 10. Points (2) and (3) are ignored. Comparing each pair has O(n^2) complexity, which can become huge for a lot of matrices, while needles as the probability of being duplicates is very small.
The check can become much faster with the following general algorithm -
fast hash of each array O(n)
check equality only for arrays with the same hash
A good hash is almost unique, so the number of keys can easily be a very large fraction of n. On average, number of arrays with the same hash will be very small, and almost 1 in some cases. Duplicate arrays will have the same hash, while having the same hash doesn't guarantee they are duplicates. In that sense, the algorithm will catch all the duplicates. Comparing images only with the same hash significantly reduces the number of comparisons, which becomes almost O(n)
For my problem, I had to check duplicates within ~1 million integer arrays, each with 10k elements. Optimizing only the array equality check (with #MB-F solution) estimated run time was 5 days. With hashing first it finished in minutes. (I used array sum as the hash, that was suited for my arrays characteristics)
Some psuedo-python code
def fast_hash(arr) -> int:
pass
def arrays_equal(arr1, arr2) -> bool:
pass
def make_hash_dict(array_stack, hush_fn=np.sum):
hash_dict = defaultdict(list)
hashes = np.squeeze(np.apply_over_axes(hush_fn, array_stack, range(1, array_stack.ndim)))
for idx, hash_val in enumerate(hashes):
hash_dict[hash_val].append(idx)
return hash_dict
def get_duplicate_sets(hash_dict, array_stack):
duplicate_sets = []
for hash_key, ind_list in hash_dict.items():
if len(ind_list) == 1:
continue
all_duplicates = []
for idx1 in range(len(ind_list)):
v1 = ind_list[idx1]
if v1 in all_duplicates:
continue
arr1 = array_stack[v1]
curr_duplicates = []
for idx2 in range(idx1+1, len(ind_list)):
v2 = ind_list[idx2]
arr2 = array_stack[v2]
if arrays_equal(arr1, arr2):
if len(curr_duplicates) == 0:
curr_duplicates.append(v1)
curr_duplicates.append(v2)
if len(curr_duplicates) > 0:
all_duplicates.extend(curr_duplicates)
duplicate_sets.append(curr_duplicates)
return duplicate_sets
The variable duplicate_sets is a list of lists, each internal list contains indices of all the same duplicates.

Why is the "map" version of ThreeSum so slow?

I expected this Python implementation of ThreeSum to be slow:
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
But I was shocked that this version looks pretty slow too:
def count_python(a):
"""ThreeSum using itertools"""
return sum(map(lambda X: sum(X)==0, itertools.combinations(a, r=3)))
Can anyone recommend a faster Python implementation? Both implementations just seem so slow...
Thanks
...
ANSWER SUMMARY:
Here is how the runs of all the various versions provided in this thread of the O(N^3) (for educational purposes, not used in real life) version worked out on my machine:
56 sec RUNNING count_slow...
28 sec RUNNING count_itertools, written by Ashwini Chaudhary...
14 sec RUNNING count_fixed, written by roippi...
11 sec RUNNING count_itertools (faster), written by Veedrak...
08 sec RUNNING count_enumerate, written by roippi...
*Note: Needed to modify Veedrak's solution to this to get the correct count output:
sum(1 for x, y, z in itertools.combinations(a, r=3) if x+y==-z)

Supplying a second answer. From various comments, it looks like you're primarily concerned about why this particular O(n**3) algorithm is slow when being ported over from java. Let's dive in.
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
One major problem that immediately pops out is that you're doing something your java code almost certainly isn't doing: materializing a 3-element list just to add three numbers together!
if sum([a[i], a[j], a[k]]) == 0:
Yuck! Just write that as
if a[i] + a[j] + a[k] == 0:
Some benchmarking shows that you're adding 50%+ overhead just by doing that. Yikes.
The other issue here is that you're using indexing where you should be using iteration. In python try to avoid writing code like this:
for i in range(len(some_list)):
do_something(some_list[i])
And instead just write:
for x in some_list:
do_something(x)
And if you explicitly need the index that you're on (as you actually do in your code), use enumerate:
for i,x in enumerate(some_list):
#etc
This is, in general, a style thing (though it goes deeper than that, with duck typing and the iterator protocol) - but it is also a performance thing. In order to look up the value of a[i], that call is converted to a.__getitem__(i), then python has to dynamically resolve a __getitem__ method lookup, call it, and return the value. Every time. It's not a crazy amount of overhead - at least on builtin types - but it adds up if you're doing it a lot in a loop. Treating a as an iterable, on the other hand, sidesteps a lot of that overhead.
So taking that change in mind, you can rewrite your function once again:
def count_enumerate(a):
cnt = 0
for i, x in enumerate(a):
for j, y in enumerate(a[i+1:], i+1):
for z in a[j+1:]:
if x + y + z == 0:
cnt += 1
return cnt
Let's look at some timings:
%timeit count(range(-100,100))
1 loops, best of 3: 394 ms per loop
%timeit count_fixed(range(-100,100)) #just fixing your sum() line
10 loops, best of 3: 158 ms per loop
%timeit count_enumerate(range(-100,100))
10 loops, best of 3: 88.9 ms per loop
And that's about as fast as it's going to go. You can shave off a percent or so by wrapping everything in a comprehension instead of doing cnt += 1 but that's pretty minor.
I've toyed around with a few itertools implementations but I actually can't get them to go faster than this explicit loop version. This makes sense if you think about it - for every iteration, the itertools.combinations version has to rebind what all three variables refer to, whereas the explicit loops get to "cheat" and rebind the variables in the outer loops far less often.
Reality check time, though: after everything is said and done, you can still expect cPython to run this algorithm an order of magnitude slower than a modern JVM would. There is simply too much abstraction built in to python that gets in the way of looping quickly. If you care about speed (and you can't fix your algorithm - see my other answer), either use something like numpy to spend all of your time looping in C, or use a different implementation of python.
postscript: pypy
For fun, I ran count_fixed on a 1000-element list, on both cPython and pypy.
cPython:
In [81]: timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
Out[81]: 19.230753898620605
pypy:
>>>> timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
0.6961538791656494
Speedy!
I might add some java testing in later to compare :-)

Algorithmically, both versions of your function are O(n**3) - so asymptotically neither is superior. You will find that the itertools version is in practice somewhat faster since it spends more time looping in C rather than in python bytecode. You can get it down a few more percentage points by removing map entirely (especially if you're running py2) but it's still going to be "slow" compared to whatever times you got from running it in a JVM.
Note that there are plenty of python implementations other than cPython out there - for loopy code, pypy tends to be much faster than cPython. So I wouldn't write python-as-a-language off as being slow, necessarily, but I would certainly say that the reference implementation of python is not known for its blazing loop speed. Give other python flavors a shot if that's something you care about.
Specific to your algorithm, an optimization will let you drop it down to O(n**2). Build up a set of your integers, s, and build up all pairs (a,b). You know that you can "zero out" (a+b) if and only if -(a+b) in (s - {a,b}).
Thanks to #Veedrak: unfortunately constructing s - {a,b} is a slow O(len(s)) operation itself - so simply check if -(a+b) is equal to either a or b. If it is, you know there's no third c that can fulfill a+b+c == 0 since all numbers in your input are distinct.
def count_python_faster(a):
s = frozenset(a)
return sum(1 for x,y in itertools.combinations(a,2)
if -(x+y) not in (x,y) and -(x+y) in s) // 3
Note the divide-by-three at the end; this is because each successful combination is triple-counted. It's possible to avoid that but it doesn't actually speed things up and (imo) just complicates the code.
Some timings for the curious:
%timeit count(range(-100,100))
1 loops, best of 3: 407 ms per loop
%timeit count_python(range(-100,100)) #this is about 100ms faster on py3
1 loops, best of 3: 382 ms per loop
%timeit count_python_faster(range(-100,100))
100 loops, best of 3: 5.37 ms per loop

You haven't stated which version of Python you're using.
In Python 3.x, a generator expression is around 10% faster than either of the two implementations you listed. Using a random array of 100 numbers in the range [-100,100] for a:
count(a) -> 8.94 ms # as per your implementation
count_python(a) -> 8.75 ms # as per your implementation
def count_generator(a):
return sum((sum(x) == 0 for x in itertools.combinations(a,r=3)))
count_generator(a) -> 7.63 ms
But other than that, it's the shear amount of combinations that's dominating execution time - O(N^3).
I should add the times shown above are for loops of 10 calls each, averaged over 10 loops. And yeah, my laptop is slow too :)

Pre-allocating a list of None

Suppose you want to write a function which yields a list of objects, and you know in advance the length n of such list.
In python the list supports indexed access in O(1), so it is arguably a good idea to pre-allocate the list and access it with indexes instead of allocating an empty list and using the append() method. This is because we avoid the burden of expanding the whole list if the space is not enough.
If I'm using python, probably performances are not that relevant in any case, but what is the better way of pre-allocating a list?
I know these possible candidates:
[None] * n → allocating two lists
[None for x in range(n)] — or xrange in python2 → building another object
Is one significantly better than the other?
What if we are in the case n = len(input)? Since input exists already, would [None for x in input] have better performances w.r.t. [None] * len(input)?

In between those two options the first one is clearly better as no Python for loop is involved.
>>> %timeit [None] * 100
1000000 loops, best of 3: 469 ns per loop
>>> %timeit [None for x in range(100)]
100000 loops, best of 3: 4.8 us per loop
Update:
And list.append has an O(1) complexity too, it might be a better choice than pre-creating list if you assign the list.append method to a variable.
>>> n = 10**3
>>> %%timeit
lis = [None]*n
for _ in range(n):
lis[_] = _
...
10000 loops, best of 3: 73.2 us per loop
>>> %%timeit
lis = []
for _ in range(n):
lis.append(_)
...
10000 loops, best of 3: 92.2 us per loop
>>> %%timeit
lis = [];app = lis.append
for _ in range(n):
app(_)
...
10000 loops, best of 3: 59.4 us per loop
>>> n = 10**6
>>> %%timeit
lis = [None]*n
for _ in range(n):
lis[_] = _
...
10 loops, best of 3: 106 ms per loop
>>> %%timeit
lis = []
for _ in range(n):
lis.append(_)
...
10 loops, best of 3: 122 ms per loop
>>> %%timeit
lis = [];app = lis.append
for _ in range(n):
app(_)
...
10 loops, best of 3: 91.8 ms per loop

When you append an item to a list, Python 'over-allocates', see the source-code of the list object. This means that for example when adding 1 item to a list of 8 items, it actually makes room for 8 new items, and uses only the first one of those. The next 7 appends are then 'for free'.
In many languages (e.g. old versions of Matlab, the newer JIT might be better) you are always told that you need to pre-allocate your vectors, since appending during a loop is very expensive. In the worst case, appending of a single item to a list of length n can cost O(n) time, since you might have to create a bigger list and copy all the existing items over. If you need to do this on every iteration, the overall cost of adding n items is O(n^2), ouch. Python's pre-allocation scheme spreads the cost of growing the array over many single appends (see amortized costs), effectively making the cost of a single append O(1) and the overall cost of adding n items O(n).
Additionally, the overhead of the rest of your Python code is usually so large, that the tiny speedup that can be obtained by pre-allocating is insignificant. So in most cases, simply forget about pre-allocating, unless your profiler tells you that appending to a list is a bottleneck.
The other answers show some profiling of the list preallocation itself, but this is useless. The only thing that matters is profiling your complete code, with all your calculations inside your loop, with and without pre-allocation. If my prediction is right, the difference is so small that the computation time you win is dwarfed by the time spent thinking about, writing and maintaining the extra lines to pre-allocate your list.

Obviously, the first version. Let me explain why.
When you do [None] * n, Python internally creates a list object of size n and it copies the the same object (here None) (this is the reason, you should use this method only when you are dealing with immutable objects) to all the memory locations. So memory allocation is done only once. After that a single iteration through the list to copy the object to all the elements. list_repeat is the function which corresponds to this type of list creation.
# Creates the list of specified size
np = (PyListObject *) PyList_New(size);
....
...
items = np->ob_item;
if (Py_SIZE(a) == 1) {
elem = a->ob_item[0];
for (i = 0; i < n; i++) {
items[i] = elem; // Copies the same item
Py_INCREF(elem);
}
return (PyObject *) np;
}
When you use a list comprehension to build a list, Python cannot know the actual size of the list being created, so it initially allocates a chunk of memory and a fresh copy of the object is stored in the list. When the list grows beyond the allocated length, it has to allocate the memory again and continue with the new object creation and storing that in the list.

Python "in" operator speed

Is the in operator's speed in python proportional to the length of the iterable?
So,
len(x) #10
if(a in x): #lets say this takes time A
pass
len(y) #10000
if(a in y): #lets say this takes time B
pass
Is A > B?

A summary for in:
list - Average: O(n)
set/dict - Average: O(1), Worst: O(n)
See this for more details.

There's no general answer to this: it depends on the types of a and especially of b. If, for example, b is a list, then yes, in takes worst-case time O(len(b)). But if, for example, b is a dict or a set, then in takes expected-case time O(1) (i.e., constant time).
About "Is A > B?", you didn't define A or B. As above, there's no general answer to which of your in statements will run faster.

Assuming x and y are lists:
Is the in operator's speed in Python proportional to the length of the iterable?
Yes.
The time for in to run on a list of length n is O(n). It should be noted that it is O(1) for x in set and x in dict as they are hashed, and in is constant time.
Is A > B?
No.
Time A < Time B as 10 < 10000
Docs: https://wiki.python.org/moin/TimeComplexity

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.