Improve performance in lists

Improve performance in lists - python

I have a problem where I am trying to take a randomly ordered list and I want to know how many elements with a greater index than the current element are smaller in value than the current element.
For example:
[1,2,5,3,7,6,8,4]
should return:
[0,0,2,0,2,1,1,0]
This is the code I have that is currently working.
bribe_array = [0] * len(q)
for i in range(0, len(bribe_array)-1):
bribe_array[i] = sum(j<q[i] for j in q[(i+1):])
This does produce the desired array but it runs slowly. What is the more pythonic way to get this accomplished?

We could fiddle around with the code in the question, but still it would be an O(n^2) algorithm. To truly improve the performance is not a matter of making the implementation more or less pythonic, but to use a different approach with a helper data structure.
Here's an outline for an O(n log n) solution: implement a self-balancing BST (AVL or red-black are good options), and additionally store in each node an attribute with the size of the subtree rooted in it. Now traverse the list from right to left and insert all its elements in the tree as new nodes. We also need an extra output list of the same size of the input list to keep track of the answer.
For every node we insert in the tree, we compare its key with the root. If it's greater than the value in the root, it means that it's greater than all the nodes in the left subtree, hence we need to add the size of the left subtree to the answer list at the position of the element we're trying to insert.
We keep doing this recursively and updating the size attribute in each node we visit, until we find the right place to insert the new node, and proceed to the next element in the input list. In the end the output list will contain the answer.

Another option that's much simpler than implementing a balanced BST is to adapt merge sort to count inversions and accumulate them during the process. Clearly, any single swap is an inversion so the lower-indexed element gets one count. Then during the merge traversal, simply keep track of how many elements from the right group have moved to the left and add that count for elements added to the right group.
Here's a very crude illustration :)
[1,2,5,3,7,6,8,4]
sort 1,2 | 5,3
3,5 -> 5: 1
merge
1,2,3,5
sort 7,6 | 8,4
6,7 -> 7: 1
4,8 -> 8: 1
merge
4 -> 6: 1, 7: 2
4,6,7,8
merge 1,2,3,5 | 4,6,7,8
1,2,3,4 -> 1 moved
5 -> +1 -> 5: 2
6,7,8

There are several ways of speeding up your code without touching the overall computational complexity.
This is so because there are several ways of writing this very algorithm.
Let's start with your code:
def bribe_orig(q):
bribe_array = [0] * len(q)
for i in range(0, len(bribe_array)-1):
bribe_array[i] = sum(j<q[i] for j in q[(i+1):])
return bribe_array
This is of somewhat mixed style: firstly, you generate a list of zeros (which is not really needed, as you can append items on demand; secondly, the outer list uses a range() which is sub-optimal given that you would like to access a specific item multiple times and hence a local name would be faster; thirdly, you write a generator inside sum() which is also sub-optimal since it will be summing up booleans and hence perform implicit conversions all the time.
A cleaner approach would be:
def bribe(items):
result = []
for i, item in enumerate(items):
partial_sum = 0
for x in items[i + 1:]:
if x < item:
partial_sum += 1
result.append(partial_sum)
return result
This is somewhat simpler and since it does a number of things explicitly, and only performing a summation when necessary (thus skipping when you would be adding 0), it may be faster.
Another way of writing your code in a more compact way is:
def bribe_compr(items):
return [sum(x < item for x in items[i + 1:]) for i, item in enumerate(items)]
This involves the use of generators and list comprehensions, but also the outer loop is written with enumerate() following the typical Python style.
But Python is infamously slow in raw looping, therefore when possible, vectorization can be helpful. One way of doing this (only for the inner loop) is with numpy:
import numpy as np
def bribe_np(items):
items = np.array(items)
return [np.sum(items[i + 1:] < item) for i, item in enumerate(items)]
Finally, it is possible to use a JIT compiler to speed up the plain Python loops using Numba:
import numba
bribe_jit = nb.jit(bribe)
As for any JIT it has some costs for the just-in-time compilation, which is eventually offset for large enough loops.
Unfortunately, Numba's JIT does not support all Python code, but when it does, like in this case, it can be pretty rewarding.
Let's look at some numbers.
Consider the input generated with the following:
import numpy as np
np.random.seed(0)
n = 10
q = np.random.randint(1, n, n)
On a small-sized inputs (n = 10):
%timeit bribe_orig(q)
# 228 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe(q)
# 20.3 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bribe_compr(q)
# 216 µs ± 5.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_np(q)
# 133 µs ± 9.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit bribe_jit(q)
# 1.11 µs ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
On a medium-sized inputs (n = 100):
%timeit bribe_orig(q)
# 20.5 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bribe(q)
# 741 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_compr(q)
# 18.9 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit bribe_np(q)
# 1.22 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit bribe_jit(q)
# 7.54 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
On a larger inputs (n = 10000):
%timeit bribe_orig(q)
# 1.99 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit bribe(q)
# 60.6 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bribe_compr(q)
# 1.8 s ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit bribe_np(q)
# 12.8 ms ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit bribe_jit(q)
# 182 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
From these results, we observe that we gain the most from substituting sum() with the explicit construct involving only Python loops.
The use of comprehensions does not land you above approx. 10% improvement over your code.
For larger inputs, the use of NumPy can be even faster than the explicit construct involving only Python loops.
However, you will get the real deal when you use the Numba's JITed version of bribe().

You can get better performance by progressively building a sorted list going from last to first in your array. Using a binary search algorithm on the sorted list for each element in the array, you get the index at which the element will be inserted which also happens to be the number of elements that are smaller in the ones already processed.
Collecting these insertion points will give you the expected result (in reverse order).
Here's an example:
a = [1,2,5,3,7,6,8,4]
from bisect import bisect_left
s = []
r = []
for x in reversed(a):
p = bisect_left(s,x)
r.append(p)
s.insert(p,x)
r = r[::-1]
print(r) #[0,0,2,0,2,1,1]
For this example, the progression will be as follows:
step 1: x = 4, p=0 ==> r=[0] s=[4]
step 2: x = 8, p=1 ==> r=[0,1] s=[4,8]
step 3: x = 6, p=1 ==> r=[0,1,1] s=[4,6,8]
step 4: x = 7, p=2 ==> r=[0,1,1,2] s=[4,6,7,8]
step 5: x = 3, p=0 ==> r=[0,1,1,2,0] s=[3,4,6,7,8]
step 6: x = 5, p=2 ==> r=[0,1,1,2,0,2] s=[3,4,5,6,7,8]
step 7: x = 2, p=0 ==> r=[0,1,1,2,0,2,0] s=[2,3,4,5,6,7,8]
step 8: x = 1, p=0 ==> r=[0,1,1,2,0,2,0,0] s=[1,2,3,4,5,6,7,8]
Reverse r, r = r[::-1] r=[0,0,2,0,2,1,1,0]
You will be performing N loops (size of the array) and the binary search performs in log(i) where i is 1 to N. So, smaller than O(N*log(N)). The only caveat is the performance of s.insert(p,x) which will introduce some variability depending on the order of the original list.
Overall the performance profile should be between O(N) and O(N*log(N)) with a worst case of O(n^2) when the array is already sorted.
If you only need to make your code a little faster and more concise, you could use a list comprehension (but that'll still be O(n^2) time):
r = [sum(v<p for v in a[i+1:]) for i,p in enumerate(a)]

Related

Performance of numpy all/any vs testing a single element

I create an array that does not contain a single zero (let's ignore that it does, with zero probability, as np.random.rand() samples [0,1) uniformly). I want to check whether all values are equal to zero (for some other purpose the arrays may contain all zeros). Below are some timings.
Surprisingly to me, checking a single (nonzero) element is about 2000 times faster than using np.all() or np.any(). I would assume that the compiler internally replaces np.all() by np.any() of the inverse condition and that np.any()/np.all() returns True/False at the first instance that the condition is fulfilled/violated (i.e. the compiler does not create the entire array of True or False values first).
How comes np.all() or np.any() are that much slower when it would only have to check one element? Or is this because of the external knowledge I put that the array does not contain all zeros? In the case of an all-zeros array, I guess it might be too slow to do the boolean comparison separately for each element. I don't know about the performance of the underlying low-level algorithms, but each element needs to be accessed once independent of whether it goes one by one or creates the whole boolean array once.
import numpy as np
np.random.seed(100)
a = np.random.rand(10418,144)
%timeit a[0,0] == 0
%timeit (a == 0).all()
%timeit np.all(a == 0)
%timeit (a != 0).any()
%timeit np.any(a != 0)
# 400 ns ± 2.08 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 713 µs ± 382 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 720 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 711 µs ± 407 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 723 µs ± 630 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

When you write a == 0, numpy creates a new array of type boolean, compares each element in a with 0 and stores the result in the array. This allocation, initialization, and subsequent deallocation is the reason for the high cost.
Note that you don't need the explicit a == 0 in the first place. Integers that are zero always evauate to False, nonzero integers to True. np.all(a) is equivalent to np.all(a != 0). So np.all(a==0) is equivalent to not np.any(a)

Creating an array of numbers that add up to 1 with given length

I'm trying to use different weights for my model and I need those weights add up to 1 like this;
def func(length):
return ['a list of numbers add up to 1 with given length']
func(4) returns [0.1, 0.2, 0.3, 0.4]
The numbers should be linearly spaced and they should not start from 0. Is there any way to achieve this with numpy or scipy?

This can be done quite simply using numpy arrays:
def func(length):
linArr = np.arange(1, length+1)
return linArr/sum(x)
First we create an array of length length ranging from 1 to length. Then we normalize the sum.
Thanks to Paul Panzer for pointing out that the efficiency of this function can be improved by using Gauss's formula for the sum of the first n integers:
def func(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum

For large inputs, you might find that using np.linspace is faster than the accepted answer
def f1(length):
linArr = np.arange(1, length+1)
arrSum = length * (length+1) // 2
return linArr/arrSum
def f2(l):
delta = 2/(l*(l+1))
return np.linspace(delta, l*delta, l)
Ensure that the two things produce the same result:
In [39]: np.allclose(f1(1000000), f2(1000000))
Out[39]: True
Check timing of both:
In [68]: %timeit f1(10000000)
515 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [69]: %timeit f2(10000000)
247 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's tempting to just use np.arange(delta, l*delta, delta) which should be even faster, but this does present the risk of rounding errors causing the array to have lengths different from l (as will happen e.g. for l = 10000000).
If speed is more important than code style, it might also possible to squeeze out a bit more by using Numba:
from numba import jit
#jit
def f3(l):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for n in range(l):
a[n] = (n+1)*delta
return a
In [96]: %timeit f3(10000000)
216 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While we're at it, let's note that it's possible to parallelize this loop. Doing so naively with Numba doesn't appear to give much, but helping it out a bit and pre-splitting the array into num_parallel parts does give further improvement on a quad core system:
from numba import njit, prange
#njit(parallel=True)
def f4(l, num_parallel=4):
a = np.empty(l, dtype=np.float64)
delta = 2/(l*(l+1))
for j in prange(num_parallel):
# The last iteration gets whatever's left from rounding
offset = 0 if j != num_parallel - 1 else l % num_parallel
for n in range(l//num_parallel + offset):
i = j*(l//num_parallel) + n
a[i] = (i+1)*delta
return a
In [171]: %timeit f4(10000000, 4)
163 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [172]: %timeit f4(10000000, 8)
158 ms ± 5.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [173]: %timeit f4(10000000, 12)
157 ms ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Python FOR loop vs. builtin reduce() and mul() functions?

Here I have two functions:
from functools import reduce
from operator import mul
def fn1(arr):
product = 1
for number in arr:
product *= number
return [product//x for x in arr]
def fn2(arr):
product = reduce(mul, arr)
return [product//x for x in arr]
Benchmark:
In [2]: arr = list(range(1,11))
In [3]: %timeit fn1(arr)
1.62 µs ± 23.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit fn2(arr)
1.88 µs ± 28.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: arr = list(range(1,101))
In [6]: %timeit fn1(arr)
38.5 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit fn2(arr)
41 µs ± 463 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [8]: arr = list(range(1,1001))
In [9]: %timeit fn1(arr)
4.23 ms ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %timeit fn2(arr)
4.24 ms ± 36.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: arr = list(range(1,10001))
In [12]: %timeit fn1(arr)
605 ms ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [13]: %timeit fn2(arr)
594 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here fn2() is marginally slower with the small lists. My understanding was that reduce() and mul() functions are both builtin functions, therefore they run at C speed and should be faster than the for loop. Probably because I have much more function calls (which also take some time) inside the fn2, it contributes to the end performance? But then the trend shows that fn2() outperforms fn1() with the larger lists. Why?

There could be a lot of reasons for this. First is CPU code execution predictions and compiler optimizations. As for me it shouldn't matter much for you which form of the code is faster if you choose proper algorithm. You need to use one that is suitable for your needs and looks better and leave performance to python compiler. Often faster options cause more problems with memory/readability/support. Also it's not guaranteed that little more complex code inside simple cycles wouldn't change performance, just because some optimizations could be applied.
-- Update --
If you want to increase performance of python on simple operations I would advice you to look at PyPy, Cython, nim which are made to gain performance of C when python type wrappers take too much.

these are always going to be very close, but for somewhat interesting reasons:
if the product is small (e.g. for the range(1,10)) then the functions are doing very little "useful" work, and everything is going into marshalling machine ints between Python objects so the functions can be invoked
if the product is large (e.g. the range(1,10001) case) the numbers become enormous (i.e. thousands of decimal digits) and the majority of time is spent multiplying very large numbers
for example, with Python 3.7.3 (in Linux 5.0.10):
from functools import reduce
from operator import mul
prod = reduce(mul, range(1, 10001))
prod has ~36k digits and consumes ~16KiB --- i.e. check with math.log10(prod) and sys.getsizeof(prod).
working with small (i.e. not bignum) products such as:
reduce(mul, [1] * 10001)
is ~50 times faster on my computer than when we need to use bignums as above. also notice that it's basically the same speed when using floating point numbers as compared to integers, e.g.
reduce(mul, [1.] * 10001)
only takes ~10% more time.
your additional code that makes an additional pass over the array seems to just be complicating issues hence I've ignored it --- microbenchmarks like these are awkward enough to get right!

Big-O of list comprehension that calls max in the condition: O(n) or O(n^2)?

Q. Write an algorithm that returns the second largest number in an array
a = [1, 2, 3, 4, 5]
print(max([x for x in a if x != max(a)]))
>> 4
I'm trying to figure out how this algorithm works and whether or not pythons internal magic will make this as efficient as writing a linear algorithm which just loops over the list a once and stores the highest and second highest values.
Correct me if I'm wrong:
The call to max(a) would be O(n)
[x for x in a] would also be O(n)
Would python be smart enough to cache the value of max(a) or would this mean that the list comprehension part the algorithm is O(n^2)?
And then the final max([listcomp]) would be another O(n), but this would only run once after the comprehension is finished, so the final algorithm would be O(n^2)?
Is there any fancy business going on internally that would cache the max(a) value and result in this algorithm working out quicker than O(n^2)?

The easy way to find out is timing it. Consider this timing code:
for i in range(1, 5):
a = list(range(10**i))
%timeit max([x for x in a if x != max(a)])
17.6 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
698 µs ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
61.6 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
6.31 s ± 167 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Each time it multiplied the number of elements by 10 the runtime increased by 100. That's almost certainly O(n**2). For an O(n) algorithm the runtime would increase linearly with the number of elements:
for i in range(1, 6):
a = list(range(10**i))
max_ = max(a)
%timeit max([x for x in a if x != max_])
4.82 µs ± 27.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
29 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
262 µs ± 3.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.42 ms ± 13 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
24.9 ms ± 231 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
But I'm not certain the algorithm really does what is asked. Consider the list a=[1,3,3] even the heapq module tells me that the second largest element is 3 (not 1 - what your algorithm returns):
import heapq
>>> heapq.nlargest(2, [1,3,3])[0]
3

Would python be smart enough to cache the value of max(a) or would
this mean that the list comprehension part the algorithm is O(n^2)?
No, because, as MSeifert said in a comment, python doesn't make assumptions about a, and so doesn't cache the value of max(a), which is recomputed each time.
You might want to consider an implementation that keeps track of the largest two items in one pass. You'll need to code an explicit for loop and do it. Here's a useful link from GeeksForGeeks (this, I recommend).
Alternatively, you can consider multiple traversals that still ends up being linear in complexity.
In [1782]: a = [1, 2, 3, 4, 5]
In [1783]: max(set(a) - {max(a)}) # 3 linear traversals
Out[1783]: 4
There's scope for improvement here, but like I said, nothing beats the explicit for loop approach.

Processing alternate element in for loop without using index value [duplicate]

This question already has answers here:
Extract elements of list at odd positions
(5 answers)
Closed 7 months ago.
I have list of elements with dictionary, for simplicity I have written them as strings:
ls = ['element1', 'element2', 'element3', 'element4', 'element5', 'element6', 'element7', 'element8', 'element9', 'element10']
I am trying to process pair of element from list as follow:
#m1. Step for loop by size two with if condition
for x in ls:
if ls.index(x)%2 == 0:
# my code to be process
print(x) # for simplicity I just printed element
#m2. tried another way like below:
for x in range(0, len(ls), 2):
# this way give me output of alternate element from list
print(ls[x])
Is there any way to get only alternate elements while iterating the list items in m1 just like m2?

You can slice the list in steps of two; exploiting memory:
for x in ls[::2]:
print(x)

You can use itertools.islice with a step of 2:
import itertools
for item in itertools.islice(ls, None, None, 2): # start and stop None, step 2
print(item)
Which prints:
element1
element3
element5
element7
element9
The islice won't create a new list, so it's more memory-efficient than l[::2] but at the cost of performance (it will be a bit slower).
Timing comparison:
(NB: I use IPythons %%timeit to measure the execution time.)
For short sequences [::2] is faster:
ls = list(range(100))
%%timeit
for item in itertools.islice(ls, None, None, 2):
pass
3.81 µs ± 90 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
for item in ls[::2]:
pass
3.16 µs ± 82 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But for long sequences islice will be faster and require less memory:
import itertools
ls = list(range(100000))
%%timeit
for item in itertools.islice(ls, None, None, 2):
pass
3.14 ms ± 53.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
for item in ls[::2]:
pass
4.82 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One exception: If you want the result as list then slicing [::2] will always be faster but in case you want to iterate over it then islice should be the preferred option.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Improve performance in lists - python

Related

Performance of numpy all/any vs testing a single element

Creating an array of numbers that add up to 1 with given length

Python FOR loop vs. builtin reduce() and mul() functions?

Big-O of list comprehension that calls max in the condition: O(n) or O(n^2)?

Processing alternate element in for loop without using index value [duplicate]

Categories

Resources