Cython automatic unrolling loop

Cython automatic unrolling loop - python

I am trying to accelerate a part of a Python code in which I have the following code:
for i in range(n):
for j in range(m):
for (sign,idx) in [(a,b),(c,d),(e,f),(g,h)]:
array[idx,i] += sign * something
array[idx,j] += sign * somethingElse
where a,b,c... are relatively complex expressions.
If I manually unrol for inner for loop by writting:
for i in range(n):
for j in range(m):
sign,idx = a,b
array[idx,i] += sign * something
array[idx,j] += sign * somethingElse
sign,idx = c,d
array[idx,i] += sign * something
array[idx,j] += sign * somethingElse
sign,idx = e,f
array[idx,i] += sign * something
array[idx,j] += sign * somethingElse
sign,idx = g,h
array[idx,i] += sign * something
array[idx,j] += sign * somethingElse
The code runs 4x faster... But copy-pasting seems like a bad idea.
My question: can it be done automatically at compile time?

I guess that this is indeed a problem of typing: in test1(), I explicitly construct an array "values" while in test2() I construct this array each time.
def test1():
cdef int i
cdef int value
cdef int values[4]
cdef double sum = 0
values[:] = [1,2,3,4]
for i in range(1000000):
for value in values:
sum += values[j]
return sum
def test2():
cdef int i
cdef int value
cdef double sum = 0
for i in range(1000000):
for value in [1,2,3,4]:
sum += value
return sum
The first version is roughly 3 times faster:
%timeit test1()
4.4 ms ± 44.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test2()
13.3 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

My Python code TimeLimitExceeded. How do I optimize and reduce the number of operations in my code?

Here is my code:
def seperateInputs(inp):
temp = inp.split()
n = int(temp[0])
wires = []
temp[1] = temp[1].replace('),(', ') (')
storeys = temp[1][1:len(temp[1])-1].split()
for each in storeys:
each = each[1:len(each)-1]
t = each.split(',')
wires.append((int(t[0]), int(t[1])))
return n, wires
def findCrosses(n, wires):
cross = 0
for i in range(len(wires)-1):
for j in range(i+1, len(wires)):
if (wires[i][0] < wires[j][0] and wires[i][1] > wires[j][1]) or (wires[i][0] > wires[j][0] and wires[i][1] < wires[j][1]):
cross += 1
return cross
def main():
m = int(input())
for i in range(m):
inp = input()
n, wires = seperateInputs(inp)
print(findCrosses(n, wires))
main()
The question asks:
I also tested my own sample input which got me the output that is correct:
Sample input:
3
20 [(1,8),(10,18),(17,19),(13,16),(4,1),(8,17),(2,10),(11,0),(3,2),(12,3),(18,14),(7,7),(19,5),(0,6)]
20 [(3,4),(10,7),(6,11),(7,17),(13,9),(15,19),(19,12),(16,14),(12,8),(0,3),(8,15),(4,18),(18,6),(5,5),(9,13),(17,1),(1,0)]
20 [(15,8),(0,14),(1,4),(6,5),(3,0),(13,15),(7,10),(5,9),(19,7),(17,13),(10,3),(16,16),(14,2),(11,11),(8,18),(9,12),(4,1)]
Sample output:
38
57
54
However although small input worked but medium to large input gives me TimeLimitExceeded error:
How do I optimize this? Is there a way to have much less operations than what I already have? TIA.

There are a handful of things you can do.
First, things are easier to compute if you sort the list first by the left building. This costs a little up-front, but makes things easier and fast as you process because you only need to compare how many lower second elements you've seen so far. The code is nice and simple for this:
l = [(3,4),(10,7),(6,11),(7,17),(13,9),(15,19),(19,12),(16,14),(12,8),(0,3),(8,15),(4,18),(18,6),(5,5),(9,13),(17,1),(1,0)]
def count_crossings(l):
s = sorted(l, key=lambda p: p[0])
endpoints = []
count = 0
for i, j in s:
count += sum(e > j for e in endpoints)
endpoints.append(j)
return count
count_crossings(l)
# 57
This is a little inefficient because you are looping through endpoints for every point. If you could also keep endpoints sorted, you would only need to count the number less than the given right hand boiling floor. Anytime you thing of keeping a list sorted, you should consider the amazing built-in library bisect. This will make things an order of magnitude faster:
import bisect
def count_crossings_b(l):
s = sorted(l, key=lambda p: p[0])
endpoints = []
count = 0
for i, j in s:
bisect.insort_left(endpoints, j)
count += len(endpoints) - bisect.bisect_right(endpoints, j)
return count
count_crossings_b(l)
# 57
The various timings on my laptop look like:
l = [(random.randint(1, 200), random.randint(1, 200)) for _ in range(1000)]
%timeit findCrosses(l) # original
# 179 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit count_crossings(l)
# 38.1 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit count_crossings_b(l)
# 1.08 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Multiple arithmetic operations efficiently on NumPy

I have the following data for a Python program.
import numpy as np
np.random.seed(28)
n = 100000
d = 60
S = np.random.rand(n)
O = np.random.rand(n, d, d)
p = np.random.rand()
mask = np.where(S < 0.5)
And I want to run the following algorithm:
def method1():
sum_p = np.zeros([d, d])
sum_m = np.zeros([d, d])
for k in range(n):
s = S[k] * O[k]
sum_p += s
if(S[k] < 0.5):
sum_m -= s
return p * sum_p + sum_m
This is a minimal example, but the code in method1() is supposed to be run many times in my project, so I would like to rewrite it in a more pythonic way, to make it as efficient as possible. I have tried with the following method:
def method2():
sall = S[:, None, None] * O
return p * sall.sum(axis=0) - sall[mask].sum(axis=0)
But, although this method performs better with low values of d, when d=60 it does not provide good times:
# To check that both methods provide the same result.
In [1]: np.sum(method1() == method2()) == d*d
Out[1]: True
In [2]: %timeit method1()
Out[2]: 801 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit method2()
Out[3]: 1.91 s ± 6.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Do you have any other ideas to optimize this method?
(As additional information, the variable mask is supposed to be used in other parts of my final code, so I don't need to consider it inside the code of method2 for the time computation.)

How to effectively work on combinations along one dimension in np array

Given S being an n x m matrix, as a numpy array, I want to call function f on pairs of (S[i], S[j]) to calculate a particular value of interest, stored in a matrix with dimensions n x n. In my particular case the function f is commutative so f(x,y) = f(y,x).
With all this in mind I am wondering if I can do any tricks to speed this up as much as I can, n can be fairly large.
When I time the function f, it's around a couple of microseconds, which is as expected. It's a pretty straightforward calculation. Below I show you the timings I got, compared with max() and sum() for reference.
In [19]: %timeit sum(s[11])
4.68 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [20]: %timeit max(s[11])
3.61 µs ± 64.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [21]: %timeit f(s[11], s[21], 50, 10, 1e-5)
1.23 µs ± 7.25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: %timeit f(s[121], s[321], 50, 10, 1e-5)
1.26 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
However when I time the overall processing time for a 500x50 sample data (resulting in 500 x 500 /2 = 125K comparisons), the overall time blows up significantly (into minutes). I would have expected something like 0.2-0.3 seconds (1.25E5 * 2E-6 sec/calc).
In [12]: #jit
...: def testf(s, n, m, p):
...: tol = 1e-5
...: sim = np.zeros((n,n))
...: for i in range(n):
...: for j in range(n):
...: if i > j:
...: delta = p[i] - p[j]
...: if delta < 0:
...: res = f(s[i], s[j], m, abs(delta), tol) # <-- max(s[i])
...: else:
...: res = f(s[j], s[i], m, delta, tol) # <-- sum(s[j])
...: sim[i][j] = res
...: sim[j][i] = res
...: return sim
In code above I have changed the lines where res is assigned to max() and sum() (commented out parts) for testing and the code executes approx 100 times faster, even though the functions themselves are slower compared to my function f()
Which brings me to my questions:
Can I avoid the double loop to speed this up? Ideally I want to be able to run this for matrices where n = 1E5 size. (Comment: since the max and sum, functions work considerably faster, my guess is that the for loops isn't the bottleneck here, but still good to know if there is a better way)
What may cause the severe slowdown with my function, if it's not the double for loop?
EDIT
The specifics of the function f was asked, by some comments. It's iterating over two arrays and checks the number of values in the two arrays that are "close enough". I removed the comments and changes some variable names but the logic is as shown below. It was interesting to note that math.isclose(x,y,rel_tol) which is equivalent to the if-statements i have below, makes the code significantly slower, probably due to library call?
from numba import jit
#jit
def f(arr1, arr2, n, d, rel_tol):
counter = 0
i,j,k = 0,0,0
while (i < n and j < n and k < n):
val = arr1[j] + d
if abs(arr1[i] - arr2[k]) < rel_tol * max(arr1[i], arr2[k]):
counter += 1
i += 1
k += 1
elif abs(val - arr2[k]) < rel_tol * max(val, arr2[k]):
counter += 1
j += 1
k += 1
else:
# incremenet the index corresponding to the lightest
if arr1[i] <= arr2[k] and arr1[i] <= val:
if i < n:
i += 1
elif val <= arr1[i] and val <= arr2[k]:
if j < n:
j += 1
else:
k += 1
return counter

Efficiency difference between 2 linear line search in Python

I have a question regarding the difference in efficiency when doing list search. Why is there a difference between these two?
test_list= [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50]
The first one -
def linearSearch(A,x):
if x in A:
return True
return False
The second one -
def linearSearch_2(A,x):
for element in A:
if element == x:
return True
return False
Testing them
%timeit linearSearch(test_list, 3)
438 ns ± 5.86 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit linearSearch_2(test_list, 3)
1.28 µs ± 7.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The difference remains when I use a much larger list. Is there any fundamental difference between these two methods?

Although in theory, these should complete in the same time, Python's in operator is written to work at raw C level so completes much faster than writing your own for-loop in Python.
However, if you were to translate the second snippet into C, then it would out-perform the first snippet in Python as C is much more low-level so runs faster.
Note:
The first function is pretty much useless as it is identical to:
def linearSearch(A,x):
return x in A
which is clear now that whenever you would call it, you could instead just write directly: x in A to produce the same result!
Out of interest, I wrote the second snippet in C, but to make timing more exaggerated, made it do the whole thing 1000000 times:
#include <stdio.h>
#include <time.h>
void main(){
clock_t begin = clock();
for (int s = 0; s < 1000000; s++){
int x = 3;
int a[25] = {2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50};
for (int i = 0; i < 25; i++){
if (i == x) break;
}
}
printf("completed in %f secs\n", (double)(clock() - begin) / CLOCKS_PER_SEC);
}
which outputted:
completed in 0.021514 secs
whereas my modified version of your first snippet in Python:
import time
start = time.time()
for _ in range(1000000):
x = 3
l = [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50]
if x in l:
continue;
print("completed in", time.time() - start, "seconds")
outputted:
completed in 1.1042814254760742 seconds

python different for-loop implement get different speed and even in contrast with cache mechanism in C

I learn about caching recently and knowing that it can optimize the program by improve the probability to hit the cache. And there is a piece of code that can get different time in C.
a[100][5000]=...//initialize
for(i=0; i<100; i=i+1) {
for(j=0; j<5000; j=j+1) {
a[i][j] = 2 * a[i][j];
}
}
a[100][5000]=...//initialize
for(i=0; i<100; i=i+1) {
for(j=0; j<5000; j=j+1) {
a[i][j] = 2 * a[i][j];
}
}
The below one is faster than above one. But in python get different result in below code.The principle is below one can hit the cache in cache Line but above one should visit the mem more frequently.
arr1 = [[i for i in range(5000)] for j in range(100)]
arr2 = [[i for i in range(5000)] for j in range(100)]
def test1():
start = time.time()
for i in range(5000):
for j in range(100):
arr1[i][j] = 2 * arr1[i][j]
def test2():
start = time.time()
sum = 0
for j in range(100):
for i in range(5000):
arr2[i][j] = 2 * arr2[i][j]
%timeit -n100 test1()
%timeit -n100 test2()
# 1.16 s ± 67.2 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 1.5 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I wonder if the cache mechanism is diff in this two program language and the basic python is implemented by C?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cython automatic unrolling loop - python

Related

My Python code TimeLimitExceeded. How do I optimize and reduce the number of operations in my code?

Multiple arithmetic operations efficiently on NumPy

How to effectively work on combinations along one dimension in np array

Efficiency difference between 2 linear line search in Python

python different for-loop implement get different speed and even in contrast with cache mechanism in C

Categories

Resources