I have a question regarding the difference in efficiency when doing list search. Why is there a difference between these two?
test_list= [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50]
The first one -
def linearSearch(A,x):
if x in A:
return True
return False
The second one -
def linearSearch_2(A,x):
for element in A:
if element == x:
return True
return False
Testing them
%timeit linearSearch(test_list, 3)
438 ns ± 5.86 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit linearSearch_2(test_list, 3)
1.28 µs ± 7.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The difference remains when I use a much larger list. Is there any fundamental difference between these two methods?
Although in theory, these should complete in the same time, Python's in operator is written to work at raw C level so completes much faster than writing your own for-loop in Python.
However, if you were to translate the second snippet into C, then it would out-perform the first snippet in Python as C is much more low-level so runs faster.
Note:
The first function is pretty much useless as it is identical to:
def linearSearch(A,x):
return x in A
which is clear now that whenever you would call it, you could instead just write directly: x in A to produce the same result!
Out of interest, I wrote the second snippet in C, but to make timing more exaggerated, made it do the whole thing 1000000 times:
#include <stdio.h>
#include <time.h>
void main(){
clock_t begin = clock();
for (int s = 0; s < 1000000; s++){
int x = 3;
int a[25] = {2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50};
for (int i = 0; i < 25; i++){
if (i == x) break;
}
}
printf("completed in %f secs\n", (double)(clock() - begin) / CLOCKS_PER_SEC);
}
which outputted:
completed in 0.021514 secs
whereas my modified version of your first snippet in Python:
import time
start = time.time()
for _ in range(1000000):
x = 3
l = [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50]
if x in l:
continue;
print("completed in", time.time() - start, "seconds")
outputted:
completed in 1.1042814254760742 seconds
Related
I wrote the following two tetration functions in Python:
def recur_tet(b, n):
if n == 1:
return(b)
else:
return(b ** recur_tet(b, n - 1))
def iter_tet(b, n):
ans = 1
for i in range(n):
ans = b ** ans
return(ans)
And, surprisingly, the recursive version was slightly faster:
python3> %timeit recur_tet(2,4)
1 µs ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
python3> %timeit iter_tet(2,4)
1.15 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I thought it might have something to do with how Python was interpreting it, so I did a C version:
/* tetration.c */
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
int recur_tet(int b, int n){
if(n == 1){
return(b);
}
else{
return(pow(b, recur_tet(b, n - 1)));
}
}
int iter_tet(int b, int n){
int ans = 1;
int i;
for(i = 1; i <= n; i++){
ans = pow(b, ans);
}
return(ans);
}
int main(int argc, char *argv[]){
/* giving an argument of "1" will do a recursive tetration
while an argument of "2" will do an iterative one */
if(atoi(argv[1]) == 1){
recur_tet(2,4);
}
else if(atoi(argv[1]) == 2){
iter_tet(2,4);
}
return(0);
}
And the recursive version was still faster:
> gcc tetration.c -o tet.o
> time(while ((n++ < 100000)); do ./tet.o 1; done)
real 4m24.226s
user 1m26.503s
sys 1m32.155s
> time(while ((n++ < 100000)); do ./tet.o 2; done)
real 4m40.998s
user 1m30.699s
sys 1m37.110s
So this difference seems real. The assembled C program (as returned by gcc -S) represents recur_tet as 42 instructions, while iter_tet is 39 instructions, so it seems like the recursive one should be longer? but I don't really know anything about assembly so who knows.
Anyway, does anyone have insights about why the recursive version of each function is faster, despite the common wisdom about recursion vs. iteration? Am I writing my iterative version in a silly way with some inefficiency I'm not seeing?
The problem with both the Python and the C comparisons is that the recursive and iterative algorithms are not really equivalent (even though they should produce the same result).
When n is 1, the recursive versions are returning b immediately, with no exponentiation being performed. But the iterative versions are doing exponentiation in that case (b**1 in Python and pow(b, 1) in C). This accounts for the slower speed of the iterative versions.
So in general, the iterative versions are making one additional exponentiation call than the recursive versions.
To do a fair comparison, either change the recursive versions to do exponentiation when n is 1, or else change the iterative versions to avoid it.
I am trying to accelerate a part of a Python code in which I have the following code:
for i in range(n):
for j in range(m):
for (sign,idx) in [(a,b),(c,d),(e,f),(g,h)]:
array[idx,i] += sign * something
array[idx,j] += sign * somethingElse
where a,b,c... are relatively complex expressions.
If I manually unrol for inner for loop by writting:
for i in range(n):
for j in range(m):
sign,idx = a,b
array[idx,i] += sign * something
array[idx,j] += sign * somethingElse
sign,idx = c,d
array[idx,i] += sign * something
array[idx,j] += sign * somethingElse
sign,idx = e,f
array[idx,i] += sign * something
array[idx,j] += sign * somethingElse
sign,idx = g,h
array[idx,i] += sign * something
array[idx,j] += sign * somethingElse
The code runs 4x faster... But copy-pasting seems like a bad idea.
My question: can it be done automatically at compile time?
I guess that this is indeed a problem of typing: in test1(), I explicitly construct an array "values" while in test2() I construct this array each time.
def test1():
cdef int i
cdef int value
cdef int values[4]
cdef double sum = 0
values[:] = [1,2,3,4]
for i in range(1000000):
for value in values:
sum += values[j]
return sum
def test2():
cdef int i
cdef int value
cdef double sum = 0
for i in range(1000000):
for value in [1,2,3,4]:
sum += value
return sum
The first version is roughly 3 times faster:
%timeit test1()
4.4 ms ± 44.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit test2()
13.3 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have a problem with Numba typing - I read the manual, but eventually hit a brick wall.
The function in question is a part of a bigger project - though it needs to run fast - Python lists are out of the question, hence I've decided on trying Numba. Sadly, the function fails in nopython=True mode, despite the fact that - according to my understanding - all types are being provided.
The code is as follows:
from Numba import jit, njit, uint8, int64, typeof
#jit(uint8[:,:,:](int64))
def findWhite(cropped):
h1 = int64(0)
for i in cropped:
for j in i:
if np.sum(j) == 765:
h1 = h1 + int64(1)
else:
pass
return h1
also, separately:
print(typeof(cropped))
array(uint8, 3d, C)
print(typeof(h1))
int64
In this case 'cropped' is a large uint8 3D C matrix (RGB tiff file comprehension - PIL.Image). Could someone please explain to a Numba newbie what am I doing wrong?
Have you considered using Numpy? That's often a good intermediate between Python lists and Numba, something like:
h1 = (cropped.sum(axis=-1) == 765).sum()
or
h1 = (cropped == 255).all(axis=-1).sum()
The example code you provide is not valid Numba. Your signature is also incorrect, since the input is a 3D array and the output an integer, it should probably be:
#njit(int64(uint8[:,:,:]))
Looping over the array like you do is not valid code. A close translation of your code would be something like this:
#njit(int64(uint8[:,:,:]))
def findWhite(cropped):
h1 = int64(0)
ys, xs, n_bands = cropped.shape
for i in range(ys):
for j in range(xs):
if cropped[i, j, :].sum() == 765:
h1 += 1
return h1
But that isn't very fast and doesn't beat Numpy on my machine. With Numba it's fine to explicitly loop over every element in an array, this is already a lot faster:
#njit(int64(uint8[:,:,:]))
def findWhite_numba(cropped):
h1 = int64(0)
ys, xs, zs = cropped.shape
for i in range(ys):
for j in range(xs):
incr = 1
for k in range(zs):
if cropped[i, j, k] != 255:
incr = 0
break
h1 += incr
return h1
For a 5000x5000x3 array these are the result for me:
Numpy (h1 = (cropped == 255).all(axis=-1).sum()):
427 ms ± 6.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
findWhite:
612 ms ± 6.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
findWhite_numba:
31 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A benefit of the Numpy method is that it generalizes to any amount of dimensions.
Numpy's string functions are all very slow and are less performant than pure python lists. I am looking to optimize all the normal string functions using Cython.
For instance, let's take a numpy array of 100,000 unicode strings with data type either unicode or object and lowecase each one.
alist = ['JsDated', 'УКРАЇНА'] * 50000
arr_unicode = np.array(alist)
arr_object = np.array(alist, dtype='object')
%timeit np.char.lower(arr_unicode)
51.6 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using a list comprehension is just as fast
%timeit [a.lower() for a in arr_unicode]
44.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For the object data type, we cannot use np.char. The list comprehension is 3x as fast.
%timeit [a.lower() for a in arr_object]
16.1 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The only way I know how to do this in Cython is to create an empty object array and call the Python string method lower on each iteration.
import numpy as np
cimport numpy as np
from numpy cimport ndarray
def lower(ndarray[object] arr):
cdef int i
cdef int n = len(arr)
cdef ndarray[object] result = np.empty(n, dtype='object')
for i in range(n):
result[i] = arr[i].lower()
return result
This yields a modest improvement
%timeit lower(arr_object)
11.3 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have tried accessing the memory directly with the data ndarray attribute like this:
def lower_fast(ndarray[object] arr):
cdef int n = len(arr)
cdef int i
cdef char* data = arr.data
cdef int itemsize = arr.itemsize
for i in range(n):
# no idea here
I believe data is one contiguous piece of memory holding all the raw bytes one after the other. Accessing these bytes is extremely fast and it seems converting these raw bytes would increase performance by 2 orders of magnitude. I found a tolower c++ function that might work, but I don't know how to hook it in with Cython.
Update with fastest method (doesn't work for unicode)
Here is the fastest method I found by far, from another SO post. This lowercases all the ascii characters by accessing the numpy memoryview via the data attribute. I think it will mangle other unicode characters that have bytes between 65 and 90 as well. But the speed is very good.
cdef int f(char *a, int itemsize, int shape):
cdef int i
cdef int num
cdef int loc
for i in range(shape * itemsize):
num = a[i]
print(num)
if 65 <= num <= 90:
a[i] +=32
def lower_fast(ndarray arr):
cdef char *inp
inp = arr.data
f(inp, arr.itemsize, arr.shape[0])
return arr
This is 100x faster than the others and what I am looking for.
%timeit lower_fast(arr)
103 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This was only slightly faster than the list comprehension for me on my machine, but if you want unicode support this might be the fastest way of doing it. You'll need to apt-get install libunistring-dev or whatever is appropriate for your OS / package manager.
In some C file, say, _lower.c, have
#include <stdlib.h>
#include <string.h>
#include <unistr.h>
#include <unicase.h>
void _c_tolower(uint8_t **s, uint32_t total_len) {
size_t lower_len, s_len;
uint8_t *s_ptr = *s, *lowered;
while(s_ptr - *s < total_len) {
s_len = u8_strlen(s_ptr);
if (s_len == 0) {
s_ptr += 1;
continue;
}
lowered = u8_tolower(s_ptr, s_len, NULL, NULL, NULL, &lower_len);
memcpy(s_ptr, lowered, lower_len);
free(lowered);
s_ptr += s_len;
}
}
Then, in lower.pxd you do
cdef extern from "_lower.c":
cdef void _c_tolower(unsigned char **s, unsigned int total_len)
And finally, in lower.pyx:
cpdef void lower(ndarray arr):
cdef unsigned char * _arr
_arr = <unsigned char *> arr.data
_c_tolower(&_arr, arr.shape[0] * arr.itemsize)
On my laptop, I got 46ms for the list comprehension you had above and 37ms for this method (and 0.8ms for your lower_fast), so it's probably not worth it, but I figured I'd type it out in case you wanted an example of how to hook such a thing into Cython.
There are a few points of improvement that I don't know will make much of a difference:
arr.data is something like a square matrix I guess? (I don't know, I don't use numpy for anything), and pads the ends of the shorter strings with \x00s. I was too lazy to figure out how to get u8_tolower to look past the 0s, so I just manually fast-forward past them (that's what the if (s_len == 0) clause is doing). I suspect that one call to u8_tolower would be significantly faster than doing it thousands of times.
I'm doing a lot of freeing/memcpying. You can probably avoid that if you're clever.
I think it's the case that every lowercase unicode character is at most as wide as its uppercase variant, so this should not run into any segfaults or buffer overwrites or just overlapping substring issues, but don't take my word for that.
Not really an answer, but hope it helps your further investigations!
PS You'll notice that this does the lowering in-place, so the usage would be like this:
>>> alist = ['JsDated', 'УКРАЇНА', '道德經', 'Ну И йЕшШо'] * 2
>>> arr_unicode = np.array(alist)
>>> lower_2(arr_unicode)
>>> for x in arr_unicode:
... print x
...
jsdated
україна
道德經
ну и йешшо
jsdated
україна
道德經
ну и йешшо
>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); x = [a.lower() for a in arr_unicode]; time() - ct;
0.046072959899902344
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.037489891052246094
EDIT
DUH, you modify the C function to look like this
void _c_tolower(uint8_t **s, uint32_t total_len) {
size_t lower_len;
uint8_t *lowered;
lowered = u8_tolower(*s, total_len, NULL, NULL, NULL, &lower_len);
memcpy(*s, lowered, lower_len);
free(lowered);
}
and then it does it all in one go. Looks more dangerous in terms of possibly having something from the old data left over of lower_len is shorter than the original string... in short, this code is TOTALLY EXPERIMENTAL AND FOR ILLUSTRATIVE PURPOSES ONLY DO NOT USE THIS IN PRODUCTION IT WILL PROBABLY BREAK.
Anyway, ~40% faster this way:
>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.022463043975830078
I learn about caching recently and knowing that it can optimize the program by improve the probability to hit the cache. And there is a piece of code that can get different time in C.
a[100][5000]=...//initialize
for(i=0; i<100; i=i+1) {
for(j=0; j<5000; j=j+1) {
a[i][j] = 2 * a[i][j];
}
}
a[100][5000]=...//initialize
for(i=0; i<100; i=i+1) {
for(j=0; j<5000; j=j+1) {
a[i][j] = 2 * a[i][j];
}
}
The below one is faster than above one. But in python get different result in below code.The principle is below one can hit the cache in cache Line but above one should visit the mem more frequently.
arr1 = [[i for i in range(5000)] for j in range(100)]
arr2 = [[i for i in range(5000)] for j in range(100)]
def test1():
start = time.time()
for i in range(5000):
for j in range(100):
arr1[i][j] = 2 * arr1[i][j]
def test2():
start = time.time()
sum = 0
for j in range(100):
for i in range(5000):
arr2[i][j] = 2 * arr2[i][j]
%timeit -n100 test1()
%timeit -n100 test2()
# 1.16 s ± 67.2 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 1.5 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I wonder if the cache mechanism is diff in this two program language and the basic python is implemented by C?