I have successfully used Cython for the first time to significantly speed up packing nibbles from one list of integers (bytes) into another (see Faster bit-level data packing), e.g. packing the two sequential bytes 0x0A and 0x0B into 0xAB.
def pack(it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]
While the resulting speed is satisfactory, I am curious whether this can be taken further by making better use of the input and output lists.
cython3 -a pack.cyx generates a very "cythonic" HTML report that I unfortunately am not experienced enough to draw any useful conclusions from.
From a C point of view the loop should "simply" access two unsigned int arrays. Possibly, using a wider data type (16/32 bit) could further speed this up proportionally.
The question is: (how) can Python [binary/immutable] sequence types be typed as unsigned int array for Cython?
Using array as suggested in How to convert python array to cython array? does not seem to make it faster (and the array needs to be created from bytes object beforehand), nor does typing the parameter as list instead of object (same as no type) or using for loop instead of list comprehension:
def packx(list it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef list r = [0]*n
for i in range(n):
r[i] = (it[i*2]//16)<<4 | it[i*2+1]//16
return r
I think my earlier test just specified an array.array as input, but following the comments I've now just tried
from cpython cimport array
import array
def packa(array.array a):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(a)//2
cdef unsigned int i
cdef unsigned int b[256*64/2]
for i in range(n):
b[i] = (a[i*2]//16)<<4 | a[i*2+1]//16;
cdef array.array c = array.array("B", b)
return c
which compiles but
ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist()
segfaults.
I find the documentation a bit sparse, so any hints on what the problem is here and how to allocate the array for output data are appreciated.
Taking #ead's first approach, plus combining division and shifting (seems to save a microsecond:
#cython: boundscheck=False, wraparound=False
def packa(char[::1] a):
"""Cythonize python nibble packing loop, typed with array"""
cdef unsigned int n = len(a)//2
cdef unsigned int i
# cdef unsigned int b[256*64/2]
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = ( a[i*2] & 0xF0 ) | (a[i*2+1] >> 4);
return res
compiles much longer, but runs much faster:
python3 -m timeit -s 'from pack import packa; import array; data = array.array("B", bytes([0]*256*64))' 'packa(data)'
1000 loops, best of 3: 236 usec per loop
Amazing! But, with the additional bytes-to-array and array-to-list conversion
ima = array.array("B", imd) # unsigned char (1 Byte)
pa = packa(ima)
packed = pa.tolist() # bytes would probably also do
it now only takes about 1.7 ms - very cool!
Down to 150 us timed or approx. 0.4 ms actual:
from cython cimport boundscheck, wraparound
from cpython cimport array
import array
#boundscheck(False)
#wraparound(False)
def pack(const unsigned char[::1] di):
cdef:
unsigned int i, n = len(di)
unsigned char h, l, r
array.array do = array.array('B')
array.resize(do, n>>1)
for i in range(0, n, 2):
h = di[i] & 0xF0
l = di[i+1] >> 4
r = h | l
do.data.as_uchars[i>>1] = r
return do
I'm not converting the result array to a list anymore, this is done automatically by py-spidev when writing, and the total time is about the same: 10 ms (# 10 MHz).
If you wanna to be as fast as C you should not use list with python-integers inside but an array.array. It is possible to get a speed-up of around 140 for your python+list code by using cython+array.array.
Here are some ideas how to make your code faster with cython. As benchmark I choose a list with 1000 elements (big enough and cache-misses have no effects yet):
import random
l=[random.randint(0,15) for _ in range(1000)]
As baseline, your python-implementation with list:
def packx(it):
n = len(it)//2
r = [0]*n
for i in range(n):
r[i] = (it[i*2]%16)<<4 | it[i*2+1]%16
return r
%timeit packx(l)
143 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
By the way, I use % instead of //, which is probably what you want, otherwise you would get only 0s as result (only lower bits have data in your description).
After cythonizing the same function (with %%cython-magic) we get a speed-up of around 2:
%timeit packx(l)
77.6 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Let's look at the html produced by option -a, we see the following for the line corresponding to the for-loop:
.....
__pyx_t_2 = PyNumber_Multiply(__pyx_v_i, __pyx_int_2); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 6, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
__pyx_t_5 = PyObject_GetItem(__pyx_v_it, __pyx_t_2); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 6, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_5);
__Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
__pyx_t_2 = __Pyx_PyInt_RemainderObjC(__pyx_t_5, __pyx_int_16, 16, 0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 6, __pyx_L1_error)
...
Py_NumberMultiply means that we use slow python-multiplication, Pyx_DECREF- all temporaries are slow python-objects. We need to change that!
Let's pass not a list but an array.array of bytes to our function and return an array.array of bytes back. Lists have full fledged python objects inside, array.arraythe lowly raw c-data which is faster:
%%cython
from cpython cimport array
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('b', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16
return res
import array
a=array.array('B', l)
%timeit cy_apackx(a)
19.2 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Better, but let take a look at the generated html, there is still some slow python-code:
__pyx_t_2 = __Pyx_PyInt_From_long(((__Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))), 16) << 4) | __Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))), 16))); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 9, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_2);
if (unlikely(__Pyx_SetItemInt(((PyObject *)__pyx_v_res), __pyx_v_i, __pyx_t_2, unsigned int, 0, __Pyx_PyInt_From_unsigned_int, 0, 0, 1) < 0)) __PYX_ERR(0, 9, __pyx_L1_error)
__Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
We still use a python-setter for array (__Pax_SetItemInt) and for this a python objecct __pyx_t_2 is needed, to avoid this we use array.data.as_chars:
%%cython
from cpython cimport array
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16 ##HERE!
return res
%timeit cy_apackx(a)
1.86 µs ± 30.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Much better, but let's take a look at html again, and we see some calls to __Pyx_RaiseBufferIndexError - this safety costs some time, so let's switch it off:
%%cython
from cpython cimport array
import cython
#cython.boundscheck(False) # switch of safety-checks
#cython.wraparound(False) # switch of safety-checks
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]%16)<<4 | it[i*2+1]%16 ##HERE!
return res
%timeit cy_apackx(a)
1.53 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
When we look at the generated html, we see:
__pyx_t_7 = (__pyx_v_i * 2);
__pyx_t_8 = ((__pyx_v_i * 2) + 1);
(__pyx_v_res->data.as_chars[__pyx_v_i]) = ((__Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))), 16) << 4) | __Pyx_mod_long((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))), 16));
No python-stuff! Good so far. However, I'm not sure about __Pyx_mod_long, its definition is:
static CYTHON_INLINE long __Pyx_mod_long(long a, long b) {
long r = a % b;
r += ((r != 0) & ((r ^ b) < 0)) * b;
return r;
}
So C and Python have differences for mod of negative numbers and it must be taken into account. This function-definition, albeit inlined, will prevent the C-compiler from optimizing a%16 as a&15. We have only positive numbers, so no need to care about them, thus we need to do the a&15-trick by ourselves:
%%cython
from cpython cimport array
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = (it[i*2]&15)<<4 | (it[i*2+1]&15)
return res
%timeit cy_apackx(a)
1.02 µs ± 8.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I'm also satified with the resulting C-code/html (only one line):
(__pyx_v_res->data.as_chars[__pyx_v_i]) = ((((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_7)) ))) & 15) << 4) | ((*((char *) ( /* dim=0 */ ((char *) (((char *) __pyx_v_it.data) + __pyx_t_8)) ))) & 15));
Conclusion: In the sum that means speed up of 140 (140 µs vs 1.02 µs)- not bad! Another interesting point: the calculation itself takes about 2 µs (and that comprises less than optimal bound checking and division) - 138 µs are for creating, registering and deleting temporary python objects.
If you need the upper bits and can assume that lower bits are without dirt (otherwise &250 can help), you can use:
from cpython cimport array
import cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_apackx(char[::1] it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef array.array res = array.array('B', [])
array.resize(res, n)
for i in range(n):
res.data.as_chars[i] = it[i*2] | (it[i*2+1]>>4)
return res
%timeit cy_apackx(a)
819 ns ± 8.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Another interesting question is which costs have the operations if list is used. If we start with the "improved" version:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
res=[0]*n
for i in range(n):
res[i] = it[i*2] | (it[i*2+1]>>4))
return res
%timeit cy_packx(l)
20.7 µs ± 450 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
we see, that reducing the number of integer operation leads to a big speed-up. That is due to the fact, that python-integers are immutable and every operation creates a new temporary object, which is costly. Eliminating operations means also eliminating costly temporaries.
However, it[i*2] | (it[i*2+1]>>4) is done with python-integer, as next step we make it cdef-operations:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
res=[0]*n
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
res[i]= a | (b>>4)
return res
%timeit cy_packx(l)
7.3 µs ± 880 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I don't know how it can be improved further, thus we have 7.3 µs for list vs. 1 µs for array.array.
Last question, what is the costs break down of the list version? I order to avoid being optimized away by the C-compiler, we use a slightly different baseline function:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
cdef unsigned char s = 0
res=[0]*n
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
s+=a | (b>>4)
res[i]= s
return res
%timeit cy_packx(l)
In [79]: %timeit cy_packx(l)
7.67 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The usage of the s variable means, it does not get optimized away in the second version:
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
cdef unsigned char s = 0
res=[0]*n
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
s+=a | (b>>4)
res[0]=s
return res
In [81]: %timeit cy_packx(l)
5.46 µs ± 72.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
About 2 µs or about 30% are the costs for creating new integer objects. What are the costs of the memory allocation?
%%cython
def cy_packx(it):
cdef unsigned int n = len(it)//2
cdef unsigned int i
cdef unsigned char a,b
cdef unsigned char s = 0
for i in range(n):
a=it[i*2]
b=it[i*2+1] # ensures next operations are fast
s+=a | (b>>4)
return s
In [84]: %timeit cy_packx(l)
3.84 µs ± 43.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
That leads to the following performance break down of the list-version:
Time(in µs) Percentage(in %)
all 7.7 100
calculation 1 12
alloc memory 1.6 21
create ints 2.2 29
access data/cast 2.6 38
I must confess, I expected create ints to play a bigger role and didn't thing accessing the data in the list and casting it to chars will cost that much.
Related
I wrote the following two tetration functions in Python:
def recur_tet(b, n):
if n == 1:
return(b)
else:
return(b ** recur_tet(b, n - 1))
def iter_tet(b, n):
ans = 1
for i in range(n):
ans = b ** ans
return(ans)
And, surprisingly, the recursive version was slightly faster:
python3> %timeit recur_tet(2,4)
1 µs ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
python3> %timeit iter_tet(2,4)
1.15 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
I thought it might have something to do with how Python was interpreting it, so I did a C version:
/* tetration.c */
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
int recur_tet(int b, int n){
if(n == 1){
return(b);
}
else{
return(pow(b, recur_tet(b, n - 1)));
}
}
int iter_tet(int b, int n){
int ans = 1;
int i;
for(i = 1; i <= n; i++){
ans = pow(b, ans);
}
return(ans);
}
int main(int argc, char *argv[]){
/* giving an argument of "1" will do a recursive tetration
while an argument of "2" will do an iterative one */
if(atoi(argv[1]) == 1){
recur_tet(2,4);
}
else if(atoi(argv[1]) == 2){
iter_tet(2,4);
}
return(0);
}
And the recursive version was still faster:
> gcc tetration.c -o tet.o
> time(while ((n++ < 100000)); do ./tet.o 1; done)
real 4m24.226s
user 1m26.503s
sys 1m32.155s
> time(while ((n++ < 100000)); do ./tet.o 2; done)
real 4m40.998s
user 1m30.699s
sys 1m37.110s
So this difference seems real. The assembled C program (as returned by gcc -S) represents recur_tet as 42 instructions, while iter_tet is 39 instructions, so it seems like the recursive one should be longer? but I don't really know anything about assembly so who knows.
Anyway, does anyone have insights about why the recursive version of each function is faster, despite the common wisdom about recursion vs. iteration? Am I writing my iterative version in a silly way with some inefficiency I'm not seeing?
The problem with both the Python and the C comparisons is that the recursive and iterative algorithms are not really equivalent (even though they should produce the same result).
When n is 1, the recursive versions are returning b immediately, with no exponentiation being performed. But the iterative versions are doing exponentiation in that case (b**1 in Python and pow(b, 1) in C). This accounts for the slower speed of the iterative versions.
So in general, the iterative versions are making one additional exponentiation call than the recursive versions.
To do a fair comparison, either change the recursive versions to do exponentiation when n is 1, or else change the iterative versions to avoid it.
I have the following code to generate all possible combinations in specified range using itertools but I cant get any speed improvements from using the code with cython. Original code is this:
from itertools import *
def x(e,f,g):
a=[]
for c in combinations(range(e, f),g):
d = list((c))
a.append(d)
and after declaring types for cython:
from itertools import *
cpdef x(int e,int f,int g):
cpdef tuple c
cpdef list a
cpdef list d
a=[]
for c in combinations(range(e, f),g):
d = list((c))
a.append(d)
I saved the latter as test_cy.pyx and compiled using cythonize -a -i test_cy.pyx
After compiling, I created a new script with the following code and ran it:
import test_cy
test_cy.x(1,45,6)
I didnt get any significant speed improvement, still took about the same time as the original script, about 10.8 sec.
Is there anything I did wrong or is itertools already so optimised that there cant be any bigger improvements to its speed?
As already pointed out in the comments, you should not expect cython to speed-up your code because the most of the time the algorithm spends in itertools and creation of lists.
Because I'm curios to see how itertools's generic implementation fares against old-school-tricks, let's take a look at this Cython implementation of "all subsets k out of n":
%%cython
ctypedef unsigned long long ull
cdef ull next_subset(ull subset):
cdef ull smallest, ripple, ones
smallest = subset& -subset
ripple = subset + smallest
ones = subset ^ ripple
ones = (ones >> 2)//smallest
subset= ripple | ones
return subset
cdef subset2list(ull subset, int offset, int cnt):
cdef list lst=[0]*cnt #pre-allocate
cdef int current=0;
cdef int index=0
while subset>0:
if((subset&1)!=0):
lst[index]=offset+current
index+=1
subset>>=1
current+=1
return lst
def all_k_subsets(int start, int end, int k):
cdef int n=end-start
cdef ull MAX=1L<<n;
cdef ull subset=(1L<<k)-1L;
lst=[]
while(MAX>subset):
lst.append(subset2list(subset, start, k))
subset=next_subset(subset)
return lst
This implementation uses some well-known bit-tricks and has the limitation, that it only works for at most 64 elements.
If we compare both approaches:
>>> %timeit x(1,45,6)
2.52 s ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit all_k_subsets(1,45,6)
1.29 s ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The speed-up of factor 2 is quite disappointing.
However, the bottle-neck is the creation of the lists and not the calculation itself - it is easy to check, that without list creation the calculation would take about 0.1 seconds.
My take away from it: if you are serious about speed you should not create so many lists but proceed the subset on the fly (best in cython) - a speed-up of more than 10 is possible. If it is a must to have all subsets as lists, so you cannot expect a huge speed-up.
I have a question regarding the difference in efficiency when doing list search. Why is there a difference between these two?
test_list= [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50]
The first one -
def linearSearch(A,x):
if x in A:
return True
return False
The second one -
def linearSearch_2(A,x):
for element in A:
if element == x:
return True
return False
Testing them
%timeit linearSearch(test_list, 3)
438 ns ± 5.86 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit linearSearch_2(test_list, 3)
1.28 µs ± 7.05 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The difference remains when I use a much larger list. Is there any fundamental difference between these two methods?
Although in theory, these should complete in the same time, Python's in operator is written to work at raw C level so completes much faster than writing your own for-loop in Python.
However, if you were to translate the second snippet into C, then it would out-perform the first snippet in Python as C is much more low-level so runs faster.
Note:
The first function is pretty much useless as it is identical to:
def linearSearch(A,x):
return x in A
which is clear now that whenever you would call it, you could instead just write directly: x in A to produce the same result!
Out of interest, I wrote the second snippet in C, but to make timing more exaggerated, made it do the whole thing 1000000 times:
#include <stdio.h>
#include <time.h>
void main(){
clock_t begin = clock();
for (int s = 0; s < 1000000; s++){
int x = 3;
int a[25] = {2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50};
for (int i = 0; i < 25; i++){
if (i == x) break;
}
}
printf("completed in %f secs\n", (double)(clock() - begin) / CLOCKS_PER_SEC);
}
which outputted:
completed in 0.021514 secs
whereas my modified version of your first snippet in Python:
import time
start = time.time()
for _ in range(1000000):
x = 3
l = [2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50]
if x in l:
continue;
print("completed in", time.time() - start, "seconds")
outputted:
completed in 1.1042814254760742 seconds
Numpy's string functions are all very slow and are less performant than pure python lists. I am looking to optimize all the normal string functions using Cython.
For instance, let's take a numpy array of 100,000 unicode strings with data type either unicode or object and lowecase each one.
alist = ['JsDated', 'УКРАЇНА'] * 50000
arr_unicode = np.array(alist)
arr_object = np.array(alist, dtype='object')
%timeit np.char.lower(arr_unicode)
51.6 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using a list comprehension is just as fast
%timeit [a.lower() for a in arr_unicode]
44.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For the object data type, we cannot use np.char. The list comprehension is 3x as fast.
%timeit [a.lower() for a in arr_object]
16.1 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The only way I know how to do this in Cython is to create an empty object array and call the Python string method lower on each iteration.
import numpy as np
cimport numpy as np
from numpy cimport ndarray
def lower(ndarray[object] arr):
cdef int i
cdef int n = len(arr)
cdef ndarray[object] result = np.empty(n, dtype='object')
for i in range(n):
result[i] = arr[i].lower()
return result
This yields a modest improvement
%timeit lower(arr_object)
11.3 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have tried accessing the memory directly with the data ndarray attribute like this:
def lower_fast(ndarray[object] arr):
cdef int n = len(arr)
cdef int i
cdef char* data = arr.data
cdef int itemsize = arr.itemsize
for i in range(n):
# no idea here
I believe data is one contiguous piece of memory holding all the raw bytes one after the other. Accessing these bytes is extremely fast and it seems converting these raw bytes would increase performance by 2 orders of magnitude. I found a tolower c++ function that might work, but I don't know how to hook it in with Cython.
Update with fastest method (doesn't work for unicode)
Here is the fastest method I found by far, from another SO post. This lowercases all the ascii characters by accessing the numpy memoryview via the data attribute. I think it will mangle other unicode characters that have bytes between 65 and 90 as well. But the speed is very good.
cdef int f(char *a, int itemsize, int shape):
cdef int i
cdef int num
cdef int loc
for i in range(shape * itemsize):
num = a[i]
print(num)
if 65 <= num <= 90:
a[i] +=32
def lower_fast(ndarray arr):
cdef char *inp
inp = arr.data
f(inp, arr.itemsize, arr.shape[0])
return arr
This is 100x faster than the others and what I am looking for.
%timeit lower_fast(arr)
103 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This was only slightly faster than the list comprehension for me on my machine, but if you want unicode support this might be the fastest way of doing it. You'll need to apt-get install libunistring-dev or whatever is appropriate for your OS / package manager.
In some C file, say, _lower.c, have
#include <stdlib.h>
#include <string.h>
#include <unistr.h>
#include <unicase.h>
void _c_tolower(uint8_t **s, uint32_t total_len) {
size_t lower_len, s_len;
uint8_t *s_ptr = *s, *lowered;
while(s_ptr - *s < total_len) {
s_len = u8_strlen(s_ptr);
if (s_len == 0) {
s_ptr += 1;
continue;
}
lowered = u8_tolower(s_ptr, s_len, NULL, NULL, NULL, &lower_len);
memcpy(s_ptr, lowered, lower_len);
free(lowered);
s_ptr += s_len;
}
}
Then, in lower.pxd you do
cdef extern from "_lower.c":
cdef void _c_tolower(unsigned char **s, unsigned int total_len)
And finally, in lower.pyx:
cpdef void lower(ndarray arr):
cdef unsigned char * _arr
_arr = <unsigned char *> arr.data
_c_tolower(&_arr, arr.shape[0] * arr.itemsize)
On my laptop, I got 46ms for the list comprehension you had above and 37ms for this method (and 0.8ms for your lower_fast), so it's probably not worth it, but I figured I'd type it out in case you wanted an example of how to hook such a thing into Cython.
There are a few points of improvement that I don't know will make much of a difference:
arr.data is something like a square matrix I guess? (I don't know, I don't use numpy for anything), and pads the ends of the shorter strings with \x00s. I was too lazy to figure out how to get u8_tolower to look past the 0s, so I just manually fast-forward past them (that's what the if (s_len == 0) clause is doing). I suspect that one call to u8_tolower would be significantly faster than doing it thousands of times.
I'm doing a lot of freeing/memcpying. You can probably avoid that if you're clever.
I think it's the case that every lowercase unicode character is at most as wide as its uppercase variant, so this should not run into any segfaults or buffer overwrites or just overlapping substring issues, but don't take my word for that.
Not really an answer, but hope it helps your further investigations!
PS You'll notice that this does the lowering in-place, so the usage would be like this:
>>> alist = ['JsDated', 'УКРАЇНА', '道德經', 'Ну И йЕшШо'] * 2
>>> arr_unicode = np.array(alist)
>>> lower_2(arr_unicode)
>>> for x in arr_unicode:
... print x
...
jsdated
україна
道德經
ну и йешшо
jsdated
україна
道德經
ну и йешшо
>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); x = [a.lower() for a in arr_unicode]; time() - ct;
0.046072959899902344
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.037489891052246094
EDIT
DUH, you modify the C function to look like this
void _c_tolower(uint8_t **s, uint32_t total_len) {
size_t lower_len;
uint8_t *lowered;
lowered = u8_tolower(*s, total_len, NULL, NULL, NULL, &lower_len);
memcpy(*s, lowered, lower_len);
free(lowered);
}
and then it does it all in one go. Looks more dangerous in terms of possibly having something from the old data left over of lower_len is shorter than the original string... in short, this code is TOTALLY EXPERIMENTAL AND FOR ILLUSTRATIVE PURPOSES ONLY DO NOT USE THIS IN PRODUCTION IT WILL PROBABLY BREAK.
Anyway, ~40% faster this way:
>>> alist = ['JsDated', 'УКРАЇНА'] * 50000
>>> arr_unicode = np.array(alist)
>>> ct = time(); lower_2(arr_unicode); time() - ct
0.022463043975830078
I have been trying to work with Cython and I encountered the following peculiar scenario where a sum function over an array takes 3 times the amount of time that the average of an array takes.
Here are my three functions
cpdef FLOAT_t cython_sum(cnp.ndarray[FLOAT_t, ndim=1] A):
cdef double [:] x = A
cdef double sum = 0
cdef unsigned int N = A.shape[0]
for i in xrange(N):
sum += x[i]
return sum
cpdef FLOAT_t cython_avg(cnp.ndarray[FLOAT_t, ndim=1] A):
cdef double [:] x = A
cdef double sum = 0
cdef unsigned int N = A.shape[0]
for i in xrange(N):
sum += x[i]
return sum/N
cpdef FLOAT_t cython_silly_avg(cnp.ndarray[FLOAT_t, ndim=1] A):
cdef unsigned int N = A.shape[0]
return cython_avg(A)*N
Here are the run times in ipython
In [7]: A = np.random.random(1000000)
In [8]: %timeit np.sum(A)
1000 loops, best of 3: 906 us per loop
In [9]: %timeit np.mean(A)
1000 loops, best of 3: 919 us per loop
In [10]: %timeit cython_avg(A)
1000 loops, best of 3: 896 us per loop
In [11]: %timeit cython_sum(A)
100 loops, best of 3: 2.72 ms per loop
In [12]: %timeit cython_silly_avg(A)
1000 loops, best of 3: 862 us per loop
I am unable to account for the memory jump in simple cython_sum. Is it because of some memory allocation? Since these are random nos from 0 to 1. The sum is around 500K.
Since line_profiler doesn't work with cython, I was unable to profile my code.
It seems like the results from #nbren12 are the definite answer: these results cannot be reproduced.
The evidence (and logic) point out that both methods have the same runtime.