Why is numpy list access slower than vanilla python? - python

I was under the impression that numpy would be faster for list operations, but the following example seems to indicate otherwise:
import numpy as np
import time
def ver1():
a = [i for i in range(40)]
b = [0 for i in range(40)]
for i in range(1000000):
for j in range(40):
b[j]=a[j]
def ver2():
a = np.array([i for i in range(40)])
b = np.array([0 for i in range(40)])
for i in range(1000000):
for j in range(40):
b[j]=a[j]
t0 = time.time()
ver1()
t1 = time.time()
ver2()
t2 = time.time()
print(t1-t0)
print(t2-t1)
Output is:
4.872278928756714
9.120521068572998
(I'm running 64-bit Python 3.4.3 in Windows 7, on an i7 920)
I do understand that this isn't the fastest way to copy a list, but I'm trying to find out if I'm using numpy incorrectly. Or is it the case that numpy is slower for this kind of operation and is only more efficient in more complex operations?
EDIT:
I also tried the following, which just just does a direct copy via b[:] = a, and numpy is still twice as slow:
import numpy as np
import time
def ver6():
a = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
b = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
for i in range(1000000):
b[:] = a
def ver7():
a = np.array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
b = np.array([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0])
for i in range(1000000):
b[:] = a
t0 = time.time()
ver6()
t1 = time.time()
ver7()
t2 = time.time()
print(t1-t0)
print(t2-t1)
Output is:
0.36202096939086914
0.6750380992889404

You're using NumPy wrong. NumPy's efficiency relies on doing as much work as possible in C-level loops instead of interpreted code. When you do
for j in range(40):
b[j]=a[j]
That's an interpreted loop, with all the intrinsic interpreter overhead and more, because NumPy's indexing logic is way more complex than list indexing, and NumPy needs to create a new element wrapper object on every element retrieval. You're not getting any of the benefits of NumPy when you write code like this.
You need to write the code in such a way that the work happens in C:
b[:] = a
This would also improve the efficiency of the list operation, but it's much more important for NumPy.

Most of what you are seeing is Python object creation from C native types.
A Python list is, at it's heart, an array of PyObject pointers. When a and b are both Python lists, doing b[i] = a[i] will imply:
decreasing the reference count of the object pointed by b[i],
increasing the reference count of the object pointed by a[i], and
copying the address stored in a[i] into b[i].
But if a and b are NumPy arrays, things are a little more ellaborate, and the same b[i] = a[i] then requires:
creating a Python integer object from the native C integer type stored at a[i], see this,
converting the Python integer object into a native C integer type, and storing its value in b[i], see here, and
decreasing the reference count of the temporary Python integer object.
So the difference is mostly in creating and disposing of that intermediate Python object, that lists do not need to do.

Related

Speed up the List appending process

I would like to speed up the process of the below code
x = []
y = []
z = []
for i in range(0, 1000000):
if 0 < u[i] < 1920 and 0 < v[i] < 1080:
x.append(u[i])
y.append(v[i])
z.append([x_ind[i], y_ind[i]])
Any ideas would be really appreciated.
Thanks
Typically, you can optimize cases like this by replacing loops over indices and indexing with loops over the raw values. So replacement code here would be:
x = []
y = []
z = []
for a, b, c in zip(u, v, ind):
if 0 < a < 1920 and 0 < b < 1080:
x.append(a)
y.append(b)
z.append([c, c])
If u, v and ind might be longer than 1000000 and you must stop at 1000000 items checked, you'd just add an import to the top of the file, from itertools import islice, and change the for loop itself to:
for a, b, c in islice(zip(u, v, ind), 1000000):
Either way, you remove all indexing from the code (indexing has one of the worst ratios of overhead to useful work accomplished in the CPython reference interpreter, though other interpreters and tools like Cython will behave differently) and, if you use nicer names than a, b and c, more self-documenting code.
There are minor benefits (decreasing in the most recent versions of Python) to pre-binding copies of append instead of dynamic binding, so if you're really hurting for speed, especially on older versions of Python that didn't optimize away the creation of bound methods, you can try:
x = []
y = []
z = []
xapp, yapp, zapp = x.append, y.append, z.append
for a, b, c in zip(u, v, ind):
if 0 < a < 1920 and 0 < b < 1080:
xapp(a)
yapp(b)
zapp([c, c])
(adding islice if needed) to reduce method call overhead a bit at the expense of uglier code. Definitely don't do this unless profiling has shown this is the hot code path and you really need it faster.
Lastly, a note: If this code is being run at top-level (outside of any function) it will run significantly slower (variable lookup for locally scoped names in a function is a C array lookup; looking up globally scoped names, which all lookups outside a function involve, involves at least one dict key lookup, which is substantially more expensive). Put it in a function (along with the definitions of x, y and z; u, v and ind don't matter so much if you're zipping rather than indexing them) and call that function instead of running at global scope and it should run a lot faster.
Improvements beyond this might be possible using numpy arrays instead of lists, but you'd need to be much more specific about your problem to hazard a guess on the utility of such a change.

number of loops matters efficiency (interpreted vs compiled languages?)

Say you have to carry out a computation by using 2 or even 3 loops. Intuitively, one may thing that it's more efficient to do this with a single loop. I tried a simple Python example:
import itertools
import timeit
def case1(n):
c = 0
for i in range(n):
c += 1
return c
def case2(n):
c = 0
for i in range(n):
for j in range(n):
for k in range(n):
c += 1
return c
print(case1(1000))
print(case2(10))
if __name__ == '__main__':
import timeit
print(timeit.timeit("case1(1000)", setup="from __main__ import case1", number=10000))
print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))
This code run:
$ python3 code.py
1000
1000
0.8281264099932741
1.04944919400441
So effectively 1 loop seems to be a bit more efficient. Yet I have a slightly different scenario in my problem, as I need to use the values in an array (in the following example I use the function range for simplification). That is, if I collapse everything to a single loop I would have to create an extended array from the values of another array whose size is between 2 and 10 elements.
import itertools
import timeit
def case1(n):
b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
c = 0
for i in range(len(b)):
c += b[i]
return c
def case2(n):
c = 0
for i in range(n):
for j in range(n):
for k in range(n):
c += i*j*k
return c
print(case1(10))
print(case2(10))
if __name__ == '__main__':
import timeit
print(timeit.timeit("case1(10)", setup="from __main__ import case1", number=10000))
print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))
In my computer this code run in:
$ python3 code.py
91125
91125
2.435348572995281
1.6435037050105166
So it seems the 3 nested loops are more efficient because I spend sometime creating the array b in case1. so I'm not sure I'm creating this array in the most efficient way, but leaving that aside, does it really pay off collapsing loops to a single one? I'm using Python here, but what about compiled languages like C++? Does the compiler in this case do something to optimize the single loop? Or on the other hand, does the compiler do some optimization when you have multiple nested loops?
This is why the single loop function takes supposedly longer than it should
b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
Just by changing the whole function to
def case1(n, b):
c = 0
for i in range(len(b)):
c += b[i]
return c
Makes the timeit return :
case1 : 0.965343249744
case2 : 2.28501694207
Your case is simple enough that various optimizations would probably do a lot. Be it numpy for more efficient array's, maybe pypy for a better JIT optimizer, or various other things.
Looking at the bytecode via the dis module can help you understand what happens under the hood and make some micro optimizations, but in general it does not really matter if you do one loop or a nested loop, if your memory access pattern is somewhat predictable for the CPU. If not, it may differ wildly.
Python has some bytecodes that are cheap and others that are more expensive, e.g. function calls are much more expensive than a simple addition. Same with creating new objects and various other things. So the usual optimization is moving the loop to C, which is one of the benefits of itertools sometimes.
Once you are on the C-level it usually comes down to: Avoid syscalls/mallocs() in tight loops, have predictable memory access patterns and make sure your algorithm is cache friendly.
So, your algorithms above will probably vary wildly in performance if you go to large values of N, due to the amount of memory allocation and cache access.
But the fastest way for the specific problem above would be to find a closed form for the function, it seems wasteful to iterate for that, as there must be a much simpler formula to calculate the final value of 'c'. As usual, first get the best algorithm before doing micro optimizations.
e.g. Wolfram Alpha tells you that you could replace two loops with, there is probably a closed form for all three, but Alpha didn't tell me...
def case3(n):
c = 0
for j in range(n):
c += (j* n^2 *(n+1)^2))/4
return c

fast python matrix creation and iteration

I need to create a matrix starting from the values of a weight matrix. Which is the best structure to hold the matrix in term of speed both when creating and iterating over it? I was thinking about a list of lists or a numpy 2D array but they both seem slow to me.
What I need:
numpy array
A = np.zeros((dim, dim))
for r in range(A.shape[0]):
for c in range(A.shape[0]):
if(r==c):
A.itemset(node_degree[r])
else:
A.itemset(arc_weight[r,c])
or
list of lists
l = []
for r in range(dim):
l.append([])
for c in range(dim):
if(i==j):
l[i].append(node_degree[r])
else:
l[i].append(arc_weight[r,c])
where dim can be also 20000 , node_degree is a vector and arc_weight is another matrix. I wrote it in c++, it takes less less than 0.5 seconds while the others two in python more than 20 seconds. I know python is not c++ but I need to be as fast as possible.
Thank you all.
One thing is you shouldn't be appending to the list if you already know it's size.
Preallocate the memory first using list comprehension and generate the r, c values using xrange() instead of range() since you are using Python < 3.x (see here):
l = [[0 for c in xrange(dim)] for r in xrange(dim)]
Better yet, you can build what you need in one shot using:
l = [[node_degree[r] if r == c else arc_weight[r,c]
for c in xrange(dim)] for r in xrange(dim)]
Compared to your original implementation, this should use less memory (because of the xrange() generators), and less time because you remove the need to reallocating memory by specifying the dimensions up front.
Numpy matrices are generally faster as they know their dimensions and entry type.
In your particular situation, since you already have the arc_weight and node_degree matrices created so you can create your matrix directly from arc_weight and then replace the diagonal:
A = np.matrix(arc_matrix)
np.fill_diagonal(A, node_degree)
Another option is to replace the double loop with a function that puts the right element in each position and create a matrix from the function:
def fill_matrix(r, c):
return arc_weight[r,c] if r != c else node_degree[r]
A = np.fromfunction(fill_matrix, (dim, dim))
As a rule of thumb, with numpy you must avoid loops at all costs. First method should be faster but you should profile both to see what works for you. You should also take into account that you seem to be duplicating your data set in memory, so if it is really huge you might get in trouble. Best idea would be to create your matrix directly avoiding arc_weight and node_degree altogether.
Edit: Some simple time comparisons between list comprehension and numpy matrix creation. Since I don't know how your arc_weight and node_degree are defined, I just made up two random functions. It seems that numpy.fromfunction complains a bit if the function has a conditional on it, so I construct the matrix in two steps.
import numpy as np
def arc_weight(a,b):
return a+b
def node_degree(a):
return a*a
def create_as_list(N):
return [[arc_weight(c,r) if c!=r else node_degree(c) for c in xrange(N)] for r in xrange(N)]
def create_as_numpy(N):
A = np.fromfunction(arc_weight, (N,N))
np.fill_diagonal(A, node_degree(np.arange(N)))
return A
And here the timings for N=2000:
time A = create_as_list(2000)
CPU times: user 839 ms, sys: 16.5 ms, total: 856 ms
Wall time: 845 ms
time A = create_as_numpy(2000)
CPU times: user 83.1 ms, sys: 12.9 ms, total: 96 ms
Wall time: 95.3 ms
Make a copy of arc_weight and fill the diagonal with values from node_degree. For a 20000-by-20000 output, it takes about 1.6 seconds on my machine:
>>> import numpy
>>> dim = 20000
>>> arc_weight = numpy.arange(dim**2).reshape([dim, dim])
>>> node_degree = numpy.arange(dim)
>>> import timeit
>>> timeit.timeit('''
... A = arc_weight.copy()
... A.flat[::dim+1] = node_degree
... ''', '''
... from __main__ import dim, arc_weight, node_degree''',
... number=1)
1.6081738501125764
Once you have your array, try not to iterate over it. Compared to broadcasted operators and NumPy built-in functions, Python-level loops are a performance disaster.

Reduce python loop to array calculation

I am trying to fill an array with calculated values from functions defined earlier in my code. I started with a code that has a similar structure to the following:
from numpy import cos, sin, arange, zeros
a = arange(1000)
b = arange(1000)
def defcos(x):
return cos(x)
def defsin(x):
return sin(x)
a_len = len(a)
b_len = len(b)
result = zeros((a_len,b_len))
for i in xrange(b_len):
for j in xrange(a_len):
a_res = defcos(a[j])
b_res = defsin(b[i])
result[i,j] = a_res * b_res
I tried to use array representations of the functions, which ended up in the following change for the loop
a_res = defsin(a)
b_res = defcos(b)
for i in xrange(b_len):
for j in xrange(a_len):
result[i,j] = a_res[i] * b_res[j]
This is already significantly faster, than the first version. But is there a way to avoid the loop entirely? I have encountered those loops a couple of times in the past but never botheres as it was not critical in terms of speed. But this time it is the core component of something, which is looped through a couple of times more. :)
Any help would be appreciated, thanks in advance!
Like so:
from numpy import newaxis
a_res = sin(a)
b_res = cos(b)
result = a_res[:, newaxis] * b_res
To understand how this works, have a look at the rules for array broadcasting. And please don't define useless functions like defsin, just use sin itself! Another minor detail, you get i from range(b_len), but you use it to index a_res! This is a bug if a_len != b_len.

Speeding up python code with cython

I have a function which just basically makes lots of calls to a simple defined hash function and tests to see when it finds a duplicate. I need to do lots of simulations with it so would like it to be as fast as possible. I am attempting to use cython to do this. The cython code is currently called with a normal python list of integers with values in the range 0 to m^2.
import math, random
cdef int a,b,c,d,m,pos,value, cyclelimit, nohashcalls
def h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) %m
def floyd(inputx):
dupefound, nohashcalls = (0,0)
m = len(inputx)
loops = int(m*math.log(m))
for loopno in xrange(loops):
if (dupefound == 1):
break
a = random.randrange(m)
b = random.randrange(m)
c = random.randrange(m)
d = random.randrange(m)
pos = random.randrange(m)
value = inputx[pos]
listofpos = [0] * m
listofpos[pos] = 1
setofvalues = set([value])
cyclelimit = int(math.sqrt(m))
for j in xrange(cyclelimit):
pos = h3(a,b, c,d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
How can I convert inputx and listofpos to use C type arrays and to access the arrays at C speed? Are there any other speed ups I can use? Can setofvalues be sped up?
So that there is something to compare against, 50 calls to floyd() with m = 5000 currently takes around 30 seconds on my computer.
Update: Example code snippet to show how floyd is called.
m = 5000
inputx = random.sample(xrange(m**2), m)
(dupefound, nohashcalls) = edcython.floyd(inputx)
First of all, it seems that you must type the variables inside the function. A good example of it is here.
Second, cython -a, for "annotate", gives you a really excellent break down of the code generated by the cython compiler and a color-coded indication of how dirty (read: python api heavy) it is. This output is really essential when trying to optimize anything.
Third, the now famous page on working with Numpy explains how to get fast, C-style access to the Numpy array data. Unforunately it's verbose and annoying. We're in luck however, because more recent Cython provides Typed Memory Views, which are both easy to use and awesome. Read that entire page before you try to do anything else.
After ten minutes or so I came up with this:
# cython: infer_types=True
# Use the C math library to avoid Python overhead.
from libc cimport math
# For boundscheck below.
import cython
# We're lazy so we'll let Numpy handle our array memory management.
import numpy as np
# You would normally also import the Numpy pxd to get faster access to the Numpy
# API, but it requires some fancier compilation options so I'll leave it out for
# this demo.
# cimport numpy as np
import random
# This is a small function that doesn't need to be exposed to Python at all. Use
# `cdef` instead of `def` and inline it.
cdef inline int h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) % m
# If we want to live fast and dangerously, we tell cython not to check our array
# indices for IndexErrors. This means we CAN overrun our array and crash the
# program or screw up our stack. Use with caution. Profiling suggests that we
# aren't gaining anything in this case so I leave it on for safety.
# #cython.boundscheck(False)
# `cpdef` so that calling this function from another Cython (or C) function can
# skip the Python function call overhead, while still allowing us to use it from
# Python.
cpdef floyd(int[:] inputx):
# Type the variables in the scope of the function.
cdef int a,b,c,d, value, cyclelimit
cdef unsigned int dupefound = 0
cdef unsigned int nohashcalls = 0
cdef unsigned int loopno, pos, j
# `m` has type int because inputx is already a Cython memory view and
# `infer-types` is on.
m = inputx.shape[0]
cdef unsigned int loops = int(m*math.log(m))
# Again using the memory view, but letting Numpy allocate an array of zeros.
cdef int[:] listofpos = np.zeros(m, dtype=np.int32)
# Keep this random sampling out of the loop
cdef int[:, :] randoms = np.random.randint(0, m, (loops, 5)).astype(np.int32)
for loopno in range(loops):
if (dupefound == 1):
break
# From our precomputed array
a = randoms[loopno, 0]
b = randoms[loopno, 1]
c = randoms[loopno, 2]
d = randoms[loopno, 3]
pos = randoms[loopno, 4]
value = inputx[pos]
# Unforunately, Memory View does not support "vectorized" operations
# like standard Numpy arrays. Otherwise we'd use listofpos *= 0 here.
for j in range(m):
listofpos[j] = 0
listofpos[pos] = 1
setofvalues = set((value,))
cyclelimit = int(math.sqrt(m))
for j in range(cyclelimit):
pos = h3(a, b, c, d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
There are no tricks here that aren't explained on docs.cython.org, which is where I learned them myself, but helps to see it all come together.
The most important changes to your original code are in the comments, but they all amount to giving Cython hints about how to generate code that doesn't use the Python API.
As an aside: I really don't know why infer_types is not on by default. It lets the compiler
implicitly use C types instead of Python types where possible, meaning less work for you.
If you run cython -a on this, you'll see that the only lines that call into Python are your calls to random.sample, and building or adding to a Python set().
On my machine, your original code runs in 2.1 seconds. My version runs in 0.6 seconds.
The next step is to get random.sample out of that loop, but I'll leave that to you.
I have edited my answer to demonstrate how to precompute the rand samples. This brings the time down to 0.4 seconds.
Do you need to use this particular hashing algorithm? Why not use the built-in hashing algorithm for dicts? For example:
from collections import Counter
cnt = Counter(inputx)
dupes = [k for k, v in cnt.iteritems() if v > 1]

Categories

Resources