How should I improve Python/Cython performance? Parallelization/memoryviews/numpy?

How should I improve Python/Cython performance? Parallelization/memoryviews/numpy? - python

My task: take 3 lists of ints, each with some multiplier, and see if the elements can be rearranged to make two lists (with larger multipliers).
I have code that does this - looped over my whole data set, it takes about 15 seconds: (EDIT: fixed errors)
%%cython
cdef bint my_check(
list pattern1,
list pattern2,
list pattern3,
int amount1,
int amount2,
int amount3
):
cdef dict all_items = dict()
cdef int i, total_amount = amount1 + amount2 + amount3, m1, m2
cdef bint bad_split = False
# Pool the items together.
for i in range(len(pattern1)):
all_items[pattern1[i]] = all_items.get(pattern1[i],0) + amount1
for i in range(len(pattern2)):
all_items[pattern2[i]] = all_items.get(pattern2[i],0) + amount2
for i in range(len(pattern3)):
all_items[pattern3[i]] = all_items.get(pattern3[i],0) + amount3
# Iterate through possible split points:
for m1 in range(total_amount//2, total_amount):
m2 = total_amount - m1
# Split items into those with quantities divisible at this point and those without
divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 == 0}
not_divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 != 0}
# Check that all of the element amounts that are not divisible by m1 are divisible by m2.
for i in not_divisible:
if not_divisible[i]%m2 != 0:
bad_split = True
break
# If there is an element that doesn't divide by either, try the next split value.
if bad_split:
continue
items1 = {i:divisible[i]//m1 for i in divisible}
items2 = {i:not_divisible[i]//m2 for i in not_divisible}
if <some other stuff here>:
return True
# Tried all of the split points
return False
Then if this returns True, I run another function to do the combination. On my data set, the my_check() function is being called > 150,000 times (and taking the bulk of the time) and the other function < 500 times, so I'm not too concerned with optimizing that one.
I'd like to parallelize this to improve the performance, but what I've found:
my first thought was to use numpy functions to take advantage of vectorization, by converting all_items to a numpy array, using np.mod() and np.logical_not() to split the items, and other numpy functions in the last if clause, but that blows the time up by 3-4x compared to using the dict comprehension
if I switch the m1 range to a Cython prange, the compiler complained about using Python objects without the GIL. I switched the dicts to cdef'd numpy arrays, but that was even slower. I tried using memoryviews, but they don't seem to be easily manipulated? I read in another question here that slices can't be assigned to variables, so I don't know how I'd work with them. It won't let me cdef new variables inside the for loop.
Since I'm running at different values of m1, and terminating as soon as any of them return True, it should be parallelizable without worrying about race conditions.
What should my approach be here? Numpy? Cython? Something else?
I'm happy to post more detailed errors from any of my attempts, but figured that posting them all would get overwhelming. I haven't been able to get profiling or line profiling working for this - I've added the relevant # cython: statements to the top of the Jupyter notebook cell, but it doesn't find anything when I run it.
EDIT:
Per #DavidW's answer I've replaced the middle chunk of code with the following, which cuts the time in half:
items1 = dict()
items2 = dict()
bad_split = False
for k,v in all_items.items():
if v % m1 == 0:
items1[k] = v//m1
elif v % m2 == 0:
items2[k] = v//m2
else:
bad_split = True
break
I'd still like to find some way of taking advantage of my multi-core processor if that's possible.

There's definitely some improvements you can make to the loops that doesn't change the fundamental approach but may be faster. I haven't timed these so it's worth doing that rather than taking my word for it.
for i in range(len(pattern1)):
all_items[pattern1[i] = all_items.get(pattern1[i],0) + amount1
(Ignoring the syntax error). It's generally more ideomatic to iterate by item rather than over a range, and it avoids two lookups (sometimes that isn't true in Cython, for example iterating over numpy arrays, but for a list it's probably true):
for pattern1_i in pattern1:
all_items[pattern1_i] = all_items.get(pattern1_i,0) + amount1
More significantly you have two loops:
divisible = {i:all_items[i] for i in all_items if all_items[i]//m1 == 0}
not_divisible = {i:all_items[i] for i in all_items if all_items[i]//m1 != 0}
You're wasting a lot of time doing dict-lookups when you could iterate directly over both keys and values. For example
divisible = {k: v for k, v in all_items.items() if v//m1 == 0}
But you're also looping over the dictionary twice and performing the same test twice.
divisible = {}
not_divisible = {}
for k, v in all_items.items():
if v//m1 == 0:
divisible[k] = v
else:
not_divisible[k] = v
It might well be possible to translate your algorithm to something involving Numpy arrays, but it's a fairly significant change and beyond my interest here.
Addendum: I'm increasingly reluctant to recommend people use C++ classes in Cython these days. Mainly because a) it can often lead to quite awkward code, b) people tend to use it in a cargo-culty way because "it's C++ so it must be faster than Python objects, and c) people tend to forgot about the cost of converting their objects to/from C++ at the start and end of every function.
However, in this case it might actually be a good choice, since your dict objects are uniformly typed, and entirely contained with a single function. The key substitution is dict -> unordered_map.
What you want to do (rough outline) is
from libcpp.unordered_map cimport unordered_map
Then type all_items, items1 and items2 as cdef unordered_map[int, int (I think...). You do this typing outside the loop. The rest of your code then remains largely the same (you may need to find a substitute for dict.get...).
Once you've got it working as a serial calculation, you should be able to
turn your for m1 in range(total_amount//2, total_amount): into a prange loop, and assuming everything is correctly typed then this should work in parallel. Obviously if <some other stuff here> is a big unknown.
You must treat all_items as strictly read-only during the loop to avoid race-conditions. However, items1 and items2 should be correctly identified as loop-local variables by Cython I hope.
Here's a fairly similar answer to use as a starting point. For future readers: please think twice about whether you really need to convert all your Python objects to C++ ones; you probably don't

Related

Is it more performant to access function from variable?

I was reading sources of Python statistics module and saw strage variable partials_get = partials.get which then was used once in for loop partials[d] = partials_get(d, 0) + n.
def _sum(data, start=0):
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get # STRANGE VARIABLE
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ) # or raise TypeError
for n, d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n # AND IT'S USAGE
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)
So my question: Why not just write partials[d] = partials.get(d, 0) + n? Is it slower than storing and calling function from variable?

partials.get has to search for the get attribute, starting with the object's dictionary and then going to the dictionary of the class and its parent classes. This will be done each time through the loop.
Assigning it to a variable does this lookup once, rather than repeating it.
This is a microoptimization that's typically only significant if the loop has many repetitions. The statistics library often processes large data sets, so it's reasonable here. It's rarely needed in ordinary application code.

Short answer: yes.
Python is an interpreted language, and while dictionary/attribute access is blazingly fast and very optimized, it still incurs a hit.
Since they are running this in a tight loop, they are taking the slight performance advantage of removing the "dot" from accessing partials.get.
There are other slight improvements from doing this in other cases where the variable is enough of a hint to the compiler (for cpython at least) to ensure this stays local, but I'm not sure this is the case here.

Which is better: deque or list slicing?

If I use the code
from collections import deque
q = deque(maxlen=2)
while step <= step_max:
calculate(item)
q.append(item)
another_calculation(q)
how does it compare in efficiency and readability to
q = []
while step <= step_max:
calculate(item)
q.append(item)
q = q[-2:]
another_calculation(q)
calculate() and another_calculation() are not real in this case but in my actual program are simply two calculations. I'm doing these calculations every step for millions of steps (I'm simulating an ion in 2-d space). Because there are so many steps, q gets very long and uses a lot of memory, while another_calculation() only uses the last two values of q. I had been using the latter method, then heard deque mentioned and thought it might be more efficient; thus the question.
I.e., how do deques in python compare to just normal list slicing?

q = q[-2:]
now this is a costly operation because it recreates a list everytime (and copies the references). (A nasty side effect here is that it changes the reference of q even if you can use q[:] = q[-2:] to avoid that).
The deque object just changes the start of the list pointer and "forgets" the oldest item. So it's faster and it's one of the usages it's been designed for.
Of course, for 2 values, there isn't much difference, but for a bigger number there is.

If I interpret your question correctly, you have a function, that calculates a value, and you want to do another calculation with this and the previous value. The best way is to use two variables:
while step <= step_max:
item = calculate()
another_calculation(previous_item, item)
previous_item = item
If the calculations are some form of vector math, you should consider using numpy.

I don't understand why/how one of these methods is faster than the others

I wanted to test the difference in time between implementations of some simple code. I decided to count how many values out of a random sample of 10,000,000 numbers is greater than 0.5. The random sample is grabbed uniformly from the range [0.0, 1.0).
Here is my code:
from numpy.random import random_sample; import time;
n = 10000000;
t1 = time.clock();
t = 0;
z = random_sample(n);
for x in z:
if x > 0.5: t += 1;
print t;
t2 = time.clock();
t = 0;
for _ in xrange(n):
if random_sample() > 0.5: t += 1;
print t;
t3 = time.clock();
t = (random_sample(n) > 0.5).sum();
print t;
t4 = time.clock();
print t2-t1; print t3-t2; print t4-t3;
This is the output:
4999445
4999511
5001498
7.0348236652
1.75569394301
0.202538106332
I get that the first implementation sucks because creating a massive array and then counting it element-wise is a bad idea, so I thought that the second implementation would be the most efficient.
But how is the third implementation 10 times faster than the second method? Doesn't the third method also create a massive array in the form of random_sample(n) and then go through it checking values against 0.5?
How is this third method different from the first method and why is it ~35 times faster than the first method?
EDIT: #merlin2011 suggested that Method 3 probably doesn't create the full array in memory. So, to test that theory I tried the following:
z = random_sample(n);
t = (z > 0.5).sum();
print t;
which runs in a time of 0.197948451549 which is practically identical to Method 3. So, this is probably not a factor.

Method 1 generates a full list in memory before using it. This is slow because the memory has to be allocated and then accessed, probably missing the cache multiple times.
Method 2 uses an generator, which never creates the list in memory but instead generates each element on demand.
Method 3 is probably faster because sum() is implemented as a loop in C but I am not 100% sure. My guess is that this is faster for the same reason that Matlab vectorization is faster than for loops in Matlab.
Update: Separating out each of three steps, I observe that method 3 is still equally fast, so I have to agree with utdemir that each individual operator is executing instructions closer to machine code.
z = random_sample(n)
z2 = z > 0.5
t = z2.sum();
In each of the first two methods, you are invoking Python's standard functionality to do a loop, and this is much slower than a C-level loop that is baked into the implementation.

AFAIK
Function calls are heavy, on method two, you're calling random_sample() 10000000 times, but on third method, you just call it once.
Numpy's > and .sum are optimized to their last bits in C, also most probably using SIMD instructions to avoid loops.
So,
On method 2, you are comparing and looping using Python; but on method 3, you're much closer to the processor and using optimized instructions to compare and sum.

How to avoid using for-loops with numpy?

I have already written the following piece of code, which does exactly what I want, but it goes way too slow. I am certain that there is a way to make it faster, but I cant seem to find how it should be done. The first part of the code is just to show what is of which shape.
two images of measurements (VV1 and HH1)
precomputed values, VV simulated and HH simulated, which both depend on 3 parameters (precomputed for (101, 31, 11) values)
the index 2 is just to put the VV and HH images in the same ndarray, instead of making two 3darrays
VV1 = numpy.ndarray((54, 43)).flatten()
HH1 = numpy.ndarray((54, 43)).flatten()
precomp = numpy.ndarray((101, 31, 11, 2))
two of the three parameters we let vary
comp = numpy.zeros((len(parameter1), len(parameter2)))
for i,(vv,hh) in enumerate(zip(VV1,HH1)):
comp0 = numpy.zeros((len(parameter1),len(parameter2)))
for j in range(len(parameter1)):
for jj in range(len(parameter2)):
comp0[j,jj] = numpy.min((vv-precomp[j,jj,:,0])**2+(hh-precomp[j,jj,:,1])**2)
comp+=comp0
The obvious thing i know i should do is get rid of as many for-loops as I can, but I don't know how to make the numpy.min behave properly when working with more dimensions.
A second thing (less important if it can get vectorized, but still interesting) i noticed is that it takes mostly CPU time, and not RAM, but i searched a long time already, but i cant find a way to write something like "parfor" instead of "for" in matlab, (is it possible to make an #parallel decorator, if i just put the for-loop in a separate method?)
edit: in reply to Janne Karila: yeah that definately improves it a lot,
for (vv,hh) in zip(VV1,HH1):
comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
Is definitely a lot faster, but is there any possibility to remove the outer for-loop too? And is there a way to make a for-loop parallel, with an #parallel or something?

This can replace the inner loops, j and jj
comp0 = numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
This may be a replacement for the whole loop, though all this indexing is stretching my mind a bit. (this creates a large intermediate array though)
comp = numpy.sum(
numpy.min((VV1.reshape(-1,1,1,1) - precomp[numpy.newaxis,...,0])**2
+(HH1.reshape(-1,1,1,1) - precomp[numpy.newaxis,...,1])**2,
axis=2),
axis=0)

One way to parallelize the loop is to construct it in such a way as to use map. In that case, you can then use multiprocessing.Pool to use a parallel map.
I would change this:
for (vv,hh) in zip(VV1,HH1):
comp+= numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
To something like this:
def buildcomp(vvhh):
vv, hh = vvhh
return numpy.min((vv-precomp[...,0])**2+(hh-precomp[...,1])**2, axis=2)
if __name__=='__main__':
from multiprocessing import Pool
nthreads = 2
p = Pool(nthreads)
complist = p.map(buildcomp, np.column_stack((VV1,HH1)))
comp = np.dstack(complist).sum(-1)
Note that the dstack assumes that each comp.ndim is 2, because it will add a third axis, and sum along it. This will slow it down a bit because you have to build the list, stack it, then sum it, but these are all either parallel or numpy operations.
I also changed the zip to a numpy operation np.column_stack, since zip is much slower for long arrays, assuming they're already 1d arrays (which they are in your example).
I can't easily test this so if there's a problem, feel free to let me know.

In computer science, there is the concept of Big O notation, used for getting an approximation of how much work is required to do something. To make a program fast, do as little as possible.
This is why Janne's answer is so much faster, you do fewer calculations. Taking this principle farther, we can apply the concept of memoization, because you are CPU bound instead of RAM bound. You can use the memory library, if it needs to be more complex than the following example.
class AutoVivification(dict):
"""Implementation of perl's autovivification feature."""
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
memo = AutoVivification()
def memoize(n, arr, end):
if not memo[n][arr][end]:
memo[n][arr][end] = (n-arr[...,end])**2
return memo[n][arr][end]
for (vv,hh) in zip(VV1,HH1):
first = memoize(vv, precomp, 0)
second = memoize(hh, precomp, 1)
comp+= numpy.min(first+second, axis=2)
Anything that has already been computed gets saved to memory in the dictionary, and we can look it up later instead of recomputing it. You can even break down the math being done into smaller steps that are each memoized if necessary.
The AutoVivification dictionary is just to make it easier to save the results inside of memoize, because I'm lazy. Again, you can memoize any of the math you do, so if numpy.min is slow, memoize it too.

Efficient generic Python memoize

I have a generic Python memoizer:
cache = {}
def memoize(f):
"""Memoize any function."""
def decorated(*args):
key = (f, str(args))
result = cache.get(key, None)
if result is None:
result = f(*args)
cache[key] = result
return result
return decorated
It works, but I'm not happy with it, because sometimes it's not efficient. Recently, I used it with a function that takes lists as arguments, and apparently making keys with whole lists slowed everything down. What is the best way to do that? (i.e., to efficiently compute keys, whatever the args, and however long or complex they are)
I guess the question is really about how you would efficiently produce keys from the args and the function for a generic memoizer - I have observed in one program that poor keys (too expensive to produce) had a significant impact on the runtime. My prog was taking 45s with 'str(args)', but I could reduce that to 3s with handcrafted keys. Unfortunately, the handcrafted keys are specific to this prog, but I want a fast memoizer where I won't have to roll out specific, handcrafted keys for the cache each time.

First, if you're pretty sure that O(N) hashing is reasonable and necessary here, and you just want to speed things up with a faster algorithm than hash(str(x)), try this:
def hash_seq(iterable):
result = hash(type(iterable))
for element in iterable:
result ^= hash(element)
return result
Of course this won't work for possibly-deep sequences, but there's an obvious way around that:
def hash_seq(iterable):
result = hash(type(iterable))
for element in iterable:
try:
result ^= hash(element)
except TypeError:
result ^= hash_seq(element)
return result
I don't think sure this is a good-enough hash algorithm, because it will return the same value for different permutations of the same list. But I am pretty sure that no good-enough hash algorithm will be much faster. At least if it's written in C or Cython, which you'll probably ultimately want to do if this is the direction you're going.
Also, it's worth noting that this will be correct in many cases where str (or marshal) will not—for example, if your list may have some mutable element whose repr involves its id rather than its value. However, it's still not correct in all cases. In particular, it assumes that "iterates the same elements" means "equal" for any iterable type, which obviously isn't guaranteed to be true. False negatives aren't a huge deal, but false positives are (e.g., two dicts with the same keys but different values may spuriously compare equal and share a memo).
Also, it uses no extra space, instead of O(N) with a rather large multiplier.
At any rate, it's worth trying this first, and only then deciding whether it's worth analyzing for good-enough-ness and tweaking for micro-optimizations.
Here's a trivial Cython version of the shallow implementation:
def test_cy_xor(iterable):
cdef int result = hash(type(iterable))
cdef int h
for element in iterable:
h = hash(element)
result ^= h
return result
From a quick test, the pure Python implementation is pretty slow (as you'd expect, with all that Python looping, compared to the C looping in str and marshal), but the Cython version wins easily:
test_str( 3): 0.015475
test_marshal( 3): 0.008852
test_xor( 3): 0.016770
test_cy_xor( 3): 0.004613
test_str(10000): 8.633486
test_marshal(10000): 2.735319
test_xor(10000): 24.895457
test_cy_xor(10000): 0.716340
Just iterating the sequence in Cython and doing nothing (which is effectively just N calls to PyIter_Next and some refcounting, so you're not going to do much better in native C) is 70% of the same time as test_cy_xor. You can presumably make it faster by requiring an actual sequence instead of an iterable, and even more so by requiring a list, although either way it might require writing explicit C rather than Cython to get the benefits.
Anyway, how do we fix the ordering problem? The obvious Python solution is to hash (i, element) instead of element, but all that tuple manipulation slows down the Cython version up to 12x. The standard solution is to multiply by some number between each xor. But while you're at it, it's worth trying to get the values to spread out nicely for short sequences, small int elements, and other very common edge cases. Picking the right numbers is tricky, so… I just borrowed everything from tuple. Here's the complete test.
_hashtest.pyx:
cdef _test_xor(seq):
cdef long result = 0x345678
cdef long mult = 1000003
cdef long h
cdef long l = 0
try:
l = len(seq)
except TypeError:
# NOTE: This probably means very short non-len-able sequences
# will not be spread as well as they should, but I'm not
# sure what else to do.
l = 100
for element in seq:
try:
h = hash(element)
except TypeError:
h = _test_xor(element)
result ^= h
result *= mult
mult += 82520 + l + l
result += 97531
return result
def test_xor(seq):
return _test_xor(seq) ^ hash(type(seq))
hashtest.py:
import marshal
import random
import timeit
import pyximport
pyximport.install()
import _hashtest
def test_str(seq):
return hash(str(seq))
def test_marshal(seq):
return hash(marshal.dumps(seq))
def test_cy_xor(seq):
return _hashtest.test_xor(seq)
# This one is so slow that I don't bother to test it...
def test_xor(seq):
result = hash(type(seq))
for i, element in enumerate(seq):
try:
result ^= hash((i, element))
except TypeError:
result ^= hash(i, hash_seq(element))
return result
smalltest = [1,2,3]
bigtest = [random.randint(10000, 20000) for _ in range(10000)]
def run():
for seq in smalltest, bigtest:
for f in test_str, test_marshal, test_cy_xor:
print('%16s(%5d): %9f' % (f.func_name, len(seq),
timeit.timeit(lambda: f(seq), number=10000)))
if __name__ == '__main__':
run()
Output:
test_str( 3): 0.014489
test_marshal( 3): 0.008746
test_cy_xor( 3): 0.004686
test_str(10000): 8.563252
test_marshal(10000): 2.744564
test_cy_xor(10000): 0.904398
Here are some potential ways to make this faster:
If you have lots of deep sequences, instead of using try around hash, call PyObject_Hash and check for -1.
If you know you have a sequence (or, even better, specifically a list), instead of just an iterable, PySequence_ITEM (or PyList_GET_ITEM) is probably going to be faster than the PyIter_Next implicitly used above.
In either case, once you start calling C API calls, it's usually easier to drop Cython and just write the function in C. (You can still use Cython to write a trivial wrapper around that C function, instead of manually coding up the extension module.) And at that point, just borrow the tuplehash code directly instead of reimplementing the same algorithm.
If you're looking for a way to avoid the O(N) in the first place, that's just not possible. If you look at how tuple.__hash__, frozenset.__hash__, and ImmutableSet.__hash__ work (the last one is pure Python and very readable, by the way), they all take O(N). However, they also all cache the hash values. So, if you're frequently hashing the same tuple (rather than non-identical-but-equal ones), it approaches constant time. (It's O(N/M), where M is the number of times you call with each tuple.)
If you can assume that your list objects never mutate between calls, you can obviously do the same thing, e.g., with a dict mapping id to hash as an external cache. But in general, that obviously isn't a reasonable assumption. (If your list objects never mutate, it would be easier to just switch to tuple objects and not bother with all this complexity.)
But you can wrap up your list objects in a subclass that adds a cached hash value member (or slot), and invalidates the cache whenever it gets a mutating call (append, __setitem__, __delitem__, etc.). Then your hash_seq can check for that.
The end result is the same correctness and performance as with tuples: amortized O(N/M), except that for tuple M is the number of times you call with each identical tuple, while for list it's the number of times you call with each identical list without mutating in between.

You could try a couple of things:
Using marshal.dumps instead of str might be slightly faster (at least on my machine):
>>> timeit.timeit("marshal.dumps([1,2,3])","import marshal", number=10000)
0.008287056301007567
>>> timeit.timeit("str([1,2,3])",number=10000)
0.01709315717356219
Also, if your functions are expensive to compute, and could possibly return None themselves, then your memoizing function will be re-computing them each time (I'm possibly reaching here, but without knowing more I can only guess).
Incorporating these 2 things gives:
import marshal
cache = {}
def memoize(f):
"""Memoize any function."""
def decorated(*args):
key = (f, marshal.dumps(args))
if key in cache:
return cache[key]
cache[key] = f(*args)
return cache[key]
return decorated

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How should I improve Python/Cython performance? Parallelization/memoryviews/numpy? - python

Related

Is it more performant to access function from variable?

Which is better: deque or list slicing?

I don't understand why/how one of these methods is faster than the others

How to avoid using for-loops with numpy?

Efficient generic Python memoize

Categories

Resources