Is it more performant to access function from variable?

Is it more performant to access function from variable? - python

I was reading sources of Python statistics module and saw strage variable partials_get = partials.get which then was used once in for loop partials[d] = partials_get(d, 0) + n.
def _sum(data, start=0):
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get # STRANGE VARIABLE
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ) # or raise TypeError
for n, d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n # AND IT'S USAGE
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)
So my question: Why not just write partials[d] = partials.get(d, 0) + n? Is it slower than storing and calling function from variable?

partials.get has to search for the get attribute, starting with the object's dictionary and then going to the dictionary of the class and its parent classes. This will be done each time through the loop.
Assigning it to a variable does this lookup once, rather than repeating it.
This is a microoptimization that's typically only significant if the loop has many repetitions. The statistics library often processes large data sets, so it's reasonable here. It's rarely needed in ordinary application code.

Short answer: yes.
Python is an interpreted language, and while dictionary/attribute access is blazingly fast and very optimized, it still incurs a hit.
Since they are running this in a tight loop, they are taking the slight performance advantage of removing the "dot" from accessing partials.get.
There are other slight improvements from doing this in other cases where the variable is enough of a hint to the compiler (for cpython at least) to ensure this stays local, but I'm not sure this is the case here.

Related

How should I improve Python/Cython performance? Parallelization/memoryviews/numpy?

My task: take 3 lists of ints, each with some multiplier, and see if the elements can be rearranged to make two lists (with larger multipliers).
I have code that does this - looped over my whole data set, it takes about 15 seconds: (EDIT: fixed errors)
%%cython
cdef bint my_check(
list pattern1,
list pattern2,
list pattern3,
int amount1,
int amount2,
int amount3
):
cdef dict all_items = dict()
cdef int i, total_amount = amount1 + amount2 + amount3, m1, m2
cdef bint bad_split = False
# Pool the items together.
for i in range(len(pattern1)):
all_items[pattern1[i]] = all_items.get(pattern1[i],0) + amount1
for i in range(len(pattern2)):
all_items[pattern2[i]] = all_items.get(pattern2[i],0) + amount2
for i in range(len(pattern3)):
all_items[pattern3[i]] = all_items.get(pattern3[i],0) + amount3
# Iterate through possible split points:
for m1 in range(total_amount//2, total_amount):
m2 = total_amount - m1
# Split items into those with quantities divisible at this point and those without
divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 == 0}
not_divisible = {i:all_items[i] for i in all_items if all_items[i]%m1 != 0}
# Check that all of the element amounts that are not divisible by m1 are divisible by m2.
for i in not_divisible:
if not_divisible[i]%m2 != 0:
bad_split = True
break
# If there is an element that doesn't divide by either, try the next split value.
if bad_split:
continue
items1 = {i:divisible[i]//m1 for i in divisible}
items2 = {i:not_divisible[i]//m2 for i in not_divisible}
if <some other stuff here>:
return True
# Tried all of the split points
return False
Then if this returns True, I run another function to do the combination. On my data set, the my_check() function is being called > 150,000 times (and taking the bulk of the time) and the other function < 500 times, so I'm not too concerned with optimizing that one.
I'd like to parallelize this to improve the performance, but what I've found:
my first thought was to use numpy functions to take advantage of vectorization, by converting all_items to a numpy array, using np.mod() and np.logical_not() to split the items, and other numpy functions in the last if clause, but that blows the time up by 3-4x compared to using the dict comprehension
if I switch the m1 range to a Cython prange, the compiler complained about using Python objects without the GIL. I switched the dicts to cdef'd numpy arrays, but that was even slower. I tried using memoryviews, but they don't seem to be easily manipulated? I read in another question here that slices can't be assigned to variables, so I don't know how I'd work with them. It won't let me cdef new variables inside the for loop.
Since I'm running at different values of m1, and terminating as soon as any of them return True, it should be parallelizable without worrying about race conditions.
What should my approach be here? Numpy? Cython? Something else?
I'm happy to post more detailed errors from any of my attempts, but figured that posting them all would get overwhelming. I haven't been able to get profiling or line profiling working for this - I've added the relevant # cython: statements to the top of the Jupyter notebook cell, but it doesn't find anything when I run it.
EDIT:
Per #DavidW's answer I've replaced the middle chunk of code with the following, which cuts the time in half:
items1 = dict()
items2 = dict()
bad_split = False
for k,v in all_items.items():
if v % m1 == 0:
items1[k] = v//m1
elif v % m2 == 0:
items2[k] = v//m2
else:
bad_split = True
break
I'd still like to find some way of taking advantage of my multi-core processor if that's possible.

There's definitely some improvements you can make to the loops that doesn't change the fundamental approach but may be faster. I haven't timed these so it's worth doing that rather than taking my word for it.
for i in range(len(pattern1)):
all_items[pattern1[i] = all_items.get(pattern1[i],0) + amount1
(Ignoring the syntax error). It's generally more ideomatic to iterate by item rather than over a range, and it avoids two lookups (sometimes that isn't true in Cython, for example iterating over numpy arrays, but for a list it's probably true):
for pattern1_i in pattern1:
all_items[pattern1_i] = all_items.get(pattern1_i,0) + amount1
More significantly you have two loops:
divisible = {i:all_items[i] for i in all_items if all_items[i]//m1 == 0}
not_divisible = {i:all_items[i] for i in all_items if all_items[i]//m1 != 0}
You're wasting a lot of time doing dict-lookups when you could iterate directly over both keys and values. For example
divisible = {k: v for k, v in all_items.items() if v//m1 == 0}
But you're also looping over the dictionary twice and performing the same test twice.
divisible = {}
not_divisible = {}
for k, v in all_items.items():
if v//m1 == 0:
divisible[k] = v
else:
not_divisible[k] = v
It might well be possible to translate your algorithm to something involving Numpy arrays, but it's a fairly significant change and beyond my interest here.
Addendum: I'm increasingly reluctant to recommend people use C++ classes in Cython these days. Mainly because a) it can often lead to quite awkward code, b) people tend to use it in a cargo-culty way because "it's C++ so it must be faster than Python objects, and c) people tend to forgot about the cost of converting their objects to/from C++ at the start and end of every function.
However, in this case it might actually be a good choice, since your dict objects are uniformly typed, and entirely contained with a single function. The key substitution is dict -> unordered_map.
What you want to do (rough outline) is
from libcpp.unordered_map cimport unordered_map
Then type all_items, items1 and items2 as cdef unordered_map[int, int (I think...). You do this typing outside the loop. The rest of your code then remains largely the same (you may need to find a substitute for dict.get...).
Once you've got it working as a serial calculation, you should be able to
turn your for m1 in range(total_amount//2, total_amount): into a prange loop, and assuming everything is correctly typed then this should work in parallel. Obviously if <some other stuff here> is a big unknown.
You must treat all_items as strictly read-only during the loop to avoid race-conditions. However, items1 and items2 should be correctly identified as loop-local variables by Cython I hope.
Here's a fairly similar answer to use as a starting point. For future readers: please think twice about whether you really need to convert all your Python objects to C++ ones; you probably don't

Memoized to DP solution - Making Change

Recently I read a problem to practice DP. I wasn't able to come up with one, so I tried a recursive solution which I later modified to use memoization. The problem statement is as follows :-
Making Change. You are given n types of coin denominations of values
v(1) < v(2) < ... < v(n) (all integers). Assume v(1) = 1, so you can
always make change for any amount of money C. Give an algorithm which
makes change for an amount of money C with as few coins as possible.
[on problem set 4]
I got the question from here
My solution was as follows :-
def memoized_make_change(L, index, cost, d):
if index == 0:
return cost
if (index, cost) in d:
return d[(index, cost)]
count = cost / L[index]
val1 = memoized_make_change(L, index-1, cost%L[index], d) + count
val2 = memoized_make_change(L, index-1, cost, d)
x = min(val1, val2)
d[(index, cost)] = x
return x
This is how I've understood my solution to the problem. Assume that the denominations are stored in L in ascending order. As I iterate from the end to the beginning, I have a choice to either choose a denomination or not choose it. If I choose it, I then recurse to satisfy the remaining amount with lower denominations. If I do not choose it, I recurse to satisfy the current amount with lower denominations.
Either way, at a given function call, I find the best(lowest count) to satisfy a given amount.
Could I have some help in bridging the thought process from here onward to reach a DP solution? I'm not doing this as any HW, this is just for fun and practice. I don't really need any code either, just some help in explaining the thought process would be perfect.
[EDIT]
I recall reading that function calls are expensive and is the reason why bottom up(based on iteration) might be preferred. Is that possible for this problem?

Here is a general approach for converting memoized recursive solutions to "traditional" bottom-up DP ones, in cases where this is possible.
First, let's express our general "memoized recursive solution". Here, x represents all the parameters that change on each recursive call. We want this to be a tuple of positive integers - in your case, (index, cost). I omit anything that's constant across the recursion (in your case, L), and I suppose that I have a global cache. (But FWIW, in Python you should just use the lru_cache decorator from the standard library functools module rather than managing the cache yourself.)
To solve for(x):
If x in cache: return cache[x]
Handle base cases, i.e. where one or more components of x is zero
Otherwise:
Make one or more recursive calls
Combine those results into `result`
cache[x] = result
return result
The basic idea in dynamic programming is simply to evaluate the base cases first and work upward:
To solve for(x):
For y starting at (0, 0, ...) and increasing towards x:
Do all the stuff from above
However, two neat things happen when we arrange the code this way:
As long as the order of y values is chosen properly (this is trivial when there's only one vector component, of course), we can arrange that the results for the recursive call are always in cache (i.e. we already calculated them earlier, because y had that value on a previous iteration of the loop). So instead of actually making the recursive call, we replace it directly with a cache lookup.
Since every component of y will use consecutively increasing values, and will be placed in the cache in order, we can use a multidimensional array (nested lists, or else a Numpy array) to store the values instead of a dictionary.
So we get something like:
To solve for(x):
cache = multidimensional array sized according to x
for i in range(first component of x):
for j in ...:
(as many loops as needed; better yet use `itertools.product`)
If this is a base case, write the appropriate value to cache
Otherwise, compute "recursive" index values to use, look up
the values, perform the computation and store the result
return the appropriate ("last") value from cache

I suggest considering the relationship between the value you are constructing and the values you need for it.
In this case you are constructing a value for index, cost based on:
index-1 and cost
index-1 and cost%L[index]
What you are searching for is a way of iterating over the choices such that you will always have precalculated everything you need.
In this case you can simply change the code to the iterative approach:
for each choice of index 0 upwards:
for each choice of cost:
compute value corresponding to index,cost
In practice, I find that the iterative approach can be significantly faster (e.g. *4 perhaps) for simple problems as it avoids the overhead of function calls and checking the cache for preexisting values.

In python is it better to have a series of += or to append to a list and then sum?

Both of these bits of code do the same thing:
g = 1
g += 2
g += 17
print g
g = []
g.append(1)
g.append(2)
g.append(17)
print sum(g)
I was just wondering if one of these ways is "better" or more Python than the other.
My own testing with the following bit of code:
import time
n = 1000000
A = time.clock()
w = 0
for i in range(n):
w += i
print w, time.clock() - A
A = time.clock()
g = []
for i in range(n):
g.append( i )
print sum(g), time.clock() - A
seems to indicate that the first method is slightly faster, but I may be missing something.
Or I may be missing an entirely better way to perform this type of operation. Any input would be welcome.

It's not a matter of being Pythonic, it's a matter of what you want to achieve.
Do you want to save the values that constitute the sum so they can be referred to later? If so, use a list. If not, then why even bother with a list? It would just be a convoluted and less efficient way to do the same thing -- just add the values up. The direct addition method will obviously be faster because all you're doing is adding to a variable (very cheap), instead of mutating a list (which has a greater cost). Not to mention the evident memory advantage of using the direct addition approach, since you wouldn't be storing useless numbers.

Method A is
Add the integers
Method B is
Create a list of integers
Add the integers
If all you want to do is
Add the integers
I'd go with method A.

The only reason to have the second method is if you plan to use the list g elsewhere in the code. Otherwise, there isn't a reason to do it the second way. Making a list and then summing its values is a lot more costly then just incrementing a variable.
Moreover, if incrementing g is your goal, then why not do that? "Explicit is better than implicit" is a motto of Python. The first method explicitly increments g.
Also, the list may confuse people. When they see your code, they will think you need the list and plan to use it elsewhere. Not to mention that g is now a list. If g is supposed to be a number, having it be a list is not good and can lead to problems.
Finally, the first solution has less syntax (always a plus if it does the same job efficiently).
So, I'd go with method 1.

Absolutely the first, for many reason, first of all memory allocation (N integer instead just one) and performance: in real world application the GC overhead would pop out.

edit: disregard this, I can see now that it is not generally true, and is only true for specific values of x.
Ok, so I can see that making a list should be inefficient, but then why does fun2 run more quickly in this instance? Doesn't it essentially create a list and then sum over it?
import timeit
def fun1(x):
w = 0
for i in range(x):
w += i
return w
def fun2(x):
return sum([i for i in range(x)])
timer = timeit.Timer(stmt='fun1(10000)', setup='from __main__ import fun1')
print timer.timeit(number=10000)
timer = timeit.Timer(stmt='fun2(10000)', setup='from __main__ import fun2')
print timer.timeit(number=10000)

Empty zeroth element in array/list to eliminate repeated decrementing. Does this improve performance?

I am using Python to solve Project Euler problems. Many require caching the results of past calculations to improve performance, leading to code like this:
pastResults = [None] * 1000000
def someCalculation(integerArgument):
# return result of a calculation performed on numberArgument
# for example, summing the factorial or square of its digits
for eachNumber in range(1, 1000001)
if pastResults[eachNumber - 1] is None:
pastResults[eachNumber - 1] = someCalculation(eachNumber)
# perform additional actions with pastResults[eachNumber - 1]
Would the repeated decrementing have an adverse impact on program performance? Would having an empty or dummy zeroth element (so the zero-based array emulates a one-based array) improve performance by eliminating the repeated decrementing?
pastResults = [None] * 1000001
def someCalculation(integerArgument):
# return result of a calculation performed on numberArgument
# for example, summing the factorial or square of its digits
for eachNumber in range(1, 1000001)
if pastResults[eachNumber] is None:
pastResults[eachNumber] = someCalculation(eachNumber)
# perform additional actions with pastResults[eachNumber]
I also feel that emulating a one-based array would make the code easier to follow. That is why I do not make the range zero-based with for eachNumber in range(1000000) as someCalculation(eachNumber + 1) would not be logical.
How significant is the additional memory from the empty zeroth element? What other factors should I consider? I would prefer answers that are not confined to Python and Project Euler.
EDIT: Should be is None instead of is not None.

Not really an answer to the question regarding the performance, rather a general tip about caching previously calculated values. The usual way to do this is to use a map (Python dict) for this, as this allows to use more complex keys instead of just integer numbers, like floating point numbers, strings, or even tuples. Also, you won't run into problems in case your keys are rather sparse.
pastResults = {}
def someCalculation(integerArgument):
if integerArgument not in pastResults:
pastResults[integerArgument] = # calculation performed on numberArg.
return pastResults[integerArgument]
Also, there is no need to perform the calculations "in order" using a loop. Just call the function for the value you are interested in, and the if statement will take care that, when invoked recursively, the function is called only once for each argument.
Ultimately, if you are using this a lot (as clearly the case for Project Euler) you can define yourself a function decorator, like this one:
def memo(f):
f.cache = {}
def _f(*args, **kwargs):
if args not in f.cache:
f.cache[args] = f(*args, **kwargs)
return f.cache[args]
return _f
What this does is: It takes a function and defines another function that first checks whether the given parameters can be found in the cache, and otherwise calculates the result of the original function and puts it into the cache. Just add the #memo annotation to your function definitions and this will take care of caching for you.
#memo
def someCalculation(integerArgument):
# function body
This is syntactic sugar for someCalculation = memo(someCalculation). Note however, that this will not always work out well. First, the paremters have to be hashable (no lists or other mutable types); second, in case you are passing parameters that are not relevant for the result (e.g., debugging stuff etc.) your cache can grow unnecessarily large, as all the parameters are used as the key.

Efficient generic Python memoize

I have a generic Python memoizer:
cache = {}
def memoize(f):
"""Memoize any function."""
def decorated(*args):
key = (f, str(args))
result = cache.get(key, None)
if result is None:
result = f(*args)
cache[key] = result
return result
return decorated
It works, but I'm not happy with it, because sometimes it's not efficient. Recently, I used it with a function that takes lists as arguments, and apparently making keys with whole lists slowed everything down. What is the best way to do that? (i.e., to efficiently compute keys, whatever the args, and however long or complex they are)
I guess the question is really about how you would efficiently produce keys from the args and the function for a generic memoizer - I have observed in one program that poor keys (too expensive to produce) had a significant impact on the runtime. My prog was taking 45s with 'str(args)', but I could reduce that to 3s with handcrafted keys. Unfortunately, the handcrafted keys are specific to this prog, but I want a fast memoizer where I won't have to roll out specific, handcrafted keys for the cache each time.

First, if you're pretty sure that O(N) hashing is reasonable and necessary here, and you just want to speed things up with a faster algorithm than hash(str(x)), try this:
def hash_seq(iterable):
result = hash(type(iterable))
for element in iterable:
result ^= hash(element)
return result
Of course this won't work for possibly-deep sequences, but there's an obvious way around that:
def hash_seq(iterable):
result = hash(type(iterable))
for element in iterable:
try:
result ^= hash(element)
except TypeError:
result ^= hash_seq(element)
return result
I don't think sure this is a good-enough hash algorithm, because it will return the same value for different permutations of the same list. But I am pretty sure that no good-enough hash algorithm will be much faster. At least if it's written in C or Cython, which you'll probably ultimately want to do if this is the direction you're going.
Also, it's worth noting that this will be correct in many cases where str (or marshal) will not—for example, if your list may have some mutable element whose repr involves its id rather than its value. However, it's still not correct in all cases. In particular, it assumes that "iterates the same elements" means "equal" for any iterable type, which obviously isn't guaranteed to be true. False negatives aren't a huge deal, but false positives are (e.g., two dicts with the same keys but different values may spuriously compare equal and share a memo).
Also, it uses no extra space, instead of O(N) with a rather large multiplier.
At any rate, it's worth trying this first, and only then deciding whether it's worth analyzing for good-enough-ness and tweaking for micro-optimizations.
Here's a trivial Cython version of the shallow implementation:
def test_cy_xor(iterable):
cdef int result = hash(type(iterable))
cdef int h
for element in iterable:
h = hash(element)
result ^= h
return result
From a quick test, the pure Python implementation is pretty slow (as you'd expect, with all that Python looping, compared to the C looping in str and marshal), but the Cython version wins easily:
test_str( 3): 0.015475
test_marshal( 3): 0.008852
test_xor( 3): 0.016770
test_cy_xor( 3): 0.004613
test_str(10000): 8.633486
test_marshal(10000): 2.735319
test_xor(10000): 24.895457
test_cy_xor(10000): 0.716340
Just iterating the sequence in Cython and doing nothing (which is effectively just N calls to PyIter_Next and some refcounting, so you're not going to do much better in native C) is 70% of the same time as test_cy_xor. You can presumably make it faster by requiring an actual sequence instead of an iterable, and even more so by requiring a list, although either way it might require writing explicit C rather than Cython to get the benefits.
Anyway, how do we fix the ordering problem? The obvious Python solution is to hash (i, element) instead of element, but all that tuple manipulation slows down the Cython version up to 12x. The standard solution is to multiply by some number between each xor. But while you're at it, it's worth trying to get the values to spread out nicely for short sequences, small int elements, and other very common edge cases. Picking the right numbers is tricky, so… I just borrowed everything from tuple. Here's the complete test.
_hashtest.pyx:
cdef _test_xor(seq):
cdef long result = 0x345678
cdef long mult = 1000003
cdef long h
cdef long l = 0
try:
l = len(seq)
except TypeError:
# NOTE: This probably means very short non-len-able sequences
# will not be spread as well as they should, but I'm not
# sure what else to do.
l = 100
for element in seq:
try:
h = hash(element)
except TypeError:
h = _test_xor(element)
result ^= h
result *= mult
mult += 82520 + l + l
result += 97531
return result
def test_xor(seq):
return _test_xor(seq) ^ hash(type(seq))
hashtest.py:
import marshal
import random
import timeit
import pyximport
pyximport.install()
import _hashtest
def test_str(seq):
return hash(str(seq))
def test_marshal(seq):
return hash(marshal.dumps(seq))
def test_cy_xor(seq):
return _hashtest.test_xor(seq)
# This one is so slow that I don't bother to test it...
def test_xor(seq):
result = hash(type(seq))
for i, element in enumerate(seq):
try:
result ^= hash((i, element))
except TypeError:
result ^= hash(i, hash_seq(element))
return result
smalltest = [1,2,3]
bigtest = [random.randint(10000, 20000) for _ in range(10000)]
def run():
for seq in smalltest, bigtest:
for f in test_str, test_marshal, test_cy_xor:
print('%16s(%5d): %9f' % (f.func_name, len(seq),
timeit.timeit(lambda: f(seq), number=10000)))
if __name__ == '__main__':
run()
Output:
test_str( 3): 0.014489
test_marshal( 3): 0.008746
test_cy_xor( 3): 0.004686
test_str(10000): 8.563252
test_marshal(10000): 2.744564
test_cy_xor(10000): 0.904398
Here are some potential ways to make this faster:
If you have lots of deep sequences, instead of using try around hash, call PyObject_Hash and check for -1.
If you know you have a sequence (or, even better, specifically a list), instead of just an iterable, PySequence_ITEM (or PyList_GET_ITEM) is probably going to be faster than the PyIter_Next implicitly used above.
In either case, once you start calling C API calls, it's usually easier to drop Cython and just write the function in C. (You can still use Cython to write a trivial wrapper around that C function, instead of manually coding up the extension module.) And at that point, just borrow the tuplehash code directly instead of reimplementing the same algorithm.
If you're looking for a way to avoid the O(N) in the first place, that's just not possible. If you look at how tuple.__hash__, frozenset.__hash__, and ImmutableSet.__hash__ work (the last one is pure Python and very readable, by the way), they all take O(N). However, they also all cache the hash values. So, if you're frequently hashing the same tuple (rather than non-identical-but-equal ones), it approaches constant time. (It's O(N/M), where M is the number of times you call with each tuple.)
If you can assume that your list objects never mutate between calls, you can obviously do the same thing, e.g., with a dict mapping id to hash as an external cache. But in general, that obviously isn't a reasonable assumption. (If your list objects never mutate, it would be easier to just switch to tuple objects and not bother with all this complexity.)
But you can wrap up your list objects in a subclass that adds a cached hash value member (or slot), and invalidates the cache whenever it gets a mutating call (append, __setitem__, __delitem__, etc.). Then your hash_seq can check for that.
The end result is the same correctness and performance as with tuples: amortized O(N/M), except that for tuple M is the number of times you call with each identical tuple, while for list it's the number of times you call with each identical list without mutating in between.

You could try a couple of things:
Using marshal.dumps instead of str might be slightly faster (at least on my machine):
>>> timeit.timeit("marshal.dumps([1,2,3])","import marshal", number=10000)
0.008287056301007567
>>> timeit.timeit("str([1,2,3])",number=10000)
0.01709315717356219
Also, if your functions are expensive to compute, and could possibly return None themselves, then your memoizing function will be re-computing them each time (I'm possibly reaching here, but without knowing more I can only guess).
Incorporating these 2 things gives:
import marshal
cache = {}
def memoize(f):
"""Memoize any function."""
def decorated(*args):
key = (f, marshal.dumps(args))
if key in cache:
return cache[key]
cache[key] = f(*args)
return cache[key]
return decorated

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is it more performant to access function from variable? - python

Related

How should I improve Python/Cython performance? Parallelization/memoryviews/numpy?

Memoized to DP solution - Making Change

In python is it better to have a series of += or to append to a list and then sum?

Empty zeroth element in array/list to eliminate repeated decrementing. Does this improve performance?

Efficient generic Python memoize

Categories

Resources