Python optimisations in this code?

Python optimisations in this code? - python

I have two fairly simple code snippets and I'm running both of them a very large amount of times; I'm trying to determine if there's any optimisation I can do to speed up the execution time. If there's anything that stands out as something that could be done a lot quicker...
In the first one, we've got a list, fields. We've also got a list of lists, weights. We're trying to find which weight list multiplied by fields will produce the maximum sum. Fields is about 30k entries long.
def find_best(weights,fields):
winner = -1
best = -float('inf')
for c in range(num_category):
score = 0
for i in range(num_fields):
score += float(fields[i]) * weights[c][i]
if score > best:
best = score
winner = c
return winner
In the second one, we're trying to update two of our weight lists; one gets increased and one decreased. The amount to increase/decrease each element in the is equal to the corresponding element in fields (e.g. if fields[4] = 10.5, then we want to increase weights[toincrease][4] by 10.5 and decrease weights[todecrease][4] by 10.5)
def update_weights(weights,fields,toincrease,todecrease):
for i in range(num_fields):
update = float(fields[i])
weights[toincrease][i] += update
weights[todecrease][i] -= update
return weights
I hope this isn't an overly specific question.

When you are trying to optimise, the thing you have to do is profile and measure! Python provides the timeit module which makes measuring things easy!
This will assume that you've converted fields to a list of floats beforehand (outside any of these functions), since the string → float conversion is very slow. You can do this via fields = [float(f) for f in string_fields].
Also, for doing numerical processing, pure python isn't very good, since it ends up doing a lot of type-checking (and some other stuff) for each operation. Using a C library like numpy will give massive improvements.
find_best
I have incorporated the answers of others (and a few more) into a profiling suite (say, test_find_best.py):
import random, operator, numpy as np, itertools, timeit
fields = [random.random() for _ in range(3000)]
fields_string = [str(field) for field in fields]
weights = [[random.random() for _ in range(3000)] for c in range(100)]
npw = np.array(weights)
npf = np.array(fields)
num_fields = len(fields)
num_category = len(weights)
def f_original():
winner = -1
best = -float('inf')
for c in range(num_category):
score = 0
for i in range(num_fields):
score += float(fields_string[i]) * weights[c][i]
if score > best:
best = score
winner = c
def f_original_no_string():
winner = -1
best = -float('inf')
for c in range(num_category):
score = 0
for i in range(num_fields):
score += fields[i] * weights[c][i]
if score > best:
best = score
winner = c
def f_original_xrange():
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = 0
for i in xrange(num_fields):
score += fields[i] * weights[c][i]
if score > best:
best = score
winner = c
# Zenon http://stackoverflow.com/a/10134298/1256624
def f_index_comprehension():
winner = -1
best = -float('inf')
for c in range(num_category):
score = sum(fields[i] * weights[c][i] for i in xrange(num_fields))
if score > best:
best = score
winner = c
# steveha http://stackoverflow.com/a/10134247/1256624
def f_comprehension():
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = sum(f * w for f, w in itertools.izip(fields, weights[c]))
if score > best:
best = score
winner = c
def f_schwartz_original(): # https://en.wikipedia.org/wiki/Schwartzian_transform
tup = max(((i, sum(t[0] * t[1] for t in itertools.izip(fields, wlist))) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
def f_schwartz_opt(): # https://en.wikipedia.org/wiki/Schwartzian_transform
tup = max(((i, sum(f * w for f,w in itertools.izip(fields, wlist))) for i, wlist in enumerate(weights)),
key=operator.itemgetter(1)
)
def fweight(field_float_list, wlist):
f = iter(field_float_list)
return sum(f.next() * w for w in wlist)
def f_schwartz_iterate():
tup = max(
((i, fweight(fields, wlist)) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
# Nolen Royalty http://stackoverflow.com/a/10134147/1256624
def f_numpy_mult_sum():
np.argmax(np.sum(npf * npw, axis = 1))
# me
def f_imap():
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = sum(itertools.imap(operator.mul, fields, weights[c]))
if score > best:
best = score
winner = c
def f_numpy():
np.argmax(npw.dot(npf))
for f in [f_original,
f_index_comprehension,
f_schwartz_iterate,
f_original_no_string,
f_schwartz_original,
f_original_xrange,
f_schwartz_opt,
f_comprehension,
f_imap]:
print "%s: %.2f ms" % (f.__name__, timeit.timeit(f,number=10)/10 * 1000)
for f in [f_numpy_mult_sum, f_numpy]:
print "%s: %.2f ms" % (f.__name__, timeit.timeit(f,number=100)/100 * 1000)
Running python test_find_best.py gives me:
f_original: 310.34 ms
f_index_comprehension: 102.58 ms
f_schwartz_iterate: 103.39 ms
f_original_no_string: 96.36 ms
f_schwartz_original: 90.52 ms
f_original_xrange: 89.31 ms
f_schwartz_opt: 69.48 ms
f_comprehension: 68.87 ms
f_imap: 53.33 ms
f_numpy_mult_sum: 3.57 ms
f_numpy: 0.62 ms
So the numpy version using .dot (sorry, I can't find the documentation for it atm) is the fastest. If you are doing a lot of numerical operations (which it seems you are), it might be worth converting fields and weights as numpy arrays as soon as you create them.
update_weights
Numpy is likely to offer a similar speed-up for update_weights, doing something like:
def update_weights(weights, fields, to_increase, to_decrease):
weights[to_increase,:] += fields
weights[to_decrease,:] -= fields
return weights
(I haven't tested or profiled that btw, you need to do that.)

I think you could get a pretty big speed boost using numpy. Stupidly simple example:
>>> fields = numpy.array([1, 4, 1, 3, 2, 5, 1])
>>> weights = numpy.array([[.2, .3, .4, .2, .1, .5, .9], [.3, .1, .1, .9, .2, .4, .5]])
>>> fields * weights
array([[ 0.2, 1.2, 0.4, 0.6, 0.2, 2.5, 0.9],
[ 0.3, 0.4, 0.1, 2.7, 0.4, 2. , 0.5]])
>>> result = _
>>> numpy.argmax(numpy.sum(result, axis=1))
1
>>> result[1]
array([ 0.3, 0.4, 0.1, 2.7, 0.4, 2. , 0.5])

If you are running Python 2.x I would use xrange() rather than range(), uses less memory as it doesn't generate a list
This is assuming you want to keep the current code structure.

First, if you are using Python 2.x, you can gain some speed by using xrange() instead of range(). In Python 3.x there is no xrange(), but the built-in range() is basically the same as xrange().
Next, if we are going for speed, we need to write less code, and rely more on Python's built-in features (that are written in C for speed).
You could speed things up by using a generator expression inside of sum() like so:
from itertools import izip
def find_best(weights,fields):
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = sum(float(t[0]) * t[1] for t in izip(fields, weights[c]))
if score > best:
best = score
winner = c
return winner
Applying the same idea again, let's try to use max() to find the best result. I think this code is ugly to look at, but if you benchmark it and it's enough faster, it might be worth it:
from itertools import izip
def find_best(weights, fields):
tup = max(
((i, sum(float(t[0]) * t[1] for t in izip(fields, wlist))) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
return tup[0]
Ugh! But if I didn't make any mistakes, this does the same thing, and it should rely a lot on the C machinery in Python. Measure it and see if it is faster.
So, we are calling max(). We are giving it a generator expression, and it will find the max value returned from the generator expression. But you want the index of the best value, so the generator expression returns a tuple: index and weight value. So we need to pass the generator expression as the first argument, and the second argument must be a key function that looks at the weight value from the tuple and ignores the index. Since the generator expression is not the only argument to max() it needs to be in parens. Then it builds a tuple of i and the calculated weight, calculated by the same sum() we used above. Finally once we get back a tuple from max() we index it to get the index value, and return that.
We can make this much less ugly if we break out a function. This adds the overhead of a function call, but if you measure it I'll bet it isn't too much slower. Also, now that I think about it, it makes sense to build a list of fields values already pre-coerced to float; then we can use that multiple times. Also, instead of using izip() to iterate over two lists in parallel, let's just make an iterator and explicitly ask it for values. In Python 2.x we use the .next() method function to ask for a value; in Python 3.x you would use the next() built-in function.
def fweight(field_float_list, wlist):
f = iter(field_float_list)
return sum(f.next() * w for w in wlist)
def find_best(weights, fields):
flst = [float(x) for x in fields]
tup = max(
((i, fweight(flst, wlist)) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
return tup[0]
If there are 30K fields values, then pre-computing the float() values is likely to be a big speed win.
EDIT: I missed one trick. Instead of the lambda function, I should have used operator.itemgetter() like some of the code in the accepted answer. Also, the accepted answer timed things, and it does look like the overhead of the function call was significant. But the Numpy answers were so much faster that it's not worth playing with this answer anymore.
As for the second part, I don't think it can be sped up very much. I'll try:
def update_weights(weights,fields,toincrease,todecrease):
w_inc = weights[toincrease]
w_dec = weights[todecrease]
for i, f in enumerated(fields):
f = float(f) # see note below
w_inc[i] += f
w_dec[i] -= f
So, instead of iterating over an xrange(), here we just iterate over the fields values directly. We have a line that coerces to float.
Note that if the weights values are already float, we don't really need to coerce to float here, and we can save time by just deleting that line.
Your code was indexing the weights list four times: twice to do the increment, twice to do the decrement. This code does the first index (using the toincrease or todecrease) argument just once. It still has to index by i in order for += to work. (My first version tried to avoid this with an iterator and didn't work. I should have tested before posting. But it's fixed now.)
One last version to try: instead of incrementing and decrementing values as we go, just use list comprehensions to build a new list with the values we want:
def update_weights(weights, field_float_list, toincrease, todecrease):
f = iter(field_float_list)
weights[toincrease] = [x + f.next() for x in weights[toincrease]]
f = iter(field_float_list)
weights[todecrease] = [x - f.next() for x in weights[todecrease]]
This assumes you have already coerced all the fields values to float, as shown above.
Is it faster, or slower, to replace the whole list this way? I'm going to guess faster, but I'm not sure. Measure and see!
Oh, I should add: note that my version of update_weights() shown above does not return weights. This is because in Python it is considered a good practice to not return a value from a function that mutates a data structure, just to make sure that nobody ever gets confused about which functions do queries and which functions change things.
http://en.wikipedia.org/wiki/Command-query_separation
Measure measure measure. See how much faster my suggestions are, or are not.

An easy optimisation is to use xrange instead of range. xrange is a generator function that yields results one by one as you iterate over it; whereas range first creates the entire (30,000 item) list as a temporary object, using more memory and CPU cycles.

As #Levon says, xrange() in python2.x is a must. Also, if you are in python2.4+ you can use generator expression (thanks #steveha) , which kinda work like list comprehensions (only in 2.6+), for your inner loop as simply as follows:
for i in range(num_fields):
score += float(fields[i]) * weights[c][i]
equivalent to
score = sum(float(fields[i]) * weights[c][i]) for i in num_fields)
Also in general, there is this great page on the python wiki about simple but effective
optimizations tricks!

Related

Improving sum() calculations on evenly spaced list

I have a code where I need 10Million evenly spaced number between 0 and 1 and I have a logic function which is responsible to pick a random index and return the sum of numbers from that index till the end of the list.
Thus the code looks like below,
import random
import numpy as np
ten_million = np.linspace(0.0, 1.0, 10000000)
def deep_dive_logic():
# this pick is derived from good logic, however, let's just use random here for demonstration
pick = random.randint(0, 10000000)
return sum(ten_million[pick:])
for _ in range(2500):
r = deep_dive_logic()
print(r)
# more logic ahead...
The problem here is as I loop sum() on a list of such size it takes approx. 1.3 s for each result.
Is there any efficient way to reduce the 1.3s wait per call? I also tried creating a kind of cache dictionary, but the deep_dive_logic() function runs in a multi-process environment hence there is need to cache this dictionary, either redis or a json.dump not a choice because of the size of dictionary mounts to around 236MB and adds up as overhead in inter-process communication if not cached.
sums_dict = {0: sum(ten_million)}
even_difference = (ten_million[1] - ten_million[0])
for i in range(len(ten_million) - 1):
sums_dict[i+1] = sums_dict[i] - (even_difference * (i+1))
I need help with either caching of 10Million dictionary or an alternate formula to return the result without using sum() or any out-of-box solution.
https://repl.it/repls/HoneydewGoldenShockwave

np.sum(ten_million) does it in about 0.005 seconds, whereas sum(ten_million) is about 1.5 seconds on my machine
As for a solution without using any out of the box functions, as suggested in the comments to your question by MrT, you can use the property of arithmetic progressions, which says that the sum of a progression is equal to n(a1+an) / 2, where n is the number of elements (10000000), a1 is your first element (0), and an is your last element (1). In your example, this is 10000000(0+1) / 2 = 5000000
so, for your deep_dive_logic function, just return that:
def deep_dive_logic():
pick = random.randint(0, 10000000)
return (len(ten_million)-pick)*(ten_million[pick]+ten_million[-1]) / 2
Also does the job extremely fast, in fact, much faster than np.sum: on average, the arithmetic progression calculation took 1.223e-06 seconds, whereas np.sum took 0.00577 seconds on my machine. Makes sense, seeing how it's just one addition, one multiplication, and one division...

Do it analytically:
def cumm_sum(start, finish, steps, k):
step = (finish - start) / steps
pop = (finish - k) / step
return (pop + 1) * 0.5 * (k + finish)
and the call would be like:
pick = ten_million[random.randint(0, 10000000)]
result = cumm_sum(0.0, 1.0, 10000000, pick)

use math to reduce the problem complexity:
The sum of an arithmetic progression is given by
(m+n)*(m-n+1)*0.5
use np.vectorize to speed up the array operation:
ten_m = 10000000
def sum10m_py(n):
return (1+n)*(ten_m-n*ten_m+1)*0.5
sum_np = np.vectorize(sum_py)
pick the elements you want, and then apply the vectorized function on it.
mask = np.random.randint(0,ten_m,2500)
sums = sum_np(ten_million[mask])

Faster looping with itertools

I have a function
def getSamples():
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
goodSamples = []
rightCount = 0
totalCount = 0
while(rightCount < 100000):
z0 = np.random.normal(5, 14)
u0 = np.random.uniform(0,k*q(z0))
if(p(z0) > u0):
goodSamples.append(z0)
rightCount += 1
totalCount += 1
return np.array(goodSamples)
My implementation to generate 100000 samples is taking much long. How can I make it fast with itertools or something similar?

I would say that the secret to making this code faster does not lie in changing the loop syntax. Here are a few points:
np.random.normal has an additional parameter size that lets you get many values at once. I would suggest using an array of say 1E09 elements and then checking your condition on that for how many are good. You can then estimate how likely that is.
To create your uniform samples, why not use sympy for symbolic evaluation of the pdf? (I don't know if this is faster but it could be since you already know the mean and variance.)
Again, for p could you use a symbolic function?

In general, performance problems are caused by doing things the "wrong way". Numpy can be very fast when used as it is designed to be used, that is by exploiting its vector processing where these vectorized operations are handed off to compiled code. Two bad practices that come from other programing languages/approaches are
Loops: Whenever you think you need a loop stop and think. Most of the time you do not and in fact do not even want one. It is much faster both to write and run code without loops.
Memory allocation: Whenever you know the size of an object, preallocate space for it. Growing memory, particularly in Python lists, is very slow compared to the alternatives.
In this case it is easy to get (approximately) two orders of magnitude speedup; the tradeoff is more memory usage.
Below is some representative code, it is not meant to be blindly used. I have not even verified it produces the correct results. It is more or less a direct translation of your routine. It appears you are drawing random numbers from a probability distribution using the rejection method. There may be more efficient algorithms to do this for your probability distribution.
def getSamples2() :
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
N = 100000 # Total number of samples we want
Ngood = 0 # Current number of good samples
goodSamples = np.zeros(N) # Storage for the good samples
while Ngood < N : # Unfortunately a loop, ....
z0 = np.random.normal(5, 14, size=N)
u0 = np.random.uniform(size=N)*k*q(z0)
ind, = np.where(p(z0) > u0)
n = min(len(ind), N-Ngood)
goodSamples[Ngood:Ngood+n] = z0[ind[:n]]
Ngood += n
return goodSamples
This generates random numbers in chunks and saves the good ones. I have not tried to optimize the chunk size (here I just use N, the total number we want, in principle this could/should be different and could even be adjusted based on the number we have left to generate). This still uses a loop, unfortunately, but now this will be run "tens" of times instead of 100,000 times. This also uses the where function and array slicing; these are good general tools to be comfortable with.
In one test with %timeit on my machine I found
In [27]: %timeit getSamples() # Original routine
1 loops, best of 3: 49.3 s per loop
In [28]: %timeit getSamples2()
1 loops, best of 3: 505 ms per loop

Here is kinda itertools "magic", but I'm not sure it can help. Probably it's much better for perfomance to prepare an numpy array (using zeros) and fill it without creating python auto-growing list. Here is both itertools and zero-preparations. (Excuse me in advance for untested code)
from itertools import count, ifilter, imap, takewhile
import operator
def getSamples():
p = lambda x : mlab.normpdf(x, 3, 2) + mlab.normpdf(x, -5, 1)
q = lambda x : mlab.normpdf(x, 5, 14)
k = 30
n = 100000
samples_iter = imap(
operator.itemgetter(1),
takewhile(
lambda i, s: i < n,
enumerate(
ifilter(lambda z: p(z) > np.random.uniform(0,k*q(z)),
(np.random.normal(5, 14) for _ in count()))
)))
goodSamples = numpy.zeros(n)
# set values from iterator, probably there is a better way for that
for i, sample in enumerate(samples_iter):
goodSamples[i] = sample
return goodSamples

Need a fast way to count and sum an iterable in a single pass

Can any one help me? I'm trying to come up with a way to compute
>>> sum_widths = sum(col.width for col in cols if not col.hide)
and also count the number of items in this sum, without having to make two passes over cols.
It seems unbelievable but after scanning the std-lib (built-in functions, itertools, functools, etc), I couldn't even find a function which would count the number of members in an iterable. I found the function itertools.count, which sounds like what I want, but It's really just a deceptively named range function.
After a little thought I came up with the following (which is so simple that the lack of a library function may be excusable, except for its obtuseness):
>>> visable_col_count = sum(col is col for col in cols if not col.hide)
However, using these two functions requires two passes of the iterable, which just rubs me the wrong way.
As an alternative, the following function does what I want:
>>> def count_and_sum(iter):
>>> count = sum = 0
>>> for item in iter:
>>> count += 1
>>> sum += item
>>> return count, sum
The problem with this is that it takes 100 times as long (according to timeit) as the sum of a generator expression form.
If anybody can come up with a simple one-liner which does what I want, please let me know (using Python 3.3).
Edit 1
Lots of great ideas here, guys. Thanks to all who replied. It will take me a while to digest all these answers, but I will and I will try to pick one to check.
Edit 2
I repeated the timings on my two humble suggestions (count_and_sum function and 2 separate sum functions) and discovered that my original timing was way off, probably due to an auto-scheduled backup process running in the background.
I also timed most of the excellent suggestions given as answers here, all with the same model. Analysing these answers has been quite an education for me: new uses for deque, enumerate and reduce and first time for count and accumulate. Thanks to all!
Here are the results (from my slow netbook) using the software I'm developing for display:
┌───────────────────────────────────────────────────────┐
│ Count and Sum Timing │
├──────────────────────────┬───────────┬────────────────┤
│ Method │Time (usec)│Time (% of base)│
├──────────────────────────┼───────────┼────────────────┤
│count_and_sum (base) │ 7.2│ 100%│
│Two sums │ 7.5│ 104%│
│deque enumerate accumulate│ 7.3│ 101%│
│max enumerate accumulate │ 7.3│ 101%│
│reduce │ 7.4│ 103%│
│count sum │ 7.3│ 101%│
└──────────────────────────┴───────────┴────────────────┘
(I didn't time the complex and fold methods as being just too obscure, but thanks anyway.)
Since there's very little difference in timing among all these methods I decided to use the count_and_sum function (with an explicit for loop) as being the most readable, explicit and simple (Python Zen) and it also happens to be the fastest!
I wish I could accept one of these amazing answers as correct but they are all equally good though more or less obscure, so I'm just up-voting everybody and accepting my own answer as correct (count_and_sum function) since that's what I'm using.
What was that about "There should be one-- and preferably only one --obvious way to do it."?

Using complex numbers
z = [1, 2, 4, 5, 6]
y = sum(x + 1j for x in z)
sum_z, count_z = y.real, int(y.imag)
print sum_z, count_z
18.0 5

I don't know about speed, but this is kind of pretty:
>>> from itertools import accumulate
>>> it = range(10)
>>> max(enumerate(accumulate(it), 1))
(10, 45)

Adaption of DSM's answer. using deque(... maxlen=1) to save memory use.
import itertools
from collections import deque
deque(enumerate(itertools.accumulate(x), 1), maxlen=1)
timing code in ipython:
import itertools , random
from collections import deque
def count_and_sum(iter):
count = sum = 0
for item in iter:
count += 1
sum += item
return count, sum
X = [random.randint(0, 10) for _ in range(10**7)]
%timeit count_and_sum(X)
%timeit deque(enumerate(itertools.accumulate(X), 1), maxlen=1)
%timeit (max(enumerate(itertools.accumulate(X), 1)))
results: now faster than OP's method
1 loops, best of 3: 1.08 s per loop
1 loops, best of 3: 659 ms per loop
1 loops, best of 3: 1.19 s per loop

Here's some timing data that might be of interest:
import timeit
setup = '''
import random, functools, itertools, collections
x = [random.randint(0, 10) for _ in range(10**5)]
def count_and_sum(it):
c, s = 0, 0
for i in it:
c += 1
s += i
return c, s
def two_pass(it):
return sum(i for i in it), sum(True for i in it)
def functional(it):
return functools.reduce(lambda pair, x: (pair[0]+1, pair[1]+x), it, [0, 0])
def accumulator(it):
return max(enumerate(itertools.accumulate(it), 1))
def complex(it):
cpx = sum(x + 1j for x in it)
return cpx.real, int(cpx.imag)
def dequed(it):
return collections.deque(enumerate(itertools.accumulate(it), 1), maxlen=1)
'''
number = 100
for stmt in ['count_and_sum(x)',
'two_pass(x)',
'functional(x)',
'accumulator(x)',
'complex(x)',
'dequed(x)']:
print('{:.4}'.format(timeit.timeit(stmt=stmt, setup=setup, number=number)))
Result:
3.404 # OP's one-pass method
3.833 # OP's two-pass method
8.405 # Timothy Shields's fold method
3.892 # DSM's accumulate-based method
4.946 # 1_CR's complex-number method
2.002 # M4rtini's deque-based modification of DSM's method
Given these results, I'm not really sure how the OP is seeing a 100x slowdown with the one-pass method. Even if the data looks radically different from a list of random integers, that just shouldn't happen.
Also, M4rtini's solution looks like the clear winner.
To clarify, these results are in CPython 3.2.3. For a comparison to PyPy3, see James_pic's answer, which shows some serious gains from JIT compilation for some methods (also mentioned in a comment by M4rtini.

As a follow-up to senshin's answer, it's worth noting that the performance differences are largely due to quirks in CPython's implementation, that make some methods slower than others (for example, for loops are relatively slow in CPython). I thought it would be interesting to try the exact same test in PyPy (using PyPy3 2.1 beta), which has different performance characteristics. In PyPy the results are:
0.6227 # OP's one-pass method
0.8714 # OP's two-pass method
1.033 # Timothy Shields's fold method
6.354 # DSM's accumulate-based method
1.287 # 1_CR's complex-number method
3.857 # M4rtini's deque-based modification of DSM's method
In this case, the OP's one-pass method is fastest. This makes sense, as it's arguably the simplest (at least from a compiler's point of view) and PyPy can eliminate many of the overheads by inlining method calls, which CPython can't.
For comparison, CPython 3.3.2 on my machine give the following:
1.651 # OP's one-pass method
1.825 # OP's two-pass method
3.258 # Timothy Shields's fold method
1.684 # DSM's accumulate-based method
3.072 # 1_CR's complex-number method
1.191 # M4rtini's deque-based modification of DSM's method

You can keep count inside the sum with tricks similar to this
>>> from itertools import count
>>> cnt = count()
>>> sum((next(cnt), x)[1] for x in range(10) if x%2)
25
>>> next(cnt)
5
But it's probably going to be more readable to just use a for loop

You can use this:
from itertools import count
lst = range(10)
c = count(1)
tot = sum(next(c) and x for x in lst if x % 2)
n = next(c)-1
print(n, tot)
# 5 25
It's kind of a hack, but it works perfectly well.

I don't know what the Python syntax is off hand, but you could potentially use a fold. Something like this:
(count, total) = fold((0, 0), lambda pair, x: (pair[0] + 1, pair[1] + x))
The idea is to use a seed of (0,0) and then at each step add 1 to the first component and the current number to the second component.
For comparison, you could implement sum as a fold as follows:
total = fold(0, lambda t, x: t + x)

1_CR's complex number solution is cute but overly hacky. The reason it works is that a complex number is a 2-tuple, that sums elementwise. The same is true of numpy arrays and I think it's slightly cleaner to use those:
import numpy as np
z = [1, 2, 4, 5, 6]
y = sum(np.array([x, 1]) for x in z)
sum_z, count_z = y[0], y[1]
print sum_z, count_z
18 5

Something else to consider: If it is possible to determine a minimum possible count, we can let the efficient built-in sum do part of the work:
from itertools import islice
def count_and_sum(iterable):
# insert favorite implementation here
def count_and_sum_with_min_count(iterable, min_count):
iterator = iter(iterable)
slice_sum = sum(islice(iterator, None, min_count))
rest_count, rest_sum = count_and_sum(iterator)
return min_count + rest_count, slice_sum + rest_sum
For example, using the deque method with a sequence of 10000000 items and min_count of 5000000, timing results are:
count_and_sum: 1.03
count_and_sum_with_min_count: 0.63

Thanks for all the great answers, but I decided to use my original count_and_sum function, called as follows:
>>> cc, cs = count_and_sum(c.width for c in cols if not c.hide)
As explained in the edits to my original question this turned out to be the fastest and most readable solution.

How about this?
It seems to work.
from functools import reduce
class Column:
def __init__(self, width, hide):
self.width = width
self.hide = hide
lst = [Column(10, False), Column(100, False), Column(1000, True), Column(10000, False)]
print(reduce(lambda acc, col: Column(col.width + acc.width, False) if not col.hide else acc, lst, Column(0, False)).width)

You might only need the sum & count today, but who knows what you'll need tomorrow!
Here's an easily extensible solution:
def fold_parallel(itr, **fs):
res = {
k: zero for k, (zero, f) in fs.items()
}
for x in itr:
for k, (_, f) in fs.items():
res[k] = f(res[k], x)
return res
from operator import add
print(fold_parallel([1, 2, 3],
count = (0, lambda a, b: a + 1),
sum = (0, add),
))
# {'count': 3, 'sum': 6}

How to check if a number is within a group of non-consecutive ranges?

Question may sound complicated, but actually is pretty simple, but i can't find any nice solution in Python.
I have ranges like
("8X5000", "8X5099"). Here X can be any digit, so I want to match numbers that fall into one of the ranges (805000..805099 or 815000..815099 or ... 895000..895099).
How can I do this?

#TimPietzcker answer is correct and Pythonic, but it raises some performance concerns (arguably making it even more Pythonic). It creates an iterator that is searches for a value. I don't expect Python to be able to optimize the search.
This should perform better:
def IsInRange(n, r=("8X5000", "8X5099")):
(minr, maxr) = [[int(i) for i in l.split('X')] for l in r]
p = len(r[0]) - r[0].find('X')
nl = (n // 10**p, n % 10**(p-1))
fInRange = all([minr[i] <= nl[i] <= maxr[i] for i in range(2)])
return fInRange
The second line inside the function is a nested list comprehension so may be a little hard to read but it sets:
minr = [8, 5000]
maxr = [8, 5099]
When n = 595049:
nl = (5, 5049)
The code just splits the ranges into parts (while converting to int), splits the target number into parts, then range checks the parts. It wouldn't be hard to enhance this to handle multiple X's in the range specifiers.
Update
I just tested relative performance using timeit:
def main():
t1 = timeit.timeit('MultiRange.in_range(985000)', setup='import MultiRange', number=10000)
t2 = timeit.timeit('MultiRange.IsInRange(985000)', setup='import MultiRange', number=10000)
print t1, t2
print float(t2)/float(t1), 1 - float(t2)/float(t1)
On my 32-bit Win 7 machine running Python 2.7.2 my solution is almost 10 times faster than #TimPietzcker's (to be specific, it runs in 12% of the time). As you increase the size of the range, it only gets worse. When:
ranges=("8X5000", "8X5999")
The performance boost is 50x. Even for the smallest range, my version runs 4 times faster.
With #PaulMcGuire suggested performance patch to in_range, my version runs 3 times faster.
Update 2
Motivated by #PaulMcGuire's comment I went ahead and refactored our functions into classes. Here's mine:
class IsInRange5(object):
def __init__(self, r=("8X5000", "8X5099")):
((self.minr0, self.minr1), (self.maxr0, self.maxr1)) = [[int(i) for i in l.split('X')] for l in r]
pos = len(r[0]) - r[0].find('X')
self.basel = 10**(pos-1)
self.baseh = self.basel*10
self.ir = range(2)
def __contains__(self, n):
return self.minr0 <= n // self.baseh <= self.maxr0 and \
self.minr1 <= n % self.basel <= self.maxr1
This did close the gap, but even after pre-computing range invariants (for both) #PaulMcGuire's took 50% longer.

range = (80555,80888)
x = 80666
print range[0] < x < range[1]
maybe what your looking for ...

Example for Python 3 (in Python 2, use xrange instead of range):
def in_range(number, ranges=("8X5000", "8X5099")):
actual_ranges = ((int(ranges[0].replace("X", digit)),
int(ranges[1].replace("X", digit)) + 1)
for digit in "0123456789")
return any(number in range(*interval) for interval in actual_ranges)
Results:
>>> in_range(805001)
True
>>> in_range(895099)
True
>>> in_range(805100)
False
An improvement to this, suggested by Paul McGuire (thanks!):
def in_range(number, ranges=("8X5000", "8X5099")):
actual_ranges = ((int(ranges[0].replace("X", digit)),
int(ranges[1].replace("X", digit)))
for digit in "0123456789")
return any(minval <= number <= maxval for minval, maxval in actual_ranges)

How to make this python script fast? (benchmarking related to branch prediction from a post from here)

From here - a branching prediction problem, I started to write the Python version of the program to check the runtime of the sorted/unsorted versions in Python.
I tried sorted first.
Here's the code:
import time
from random import *
arraysize = 327
data = []
for c in range(arraysize):
data.append( randint( 1 , 256 ) )
## !!! with this, the next loop runs faster
data.sort()
## test
start_time = time.clock()
sum = 0
for i in range(100000):
for c in range(arraysize):
if data[c] >= 128:
sum += data[c]
print time.clock() - start_time
I'm not sure about the accuracy of my simple timing methodology, but it seems well enough.
When I set arraysize = 32768 I waited for >20 mins the first time!! More than 20 minutes!
But with arraysize = 327, i get a time of 8.141656691s.
Please correct me if I'm wrong somewhere in my code, or whether using Numpy/Scipy in someway would speed things up.
Thanks.

I started with the answer by #mgilson and reworked it a bit. I wanted to test the "decision bit" and lookup table techniques as discussed in my answer to the original question: https://stackoverflow.com/a/17782979/166949
I made a few changes to the original. Some were just style things that reflect my personal preference. However, I discovered bugs in the original, and I think it's important to measure the correct code.
I made the Python code now take arguments from the command line, and wrote a shell script that runs the Python script using Python 2.x, Python 3.x, and PyPy. The exact versions were: Python 2.7.6, Python 3.4.0, and PyPy 2.7.3
I ran the tests on Linux Mint 17, 64-bit version. The CPU is an AMD Phenom 9850 running at 2.5 GHz with 16 GB of RAM. The Linux kernel version, per uname -r, is: 3.13.0-24-generic
The reason I made the code take arguments from the command line is that 327 is a pretty short list. I figured that the sum() and generator expressions would do better when the list was much longer, so I made list size and number of trials be passed from the command line. The results shows which Python, and the list length and number of trials.
Conclusion: to my surprise, even with a long list, Python was fastest using sum() with a list comprehension! There is some overhead to running a generator that seems to be slower than the overhead of building a list and then tearing it down.
However, if the list got truly large, I expected the generator would begin to out-perform the list comprehension. With a list of a million random values, the listcomp was still faster, but when I went to 16 million random values, the genexp became faster. And the speed penalty of the generator expression is not large for shorter lists. So I still favor the generator expression as the go-to way to solve this problem in Python.
Interestingly, PyPy was fastest with the table lookup. This makes sense: that was the fastest way I found in C, and PyPy is generating native code from the JIT.
For CPython, with its virtual machine, it is faster to invoke a single operation than several operations; the overhead of the Python VM can outweigh a more expensive fundamental operation. Thus integer division is faster than bit masking plus bit shifting, because the division is a single operation. But in PyPy, the bit masking+shifting is much faster than the division.
Also, in CPython, using sum() lets your code run in the C internals so it can be very fast; but in PyPy, sum() is slower than just writing a straighforward loop that the JIT can turn into a wicked fast native loop. My guess is that the generator machinery is hard for PyPy to grok and optimize away into native code.
The shell script:
for P in python python3 pypy; do
echo "$P ($1, $2)"
$P test_branches.py $1 $2
echo
done
The Python code:
import random
import sys
import timeit
try:
RANGE = xrange
except NameError:
RANGE = range
if len(sys.argv) != 3:
print("Usage: python test_branches.py <length_of_array> <number_of_trials>")
sys.exit(1)
TEST_DATA_LEN = int(sys.argv[1])
NUM_REPEATS = int(sys.argv[2])
_test_data = [random.randint(0,255) for _ in RANGE(TEST_DATA_LEN)]
def test0(data):
"""original way"""
total = 0
for i in RANGE(TEST_DATA_LEN):
if data[i] >= 128:
total += data[i]
return total
def test1(data):
"""better loop"""
total = 0
for n in data:
if n >= 128:
total += n
return total
def test2(data):
"""sum + generator"""
return sum(n for n in data if n >= 128)
def test3(data):
"""sum + listcomp"""
return sum([n for n in data if n >= 128])
def test4(data):
"""decision bit -- bit mask and shift"""
lst = [0, 0]
for n in data:
lst[(n & 0x80) >> 7] += n
return lst[1]
def test5(data):
"""decision bit -- division"""
lst = [0, 0]
for n in data:
lst[n // 128] += n
return lst[1]
_lut = [0 if n < 128 else n for n in RANGE(256)]
def test6(data):
"""lookup table"""
total = 0
for n in data:
total += _lut[n]
return total
def test7(data):
"""lookup table with sum()"""
return sum(_lut[n] for n in data)
test_functions = [v for k,v in globals().items() if k.startswith("test")]
test_functions.sort(key=lambda x: x.__name__)
correct = test0(_test_data)
for fn in test_functions:
name = fn.__name__
doc = fn.__doc__
if fn(_test_data) != correct:
print("{}() not correct!!!".format(name))
s_call = "{}(_test_data)".format(name)
s_import = "from __main__ import {},_test_data".format(name)
t = timeit.timeit(s_call,s_import,number=NUM_REPEATS)
print("{:7.03f}: {}".format(t, doc))
The results:
python (327, 100000)
3.170: original way
2.211: better loop
2.378: sum + generator
2.188: sum + listcomp
5.321: decision bit -- bit mask and shift
4.977: decision bit -- division
2.937: lookup table
3.464: lookup table with sum()
python3 (327, 100000)
5.786: original way
3.444: better loop
3.286: sum + generator
2.968: sum + listcomp
8.858: decision bit -- bit mask and shift
7.056: decision bit -- division
4.640: lookup table
4.783: lookup table with sum()
pypy (327, 100000)
0.296: original way
0.312: better loop
1.932: sum + generator
1.011: sum + listcomp
0.172: decision bit -- bit mask and shift
0.613: decision bit -- division
0.140: lookup table
1.977: lookup table with sum()
python (65536, 1000)
6.528: original way
4.661: better loop
4.974: sum + generator
4.150: sum + listcomp
10.971: decision bit -- bit mask and shift
10.218: decision bit -- division
6.052: lookup table
7.070: lookup table with sum()
python3 (65536, 1000)
12.999: original way
7.618: better loop
6.826: sum + generator
5.587: sum + listcomp
19.326: decision bit -- bit mask and shift
14.917: decision bit -- division
9.779: lookup table
9.575: lookup table with sum()
pypy (65536, 1000)
0.681: original way
0.884: better loop
2.640: sum + generator
2.642: sum + listcomp
0.316: decision bit -- bit mask and shift
1.573: decision bit -- division
0.280: lookup table
1.561: lookup table with sum()
python (1048576, 100)
10.371: original way
7.065: better loop
7.910: sum + generator
6.579: sum + listcomp
17.583: decision bit -- bit mask and shift
15.426: decision bit -- division
9.285: lookup table
10.850: lookup table with sum()
python3 (1048576, 100)
20.435: original way
11.221: better loop
10.162: sum + generator
8.981: sum + listcomp
29.108: decision bit -- bit mask and shift
23.626: decision bit -- division
14.706: lookup table
14.173: lookup table with sum()
pypy (1048576, 100)
0.985: original way
0.926: better loop
5.462: sum + generator
6.623: sum + listcomp
0.527: decision bit -- bit mask and shift
2.334: decision bit -- division
0.481: lookup table
5.800: lookup table with sum()
python (16777216, 10)
15.704: original way
11.331: better loop
11.995: sum + generator
13.787: sum + listcomp
28.527: decision bit -- bit mask and shift
25.204: decision bit -- division
15.349: lookup table
17.607: lookup table with sum()
python3 (16777216, 10)
32.822: original way
18.530: better loop
16.339: sum + generator
14.999: sum + listcomp
47.755: decision bit -- bit mask and shift
38.930: decision bit -- division
23.704: lookup table
22.693: lookup table with sum()
pypy (16777216, 10)
1.559: original way
2.234: better loop
6.586: sum + generator
10.931: sum + listcomp
0.817: decision bit -- bit mask and shift
3.714: decision bit -- division
0.752: lookup table
3.837: lookup table with sum()

One small optimization, which is also a matter of style, lists can be iterated over directly instead of needing to index them:
for d in data:
if d >= 128:
sum += d
This saves a few function calls.
This loop also might go faster if you use the builtin sum function:
my_sum += sum( d for d in data if d>=128 )
A list comp may be faster than the above generator (at the expense of a little extra memory):
my_sum += sum( [d for d in data if d>=128] )
Of course, from an algorithm perspective, we can get rid of the outmost loop since the sum of the inner loop isn't going to change:
my_sum = 100000 * sum( d for d in data if d>=128 )
but I'm guessing you already knew that...
Finally, here's how I would benchmark this:
import random
import timeit
N = 327
testarr = [random.randint(1,256) for _ in range(N)]
def test1(data):
"""Original way"""
s = 0
for c in range(N):
if data[c] >= 128:
s += data[c]
def test2(data):
"""better loop"""
s = 0
for d in data:
if d >= 128:
s += d
def test3(data):
""" sum + generator """
sum( d for d in data if d >= 128 )
def test4(data):
""" sum + listcomp """
sum( [ d for d in data if d >= 128 ])
NNUMBER = 100000
print timeit.timeit('test1(testarr)','from __main__ import test1,testarr',number=NNUMBER)
print timeit.timeit('test2(testarr)','from __main__ import test2,testarr',number=NNUMBER)
print timeit.timeit('test3(testarr)','from __main__ import test3,testarr',number=NNUMBER)
print timeit.timeit('test4(testarr)','from __main__ import test4,testarr',number=NNUMBER)
My results (OS-X Mavericks, python 2.7.5 -- mileage may vary):
2.80526804924 # loop with range
2.04129099846 # loop over elements
1.91441488266 # sum with generator
2.05234098434 # sum with list
For me, the generator with sum wins by a small margin. The list-comp with sum and the explicit loop are about equal. Looping over indices is (not surprisingly) the slowest.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python optimisations in this code? - python

If you are running Python 2.x I would use xrange() rather than range(), uses less memory as it doesn't generate a list This is assuming you want to keep the current code structure.

An easy optimisation is to use xrange instead of range. xrange is a generator function that yields results one by one as you iterate over it; whereas range first creates the entire (30,000 item) list as a temporary object, using more memory and CPU cycles.

Related

Improving sum() calculations on evenly spaced list

Faster looping with itertools

Need a fast way to count and sum an iterable in a single pass

How to check if a number is within a group of non-consecutive ranges?

How to make this python script fast? (benchmarking related to branch prediction from a post from here)

Categories

Resources