I would like to speed up the process of the below code
x = []
y = []
z = []
for i in range(0, 1000000):
if 0 < u[i] < 1920 and 0 < v[i] < 1080:
x.append(u[i])
y.append(v[i])
z.append([x_ind[i], y_ind[i]])
Any ideas would be really appreciated.
Thanks
Typically, you can optimize cases like this by replacing loops over indices and indexing with loops over the raw values. So replacement code here would be:
x = []
y = []
z = []
for a, b, c in zip(u, v, ind):
if 0 < a < 1920 and 0 < b < 1080:
x.append(a)
y.append(b)
z.append([c, c])
If u, v and ind might be longer than 1000000 and you must stop at 1000000 items checked, you'd just add an import to the top of the file, from itertools import islice, and change the for loop itself to:
for a, b, c in islice(zip(u, v, ind), 1000000):
Either way, you remove all indexing from the code (indexing has one of the worst ratios of overhead to useful work accomplished in the CPython reference interpreter, though other interpreters and tools like Cython will behave differently) and, if you use nicer names than a, b and c, more self-documenting code.
There are minor benefits (decreasing in the most recent versions of Python) to pre-binding copies of append instead of dynamic binding, so if you're really hurting for speed, especially on older versions of Python that didn't optimize away the creation of bound methods, you can try:
x = []
y = []
z = []
xapp, yapp, zapp = x.append, y.append, z.append
for a, b, c in zip(u, v, ind):
if 0 < a < 1920 and 0 < b < 1080:
xapp(a)
yapp(b)
zapp([c, c])
(adding islice if needed) to reduce method call overhead a bit at the expense of uglier code. Definitely don't do this unless profiling has shown this is the hot code path and you really need it faster.
Lastly, a note: If this code is being run at top-level (outside of any function) it will run significantly slower (variable lookup for locally scoped names in a function is a C array lookup; looking up globally scoped names, which all lookups outside a function involve, involves at least one dict key lookup, which is substantially more expensive). Put it in a function (along with the definitions of x, y and z; u, v and ind don't matter so much if you're zipping rather than indexing them) and call that function instead of running at global scope and it should run a lot faster.
Improvements beyond this might be possible using numpy arrays instead of lists, but you'd need to be much more specific about your problem to hazard a guess on the utility of such a change.
Related
I'm not sure if this is a duplicate of the other PyPy memory questions, but here I'll provide a concrete example.
from __future__ import division
def mul_inv(a, m):
"""Modular multiplicative inverse, a^-1 mod m. Credit: rosettacode.org"""
m0 = m
x0, x1 = 0, 1
if m == 1: return 1
while a > 1:
assert m != 0, "a and m must be coprime"
q = a // m
a, m = m, a%m
x0, x1 = x1 - q * x0, x0
if x1 < 0: x1 += m0
return x1
M = 1000000009
L = 10**8
bin2 = [0] * L
bin2[0] = 1
for n in range(L-1):
bin2[n+1] = (bin2[n] * (4*n + 2) * mul_inv(n+1, M)) % M
if n % 10**5 == 0: print(n, bin2[n])
print(bin2[:20])
With python 3.6, the program uses 3-4 GB at most and runs to completion (Armin Rigo's list change doesn't change this significantly). With python 2.7.13 running PyPy 5.10.0, the program reaches 8 GB (how much RAM I have) quickly and freezes. Even with gc.collect() calls the program runs out of memory when n is about 3.5 * 10^7.
Where is this memory usage coming from? The only large memory usage should be initializing bin2 as a 10^8 int list. Nothing else should be increasing the memory usage, under the assumption that all the local variables in mul_inv are garbage collected.
Oops, it's a bad case of the optimization for lists of integers. The problem is that this starts as a list of ints:
bin2 = [0] * L
This is internally stored as an array of ints. It's usually much more compact, even though in this case it doesn't change anything---because on CPython it's a list containing L copies of the same object 0.
But the problem is that pretty soon, we store a long in the list. At this point, we need to turn the whole list into the generic kind that can store anything. But! The problem is that we see 100 million zeroes, and so we create 100 million 0 objects. This creates instantly 3 GB of memory pressure for nothing, in addition to 800MB for the list itself.
We can check that the problem doesn't occur if we initialize the list like this, so that it really contains 100 million times the same object:
bin2 = [0L] * L # Python 2.x
bin2[0] = 1
That said, in your example you don't need the list to contain 100 million elements in the first place. You can initialize it as:
bin2 = [1]
and use bin2.append(). This lets the program start much more quickly and without any large memory usage near the beginning.
Note that PyPy3 still uses more memory than CPython3.
AFAICT the issue here is that you're assigning longs to the array, and despite your modulo, PyPy doesn't seem to notice that the number still fits into a machine word.
I can think of two ways to fix this:
Pass the value assigned to bin2[n+1] through int().
Use array.array().
The former only affects PyPy2, and results in what appears to be a stable memory footprint of ~800MB on my Mac, whereas the latter appears to stabilise at ~1.4GB regardless of whether I run it in PyPy2 or PyPy3.
I haven't run the program fully to completion, though, so YMMV…
I was trying to code Quicksort in Python (see the full code at the end of the question) and in the partition function I am supposed to swap two elements of an array (call it x). I am using the following code for swapping based on the xor operator:
x[i]^=x[j]
x[j]^=x[i]
x[i]^=x[j]
I know that it should work because of the nilpotence of the xor operator (i.e. x^x=0) and I have done it like a million times in Java and in C without any problem. My question is: why doesn’t it work in Python? It seems that it is not working when x[i] == x[j] (maybe i = j?).
x = [2,4,3,5,2,5,46,2,5,6,2,5]
print x
def part(a,b):
global x
i = a
for j in range(a,b):
if x[j]<=x[b]:
x[i]^=x[j]#t = x[i]
x[j]^=x[i]#x[i] = x[j]
x[i]^=x[j]#x[j]=t
i = i+1
r = x[i]
x[i]=x[b]
x[b]=r
return i
def quick(y,z):
if z-y<=0:
return
p = part(y,z)
quick(y,p-1)
quick(p+1,z)
quick(0,len(x)-1)
print x
As to why it doesn't work, it really shouldn't matter1, because you shouldn't be using code like that in the first place, especially when Python gives you a perfectly good 'atomic swap' capability:
x[i], x[j] = x[j], x[i]
It's always been my opinion that all programs should be initially optimised for readability first and only have performance or storage improvements imposed if there's a clear need and a clear benefit (neither of which I've ever seen for the XOR trick outside some incredibly small data environments like some embedded systems).
Even in languages that don't provide that nifty feature, it's more readable and probably faster to use a temporary variable:
tmp = x[i]
x[i] = x[j]
x[j] = tmp
1 However, if you really want to know why it's not working, it's because that trick is okay for swapping two distinct variables, but not so well when you use it with the same variable, which is what you'll be doing when you try to swap x[i] with x[j] when i is equal to j.
It's functionally equivalent to the following, with print statements added so you can see where the whole thing falls apart:
>>> a = 42
>>> a ^= a ; print(a)
0
>>> a ^= a ; print(a)
0
>>> a ^= a ; print(a)
0
Contrast that with two distinct variables, which works okay:
>>> a = 314159; b = 271828; print(a,b)
314159 271828
>>> a ^= b; print(a,b)
61179 271828
>>> b ^= a; print(a,b)
61179 314159
>>> a ^= b; print(a,b)
271828 314159
The problem is that the trick works by transferring information between the two variables gradually (similar to the fox/goose/beans puzzle). When it's the same variable, the first step doesn't so much transfer information as it does destroy it.
Both Python's 'atomic swap' and use of a temporary variable will avoid this problem completely.
I was reviewing this fact, and for example, you could express the xor like the following expression:
a xor b = (a or b) - (a & b)
So, basically, if you replace a by b, Whoa! xDD
You'll get it, zero.
Say you have to carry out a computation by using 2 or even 3 loops. Intuitively, one may thing that it's more efficient to do this with a single loop. I tried a simple Python example:
import itertools
import timeit
def case1(n):
c = 0
for i in range(n):
c += 1
return c
def case2(n):
c = 0
for i in range(n):
for j in range(n):
for k in range(n):
c += 1
return c
print(case1(1000))
print(case2(10))
if __name__ == '__main__':
import timeit
print(timeit.timeit("case1(1000)", setup="from __main__ import case1", number=10000))
print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))
This code run:
$ python3 code.py
1000
1000
0.8281264099932741
1.04944919400441
So effectively 1 loop seems to be a bit more efficient. Yet I have a slightly different scenario in my problem, as I need to use the values in an array (in the following example I use the function range for simplification). That is, if I collapse everything to a single loop I would have to create an extended array from the values of another array whose size is between 2 and 10 elements.
import itertools
import timeit
def case1(n):
b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
c = 0
for i in range(len(b)):
c += b[i]
return c
def case2(n):
c = 0
for i in range(n):
for j in range(n):
for k in range(n):
c += i*j*k
return c
print(case1(10))
print(case2(10))
if __name__ == '__main__':
import timeit
print(timeit.timeit("case1(10)", setup="from __main__ import case1", number=10000))
print(timeit.timeit("case2(10)", setup="from __main__ import case2", number=10000))
In my computer this code run in:
$ python3 code.py
91125
91125
2.435348572995281
1.6435037050105166
So it seems the 3 nested loops are more efficient because I spend sometime creating the array b in case1. so I'm not sure I'm creating this array in the most efficient way, but leaving that aside, does it really pay off collapsing loops to a single one? I'm using Python here, but what about compiled languages like C++? Does the compiler in this case do something to optimize the single loop? Or on the other hand, does the compiler do some optimization when you have multiple nested loops?
This is why the single loop function takes supposedly longer than it should
b = [i * j * k for i, j, k in itertools.product(range(n), repeat=3)]
Just by changing the whole function to
def case1(n, b):
c = 0
for i in range(len(b)):
c += b[i]
return c
Makes the timeit return :
case1 : 0.965343249744
case2 : 2.28501694207
Your case is simple enough that various optimizations would probably do a lot. Be it numpy for more efficient array's, maybe pypy for a better JIT optimizer, or various other things.
Looking at the bytecode via the dis module can help you understand what happens under the hood and make some micro optimizations, but in general it does not really matter if you do one loop or a nested loop, if your memory access pattern is somewhat predictable for the CPU. If not, it may differ wildly.
Python has some bytecodes that are cheap and others that are more expensive, e.g. function calls are much more expensive than a simple addition. Same with creating new objects and various other things. So the usual optimization is moving the loop to C, which is one of the benefits of itertools sometimes.
Once you are on the C-level it usually comes down to: Avoid syscalls/mallocs() in tight loops, have predictable memory access patterns and make sure your algorithm is cache friendly.
So, your algorithms above will probably vary wildly in performance if you go to large values of N, due to the amount of memory allocation and cache access.
But the fastest way for the specific problem above would be to find a closed form for the function, it seems wasteful to iterate for that, as there must be a much simpler formula to calculate the final value of 'c'. As usual, first get the best algorithm before doing micro optimizations.
e.g. Wolfram Alpha tells you that you could replace two loops with, there is probably a closed form for all three, but Alpha didn't tell me...
def case3(n):
c = 0
for j in range(n):
c += (j* n^2 *(n+1)^2))/4
return c
I have a function which just basically makes lots of calls to a simple defined hash function and tests to see when it finds a duplicate. I need to do lots of simulations with it so would like it to be as fast as possible. I am attempting to use cython to do this. The cython code is currently called with a normal python list of integers with values in the range 0 to m^2.
import math, random
cdef int a,b,c,d,m,pos,value, cyclelimit, nohashcalls
def h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) %m
def floyd(inputx):
dupefound, nohashcalls = (0,0)
m = len(inputx)
loops = int(m*math.log(m))
for loopno in xrange(loops):
if (dupefound == 1):
break
a = random.randrange(m)
b = random.randrange(m)
c = random.randrange(m)
d = random.randrange(m)
pos = random.randrange(m)
value = inputx[pos]
listofpos = [0] * m
listofpos[pos] = 1
setofvalues = set([value])
cyclelimit = int(math.sqrt(m))
for j in xrange(cyclelimit):
pos = h3(a,b, c,d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
How can I convert inputx and listofpos to use C type arrays and to access the arrays at C speed? Are there any other speed ups I can use? Can setofvalues be sped up?
So that there is something to compare against, 50 calls to floyd() with m = 5000 currently takes around 30 seconds on my computer.
Update: Example code snippet to show how floyd is called.
m = 5000
inputx = random.sample(xrange(m**2), m)
(dupefound, nohashcalls) = edcython.floyd(inputx)
First of all, it seems that you must type the variables inside the function. A good example of it is here.
Second, cython -a, for "annotate", gives you a really excellent break down of the code generated by the cython compiler and a color-coded indication of how dirty (read: python api heavy) it is. This output is really essential when trying to optimize anything.
Third, the now famous page on working with Numpy explains how to get fast, C-style access to the Numpy array data. Unforunately it's verbose and annoying. We're in luck however, because more recent Cython provides Typed Memory Views, which are both easy to use and awesome. Read that entire page before you try to do anything else.
After ten minutes or so I came up with this:
# cython: infer_types=True
# Use the C math library to avoid Python overhead.
from libc cimport math
# For boundscheck below.
import cython
# We're lazy so we'll let Numpy handle our array memory management.
import numpy as np
# You would normally also import the Numpy pxd to get faster access to the Numpy
# API, but it requires some fancier compilation options so I'll leave it out for
# this demo.
# cimport numpy as np
import random
# This is a small function that doesn't need to be exposed to Python at all. Use
# `cdef` instead of `def` and inline it.
cdef inline int h3(int a,int b,int c,int d, int m,int x):
return (a*x**2 + b*x+c) % m
# If we want to live fast and dangerously, we tell cython not to check our array
# indices for IndexErrors. This means we CAN overrun our array and crash the
# program or screw up our stack. Use with caution. Profiling suggests that we
# aren't gaining anything in this case so I leave it on for safety.
# #cython.boundscheck(False)
# `cpdef` so that calling this function from another Cython (or C) function can
# skip the Python function call overhead, while still allowing us to use it from
# Python.
cpdef floyd(int[:] inputx):
# Type the variables in the scope of the function.
cdef int a,b,c,d, value, cyclelimit
cdef unsigned int dupefound = 0
cdef unsigned int nohashcalls = 0
cdef unsigned int loopno, pos, j
# `m` has type int because inputx is already a Cython memory view and
# `infer-types` is on.
m = inputx.shape[0]
cdef unsigned int loops = int(m*math.log(m))
# Again using the memory view, but letting Numpy allocate an array of zeros.
cdef int[:] listofpos = np.zeros(m, dtype=np.int32)
# Keep this random sampling out of the loop
cdef int[:, :] randoms = np.random.randint(0, m, (loops, 5)).astype(np.int32)
for loopno in range(loops):
if (dupefound == 1):
break
# From our precomputed array
a = randoms[loopno, 0]
b = randoms[loopno, 1]
c = randoms[loopno, 2]
d = randoms[loopno, 3]
pos = randoms[loopno, 4]
value = inputx[pos]
# Unforunately, Memory View does not support "vectorized" operations
# like standard Numpy arrays. Otherwise we'd use listofpos *= 0 here.
for j in range(m):
listofpos[j] = 0
listofpos[pos] = 1
setofvalues = set((value,))
cyclelimit = int(math.sqrt(m))
for j in range(cyclelimit):
pos = h3(a, b, c, d, m, inputx[pos])
nohashcalls += 1
if (inputx[pos] in setofvalues):
if (listofpos[pos]==1):
dupefound = 0
else:
dupefound = 1
print "Duplicate found at position", pos, " and value", inputx[pos]
break
listofpos[pos] = 1
setofvalues.add(inputx[pos])
return dupefound, nohashcalls
There are no tricks here that aren't explained on docs.cython.org, which is where I learned them myself, but helps to see it all come together.
The most important changes to your original code are in the comments, but they all amount to giving Cython hints about how to generate code that doesn't use the Python API.
As an aside: I really don't know why infer_types is not on by default. It lets the compiler
implicitly use C types instead of Python types where possible, meaning less work for you.
If you run cython -a on this, you'll see that the only lines that call into Python are your calls to random.sample, and building or adding to a Python set().
On my machine, your original code runs in 2.1 seconds. My version runs in 0.6 seconds.
The next step is to get random.sample out of that loop, but I'll leave that to you.
I have edited my answer to demonstrate how to precompute the rand samples. This brings the time down to 0.4 seconds.
Do you need to use this particular hashing algorithm? Why not use the built-in hashing algorithm for dicts? For example:
from collections import Counter
cnt = Counter(inputx)
dupes = [k for k, v in cnt.iteritems() if v > 1]
I have two fairly simple code snippets and I'm running both of them a very large amount of times; I'm trying to determine if there's any optimisation I can do to speed up the execution time. If there's anything that stands out as something that could be done a lot quicker...
In the first one, we've got a list, fields. We've also got a list of lists, weights. We're trying to find which weight list multiplied by fields will produce the maximum sum. Fields is about 30k entries long.
def find_best(weights,fields):
winner = -1
best = -float('inf')
for c in range(num_category):
score = 0
for i in range(num_fields):
score += float(fields[i]) * weights[c][i]
if score > best:
best = score
winner = c
return winner
In the second one, we're trying to update two of our weight lists; one gets increased and one decreased. The amount to increase/decrease each element in the is equal to the corresponding element in fields (e.g. if fields[4] = 10.5, then we want to increase weights[toincrease][4] by 10.5 and decrease weights[todecrease][4] by 10.5)
def update_weights(weights,fields,toincrease,todecrease):
for i in range(num_fields):
update = float(fields[i])
weights[toincrease][i] += update
weights[todecrease][i] -= update
return weights
I hope this isn't an overly specific question.
When you are trying to optimise, the thing you have to do is profile and measure! Python provides the timeit module which makes measuring things easy!
This will assume that you've converted fields to a list of floats beforehand (outside any of these functions), since the string → float conversion is very slow. You can do this via fields = [float(f) for f in string_fields].
Also, for doing numerical processing, pure python isn't very good, since it ends up doing a lot of type-checking (and some other stuff) for each operation. Using a C library like numpy will give massive improvements.
find_best
I have incorporated the answers of others (and a few more) into a profiling suite (say, test_find_best.py):
import random, operator, numpy as np, itertools, timeit
fields = [random.random() for _ in range(3000)]
fields_string = [str(field) for field in fields]
weights = [[random.random() for _ in range(3000)] for c in range(100)]
npw = np.array(weights)
npf = np.array(fields)
num_fields = len(fields)
num_category = len(weights)
def f_original():
winner = -1
best = -float('inf')
for c in range(num_category):
score = 0
for i in range(num_fields):
score += float(fields_string[i]) * weights[c][i]
if score > best:
best = score
winner = c
def f_original_no_string():
winner = -1
best = -float('inf')
for c in range(num_category):
score = 0
for i in range(num_fields):
score += fields[i] * weights[c][i]
if score > best:
best = score
winner = c
def f_original_xrange():
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = 0
for i in xrange(num_fields):
score += fields[i] * weights[c][i]
if score > best:
best = score
winner = c
# Zenon http://stackoverflow.com/a/10134298/1256624
def f_index_comprehension():
winner = -1
best = -float('inf')
for c in range(num_category):
score = sum(fields[i] * weights[c][i] for i in xrange(num_fields))
if score > best:
best = score
winner = c
# steveha http://stackoverflow.com/a/10134247/1256624
def f_comprehension():
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = sum(f * w for f, w in itertools.izip(fields, weights[c]))
if score > best:
best = score
winner = c
def f_schwartz_original(): # https://en.wikipedia.org/wiki/Schwartzian_transform
tup = max(((i, sum(t[0] * t[1] for t in itertools.izip(fields, wlist))) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
def f_schwartz_opt(): # https://en.wikipedia.org/wiki/Schwartzian_transform
tup = max(((i, sum(f * w for f,w in itertools.izip(fields, wlist))) for i, wlist in enumerate(weights)),
key=operator.itemgetter(1)
)
def fweight(field_float_list, wlist):
f = iter(field_float_list)
return sum(f.next() * w for w in wlist)
def f_schwartz_iterate():
tup = max(
((i, fweight(fields, wlist)) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
# Nolen Royalty http://stackoverflow.com/a/10134147/1256624
def f_numpy_mult_sum():
np.argmax(np.sum(npf * npw, axis = 1))
# me
def f_imap():
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = sum(itertools.imap(operator.mul, fields, weights[c]))
if score > best:
best = score
winner = c
def f_numpy():
np.argmax(npw.dot(npf))
for f in [f_original,
f_index_comprehension,
f_schwartz_iterate,
f_original_no_string,
f_schwartz_original,
f_original_xrange,
f_schwartz_opt,
f_comprehension,
f_imap]:
print "%s: %.2f ms" % (f.__name__, timeit.timeit(f,number=10)/10 * 1000)
for f in [f_numpy_mult_sum, f_numpy]:
print "%s: %.2f ms" % (f.__name__, timeit.timeit(f,number=100)/100 * 1000)
Running python test_find_best.py gives me:
f_original: 310.34 ms
f_index_comprehension: 102.58 ms
f_schwartz_iterate: 103.39 ms
f_original_no_string: 96.36 ms
f_schwartz_original: 90.52 ms
f_original_xrange: 89.31 ms
f_schwartz_opt: 69.48 ms
f_comprehension: 68.87 ms
f_imap: 53.33 ms
f_numpy_mult_sum: 3.57 ms
f_numpy: 0.62 ms
So the numpy version using .dot (sorry, I can't find the documentation for it atm) is the fastest. If you are doing a lot of numerical operations (which it seems you are), it might be worth converting fields and weights as numpy arrays as soon as you create them.
update_weights
Numpy is likely to offer a similar speed-up for update_weights, doing something like:
def update_weights(weights, fields, to_increase, to_decrease):
weights[to_increase,:] += fields
weights[to_decrease,:] -= fields
return weights
(I haven't tested or profiled that btw, you need to do that.)
I think you could get a pretty big speed boost using numpy. Stupidly simple example:
>>> fields = numpy.array([1, 4, 1, 3, 2, 5, 1])
>>> weights = numpy.array([[.2, .3, .4, .2, .1, .5, .9], [.3, .1, .1, .9, .2, .4, .5]])
>>> fields * weights
array([[ 0.2, 1.2, 0.4, 0.6, 0.2, 2.5, 0.9],
[ 0.3, 0.4, 0.1, 2.7, 0.4, 2. , 0.5]])
>>> result = _
>>> numpy.argmax(numpy.sum(result, axis=1))
1
>>> result[1]
array([ 0.3, 0.4, 0.1, 2.7, 0.4, 2. , 0.5])
If you are running Python 2.x I would use xrange() rather than range(), uses less memory as it doesn't generate a list
This is assuming you want to keep the current code structure.
First, if you are using Python 2.x, you can gain some speed by using xrange() instead of range(). In Python 3.x there is no xrange(), but the built-in range() is basically the same as xrange().
Next, if we are going for speed, we need to write less code, and rely more on Python's built-in features (that are written in C for speed).
You could speed things up by using a generator expression inside of sum() like so:
from itertools import izip
def find_best(weights,fields):
winner = -1
best = -float('inf')
for c in xrange(num_category):
score = sum(float(t[0]) * t[1] for t in izip(fields, weights[c]))
if score > best:
best = score
winner = c
return winner
Applying the same idea again, let's try to use max() to find the best result. I think this code is ugly to look at, but if you benchmark it and it's enough faster, it might be worth it:
from itertools import izip
def find_best(weights, fields):
tup = max(
((i, sum(float(t[0]) * t[1] for t in izip(fields, wlist))) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
return tup[0]
Ugh! But if I didn't make any mistakes, this does the same thing, and it should rely a lot on the C machinery in Python. Measure it and see if it is faster.
So, we are calling max(). We are giving it a generator expression, and it will find the max value returned from the generator expression. But you want the index of the best value, so the generator expression returns a tuple: index and weight value. So we need to pass the generator expression as the first argument, and the second argument must be a key function that looks at the weight value from the tuple and ignores the index. Since the generator expression is not the only argument to max() it needs to be in parens. Then it builds a tuple of i and the calculated weight, calculated by the same sum() we used above. Finally once we get back a tuple from max() we index it to get the index value, and return that.
We can make this much less ugly if we break out a function. This adds the overhead of a function call, but if you measure it I'll bet it isn't too much slower. Also, now that I think about it, it makes sense to build a list of fields values already pre-coerced to float; then we can use that multiple times. Also, instead of using izip() to iterate over two lists in parallel, let's just make an iterator and explicitly ask it for values. In Python 2.x we use the .next() method function to ask for a value; in Python 3.x you would use the next() built-in function.
def fweight(field_float_list, wlist):
f = iter(field_float_list)
return sum(f.next() * w for w in wlist)
def find_best(weights, fields):
flst = [float(x) for x in fields]
tup = max(
((i, fweight(flst, wlist)) for i, wlist in enumerate(weights)),
key=lambda t: t[1]
)
return tup[0]
If there are 30K fields values, then pre-computing the float() values is likely to be a big speed win.
EDIT: I missed one trick. Instead of the lambda function, I should have used operator.itemgetter() like some of the code in the accepted answer. Also, the accepted answer timed things, and it does look like the overhead of the function call was significant. But the Numpy answers were so much faster that it's not worth playing with this answer anymore.
As for the second part, I don't think it can be sped up very much. I'll try:
def update_weights(weights,fields,toincrease,todecrease):
w_inc = weights[toincrease]
w_dec = weights[todecrease]
for i, f in enumerated(fields):
f = float(f) # see note below
w_inc[i] += f
w_dec[i] -= f
So, instead of iterating over an xrange(), here we just iterate over the fields values directly. We have a line that coerces to float.
Note that if the weights values are already float, we don't really need to coerce to float here, and we can save time by just deleting that line.
Your code was indexing the weights list four times: twice to do the increment, twice to do the decrement. This code does the first index (using the toincrease or todecrease) argument just once. It still has to index by i in order for += to work. (My first version tried to avoid this with an iterator and didn't work. I should have tested before posting. But it's fixed now.)
One last version to try: instead of incrementing and decrementing values as we go, just use list comprehensions to build a new list with the values we want:
def update_weights(weights, field_float_list, toincrease, todecrease):
f = iter(field_float_list)
weights[toincrease] = [x + f.next() for x in weights[toincrease]]
f = iter(field_float_list)
weights[todecrease] = [x - f.next() for x in weights[todecrease]]
This assumes you have already coerced all the fields values to float, as shown above.
Is it faster, or slower, to replace the whole list this way? I'm going to guess faster, but I'm not sure. Measure and see!
Oh, I should add: note that my version of update_weights() shown above does not return weights. This is because in Python it is considered a good practice to not return a value from a function that mutates a data structure, just to make sure that nobody ever gets confused about which functions do queries and which functions change things.
http://en.wikipedia.org/wiki/Command-query_separation
Measure measure measure. See how much faster my suggestions are, or are not.
An easy optimisation is to use xrange instead of range. xrange is a generator function that yields results one by one as you iterate over it; whereas range first creates the entire (30,000 item) list as a temporary object, using more memory and CPU cycles.
As #Levon says, xrange() in python2.x is a must. Also, if you are in python2.4+ you can use generator expression (thanks #steveha) , which kinda work like list comprehensions (only in 2.6+), for your inner loop as simply as follows:
for i in range(num_fields):
score += float(fields[i]) * weights[c][i]
equivalent to
score = sum(float(fields[i]) * weights[c][i]) for i in num_fields)
Also in general, there is this great page on the python wiki about simple but effective
optimizations tricks!