swap assignment slower than introducing a new variable?

swap assignment slower than introducing a new variable? - python

Okay, I hope the title made somewhat sense. I have made a fast fourier transform code and want to see how it could go faster. So here is the code ('code1') where I do the swapping as the FFT dictates:
stp='''
import numpy as nn
D=2**18
N=12
w=(nn.linspace(0,D-1,D)%2**N<(2**N/2))-.5 #square wave
f=w[0:2**N]+nn.zeros(2**N)*1j #it takes only one wavelength from w
'''
code1='''
for m in range(N):
for l in range(2**m):
for k in range(2**(N-1-m)):
t=f[k+l*2**(N-m)]
f[k+l*2**(N-m)]=(t+f[k+l*2**(N-m)+2**(N-1-m)])
f[k+l*2**(N-m)+2**(N-1-m)]=(t-f[k+l*2**(N-m)+2**(N-1-m)])*nn.e**(-2*nn.pi*1j*k*2**(m)/2**N)
'''
print(timeit.timeit(setup=stp,stmt=code1,number=1))
where I introduce a new variable 't'. This outputs a time of 0.132.
So I thought it should go faster if I did (setup is the same as before):
code2='''
for m in range(N):
for l in range(2**m):
for k in range(2**(N-1-m)):
f[k+l*2**(N-m)],f[k+l*2**(N-m)+2**(N-1-m)]=f[k+l*2**(N-m)]+f[k+l*2**(N-m)+2**(N-1-m)],(f[k+l*2**(N-m)]-f[k+l*2**(N-m)+2**(N-1-m)])*nn.e**(-2*nn.pi*1j*k*2**(m)/2**N)
'''
print(timeit.timeit(setup=stp,stmt=code2,number=1))
since now I do two assignments instead of three (was my line of thinking). But it appears that this is actually slower (0.152). Anyone has an idea as to why? And does anyone know a way to do this swapping faster than the t=a,a=f(a,b),b=g(t,b) I introduced before, since I find it hard to believe that this is the most efficient way.
EDIT: added the actual code instead of the pseudo-code.
MORE EDIT:
I tried running the same without using Numpy. Both are faster, so that's positive, but again the t=a,a=f(a,b),b=g(t,b) method appears faster (0.104) than the a,b=f(a,b),g(a,b) method (0.114). So the mystery remains.
new code:
stpsansnumpy='''
import cmath as mm
D=2**18
N=12
w=[0]*D
for i in range(D):
w[i]=(i%2**N<(2**N/2))-.5 #square wave
f=w[0:2**N]+[0*1j]*2**N #it takes only one wavelength from w
'''
code1math='''
for m in range(N):
for l in range(2**m):
for k in range(2**(N-1-m)):
t=f[k+l*2**(N-m)]
f[k+l*2**(N-m)]=(t+f[k+l*2**(N-m)+2**(N-1-m)])
f[k+l*2**(N-m)+2**(N-1-m)]=(t-f[k+l*2**(N-m)+2**(N-1-m)])*mm.exp(-2*mm.pi*1j*k*2**(m)/2**N)
'''
print(timeit.timeit(setup=stpsansnumpy,stmt=code1math,number=1))
and:
code2math='''
for m in range(N):
for l in range(2**m):
for k in range(2**(N-1-m)):
f[k+l*2**(N-m)],f[k+l*2**(N-m)+2**(N-1-m)]=f[k+l*2**(N-m)]+f[k+l*2**(N-m)+2**(N-1-m)],(f[k+l*2**(N-m)]-f[k+l*2**(N-m)+2**(N-1-m)])*mm.exp(-2*mm.pi*1j*k*2**(m)/2**N)
'''
print(timeit.timeit(setup=stpsansnumpy,stmt=code2math,number=1))

It would have been nice if you shared why you think one is slower than the other and if your code would have been formatted properly as Python code should be.
Something like this:
from timeit import timeit
def f(a, b):
return a
def g(a, b):
return a
def extra_var(a, b):
t = a
a = f(a, b)
b = g(t, b)
return a, b
def swap_direct(a, b):
a, b = f(a, b), g(a, b)
return a, b
print(timeit(lambda: extra_var(1, 2)))
print(timeit(lambda: swap_direct(1, 2)))
However, if you had, you would probably have found the same results I did:
0.2162299
0.21171479999999998
The results are so close that in consecutive runs, either function can appear to be a bit faster or slower.
So, you'd increase the volume:
print(timeit(lambda: extra_var(1, 2), number=10000000))
print(timeit(lambda: swap_direct(1, 2), number=10000000))
And the mystery goes away:
2.1527828999999996
2.1225841
The direct swap is actually slightly faster, as expected. What is different about what you were doing that was giving you other results?
You say you're seeing a difference when you implement it in the context of more complicated code - however, this shows it's likely that your code itself is the likely culprit, which is why StackOverflow suggests you share a minimal, reproducible example, so people can actually try what you say happens, instead of having to take your word for it.
In most cases, it turns out someone made a mistake and everything is as expected. In some cases, you get an interesting answer.

In the first version I see 5 indexing operations, and in the second one I see 6. I’m not surprised that 6 indexing operations (with all the computations you use in them) are more expensive than 5.
Creating a temporary variable, or creating a temporary tuple, is peanuts compared to all the computations you do in these code fragments.

Related

How can i check that a list is in my array in python

for example if i have:
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
and i want to check if the following list is the same as one of the lists that the array consist of:
B = [2,3,4]
I tried
B in A #which returns True
But the following also returns True, which should be false:
B = [2,2,2]
B in A

Try this generator comprehension. The builtin any() short-circuits so that you don't have extra evaluations that you don't need.
any(np.array_equal(row, B) for row in A)
For now, np.array_equal doesn't implement internal short-circuiting. In a different question the performance impact of different ways of accomplishing this is discussed.
As #Dan mentions below, broadcasting is another valid way to solve this problem, and it's often (though not always) a better way. For some rough heuristics, here's how you might want to choose between the two approaches. As with any other micro-optimization, benchmark your results.
Generator Comprehension
Reduced memory footprint (not creating the array B==A)
Short-circuiting (if the first row of A is B, we don't have to look at the rest)
When rows are large (definition depends on your system, but could be ~100 - 100,000), broadcasting isn't noticeably faster.
Uses builtin language features. You have numpy installed anyway, but I'm partial to using the core language when there isn't a reason to do otherwise.
Broadcasting
Fastest way to solve an extremely broad range of problems using numpy. Using it here is good practice.
If we do have to search through every row in A (i.e. if more often than not we expect B to not be in A), broadcasting will almost always be faster (not always a lot faster necessarily, see next point)
When rows are smallish, the generator expression won't be able to vectorize the computations efficiently, so broadcasting will be substantially faster (unless of course you have enough rows that short-circuiting outweighs that concern).
In a broader context where you have more numpy code, the use of broadcasting here can help to have more consistent patterns in your code base. Coworkers and future you will appreciate not having a mix of coding styles and patterns.

You can do it by using broadcasting like this:
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
B = np.array([2,3,4]) # Or [2,3,4], a list will work fine here too
(B==A).all(axis=1).any()

Using the built-in any. As soon as an identical element is found, it stops iterating and returns true.
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
B = [3,2,4]
if any(np.array_equal(B, x) for x in A):
print(f'{B} inside {A}')
else:
print(f'{B} NOT inside {A}')

You need to use .all() for comparing all the elements of list.
A = np.array([[2,3,4],[5,6,7]])
B = [2,3,4]
for i in A:
if (i==B).all():
print ("Yes, B is present in A")
break
EDIT: I put break to break out of the loop as soon as the first occurence is found. This applies to example such as A = np.array([[2,3,4],[2,3,4]])
# print ("Yes, B is present in A")
Alternative solution using any:
any((i==B).all() for i in A)
# True

list((A[[i], :]==B).all() for i in range(A.shape[0]))
[True, False]
This will tell you what row of A is equal to B

Straight forward, you could use any() to go through a generator comparing the arrays with array_equal.
from numpy import array_equal
import numpy as np
A = np.array([[2,3,4],[5,6,7]])
B = np.array([2,2,4])
in_A = lambda x, A : any((array_equal(a,x) for a in A))
print(in_A(B, A))
False
[Program finished]

Solve large number of small equation systems in numpy

I have a large number of small linear equation systems that I'd like to solve efficiently using numpy. Basically, given A[:,:,:] and b[:,:], I wish to find x[:,:] given by A[i,:,:].dot(x[i,:]) = b[i,:]. So if I didn't care about speed, I could solve this as
for i in range(n):
x[i,:] = np.linalg.solve(A[i,:,:],b[i,:])
But since this involved explicit looping in python, and since A typically has a shape like (1000000,3,3), such a solution would be quite slow. If numpy isn't up to this, I could do this loop in fortran (i.e. using f2py), but I'd prefer to stay in python if possible.

For those coming back to read this question now, I thought I'd save others time and mention that numpy handles this using broadcasting now.
So, in numpy 1.8.0 and higher, the following can be used to solve N linear equations.
x = np.linalg.solve(A,b)

I guess answering yourself is a bit of a faux pas, but this is the fortran solution I have a the moment, i.e. what the other solutions are effectively competing against, both in speed and brevity.
function pixsolve(A, b) result(x)
implicit none
real*8 :: A(:,:,:), b(:,:), x(size(b,1),size(b,2))
integer*4 :: i, n, m, piv(size(b,1)), err
n = size(A,3); m = size(A,1)
x = b
do i = 1, n
call dgesv(m, 1, A(:,:,i), m, piv, x(:,i), m, err)
end do
end function
This would be compiled as:
f2py -c -m foo{,.f90} -llapack -lblas
And called from python as
x = foo.pixsolve(A.T, b.T).T
(The .Ts are needed due to a poor design choice in f2py, which both causes unnecessary copying, inefficient memory access patterns and unnatural looking fortran indexing if the .Ts are left out.)
This also avoids a setup.py etc. I have no bone to pick with fortran (as long as strings aren't involved), but I was hoping that numpy might have something short and elegant which could do the same thing.

I think you're wrong about the explicit looping being a problem. Usually it's only the innermost loop it's worth optimizing, and I think that holds true here. For example, we can measure the code of the overhead vs the cost of the actual computation:
import numpy as np
n = 10**6
A = np.random.random(size=(n, 3, 3))
b = np.random.random(size=(n, 3))
x = b*0
def f():
for i in xrange(n):
x[i,:] = np.linalg.solve(A[i,:,:],b[i,:])
np.linalg.pseudosolve = lambda a,b: b
def g():
for i in xrange(n):
x[i,:] = np.linalg.pseudosolve(A[i,:,:],b[i,:])
which gives me
In [66]: time f()
CPU times: user 54.83 s, sys: 0.12 s, total: 54.94 s
Wall time: 55.62 s
In [67]: time g()
CPU times: user 5.37 s, sys: 0.01 s, total: 5.38 s
Wall time: 5.40 s
IOW, it's only spending 10% of its time doing anything other than actually solving your problem. Now, I could totally believe that np.linalg.solve itself is too slow for you relative to what you could get out of Fortran, and so you want to do something else. That's probably especially true on small problems, come to think of it: IIRC I once found it faster to unroll certain small solutions by hand, although that was a while back.
But by itself, it's not true that using an explicit loop on the first index will make the overall solution quite slow. If np.linalg.solve is fast enough, the loop won't change it much here.

I think you can do it in one go, with a (3x100000,3x100000) matrix composed of 3x3 blocks around the diagonal.
Not tested :
b_new = np.vstack([ b[i,:] for i in range(len(i)) ])
x_new = np.zeros(shape=(3x10000,3) )
A_new = np.zeros(shape=(3x10000,3x10000) )
n,m = A.shape
for i in range(n):
A_new[3*i:3*(i+1),3*i:3*(i+1)] = A[i,:,:]
x = np.linalg.solve(A_new,b_new)

Efficient strings containing each other

I have two sets of strings (A and B), and I want to know all pairs of strings a in A and b in B where a is a substring of b.
The first step of coding this was the following:
for a in A:
for b in B:
if a in b:
print (a,b)
However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:, check if the regexp '.*' + a + '.*': matches 'b'. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a. Also, using a list comprehension for the inner for b in B: loop will likely give a pretty big speedup (and a nested list comprehension may be even better).
I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of A and B sets, and I don't want it to run all week).
Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!
Edit:
Using the advice of #ninjagecko and #Sven Marnach, I built a quick prefix table of 10-mers:
import collections
prefix_table = collections.defaultdict(set)
for k, b in enumerate(B):
for i in xrange(len(prot_seq)-10):
j = i+10+1
prefix_table[b[i:j]].add(k)
for a in A:
if len(a) >= 10:
for k in prefix_table[a[:10]]:
# check if a is in b
# (missing_edges is necessary, but not sufficient)
if a in B[k]:
print (a,b)
else:
for k in xrange(len(prots_and_seqs)):
# a is too small to use the table; check if
# a is in any b
if a in B[k]:
print (a, b)

Of course you can easily write this as a list comprehension:
[(a, b) for a in A for b in B if a in b]
This might slightly speed up the loop, but don't expect too much. I doubt using regular expressions will help in any way with this one.
Edit: Here are some timings:
import itertools
import timeit
import re
import collections
with open("/usr/share/dict/british-english") as f:
A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
B = [s.strip() for s in itertools.islice(f, 23000, 25000)]
def f():
result = []
for a in A:
for b in B:
if a in b:
result.append((a, b))
return result
def g():
return [(a, b) for a in A for b in B if a in b]
def h():
res = [re.compile(re.escape(a)) for a in A]
return [(a, b) for a in res for b in B if a.search(b)]
def ninjagecko():
d = collections.defaultdict(set)
for k, b in enumerate(B):
for i, j in itertools.combinations(range(len(b) + 1), 2):
d[b[i:j]].add(k)
return [(a, B[k]) for a in A for k in d[a]]
print "Nested loop", timeit.repeat(f, number=1)
print "List comprehension", timeit.repeat(g, number=1)
print "Regular expressions", timeit.repeat(h, number=1)
print "ninjagecko", timeit.repeat(ninjagecko, number=1)
Results:
Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]
Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.
Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings -- they remained essentially unchanged.)

Let's assume your words are bounded at a reasonable size (let's say 10 letters). Do the following to achieve linear(!) time complexity, that is, O(A+B):
Initialize a hashtable or trie
For each string b in B:
For every substring of that string
Add the substring to the hashtable/trie (this is no worse than 55*O(B)=O(B)), with metadata of which string it belonged to
For each string a in A:
Do an O(1) query to your hashtable/trie to find all B-strings it is in, yield those
(As of writing this answer, no response yet if OP's "words" are bounded. If they are unbounded, this solution still applies, but there is a dependency of O(maxwordsize^2), though actually it's nicer in practice since not all words are the same size, so it might be as nice as O(averagewordsize^2) with the right distribution. For example if all the words were of size 20, the problem size would grow by a factor of 4 more than if they were of size 10. But if sufficiently few words were increased from size 10->20, then the complexity wouldn't change much.)
edit: https://stackoverflow.com/q/8289199/711085 is actually a theoretically better answer. I was looking at the linked Wikipedia page before that answer was posted, and was thinking "linear in the string size is not what you want", and only later realized it's exactly what you want. Your intuition to build a regexp (Aword1|Aword2|Aword3|...) is correct since the finite-automaton which is generated behind the scenes will perform matching quickly IF it supports simultaneous overlapping matches, which not all regexp engines might. Ultimately what you should use depends on if you plan to reuse the As or Bs, or if this is just a one-time thing. The above technique is much easier to implement but only works if your words are bounded (and introduces a DoS vulnerability if you don't reject words above a certain size limit), but may be what you are looking for if you don't want the Aho-Corasick string matching finite automaton or similar, or it is unavailable as a library.

A very fast way to search for a lot of strings is to make use of a finite automaton (so you were not that far with the guess of regexp), namely the Aho Corasick string matching machine, which is used in tools like grep, virus scanners and the like.
First it compiles the strings you want to search for (in your case the words in A) into a finite-state automaton with failure function (see the paper from '75 if you are interested in details). This automaton then reads the input string(s) and outputs all found search strings (probably you want to modify it a bit, so that it outputs the string in which the search string was found aswell).
This method has the advantage that it searches all search strings at the same time and thus needs to look at every character of the input string(s) only once (linear complexity)!
There are implementations of the aho corasick pattern matcher at pypi, but i haven't tested them, so I can't say anything about performance, usability or correctness of these implementations.
EDIT: I tried this implementation of the Aho-Corasick automaton and it is indeed the fastest of the suggested methods so far, and also easy to use:
import pyahocorasick
def aho(A, B):
t = pyahocorasick.Trie();
for a in A:
t.add_word(a, a)
t.make_automaton()
return [(s,b) for b in B for (i,res) in t.iter(b) for s in res]
One thing I observed though, was when testing this implementation with #SvenMarnachs script it yielded slightly less results than the other methods and I am not sure why. I wrote a mail to the creator, maybe he will figure it out.

There are specialized index structures for this, see for example
http://en.wikipedia.org/wiki/Suffix_tree
You'd build a suffix-tree or something similar for B, then use A to query it.

Pythonic iterable difference

I've written some code to find all the items that are in one iterable and not another and vice versa. I was originally using the built in set difference, but the computation was rather slow as there were millions of items being stored in each set. Since I know there will be at most a few thousand differences I wrote the below version:
def differences(a_iter, b_iter):
a_items, b_items = set(), set()
def remove_or_add_if_none(a_item, b_item, a_set, b_set):
if a_item is None:
if b_item in a_set:
a_set.remove(b_item)
else:
b_set.add(b)
def remove_or_add(a_item, b_item, a_set, b_set):
if a in b_set:
b_set.remove(a)
if b in a_set:
a_set.remove(b)
else:
b_set.add(b)
return True
return False
for a, b in itertools.izip_longest(a_iter, b_iter):
if a is None or b is None:
remove_or_add_if_none(a, b, a_items, b_items)
remove_or_add_if_none(b, a, b_items, a_items)
continue
if a != b:
if remove_or_add(a, b, a_items, b_items) or \
remove_or_add(b, a, b_items, a_items):
continue
a_items.add(a)
b_items.add(b)
return a_items, b_items
However, the above code doesn't seem very pythonic so I'm looking for alternatives or suggestions for improvement.

Here is a more pythonic solution:
a, b = set(a_iter), set(b_iter)
return a - b, b - a
Pythonic does not mean fast, but rather elegant and readable.
Here is a solution that might be faster:
a, b = set(a_iter), set(b_iter)
# Get all the candidate return values
symdif = a.symmetric_difference(b)
# Since symdif has much fewer elements, these might be faster
return symdif - b, symdif - a
Now, about writing custom “fast” algorithms in Python instead of using the built-in operations: it's a very bad idea.
The set operators are heavily optimized, and written in C, which is generally much, much faster than Python.
You could write an algorithm in C (or Cython), but then keep in mind that Python's set algorithms were written and optimized by world-class geniuses.
Unless you're extremely good at optimization, it's probably not worth the effort. On the other hand, if you do manage to speed things up substantially, please share your code; I bet it'd have a chance of getting into Python itself.
For a more realistic approach, try eliminating calls to Python code. For instance, if your objects have a custom equality operator, figure out a way to remove it.
But don't get your hopes up. Working with millions of pieces of data will always take a long time. I don't know where you're using this, but maybe it's better to make the computer busy for a minute than to spend the time optimizing set algorithms?

i think your code is broken - try it with [1,1] and [1,2] and you'll get that 1 is in one set but not the other.
> print differences([1,1],[1,2])
(set([1]), set([2]))
you can trace this back to the effect of the if a != b test (which is assuming something about ordering that is not present in simple set differences).
without that test, which probably discards many values, i don't think your method is going to be any faster than built-in sets. the argument goes something like: you really do need to create one set in memory to hold all the data (your bug came from not doing that). a naive set approach creates two sets. so the best you can do is save half the time, and you also have to do the work, in python, of what is probably efficient c code.

I would have thought python set operations would be the best performance you could get out of the standard library.
Perhaps it's the particular implementation you chose that's the problem, rather than the data structures and attendant operations themselves. Here's an alternate implementation that should be give you better performance.
For sequence comparison tasks in which the sequences are large, avoid, if at all possible, putting the objects that comprise the sequences into the containers used for the comparison--better to work with indices instead. If the objects in your sequences are unordered, then sort them.
So for instance, i use NumPy, the numerical python library, for these sort of tasks:
# a, b are 'fake' index arrays of type boolean
import numpy as NP
a, b = NP.random.randint(0, 2, 10), NP.random.randint(0, 2, 10)
a, b = NP.array(a, dtype=bool), NP.array(b, dtype=bool)
# items a and b have in common:
NP.sum(NP.logical_and(a, b))
# the converse (the differences)
NP.sum(NP.logical_or(a, b))

How can I, in python, iterate over multiple 2d lists at once, cleanly?

If I'm making a simple grid based game, for example, I might have a few 2d lists. One might be for terrain, another might be for objects, etc. Unfortunately, when I need to iterate over the lists and have the contents of a square in one list affect part of another list, I have to do something like this.
for i in range(len(alist)):
for j in range(len(alist[i])):
if alist[i][j].isWhatever:
blist[i][j].doSomething()
Is there a nicer way to do something like this?

If anyone is interested in performance of the above solutions, here they are for 4000x4000 grids, from fastest to slowest:
Brian: 1.08s (modified, with izip instead of zip)
John: 2.33s
DzinX: 2.36s
ΤΖΩΤΖΙΟΥ: 2.41s (but object initialization took 62s)
Eugene: 3.17s
Robert: 4.56s
Brian: 27.24s (original, with zip)
EDIT: Added Brian's scores with izip modification and it won by a large amount!
John's solution is also very fast, although it uses indices (I was really surprised to see this!), whereas Robert's and Brian's (with zip) are slower than the question creator's initial solution.
So let's present Brian's winning function, as it is not shown in proper form anywhere in this thread:
from itertools import izip
for a_row,b_row in izip(alist, blist):
for a_item, b_item in izip(a_row,b_row):
if a_item.isWhatever:
b_item.doSomething()

I'd start by writing a generator method:
def grid_objects(alist, blist):
for i in range(len(alist)):
for j in range(len(alist[i])):
yield(alist[i][j], blist[i][j])
Then whenever you need to iterate over the lists your code looks like this:
for (a, b) in grid_objects(alist, blist):
if a.is_whatever():
b.do_something()

You could zip them. ie:
for a_row,b_row in zip(alist, blist):
for a_item, b_item in zip(a_row,b_row):
if a_item.isWhatever:
b_item.doSomething()
However the overhead of zipping and iterating over the items may be higher than your original method if you rarely actually use the b_item (ie a_item.isWhatever is usually False). You could use itertools.izip instead of zip to reduce the memory impact of this, but its still probably going to be slightly slower unless you always need the b_item.
Alternatively, consider using a 3D list instead, so terrain for cell i,j is at l[i][j][0], objects at l[i][j][1] etc, or even combine the objects so you can do a[i][j].terrain, a[i][j].object etc.
[Edit] DzinX's timings actually show that the impact of the extra check for b_item isn't really significant, next to the performance penalty of re-looking up by index, so the above (using izip) seems to be fastest.
I've now given a quick test for the 3d approach as well, and it seems faster still, so if you can store your data in that form, it could be both simpler and faster to access. Here's an example of using it:
# Initialise 3d list:
alist = [ [[A(a_args), B(b_args)] for i in xrange(WIDTH)] for j in xrange(HEIGHT)]
# Process it:
for row in xlist:
for a,b in row:
if a.isWhatever():
b.doSomething()
Here are my timings for 10 loops using a 1000x1000 array, with various proportions of isWhatever being true are:
( Chance isWhatever is True )
Method 100% 50% 10% 1%
3d 3.422 2.151 1.067 0.824
izip 3.647 2.383 1.282 0.985
original 5.422 3.426 1.891 1.534

When you are operating with grids of numbers and want really good performance, you should consider using Numpy. It's surprisingly easy to use and lets you think in terms of operations with grids instead of loops over grids. The performance comes from the fact that the operations are then run over whole grids with optimised SSE code.
For example here is some numpy using code that I wrote that does brute force numerical simulation of charged particles connected by springs. This code calculates a timestep for a 3d system with 100 nodes and 99 edges in 31ms. That is over 10x faster than the best pure python code I could come up with.
from numpy import array, sqrt, float32, newaxis
def evolve(points, velocities, edges, timestep=0.01, charge=0.1, mass=1., edgelen=0.5, dampen=0.95):
"""Evolve a n body system of electrostatically repulsive nodes connected by
springs by one timestep."""
velocities *= dampen
# calculate matrix of distance vectors between all points and their lengths squared
dists = array([[p2 - p1 for p2 in points] for p1 in points])
l_2 = (dists*dists).sum(axis=2)
# make the diagonal 1's to avoid division by zero
for i in xrange(points.shape[0]):
l_2[i,i] = 1
l_2_inv = 1/l_2
l_3_inv = l_2_inv*sqrt(l_2_inv)
# repulsive force: distance vectors divided by length cubed, summed and multiplied by scale
scale = timestep*charge*charge/mass
velocities -= scale*(l_3_inv[:,:,newaxis].repeat(points.shape[1], axis=2)*dists).sum(axis=1)
# calculate spring contributions for each point
for idx, (point, outedges) in enumerate(izip(points, edges)):
edgevecs = point - points.take(outedges, axis=0)
edgevec_lens = sqrt((edgevecs*edgevecs).sum(axis=1))
scale = timestep/mass
velocities[idx] += (edgevecs*((((edgelen*scale)/edgevec_lens - scale))[:,newaxis].repeat(points.shape[1],axis=1))).sum(axis=0)
# move points to new positions
points += velocities*timestep

As a slight style change, you could use enumerate:
for i, arow in enumerate(alist):
for j, aval in enumerate(arow):
if aval.isWhatever():
blist[i][j].doSomething()
I don't think you'll get anything significantly simpler unless you rearrange your data structures as Federico suggests. So that you could turn the last line into something like "aval.b.doSomething()".

Generator expressions and izip from itertools module will do very nicely here:
from itertools import izip
for a, b in (pair for (aline, bline) in izip(alist, blist)
for pair in izip(aline, bline)):
if a.isWhatever:
b.doSomething()
The line in for statement above means:
take each line from combined grids alist and blist and make a tuple from them (aline, bline)
now combine these lists with izip again and take each element from them (pair).
This method has two advantages:
there are no indices used anywhere
you don't have to create lists with zip and use more efficient generators with izip instead.

Are you sure that the objects in the two matrices you are iterating in parallel are instances of conceptually distinct classes? What about merging the two classes ending up with a matrix of objects that contain both isWhatever() and doSomething()?

If the two 2D-lists remain constant during the lifetime of your game and you can't enjoy Python's multiple inheritance to join the alist[i][j] and blist[i][j] object classes (as others have suggested), you could add a pointer to the corresponding b item in each a item after the lists are created, like this:
for a_row, b_row in itertools.izip(alist, blist):
for a_item, b_item in itertools.izip(a_row, b_row):
a_item.b_item= b_item
Various optimisations can apply here, like your classes having __slots__ defined, or the initialization code above could be merged with your own initialization code e.t.c. After that, your loop will become:
for a_row in alist:
for a_item in a_row:
if a_item.isWhatever():
a_item.b_item.doSomething()
That should be more efficient.

If a.isWhatever is rarely true you could build an "index" once:
a_index = set((i,j)
for i,arow in enumerate(a)
for j,a in enumerate(arow)
if a.IsWhatever())
and each time you want something to be done:
for (i,j) in a_index:
b[i][j].doSomething()
If a changes over time, then you will need to
keep the index up-to-date. That's why I used
a set, so items can be added and removed fast.

A common pattern in other answers here is that they attempt to zip together the two inputs, then zip over elements from each pair of nested "row" lists. I propose to invert this, to get more elegant code. As the Zen of Python tells us, "Flat is better than nested."
I took the following approach to set up a test:
>>> class A:
... def __init__(self):
... self.isWhatever = True
...
>>>
>>> class B:
... def doSomething(self):
... pass
...
>>> alist = [[A() for _ in range(1000)] for _ in range(1000)]
>>> blist = [[B() for _ in range(1000)] for _ in range(1000)]
Adapting the originally-best-performing code for 3.x, that solution was
def brian_modern():
for a_row, b_row in zip(alist, blist):
for a_item, b_item in zip(a_row, b_row):
if a_item.isWhatever:
b_item.doSomething()
(since nowadays, zip returns an iterator, and does what itertools.izip used to do).
On my platform (Python 3.8.10 on Linux Mint 20.3; Intel(R) Core(TM) i5-4430 CPU # 3.00GHz with 8GB of DDR3 RAM # 1600MT/s), I get this timing result:
>>> import timeit
>>> timeit.timeit(brian_modern, number=100)
10.740317705087364
Rather than this repeated zip, my approach is to flatten each input iterable first, and then zip the results.
from itertools import chain
def karl():
flatten = chain.from_iterable
for a_item, b_item in zip(flatten(alist), flatten(blist)):
if a_item.isWhatever:
b_item.doSomething()
This gives almost as good performance:
>>> karl()
>>> timeit.timeit(karl, number=100)
11.126002880046144
As a baseline, let's try to pare the looping overhead to a minimum:
my_a = A()
my_b = B()
def baseline():
a = my_a # avoid repeated global lookup
b = my_b # avoid repeated global lookup
for i in range(1000000):
if a.isWhatever:
b.doSomething()
and then check how much of the time is used by the actual object-checking logic:
>>> timeit.timeit(baseline, number=100)
9.41121925599873
So, the pre-flattening approach does incur significantly more overhead (about 18%, vs. about 14% for the repeated-zip approach). However, it is still a fairly small amount of overhead, even for a trivial loop body, and it also allows us to write the code more elegantly.
In my testing, this is the fastest approach to pre-flattening. Splatting out arguments to itertools.chain is slightly slower again, while using a generator expression to flatten the input...
def karl_gen():
a_flat = (i for row in alist for i in row)
b_flat = (j for row in blist for j in row)
for a_item, b_item in zip(a_flat, b_flat):
if a_item.isWhatever:
b_item.doSomething()
... is much slower:
>>> timeit.timeit(karl_gen, number=100)
16.904560427879915
Switching to list comprehensions here barely makes a difference to speed vs. the generators, while also doubling up on memory requirements temporarily. So itertools.chain.from_iterable is a clear winner.

for d1 in alist
for d2 in d1
if d2 = "whatever"
do_my_thing()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.