Suppose I have N items and a multi-hot vector of values {0, 1} that represents inclusion of these items in a result:
N = 4
# items 1 and 3 will be included in the result
vector = [0, 1, 0, 1]
# item 2 will be included in the result
vector = [0, 0, 1, 0]
I'm also provided a matrix of conflicts which indicates which items cannot be included in the result at the same time:
conflicts = [
[0, 1, 1, 0], # any result that contains items 1 AND 2 is invalid
[0, 1, 1, 1], # any result that contains AT LEAST 2 items from {1, 2, 3} is invalid
]
Given this matrix of conflicts, we can determine the validity of the earlier vectors:
# invalid as it triggers conflict 1: [0, 1, 1, 1]
vector = [0, 1, 0, 1]
# valid as it triggers no conflicts
vector = [0, 0, 1, 0]
A naive solution to detect whether a given vector is "valid" (i.e. does not trigger any conflicts) may be done via a dot product and summation operation in numpy:
violation = np.dot(conflicts, vector)
is_valid = np.max(violation) <= 1
Are there are more efficient ways to perform this operation, perhaps either via np.einsum or by bypassing numpy arrays entirely in favour of bit manipulation?
We assume that the number of vectors being checked can be very large (e.g. up to 2^N if we evaluate all possibilities) but that only one vector is likely being checked at a time (to avoid generating a matrix of shape up to (2^N, N) as input).
TL;DR: you can use Numba to optimize np.dot to only operate only on binary values. More specifically, you can perform SIMD-like operations on 8 bytes at once using 64-bit views.
Converting lists to arrays
First of all, the lists can be efficiently converted to relatively-compact arrays using this approach:
vector = np.fromiter(vector, np.uint8)
conflicts = np.array([np.fromiter(conflicts[i], np.uint8) for i in range(len(conflicts))])
This is faster than using the automatic Numpy conversion or np.array (there is less check to perform in the Numpy code internally and Numpy, Numpy know what type of array to build and the resulting one is smaller in memory and thus faster to fill). This step can be used to speed up your np.dot-based solution.
If the input are already a Numpy array, then check they are of type np.uint8 or np.int8. Otherwise, please cast them to such type using conflits = conflits.astype(np.uint8) for example.
First try
Then, one solution could be to use np.packbits to pack the input binary values much as possible in an array of bits in memory, and then perform logical ANDs. But it turns out that np.packbits is pretty slow. Thus, this solution is not a good idea in the end. In fact, any solution creating temporary arrays with a shape similar to conflicts will be slow since writing such an array in memory is generally slower than np.dot (which read conflicts from memory once).
Using Numba
Since np.dot is pretty well optimized, the only solution to defeat it is to use an optimized native code. Numba can be used to generate a native executable code at runtime from a Numpy-based Python code thanks to a just-in-time compiler. The idea is to perform a logical ANDs between vector and rows of conflicts per block. Conflict are check for each block so to stop the computation as early as possible. Blocks can be efficiently compared by groups of 8 octets by comparing the uint64 views of the two arrays (in a SIMD-friendly way).
import numba as nb
#nb.njit('bool_(uint8[::1], uint8[:,::1])')
def check_valid(vector, conflicts):
n, m = conflicts.shape
assert vector.size == m
for i in range(n):
block_size = 128 # In the range: 8,16,...,248
conflicts_row = conflicts[i,:]
gsum = 0 # Global sum of conflicts
m_limit = m // block_size * block_size
for j in range(0, m_limit, block_size):
vector_block = vector[j:j+block_size].view(np.uint64)
conflicts_block = conflicts_row[j:j+block_size].view(np.uint64)
# Matching
lsum = np.uint64(0) # 8 local sums of conflicts
for k in range(block_size//8):
lsum += vector_block[k] & conflicts_block[k]
# Trick to perform the reduction of all the bytes in lsum
lsum += lsum >> 32
lsum += lsum >> 16
lsum += lsum >> 8
gsum += lsum & 0xFF
# Check if there is a conflict
if gsum >= 2:
return False
# Remaining part
for j in range(m_limit, m):
gsum += vector[j] & conflicts_row[j]
if gsum >= 2:
return False
return True
Results
This is about 9 times faster than np.dot on my machine for a large conflicts array of shape (16, 65536) (without conflicts). The time to convert lists is not included in both cases. When there are conflicts, the provided solution is much faster since it can early stop the computation.
Theoretically, the computation should be even faster, but the Numba JIT do not succeed to vectorize the loop using SIMD instructions. That being said, it seems the same issue appears for np.dot. If the arrays are even bigger, you can parallelize the computation of the blocks (at the expense of a slower computation if the function return False).
Related
Is it actually possible to make this run faster? I need to get half of all possible grids (all elements can be either -1 or 1) of size 4*Lx (for counting energies in Ising model).
def get_grid(Lx):
a = list()
count = 0
t = list(product([1,-1], repeat=Lx))
for i in range(len(t)):
for j in range(len(t)):
for k in range(len(t)):
for l in range(len(t)):
count += 1
a.append([t[i], t[j], t[k], t[l]])
if count == 2**(Lx*4)/2:
return np.array(a)
Tried using numba, but that didn't work out.
First of all, Numba does not like lists. If you want an efficient code, then you need to operate on arrays (except when you really do not know the size at runtime and estimating it is hard/slow). Here the size of the output array is already known so it is better to preallocate it and then fill it. Numba does not like much high-level features like generators, you should prefer using basic loops which are faster (as long as they are executed in a JITed function). The Cartesian product can be replaced by the efficient computation of an array based on the bits of an increasing integer. The whole computation is mainly memory-bound so it is better to use small integer datatypes like uint8 which take 4 times less space in RAM (and thus about 4 times faster to fill). Here is the resulting code:
import numpy as np
import numba as nb
#nb.njit('int8[:,:,:](int64,)')
def get_grid_numba(Lx):
t = np.empty((2**Lx, Lx), dtype=np.int8)
for i in range(2**Lx):
for j in range(Lx):
t[i, Lx-1-j] = 1 - 2 * ((i >> j) & 1)
outSize = 2**(Lx*4 - 1)
out = np.empty((outSize, 4, Lx), dtype=np.int8)
cur = 0
for i in range(len(t)):
for j in range(len(t)):
for k in range(len(t)):
for l in range(len(t)):
out[cur, 0, :] = t[i, :]
out[cur, 1, :] = t[j, :]
out[cur, 2, :] = t[k, :]
out[cur, 3, :] = t[l, :]
cur += 1
if cur == outSize:
return out
return out
For Lx=4, the initial code takes 66.8 ms while this Numba code takes 0.36 ms on my i5-9600KF processor. It is thus 185 times faster.
Note that the size of the output array exponentially grows very quickly. For Lx=7, the output shape is (134217728, 4, 7) and it takes 3.5 GiB of RAM. The Numba code takes 2.47 s to generate it, that is 1.4 GiB/s. If this is not enough to you, then you can write specific implementation from Lx=1 to Lx=8, use loops for the out slice assignment and even use multiple threads for Lx>=5. For small arrays, you can pre-compute them once. This should be an order of magnitude faster.
I wanted to code a prime number generator in python - I've only done this in C and Java. I did the following. I used an integer bitmap as an array. Performance of the algorithm should increase nlog(log(n)) but I am seeing exponential increase in cost/time as the problem size n increases. Is this something obvious I am not seeing or don't know about python as integers grow larger than practical? I am using python-3.8.3.
def countPrimes(n):
if n < 3:
return []
arr = (1 << (n-1)) - 2
for i in range(2, n):
selector = 1 << (i - 1)
if (selector & arr) == 0:
continue
# We have a prime
composite = selector
while (composite := composite << i) < arr:
arr = arr & (~composite)
primes = []
for i in range(n):
if (arr >> i) & 1 == 1:
primes.append(i+1)
return primes
Some analysis of my runtime:
A plot of y = nlog(log(n)) (red line which is steeper) and y = x (blue line which is less steep):
I'd normally not use integers with sizes exceeding uint64, because python allows unlimited size integers and I'm just testing, I used the above approach. As I said, I am trying to understand why the algorithm time increases exponentially with problem size n.
I used an integer bitmap as an array
That's extremely expensive. Python ints are immutable. Every time you want to toggle a bit, you're building a whole new gigantic int.
You also need to build other giant ints just to access single bits you're interested in - for example, composite and ~composite are huge in arr = arr & (~composite), even though you're only interested in 1 bit.
Use an actual mutable sequence type. Maybe a list, maybe a NumPy array, maybe some bitvector type off of PyPI, but don't use an int.
Suppose I have a NumPy array arr that I want to element-wise filter (reduce) depending on the truth value of a (broadcastable) function, e.g.
I want to get only values below a certain threshold value k:
def cond(x):
return x < k
There are a couple of methods, e.g.:
Using a generator: np.fromiter((x for x in arr if cond(x)), dtype=arr.dtype) (which is a memory efficient version of using a list comprehension: np.array([x for x in arr if cond(x)]) because np.fromiter() will produce a NumPy array directly, without the need to allocate an intermediate Python list)
Using boolean masking: arr[cond(arr)]
Using integer indexing: arr[np.nonzero(cond(arr))] (or equivalently using np.where() as it defaults to np.nonzero() with only one condition)
Using explicit looping with:
single pass and final copying/resizing
two passes: one to determine the size of the result and one to actually perform the computation
(The last two approaches to be accelerated with Cython or Numba)
Which is the fastest? What about memory efficiency?
(EDITED: To use directly np.nonzero() instead of np.where() as per #ShadowRanger comment)
Summary
Using a loop-based approach with single pass and copying, accelerated with Numba, offers the best overall trade-off in terms of speed, memory efficiency and flexibility.
If the execution of the condition function is sufficiently fast, two-passes (filter2_nb()) may be faster, while they are more memory efficient regardless.
Also, for sufficiently large inputs, resizing instead of copying (filter_resize_xnb()) leads to faster execution.
If the data type (and the condition function) is known ahead of time and can be compiled, the Cython acceleration seems to be faster.
It is likely that a similar hard-coding of the condition would lead to comparable speed-up with Numba accerelation as well.
When it comes to NumPy-only based approaches, boolean masking or integer indexing are of comparable speed, and which one comes out faster depends largely on the filtering factor, i.e. the portion of values that passes through the filtering condition.
The np.fromiter() approach is much slower (it would be off-charts in the plot), but does not produce large temporary objects.
Note that the following tests are meant to give some insights into the different approaches and should be taken with a grain of salt.
The most relevant assumptions are that the condition is broadcastable and that it would eventually compute very fast.
Definitions
Using a generator:
def filter_fromiter(arr, cond):
return np.fromiter((x for x in arr if cond(x)), dtype=arr.dtype)
Using boolean masking:
def filter_mask(arr, cond):
return arr[cond(arr)]
Using integer indexing:
def filter_idx(arr, cond):
return arr[np.nonzero(cond(arr))]
4a. Using explicit looping, with single pass and final copying/resizing
Cython-accelerated with copying (pre-compiled condition)
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef long NUM = 1048576
cdef long MAX_VAL = 1048576
cdef long K = 1048576 // 2
cdef int cond_cy(long x, long k=K):
return x < k
cdef size_t _filter_cy(long[:] arr, long[:] result, size_t size):
cdef size_t j = 0
for i in range(size):
if cond_cy(arr[i]):
result[j] = arr[i]
j += 1
return j
def filter_cy(arr):
result = np.empty_like(arr)
new_size = _filter_cy(arr, result, arr.size)
return result[:new_size].copy()
Cython-accelerated with resizing (pre-compiled condition)
def filter_resize_cy(arr):
result = np.empty_like(arr)
new_size = _filter_cy(arr, result, arr.size)
result.resize(new_size)
return result
Numba-accelerated with copying
import numba as nb
#nb.njit
def cond_nb(x, k=K):
return x < k
#nb.njit
def filter_nb(arr, cond_nb):
result = np.empty_like(arr)
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
result[j] = arr[i]
j += 1
return result[:j].copy()
Numba-accelerated with resizing
#nb.njit
def _filter_out_nb(arr, out, cond_nb):
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
out[j] = arr[i]
j += 1
return j
def filter_resize_xnb(arr, cond_nb):
result = np.empty_like(arr)
j = _filter_out_nb(arr, result, cond_nb)
result.resize(j, refcheck=False) # unsupported in NoPython mode
return result
Numba-accelerated with a generator and np.fromiter()
#nb.njit
def filter_gen_nb(arr, cond_nb):
for i in range(arr.size):
if cond_nb(arr[i]):
yield arr[i]
def filter_gen_xnb(arr, cond_nb):
return np.fromiter(filter_gen_nb(arr, cond_nb), dtype=arr.dtype)
4b. Using explicit looping with two passes: one to determine the size of the result and one to actually perform the computation
Cython-accelerated (pre-compiled condition)
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
cdef size_t _filtered_size_cy(long[:] arr, size_t size):
cdef size_t j = 0
for i in range(size):
if cond_cy(arr[i]):
j += 1
return j
def filter2_cy(arr):
cdef size_t new_size = _filtered_size_cy(arr, arr.size)
result = np.empty(new_size, dtype=arr.dtype)
new_size = _filter_cy(arr, result, arr.size)
return result
Numba-accelerated
#nb.njit
def filter2_nb(arr, cond_nb):
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
j += 1
result = np.empty(j, dtype=arr.dtype)
j = 0
for i in range(arr.size):
if cond_nb(arr[i]):
result[j] = arr[i]
j += 1
return result
Timing Benchmarks
(The generator-based filter_fromiter() method is much slower than the others -- by approx. 2 orders of magnitude.
Similar (and perhaps slightly worse) performances can be expected from a list comprehension.
This would be true for any explicit looping with non-accelerated code.)
The timing would depend on both the input array size and the percent of filtered items.
As a function of input size
The first graph addresses the timings as a function of the input size (for ~50% filtering factor -- i.e. 50% of the elements appear in the result):
In general, explicit looping with one form of acceleration leads to the fastest execution, with slight variations depending on input size.
Within NumPy, the integer indexing approaches are basically on par with boolean masking.
The benefits of using np.fromiter() (no pre-allocation) can be reaped by writing a Numba-accelerated generator, which would come out slower than the other approaches (within an order of magnitude), but much faster than pure-Python looping.
As a function of filling
The second graph addresses the timings as a function of items passing through the filter (for a fixed input size of ~1 million elements):
The first observation is that all methods are slowest when approaching a ~50% filling and with less, or more filling they are faster, and fastest towards no filling (highest percent of filtered-out values, lowest percent of passing through values as indicated in the x-axis of the graph).
Again, explicit looping with some mean of acceleration leads to the fastest execution.
Within NumPy, the integer indexing and boolean masking approaches are again basically the same.
(Full code available here)
Memory Considerations
The generator-based filter_fromiter() method requires only minimal temporary storage, independently of the size of the input.
Memory-wise this is the most efficient method.
This approach can be effectively speed up with a Numba-accelerated generator.
Of similar memory efficiency are the Cython / Numba two-passes methods, because the size of the output is determined during the first pass.
The caveat here is that computing the condition has to be fast for these methods to be fast.
On the memory side, the single-pass solutions for both Cython and Numba require a temporary array of the size of the input.
Hence, these are not very memory-efficient compared to two-passes or the generator-based one.
Yet they are of similar asymptotic temporary memory footprint compared to masking, but the constant term is typically larger than masking.
The boolean masking solution requires a temporary array of the size of the input but of type bool, which in NumPy is 1 byte, so this is ~8 times smaller than the default size of a NumPy array on a typical 64-bit system.
The integer indexing solution has the same requirement as the boolean mask slicing in the first step (inside np.nonzero() call), which gets converted to a series of ints (typically int64 on a 64-bit system) in the second step (the output of np.nonzero()).
This second step, therefore, has variable memory requirements, depending on the number of filtered elements.
Remarks
both boolean masking and integer indexing require some form of conditioning that is capable of producing a boolean mask (or, alternatively, a list of indices); in the above implementation, the condition is broadcastable
the generator and the Numba-accelerated methods are also the most flexible when it comes to specifying a different filtering condition
the Numba-accelerated methods require the condition to be Numba-compatible to access the Numba acceleration in NoPython mode
the Cython solution requires specifying the data types for it to be fast, or extra effort for multiple types dispatching, and extra effort (not explored here) to get the same level of flexibility as the other methods
for both Numba and Cython, the filtering condition can be hard-coded, leading to marginal but appreaciable speed differences
the single-pass solutions require additional code to handle the unused (but otherwise initially allotted) memory.
the NumPy methods do NOT return a view of the input, but a copy, as a result of advanced indexing:
arr = np.arange(100)
k = 50
print('`arr[arr > k]` is a copy: ', arr[arr > k].base is None)
# `arr[arr > k]` is a copy: True
print('`arr[np.where(arr > k)]` is a copy: ', arr[np.where(arr > k)].base is None)
# `arr[np.where(arr > k)]` is a copy: True
print('`arr[:k]` is a copy: ', arr[:k].base is None)
# `arr[:k]` is a copy: False
(EDITED: various improvements based on #ShadowRanger, #PaulPanzer, #max9111 and #DavidW comments.)
I am trying to expand the numpy "tensordot" such that things like:
K_ijklm = A_ki * B_jml can be written in a clear way like this: K = mytensordot(A,B,[2,0],[1,4,3])
To my understanding, numpy's tensordot (with optional argument 0) would be able to do something like this: K_kijml = A_ki * B_jml, i.e. keeping the order of the indexes. Therefore I would then have to do a number of np.swapaxes() to obtain the matrix `K_ijklm', which in a complicated case can be an easy source of errors (potentially very hard to debug).
The problem is that my implementation is slow (10x slower than tensordot [EDIT: It is actually MUCH slower than that]), even when using numba. I was wondering if anyone would have some insight on what could be done to improve the performance of my algorithm.
MWE
import numpy as np
import numba as nb
import itertools
import timeit
#nb.jit()
def myproduct(dimN):
N=np.prod(dimN)
L=len(dimN)
Product=np.zeros((N,L),dtype=np.int32)
rn=0
for n in range(1,N):
for l in range(L):
if l==0:
rn=1
v=Product[n-1,L-1-l]+rn
rn = 0
if v == dimN[L-1-l]:
v = 0
rn = 1
Product[n,L-1-l]=v
return Product
#nb.jit()
def mytensordot(A,B,iA,iB):
iA,iB = np.array(iA,dtype=np.int32),np.array(iB,dtype=np.int32)
dimA,dimB = A.shape,B.shape
NdimA,NdimB=len(dimA),len(dimB)
if len(iA) != NdimA: raise ValueError("iA must be same size as dim A")
if len(iB) != NdimB: raise ValueError("iB must be same size as dim B")
NdimN = NdimA + NdimB
dimN=np.zeros(NdimN,dtype=np.int32)
dimN[iA]=dimA
dimN[iB]=dimB
Out=np.zeros(dimN)
indexes = myproduct(dimN)
for nidxs in indexes:
idxA = tuple(nidxs[iA])
idxB = tuple(nidxs[iB])
v=A[(idxA)]*B[(idxB)]
Out[tuple(nidxs)]=v
return Out
A=np.random.random((4,5,3))
B=np.random.random((6,4))
def runmytdot():
return mytensordot(A,B,[0,2,3],[1,4])
def runtensdot():
return np.tensordot(A,B,0).swapaxes(1,3).swapaxes(2,3)
print(np.all(runmytdot()==runtensdot()))
print(timeit.timeit(runmytdot,number=100))
print(timeit.timeit(runtensdot,number=100))
Result:
True
1.4962144780438393
0.003484356915578246
You have run into a known issue. numpy.zeros requires a tuple when creating a multidimensional array. If you pass something other than a tuple, it sometimes works, but that's only because numpy is smart about converting the object into a tuple first.
The trouble is that numba does not currently support conversion of arbitrary iterables into tuples. So this line fails when you try to compile it in nopython=True mode. (A couple of others fail too, but this is the first.)
Out=np.zeros(dimN)
In theory you could call np.prod(dimN), create a flat array of zeros, and reshape it, but then you run into the very same problem: the reshape method of numpy arrays requires a tuple!
This is quite a vexing problem with numba -- I had not encountered it before. I really doubt the solution I have found is the correct one, but it is a working solution that allows us to compile a version in nopython=True mode.
The core idea is to avoid using tuples for indexing by directly implementing an indexer that follows the strides of the array:
#nb.jit(nopython=True)
def index_arr(a, ix_arr):
strides = np.array(a.strides) / a.itemsize
ix = int((ix_arr * strides).sum())
return a.ravel()[ix]
#nb.jit(nopython=True)
def index_set_arr(a, ix_arr, val):
strides = np.array(a.strides) / a.itemsize
ix = int((ix_arr * strides).sum())
a.ravel()[ix] = val
This allows us to get and set values without needing a tuple.
We can also avoid using reshape by passing the output buffer into the jitted function, and wrapping that function in a helper:
#nb.jit() # We can't use nopython mode here...
def mytensordot(A, B, iA, iB):
iA, iB = np.array(iA, dtype=np.int32), np.array(iB, dtype=np.int32)
dimA, dimB = A.shape, B.shape
NdimA, NdimB = len(dimA), len(dimB)
if len(iA) != NdimA:
raise ValueError("iA must be same size as dim A")
if len(iB) != NdimB:
raise ValueError("iB must be same size as dim B")
NdimN = NdimA + NdimB
dimN = np.zeros(NdimN, dtype=np.int32)
dimN[iA] = dimA
dimN[iB] = dimB
Out = np.zeros(dimN)
return mytensordot_jit(A, B, iA, iB, dimN, Out)
Since the helper contains no loops, it adds some overhead, but the overhead is pretty trivial. Here's the final jitted function:
#nb.jit(nopython=True)
def mytensordot_jit(A, B, iA, iB, dimN, Out):
for i in range(np.prod(dimN)):
nidxs = int_to_idx(i, dimN)
a = index_arr(A, nidxs[iA])
b = index_arr(B, nidxs[iB])
index_set_arr(Out, nidxs, a * b)
return Out
Unfortunately, this does not wind up generating as much of a speedup as we might like. On smaller arrays it's about 5x slower than tensordot; on larger arrays it's still 50x slower. (But at least it's not 1000x slower!) This is not too surprising in retrospect, since dot and tensordot are both using BLAS under the hood, as #hpaulj reminds us.
After finishing this code, I saw that einsum has solved your real problem -- nice!
But the underlying issue that your original question points to -- that indexing with arbitrary-length tuples is not possible in jitted code -- is still a frustration. So hopefully this will be useful to someone else!
tensordot with scalar axes values can be obscure. I explored it in
How does numpy.tensordot function works step-by-step?
There I deduced that np.tensordot(A, B, axes=0) is equivalent using axes=[[], []].
In [757]: A=np.random.random((4,5,3))
...: B=np.random.random((6,4))
In [758]: np.tensordot(A,B,0).shape
Out[758]: (4, 5, 3, 6, 4)
In [759]: np.tensordot(A,B,[[],[]]).shape
Out[759]: (4, 5, 3, 6, 4)
That in turn is equivalent to calling dot with a new size 1 sum-of-products dimenson:
In [762]: np.dot(A[...,None],B[...,None,:]).shape
Out[762]: (4, 5, 3, 6, 4)
(4,5,3,1) * (6,1,4) # the 1 is the last of A and 2nd to the last of B
dot is fast, using BLAS (or equivalent) code. Swapping axes and reshaping is also relatively fast.
einsum gives us a lot of control over axes
replicating the above products:
In [768]: np.einsum('jml,ki->jmlki',A,B).shape
Out[768]: (4, 5, 3, 6, 4)
and with swapping:
In [769]: np.einsum('jml,ki->ijklm',A,B).shape
Out[769]: (4, 4, 6, 3, 5)
A minor point - the double swap can be written as one transpose:
.swapaxes(1,3).swapaxes(2,3)
.transpose(0,3,1,2,4)
Implementing a Local Search algorithm for the Quadratic Assignment Problem in Python with Numpy, I've found the use of for-loops to be problematic at best, given that CPython is very slow when it comes to heavy mathematical computation.
My code has a 3-level nested for-loop which iterates a solution (numpy ndarray), iterates over a mask (numpy ndarray), and iterates the solution again, then doing some computations that, and this is the troublesome bit, may affect later iterations.
My question is, is it even possible to somehow precompute this? With the nature of the problem, I'm not really sure. As the problem is a Quadratic, for small sizes it's acceptable, but over 100 elements results in a lot of iterations.
The code is as follows:
while has_improved and iterations < 100:
for r in range(n):
for b in dlb:
if b: continue
has_improved = False
for s in range(n):
move_cost = cost_fn_delta((flow, distance), solution, r, s)
if move_cost < 0:
# Swap the indices
solution[r], solution[s] = solution[s], solution[r]
cost += move_cost
dlb[r], dlb[s] = False, False
has_improved = True
iterations += 1
if not has_improved:
dlb[r] = True