Solve large number of small equation systems in numpy - python

I have a large number of small linear equation systems that I'd like to solve efficiently using numpy. Basically, given A[:,:,:] and b[:,:], I wish to find x[:,:] given by A[i,:,:].dot(x[i,:]) = b[i,:]. So if I didn't care about speed, I could solve this as
for i in range(n):
x[i,:] = np.linalg.solve(A[i,:,:],b[i,:])
But since this involved explicit looping in python, and since A typically has a shape like (1000000,3,3), such a solution would be quite slow. If numpy isn't up to this, I could do this loop in fortran (i.e. using f2py), but I'd prefer to stay in python if possible.

For those coming back to read this question now, I thought I'd save others time and mention that numpy handles this using broadcasting now.
So, in numpy 1.8.0 and higher, the following can be used to solve N linear equations.
x = np.linalg.solve(A,b)

I guess answering yourself is a bit of a faux pas, but this is the fortran solution I have a the moment, i.e. what the other solutions are effectively competing against, both in speed and brevity.
function pixsolve(A, b) result(x)
implicit none
real*8 :: A(:,:,:), b(:,:), x(size(b,1),size(b,2))
integer*4 :: i, n, m, piv(size(b,1)), err
n = size(A,3); m = size(A,1)
x = b
do i = 1, n
call dgesv(m, 1, A(:,:,i), m, piv, x(:,i), m, err)
end do
end function
This would be compiled as:
f2py -c -m foo{,.f90} -llapack -lblas
And called from python as
x = foo.pixsolve(A.T, b.T).T
(The .Ts are needed due to a poor design choice in f2py, which both causes unnecessary copying, inefficient memory access patterns and unnatural looking fortran indexing if the .Ts are left out.)
This also avoids a setup.py etc. I have no bone to pick with fortran (as long as strings aren't involved), but I was hoping that numpy might have something short and elegant which could do the same thing.

I think you're wrong about the explicit looping being a problem. Usually it's only the innermost loop it's worth optimizing, and I think that holds true here. For example, we can measure the code of the overhead vs the cost of the actual computation:
import numpy as np
n = 10**6
A = np.random.random(size=(n, 3, 3))
b = np.random.random(size=(n, 3))
x = b*0
def f():
for i in xrange(n):
x[i,:] = np.linalg.solve(A[i,:,:],b[i,:])
np.linalg.pseudosolve = lambda a,b: b
def g():
for i in xrange(n):
x[i,:] = np.linalg.pseudosolve(A[i,:,:],b[i,:])
which gives me
In [66]: time f()
CPU times: user 54.83 s, sys: 0.12 s, total: 54.94 s
Wall time: 55.62 s
In [67]: time g()
CPU times: user 5.37 s, sys: 0.01 s, total: 5.38 s
Wall time: 5.40 s
IOW, it's only spending 10% of its time doing anything other than actually solving your problem. Now, I could totally believe that np.linalg.solve itself is too slow for you relative to what you could get out of Fortran, and so you want to do something else. That's probably especially true on small problems, come to think of it: IIRC I once found it faster to unroll certain small solutions by hand, although that was a while back.
But by itself, it's not true that using an explicit loop on the first index will make the overall solution quite slow. If np.linalg.solve is fast enough, the loop won't change it much here.

I think you can do it in one go, with a (3x100000,3x100000) matrix composed of 3x3 blocks around the diagonal.
Not tested :
b_new = np.vstack([ b[i,:] for i in range(len(i)) ])
x_new = np.zeros(shape=(3x10000,3) )
A_new = np.zeros(shape=(3x10000,3x10000) )
n,m = A.shape
for i in range(n):
A_new[3*i:3*(i+1),3*i:3*(i+1)] = A[i,:,:]
x = np.linalg.solve(A_new,b_new)

Related

The most efficient way rather than using np.setdiff1d and np.in1d, to remove common values of 1D arrays with unique values

I need a much faster code to remove values of an 1D array (array length ~ 10-15) that are common with another 1D array (array length ~ 1e5-5e5 --> rarely up to 7e5), which are index arrays contain integers. There is no duplicate in the arrays, and they are not sorted and the order of the values must be kept in the main array after modification. I know that can be achieved using such np.setdiff1d or np.in1d (which both are not supported for numba jitted in no-python mode), and other similar posts (e.g. this) have not much more efficient way to do so, but performance is important here because all the values in the main index array will be gradually be removed in loops.
import numpy as np
import numba as nb
n = 500000
r = 10
arr1 = np.random.permutation(n)
arr2 = np.random.randint(0, n, r)
# #nb.jit
def setdif1d_np(a, b):
return np.setdiff1d(a, b, assume_unique=True)
# #nb.jit
def setdif1d_in1d_np(a, b):
return a[~np.in1d(a, b)]
There is another related post that proposed by norok2 for 2D arrays, that is ~15 times faster solution (hashing-like way using numba) than usual methods described there. This solution may be the best if it could be prepared for 1D arrays:
#nb.njit
def mul_xor_hash(arr, init=65537, k=37):
result = init
for x in arr.view(np.uint64):
result = (result * k) ^ x
return result
#nb.njit
def setdiff2d_nb(arr1, arr2):
# : build `delta` set using hashes
delta = {mul_xor_hash(arr2[0])}
for i in range(1, arr2.shape[0]):
delta.add(mul_xor_hash(arr2[i]))
# : compute the size of the result
n = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
n += 1
# : build the result
result = np.empty((n, arr1.shape[-1]), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
result[j] = arr1[i]
j += 1
return result
I tried to prepare that for 1D arrays, but I have some problems/question with that.
At first, IDU what does mul_xor_hash exactly do, and if init and k are arbitrary selected or not
Why mul_xor_hash will not work without nb.njit:
File "C:/Users/Ali/Desktop/test - Copy - Copy.py", line 21, in mul_xor_hash
result = (result * k) ^ x
TypeError: ufunc 'bitwise_xor' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
IDK how to implement mul_xor_hash on 1D arrays (if it could), which I guess may make it faster more than for 2Ds, so I broadcast the input arrays to 2D by [None, :], which get the following error just for arr2:
print(mul_xor_hash(arr2[0]))
ValueError: new type not compatible with array
and what does delta do
I am searching the most efficient way in this regard. In the absence of better method than norok2 solution, how to prepare this solution for 1D arrays?
Understanding the hash-based solution
At first, IDU what does mul_xor_hash exactly do, and if init and k are arbitrary selected or not
mul_xor_hash is a custom hash function. Functions mixing xor and multiply (possibly with shifts) are known to be relatively fast to compute the hash of a raw data buffer. The multiplication tends to shuffle bits and the xor is used to somehow combine/accumulate the result in a fixed size small value (ie. the final hash). There are many different hashing functions. Some are faster than others, some cause more collisions than other in a given context. A fast hashing function causing too many collisions can be useless in practice as it would result in a pathological situation where all conflicting values needs to be compared. This is why fast hash functions are hard to implement.
init and k are parameter certainly causing the hash to be pretty balance. This is pretty common in such a hash function. k needs to be sufficiently big for the multiplication to shuffle bits and it should typically also be a prime number (values like power of two tends to increase collisions due to modular arithmetic behaviours). init plays a significant role only for very small arrays (eg. with 1 item): it helps to reduce collisions by xoring the final hash by a non-trivial constant. Indeed, if arr.size = 1, then result = (init * k) ^ arr[0] where init * k is a constant. Having an identity hash function equal to arr[0] is known to be bad since it tends to result in many collisions (this is a complex topic, but put it shortly, arr[0] can be divided by the number of buckets in the hash table for example). Thus, init should be a relatively big number and init * k should also be a big non-trivial value (a prime number is a good target value).
Why mul_xor_hash will not work without nb.njit
It depends of the input. The input needs to be a 1D array and have a raw size in byte divisible by 8 (eg. 64-bit items, 2n x 32-bit ones, 4n x 16-bit one or 8n 8-bit ones). Here is some examples:
mul_xor_hash(np.random.rand(10))
mul_xor_hash(np.arange(10)) # Do not work with 9
and what does delta do
It is a set containing the hash of the arr2 row so to find matching lines faster than comparing them without hashes.
how to prepare this solution for 1D arrays?
AFAIK, hashes are only use to avoid comparisons of rows but this is because the input is the 2D array. In 1D, there is no such a problem.
There is big catch with this method: it only works if there is no hash collisions. Otherwise, the implementation wrongly assumes that values are equal even if they are not! #norok explicitly mentioned it in the comments though:
Note that the collision handling for the hashings should also be implemented
Faster implementation
Using the 2D solution of #norok2 for 1D is not a good idea since hashes will not make it faster the way they are used. In fact, a set already use a hash function internally anyway. Not to mention collisions needs to be properly implemented (which is done by a set).
Using a set is a relatively good idea since it causes the complexity to be O(n + m) where n = len(arr1) and m = len(arr2). That being said, if arr1 is converted to a set, then it will be too big to fit in L1 cache (due to the size of arr1 in your case) resulting in slow cache misses. Additionally, the growing size of the set will cause values to be re-hashed which is not efficient. If arr2 is converted to a set, then the many hash table fetches will not be very efficient since arr2 is very small in your case. This is why this solution is sub-optimal.
One solution is to split arr1 in chunks and then build a set based on the target chunk. You can then check if a value is in the set or not efficiently. Building the set is still not very efficient due to the growing size. This problem is due to Python itself which do not provide a way to reserve some space for the data structure like other languages do (eg. C++). One solution to avoid this issue is simply to reimplement an hash-table which is not trivial and cumbersome. Actually, Bloom filters can be used to speed up this process since they can quickly find if there is no collision between the two sets arr1 and arr2 in average (though they are not trivial to implement).
Another optimization is to use multiple threads to compute the chunks in parallel since they are independent. That being said, the appending to the final array is not easy to do efficiently in parallel, especially since you do not want the order to be modified. One solution is to move away the copy from the parallel loop and do it serially but this is slow and AFAIK there is no simple way to do that in Numba currently (since the parallelism layer is very limited). Consider using native languages like C/C++ for an efficient parallel implementation.
In the end, hashing can be pretty complex and the speed up can be quite small compared to a naive implementation with two nested loops since arr2 only have few items and modern processors can compare values quickly using SIMD instructions (while hash-based method can hardly benefit from them on mainstream processors). Unrolling can help to write a pretty simple and fast implementation. Again, unfortunately, Numba use LLVM-Jit internally which appear to fail to vectorize such a simple code (certainly due to missing optimizations in either LLVM-Jit or even LLVM itself). As a result, the non vectorized code is finally a bit slower (rather than 4~10 times faster on a modern mainstream processor). One solution is to use a C/C++ code instead to do that (or possibly Cython).
Here is a serial implementation using basic Bloom filters:
#nb.njit('uint32(int32)')
def hash_32bit_4k(value):
return (np.uint32(value) * np.uint32(27_644_437)) & np.uint32(0x0FFF)
#nb.njit(['int32[:](int32[:], int32[:])', 'int32[:](int32[::1], int32[::1])'])
def setdiff1d_nb_faster(arr1, arr2):
out = np.empty_like(arr1)
bloomFilter = np.zeros(4096, dtype=np.uint8)
for j in range(arr2.size):
bloomFilter[hash_32bit_4k(arr2[j])] = True
cur = 0
for i in range(arr1.size):
# If the bloom-filter value is true, we know arr1[i] is not in arr2.
# Otherwise, there is maybe a false positive (conflict) and we need to check to be sure.
if bloomFilter[hash_32bit_4k(arr1[i])] and arr1[i] in arr2:
continue
out[cur] = arr1[i]
cur += 1
return out[:cur]
Here is an untested variant that should work for 64-bit integers (floating point numbers need memory views and possibly a prime constant too):
#nb.njit('uint64(int64)')
def hash_32bit_4k(value):
return (np.uint64(value) * np.uint64(67_280_421_310_721)) & np.uint64(0x0FFF)
Note that if all the values in the small array are contained in the main array in each loop, then we can speed up the arr1[i] in arr2 part by removing values from arr2 when we find them. That being said, collisions and findings should be very rare so I do not expect this to be significantly faster (not to mention it adds some overhead and complexity). If items are computed in chunks, then the last chunks can be directly copied without any check but the benefit should still be relatively small. Note that this strategy can be effective for the naive (C/C++) SIMD implementation previously mentioned though (it can be about 2x faster).
Generalization and parallel implementation
This section focus on the algorithm to use regarding the input size. It particularly details an SIMD-based implementation and discuss about the use of multiple threads.
First of all, regarding the value r, the best algorithm to use can be different. More specifically:
when r is 0, the best thing to do is to return the input array arr1 unmodified (possibly a copy to avoid issue with in-place algorithms);
when r is 1, we can use one basic loop iterating over the array, but the best implementation is likely to use np.where of Numpy which is highly optimized for that
when r is small like <10, then using a SIMD-based implementation should be particularly efficient, especially if the iteration range of the arr2-based loop is known at compile-time and is unrolled
for bigger r values that are still relatively small (eg. r < 1000 and r << n), the provided hash-based solution should be one of the best;
for larger r values with r << n, the hash-based solution can be optimized by packing boolean values as bits in bloomFilter and by using multiple hash-functions instead of one so to better handle collisions while being more cache-friendly (in fact, this is what actual bloom filters does); note that multi-threading can be used so speed up the lookups when r is huge and r << n;
when r is big and not much smaller than n, then the problem is pretty hard to solve efficiently and the best solution is certainly to sort both arrays (typically with a radix sort) and use a merge-based method to remove the duplicates, possibly with multiple threads when both r and n are huge (hard to implement).
Let's start with the SIMD-based solution. Here is an implementation:
#nb.njit('int32[:](int32[::1], int32[::1])')
def setdiff1d_nb_simd(arr1, arr2):
out = np.empty_like(arr1)
limit = arr1.size // 4 * 4
limit2 = arr2.size // 2 * 2
cur = 0
z32 = np.int32(0)
# Tile (x4) based computation
for i in range(0, limit, 4):
f0, f1, f2, f3 = z32, z32, z32, z32
v0, v1, v2, v3 = arr1[i], arr1[i+1], arr1[i+2], arr1[i+3]
# Unrolled (x2) loop searching for a match in `arr2`
for j in range(0, limit2, 2):
val1 = arr2[j]
val2 = arr2[j+1]
f0 += (v0 == val1) + (v0 == val2)
f1 += (v1 == val1) + (v1 == val2)
f2 += (v2 == val1) + (v2 == val2)
f3 += (v3 == val1) + (v3 == val2)
# Remainder of the previous loop
if limit2 != arr2.size:
val = arr2[arr2.size-1]
f0 += v0 == val
f1 += v1 == val
f2 += v2 == val
f3 += v3 == val
if f0 == 0: out[cur] = arr1[i+0]; cur += 1
if f1 == 0: out[cur] = arr1[i+1]; cur += 1
if f2 == 0: out[cur] = arr1[i+2]; cur += 1
if f3 == 0: out[cur] = arr1[i+3]; cur += 1
# Remainder
for i in range(limit, arr1.size):
if arr1[i] not in arr2:
out[cur] = arr1[i]
cur += 1
return out[:cur]
It turns out this implementation is always slower than the hash-based one on my machine since Numba clearly generate an inefficient for the inner arr2-based loop and this appears to come from broken optimizations related to the ==: Numba simply fail use SIMD instructions for this operation (for no apparent reasons). This prevent many alternative SIMD-related codes to be fast as long as they are using Numba.
Another issue with Numba is that np.where is slow since it use a naive implementation while the one of Numpy has been heavily optimized. The optimization done in Numpy can hardly be applied to the Numba implementation due to the previous issue. This prevent any speed up using np.where in a Numba code.
In practice, the hash-based implementation is pretty fast and the copy takes a significant time on my machine already. The computing part can be speed up using multiple thread. This is not easy since the parallelism model of Numba is very limited. The copy cannot be easily optimized with Numba (one can use non-temporal store but this is not yet supported by Numba) unless the computation is possibly done in-place.
To use multiple threads, one strategy is to first split the range in chunk and then:
build a boolean array determining, for each item of arr1, whether the item is found in arr2 or not (fully parallel)
count the number of item found by chunk (fully parallel)
compute the offset of the destination chunk (hard to parallelize, especially with Numba, but fast thanks to chunks)
copy the chunk to the target location without copying found items (fully parallel)
Here is an efficient parallel hash-based implementation:
#nb.njit('int32[:](int32[:], int32[:])', parallel=True)
def setdiff1d_nb_faster_par(arr1, arr2):
# Pre-computation of the bloom-filter
bloomFilter = np.zeros(4096, dtype=np.uint8)
for j in range(arr2.size):
bloomFilter[hash_32bit_4k(arr2[j])] = True
chunkSize = 1024 # To tune regarding the kind of input
chunkCount = (arr1.size + chunkSize - 1) // chunkSize
# Find for each item of `arr1` if the value is in `arr2` (parallel)
# and count the number of item found for each chunk on the fly.
# Note: thanks to page fault, big parts of `found` are not even written in memory if `arr2` is small
found = np.zeros(arr1.size, dtype=nb.bool_)
foundCountByChunk = np.empty(chunkCount, dtype=nb.uint16)
for i in nb.prange(chunkCount):
start, end = i * chunkSize, min((i + 1) * chunkSize, arr1.size)
foundCountInChunk = 0
for j in range(start, end):
val = arr1[j]
if bloomFilter[hash_32bit_4k(val)] and val in arr2:
found[j] = True
foundCountInChunk += 1
foundCountByChunk[i] = foundCountInChunk
# Compute the location of the destination chunks (sequential)
outChunkOffsets = np.empty(chunkCount, dtype=nb.uint32)
foundCount = 0
for i in range(chunkCount):
outChunkOffsets[i] = i * chunkSize - foundCount
foundCount += foundCountByChunk[i]
# Parallel chunk-based copy
out = np.empty(arr1.size-foundCount, dtype=arr1.dtype)
for i in nb.prange(chunkCount):
srcStart, srcEnd = i * chunkSize, min((i + 1) * chunkSize, arr1.size)
cur = outChunkOffsets[i]
# Optimization: we can copy the whole chunk if there is nothing found in it
if foundCountByChunk[i] == 0:
out[cur:cur+(srcEnd-srcStart)] = arr1[srcStart:srcEnd]
else:
for j in range(srcStart, srcEnd):
if not found[j]:
out[cur] = arr1[j]
cur += 1
return out
This implementation is the fastest for the target input on my machine. It is generally fast when n is quite big and the overhead to create threads is relatively small on the target platform (eg. on PCs but typically not computing servers with many cores). The overhead of the parallel implementation is significant so the number of core on the target machine needs to be at least 4 so the implementation can be significantly faster than the sequential implementation.
It may be useful to tune the chunkSize variable for the target inputs. If r << n, it is better to use a pretty big chunkSize. That being said, the number of chunk needs to be sufficiently big for multiple thread to operate on many chunks. Thus, chunkSize should be significantly smaller than n / numberOfThreads.
On my machine most of the time (65-70%) is spent in the final copy which is mostly memory-bound and can hardly be optimized further with Numba.
Results
Here are results on my i5-9600KF-based machine (with 6 cores):
setdif1d_np: 2.65 ms
setdif1d_in1d_np: 2.61 ms
setdiff1d_nb: 2.33 ms
setdiff1d_nb_simd: 1.85 ms
setdiff1d_nb_faster: 0.73 ms
setdiff1d_nb_faster_par: 0.49 ms
The best provided implementation is about 4~5 time faster than the other ones.
What I found is that hashing does not help,. It is just trick for 2D case, to convert 1d arrays to single numbers and put them as such in a set.
Below is method of norok2 I converted to 1d arrays (and added annotations for faster compilation).
Note that this is only slightly (20-30%) faster than the methods you already have. And of course after second function call, on first due to compilation it is slightly slower.
#nb.njit('int32[:](int32[:], int32[:])')
def setdiff1d_nb(arr1, arr2):
delta = set(arr2)
# : build the result
result = np.empty(len(arr1), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if arr1[i] not in delta:
result[j] = arr1[i]
j += 1
return result[:j]

swap assignment slower than introducing a new variable?

Okay, I hope the title made somewhat sense. I have made a fast fourier transform code and want to see how it could go faster. So here is the code ('code1') where I do the swapping as the FFT dictates:
stp='''
import numpy as nn
D=2**18
N=12
w=(nn.linspace(0,D-1,D)%2**N<(2**N/2))-.5 #square wave
f=w[0:2**N]+nn.zeros(2**N)*1j #it takes only one wavelength from w
'''
code1='''
for m in range(N):
for l in range(2**m):
for k in range(2**(N-1-m)):
t=f[k+l*2**(N-m)]
f[k+l*2**(N-m)]=(t+f[k+l*2**(N-m)+2**(N-1-m)])
f[k+l*2**(N-m)+2**(N-1-m)]=(t-f[k+l*2**(N-m)+2**(N-1-m)])*nn.e**(-2*nn.pi*1j*k*2**(m)/2**N)
'''
print(timeit.timeit(setup=stp,stmt=code1,number=1))
where I introduce a new variable 't'. This outputs a time of 0.132.
So I thought it should go faster if I did (setup is the same as before):
code2='''
for m in range(N):
for l in range(2**m):
for k in range(2**(N-1-m)):
f[k+l*2**(N-m)],f[k+l*2**(N-m)+2**(N-1-m)]=f[k+l*2**(N-m)]+f[k+l*2**(N-m)+2**(N-1-m)],(f[k+l*2**(N-m)]-f[k+l*2**(N-m)+2**(N-1-m)])*nn.e**(-2*nn.pi*1j*k*2**(m)/2**N)
'''
print(timeit.timeit(setup=stp,stmt=code2,number=1))
since now I do two assignments instead of three (was my line of thinking). But it appears that this is actually slower (0.152). Anyone has an idea as to why? And does anyone know a way to do this swapping faster than the t=a,a=f(a,b),b=g(t,b) I introduced before, since I find it hard to believe that this is the most efficient way.
EDIT: added the actual code instead of the pseudo-code.
MORE EDIT:
I tried running the same without using Numpy. Both are faster, so that's positive, but again the t=a,a=f(a,b),b=g(t,b) method appears faster (0.104) than the a,b=f(a,b),g(a,b) method (0.114). So the mystery remains.
new code:
stpsansnumpy='''
import cmath as mm
D=2**18
N=12
w=[0]*D
for i in range(D):
w[i]=(i%2**N<(2**N/2))-.5 #square wave
f=w[0:2**N]+[0*1j]*2**N #it takes only one wavelength from w
'''
code1math='''
for m in range(N):
for l in range(2**m):
for k in range(2**(N-1-m)):
t=f[k+l*2**(N-m)]
f[k+l*2**(N-m)]=(t+f[k+l*2**(N-m)+2**(N-1-m)])
f[k+l*2**(N-m)+2**(N-1-m)]=(t-f[k+l*2**(N-m)+2**(N-1-m)])*mm.exp(-2*mm.pi*1j*k*2**(m)/2**N)
'''
print(timeit.timeit(setup=stpsansnumpy,stmt=code1math,number=1))
and:
code2math='''
for m in range(N):
for l in range(2**m):
for k in range(2**(N-1-m)):
f[k+l*2**(N-m)],f[k+l*2**(N-m)+2**(N-1-m)]=f[k+l*2**(N-m)]+f[k+l*2**(N-m)+2**(N-1-m)],(f[k+l*2**(N-m)]-f[k+l*2**(N-m)+2**(N-1-m)])*mm.exp(-2*mm.pi*1j*k*2**(m)/2**N)
'''
print(timeit.timeit(setup=stpsansnumpy,stmt=code2math,number=1))
It would have been nice if you shared why you think one is slower than the other and if your code would have been formatted properly as Python code should be.
Something like this:
from timeit import timeit
def f(a, b):
return a
def g(a, b):
return a
def extra_var(a, b):
t = a
a = f(a, b)
b = g(t, b)
return a, b
def swap_direct(a, b):
a, b = f(a, b), g(a, b)
return a, b
print(timeit(lambda: extra_var(1, 2)))
print(timeit(lambda: swap_direct(1, 2)))
However, if you had, you would probably have found the same results I did:
0.2162299
0.21171479999999998
The results are so close that in consecutive runs, either function can appear to be a bit faster or slower.
So, you'd increase the volume:
print(timeit(lambda: extra_var(1, 2), number=10000000))
print(timeit(lambda: swap_direct(1, 2), number=10000000))
And the mystery goes away:
2.1527828999999996
2.1225841
The direct swap is actually slightly faster, as expected. What is different about what you were doing that was giving you other results?
You say you're seeing a difference when you implement it in the context of more complicated code - however, this shows it's likely that your code itself is the likely culprit, which is why StackOverflow suggests you share a minimal, reproducible example, so people can actually try what you say happens, instead of having to take your word for it.
In most cases, it turns out someone made a mistake and everything is as expected. In some cases, you get an interesting answer.
In the first version I see 5 indexing operations, and in the second one I see 6. I’m not surprised that 6 indexing operations (with all the computations you use in them) are more expensive than 5.
Creating a temporary variable, or creating a temporary tuple, is peanuts compared to all the computations you do in these code fragments.

Alternatives to lpsolve in Python for fast setup and solve

I have been using lpsolve in Python for a very long time, with generally good results. I have even written my own Cython wrapper to it to overcome the mess that the original Python wrapper is.
My Cython wrapper is much faster in the setup of the problem, but of course the solution time only depends on the lpsolve C code and has got nothing to do with Python.
I am only solving real-valued linear problems, no MIPs. The latest (and one of the biggest) LP I had to solve had a constraint matrix of size about 5,000 x 3,000, and the setup plus solve takes about 150 ms. The problem is, I have to solve a slightly modified version of the same problem many times in my simulation (constraints, RHS, bounds and so on are time-dependent for a simulation with many timesteps). The constraint matrix is usually very sparse, with about 0.1% - 0.5% of NNZ or less.
Using lpsolve, the setup and solve of the problem is as easy as the following:
import numpy
from lp_solve.lp_solve import PRESOLVE_COLS, PRESOLVE_ROWS, PRESOLVE_LINDEP, PRESOLVE_NONE, SCALE_DYNUPDATE, lpsolve
# The constraint_matrix id a 2D NumPy array and it contains
# both equality and inequality constraints
# And I need it to be single precision floating point, like all
# other NumPy arrays from here onwards
m, n = constraint_matrix.shape
n_le = len(inequality_constraints)
n_e = len(equality_constraints)
# Setup RHS vector
b = numpy.zeros((m, ), dtype=numpy.float32)
# Assign equality and inequality constraints (= and <=)
b[0:n_e] = equality_constraints
b[-n_le:] = inequality_constraints
# Tell lpsolve which rows are equalities and which are inequalities
e = numpy.asarray(['LE']*m)
e[0:-n_le] = 'EQ'
# Make the LP
lp = lpsolve('make_lp', m, n)
# Some options for scaling the problem
lpsolve('set_scaling', lp, SCALE_DYNUPDATE)
lpsolve('set_verbose', lp, 'IMPORTANT')
# Use presolve as it is much faster
lpsolve('set_presolve', lp, PRESOLVE_COLS | PRESOLVE_ROWS | PRESOLVE_LINDEP)
# I only care about maximization
lpsolve('set_sense', lp, True)
# Set the objective function of the problem
lpsolve('set_obj_fn', lp, objective_function)
lpsolve('set_mat', lp, constraint_matrix)
# Tell lpsolve about the RHS
lpsolve('set_rh_vec', lp, b)
# Set the constraint type (equality or inequality)
lpsolve('set_constr_type', lp, e)
# Set upper bounds for variables - lower bounds are automatically 0
lpsolve('set_upbo', lp, ub_values)
# Solve the problem
out = lpsolve('solve', lp)
# Retrieve the solution for all the variables
vars_sol = numpy.asarray([lpsolve('get_var_primalresult', lp, i) for i in xrange(m + 1, m + n + 1)], dtype=numpy.float32)
# Delete the problem, timestep done
lpsolve('delete_lp', lp)
For reasons that are too long to explain, my NumPy arrays are all single precision floating point arrays, and I'd like them to stay that way.
Now, after all this painful introduction, I would like to ask: does anyone know of another library (with reasonable Python wrappers) that allows me to setup and solve a problem of this size as fast (or potentially faster) than lpsolve? Most of the libraries I have looked at (PuLP, CyLP, PyGLPK and so on) do not seem to have a straightforward way to say "this is my entire constraint matrix, set it in one go". They mostly seem to be oriented towards being "modelling languages" that allow fancy things like this (CyLP example):
# Add variables
x = s.addVariable('x', 3)
y = s.addVariable('y', 2)
# Create coefficients and bounds
A = np.matrix([[1., 2., 0],[1., 0, 1.]])
B = np.matrix([[1., 0, 0], [0, 0, 1.]])
D = np.matrix([[1., 2.],[0, 1]])
a = CyLPArray([5, 2.5])
b = CyLPArray([4.2, 3])
x_u= CyLPArray([2., 3.5])
# Add constraints
s += A * x <= a
s += 2 <= B * x + D * y <= b
s += y >= 0
s += 1.1 <= x[1:3] <= x_u
I honestly don't care about the flexibility, I just need raw speed in the problem setup and solve. Creating NumPy matrices plus doing all those fancy operations above is definitely going to be a performance killer.
I'd rather stay on Open Source solvers if possible, but any suggestion is most welcome. My apologies for the long post.
Andrea.
#infinity77,
I am looking at using lpsolve now. It is fairly straight forward to generate a .lp file for input and it did solve several smaller problems. BUT... I am now attempting to solve a node coloring problem. This is a MIP. My first attempt at a 500 node problem ran about 15 minutes as an lp relaxation. lpsolve has been grinding on the true MIP since last night. Still grinding. These coloring problems are difficult but 14 hours with no end in sight is tooooo much.
The best alternative I am aware of is from coin-or.org;[Coin-OR Solvers] 1 Try their clp and cbc solvers depending on which type of problem you are solving. Benchmarks I have seen say they are the best choice outside of CPLEX and Gurobi. It is free but you will need to be sure it is "legal" for your purposes. A good benchmark paper by Bernhard Meindl and Matthias Templ at Open Source Solver Benchmarks

From for loops to matrix computation

I've got a piece of code taking an input and checking if the input meets requirements. The input is composed of a list of objects called S.
class S:
def __init__(self, f, t, tf, timeline):
self.f = f
self.t = t
self.tf = tf
self.timeline = timeline
To know if a combination of objects meets the requirement, I have functions taking a list of size N of objects and returning True or False.
input1 = [S_1, ..., S_N]
def c1(input1):
if condition_c1_valid:
return True
else:
return False
Now let's consider this example:
import itertools
possible_objects = [S(f, t, tf, timeline) for f in [...] for t in [..] ...]
inputs_to_check = list(itertools.combination_with_replacement(possible_objects, 5)
results = list()
for inp in inputs_to_check:
if c1(inp):
results.append(inp)
Right now, my solution is using a for loop on the N condition I'm checking every time.
The code keeps the inputs which meets the condition.
Could this be computed at once in a matrix fashion? (Vectorized)
I was thinking of something like this: (pseudo code)
Data[input, c1, ..., cN]
return where(all(c1, ..., cN) is True)
Can anyone tell me if it is achievable, and could point me towards examples? In the end, my list of inputs to check is very large. Thus it would be interesting to send the computation to the GPU. I thought that maybe this could be achieved through Tensorflow...
Thanks for the tips :)
EDIT: The example above is far from the reality. I'm using nested for loops on a large set, with a complexity of the 6th or 7th degree. The current solution is optimize with generators, but I would like to push this further.
In the most general sense, you won't be able to vectorize this. CPython is notoriously bad at parallel processing due to the GIL and it's primary matrix vectorization library (numpy) is for dealing with primative types (integers, floats, etc.), not python objects such as S.
There are a few things that could help:
If f, t, tf, timeline are numbers (which they look like they
may be), then you could form four numpy arrays of these values and
pass those through a vectorized version of c1 which returns a boolean array. You could then do np.asarray(input1)[c1_vec(f_vec, t_vec, tf_vec, timeline_vec)]
You said you've used generators instead of lists, but just to be especially sure your example should read as:
possible_objects = (S(f, t, tf, timeline) for f in (...) for t in (...) ...)
inputs_to_check = itertools.combination_with_replacement(possible_objects, 5)
results = [inp for inp in inputs_to_check if c1(inp)]
This saves a lot of time of writing objects to memory that can be avoided.
Use PyPy. It uses a JIT compiler to massively speed up python for loops. For very large loops this will get up to near C speed.
You mention using a GPU. CPython doesn't even run on more then one CPU core, running this on a GPU would be pointless unless using another implementation.

Is MATLAB's bsxfun the best? Python's numpy.einsum?

I have a very large multiply and sum operation that I need to implement as efficiently as possible. The best method I've found so far is bsxfun in MATLAB, where I formulate the problem as:
L = 10000;
x = rand(4,1,L+1);
A_k = rand(4,4,L);
tic
for k = 2:L
i = 2:k;
x(:,1,k+1) = x(:,1,k+1)+sum(sum(bsxfun(#times,A_k(:,:,2:k),x(:,1,k+1-i)),2),3);
end
toc
Note that L will be larger in practice. Is there a faster method? It's strange that I need to first add the singleton dimension to x and then sum over it, but I can't get it to work otherwise.
It's still much faster than any other method I've tried, but not enough for our application. I've heard rumors that the Python function numpy.einsum may be more efficient, but I wanted to ask here first before I consider porting my code.
I'm using MATLAB R2017b.
I believe both of your summations can be removed, but I only removed the easier one for the time being. The summation over the second dimension is trivial, since it only affects the A_k array:
B_k = sum(A_k,2);
for k = 2:L
i = 2:k;
x(:,1,k+1) = x(:,1,k+1) + sum(bsxfun(#times,B_k(:,1,2:k),x(:,1,k+1-i)),3);
end
With this single change the runtime is reduced from ~8 seconds to ~2.5 seconds on my laptop.
The second summation could also be removed, by transforming times+sum into a matrix-vector product. It needs some singleton fiddling to get the dimensions right, but if you define an auxiliary array that is B_k with the second dimension reversed, you can generate the remaining sum as ~x*C_k with this auxiliary array C_k, give or take a few calls to reshape.
So after a closer look I realized that my original assessment was overly optimistic: you have multiplications in both dimensions in your remaining term, so it's not a simple matrix product. Anyway, we can rewrite that term to be the diagonal of a matrix product. This implies that we're computing a bunch of unnecessary matrix elements, but this still seems to be slightly faster than the bsxfun approach, and we can get rid of your pesky singleton dimension too:
L = 10000;
x = rand(4,L+1);
A_k = rand(4,4,L);
B_k = squeeze(sum(A_k,2)).';
tic
for k = 2:L
ii = 1:k-1;
x(:,k+1) = x(:,k+1) + diag(x(:,ii)*B_k(k+1-ii,:));
end
toc
This runs in ~2.2 seconds on my laptop, somewhat faster than the ~2.5 seconds obtained previously.
Since you're using an new version of Matlab you might try broadcasting / implicit expansion instead of bsxfun:
x(:,1,k+1) = x(:,1,k+1)+sum(sum(A_k(:,:,2:k).*x(:,1,k-1:-1:1),3),2);
I also changed the order of summation and removed the i variable for further improvement. On my machine, and with Matlab R2017b, this was about 25% faster for L = 10000.

Categories

Resources