Related
I need a much faster code to remove values of an 1D array (array length ~ 10-15) that are common with another 1D array (array length ~ 1e5-5e5 --> rarely up to 7e5), which are index arrays contain integers. There is no duplicate in the arrays, and they are not sorted and the order of the values must be kept in the main array after modification. I know that can be achieved using such np.setdiff1d or np.in1d (which both are not supported for numba jitted in no-python mode), and other similar posts (e.g. this) have not much more efficient way to do so, but performance is important here because all the values in the main index array will be gradually be removed in loops.
import numpy as np
import numba as nb
n = 500000
r = 10
arr1 = np.random.permutation(n)
arr2 = np.random.randint(0, n, r)
# #nb.jit
def setdif1d_np(a, b):
return np.setdiff1d(a, b, assume_unique=True)
# #nb.jit
def setdif1d_in1d_np(a, b):
return a[~np.in1d(a, b)]
There is another related post that proposed by norok2 for 2D arrays, that is ~15 times faster solution (hashing-like way using numba) than usual methods described there. This solution may be the best if it could be prepared for 1D arrays:
#nb.njit
def mul_xor_hash(arr, init=65537, k=37):
result = init
for x in arr.view(np.uint64):
result = (result * k) ^ x
return result
#nb.njit
def setdiff2d_nb(arr1, arr2):
# : build `delta` set using hashes
delta = {mul_xor_hash(arr2[0])}
for i in range(1, arr2.shape[0]):
delta.add(mul_xor_hash(arr2[i]))
# : compute the size of the result
n = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
n += 1
# : build the result
result = np.empty((n, arr1.shape[-1]), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
result[j] = arr1[i]
j += 1
return result
I tried to prepare that for 1D arrays, but I have some problems/question with that.
At first, IDU what does mul_xor_hash exactly do, and if init and k are arbitrary selected or not
Why mul_xor_hash will not work without nb.njit:
File "C:/Users/Ali/Desktop/test - Copy - Copy.py", line 21, in mul_xor_hash
result = (result * k) ^ x
TypeError: ufunc 'bitwise_xor' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
IDK how to implement mul_xor_hash on 1D arrays (if it could), which I guess may make it faster more than for 2Ds, so I broadcast the input arrays to 2D by [None, :], which get the following error just for arr2:
print(mul_xor_hash(arr2[0]))
ValueError: new type not compatible with array
and what does delta do
I am searching the most efficient way in this regard. In the absence of better method than norok2 solution, how to prepare this solution for 1D arrays?
Understanding the hash-based solution
At first, IDU what does mul_xor_hash exactly do, and if init and k are arbitrary selected or not
mul_xor_hash is a custom hash function. Functions mixing xor and multiply (possibly with shifts) are known to be relatively fast to compute the hash of a raw data buffer. The multiplication tends to shuffle bits and the xor is used to somehow combine/accumulate the result in a fixed size small value (ie. the final hash). There are many different hashing functions. Some are faster than others, some cause more collisions than other in a given context. A fast hashing function causing too many collisions can be useless in practice as it would result in a pathological situation where all conflicting values needs to be compared. This is why fast hash functions are hard to implement.
init and k are parameter certainly causing the hash to be pretty balance. This is pretty common in such a hash function. k needs to be sufficiently big for the multiplication to shuffle bits and it should typically also be a prime number (values like power of two tends to increase collisions due to modular arithmetic behaviours). init plays a significant role only for very small arrays (eg. with 1 item): it helps to reduce collisions by xoring the final hash by a non-trivial constant. Indeed, if arr.size = 1, then result = (init * k) ^ arr[0] where init * k is a constant. Having an identity hash function equal to arr[0] is known to be bad since it tends to result in many collisions (this is a complex topic, but put it shortly, arr[0] can be divided by the number of buckets in the hash table for example). Thus, init should be a relatively big number and init * k should also be a big non-trivial value (a prime number is a good target value).
Why mul_xor_hash will not work without nb.njit
It depends of the input. The input needs to be a 1D array and have a raw size in byte divisible by 8 (eg. 64-bit items, 2n x 32-bit ones, 4n x 16-bit one or 8n 8-bit ones). Here is some examples:
mul_xor_hash(np.random.rand(10))
mul_xor_hash(np.arange(10)) # Do not work with 9
and what does delta do
It is a set containing the hash of the arr2 row so to find matching lines faster than comparing them without hashes.
how to prepare this solution for 1D arrays?
AFAIK, hashes are only use to avoid comparisons of rows but this is because the input is the 2D array. In 1D, there is no such a problem.
There is big catch with this method: it only works if there is no hash collisions. Otherwise, the implementation wrongly assumes that values are equal even if they are not! #norok explicitly mentioned it in the comments though:
Note that the collision handling for the hashings should also be implemented
Faster implementation
Using the 2D solution of #norok2 for 1D is not a good idea since hashes will not make it faster the way they are used. In fact, a set already use a hash function internally anyway. Not to mention collisions needs to be properly implemented (which is done by a set).
Using a set is a relatively good idea since it causes the complexity to be O(n + m) where n = len(arr1) and m = len(arr2). That being said, if arr1 is converted to a set, then it will be too big to fit in L1 cache (due to the size of arr1 in your case) resulting in slow cache misses. Additionally, the growing size of the set will cause values to be re-hashed which is not efficient. If arr2 is converted to a set, then the many hash table fetches will not be very efficient since arr2 is very small in your case. This is why this solution is sub-optimal.
One solution is to split arr1 in chunks and then build a set based on the target chunk. You can then check if a value is in the set or not efficiently. Building the set is still not very efficient due to the growing size. This problem is due to Python itself which do not provide a way to reserve some space for the data structure like other languages do (eg. C++). One solution to avoid this issue is simply to reimplement an hash-table which is not trivial and cumbersome. Actually, Bloom filters can be used to speed up this process since they can quickly find if there is no collision between the two sets arr1 and arr2 in average (though they are not trivial to implement).
Another optimization is to use multiple threads to compute the chunks in parallel since they are independent. That being said, the appending to the final array is not easy to do efficiently in parallel, especially since you do not want the order to be modified. One solution is to move away the copy from the parallel loop and do it serially but this is slow and AFAIK there is no simple way to do that in Numba currently (since the parallelism layer is very limited). Consider using native languages like C/C++ for an efficient parallel implementation.
In the end, hashing can be pretty complex and the speed up can be quite small compared to a naive implementation with two nested loops since arr2 only have few items and modern processors can compare values quickly using SIMD instructions (while hash-based method can hardly benefit from them on mainstream processors). Unrolling can help to write a pretty simple and fast implementation. Again, unfortunately, Numba use LLVM-Jit internally which appear to fail to vectorize such a simple code (certainly due to missing optimizations in either LLVM-Jit or even LLVM itself). As a result, the non vectorized code is finally a bit slower (rather than 4~10 times faster on a modern mainstream processor). One solution is to use a C/C++ code instead to do that (or possibly Cython).
Here is a serial implementation using basic Bloom filters:
#nb.njit('uint32(int32)')
def hash_32bit_4k(value):
return (np.uint32(value) * np.uint32(27_644_437)) & np.uint32(0x0FFF)
#nb.njit(['int32[:](int32[:], int32[:])', 'int32[:](int32[::1], int32[::1])'])
def setdiff1d_nb_faster(arr1, arr2):
out = np.empty_like(arr1)
bloomFilter = np.zeros(4096, dtype=np.uint8)
for j in range(arr2.size):
bloomFilter[hash_32bit_4k(arr2[j])] = True
cur = 0
for i in range(arr1.size):
# If the bloom-filter value is true, we know arr1[i] is not in arr2.
# Otherwise, there is maybe a false positive (conflict) and we need to check to be sure.
if bloomFilter[hash_32bit_4k(arr1[i])] and arr1[i] in arr2:
continue
out[cur] = arr1[i]
cur += 1
return out[:cur]
Here is an untested variant that should work for 64-bit integers (floating point numbers need memory views and possibly a prime constant too):
#nb.njit('uint64(int64)')
def hash_32bit_4k(value):
return (np.uint64(value) * np.uint64(67_280_421_310_721)) & np.uint64(0x0FFF)
Note that if all the values in the small array are contained in the main array in each loop, then we can speed up the arr1[i] in arr2 part by removing values from arr2 when we find them. That being said, collisions and findings should be very rare so I do not expect this to be significantly faster (not to mention it adds some overhead and complexity). If items are computed in chunks, then the last chunks can be directly copied without any check but the benefit should still be relatively small. Note that this strategy can be effective for the naive (C/C++) SIMD implementation previously mentioned though (it can be about 2x faster).
Generalization and parallel implementation
This section focus on the algorithm to use regarding the input size. It particularly details an SIMD-based implementation and discuss about the use of multiple threads.
First of all, regarding the value r, the best algorithm to use can be different. More specifically:
when r is 0, the best thing to do is to return the input array arr1 unmodified (possibly a copy to avoid issue with in-place algorithms);
when r is 1, we can use one basic loop iterating over the array, but the best implementation is likely to use np.where of Numpy which is highly optimized for that
when r is small like <10, then using a SIMD-based implementation should be particularly efficient, especially if the iteration range of the arr2-based loop is known at compile-time and is unrolled
for bigger r values that are still relatively small (eg. r < 1000 and r << n), the provided hash-based solution should be one of the best;
for larger r values with r << n, the hash-based solution can be optimized by packing boolean values as bits in bloomFilter and by using multiple hash-functions instead of one so to better handle collisions while being more cache-friendly (in fact, this is what actual bloom filters does); note that multi-threading can be used so speed up the lookups when r is huge and r << n;
when r is big and not much smaller than n, then the problem is pretty hard to solve efficiently and the best solution is certainly to sort both arrays (typically with a radix sort) and use a merge-based method to remove the duplicates, possibly with multiple threads when both r and n are huge (hard to implement).
Let's start with the SIMD-based solution. Here is an implementation:
#nb.njit('int32[:](int32[::1], int32[::1])')
def setdiff1d_nb_simd(arr1, arr2):
out = np.empty_like(arr1)
limit = arr1.size // 4 * 4
limit2 = arr2.size // 2 * 2
cur = 0
z32 = np.int32(0)
# Tile (x4) based computation
for i in range(0, limit, 4):
f0, f1, f2, f3 = z32, z32, z32, z32
v0, v1, v2, v3 = arr1[i], arr1[i+1], arr1[i+2], arr1[i+3]
# Unrolled (x2) loop searching for a match in `arr2`
for j in range(0, limit2, 2):
val1 = arr2[j]
val2 = arr2[j+1]
f0 += (v0 == val1) + (v0 == val2)
f1 += (v1 == val1) + (v1 == val2)
f2 += (v2 == val1) + (v2 == val2)
f3 += (v3 == val1) + (v3 == val2)
# Remainder of the previous loop
if limit2 != arr2.size:
val = arr2[arr2.size-1]
f0 += v0 == val
f1 += v1 == val
f2 += v2 == val
f3 += v3 == val
if f0 == 0: out[cur] = arr1[i+0]; cur += 1
if f1 == 0: out[cur] = arr1[i+1]; cur += 1
if f2 == 0: out[cur] = arr1[i+2]; cur += 1
if f3 == 0: out[cur] = arr1[i+3]; cur += 1
# Remainder
for i in range(limit, arr1.size):
if arr1[i] not in arr2:
out[cur] = arr1[i]
cur += 1
return out[:cur]
It turns out this implementation is always slower than the hash-based one on my machine since Numba clearly generate an inefficient for the inner arr2-based loop and this appears to come from broken optimizations related to the ==: Numba simply fail use SIMD instructions for this operation (for no apparent reasons). This prevent many alternative SIMD-related codes to be fast as long as they are using Numba.
Another issue with Numba is that np.where is slow since it use a naive implementation while the one of Numpy has been heavily optimized. The optimization done in Numpy can hardly be applied to the Numba implementation due to the previous issue. This prevent any speed up using np.where in a Numba code.
In practice, the hash-based implementation is pretty fast and the copy takes a significant time on my machine already. The computing part can be speed up using multiple thread. This is not easy since the parallelism model of Numba is very limited. The copy cannot be easily optimized with Numba (one can use non-temporal store but this is not yet supported by Numba) unless the computation is possibly done in-place.
To use multiple threads, one strategy is to first split the range in chunk and then:
build a boolean array determining, for each item of arr1, whether the item is found in arr2 or not (fully parallel)
count the number of item found by chunk (fully parallel)
compute the offset of the destination chunk (hard to parallelize, especially with Numba, but fast thanks to chunks)
copy the chunk to the target location without copying found items (fully parallel)
Here is an efficient parallel hash-based implementation:
#nb.njit('int32[:](int32[:], int32[:])', parallel=True)
def setdiff1d_nb_faster_par(arr1, arr2):
# Pre-computation of the bloom-filter
bloomFilter = np.zeros(4096, dtype=np.uint8)
for j in range(arr2.size):
bloomFilter[hash_32bit_4k(arr2[j])] = True
chunkSize = 1024 # To tune regarding the kind of input
chunkCount = (arr1.size + chunkSize - 1) // chunkSize
# Find for each item of `arr1` if the value is in `arr2` (parallel)
# and count the number of item found for each chunk on the fly.
# Note: thanks to page fault, big parts of `found` are not even written in memory if `arr2` is small
found = np.zeros(arr1.size, dtype=nb.bool_)
foundCountByChunk = np.empty(chunkCount, dtype=nb.uint16)
for i in nb.prange(chunkCount):
start, end = i * chunkSize, min((i + 1) * chunkSize, arr1.size)
foundCountInChunk = 0
for j in range(start, end):
val = arr1[j]
if bloomFilter[hash_32bit_4k(val)] and val in arr2:
found[j] = True
foundCountInChunk += 1
foundCountByChunk[i] = foundCountInChunk
# Compute the location of the destination chunks (sequential)
outChunkOffsets = np.empty(chunkCount, dtype=nb.uint32)
foundCount = 0
for i in range(chunkCount):
outChunkOffsets[i] = i * chunkSize - foundCount
foundCount += foundCountByChunk[i]
# Parallel chunk-based copy
out = np.empty(arr1.size-foundCount, dtype=arr1.dtype)
for i in nb.prange(chunkCount):
srcStart, srcEnd = i * chunkSize, min((i + 1) * chunkSize, arr1.size)
cur = outChunkOffsets[i]
# Optimization: we can copy the whole chunk if there is nothing found in it
if foundCountByChunk[i] == 0:
out[cur:cur+(srcEnd-srcStart)] = arr1[srcStart:srcEnd]
else:
for j in range(srcStart, srcEnd):
if not found[j]:
out[cur] = arr1[j]
cur += 1
return out
This implementation is the fastest for the target input on my machine. It is generally fast when n is quite big and the overhead to create threads is relatively small on the target platform (eg. on PCs but typically not computing servers with many cores). The overhead of the parallel implementation is significant so the number of core on the target machine needs to be at least 4 so the implementation can be significantly faster than the sequential implementation.
It may be useful to tune the chunkSize variable for the target inputs. If r << n, it is better to use a pretty big chunkSize. That being said, the number of chunk needs to be sufficiently big for multiple thread to operate on many chunks. Thus, chunkSize should be significantly smaller than n / numberOfThreads.
On my machine most of the time (65-70%) is spent in the final copy which is mostly memory-bound and can hardly be optimized further with Numba.
Results
Here are results on my i5-9600KF-based machine (with 6 cores):
setdif1d_np: 2.65 ms
setdif1d_in1d_np: 2.61 ms
setdiff1d_nb: 2.33 ms
setdiff1d_nb_simd: 1.85 ms
setdiff1d_nb_faster: 0.73 ms
setdiff1d_nb_faster_par: 0.49 ms
The best provided implementation is about 4~5 time faster than the other ones.
What I found is that hashing does not help,. It is just trick for 2D case, to convert 1d arrays to single numbers and put them as such in a set.
Below is method of norok2 I converted to 1d arrays (and added annotations for faster compilation).
Note that this is only slightly (20-30%) faster than the methods you already have. And of course after second function call, on first due to compilation it is slightly slower.
#nb.njit('int32[:](int32[:], int32[:])')
def setdiff1d_nb(arr1, arr2):
delta = set(arr2)
# : build the result
result = np.empty(len(arr1), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if arr1[i] not in delta:
result[j] = arr1[i]
j += 1
return result[:j]
I have a piece of code which iterates through a vector several times, performs some calculation, and averages the result into existing data. This calculation is based on other variables (eg time) as well as the input, so the same input has a different output and the total results cannot be pre-computed. This looks like this:
output = np.zeros(50)
while loop_count < max_loops:
for idx, dat in enumerate(vec):
val = calculate(dat)
averaged = (val + output[idx] * loop_count) / (1 + loop_count)
output[idx] = averaged
loop_count += 1
This works fine but appears to be quite slow (taking around 9s). Is there a better solution, ideally using numpy, scipy or pandas? The length of vec can be quite long so avoiding a copy is also ideal
You could just compute the sum and divide by max_loops in the end, right? This would make it one bit faster as you also could do the summation in-place:
output = np.zeros(50)
for idx, dat in enumerate(vec):
for count in range(max_loop):
output[idx] += calculate(dat)
output[idx] /= max_loop
As you provided a more abstract example (which is helpful), I am not sure whether this restructured version works for your application. If the computation in calculate is truly independent for the various repetitions, I don't see further simplification.
I would like to perform the operation
If had a regular shape, then I could use np.einsum, I believe the syntax would be
np.einsum('ijp,ipk->ijk',X, alpha)
Unfortunately, my data X has a non regular structure on the 1st (if we zero index) axis.
To give a little more context, refers to the p^th feature of the j^th member of the i^th group. Because groups have different sizes, effectively, it is a list of lists of different lengths, of lists of the same length.
has a regular structure and thus can be saved as a standard numpy array (it comes in 1-dimensional and then I use alpha.reshape(a,b,c) where a,b,c are problem specific integers)
I would like to avoid storing X as a list of lists of lists or a list of np.arrays of different dimensions and writing something like
A = []
for i in range(num_groups):
temp = np.empty(group_sizes[i], dtype=float)
for j in range(group_sizes[i]):
temp[i] = np.einsum('p,pk->k',X[i][j], alpha[i,:,:])
A.append(temp)
Is this some nice numpy function/data structure for doing this or am I going to have to compromise with some only partially vectorised implementation?
I know this sounds obvious, but, if you can afford the memory, I'd start just by checking the performance you get simply by padding the data to have a uniform size, that is, simply adding zeros and perform the operation. Sometimes a simpler solution is faster than a more supposedly optimal one that has more Python/C roundtrips.
If that doesn't work, then your best bet, as Tom Wyllie suggested, is probably a bucketing strategy. Assuming X is your list of lists of lists and alpha is an array, you can start by collecting the sizes of the second index (maybe you already have this):
X_sizes = np.array([len(x_i) for x_i in X])
And sort them:
idx_sort = np.argsort(X_sizes)
X_sizes_sorted = X_sizes[idx_sort]
Then you choose a number of buckets, which is the number of divisions of your work. Let's say you pick BUCKETS = 4. You just need to divide the data so that more or less each piece is the same size:
sizes_cumsum = np.cumsum(X_sizes_sorted)
total = sizes_cumsum[-1]
bucket_idx = []
for i in range(BUCKETS):
low = np.round(i * total / float(BUCKETS))
high = np.round((i + 1) * total / float(BUCKETS))
m = sizes_cumsum >= low & sizes_cumsum < high
idx = np.where(m),
# Make relative to X, not idx_sort
idx = idx_sort[idx]
bucket_idx.append(idx)
And then you make the computation for each bucket:
bucket_results = []
for idx in bucket_idx:
# The last index in the bucket will be the biggest
bucket_size = X_sizes[idx[-1]]
# Fill bucket array
X_bucket = np.zeros((len(X), bucket_size, len(X[0][0])), dtype=X.dtype)
for i, X_i in enumerate(idx):
X_bucket[i, :X_sizes[X_i]] = X[X_i]
# Compute
res = np.einsum('ijp,ipk->ijk',X, alpha[:, :bucket_size, :])
bucket_results.append(res)
Filling the array X_bucket will probably be slow in this part. Again, if you can afford the memory, it would be more efficient to have X in a single padded array and then just slice X[idx, :bucket_size, :].
Finally, you can put back your results into a list:
result = [None] * len(X)
for res, idx in zip(bucket_results, bucket_idx):
for r, X_i in zip(res, idx):
result[X_i] = res[:X_sizes[X_i]]
Sorry I'm not giving a proper function, but I'm not sure how exactly is your input or expected output so I just put the pieces and you can use them as you see fit.
I have been fighting against a function giving me a memory error and thanks to your support (Python: how to split and return a list from a function to avoid memory error) I managed to sort the issue; however, since I am not a pro-programmer I would like to ask for your opinion on my method and how to improve its performance (if possible).
The function is a generator function returning all cycles from an n-nodes digraph. However, for a 12 nodes digraph, there are about 115 million cycles (each defined as a list of nodes, e.g. [0,1,2,0] is a cycle). I need all cycles available for further processing even after I have extracted some of their properties when they were first generated, so they need to be stored somewhere. So, the idea is to cut the result array every 10 million cycles to avoid memory error (when an array is too big, python runs out of RAM) and create a new array to store the following results. In the 12 node digraph, I would then have 12 result arrays, 11 full ones (containing 10 million cycles each) and the last containing 5 million cycles.
However, splitting the result array is not enough since the variables stay in RAM. So, I still need to write each one to the disk and delete it afterwards to clear the RAM.
As stated in How do I create a variable number of variables?, using 'exec' to create variable variable names is not very "clean" and dictionary solutions are better. However, in my case, if I store the results in a single dictionary, it will run out of memory due to the size of the arrays. Hence, I went for the 'exec' way. I would be grateful if you could comment on that decision.
Also, to store the arrays I use numpy.savez_compressed which gives me a 43 Mb file for each 10million cycles array. If it is not compressed it creates a 500 Mb file. However, using the compressed version slows the writing process. Any idea how to speed the writing and/or compressing process?
A simplified version of the code I wrote is as follows:
nbr_result_arrays=0
result_array_0=[]
result_lenght=10000000
tmp=result_array_0 # I use tmp to avoid using exec within the for loop (exec slows down code execution)
for cycle in generator:
tmp.append(cycle)
if len(tmp) == result_lenght:
exec 'np.savez_compressed(\'results_' +str(nbr_result_arrays)+ '\', tmp)'
exec 'del result_array_'+str(nbr_result_arrays)
nbr_result_arrays+=1
exec 'result_array_'+str(nbr_result_arrays)+'=[]'
exec 'tmp=result_array_'+str(nbr_result_arrays)
Thanks for reading,
Aleix
How about using itertools.islice?
import itertools
import numpy as np
for i in itertools.count():
tmp = list(itertools.islice(generator, 10000000))
if not tmp:
break
np.savez_compressed('results_{}'.format(i), tmp)
del tmp
thanks to all for your suggestions.
As suggested by #Aya, I believe that to improve performance (and possible space issues) I should avoid to store the results on the HD because storing them adds half of the time than creating the result, so loading and processing it again would get very close to creating the result again. Additionally, if I do not store any result, I save space which can become a big issue for bigger digraphs (a 12 node complete digraphs has about 115 million cycles but a 29 node ones has about 848E27 cycles... and increasing at factorial rate).
The idea is that I first need to find through all cycles going through the weakest arc to find the total probability of all cycles going it. Then, with this total probability I must go again through all those cycles to subtract them from the original array according to the weighted probability (I needed the total probability to be able to calculate the weighted probalility: weighted_prob= prob_of_this_cycle/total_prob_through_this_edge).
Thus, I believe that this is the best approach to do that (but I am open to more discussions! :) ).
However, I have a doubt regarding speed processing regarding two sub-functions:
1st: find whether a sequence contains a specific (smaller) sequence. I am doing that with the function "contains_sequence" which relies on the generator function "window" (as suggested in Is there a Python builtin for determining if an iterable contained a certain sequence? However I have been told that doing it with a deque would be up to 33% faster. Any other ideas?
2nd: I am currently finding the cycle probability of a cycle by sliding through the cycle nodes (which is represented by a list) to find the probability at the output of each arc to stay within the cycle and then multiply them all to find the cycle probability (the function name is find_cycle_probability). Any performance suggestions on this function would be appreciated since I need to run it for each cycle, i.e. countless times.
Any other tips/suggestion/comments will be most welcome! And thanks again for your help.
Aleix
Below follows the simplified code:
def simple_cycles_generator_w_filters(working_array_digraph, arc):
'''Generator function generating all cycles containing a specific arc.'''
generator=new_cycles.simple_cycles_generator(working_array_digraph)
for cycle in generator:
if contains_sequence(cycle, arc):
yield cycle
return
def find_smallest_arc_with_cycle(working_array,working_array_digraph):
'''Find the smallest arc through which at least one cycle flows.
Returns:
- if such arc exist:
smallest_arc_with_cycle = [a,b] where a is the start of arc and b the end
smallest_arc_with_cycle_value = x where x is the weight of the arc
- if such arc does not exist:
smallest_arc_with_cycle = []
smallest_arc_with_cycle_value = 0 '''
smallest_arc_with_cycle = []
smallest_arc_with_cycle_value = 0
sparse_array = []
for i in range(numpy.shape(working_array)[0]):
for j in range(numpy.shape(working_array)[1]):
if working_array[i][j] !=0:
sparse_array.append([i,j,working_array[i][j]])
sorted_array=sorted(sparse_array, key=lambda x: x[2])
for i in range(len(sorted_array)):
smallest_arc=[sorted_array[i][0],sorted_array[i][1]]
generator=simple_cycles_generator_w_filters(working_array_digraph,smallest_arc)
if any(generator):
smallest_arc_with_cycle=smallest_arc
smallest_arc_with_cycle_value=sorted_array[i][2]
break
return smallest_arc_with_cycle,smallest_arc_with_cycle_value
def window(seq, n=2):
"""Returns a sliding window (of width n) over data from the iterable
s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... """
it = iter(seq)
result = list(itertools.islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + [elem]
yield result
def contains_sequence(all_values, seq):
return any(seq == current_seq for current_seq in window(all_values, len(seq)))
def find_cycle_probability(cycle, working_array, total_outputs):
'''Finds the cycle probability of a given cycle within a given array'''
output_prob_of_each_arc=[]
for i in range(len(cycle)-1):
weight_of_the_arc=working_array[cycle[i]][cycle[i+1]]
output_probability_of_the_arc=float(weight_of_the_arc)/float(total_outputs[cycle[i]])#NOTE:total_outputs is an array, thus the float
output_prob_of_each_arc.append(output_probability_of_the_arc)
circuit_probabilities_of_the_cycle=numpy.prod(output_prob_of_each_arc)
return circuit_probabilities_of_the_cycle
def clean_negligible_values(working_array):
''' Cleans the array by rounding negligible values to 0 according to a
pre-defined threeshold.'''
zero_threeshold=0.000001
for i in range(numpy.shape(working_array)[0]):
for j in range(numpy.shape(working_array)[1]):
if working_array[i][j] == 0:
continue
elif 0 < working_array[i][j] < zero_threeshold:
working_array[i][j] = 0
elif -zero_threeshold <= working_array[i][j] < 0:
working_array[i][j] = 0
elif working_array[i][j] < -zero_threeshold:
sys.exit('Error')
return working_array
original_array= 1000 * numpy.random.random_sample((5, 5))
total_outputs=numpy.sum(original_array,axis=0) + 100 * numpy.random.random_sample(5)
working_array=original_array.__copy__()
straight_array= working_array.__copy__()
cycle_array=numpy.zeros(numpy.shape(working_array))
iteration_counter=0
working_array_digraph=networkx.DiGraph(working_array)
[smallest_arc_with_cycle, smallest_arc_with_cycle_value]= find_smallest_arc_with_cycle(working_array, working_array_digraph)
while smallest_arc_with_cycle: # using implicit true value of a non-empty list
cycle_flows_to_be_subtracted = numpy.zeros(numpy.shape((working_array)))
# FIRST run of the generator to calculate each cycle probability
# note: the cycle generator ONLY provides all cycles going through
# the specified weakest arc
generator = simple_cycles_generator_w_filters(working_array_digraph, smallest_arc_with_cycle)
nexus_total_probs = 0
for cycle in generator:
cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
nexus_total_probs += cycle_prob
# SECOND run of the generator
# using the nexus_prob_sum calculated before, I can allocate the weight of the
# weakest arc to each cycle going through it
generator = simple_cycles_generator_w_filters(working_array_digraph,smallest_arc_with_cycle)
for cycle in generator:
cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
allocated_cycle_weight = cycle_prob / nexus_total_probs * smallest_arc_with_cycle_value
# create the array to be substracted
for i in range(len(cycle)-1):
cycle_flows_to_be_subtracted[cycle[i]][cycle[i+1]] += allocated_cycle_weight
working_array = working_array - cycle_flows_to_be_subtracted
clean_negligible_values(working_array)
cycle_array = cycle_array + cycle_flows_to_be_subtracted
straight_array = straight_array - cycle_flows_to_be_subtracted
clean_negligible_values(straight_array)
# find the next weakest arc with cycles.
working_array_digraph=networkx.DiGraph(working_array)
[smallest_arc_with_cycle, smallest_arc_with_cycle_value] = find_smallest_arc_with_cycle(working_array,working_array_digraph)
I'm just starting to play about with OpenCL, and I'm stuck with how to structure the program in a reasonably efficient manner (mainly avoiding lots of transferring of data to/from the GPU or wherever the work is being done)
What I'm trying to do is, given:
v = r*i + b*j + g*k
..I know v for various values of r, g and b, but i, j and k are unknown. I want to calculate reasonable values for i/j/k via brute force
In other words, I have a bunch of "raw" RGB pixel values, and I have a desaturated version of these colours. I do not know the weightings (i/j/k) used calculate the desaturated values.
My initial plan was to:
load the data into a CL buffer (so the input r/g/b values, and the output)
have a kernel which takes the three possible matrix values, and the various pixel-data buffers.
It then performs v = r*i + b*j + g*k, and subtracts the value of v to the known value, and stores this in a "score" buffer
Another kernel calculates the RMS error for that value (if the difference is zero for all input values, the values for i/j/k are "correct")
I have this working (written using Python and PyCL, the code is here), but I'm wondering how I can parallelise this chunk of work more (by try multiple i/j/k values at once)
I issue is, I have the 4 read-only buffers (3 for the input values, 1 for the expected values), but I need a separate "score" buffer for every combination of i/j/k
Another issue is the RMS calculation is the slowest part, since it's effectively single-threaded (total up all the values in "score" and sqrt() the total)
Basically, I'm wondering if there's a sensible way to structure such a program.
It seems like a task well-suited to OpenCL - hopefully the description of my goal wasn't too convoluted! As mentioned, my current code is here, and in case it is clearer, this is the Python version of what I'm trying to do:
import sys
import math
import random
def make_test_data(w = 128, h = 128):
in_r, in_g, in_b = [], [], []
print "Make raw data"
for x in range(w):
for y in range(h):
in_r.append(random.random())
in_g.append(random.random())
in_b.append(random.random())
# the unknown values
mtx = [random.random(), random.random(), random.random()]
print "Secret numbers were: %s" % mtx
out_r = [(r*mtx[0] + g*mtx[1] + b*mtx[2]) for (r, g, b) in zip(in_r, in_g, in_b)]
return {'in_r': in_r, 'in_g': in_g, 'in_b': in_b,
'expected_r': out_r}
def score_matrix(ir, ig, ib, expected_r, mtx):
ms = 0
for i in range(len(ir)):
val = ir[i] * mtx[0] + ig[i] * mtx[1] + ib[i] * mtx[2]
ms += abs(val - expected_r[i]) ** 2
rms = math.sqrt(ms / float(len(ir)))
return rms
# Make random test data
test_data = make_test_data(16, 16)
lowest_rms = sys.maxint
closest = []
divisions = 10
for possible_r in range(divisions):
for possible_g in range(divisions):
for possible_b in range(divisions):
pr, pg, pb = [x / float(divisions-1) for x in (possible_r, possible_g, possible_b)]
rms = score_matrix(
test_data['in_r'], test_data['in_g'], test_data['in_b'],
test_data['expected_r'],
mtx = [pr, pg, pb])
if rms < lowest_rms:
closest = [pr, pg, pb]
lowest_rms = rms
print closest
Are i,j,k sets independent? I assumed that yes. Few things hurts your performance:
running too many small kernels
using global memory for communication between score_matrix and rm_to_rms
You could rewrite both kernels into one with following changes:
make that one OpenCL work-group would work on different i,j,k - you can pre-generate this on CPU
in order to do 1 you need to process multiple elements of array with one thread you can do it like this:
int i = get_thread_id(0);
float my_sum = 0;
for (; i < array_size; i += get_local_size(0)){
float val = in_r[i] * mtx_r + in_g[i] * mtx_g + in_b[i] * mtx_b;
my_sum += pow(fabs(expect_r[i] - val), 2);
}
after this you write my_sum for each thread into local memory and sum it up with reduce (O(log(n)) algorithm).
save result into global memory
alternatively if you need to compute i,j,k sequentially you can look up barrier and memory fence functions in OpenCL specification so you can use these instead of running two kernels, just remember to sum up everything in first step, write into global synchronize all threads, and then sum up again
There are two potential issues:
Kernel launch overhead may be large if the work required to process each of your images is small. This is what you would address by combining the evaluation of multiple i,j,k values in a single kernel.
Serialization of the sum calculation for the RMSE. This is likely the larger issue, currently.
To address (2), notice that summation can be evaluated in parallel, but it is not as trivial as mapping a function separately over every pixel in your input. That is because summation requires communicating values between neighboring elements, rather than treating all elements independently. This pattern is commonly called a reduction.
PyOpenCL includes high-level support for common reductions. What you want here is a sum reduction: pyopencl.array.sum(array).
Looking further into how this is implemented in raw OpenCL, Apple's OpenCL docs include an example of parallel reduction for sum. The pieces most relevant to what you want to do are the kernel and the main and create_reduction_pass_counts functions of the host C program which runs the reduction.