I want to multiply two numeric numpy objects t and speed without knowing a-priori whether each one is scalar or an array. The problem is that 0 is a legal value for t (or for elements of t), and inf is a legal value for speed (or for elements of speed). The model I'm working with has a rule for this case: wherever speed is infinite and t is zero, the product is defined as 0.
My problem is in applying the rule while coping with the possibly-scalar-possibly-not-ness of both operands. So far the best I can come up with is this ugly cascade:
if t.size == 1:
if t.flat[0]:
t = t * speed
else:
nonzero = t != 0
if speed.size == 1:
t[nonzero] *= speed
else:
t[nonzero] *= speed[nonzero]
I can't help but think there has to be a more efficient, more Numpythonic way that I'm missing. Is there?
0 * np.inf throws a RuntimeWarning and evaluates to np.nan
How about just doing the multiplication, and replacing the np.nan after that?
import numpy as np
def multiply(a, b):
temp = np.array(a*b)
temp[np.isnan(temp)] = 0
return temp
Testing:
np.random.seed(123)
t = np.random.rand(10)
t[0] = 0
speed = np.random.rand(10)
speed[0] = np.inf
speed[6] = np.inf
input:
multiply(t, speed)
output:
array([ 0. , 0.09201602, 0.02440781, 0.49796297, 0.06170694,
0.57372377, inf, 0.03658583, 0.53141369, 0.1746193 ])
It's been a couple of days, and nobody has written up the answer given in comments by hpaulj and SubhaneilLahiri, so I had better do it myself. My reason for making this the accepted answer is that it is general enough to address the question's title, even in cases that go beyond the specific model given in the question.
# prepare zeros of the correct shape:
out = np.zeros(np.broadcast(t, speed).shape), dtype=float)
# write the product values into the prepared array, but ONLY
# according to the mask t!=0
np.multiply(t, speed, where=t!=0, out=out)
In the particular model in the question, the pre-prepared default output value of 0 is already correct for all places where the test t!=0 fails: 0 * speed is 0 anyway wherever speed is finite, and it is also 0 by fiat (according to the model's rule) when speed is infinite. But this approach is adaptable to the more general case: in principle, it would be possible to use additional numpy ufunc calls to fill in the masked-out parts of the output array with the results of arbitrarily different rules.
Related
I need a much faster code to remove values of an 1D array (array length ~ 10-15) that are common with another 1D array (array length ~ 1e5-5e5 --> rarely up to 7e5), which are index arrays contain integers. There is no duplicate in the arrays, and they are not sorted and the order of the values must be kept in the main array after modification. I know that can be achieved using such np.setdiff1d or np.in1d (which both are not supported for numba jitted in no-python mode), and other similar posts (e.g. this) have not much more efficient way to do so, but performance is important here because all the values in the main index array will be gradually be removed in loops.
import numpy as np
import numba as nb
n = 500000
r = 10
arr1 = np.random.permutation(n)
arr2 = np.random.randint(0, n, r)
# #nb.jit
def setdif1d_np(a, b):
return np.setdiff1d(a, b, assume_unique=True)
# #nb.jit
def setdif1d_in1d_np(a, b):
return a[~np.in1d(a, b)]
There is another related post that proposed by norok2 for 2D arrays, that is ~15 times faster solution (hashing-like way using numba) than usual methods described there. This solution may be the best if it could be prepared for 1D arrays:
#nb.njit
def mul_xor_hash(arr, init=65537, k=37):
result = init
for x in arr.view(np.uint64):
result = (result * k) ^ x
return result
#nb.njit
def setdiff2d_nb(arr1, arr2):
# : build `delta` set using hashes
delta = {mul_xor_hash(arr2[0])}
for i in range(1, arr2.shape[0]):
delta.add(mul_xor_hash(arr2[i]))
# : compute the size of the result
n = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
n += 1
# : build the result
result = np.empty((n, arr1.shape[-1]), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
result[j] = arr1[i]
j += 1
return result
I tried to prepare that for 1D arrays, but I have some problems/question with that.
At first, IDU what does mul_xor_hash exactly do, and if init and k are arbitrary selected or not
Why mul_xor_hash will not work without nb.njit:
File "C:/Users/Ali/Desktop/test - Copy - Copy.py", line 21, in mul_xor_hash
result = (result * k) ^ x
TypeError: ufunc 'bitwise_xor' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
IDK how to implement mul_xor_hash on 1D arrays (if it could), which I guess may make it faster more than for 2Ds, so I broadcast the input arrays to 2D by [None, :], which get the following error just for arr2:
print(mul_xor_hash(arr2[0]))
ValueError: new type not compatible with array
and what does delta do
I am searching the most efficient way in this regard. In the absence of better method than norok2 solution, how to prepare this solution for 1D arrays?
Understanding the hash-based solution
At first, IDU what does mul_xor_hash exactly do, and if init and k are arbitrary selected or not
mul_xor_hash is a custom hash function. Functions mixing xor and multiply (possibly with shifts) are known to be relatively fast to compute the hash of a raw data buffer. The multiplication tends to shuffle bits and the xor is used to somehow combine/accumulate the result in a fixed size small value (ie. the final hash). There are many different hashing functions. Some are faster than others, some cause more collisions than other in a given context. A fast hashing function causing too many collisions can be useless in practice as it would result in a pathological situation where all conflicting values needs to be compared. This is why fast hash functions are hard to implement.
init and k are parameter certainly causing the hash to be pretty balance. This is pretty common in such a hash function. k needs to be sufficiently big for the multiplication to shuffle bits and it should typically also be a prime number (values like power of two tends to increase collisions due to modular arithmetic behaviours). init plays a significant role only for very small arrays (eg. with 1 item): it helps to reduce collisions by xoring the final hash by a non-trivial constant. Indeed, if arr.size = 1, then result = (init * k) ^ arr[0] where init * k is a constant. Having an identity hash function equal to arr[0] is known to be bad since it tends to result in many collisions (this is a complex topic, but put it shortly, arr[0] can be divided by the number of buckets in the hash table for example). Thus, init should be a relatively big number and init * k should also be a big non-trivial value (a prime number is a good target value).
Why mul_xor_hash will not work without nb.njit
It depends of the input. The input needs to be a 1D array and have a raw size in byte divisible by 8 (eg. 64-bit items, 2n x 32-bit ones, 4n x 16-bit one or 8n 8-bit ones). Here is some examples:
mul_xor_hash(np.random.rand(10))
mul_xor_hash(np.arange(10)) # Do not work with 9
and what does delta do
It is a set containing the hash of the arr2 row so to find matching lines faster than comparing them without hashes.
how to prepare this solution for 1D arrays?
AFAIK, hashes are only use to avoid comparisons of rows but this is because the input is the 2D array. In 1D, there is no such a problem.
There is big catch with this method: it only works if there is no hash collisions. Otherwise, the implementation wrongly assumes that values are equal even if they are not! #norok explicitly mentioned it in the comments though:
Note that the collision handling for the hashings should also be implemented
Faster implementation
Using the 2D solution of #norok2 for 1D is not a good idea since hashes will not make it faster the way they are used. In fact, a set already use a hash function internally anyway. Not to mention collisions needs to be properly implemented (which is done by a set).
Using a set is a relatively good idea since it causes the complexity to be O(n + m) where n = len(arr1) and m = len(arr2). That being said, if arr1 is converted to a set, then it will be too big to fit in L1 cache (due to the size of arr1 in your case) resulting in slow cache misses. Additionally, the growing size of the set will cause values to be re-hashed which is not efficient. If arr2 is converted to a set, then the many hash table fetches will not be very efficient since arr2 is very small in your case. This is why this solution is sub-optimal.
One solution is to split arr1 in chunks and then build a set based on the target chunk. You can then check if a value is in the set or not efficiently. Building the set is still not very efficient due to the growing size. This problem is due to Python itself which do not provide a way to reserve some space for the data structure like other languages do (eg. C++). One solution to avoid this issue is simply to reimplement an hash-table which is not trivial and cumbersome. Actually, Bloom filters can be used to speed up this process since they can quickly find if there is no collision between the two sets arr1 and arr2 in average (though they are not trivial to implement).
Another optimization is to use multiple threads to compute the chunks in parallel since they are independent. That being said, the appending to the final array is not easy to do efficiently in parallel, especially since you do not want the order to be modified. One solution is to move away the copy from the parallel loop and do it serially but this is slow and AFAIK there is no simple way to do that in Numba currently (since the parallelism layer is very limited). Consider using native languages like C/C++ for an efficient parallel implementation.
In the end, hashing can be pretty complex and the speed up can be quite small compared to a naive implementation with two nested loops since arr2 only have few items and modern processors can compare values quickly using SIMD instructions (while hash-based method can hardly benefit from them on mainstream processors). Unrolling can help to write a pretty simple and fast implementation. Again, unfortunately, Numba use LLVM-Jit internally which appear to fail to vectorize such a simple code (certainly due to missing optimizations in either LLVM-Jit or even LLVM itself). As a result, the non vectorized code is finally a bit slower (rather than 4~10 times faster on a modern mainstream processor). One solution is to use a C/C++ code instead to do that (or possibly Cython).
Here is a serial implementation using basic Bloom filters:
#nb.njit('uint32(int32)')
def hash_32bit_4k(value):
return (np.uint32(value) * np.uint32(27_644_437)) & np.uint32(0x0FFF)
#nb.njit(['int32[:](int32[:], int32[:])', 'int32[:](int32[::1], int32[::1])'])
def setdiff1d_nb_faster(arr1, arr2):
out = np.empty_like(arr1)
bloomFilter = np.zeros(4096, dtype=np.uint8)
for j in range(arr2.size):
bloomFilter[hash_32bit_4k(arr2[j])] = True
cur = 0
for i in range(arr1.size):
# If the bloom-filter value is true, we know arr1[i] is not in arr2.
# Otherwise, there is maybe a false positive (conflict) and we need to check to be sure.
if bloomFilter[hash_32bit_4k(arr1[i])] and arr1[i] in arr2:
continue
out[cur] = arr1[i]
cur += 1
return out[:cur]
Here is an untested variant that should work for 64-bit integers (floating point numbers need memory views and possibly a prime constant too):
#nb.njit('uint64(int64)')
def hash_32bit_4k(value):
return (np.uint64(value) * np.uint64(67_280_421_310_721)) & np.uint64(0x0FFF)
Note that if all the values in the small array are contained in the main array in each loop, then we can speed up the arr1[i] in arr2 part by removing values from arr2 when we find them. That being said, collisions and findings should be very rare so I do not expect this to be significantly faster (not to mention it adds some overhead and complexity). If items are computed in chunks, then the last chunks can be directly copied without any check but the benefit should still be relatively small. Note that this strategy can be effective for the naive (C/C++) SIMD implementation previously mentioned though (it can be about 2x faster).
Generalization and parallel implementation
This section focus on the algorithm to use regarding the input size. It particularly details an SIMD-based implementation and discuss about the use of multiple threads.
First of all, regarding the value r, the best algorithm to use can be different. More specifically:
when r is 0, the best thing to do is to return the input array arr1 unmodified (possibly a copy to avoid issue with in-place algorithms);
when r is 1, we can use one basic loop iterating over the array, but the best implementation is likely to use np.where of Numpy which is highly optimized for that
when r is small like <10, then using a SIMD-based implementation should be particularly efficient, especially if the iteration range of the arr2-based loop is known at compile-time and is unrolled
for bigger r values that are still relatively small (eg. r < 1000 and r << n), the provided hash-based solution should be one of the best;
for larger r values with r << n, the hash-based solution can be optimized by packing boolean values as bits in bloomFilter and by using multiple hash-functions instead of one so to better handle collisions while being more cache-friendly (in fact, this is what actual bloom filters does); note that multi-threading can be used so speed up the lookups when r is huge and r << n;
when r is big and not much smaller than n, then the problem is pretty hard to solve efficiently and the best solution is certainly to sort both arrays (typically with a radix sort) and use a merge-based method to remove the duplicates, possibly with multiple threads when both r and n are huge (hard to implement).
Let's start with the SIMD-based solution. Here is an implementation:
#nb.njit('int32[:](int32[::1], int32[::1])')
def setdiff1d_nb_simd(arr1, arr2):
out = np.empty_like(arr1)
limit = arr1.size // 4 * 4
limit2 = arr2.size // 2 * 2
cur = 0
z32 = np.int32(0)
# Tile (x4) based computation
for i in range(0, limit, 4):
f0, f1, f2, f3 = z32, z32, z32, z32
v0, v1, v2, v3 = arr1[i], arr1[i+1], arr1[i+2], arr1[i+3]
# Unrolled (x2) loop searching for a match in `arr2`
for j in range(0, limit2, 2):
val1 = arr2[j]
val2 = arr2[j+1]
f0 += (v0 == val1) + (v0 == val2)
f1 += (v1 == val1) + (v1 == val2)
f2 += (v2 == val1) + (v2 == val2)
f3 += (v3 == val1) + (v3 == val2)
# Remainder of the previous loop
if limit2 != arr2.size:
val = arr2[arr2.size-1]
f0 += v0 == val
f1 += v1 == val
f2 += v2 == val
f3 += v3 == val
if f0 == 0: out[cur] = arr1[i+0]; cur += 1
if f1 == 0: out[cur] = arr1[i+1]; cur += 1
if f2 == 0: out[cur] = arr1[i+2]; cur += 1
if f3 == 0: out[cur] = arr1[i+3]; cur += 1
# Remainder
for i in range(limit, arr1.size):
if arr1[i] not in arr2:
out[cur] = arr1[i]
cur += 1
return out[:cur]
It turns out this implementation is always slower than the hash-based one on my machine since Numba clearly generate an inefficient for the inner arr2-based loop and this appears to come from broken optimizations related to the ==: Numba simply fail use SIMD instructions for this operation (for no apparent reasons). This prevent many alternative SIMD-related codes to be fast as long as they are using Numba.
Another issue with Numba is that np.where is slow since it use a naive implementation while the one of Numpy has been heavily optimized. The optimization done in Numpy can hardly be applied to the Numba implementation due to the previous issue. This prevent any speed up using np.where in a Numba code.
In practice, the hash-based implementation is pretty fast and the copy takes a significant time on my machine already. The computing part can be speed up using multiple thread. This is not easy since the parallelism model of Numba is very limited. The copy cannot be easily optimized with Numba (one can use non-temporal store but this is not yet supported by Numba) unless the computation is possibly done in-place.
To use multiple threads, one strategy is to first split the range in chunk and then:
build a boolean array determining, for each item of arr1, whether the item is found in arr2 or not (fully parallel)
count the number of item found by chunk (fully parallel)
compute the offset of the destination chunk (hard to parallelize, especially with Numba, but fast thanks to chunks)
copy the chunk to the target location without copying found items (fully parallel)
Here is an efficient parallel hash-based implementation:
#nb.njit('int32[:](int32[:], int32[:])', parallel=True)
def setdiff1d_nb_faster_par(arr1, arr2):
# Pre-computation of the bloom-filter
bloomFilter = np.zeros(4096, dtype=np.uint8)
for j in range(arr2.size):
bloomFilter[hash_32bit_4k(arr2[j])] = True
chunkSize = 1024 # To tune regarding the kind of input
chunkCount = (arr1.size + chunkSize - 1) // chunkSize
# Find for each item of `arr1` if the value is in `arr2` (parallel)
# and count the number of item found for each chunk on the fly.
# Note: thanks to page fault, big parts of `found` are not even written in memory if `arr2` is small
found = np.zeros(arr1.size, dtype=nb.bool_)
foundCountByChunk = np.empty(chunkCount, dtype=nb.uint16)
for i in nb.prange(chunkCount):
start, end = i * chunkSize, min((i + 1) * chunkSize, arr1.size)
foundCountInChunk = 0
for j in range(start, end):
val = arr1[j]
if bloomFilter[hash_32bit_4k(val)] and val in arr2:
found[j] = True
foundCountInChunk += 1
foundCountByChunk[i] = foundCountInChunk
# Compute the location of the destination chunks (sequential)
outChunkOffsets = np.empty(chunkCount, dtype=nb.uint32)
foundCount = 0
for i in range(chunkCount):
outChunkOffsets[i] = i * chunkSize - foundCount
foundCount += foundCountByChunk[i]
# Parallel chunk-based copy
out = np.empty(arr1.size-foundCount, dtype=arr1.dtype)
for i in nb.prange(chunkCount):
srcStart, srcEnd = i * chunkSize, min((i + 1) * chunkSize, arr1.size)
cur = outChunkOffsets[i]
# Optimization: we can copy the whole chunk if there is nothing found in it
if foundCountByChunk[i] == 0:
out[cur:cur+(srcEnd-srcStart)] = arr1[srcStart:srcEnd]
else:
for j in range(srcStart, srcEnd):
if not found[j]:
out[cur] = arr1[j]
cur += 1
return out
This implementation is the fastest for the target input on my machine. It is generally fast when n is quite big and the overhead to create threads is relatively small on the target platform (eg. on PCs but typically not computing servers with many cores). The overhead of the parallel implementation is significant so the number of core on the target machine needs to be at least 4 so the implementation can be significantly faster than the sequential implementation.
It may be useful to tune the chunkSize variable for the target inputs. If r << n, it is better to use a pretty big chunkSize. That being said, the number of chunk needs to be sufficiently big for multiple thread to operate on many chunks. Thus, chunkSize should be significantly smaller than n / numberOfThreads.
On my machine most of the time (65-70%) is spent in the final copy which is mostly memory-bound and can hardly be optimized further with Numba.
Results
Here are results on my i5-9600KF-based machine (with 6 cores):
setdif1d_np: 2.65 ms
setdif1d_in1d_np: 2.61 ms
setdiff1d_nb: 2.33 ms
setdiff1d_nb_simd: 1.85 ms
setdiff1d_nb_faster: 0.73 ms
setdiff1d_nb_faster_par: 0.49 ms
The best provided implementation is about 4~5 time faster than the other ones.
What I found is that hashing does not help,. It is just trick for 2D case, to convert 1d arrays to single numbers and put them as such in a set.
Below is method of norok2 I converted to 1d arrays (and added annotations for faster compilation).
Note that this is only slightly (20-30%) faster than the methods you already have. And of course after second function call, on first due to compilation it is slightly slower.
#nb.njit('int32[:](int32[:], int32[:])')
def setdiff1d_nb(arr1, arr2):
delta = set(arr2)
# : build the result
result = np.empty(len(arr1), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if arr1[i] not in delta:
result[j] = arr1[i]
j += 1
return result[:j]
So I got a matrix looking something like
0.8, 0.7,nan,...
0.1,nan,0.1,...
0.9,nan,0.3,...
with dimensions N X M, where M >> N
Now I want to go through all columns and change the values of the matrix as follows:
If the column has exactly K values >= 0, do nothing
If the column has L < K values >= 0, set K-L values of this column to X
If the column has S > K values >= 0, set S-K values of this column to nan
I am looking for a fast way to do this with python for a large optimization and my for loop is taking too long still.
The outcome in the above example, with K = 2, should look like:
0.8, 0.7,nan,...
0.1,X,0.1,...
nan, nan,0.3,...
What I am considering at the moment is first sorting each column of the entire matrix (I have to restore the previous order later though) and to somehow apply np.where(num_cols != K,applyChanges(myMat),myMat), but I am not sure how to make np.where return me the columns of the matrix and not the invidivual scalars.
Thanks for any suggestion :)
edit:
So I was able to speed up my code by a factor of slightly more than 5 by first using argsort on the columns. This was I could directly set the appropriate values without requiring anyfurther "argwhere" calls which made my previous code slow.
All I can do now seems to be to move from numpy arrays to python lists as I now do a for loop and access the elements directly which is faster done using lists.
edit2:
Using numba's #jit decorator I was able to increase the speed again by a factor of 5. Wow, I am impressed, pretty cool library (is that even the proper term for it?). Now my code is fast enough for my current requirements.
edit3:
As requested, here is my for loop, it is straight forward
idx_sorted = np.argsort(pred_new, axis=0)
num_bands_selected = np.sum(pred_new >= 0, axis=0)
for idx_col in range(0, num_cols):
if num_bands_selected[idx_col] < N:
pred_new[idx_sorted[num_bands_selected[idx_col]-N:, idx_col], idx_col] = dummy_value
elif num_bands_selected[idx_col] > N:
pred_new[idx_sorted[-num_bands_selected[idx_col]:-N, idx_col], idx_col] = np.nan
pred_new is something a neural network returns, but the data has to be in a specific format for some postprocessing algorithm to work.
This could be sped up to the best of my knowledge in two ways (aside from the #jit which was not included):
Using python lists instead of numpy arrays (they allow faster read access)
First selecting columns that require modification. Here I go through everything, but often only some subset (~50 %) might require modification
Say I have a large array of value 0~255. I wanted every element in this array that is higher than 100 got multiplied by 1.2, otherwise, got multiplied by 0.8.
It sounded simple but I could not find anyway other than iterate through all the variable and multiply it one by one.
If arr is your array, then this should work:
arr[arr > 100] *= 1.2
arr[arr <= 100] *= 0.8
Update: As pointed out in the comments, this could have the undesired effect of the first step affecting what is done in the second step, so we should instead do something like
# first get the indexes we of the elements we want to change
gt_idx = arr > 100
le_idx = arr <= 100
# then update the array
arr[gt_idx] *= 1.2
arr[le_idx] *= 0.8
I have a faster implementation than np.where, also a one-liner improvement on #vindvaki:
a*=((a>100)*1.2+(a<100)*0.8)
With it you don't need to make an extra function call, and you can also add arbitrarily many modifiers using boolean logic multipliers. This one-liner will save you some computational time if your arrays get big (like 10**8 big).
np.where is the answer. I spend time messing with np.place without knowing its existence.
I have a gaussian_kde.resample array. I don't know if it is a numpy array so that I can use numpy functions.
I had the data 0<x<=0.5 of 3000 variables and I used
kde = scipy.stats.gaussian_kde(x) # can also mention bandwidth here (x,bandwidth)
sample = kde.resample(100000) # returns 100,000 values that follow the prob distribution of "x"
This gave me a sample of data that follows the probability distribution of "x". But the problem is, no matter what bandwidth I try to select, I get very few negative values in my "sample". I only want values within the range 0 < sample <= 0.5
I tried to do:
sample = np.array(sample) # to convert this to a numpy array
keep = 0<sample<=0.5
sample = sample[keep] # using the binary conditions
But this does not work! How can I remove the negative values in my array?
Firstly, you can check what type it is by using the 'type' call within python:
x = kde.resample(10000)
type(x)
numpy.ndarray
Secondly, it should be working in the way you wrote, but I would be more explicit in your binary condition:
print x
array([[ 1.42935658, 4.79293343, 4.2725778 , ..., 2.35775067, 1.69647609]])
x.size
10000
y = x[(x>1.5) & (x<4)]
which you can see, does the correct binary conditions and removes the values >1.5 and <4:
print y
array([ 2.95451084, 2.62400183, 2.79426449, ..., 2.35775067, 1.69647609])
y.size
5676
I know I'm answering about 3 years late, but this may be useful for future reference.
The catch is that while kde.resample(100000) technically returns a NumPy array, this array actually contains another array(!), and that gets in the way of all the attempts to use indexing to get subsets of the sample. To get the array that the resample() method probably should have returned all along, do this instead:
sample = kde.resample(100000)[0]
The array variable sample should then have all 100000 samples, and indexing this array should work as expected.
Why SciPy does it this way, I don't know. This misfeature doesn't even appear to be documented.
First of all, the return value of kde.resample is a numpy array, so you do not need to reconvert it.
The problem lies in the line (Edit: No, it doesn't. This should work!)
keep = 0 < sample <= 0.5
It does not do what you would think. Try:
keep = (0 < sample) * (sample <= 0.5)
I have two lists of float numbers, and I want to calculate the set difference between them.
With numpy I originally wrote the following code:
aprows = allpoints.view([('',allpoints.dtype)]*allpoints.shape[1])
rprows = toberemovedpoints.view([('',toberemovedpoints.dtype)]*toberemovedpoints.shape[1])
diff = setdiff1d(aprows, rprows).view(allpoints.dtype).reshape(-1, 2)
This works well for things like integers. In case of 2d points with float coordinates that are the result of some geometrical calculations, there's a problem of finite precision and rounding errors causing the set difference to miss some equalities. For now I resorted to the much, much slower:
diff = []
for a in allpoints:
remove = False
for p in toberemovedpoints:
if norm(p-a) < 0.1:
remove = True
if not remove:
diff.append(a)
return array(diff)
But is there a way to write this with numpy and gain back the speed?
Note that I want the remaining points to still have their full precision, so first rounding the numbers and then do a set difference probably is not the way forward (or is it? :) )
Edited to add an solution based on scipy.KDTree that seems to work:
def remove_points_fast(allpoints, toberemovedpoints):
diff = []
removed = 0
# prepare a KDTree
from scipy.spatial import KDTree
tree = KDTree(toberemovedpoints, leafsize=allpoints.shape[0]+1)
for p in allpoints:
distance, ndx = tree.query([p], k=1)
if distance < 0.1:
removed += 1
else:
diff.append(p)
return array(diff), removed
If you want to do this with the matrix form, you have a lot of memory consumption with larger arrays. If that does not matter, then you get the difference matrix by:
diff_array = allpoints[:,None] - toberemovedpoints[None,:]
The resulting array has as many rows as there are points in allpoints, and as many columns as there are points in toberemovedpoints. Then you can manipulate this any way you want (e.g. calculate the absolute value), which gives you a boolean array. To find which rows have any hits (absolute difference < .1), use numpy.any:
hits = numpy.any(numpy.abs(diff_array) < .1, axis=1)
Now you have a vector which has the same number of items as there were rows in the difference array. You can use that vector to index all points (negation because we wanted the non-matching points):
return allpoints[-hits]
This is a numpyish way of doing this. But, as I said above, it takes a lot of memory.
If you have larger data, then you are better off doing it point by point. Something like this:
return allpoints[-numpy.array([numpy.any(numpy.abs(a-toberemoved) < .1) for a in allpoints ])]
This should perform well in most cases, and the memory use is much lower than with the matrix solution. (For stylistic reasons you may want to use numpy.all instead of numpy.any and turn the comparison around to get rid of the negation.)
(Beware, there may be pritning mistakes in the code.)