I'm looking to remove certain values within a constant range around values held within a second array. i.e. I have one large np array and I want to remove values +-3 in that array using another array of specific values, say [20,50,90,210]. So if my large array was [14,21,48,54,92,215] I would want [14,54,215] returned. The values are double precision so I'm trying to avoid creating a large mask array to remove specific values and use a range instead.
You mentioned that you wanted to avoid a large mask array. Unless both your "large array" and your "specific values" array are very large, I wouldn't try to avoid this. Often, with numpy it's best to allow relatively large temporary arrays to be created.
However, if you do need to control memory usage more tightly, you have several options. A typical trick is to only vectorize one part of the operation and iterate over the shorter input (this is shown in the second example below). It saves having nested loops in Python, and can significantly decrease the memory usage involved.
I'll show three different approaches. There are several others (including dropping down to C or Cython if you really need tight control and performance), but hopefully this gives you some ideas.
On a side note, for these small inputs, the overhead of array creation will overwhelm the differences. The speed and memory usage I'm referring to is only for large (>~1e6 elements) arrays.
Fully vectorized, but most memory usage
The easiest way is to calculate all distances at once and then reduce the mask back to the same shape as the initial array. For example:
import numpy as np
vals = np.array([14,21,48,54,92,215])
other = np.array([20,50,90,210])
dist = np.abs(vals[:,None] - other[None,:])
mask = np.all(dist > 3, axis=1)
result = vals[mask]
Partially vectorized, intermediate memory usage
Another option is to build up the mask iteratively for each element in the "specific values" array. This iterates over all elements of the shorter "specific values" array (a.k.a. other in this case):
import numpy as np
vals = np.array([14,21,48,54,92,215])
other = np.array([20,50,90,210])
mask = np.ones(len(vals), dtype=bool)
for num in other:
dist = np.abs(vals - num)
mask &= dist > 3
result = vals[mask]
Slowest, but lowest memory usage
Finally, if you really want to reduce memory usage, you could iterate over every item in your large array:
import numpy as np
vals = np.array([14,21,48,54,92,215])
other = np.array([20,50,90,210])
result = []
for num in vals:
if np.all(np.abs(num - other) > 3):
result.append(num)
The temporary list in that case is likely to take up more memory than the mask in the previous version. However, you could avoid the temporary list by using np.fromiter if you wanted. The timing comparison below shows an example of this.
Timing Comparisons
Let's compare the speed of these functions. We'll use 10,000,000 elements in the "large array" and 4 values in the "specific values" array. The relative speed and memory usage of these functions depend strongly on the sizes of the two arrays, so you should only consider this as a vague guideline.
import numpy as np
vals = np.random.random(1e7)
other = np.array([0.1, 0.5, 0.8, 0.95])
tolerance = 0.05
def basic(vals, other, tolerance):
dist = np.abs(vals[:,None] - other[None,:])
mask = np.all(dist > tolerance, axis=1)
return vals[mask]
def intermediate(vals, other, tolerance):
mask = np.ones(len(vals), dtype=bool)
for num in other:
dist = np.abs(vals - num)
mask &= dist > tolerance
return vals[mask]
def slow(vals, other, tolerance):
def func(vals, other, tolerance):
for num in vals:
if np.all(np.abs(num - other) > tolerance):
yield num
return np.fromiter(func(vals, other, tolerance), dtype=vals.dtype)
And in this case, the partially vectorized version wins out. That's to be expected in most cases where vals is significantly longer than other. However, the first example (basic) is almost as fast, and is arguably simpler.
In [7]: %timeit basic(vals, other, tolerance)
1 loops, best of 3: 1.45 s per loop
In [8]: %timeit intermediate(vals, other, tolerance)
1 loops, best of 3: 917 ms per loop
In [9]: %timeit slow(vals, other, tolerance)
1 loops, best of 3: 2min 30s per loop
Either way you choose to implement things, these are common vectorization "tricks" that show up in many problems. In high-level languages like Python, Matlab, R, etc It's often useful to try fully vectorizing, then mix vectorization and explicit loops if memory usage is an issue. Which one is best usually depends on the relative sizes of the inputs, but this is a common pattern to try when optimizing speed vs memory usage in high-level scientific programming.
You can try:
def closestmatch(x, y):
val = np.abs(x-y)
return(val.min()>=3)
Then:
b[np.array([closestmatch(a, x) for x in b])]
Related
I need a much faster code to remove values of an 1D array (array length ~ 10-15) that are common with another 1D array (array length ~ 1e5-5e5 --> rarely up to 7e5), which are index arrays contain integers. There is no duplicate in the arrays, and they are not sorted and the order of the values must be kept in the main array after modification. I know that can be achieved using such np.setdiff1d or np.in1d (which both are not supported for numba jitted in no-python mode), and other similar posts (e.g. this) have not much more efficient way to do so, but performance is important here because all the values in the main index array will be gradually be removed in loops.
import numpy as np
import numba as nb
n = 500000
r = 10
arr1 = np.random.permutation(n)
arr2 = np.random.randint(0, n, r)
# #nb.jit
def setdif1d_np(a, b):
return np.setdiff1d(a, b, assume_unique=True)
# #nb.jit
def setdif1d_in1d_np(a, b):
return a[~np.in1d(a, b)]
There is another related post that proposed by norok2 for 2D arrays, that is ~15 times faster solution (hashing-like way using numba) than usual methods described there. This solution may be the best if it could be prepared for 1D arrays:
#nb.njit
def mul_xor_hash(arr, init=65537, k=37):
result = init
for x in arr.view(np.uint64):
result = (result * k) ^ x
return result
#nb.njit
def setdiff2d_nb(arr1, arr2):
# : build `delta` set using hashes
delta = {mul_xor_hash(arr2[0])}
for i in range(1, arr2.shape[0]):
delta.add(mul_xor_hash(arr2[i]))
# : compute the size of the result
n = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
n += 1
# : build the result
result = np.empty((n, arr1.shape[-1]), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if mul_xor_hash(arr1[i]) not in delta:
result[j] = arr1[i]
j += 1
return result
I tried to prepare that for 1D arrays, but I have some problems/question with that.
At first, IDU what does mul_xor_hash exactly do, and if init and k are arbitrary selected or not
Why mul_xor_hash will not work without nb.njit:
File "C:/Users/Ali/Desktop/test - Copy - Copy.py", line 21, in mul_xor_hash
result = (result * k) ^ x
TypeError: ufunc 'bitwise_xor' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
IDK how to implement mul_xor_hash on 1D arrays (if it could), which I guess may make it faster more than for 2Ds, so I broadcast the input arrays to 2D by [None, :], which get the following error just for arr2:
print(mul_xor_hash(arr2[0]))
ValueError: new type not compatible with array
and what does delta do
I am searching the most efficient way in this regard. In the absence of better method than norok2 solution, how to prepare this solution for 1D arrays?
Understanding the hash-based solution
At first, IDU what does mul_xor_hash exactly do, and if init and k are arbitrary selected or not
mul_xor_hash is a custom hash function. Functions mixing xor and multiply (possibly with shifts) are known to be relatively fast to compute the hash of a raw data buffer. The multiplication tends to shuffle bits and the xor is used to somehow combine/accumulate the result in a fixed size small value (ie. the final hash). There are many different hashing functions. Some are faster than others, some cause more collisions than other in a given context. A fast hashing function causing too many collisions can be useless in practice as it would result in a pathological situation where all conflicting values needs to be compared. This is why fast hash functions are hard to implement.
init and k are parameter certainly causing the hash to be pretty balance. This is pretty common in such a hash function. k needs to be sufficiently big for the multiplication to shuffle bits and it should typically also be a prime number (values like power of two tends to increase collisions due to modular arithmetic behaviours). init plays a significant role only for very small arrays (eg. with 1 item): it helps to reduce collisions by xoring the final hash by a non-trivial constant. Indeed, if arr.size = 1, then result = (init * k) ^ arr[0] where init * k is a constant. Having an identity hash function equal to arr[0] is known to be bad since it tends to result in many collisions (this is a complex topic, but put it shortly, arr[0] can be divided by the number of buckets in the hash table for example). Thus, init should be a relatively big number and init * k should also be a big non-trivial value (a prime number is a good target value).
Why mul_xor_hash will not work without nb.njit
It depends of the input. The input needs to be a 1D array and have a raw size in byte divisible by 8 (eg. 64-bit items, 2n x 32-bit ones, 4n x 16-bit one or 8n 8-bit ones). Here is some examples:
mul_xor_hash(np.random.rand(10))
mul_xor_hash(np.arange(10)) # Do not work with 9
and what does delta do
It is a set containing the hash of the arr2 row so to find matching lines faster than comparing them without hashes.
how to prepare this solution for 1D arrays?
AFAIK, hashes are only use to avoid comparisons of rows but this is because the input is the 2D array. In 1D, there is no such a problem.
There is big catch with this method: it only works if there is no hash collisions. Otherwise, the implementation wrongly assumes that values are equal even if they are not! #norok explicitly mentioned it in the comments though:
Note that the collision handling for the hashings should also be implemented
Faster implementation
Using the 2D solution of #norok2 for 1D is not a good idea since hashes will not make it faster the way they are used. In fact, a set already use a hash function internally anyway. Not to mention collisions needs to be properly implemented (which is done by a set).
Using a set is a relatively good idea since it causes the complexity to be O(n + m) where n = len(arr1) and m = len(arr2). That being said, if arr1 is converted to a set, then it will be too big to fit in L1 cache (due to the size of arr1 in your case) resulting in slow cache misses. Additionally, the growing size of the set will cause values to be re-hashed which is not efficient. If arr2 is converted to a set, then the many hash table fetches will not be very efficient since arr2 is very small in your case. This is why this solution is sub-optimal.
One solution is to split arr1 in chunks and then build a set based on the target chunk. You can then check if a value is in the set or not efficiently. Building the set is still not very efficient due to the growing size. This problem is due to Python itself which do not provide a way to reserve some space for the data structure like other languages do (eg. C++). One solution to avoid this issue is simply to reimplement an hash-table which is not trivial and cumbersome. Actually, Bloom filters can be used to speed up this process since they can quickly find if there is no collision between the two sets arr1 and arr2 in average (though they are not trivial to implement).
Another optimization is to use multiple threads to compute the chunks in parallel since they are independent. That being said, the appending to the final array is not easy to do efficiently in parallel, especially since you do not want the order to be modified. One solution is to move away the copy from the parallel loop and do it serially but this is slow and AFAIK there is no simple way to do that in Numba currently (since the parallelism layer is very limited). Consider using native languages like C/C++ for an efficient parallel implementation.
In the end, hashing can be pretty complex and the speed up can be quite small compared to a naive implementation with two nested loops since arr2 only have few items and modern processors can compare values quickly using SIMD instructions (while hash-based method can hardly benefit from them on mainstream processors). Unrolling can help to write a pretty simple and fast implementation. Again, unfortunately, Numba use LLVM-Jit internally which appear to fail to vectorize such a simple code (certainly due to missing optimizations in either LLVM-Jit or even LLVM itself). As a result, the non vectorized code is finally a bit slower (rather than 4~10 times faster on a modern mainstream processor). One solution is to use a C/C++ code instead to do that (or possibly Cython).
Here is a serial implementation using basic Bloom filters:
#nb.njit('uint32(int32)')
def hash_32bit_4k(value):
return (np.uint32(value) * np.uint32(27_644_437)) & np.uint32(0x0FFF)
#nb.njit(['int32[:](int32[:], int32[:])', 'int32[:](int32[::1], int32[::1])'])
def setdiff1d_nb_faster(arr1, arr2):
out = np.empty_like(arr1)
bloomFilter = np.zeros(4096, dtype=np.uint8)
for j in range(arr2.size):
bloomFilter[hash_32bit_4k(arr2[j])] = True
cur = 0
for i in range(arr1.size):
# If the bloom-filter value is true, we know arr1[i] is not in arr2.
# Otherwise, there is maybe a false positive (conflict) and we need to check to be sure.
if bloomFilter[hash_32bit_4k(arr1[i])] and arr1[i] in arr2:
continue
out[cur] = arr1[i]
cur += 1
return out[:cur]
Here is an untested variant that should work for 64-bit integers (floating point numbers need memory views and possibly a prime constant too):
#nb.njit('uint64(int64)')
def hash_32bit_4k(value):
return (np.uint64(value) * np.uint64(67_280_421_310_721)) & np.uint64(0x0FFF)
Note that if all the values in the small array are contained in the main array in each loop, then we can speed up the arr1[i] in arr2 part by removing values from arr2 when we find them. That being said, collisions and findings should be very rare so I do not expect this to be significantly faster (not to mention it adds some overhead and complexity). If items are computed in chunks, then the last chunks can be directly copied without any check but the benefit should still be relatively small. Note that this strategy can be effective for the naive (C/C++) SIMD implementation previously mentioned though (it can be about 2x faster).
Generalization and parallel implementation
This section focus on the algorithm to use regarding the input size. It particularly details an SIMD-based implementation and discuss about the use of multiple threads.
First of all, regarding the value r, the best algorithm to use can be different. More specifically:
when r is 0, the best thing to do is to return the input array arr1 unmodified (possibly a copy to avoid issue with in-place algorithms);
when r is 1, we can use one basic loop iterating over the array, but the best implementation is likely to use np.where of Numpy which is highly optimized for that
when r is small like <10, then using a SIMD-based implementation should be particularly efficient, especially if the iteration range of the arr2-based loop is known at compile-time and is unrolled
for bigger r values that are still relatively small (eg. r < 1000 and r << n), the provided hash-based solution should be one of the best;
for larger r values with r << n, the hash-based solution can be optimized by packing boolean values as bits in bloomFilter and by using multiple hash-functions instead of one so to better handle collisions while being more cache-friendly (in fact, this is what actual bloom filters does); note that multi-threading can be used so speed up the lookups when r is huge and r << n;
when r is big and not much smaller than n, then the problem is pretty hard to solve efficiently and the best solution is certainly to sort both arrays (typically with a radix sort) and use a merge-based method to remove the duplicates, possibly with multiple threads when both r and n are huge (hard to implement).
Let's start with the SIMD-based solution. Here is an implementation:
#nb.njit('int32[:](int32[::1], int32[::1])')
def setdiff1d_nb_simd(arr1, arr2):
out = np.empty_like(arr1)
limit = arr1.size // 4 * 4
limit2 = arr2.size // 2 * 2
cur = 0
z32 = np.int32(0)
# Tile (x4) based computation
for i in range(0, limit, 4):
f0, f1, f2, f3 = z32, z32, z32, z32
v0, v1, v2, v3 = arr1[i], arr1[i+1], arr1[i+2], arr1[i+3]
# Unrolled (x2) loop searching for a match in `arr2`
for j in range(0, limit2, 2):
val1 = arr2[j]
val2 = arr2[j+1]
f0 += (v0 == val1) + (v0 == val2)
f1 += (v1 == val1) + (v1 == val2)
f2 += (v2 == val1) + (v2 == val2)
f3 += (v3 == val1) + (v3 == val2)
# Remainder of the previous loop
if limit2 != arr2.size:
val = arr2[arr2.size-1]
f0 += v0 == val
f1 += v1 == val
f2 += v2 == val
f3 += v3 == val
if f0 == 0: out[cur] = arr1[i+0]; cur += 1
if f1 == 0: out[cur] = arr1[i+1]; cur += 1
if f2 == 0: out[cur] = arr1[i+2]; cur += 1
if f3 == 0: out[cur] = arr1[i+3]; cur += 1
# Remainder
for i in range(limit, arr1.size):
if arr1[i] not in arr2:
out[cur] = arr1[i]
cur += 1
return out[:cur]
It turns out this implementation is always slower than the hash-based one on my machine since Numba clearly generate an inefficient for the inner arr2-based loop and this appears to come from broken optimizations related to the ==: Numba simply fail use SIMD instructions for this operation (for no apparent reasons). This prevent many alternative SIMD-related codes to be fast as long as they are using Numba.
Another issue with Numba is that np.where is slow since it use a naive implementation while the one of Numpy has been heavily optimized. The optimization done in Numpy can hardly be applied to the Numba implementation due to the previous issue. This prevent any speed up using np.where in a Numba code.
In practice, the hash-based implementation is pretty fast and the copy takes a significant time on my machine already. The computing part can be speed up using multiple thread. This is not easy since the parallelism model of Numba is very limited. The copy cannot be easily optimized with Numba (one can use non-temporal store but this is not yet supported by Numba) unless the computation is possibly done in-place.
To use multiple threads, one strategy is to first split the range in chunk and then:
build a boolean array determining, for each item of arr1, whether the item is found in arr2 or not (fully parallel)
count the number of item found by chunk (fully parallel)
compute the offset of the destination chunk (hard to parallelize, especially with Numba, but fast thanks to chunks)
copy the chunk to the target location without copying found items (fully parallel)
Here is an efficient parallel hash-based implementation:
#nb.njit('int32[:](int32[:], int32[:])', parallel=True)
def setdiff1d_nb_faster_par(arr1, arr2):
# Pre-computation of the bloom-filter
bloomFilter = np.zeros(4096, dtype=np.uint8)
for j in range(arr2.size):
bloomFilter[hash_32bit_4k(arr2[j])] = True
chunkSize = 1024 # To tune regarding the kind of input
chunkCount = (arr1.size + chunkSize - 1) // chunkSize
# Find for each item of `arr1` if the value is in `arr2` (parallel)
# and count the number of item found for each chunk on the fly.
# Note: thanks to page fault, big parts of `found` are not even written in memory if `arr2` is small
found = np.zeros(arr1.size, dtype=nb.bool_)
foundCountByChunk = np.empty(chunkCount, dtype=nb.uint16)
for i in nb.prange(chunkCount):
start, end = i * chunkSize, min((i + 1) * chunkSize, arr1.size)
foundCountInChunk = 0
for j in range(start, end):
val = arr1[j]
if bloomFilter[hash_32bit_4k(val)] and val in arr2:
found[j] = True
foundCountInChunk += 1
foundCountByChunk[i] = foundCountInChunk
# Compute the location of the destination chunks (sequential)
outChunkOffsets = np.empty(chunkCount, dtype=nb.uint32)
foundCount = 0
for i in range(chunkCount):
outChunkOffsets[i] = i * chunkSize - foundCount
foundCount += foundCountByChunk[i]
# Parallel chunk-based copy
out = np.empty(arr1.size-foundCount, dtype=arr1.dtype)
for i in nb.prange(chunkCount):
srcStart, srcEnd = i * chunkSize, min((i + 1) * chunkSize, arr1.size)
cur = outChunkOffsets[i]
# Optimization: we can copy the whole chunk if there is nothing found in it
if foundCountByChunk[i] == 0:
out[cur:cur+(srcEnd-srcStart)] = arr1[srcStart:srcEnd]
else:
for j in range(srcStart, srcEnd):
if not found[j]:
out[cur] = arr1[j]
cur += 1
return out
This implementation is the fastest for the target input on my machine. It is generally fast when n is quite big and the overhead to create threads is relatively small on the target platform (eg. on PCs but typically not computing servers with many cores). The overhead of the parallel implementation is significant so the number of core on the target machine needs to be at least 4 so the implementation can be significantly faster than the sequential implementation.
It may be useful to tune the chunkSize variable for the target inputs. If r << n, it is better to use a pretty big chunkSize. That being said, the number of chunk needs to be sufficiently big for multiple thread to operate on many chunks. Thus, chunkSize should be significantly smaller than n / numberOfThreads.
On my machine most of the time (65-70%) is spent in the final copy which is mostly memory-bound and can hardly be optimized further with Numba.
Results
Here are results on my i5-9600KF-based machine (with 6 cores):
setdif1d_np: 2.65 ms
setdif1d_in1d_np: 2.61 ms
setdiff1d_nb: 2.33 ms
setdiff1d_nb_simd: 1.85 ms
setdiff1d_nb_faster: 0.73 ms
setdiff1d_nb_faster_par: 0.49 ms
The best provided implementation is about 4~5 time faster than the other ones.
What I found is that hashing does not help,. It is just trick for 2D case, to convert 1d arrays to single numbers and put them as such in a set.
Below is method of norok2 I converted to 1d arrays (and added annotations for faster compilation).
Note that this is only slightly (20-30%) faster than the methods you already have. And of course after second function call, on first due to compilation it is slightly slower.
#nb.njit('int32[:](int32[:], int32[:])')
def setdiff1d_nb(arr1, arr2):
delta = set(arr2)
# : build the result
result = np.empty(len(arr1), dtype=arr1.dtype)
j = 0
for i in range(arr1.shape[0]):
if arr1[i] not in delta:
result[j] = arr1[i]
j += 1
return result[:j]
Given the following 2-column array, I want to select items from the second column that correspond to "edges" in the first column. This is just an example, as in reality my a has potentially millions of rows. So, ideally I'd like to do this as fast as possible, and without creating intermediate results.
import numpy as np
a = np.array([[1,4],[1,2],[1,3],[2,6],[2,1],[2,8],[2,3],[2,1],
[3,6],[3,7],[5,4],[5,9],[5,1],[5,3],[5,2],[8,2],
[8,6],[8,8]])
i.e. I want to find the result,
desired = np.array([4,6,6,4,2])
which is entries in a[:,1] corresponding to where a[:,0] changes.
One solution is,
b = a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1]
which gives np.array([6,6,4,2]), I could simply prepend the first item, no problem. However, this creates an intermediate array of the indexes of the first items. I could avoid the intermediate by using a list comprehension:
c = [a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]
This also gives [6,6,4,2]. Assuming a generator-based zip (true in Python 3), this doesn't need to create an intermediate representation and should be very memory efficient. However, the inner loop is not numpy, and it necessitates generating a list which must be subsequently turned back into a numpy array.
Can you come up with a numpy-only version with the memory efficiency of c but the speed efficiency of b? Ideally only one pass over a is needed.
(Note that measuring the speed won't help much here, unless a is very big, so I wouldn't bother with benchmarking this, I just want something that is theoretically fast and memory efficient. For example, you can assume rows in a are streamed from a file and are slow to access -- another reason to avoid the b solution, as it requires a second random-access pass over a.)
Edit: a way to generate a large a matrix for testing:
from itertools import repeat
N, M = 100000, 100
a = np.array(zip([x for y in zip(*repeat(np.arange(N),M)) for x in y ], np.random.random(N*M)))
I am afraid if you are looking to do this in a vectorized way, you can't avoid an intermediate array, as there's no built-in for it.
Now, let's look for vectorized approaches other than nonzero(), which might be more performant. Going by the same idea of performing differentiation as with the original code of (a[1:,0]-a[:-1,0]), we can use boolean indexing after looking for non-zero differentiations that correspond to "edges" or shifts.
Thus, we would have a vectorized approach like so -
a[np.append(True,np.diff(a[:,0])!=0),1]
Runtime test
The original solution a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1,1] would skip the first row. But, let's just say for the sake of timing purposes, it's a valid result. Here's the runtimes with it against the proposed solution in this post -
In [118]: from itertools import repeat
...: N, M = 100000, 2
...: a = np.array(zip([x for y in zip(*repeat(np.arange(N),M))\
for x in y ], np.random.random(N*M)))
...:
In [119]: %timeit a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1,1]
100 loops, best of 3: 6.31 ms per loop
In [120]: %timeit a[1:][np.diff(a[:,0])!=0,1]
100 loops, best of 3: 4.51 ms per loop
Now, let's say you want to include the first row too. The updated runtimes would look something like this -
In [123]: from itertools import repeat
...: N, M = 100000, 2
...: a = np.array(zip([x for y in zip(*repeat(np.arange(N),M))\
for x in y ], np.random.random(N*M)))
...:
In [124]: %timeit a[np.append(0,(a[1:,0]-a[:-1,0]).nonzero()[0]+1),1]
100 loops, best of 3: 6.8 ms per loop
In [125]: %timeit a[np.append(True,np.diff(a[:,0])!=0),1]
100 loops, best of 3: 5 ms per loop
Ok actually I found a solution, just learned about np.fromiter, which can build a numpy array based on a generator:
d = np.fromiter((a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y), int)
I think this does it, generates a numpy array without any intermediate arrays. However, the caveat is that it does not seem to be all that efficient! Forgetting what I said in the question about testing:
t = [lambda a: a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1],
lambda a: np.array([a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]),
lambda a: np.fromiter((a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y), int)]
from timeit import Timer
[Timer(x(a)).timeit(number=10) for x in t]
[0.16596235800034265, 1.811289312000099, 2.1662971739997374]
It seems the first solution is drastically faster! I assume this is because even though it generates intermediate data, it is able to perform the inner loop completely in numpy, while in the other it runs Python code for each item in the array.
Like I said, this is why I'm not sure this kind of benchmarking makes sense here -- if accesses to a were much slower, the benchmark wouldn't be CPU-loaded. Thoughts?
Not "accepting" this answer since I am hoping someone can come up with something faster.
If memory efficiency is your concern, that can be solved as such: The only intermediate of the same size-order as the input data can be made of type bool (a[1:,0] != a[:-1, 0]); and if your input data is int32, that is 8 times smaller than 'a' itself. You can count the nonzeros of that binary array to preallocate the output array as well; though that should not be very either significant if the output of the != is as sparse as your example suggests.
I need to sort a VERY large genomic dataset using numpy. I have an array of 2.6 billion floats, dimensions = (868940742, 3) which takes up about 20GB of memory on my machine once loaded and just sitting there. I have an early 2015 13' MacBook Pro with 16GB of RAM, 500GB solid state HD and an 3.1 GHz intel i7 processor. Just loading the array overflows to virtual memory but not to the point where my machine suffers or I have to stop everything else I am doing.
I build this VERY large array step by step from 22 smaller (N, 2) subarrays.
Function FUN_1 generates 2 new (N, 1) arrays using each of the 22 subarrays which I call sub_arr.
The first output of FUN_1 is generated by interpolating values from sub_arr[:,0] on array b = array([X, F(X)]) and the second output is generated by placing sub_arr[:, 0] into bins using array r = array([X, BIN(X)]). I call these outputs b_arr and rate_arr, respectively. The function returns a 3-tuple of (N, 1) arrays:
import numpy as np
def FUN_1(sub_arr):
"""interpolate b values and rates based on position in sub_arr"""
b = np.load(bfile)
r = np.load(rfile)
b_arr = np.interp(sub_arr[:,0], b[:,0], b[:,1])
rate_arr = np.searchsorted(r[:,0], sub_arr[:,0]) # HUGE efficiency gain over np.digitize...
return r[rate_r, 1], b_arr, sub_arr[:,1]
I call the function 22 times in a for-loop and fill a pre-allocated array of zeros full_arr = numpy.zeros([868940742, 3]) with the values:
full_arr[:,0], full_arr[:,1], full_arr[:,2] = FUN_1
In terms of saving memory at this step, I think this is the best I can do, but I'm open to suggestions. Either way, I don't run into problems up through this point and it only takes about 2 minutes.
Here is the sorting routine (there are two consecutive sorts)
for idx in range(2):
sort_idx = numpy.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
# ...
# <additional processing, return small (1000, 3) array of stats>
Now this sort had been working, albeit slowly (takes about 10 minutes). However, I recently started using a larger, more fine resolution table of [X, F(X)] values for the interpolation step above in FUN_1 that returns b_arr and now the SORT really slows down, although everything else remains the same.
Interestingly, I am not even sorting on the interpolated values at the step where the sort is now lagging. Here are some snippets of the different interpolation files - the smaller one is about 30% smaller in each case and far more uniform in terms of values in the second column; the slower one has a higher resolution and many more unique values, so the results of interpolation are likely more unique, but I'm not sure if this should have any kind of effect...?
bigger, slower file:
17399307 99.4
17493652 98.8
17570460 98.2
17575180 97.6
17577127 97
17578255 96.4
17580576 95.8
17583028 95.2
17583699 94.6
17584172 94
smaller, more uniform regular file:
1 24
1001 24
2001 24
3001 24
4001 24
5001 24
6001 24
7001 24
I'm not sure what could be causing this issue and I would be interested in any suggestions or just general input about sorting in this type of memory limiting case!
At the moment each call to np.argsort is generating a (868940742, 1) array of int64 indices, which will take up ~7 GB just by itself. Additionally, when you use these indices to sort the columns of full_arr you are generating another (868940742, 1) array of floats, since fancy indexing always returns a copy rather than a view.
One fairly obvious improvement would be to sort full_arr in place using its .sort() method. Unfortunately, .sort() does not allow you to directly specify a row or column to sort by. However, you can specify a field to sort by for a structured array. You can therefore force an inplace sort over one of the three columns by getting a view onto your array as a structured array with three float fields, then sorting by one of these fields:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
In this case I'm sorting full_arr in place by the 0th field, which corresponds to the first column. Note that I've assumed that there are three float64 columns ('f8') - you should change this accordingly if your dtype is different. This also requires that your array is contiguous and in row-major format, i.e. full_arr.flags.C_CONTIGUOUS == True.
Credit for this method should go to Joe Kington for his answer here.
Although it requires less memory, sorting a structured array by field is unfortunately much slower compared with using np.argsort to generate an index array, as you mentioned in the comments below (see this previous question). If you use np.argsort to obtain a set of indices to sort by, you might see a modest performance gain by using np.take rather than direct indexing to get the sorted array:
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
x[idx]
# 1 loops, best of 100: 148 µs per loop
%%timeit -n 1 -r 100 x = np.random.randn(10000, 2); idx = x[:, 0].argsort()
np.take(x, idx, axis=0)
# 1 loops, best of 100: 42.9 µs per loop
However I wouldn't expect to see any difference in terms of memory usage, since both methods will generate a copy.
Regarding your question about why sorting the second array is faster - yes, you should expect any reasonable sorting algorithm to be faster when there are fewer unique values in the array because on average there's less work for it to do. Suppose I have a random sequence of digits between 1 and 10:
5 1 4 8 10 2 6 9 7 3
There are 10! = 3628800 possible ways to arrange these digits, but only one in which they are in ascending order. Now suppose there are just 5 unique digits:
4 4 3 2 3 1 2 5 1 5
Now there are 2⁵ = 32 ways to arrange these digits in ascending order, since I could swap any pair of identical digits in the sorted vector without breaking the ordering.
By default, np.ndarray.sort() uses Quicksort. The qsort variant of this algorithm works by recursively selecting a 'pivot' element in the array, then reordering the array such that all the elements less than the pivot value are placed before it, and all of the elements greater than the pivot value are placed after it. Values that are equal to the pivot are already sorted. Having fewer unique values means that, on average, more values will be equal to the pivot value on any given sweep, and therefore fewer sweeps are needed to fully sort the array.
For example:
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 10, 100000)
x.sort()
# 1 loops, best of 100: 2.3 ms per loop
%%timeit -n 1 -r 100 x = np.random.random_integers(0, 1000, 100000)
x.sort()
# 1 loops, best of 100: 4.62 ms per loop
In this example the dtypes of the two arrays are the same. If your smaller array has a smaller item size compared with the larger array then the cost of copying it due to the fancy indexing will also be smaller.
EDIT: In case anyone new to programming and numpy comes across this post, I want to point out the importance of considering the np.dtype that you are using. In my case, I was actually able to get away with using half-precision floating point, i.e. np.float16, which reduced a 20GB object in memory to 5GB and made sorting much more manageable. The default used by numpy is np.float64, which is a lot of precision that you may not need. Check out the doc here, which describes the capacity of the different data types. Thanks to #ali_m for pointing this out in the comments.
I did a bad job explaining this question but I have discovered some helpful workarounds that I think would be useful to share for anyone who needs to sort a truly massive numpy array.
I am building a very large numpy array from 22 "sub-arrays" of human genome data containing the elements [position, value]. Ultimately, the final array must be numerically sorted "in place" based on the values in a particular column and without shuffling the values within rows.
The sub-array dimensions follow the form:
arr1.shape = (N1, 2)
...
arr22.shape = (N22, 2)
sum([N1..N2]) = 868940742 i.e. there are close to 1BN positions to sort.
First I process the 22 sub-arrays with the function process_sub_arrs, which returns a 3-tuple of 1D arrays the same length as the input. I stack the 1D arrays into a new (N, 3) array and insert them into an np.zeros array initialized for the full dataset:
full_arr = np.zeros([868940742, 3])
i, j = 0, 0
for arr in list(arr1..arr22):
# indices (i, j) incremented at each loop based on sub-array size
j += len(arr)
full_arr[i:j, :] = np.column_stack( process_sub_arrs(arr) )
i = j
return full_arr
EDIT: Since I realized my dataset could be represented with half-precision floats, I now initialize full_arr as follows: full_arr = np.zeros([868940742, 3], dtype=np.float16), which is only 1/4 the size and much easier to sort.
Result is a massive 20GB array:
full_arr.nbytes = 20854577808
As #ali_m pointed out in his detailed post, my earlier routine was inefficient:
sort_idx = np.argsort(full_arr[:,idx])
full_arr = full_arr[sort_idx]
the array sort_idx, which is 33% the size of full_arr, hangs around and wastes memory after sorting full_arr. This sort supposedly generates a copy of full_arr due to "fancy" indexing, potentially pushing memory use to 233% of what is already used to hold the massive array! This is the slow step, lasting about ten minutes and relying heavily on virtual memory.
I'm not sure the "fancy" sort makes a persistent copy however. Watching the memory usage on my machine, it seems that full_arr = full_arr[sort_idx] deletes the reference to the unsorted original, because after about 1 second all that is left is the memory used by the sorted array and the index, even if there is a transient copy.
A more compact usage of argsort() to save memory is this one:
full_arr = full_arr[full_arr[:,idx].argsort()]
This still causes a spike at the time of the assignment, where both a transient index array and a transient copy are made, but the memory is almost instantly freed again.
#ali_m pointed out a nice trick (credited to Joe Kington) for generating a de facto structured array with a view on full_arr. The benefit is that these may be sorted "in place", maintaining stable row order:
full_arr.view('f8, f8, f8').sort(order=['f0'], axis=0)
Views work great for performing mathematical array operations, but for sorting it is far too inefficient for even a single sub-array from my dataset. In general, structured arrays just don't seem to scale very well even though they have really useful properties. If anyone has any idea why this is I would be interested to know.
One good option to minimize memory consumption and improve performance with very large arrays is to build a pipeline of small, simple functions. Functions clear local variables once they have completed so if intermediate data structures are building up and sapping memory this can be a good solution.
This a sketch of the pipeline I've used to speed up the massive array sort:
def process_sub_arrs(arr):
"""process a sub-array and return a 3-tuple of 1D values arrays"""
return values1, values2, values3
def build_arr():
"""build the initial array by joining processed sub-arrays"""
full_arr = np.zeros([868940742, 3])
i, j = 0, 0
for arr in list(arr1..arr22):
# indices (i, j) incremented at each loop based on sub-array size
j += len(arr)
full_arr[i:j, :] = np.column_stack( process_sub_arrs(arr) )
i = j
return full_arr
def sort_arr():
"""return full_arr and sort_idx"""
full_arr = build_arr()
sort_idx = np.argsort(full_arr[:, index])
return full_arr[sort_idx]
def get_sorted_arr():
"""call through nested functions to return the sorted array"""
sorted_arr = sort_arr()
<process sorted_arr>
return statistics
call stack: get_sorted_arr --> sort_arr --> build_arr --> process_sub_arrs
Once each inner function is completed get_sorted_arr() finally just holds the sorted array and then returns a small array of statistics.
EDIT: It is also worth pointing out here that even if you are able to use a more compact dtype to represent your huge array, you will want to use higher precision for summary calculations. For example, since full_arr.dtype = np.float16, the command np.mean(full_arr[:,idx]) tries to calculate the mean in half-precision floating point, but this quickly overflows when summing over a massive array. Using np.mean(full_arr[:,idx], dtype=np.float64) will prevent the overflow.
I posted this question initially because I was puzzled by the fact that a dataset of identical size suddenly began choking up my system memory, although there was a big difference in the proportion of unique values in the new "slow" set. #ali_m pointed out that, indeed, more uniform data with fewer unique values is easier to sort:
The qsort variant of Quicksort works by recursively selecting a
'pivot' element in the array, then reordering the array such that all
the elements less than the pivot value are placed before it, and all
of the elements greater than the pivot value are placed after it.
Values that are equal to the pivot are already sorted, so intuitively,
the fewer unique values there are in the array, the smaller the number
of swaps there are that need to be made.
On that note, the final change I ended up making to attempt to resolve this issue was to round the newer dataset in advance, since there was an unnecessarily high level of decimal precision leftover from an interpolation step. This ultimately had an even bigger effect than the other memory saving steps, showing that the sort algorithm itself was the limiting factor in this case.
Look forward to other comments or suggestions anyone might have on this topic, and I almost certainly misspoke about some technical issues so I would be glad to hear back :-)
Given as input a list of floats that is not sorted, what would be the most efficient way of finding the index of the closest element to a certain value? Some potential solutions come to mind:
For:
x = random.sample([float(i) for i in range(1000000)], 1000000)
1) Own function:
def min_val(lst, val):
min_i = None
min_dist = 1000000.0
for i, v in enumerate(lst):
d = abs(v - val)
if d < min_dist:
min_dist = d
min_i = i
return min_i
Result:
%timeit min_val(x, 5000.56)
100 loops, best of 3: 11.5 ms per loop
2) Min
%timeit min(range(len(x)), key=lambda i: abs(x[i]-5000.56))
100 loops, best of 3: 16.8 ms per loop
3) Numpy (including conversion)
%timeit np.abs(np.array(x)-5000.56).argmin()
100 loops, best of 3: 3.88 ms per loop
From that test, it seems that converting the list to numpy array is the best solution. However two questions come to mind:
Was that indeed a realistic comparison?
Is the numpy solution the fastest way to achieve this in Python?
Consider the partition algorithm from QuickSort. The partition algorithm rearranges a list such that the pivot element is in its final location after invocation. Based on the value of the pivot, you could then partition the portion of the array that is likely to contain the element closest to your target. Once you've either found the element you're after or a have a partition of length 1 that contains your element you're done.
The general problem you're addressing is a selection problem.
In your question you were wondering about what sort of array/list implementation to use, and that will have an impact on performance. A bigger factor will be the search algorithm as opposed to the list/array representation.
Edit in light of comment from #Andrzej
Ah, then I misunderstood your question. Strictly speaking, linear search is always O(n), so efficiency within the bounds of Big-Oh analysis is the same regardless of underlying data structure. The gotcha here is that for linear search you want a nice simple data structure to make the run-time performance as good as possible.
A Python list is an array of references to objects while (to my understanding) a Numpy array is a contiguous array of objects. The Numpy array will perform better since it doesn't have to dereference the objects to get to the values.
Your comparison technique seems reasonable for Python list vs. Numpy array. I'd be reluctant to say that a Numpy array is the fastest way to solve the problem, but it should perform better than a Python list.
I'm just starting with NumPy so I may be missing some core concepts...
What's the best way to create a NumPy array from a dictionary whose values are lists?
Something like this:
d = { 1: [10,20,30] , 2: [50,60], 3: [100,200,300,400,500] }
Should turn into something like:
data = [
[10,20,30,?,?],
[50,60,?,?,?],
[100,200,300,400,500]
]
I'm going to do some basic statistics on each row, eg:
deviations = numpy.std(data, axis=1)
Questions:
What's the best / most efficient way to create the numpy.array from the dictionary? The dictionary is large; a couple of million keys, each with ~20 items.
The number of values for each 'row' are different. If I understand correctly numpy wants uniform size, so what do I fill in for the missing items to make std() happy?
Update: One thing I forgot to mention - while the python techniques are reasonable (eg. looping over a few million items is fast), it's constrained to a single CPU. Numpy operations scale nicely to the hardware and hit all the CPUs, so they're attractive.
You don't need to create numpy arrays to call numpy.std().
You can call numpy.std() in a loop over all the values of your dictionary. The list will be converted to a numpy array on the fly to compute the standard variation.
The downside of this method is that the main loop will be in python and not in C. But I guess this should be fast enough: you will still compute std at C speed, and you will save a lot of memory as you won't have to store 0 values where you have variable size arrays.
If you want to further optimize this, you can store your values into a list of numpy arrays, so that you do the python list -> numpy array conversion only once.
if you find that this is still too slow, try to use psycho to optimize the python loop.
if this is still too slow, try using Cython together with the numpy module. This Tutorial claims impressive speed improvements for image processing. Or simply program the whole std function in Cython (see this for benchmarks and examples with sum function )
An alternative to Cython would be to use SWIG with numpy.i.
if you want to use only numpy and have everything computed at C level, try grouping all the records of same size together in different arrays and call numpy.std() on each of them. It should look like the following example.
example with O(N) complexity:
import numpy
list_size_1 = []
list_size_2 = []
for row in data.itervalues():
if len(row) == 1:
list_size_1.append(row)
elif len(row) == 2:
list_size_2.append(row)
list_size_1 = numpy.array(list_size_1)
list_size_2 = numpy.array(list_size_2)
std_1 = numpy.std(list_size_1, axis = 1)
std_2 = numpy.std(list_size_2, axis = 1)
While there are already some pretty reasonable ideas present here, I believe following is worth mentioning.
Filling missing data with any default value would spoil the statistical characteristics (std, etc). Evidently that's why Mapad proposed the nice trick with grouping same sized records.
The problem with it (assuming there isn't any a priori data on records lengths is at hand) is that it involves even more computations than the straightforward solution:
at least O(N*logN) 'len' calls and comparisons for sorting with an effective algorithm
O(N) checks on the second way through the list to obtain groups(their beginning and end indexes on the 'vertical' axis)
Using Psyco is a good idea (it's strikingly easy to use, so be sure to give it a try).
It seems that the optimal way is to take the strategy described by Mapad in bullet #1, but with a modification - not to generate the whole list, but iterate through the dictionary converting each row into numpy.array and performing required computations. Like this:
for row in data.itervalues():
np_row = numpy.array(row)
this_row_std = numpy.std(np_row)
# compute any other statistic descriptors needed and then save to some list
In any case a few million loops in python won't take as long as one might expect. Besides this doesn't look like a routine computation, so who cares if it takes extra second/minute if it is run once in a while or even just once.
A generalized variant of what was suggested by Mapad:
from numpy import array, mean, std
def get_statistical_descriptors(a):
if ax = len(shape(a))-1
functions = [mean, std]
return f(a, axis = ax) for f in functions
def process_long_list_stats(data):
import numpy
groups = {}
for key, row in data.iteritems():
size = len(row)
try:
groups[size].append(key)
except KeyError:
groups[size] = ([key])
results = []
for gr_keys in groups.itervalues():
gr_rows = numpy.array([data[k] for k in gr_keys])
stats = get_statistical_descriptors(gr_rows)
results.extend( zip(gr_keys, zip(*stats)) )
return dict(results)
numpy dictionary
You can use a structured array to preserve the ability to address a numpy object by a key, like a dictionary.
import numpy as np
dd = {'a':1,'b':2,'c':3}
dtype = eval('[' + ','.join(["('%s', float)" % key for key in dd.keys()]) + ']')
values = [tuple(dd.values())]
numpy_dict = np.array(values, dtype=dtype)
numpy_dict['c']
will now output
array([ 3.])