I need to count the number of distinct columns in relatively large arrays.
def nodistinctcols(M):
setofcols = set()
for column in M.T:
setofcols.add(repr(column))
return len(setofcols)
X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
print "nodistinctcols(X.T)", nodistinctcols(X.T)
The last line takes 20 seconds on my computer which seems excessively slow. By contrast X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)]) takes 216 ms. Can nodistinctcols be sped up?
You can use view to change the dtype of M so that an entire row (or column) will be viewed as an array of bytes. Then np.unique can be applied to find the unique values:
import numpy as np
def asvoid(arr):
"""
View the array as dtype np.void (bytes).
This views the last axis of ND-arrays as np.void (bytes) so
comparisons can be performed on the entire row.
http://stackoverflow.com/a/16840350/190597 (Jaime, 2013-05)
Some caveats:
- `asvoid` will work for integer dtypes, but be careful if using asvoid on float
dtypes, since float zeros may compare UNEQUALLY:
>>> asvoid([-0.]) == asvoid([0.])
array([False], dtype=bool)
- `asvoid` works best on contiguous arrays. If the input is not contiguous,
`asvoid` will copy the array to make it contiguous, which will slow down the
performance.
"""
arr = np.ascontiguousarray(arr)
return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))
def nodistinctcols(M):
MT = asvoid(M.T)
uniqs = np.unique(MT)
return len(uniqs)
X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
print("nodistinctcols(X.T) {}".format(nodistinctcols(X.T)))
Benchmark:
In [20]: %timeit nodistinctcols(X.T)
10 loops, best of 3: 63.6 ms per loop
In [21]: %timeit nodistinctcols_orig(X.T)
1 loops, best of 3: 17.4 s per loop
where nodistinctcols_orig is defined by:
def nodistinctcols_orig(M):
setofcols = set()
for column in M.T:
setofcols.add(repr(column))
return len(setofcols)
Sanity check passes:
In [24]: assert nodistinctcols(X.T) == nodistinctcols_orig(X.T)
By the way, it might make more sense to define
def num_distinct_rows(M):
return len(np.unique(asvoid(M)))
and simply pass M.T to the function when you wish to count the number of distinct columns. That way, the function would not be slowed down by an unnecessary transpose if you wish to use it to count the number of distinct rows.
Just for future reference, don't sleep on old-fashioned approaches like using set. Will it be as fast and memory-efficient as a clever numpy approach? No. But often it's good enough for now, which is nothing to sneeze at when you're on the clock.
In [25]: %time slow = nodistinctcols(X.T)
CPU times: user 28.2 s, sys: 12 ms, total: 28.2 s
Wall time: 28.2 s
In [26]: %time medium = len(set(map(tuple, X)))
CPU times: user 324 ms, sys: 0 ns, total: 324 ms
Wall time: 322 ms
In [27]: slow == medium
Out[27]: True
What's slow wasn't the set part, it was the string conversion.
if you have less rows than columns you can also do multiple stable sorts along the rows and the count the uniques
def count(x):
x = x.copy()
x = x[x[:,0].argsort()] # first sort can be unstable
for i in range(1, x.shape[1]):
x = x[x[:,i].argsort(kind='mergesort')] # stable sorts now
# x is now sorted so that equal columns are next to each other
# -> compare neighboors with each others and count all-true columns
return x.shape[0] - np.count_nonzero((x[1:, :] == x[:-1,:]).all(axis=1))
with numpy 1.9.dev its faster than the void compare, with older numpys the indexing kills the performance (about 4 times slower than void)
X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
In [6]: %timeit count(X)
10 loops, best of 3: 144 ms per loop
Xt = X.T.copy()
In [8]: %timeit unutbu_void(Xt)
10 loops, best of 3: 161 ms per loop
Related
I have a 4 dimensional array called new_arr. Given a list of indices, I want to update new_arr based on an old array I have stored, old_arr. I am using a for loop to do this, but it's inefficient. My code looks something like this:
update_indices = [(2,33,1,8), (4,9,49,50), ...] #as an example
for index in update_indices:
i,j,k,l = index
new_arr[i][j][k][l] = old_arr[i][j][k][l]
It's taking a very long time because update_indices is large. Is there a way I can update all of the terms at once or do this more efficiently?
Out of curiosity I have benchmarked the various improvements posted in the comments and found that working on flat indices is fastest.
I used the following setup:
rt numpy as np
n = 57
d = 4
k = int(1e6)
dt = np.double
new_arr = np.arange(n**d, dtype=dt).reshape(d * (n,))
new_arr2 = np.arange(n**d, dtype=dt).reshape(d * (n,))
old_arr = 2*np.arange(n**d, dtype=dt).reshape(d * (n,))
update_indices = list({tuple(np.random.randint(n, size=d)) for _ in range(int(k*1.1))})[:k]
where update_indices is a list of 1e6 unique index tuples.
Using the original technique from the question
%%timeit
for index in update_indices:
i,j,k,l = index
new_arr[i][j][k][l] = old_arr[i][j][k][l]
takes 1.47 s ± 19.3 ms.
Direct tuple-indexing as suggested by #defladamouse
%%timeit
for index in update_indices:
new_arr[index] = old_arr[index]
indeed gives us a speedup of 2: 778 ms ± 41.8 ms
If update_indices is not given but can be constructed as ndarray as suggested by #Jérôme Richard
update_indices_array = np.array(update_indices, dtype=np.uint32)
(the conversion itself takes 1.34 s) the path to much faster implementations is open.
In order to index numpy arrays by multidimensional list of locations we cannot use update_indices_array directly as index, but pack its columns into a tuple:
%%timeit
idx = tuple(update_indices_array.T)
new_arr2[idx] = old_arr[idx]
Giving another speedup of roughly 9: 83.5 ms ± 1.45
If we dont leave the computation of memory offsets to ndarray.__getitem__,
but compute the correspondig flat indices "by hand", we can become even faster:
%%timeit
idx_weights = np.cumprod((1,) + new_arr2.shape[:0:-1])[::-1]
update_flat = update_indices_array # idx_weights
new_arr2.ravel()[update_flat] = old_arr.ravel()[update_flat]
resulting in 41.6 ms ± 1.04 ms, another factor of 2 and a cumulative speedup factor of 35 compared with the original version.
idx_weights is simply an off-by-one reverse-order cumulative product of the array dimensions.
I assume that this speedup of 2 comes from the fact that the memory offsets / flat indices are computed twice in new_arr2[idx] = old_arr[idx] and only once in update_flat = update_indices_array # idx_weight.
Just do:
idx = np.array([(2,33,1,8), (4,9,49,50), ...])
new_arr[idx[:,0],idx[:,1],idx[:,2],idx[:,3]] = old_arr[idx[:,0],idx[:,1],idx[:,2],idx[:,3]]
no need for a loop.
Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop
Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop
I need to generate 1D array where repeated sequences of integers are separated by a random number of zeros.
So far I am using next code for this:
from random import normalvariate
regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
n_iter = 10
lag_mean = 10 # mean length of zeros sequence
lag_sd = 1 # standard deviation of zeros sequence length
# Sequence of lags lengths
lag_seq = [int(round(normalvariate(lag_mean, lag_sd))) for x in range(n_iter)]
# Generate list of concatenated zeros and regular sequences
seq = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
seq = np.concatenate(seq)
It works but looks very slow when I need a lot of long sequences. So, how can I optimize it?
You can pre-compute indices where repeated regular_sequence elements are to be put and then set those with regular_sequence in a vectorized manner. For pre-computing those indices, one can use np.cumsum to get the start of each such chunk of regular_sequence and then add a continuous set of integers extending to the size of regular_sequence to get all indices that are to be updated. Thus, the implementation would look something like this -
# Size of regular_sequence
N = regular_sequence.size
# Use cumsum to pre-compute start of every occurance of regular_sequence
offset_arr = np.cumsum(lag_seq)
idx = np.arange(offset_arr.size)*N + offset_arr
# Setup output array
out = np.zeros(idx.max() + N,dtype=regular_sequence.dtype)
# Broadcast the start indices to include entire length of regular_sequence
# to get all positions where regular_sequence elements are to be set
np.put(out,idx[:,None] + np.arange(N),regular_sequence)
Runtime tests -
def original_app(lag_seq, regular_sequence):
seq = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
return np.concatenate(seq)
def vectorized_app(lag_seq, regular_sequence):
N = regular_sequence.size
offset_arr = np.cumsum(lag_seq)
idx = np.arange(offset_arr.size)*N + offset_arr
out = np.zeros(idx.max() + N,dtype=regular_sequence.dtype)
np.put(out,idx[:,None] + np.arange(N),regular_sequence)
return out
In [64]: # Setup inputs
...: regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
...: n_iter = 1000
...: lag_mean = 10 # mean length of zeros sequence
...: lag_sd = 1 # standard deviation of zeros sequence length
...:
...: # Sequence of lags lengths
...: lag_seq = [int(round(normalvariate(lag_mean, lag_sd))) for x in range(n_iter)]
...:
In [65]: out1 = original_app(lag_seq, regular_sequence)
In [66]: out2 = vectorized_app(lag_seq, regular_sequence)
In [67]: %timeit original_app(lag_seq, regular_sequence)
100 loops, best of 3: 4.28 ms per loop
In [68]: %timeit vectorized_app(lag_seq, regular_sequence)
1000 loops, best of 3: 294 µs per loop
The best approach, I think, would be to use convolution. You can figure out the lag lengths, combine that with the length of the sequence, and use that to figure out the starting point of each regular sequence. Set those starting points to zero, then convolve with your regular sequence to fill in the values.
import numpy as np
regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
n_iter = 10000000
lag_mean = 10 # mean length of zeros sequence
lag_sd = 1 # standard deviation of zeros sequence length
# Sequence of lags lengths
lag_lens = np.round(np.random.normal(lag_mean, lag_sd, n_iter)).astype(np.int)
lag_lens[1:] += len(regular_sequence)
starts_inds = lag_lens.cumsum()-1
# Generate list of convolved ones and regular sequences
seq = np.zeros(lag_lens.sum(), dtype=np.int)
seq[starts_inds] = 1
seq = np.convolve(seq, regular_sequence)
This approach takes something like 1/20th the time on large sequences, even after changing your version to use the numpy random number generator.
Not a trivial problem because data is misaligned. Performance depends on what is a long sequence. Take the example of a square problem : a lot of, long, regular and zeros sequences (n_iter==n_reg==lag_mean):
import numpy as np
n_iter = 1000
n_reg = 1000
regular_sequence = np.arange(n_reg, dtype=np.int)
lag_mean = n_reg # mean length of zeros sequence
lag_sd = lag_mean/10 # standard deviation of zeros sequence length
lag_seq=np.int64(np.random.normal(lag_mean,lag_sd,n_iter)) # Sequence of lags lengths
First your solution :
def seq_hybrid():
seqs = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
seq = np.concatenate(seqs)
return seq
Then a pure numpy one :
def seq_numpy():
seq=np.zeros(lag_seq.sum()+n_iter*n_reg,dtype=int)
cs=np.cumsum(lag_seq+n_reg)-n_reg
indexes=np.add.outer(cs,np.arange(n_reg))
seq[indexes]=regular_sequence
return seq
A for loop solution :
def seq_python():
seq=np.empty(lag_seq.sum()+n_iter*n_reg,dtype=int)
i=0
for lag in lag_seq:
for k in range(lag):
seq[i]=0
i+=1
for k in range(n_reg):
seq[i]=regular_sequence[k]
i+=1
return seq
And a just in time compilation with numba :
from numba import jit
seq_numba=jit(seq_python)
Tests now :
In [96]: %timeit seq_hybrid()
10 loops, best of 3: 38.5 ms per loop
In [97]: %timeit seq_numpy()
10 loops, best of 3: 34.4 ms per loop
In [98]: %timeit seq_python()
1 loops, best of 3: 1.56 s per loop
In [99]: %timeit seq_numba()
100 loops, best of 3: 12.9 ms per loop
Your hybrid solution is quite as speed as a pure numpy one in this case because
the performance depend essentially of the inner loop. And yours (zeros and concatenate) is a numpy one. Predictably , python solution is slower with a traditional about 40x factor. But numpy is not optimal here, because it uses fancy indexing, necessary with misaligned data . In this case numba can help you : minimal operations are done at C level, for a 120x factor gain this time compared to the python solution.
For other values of n_iter,n_reg the factor gains compared to the python solution are:
n_iter= 1000, n_reg= 1000 : seq_numba 124, seq_hybrid 49, seq_numpy 44.
n_iter= 10, n_reg= 100000 : seq_numba 123, seq_hybrid 104, seq_numpy 49.
n_iter= 100000, n_reg= 10 : seq_numba 127, seq_hybrid 1, seq_numpy 42.
I thought an answer posted on this question had a good approach using a binary mask and np.convolve but the answer got deleted and I don't know why. Here it is with 2 concerns addressed.
def insert_sequence(lag_seq, regular_sequence):
offsets = np.cumsum(lag_seq)
start_locs = np.zeros(offsets[-1] + 1, dtype=regular_sequence.dtype)
start_locs[offsets] = 1
return np.convolve(start_locs, regular_sequence)
lag_seq = np.random.normal(15,1,10)
lag_seq = lag_seq.astype(np.uint8)
regular_sequence = np.arange(1, 6)
seq = insert_sequence(lag_seq, regular_sequence)
print(repr(seq))
I have a very (very, very) large two dimensional array - on the order of a thousand columns, but a couple of million rows (enough so that it doesn't fit in to memory on my 32GB machine). I want to compute the variance of each of the thousand columns. One key fact which helps: my data is 8-bit unsigned ints.
Here's how I'm planning on approaching this. I will first construct a new two dimensional array called counts with shape (1000, 256), with the idea that counts[i,:] == np.bincount(bigarray[:,i]). Once I have this array, it will be trivial to compute the variance.
Trouble is, I'm not sure how to compute it efficiently (this computation must be run in real-time, and I'd like bandwidth to be limited by how fast my SSD can return the data). Here's something which works, but is god-awful slow:
counts = np.array((1000,256))
for row in iterator_over_bigaray_rows():
for i,val in enumerate(row):
counts[i,val] += 1
Is there any way to write this to run faster? Something like this:
counts = np.array((1000,256))
for row in iterator_over_bigaray_rows():
counts[i,:] = // magic np one-liner to do what I want
I think this is what you want:
counts[np.arange(1000), row] += 1
But if your array has millions of rows, you are still going to have to iterate over millions of those. The following trick gives close to a 5x speed-up on my system:
chunk = np.random.randint(256, size=(1000, 1000))
def count_chunk(chunk):
rows, cols = chunk.shape
col_idx = np.arange(cols) * 256
counts = np.bincount((col_idx[None, :] + chunk).ravel(),
minlength=256*cols)
return counts.reshape(-1, 256)
def count_chunk_by_rows(chunk):
counts = np.zeros(chunk.shape[1:]+(256,), dtype=np.int)
indices = np.arange(chunk.shape[-1])
for row in chunk:
counts[indices, row] += 1
return counts
And now:
In [2]: c = count_chunk_by_rows(chunk)
In [3]: d = count_chunk(chunk)
In [4]: np.all(c == d)
Out[4]: True
In [5]: %timeit count_chunk_by_rows(chunk)
10 loops, best of 3: 80.5 ms per loop
In [6]: %timeit count_chunk(chunk)
100 loops, best of 3: 13.8 ms per loop