Generate 1d numpy with chunks of random length - python

I need to generate 1D array where repeated sequences of integers are separated by a random number of zeros.
So far I am using next code for this:
from random import normalvariate
regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
n_iter = 10
lag_mean = 10 # mean length of zeros sequence
lag_sd = 1 # standard deviation of zeros sequence length
# Sequence of lags lengths
lag_seq = [int(round(normalvariate(lag_mean, lag_sd))) for x in range(n_iter)]
# Generate list of concatenated zeros and regular sequences
seq = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
seq = np.concatenate(seq)
It works but looks very slow when I need a lot of long sequences. So, how can I optimize it?

You can pre-compute indices where repeated regular_sequence elements are to be put and then set those with regular_sequence in a vectorized manner. For pre-computing those indices, one can use np.cumsum to get the start of each such chunk of regular_sequence and then add a continuous set of integers extending to the size of regular_sequence to get all indices that are to be updated. Thus, the implementation would look something like this -
# Size of regular_sequence
N = regular_sequence.size
# Use cumsum to pre-compute start of every occurance of regular_sequence
offset_arr = np.cumsum(lag_seq)
idx = np.arange(offset_arr.size)*N + offset_arr
# Setup output array
out = np.zeros(idx.max() + N,dtype=regular_sequence.dtype)
# Broadcast the start indices to include entire length of regular_sequence
# to get all positions where regular_sequence elements are to be set
np.put(out,idx[:,None] + np.arange(N),regular_sequence)
Runtime tests -
def original_app(lag_seq, regular_sequence):
seq = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
return np.concatenate(seq)
def vectorized_app(lag_seq, regular_sequence):
N = regular_sequence.size
offset_arr = np.cumsum(lag_seq)
idx = np.arange(offset_arr.size)*N + offset_arr
out = np.zeros(idx.max() + N,dtype=regular_sequence.dtype)
np.put(out,idx[:,None] + np.arange(N),regular_sequence)
return out
In [64]: # Setup inputs
...: regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
...: n_iter = 1000
...: lag_mean = 10 # mean length of zeros sequence
...: lag_sd = 1 # standard deviation of zeros sequence length
...:
...: # Sequence of lags lengths
...: lag_seq = [int(round(normalvariate(lag_mean, lag_sd))) for x in range(n_iter)]
...:
In [65]: out1 = original_app(lag_seq, regular_sequence)
In [66]: out2 = vectorized_app(lag_seq, regular_sequence)
In [67]: %timeit original_app(lag_seq, regular_sequence)
100 loops, best of 3: 4.28 ms per loop
In [68]: %timeit vectorized_app(lag_seq, regular_sequence)
1000 loops, best of 3: 294 µs per loop

The best approach, I think, would be to use convolution. You can figure out the lag lengths, combine that with the length of the sequence, and use that to figure out the starting point of each regular sequence. Set those starting points to zero, then convolve with your regular sequence to fill in the values.
import numpy as np
regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
n_iter = 10000000
lag_mean = 10 # mean length of zeros sequence
lag_sd = 1 # standard deviation of zeros sequence length
# Sequence of lags lengths
lag_lens = np.round(np.random.normal(lag_mean, lag_sd, n_iter)).astype(np.int)
lag_lens[1:] += len(regular_sequence)
starts_inds = lag_lens.cumsum()-1
# Generate list of convolved ones and regular sequences
seq = np.zeros(lag_lens.sum(), dtype=np.int)
seq[starts_inds] = 1
seq = np.convolve(seq, regular_sequence)
This approach takes something like 1/20th the time on large sequences, even after changing your version to use the numpy random number generator.

Not a trivial problem because data is misaligned. Performance depends on what is a long sequence. Take the example of a square problem : a lot of, long, regular and zeros sequences (n_iter==n_reg==lag_mean):
import numpy as np
n_iter = 1000
n_reg = 1000
regular_sequence = np.arange(n_reg, dtype=np.int)
lag_mean = n_reg # mean length of zeros sequence
lag_sd = lag_mean/10 # standard deviation of zeros sequence length
lag_seq=np.int64(np.random.normal(lag_mean,lag_sd,n_iter)) # Sequence of lags lengths
First your solution :
def seq_hybrid():
seqs = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
seq = np.concatenate(seqs)
return seq
Then a pure numpy one :
def seq_numpy():
seq=np.zeros(lag_seq.sum()+n_iter*n_reg,dtype=int)
cs=np.cumsum(lag_seq+n_reg)-n_reg
indexes=np.add.outer(cs,np.arange(n_reg))
seq[indexes]=regular_sequence
return seq
A for loop solution :
def seq_python():
seq=np.empty(lag_seq.sum()+n_iter*n_reg,dtype=int)
i=0
for lag in lag_seq:
for k in range(lag):
seq[i]=0
i+=1
for k in range(n_reg):
seq[i]=regular_sequence[k]
i+=1
return seq
And a just in time compilation with numba :
from numba import jit
seq_numba=jit(seq_python)
Tests now :
In [96]: %timeit seq_hybrid()
10 loops, best of 3: 38.5 ms per loop
In [97]: %timeit seq_numpy()
10 loops, best of 3: 34.4 ms per loop
In [98]: %timeit seq_python()
1 loops, best of 3: 1.56 s per loop
In [99]: %timeit seq_numba()
100 loops, best of 3: 12.9 ms per loop
Your hybrid solution is quite as speed as a pure numpy one in this case because
the performance depend essentially of the inner loop. And yours (zeros and concatenate) is a numpy one. Predictably , python solution is slower with a traditional about 40x factor. But numpy is not optimal here, because it uses fancy indexing, necessary with misaligned data . In this case numba can help you : minimal operations are done at C level, for a 120x factor gain this time compared to the python solution.
For other values of n_iter,n_reg the factor gains compared to the python solution are:
n_iter= 1000, n_reg= 1000 : seq_numba 124, seq_hybrid 49, seq_numpy 44.
n_iter= 10, n_reg= 100000 : seq_numba 123, seq_hybrid 104, seq_numpy 49.
n_iter= 100000, n_reg= 10 : seq_numba 127, seq_hybrid 1, seq_numpy 42.

I thought an answer posted on this question had a good approach using a binary mask and np.convolve but the answer got deleted and I don't know why. Here it is with 2 concerns addressed.
def insert_sequence(lag_seq, regular_sequence):
offsets = np.cumsum(lag_seq)
start_locs = np.zeros(offsets[-1] + 1, dtype=regular_sequence.dtype)
start_locs[offsets] = 1
return np.convolve(start_locs, regular_sequence)
lag_seq = np.random.normal(15,1,10)
lag_seq = lag_seq.astype(np.uint8)
regular_sequence = np.arange(1, 6)
seq = insert_sequence(lag_seq, regular_sequence)
print(repr(seq))

Related

How to efficiently update a numpy ndarray given a list of indices

I have a 4 dimensional array called new_arr. Given a list of indices, I want to update new_arr based on an old array I have stored, old_arr. I am using a for loop to do this, but it's inefficient. My code looks something like this:
update_indices = [(2,33,1,8), (4,9,49,50), ...] #as an example
for index in update_indices:
i,j,k,l = index
new_arr[i][j][k][l] = old_arr[i][j][k][l]
It's taking a very long time because update_indices is large. Is there a way I can update all of the terms at once or do this more efficiently?
Out of curiosity I have benchmarked the various improvements posted in the comments and found that working on flat indices is fastest.
I used the following setup:
rt numpy as np
n = 57
d = 4
k = int(1e6)
dt = np.double
new_arr = np.arange(n**d, dtype=dt).reshape(d * (n,))
new_arr2 = np.arange(n**d, dtype=dt).reshape(d * (n,))
old_arr = 2*np.arange(n**d, dtype=dt).reshape(d * (n,))
update_indices = list({tuple(np.random.randint(n, size=d)) for _ in range(int(k*1.1))})[:k]
where update_indices is a list of 1e6 unique index tuples.
Using the original technique from the question
%%timeit
for index in update_indices:
i,j,k,l = index
new_arr[i][j][k][l] = old_arr[i][j][k][l]
takes 1.47 s ± 19.3 ms.
Direct tuple-indexing as suggested by #defladamouse
%%timeit
for index in update_indices:
new_arr[index] = old_arr[index]
indeed gives us a speedup of 2: 778 ms ± 41.8 ms
If update_indices is not given but can be constructed as ndarray as suggested by #Jérôme Richard
update_indices_array = np.array(update_indices, dtype=np.uint32)
(the conversion itself takes 1.34 s) the path to much faster implementations is open.
In order to index numpy arrays by multidimensional list of locations we cannot use update_indices_array directly as index, but pack its columns into a tuple:
%%timeit
idx = tuple(update_indices_array.T)
new_arr2[idx] = old_arr[idx]
Giving another speedup of roughly 9: 83.5 ms ± 1.45
If we dont leave the computation of memory offsets to ndarray.__getitem__,
but compute the correspondig flat indices "by hand", we can become even faster:
%%timeit
idx_weights = np.cumprod((1,) + new_arr2.shape[:0:-1])[::-1]
update_flat = update_indices_array # idx_weights
new_arr2.ravel()[update_flat] = old_arr.ravel()[update_flat]
resulting in 41.6 ms ± 1.04 ms, another factor of 2 and a cumulative speedup factor of 35 compared with the original version.
idx_weights is simply an off-by-one reverse-order cumulative product of the array dimensions.
I assume that this speedup of 2 comes from the fact that the memory offsets / flat indices are computed twice in new_arr2[idx] = old_arr[idx] and only once in update_flat = update_indices_array # idx_weight.
Just do:
idx = np.array([(2,33,1,8), (4,9,49,50), ...])
new_arr[idx[:,0],idx[:,1],idx[:,2],idx[:,3]] = old_arr[idx[:,0],idx[:,1],idx[:,2],idx[:,3]]
no need for a loop.

Index search: trade accuracy for performance

I have a simple two lines block of code that adds values to an array according to the closest elements found in another array. Since it is buried deep inside an MCMC it is executed millions of times, and I need it to be as efficient as possible.
The code below works and it is pretty self explanatory. Basically: the array arr2[0] (the one used to find the closest elements in arr0) contains values in the range (10., 25.). Currently I look for the absolute closest element in arr0 for each element in arr2[0] using np.searchsorted(), taking advantage of the fact that arr0 is already sorted.
I would be willing to trade some accuracy for better performance. That is, I could live with an index that points to a "close" element with a tolerance of say +-0.2, instead of the absolute closest element (which is what I do now)
Can this be done? More importantly: can this be done and improve the performance of the code?
import numpy as np
# Random initial data with the actual shapes used by my code.
Nmax = 1000000
arr0 = np.linspace(5., 30., Nmax)
D = np.random.randint(2, 4)
arr1 = np.random.uniform(-3., 3., (D, Nmax))
arr2 = np.random.uniform(10., 25., (10, 1500))
# Can these two lines be made faster?
# Indexes of elements in 'arr0' closest to the elements in 'arr2[0]'
closest_idxs = np.searchsorted(arr0, arr2[0])
# Add elements from 'arr1' to the first dimensions of 'arr2', according
# to the indexes found above.
arr_final = arr2[:arr1.shape[0]] + arr1[:, closest_idxs]
For an approximate matching with given tolerance value, we can use it to reduce the first arg to searchsorted and hence optimize, like so -
tol = 0.2 # tolerance value
s = int(np.round(tol/(arr0[1]-arr0[0])))
i = np.searchsorted(arr0[::s], arr2[0])
i -= (arr0[i*s]-arr2[0])>tol/2
closest_idxs_out = i*s
Timings on given setup -
In [123]: %%timeit
...: closest_idxs = np.searchsorted(arr0, arr2[0])
...: arr_final = arr2[:arr1.shape[0]] + arr1[:, closest_idxs]
1000 loops, best of 3: 641 µs per loop
In [125]: %%timeit
...: tol = 0.2 # tolerance value
...: s = int(np.round(tol/(arr0[1]-arr0[0])))
...: i = np.searchsorted(arr0[::s], arr2[0])
...: i -= (arr0[i*s]-arr2[0])>tol/2
...: closest_idxs_out = i*s
10000 loops, best of 3: 63.2 µs per loop

how to compare entries in numpy array with each other efficiently?

I have a numpy array embed_vec of length tot_vec in which each entry is a 3d vector:
[[ 0.52483319 0.78015841 0.71117216]
[ 0.53041481 0.79462171 0.67234534]
[ 0.53645428 0.80896727 0.63119403]
...,
[ 0.72283509 0.40070804 0.15220522]
[ 0.71277758 0.38498613 0.16141834]
[ 0.70221445 0.36918032 0.17370776]]
For each of the elements in this array, I want to find out the number of other entries which are "close" to that entry. By close, I mean that the distance between two vectors is less than a specified value R. For this, I must compare all the possible pairs in this array with each other and then find out the number of close vectors for each of the vectors in the array. So I am doing this:
p = np.zeros(tot_vec) # This contains the number of close vectors
for i in range(tot_vec-1):
for j in range(i+1, tot_vec):
if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
p[i] += 1
However, this is extremely inefficient because I have two nested python loops and for larger array sizes, this takes forever. If this were in C++ or Fortran, it wouldn't have been a great issue. My question is, can one achieve the same thing using numpy efficiently using some vectorization method? As a side note, I don't mind a solution using Pandas also.
Approach #1 : Vectorized approach -
def vectorized_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
r,c = np.triu_indices(tot_vec,1)
subs = embed_vec[r] - embed_vec[c]
dists = np.einsum('ij,ij->i',subs,subs)
return np.bincount(r,dists<R**2,minlength=tot_vec)
Approach #2 : With less loop complexity (for very large arrays) -
def loopy_less_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
Rsq = R**2
out = np.zeros(tot_vec,dtype=int)
for i in range(tot_vec):
subs = embed_vec[i] - embed_vec[i+1:tot_vec]
dists = np.einsum('ij,ij->i',subs,subs)
out[i] = np.count_nonzero(dists < Rsq)
return out
Benchmarking
Original approach -
def loopy_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
p = np.zeros(tot_vec) # This contains the number of close vectors
for i in range(tot_vec-1):
for j in range(i+1, tot_vec):
if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
p[i] += 1
return p
Timings -
In [76]: # Sample random array
...: embed_vec = np.random.rand(3000,3)
...: R = 0.5
...:
In [77]: %timeit loopy_app(embed_vec, R)
1 loops, best of 3: 50.5 s per loop
In [78]: %timeit loopy_less_app(embed_vec, R)
10 loops, best of 3: 143 ms per loop
350x+ speedup there!
Going with much bigger array with the proposed loopy_less_app -
In [81]: # Sample random array
...: embed_vec = np.random.rand(20000,3)
...: R = 0.5
...:
In [82]: %timeit loopy_less_app(embed_vec, R)
1 loops, best of 3: 4.47 s per loop
I am intrigued by that question and attempted to solve it efficintly using scipy's cKDTree. However, this approach may run out of memory because internally a list of all pairs with distance <= R is maintained. If your R and tot_vec are small enough it will work:
import numpy as np
from scipy.spatial import cKDTree as KDTree
tot_vec = 60000
embed_vec = np.random.randn(tot_vec, 3)
R = 0.1
tree = KDTree(embed_vec, leafsize=100)
p = np.zeros(tot_vec)
for pair in tree.query_pairs(R):
p[pair[0]] += 1
p[pair[1]] += 1
In case memory is an issue, with some effort it is possible to rewrite query_pairs as a generator function in Python at the cost of C performance.
first broadcast the difference:
disp_vecs=tot_vec[:,None,:]-tot_vec[None,:,:]
Now, depending on how big your dataset is, you may want to do a fist pass without all the math. If the distance is less than r, all the components should be less than r
first_mask=np.max(disp_vec, axis=-1)<r
Then do the actual calculation
disps=np.linlg.norm(disp_vec[first_mask],axis=-1)
second_mask=disps<r
Now reassign
disps=disps[second_mask]
first_mask[first_mask]=second_mask
disps are now the good values, and first_mask is a boolean mask of where they go. You can process from there.

Vectorized searchsorted numpy

Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop

Efficiently indexing numpy array with a numpy array

I have a very (very, very) large two dimensional array - on the order of a thousand columns, but a couple of million rows (enough so that it doesn't fit in to memory on my 32GB machine). I want to compute the variance of each of the thousand columns. One key fact which helps: my data is 8-bit unsigned ints.
Here's how I'm planning on approaching this. I will first construct a new two dimensional array called counts with shape (1000, 256), with the idea that counts[i,:] == np.bincount(bigarray[:,i]). Once I have this array, it will be trivial to compute the variance.
Trouble is, I'm not sure how to compute it efficiently (this computation must be run in real-time, and I'd like bandwidth to be limited by how fast my SSD can return the data). Here's something which works, but is god-awful slow:
counts = np.array((1000,256))
for row in iterator_over_bigaray_rows():
for i,val in enumerate(row):
counts[i,val] += 1
Is there any way to write this to run faster? Something like this:
counts = np.array((1000,256))
for row in iterator_over_bigaray_rows():
counts[i,:] = // magic np one-liner to do what I want
I think this is what you want:
counts[np.arange(1000), row] += 1
But if your array has millions of rows, you are still going to have to iterate over millions of those. The following trick gives close to a 5x speed-up on my system:
chunk = np.random.randint(256, size=(1000, 1000))
def count_chunk(chunk):
rows, cols = chunk.shape
col_idx = np.arange(cols) * 256
counts = np.bincount((col_idx[None, :] + chunk).ravel(),
minlength=256*cols)
return counts.reshape(-1, 256)
def count_chunk_by_rows(chunk):
counts = np.zeros(chunk.shape[1:]+(256,), dtype=np.int)
indices = np.arange(chunk.shape[-1])
for row in chunk:
counts[indices, row] += 1
return counts
And now:
In [2]: c = count_chunk_by_rows(chunk)
In [3]: d = count_chunk(chunk)
In [4]: np.all(c == d)
Out[4]: True
In [5]: %timeit count_chunk_by_rows(chunk)
10 loops, best of 3: 80.5 ms per loop
In [6]: %timeit count_chunk(chunk)
100 loops, best of 3: 13.8 ms per loop

Categories

Resources