How would you unionize N-arrays with different sizes? - python

The :
np.union1d(a, b)
can unionize two arrays with different sizes.
np.vstack((a, b, c)).T.ravel()
can unionize N arrays of the same size.
How would you unionize N-arrays with different sizes ?
And of course it should be fast ;) !
btw union is not just concatenation...
still testing, but would this do it :
np.unique(np.concatenate((a,b,c)))

Here's one with array-assignment + masking for positive numbers -
def unionize_ndarrays(L, maxnum=None):
if maxnum is None:
maxnum = max([np.max(i) for i in L])+1
# for lists : max([max(i) for i in L])+1
id_ar = np.zeros(maxnum, dtype=bool)
for i in L:
id_ar[i] = True
return np.flatnonzero(id_ar)
Computing the max number maxnum has noticeable runtime and could be the bottleneck even for a large number of small arrays. So, if that's known, feeding that in should help a lot on those scenarios.
Sample run -
In [43]: a = np.array([0, 1, 3, 4, 3])
...: b = np.array([0, 10, 3, 1, 2, 1])
...: c = np.array([6, 3, 4, 2])
In [44]: np.unique(np.concatenate((a,b,c)))
Out[44]: array([ 0, 1, 2, 3, 4, 6, 10])
In [45]: unionize_ndarrays((a,b,c))
Out[45]: array([ 0, 1, 2, 3, 4, 6, 10])
Benchmarking
1) Small sized arrays -
In [106]: L = [np.random.randint(0,10,n) for n in np.random.randint(4,10,10000)]
In [107]: %timeit unionize_ndarrays(L, maxnum=10)
2.74 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [108]: %timeit np.unique(np.concatenate((L)))
3.06 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Without maxnum fed
In [109]: %timeit unionize_ndarrays(L)
40.4 ms ± 542 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
If order is not important, we can also look into pandas.factorize, if we are dealing with small-sized arrays -
In [76]: a = np.array([0, 1, 3, 4, 3])
...: b = np.array([0, 10, 3, 1, 2, 1])
...: c = np.array([6, 3, 4, 2])
In [77]: L = [a,b,c]
In [80]: import pandas as pd
In [81]: pd.factorize(np.concatenate(L))[1]
Out[81]: array([ 0, 1, 3, 4, 10, 2, 6])
Related timings -
In [82]: L = [np.random.randint(0,10,n) for n in np.random.randint(4,10,10000)]
In [84]: %timeit pd.factorize(np.concatenate(L))[1]
2.1 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2) Big-sized (bigger variation in sizes) arrays -
Timings -
In [2]: L = [np.random.randint(0,1000,n) for n in np.random.randint(10,1000,10000)]
In [3]: %timeit unionize_ndarrays(L, maxnum=1000)
...: %timeit unionize_ndarrays(L)
...: %timeit np.unique(np.concatenate((L)))
14 ms ± 925 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
56.6 ms ± 641 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
242 ms ± 773 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, to choose one will depend on whether we have the priori info on max number and the size variation.

From NumPy manual:
To find the union of more than two arrays, use functools.reduce:
>>> from functools import reduce
>>> reduce(np.union1d, ([1, 3, 4, 3], [3, 1, 2, 1], [6, 3, 4, 2]))
array([1, 2, 3, 4, 6])
This also works with arrays of different sizes:
>>> reduce(np.union1d, ([0, 1, 3, 4, 3], [0, 10, 3, 1, 2, 1], [6, 3, 4, 2]))
array([ 0, 1, 2, 3, 4, 6, 10])

Related

finding zero values in numpy 3-D array [duplicate]

NumPy has the efficient function/method nonzero() to identify the indices of non-zero elements in an ndarray object. What is the most efficient way to obtain the indices of the elements that do have a value of zero?
numpy.where() is my favorite.
>>> x = numpy.array([1,0,2,0,3,0,4,5,6,7,8])
>>> numpy.where(x == 0)[0]
array([1, 3, 5])
The method where returns a tuple of ndarrays, each corresponding to a different dimension of the input. Since the input is one-dimensional, the [0] unboxes the tuple's only element.
There is np.argwhere,
import numpy as np
arr = np.array([[1,2,3], [0, 1, 0], [7, 0, 2]])
np.argwhere(arr == 0)
which returns all found indices as rows:
array([[1, 0], # Indices of the first zero
[1, 2], # Indices of the second zero
[2, 1]], # Indices of the third zero
dtype=int64)
You can search for any scalar condition with:
>>> a = np.asarray([0,1,2,3,4])
>>> a == 0 # or whatver
array([ True, False, False, False, False], dtype=bool)
Which will give back the array as an boolean mask of the condition.
You can also use nonzero() by using it on a boolean mask of the condition, because False is also a kind of zero.
>>> x = numpy.array([1,0,2,0,3,0,4,5,6,7,8])
>>> x==0
array([False, True, False, True, False, True, False, False, False, False, False], dtype=bool)
>>> numpy.nonzero(x==0)[0]
array([1, 3, 5])
It's doing exactly the same as mtrw's way, but it is more related to the question ;)
You can use numpy.nonzero to find zero.
>>> import numpy as np
>>> x = np.array([1,0,2,0,3,0,0,4,0,5,0,6]).reshape(4, 3)
>>> np.nonzero(x==0) # this is what you want
(array([0, 1, 1, 2, 2, 3]), array([1, 0, 2, 0, 2, 1]))
>>> np.nonzero(x)
(array([0, 0, 1, 2, 3, 3]), array([0, 2, 1, 1, 0, 2]))
If you are working with a one-dimensional array there is a syntactic sugar:
>>> x = numpy.array([1,0,2,0,3,0,4,5,6,7,8])
>>> numpy.flatnonzero(x == 0)
array([1, 3, 5])
I would do it the following way:
>>> x = np.array([[1,0,0], [0,2,0], [1,1,0]])
>>> x
array([[1, 0, 0],
[0, 2, 0],
[1, 1, 0]])
>>> np.nonzero(x)
(array([0, 1, 2, 2]), array([0, 1, 0, 1]))
# if you want it in coordinates
>>> x[np.nonzero(x)]
array([1, 2, 1, 1])
>>> np.transpose(np.nonzero(x))
array([[0, 0],
[1, 1],
[2, 0],
[2, 1])
import numpy as np
arr = np.arange(10000)
arr[8000:8900] = 0
%timeit np.where(arr == 0)[0]
%timeit np.argwhere(arr == 0)
%timeit np.nonzero(arr==0)[0]
%timeit np.flatnonzero(arr==0)
%timeit np.amin(np.extract(arr != 0, arr))
23.4 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
34.5 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
23.2 µs ± 447 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
27 µs ± 506 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
import numpy as np
x = np.array([1,0,2,3,6])
non_zero_arr = np.extract(x>0,x)
min_index = np.amin(non_zero_arr)
min_value = np.argmin(non_zero_arr)

Efficiently apply different permutations for each row of a 2D NumPy array [duplicate]

This question already has answers here:
Randomly shuffle items in each row of numpy array
(6 answers)
Closed 3 years ago.
Given a matrix A, I want to apply different random shuffles for different row of A; for example,
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
becomes
array([[1, 3, 2],
[6, 5, 4],
[7, 9, 8]])
Of course we can loop through the matrix and make every row randomly shuffle; however iteration is slow and I am asking if there is more efficient way to do this.
Picked up this neat trick from Divakar which involves randn and argsort:
np.random.seed(0)
s = np.arange(16).reshape(4, 4)
np.take_along_axis(s, np.random.randn(*s.shape).argsort(axis=1), axis=1)
array([[ 1, 0, 3, 2],
[ 4, 6, 5, 7],
[11, 10, 8, 9],
[14, 12, 13, 15]])
For a 2D array, this can be simplified to
s[np.arange(len(s))[:,None], np.random.randn(*s.shape).argsort(axis=1)]
array([[ 1, 0, 3, 2],
[ 4, 6, 5, 7],
[11, 10, 8, 9],
[14, 12, 13, 15]])
You can also apply np.random.permutation over each row independently to return a new array.
np.apply_along_axis(np.random.permutation, axis=1, arr=s)
array([[ 3, 1, 0, 2],
[ 4, 6, 5, 7],
[ 8, 9, 10, 11],
[15, 14, 13, 12]])
Performance -
s = np.arange(10000 * 100).reshape(10000, 100)
%timeit s[np.arange(len(s))[:,None], np.random.randn(*s.shape).argsort(axis=1)]
%timeit np.apply_along_axis(np.random.permutation, 1, s)
84.6 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
842 ms ± 8.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I've noticed it depends on the dimensions of your data, make sure to test it out first.
Codewise you can use numpy's apply_along_axis as
np.apply_along_axis(np.random.shuffle, 1, matrix)
but it doesn't seem to be more efficient than iterating at least for a 3x3 matrix, for that method I get
> %%timeit
> np.apply_along_axis(np.random.shuffle, 1, test)
67 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
while the iteration gives
> %%timeit
> for i in range(test.shape[0]):
> np.random.shuffle(test[i])
20.3 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Repeat a 2D NumPy array N times [duplicate]

This question already has answers here:
Create 3D array from a 2D array by replicating/repeating along the first axis
(4 answers)
Closed 4 years ago.
I need to augment(replicate) a 2d array of shape 32X32 to a 3d array of shape 32X32X3 by duplicating the source array. How can i do this in the best possible way?
Below is the sample of the source and expected array. I need to apply this logic over a bigger scope of my application
Source array:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Expected array:
array([[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]])
By my tests, np.repeat is a little faster than np.tile:
X = np.repeat(arr[None,:], 3, axis=0)
Alternatively, use np.concatenate:
X = np.concatenate([[arr]] * 3, axis=0)
arr = np.arange(10000 * 1000).reshape(10000, 1000)
%timeit np.repeat(arr[None,:], 3, axis=0)
%timeit np.tile(arr, (3, 1, 1))
%timeit np.concatenate([[arr]] * 3, axis=0)
# Read-only, array cannot be modified.
%timeit np.broadcast_to(arr, (3, *arr.shape))
# Creating copy of the above.
%timeit np.broadcast_to(arr, (3, *arr.shape)).copy()
170 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
187 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
243 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.9 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 100000 loops
189 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)each)
np.array_equals(np.repeat(arr[None,:], 3, axis=0),
np.tile(arr, (3, 1, 1))
True
Sounds like a job for np.tile:
In [101]: np.tile(A, (3,1,1))
Out[101]:
array([[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]])
The second argument specifies the number of copies on each dimension.
If you don't need to modify the result, make use of broadcast_to:
np.broadcast_to(arr, (3, *arr.shape))
Validation using #coldspeed's answer:
arr = np.arange(10000 * 1000).reshape(10000, 1000)
X = np.repeat(arr[None,:], 3, axis=0)
broadcast_x = np.broadcast_to(arr, (3, *arr.shape))
np.array_equal(X, broadcast_x)
True
If you do need to be able to modify, you can call copy() on the result, which should come close to repeat and tile in terms of speed.

Efficient way to compute the probability distribution of a vector of pairs in python

Suppose we have a numpy array v
v=np.array([3, 5])
Now we use the below code to find a new vector say w
v1=np.array(range(v[0]+1))
v2=np.array(range(v[1]+1))
w=np.array(list(itertools.product(v1,v2)))
So w looks like this,
array([[0, 0],
[0, 1],
[0, 2],
[0, 3],
[0, 4],
[0, 5],
[1, 0],
[1, 1],
[1, 2],
[1, 3],
[1, 4],
[1, 5],
[2, 0],
[2, 1],
[2, 2],
[2, 3],
[2, 4],
[2, 5],
[3, 0],
[3, 1],
[3, 2],
[3, 3],
[3, 4],
[3, 5]])
Now, we need to find the probability vector corresponding to each pair in w knowing that the first element in each pair follows a Binomial distribution Bin(v[0], 0.1) and the second element of each pair follows a Binomial distribution Bin(v[1], 0.05). One way to do it is by this one liner
import scipy.stats as ss
prob_vector=np.array(list((ss.binom.pmf(i[0],v[0], 0.1) * ss.binom.pmf(i[1],v[1], 0.05)) for i in w))
output:
array([5.64086303e-01, 1.48443764e-01, 1.56256594e-02, 8.22403125e-04,
2.16421875e-05, 2.27812500e-07, 1.88028768e-01, 4.94812547e-02,
5.20855312e-03, 2.74134375e-04, 7.21406250e-06, 7.59375000e-08,
2.08920853e-02, 5.49791719e-03, 5.78728125e-04, 3.04593750e-05,
8.01562500e-07, 8.43750000e-09, 7.73780938e-04, 2.03626563e-04,
2.14343750e-05, 1.12812500e-06, 2.96875000e-08, 3.12500000e-10])
But it takes tooo much time to compute, especially since I am iterating over several v vectors!!
Is there an efficient way to compute prob_vector?
Thanks
You're redoing a lot of pmf calls, as well as doing a lot on the Python side instead of the numpy side. We can save those computations by computing on your v1 and v2 arrays, and then multiplying those instead.
import numpy as np
import scipy.stats as ss
import itertools
def orig(x, y):
v = np.array([x, y])
v1 =np.array(range(v[0]+1))
v2=np.array(range(v[1]+1))
w=np.array(list(itertools.product(v1,v2)))
prob_vector=np.array(list((ss.binom.pmf(i[0],v[0], 0.1) * ss.binom.pmf(i[1],v[1], 0.05)) for i in w))
return prob_vector
def faster(x, y):
b0 = ss.binom.pmf(np.arange(x+1), x, 0.1)
b1 = ss.binom.pmf(np.arange(y+1), y, 0.05)
prob_array = b0[:, None] * b1
prob_vector = prob_array.ravel()
return prob_vector
which gives me:
In [61]: %timeit orig(3, 5)
4.46 ms ± 82.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [62]: %timeit faster(3, 5)
192 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [63]: %timeit orig(30, 50)
311 ms ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [64]: %timeit faster(30, 50)
209 µs ± 8.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [65]: (orig(30, 50) == faster(30, 50)).all()
Out[65]: True

how to create an array of specified dimension of specific type initialized with same value in python?

I wanna create some array in python of array of specified dimension of specific type initialized with same value. i can use numpy arrays of specific size but I am not sure how to initialize them with a specific value. Off course I don't want to use zeros() or ones()
Thanks a lot.
There are lots of ways to do this. The first one-liner that occurred to me is tile:
>>> numpy.tile(2, 25)
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2])
You can tile a value in any shape:
>>> numpy.tile(2, (5, 5))
array([[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2]])
However, as a number of answers below indicate, this isn't the fastest method. It's designed for tiling arrays of any size, not just single values, so if you really just want to fill an array with a single value, then it's much faster to allocate the array first, and then use slice assignment:
>>> a = numpy.empty((5, 5), dtype=int)
>>> a[:] = 2
>>> a
array([[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2]])
According to a few tests I did, there aren't any faster approaches. However, two of the approaches mentioned in answers below are equally fast: ndarray.fill and numpy.full.
These tests were all done in ipython, using Python 3.6.1 on a newish mac running OS 10.12.6. Definitions:
def fill_tile(value, shape):
return numpy.tile(value, shape)
def fill_assign(value, shape, dtype):
new = numpy.empty(shape, dtype=dtype)
new[:] = value
return new
def fill_fill(value, shape, dtype):
new = numpy.empty(shape, dtype=dtype)
new.fill(value)
return new
def fill_full(value, shape, dtype):
return numpy.full(shape, value, dtype=dtype)
def fill_plus(value, shape, dtype):
new = numpy.zeros(shape, dtype=dtype)
new += value
return new
def fill_plus_oneline(value, shape, dtype):
return numpy.zeros(shape, dtype=dtype) + value
for f in [fill_assign, fill_fill, fill_full, fill_plus, fill_plus_oneline]:
assert (fill_tile(2, (500, 500)) == f(2, (500, 500), int)).all()
tile is indeed quite slow:
In [3]: %timeit fill_tile(2, (500, 500))
947 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Slice assignment ties with ndarray.fill and numpy.full for first place:
In [4]: %timeit fill_assign(2, (500, 500), int)
102 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit fill_fill(2, (500, 500), int)
102 µs ± 1.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [6]: %timeit fill_full(2, (500, 500), int)
102 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In-place broadcasted addition is only slightly slower:
In [7]: %timeit fill_plus(2, (500, 500), int)
179 µs ± 3.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And non-in-place broadcasted addition is only slightly slower than that:
In [8]: %timeit fill_plus_oneline(2, (500, 500), int)
213 µs ± 4.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
How about:
shape = (100,100)
val = 3.14
dt = np.float
a = np.empty(shape,dtype=dt)
a.fill(val)
This way you can set things and pass the parameters in. Also, in terms of timings
In [35]: %timeit a=np.empty(shape,dtype=dt); a.fill(val)
100000 loops, best of 3: 13 us per loop
In [36]: %timeit a=np.tile(val,shape)
10000 loops, best of 3: 102 us per loop
So using empty with fill seems significantly faster than tile.
As of NumPy 1.8, you can use numpy.full() to achieve this.
>>> import numpy as np
>>> np.full((3,4), 100, dtype = int)
array([[ 100, 100, 100, 100],
[ 100, 100, 100, 100],
[ 100, 100, 100, 100]])
Are you looking for something like this?
>>> [3 for x in range(10)]
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
You can pass the resulting array to numpy.array.

Categories

Resources