Calculate the empirical distribution of a sequence in NumPy? - python

Suppose A is a (NumPy) length-M array of integers in 0, 1, ..., N-1, I would like to calculate an array of length N, c, such that c[i] = sum(A == i). A for-based solution is obvious, but is there a faster solution?
I am also aware of np.histogram but it sounds like a bit of overkill for this problem.

I think I found a solution.
N = 10 # just an example
M = 10000
A = np.random.randint(0, N, size=M)
# for-based solution
c1 = [sum(A == i) for i in range(N)]
# using numpy unique
c2 = np.zeros(N, dtype=int)
val, count = np.unique(A, return_counts=True)
c2[val] = count
assert all(c2 == c1)

Related

Efficient way to perform if condition nested in for loop in python

Is there an efficient pythonic way to perform if conditions in nested for loops:
import numpy as np
big = 3
med = 2
small = 5
mat1 = np.zeros((big, 3))
mat2 = np.zeros((big, med, 3))
mat3 = np.zeros((big, med, small))
mat1 = np.array([
[0,0,0],\
[1.0,0.5,0.2],\
[0.2,0.1,-0.1]])
mat2 = np.array([[
[1.0,0.5,0.2],\
[0.1,0.1,0.1]],\
[[0.2,0.2,0.2],\
[1.0,-0.5,-0.2]],\
[[1.0,-0.5,-0.2],\
[-1.0,0.5,-0.2]]])
mat3 = np.array([[
[1,1,1,1,1],\
[0,21,1,3,5]],\
[[1,2,3,4,5],\
[-1,-2,-2,-3,-4]],\
[[1.0,1.2,1.3,1.4,1.5],\
[5,4,3,2,1]]])
sol = np.zeros((small))
for ii in np.arange(big):
found = False
for jj in np.arange(big):
for kk in np.arange(med):
if all(abs(mat1[ii, :] - mat2[jj, kk, :]) < 1E-8):
found = True
sol = mat3[jj, kk, :]
print(sol)
break
if found:
break
where big and med can be much bigger. The above dummy code works but is very slow. Is there a way to speed it up ?
Note: the mat1, mat2 and mat3 are floats (not integer) and are not zeros in practice.
Solution:
The solution for me was the following (greatly benefiting from #LRRR answer):
for ii in np.arange(big):
tmp = mat1[ii, :]
A = np.tile(tmp[:], (med, 1))
AA = np.repeat(A[np.newaxis, :], big, 0)
sub = abs(AA - mat2) < 1E-8
tmp2 = mat3[sub.all(axis=2)]
if (len(tmp2) > 0):
val = tmp2[0, :]
Note that because I had other complications I kept the outer loop.
The if statement is required as I want the first occurrence of a match.
Also worth noting, this is significantly faster but probably can be made even faster since we could stop at the match rather than having all matches.
If I understand correctly your goal is for each row of mat1, subtract each row in each matrix of mat2, check if all values in the resultant vector are negative, and if true then use that index to return the values from mat3?
Here's an example on smaller data:
import random
import numpy as np
random.seed(10)
big = 5
med = 3
small = 2
mat1 = np.random.randint(0, 10, (big, 3))
mat2 = np.random.randint(0, 10, (big, med, 3))
mat3 = np.random.randint(0, 10, (big, med, small))
# Row subtractions
A = abs(np.repeat(mat1[:, np.newaxis], med, 1) - mat2) < 1E-8
# Extract from mat3
mat3[A.all(axis = 2)]
Breaking it down mat1[:, np.newaxis] increases the array by another dimension and np.repeat() will duplicate each row, so the sizes of mat1 and mat2 will line up to do a simple subtraction between the two.
Note: I left out the abs() from your original code on the line if all(abs(mat1[ii, :] - mat2[jj, kk, :]) < 1E-8):. It seems that by taking the absolute value, the condition < 1E-8 will never be satisfied.
Update:
Here's the redo using the new data added to the original post:
# Repeat each row of mat1 for rows in mat2
A = np.repeat(mat1, big * med, 0)
# Reshape mat2 to match matrix A
B = mat2.reshape(big*med, 3)
C = np.tile(B, (big, 1))
# Subtraction rows
sub = abs(A - C) < 1E-8
# Find values from tiled mat2
values = C[sub.all(axis = 1)]
# Get indices on reshaped mat2
indices = np.all(B == values, axis=1)
# Reshape mat3
M = mat3.reshape(big * med, small)
# Result
M[indices]
output: array([[1., 1., 1., 1., 1.]])

SKlearn Minimum Covariance Determinant (MCD) Function yields different results if applied to whole data array vs looped

I have a repeated experiment (n=K) which measures time series of equal length N, i.e. my data matrix has the shape NxK. I now want to compute a robust estimate of the covariance between the experiments for which I use the Minimum Covariance Determinent algorithm implemented in Sci Kit Learn.
One way to apply the algorithm is to directly apply the function to the data array D, i.e.:
import numpy as np
from sklearn.covariance import MinCovDet
N = 300 #number of rows
K = 40 #number of columns
D = np.random.normal(0, 1, size=(N, K)) #create random Data
mcd = MinCovDet().fit(D) #yields a KxK matrix
cov_mat = mcd.covariance_ #covariances between the columns
another way is to loop over the experiments
cov_loop = np.zeros((K, K))
for i in range(0, K):
for j in range(i, K):
temp_arr = np.zeros((N, 2))
temp_arr[:, 0] = D[:, i]
temp_arr[:, 1] = D[:, j]
mcd_temp = MinCovDet().fit(temp_arr)
cov_temp = mcd_temp.covariance_ #yields 2x2 matrix, we are only interested in the [0,1] element
cov_loop[i, j] = cov_temp[0, 1]
cov_loop[j, i] = cov_loop[i, j]
print(cov_loop/cov_mat)
The results differ significantly, which is why I wanted to ask what went wrong here.

Sparse Scipy/Numpy: an efficient way to implement sum of pairwise mins operation

Computing the sum of pairwise mins between vectors is very popular in natural language processing (NLP) and is used in computing the intersecting histogram kernel [1]. However, in NLP we frequently deal with sparse matrices.
Here is an inefficient way that uses the slow for loops to compute this operation:
import numpy as np
from scipy.sparse import csr_matrix
# Initialize sparse matrices
A = csr_matrix(np.clip(np.random.randn(100, 64) - 1, 0, np.inf))
B = csr_matrix(np.clip(np.random.randn(64, 100) - 1, 0, np.inf))
# For each row, col vector i,j in A and B respectively
G = np.zeros((100, 100))
for i in range(A.shape[0]):
for j in range(B.shape[1]):
G[i, j] = A[i].minimum(B[:,j]).sum()
Is there a way to do this without the for loop ?
I wouldn't mind a for loop if it can be compiled such as with using jit in numba.
A fast dense version of this is given here: Numpy: an efficient way to implement sum of pairwise mins operation
Thanks.
[1] http://blog.datadive.net/histogram-intersection-for-change-detection/
Here is an implementation that should be ok efficient, leveraging sparseness as best as it can. There is a loop but only along one dim, so should be not too bad.
import numpy as np
from scipy.sparse import csr_matrix, csc_matrix
M, N, K = 640, 100, 650
B1 = csr_matrix(np.clip(np.random.randn(N, K) - 1, 0, np.inf))
B2 = csr_matrix(np.clip(np.random.randn(N, K) - 1, 0, np.inf))
B = B1-B2
A1 = csc_matrix(np.clip(np.random.randn(M, N) - 1, 0, np.inf))
A2 = csc_matrix(np.clip(np.random.randn(M, N) - 1, 0, np.inf))
A = A1-A2
result = np.zeros((M, K))
for j in range(N):
ia = A.indices[A.indptr[j] : A.indptr[j+1]]
ib = B.indices[B.indptr[j] : B.indptr[j+1]]
IA, IB = np.ix_(ia, ib)
da = A.data[A.indptr[j] : A.indptr[j+1]]
db = B.data[B.indptr[j] : B.indptr[j+1]]
# both nonzero
result[IA, IB] += np.minimum.outer(da, db)
# one negative ...
am = da<0
iam, dam = ia[am], da[am]
bm = db<0
ibm, dbm = ib[bm], db[bm]
# ... the other zero
za = np.ones((M,), dtype=bool)
za[ia] = False
zb = np.ones((K,), dtype=bool)
zb[ib] = False
IA, IB = np.ix_(iam, zb)
result[IA, IB] += dam[:, None]
IA, IB = np.ix_(za, ibm)
result[IA, IB] += dbm
# compare with dense method
print(np.allclose(result, np.minimum(A.A[..., None], B.A).sum(axis=1)))
Prints
True
Well, at least in recent versions of SciPy there is a function scipy.sparse.csr_matrix.minimum Link to documentation which is the equivalent of numpy.minimum in term of element-wise minimum. However, I don't know how computationally efficient that is.

Filter array, store adjacency information

Let's say I have an 2D array of (N, N) shape:
import numpy as np
my_array = np.random.random((N, N))
Now I want to do some computations only on some "cells" of this array, for instance the ones inside the central part of the array. To avoid doing computations on cells I'm not interested in, what I usually do here is create a Boolean mask, in this spirit:
my_mask = np.zeros_like(my_array, bool)
my_mask[40:61,40:61] = True
my_array[my_mask] = some_twisted_computations(my_array[my_mask])
But what if some_twisted_computations() involves values of the neighboring cells if they are inside the mask? Performance-wise, would it be a good idea to create an "adjacency array" with a (len(my_mask), 4) shape, storing the index of 4-connected neighbor cells in the flat my_array[mask] array that I will use in some_twisted_computations()? If yes, what are the efficient options for computing such adjacency array? Should I switch to lower-level langage/other data structures?
My real-worlds arrays shapes are around (1000,1000,1000), the mask concerns only a small subset (~100000) of these values and is of rather complex geometry. I hope my questions make sense...
EDIT: the very dirty and slow solution I've worked out:
wall = mask
i = 0
top_neighbors = []
down_neighbors = []
left_neighbors = []
right_neighbors = []
indices = []
for index, val in np.ndenumerate(wall):
if not val:
continue
indices += [index]
if wall[index[0] + 1, index[1]]:
down_neighbors += [(index[0] + 1, index[1])]
else:
down_neighbors += [i]
if wall[index[0] - 1, index[1]]:
top_neighbors += [(index[0] - 1, index[1])]
else:
top_neighbors += [i]
if wall[index[0], index[1] - 1]:
left_neighbors += [(index[0], index[1] - 1)]
else:
left_neighbors += [i]
if wall[index[0], index[1] + 1]:
right_neighbors += [(index[0], index[1] + 1)]
else:
right_neighbors += [i]
i += 1
top_neighbors = [i if type(i) is int else indices.index(i) for i in top_neighbors]
down_neighbors = [i if type(i) is int else indices.index(i) for i in down_neighbors]
left_neighbors = [i if type(i) is int else indices.index(i) for i in left_neighbors]
right_neighbors = [i if type(i) is int else indices.index(i) for i in right_neighbors]
The best answer will probably depend on the nature of the computations you want to do. For example, if they can be expressed as summations over neighboring pixels, then something like np.convolve or scipy.signal.fftconvolve can be a really nice solution.
For your specific question of efficiently generating arrays of neighbor indices, you might try something like this:
x = np.random.rand(100, 100)
mask = x > 0.9
i, j = np.where(mask)
i_neighbors = i[:, np.newaxis] + [0, 0, -1, 1]
j_neighbors = j[:, np.newaxis] + [-1, 1, 0, 0]
# need to do something with the edge cases
# the best choice will depend on your application
# here we'll change out-of-bounds neighbors to the
# central point itself.
i_neighbors = np.clip(i_neighbors, 0, 99)
j_neighbors = np.clip(j_neighbors, 0, 99)
# compute some vectorized result over the neighbors
# as a concrete example, here we'll do a standard deviation
result = x[i_neighbors, j_neighbors].std(axis=1)
The result is an array of values corresponding to the masked region, containing the standard deviation of neighboring values.
Hopefully that approach will work for whatever specific problem you have in mind!
Edit: given the edited question above, here's how my response can be adapted to generate arrays of indices in a vectorized manner:
x = np.random.rand(100, 100)
mask = x > -0.9
i, j = np.where(mask)
i_neighbors = i[:, np.newaxis] + [0, 0, -1, 1]
j_neighbors = j[:, np.newaxis] + [-1, 1, 0, 0]
i_neighbors = np.clip(i_neighbors, 0, 99)
j_neighbors = np.clip(j_neighbors, 0, 99)
indices = np.zeros(x.shape, dtype=int)
indices[mask] = np.arange(len(i))
neighbor_in_mask = mask[i_neighbors, j_neighbors]
neighbors = np.where(neighbor_in_mask,
indices[i_neighbors, j_neighbors],
np.arange(len(i))[:, None])
left_indices, right_indices, top_indices, bottom_indices = neighbors.T

Compute weighted sums on rolling window with pandas dataframes of different length

I have a large dataframe > 5000000 rows that I am performing a rolling calculation on.
df = pd.DataFrame(np.randn(10000,1), columns = ['rand'])
sum_abs = df.rolling(5).sum()
I would like to do the same calculations but add in a weighted sum.
df2 = pd.DataFrame(pd.Series([1,2,3,4,5]), name ='weight'))
df3 = df.mul(df2.set_index(df.index)).rolling(5).sum()
However, I am getting a Length Mismatch expected axis has 5 elements error.
I know I could do something like [a *b for a, b in zip(L, weight)] if I converted everything to a list but I would like to keep it in a dataframe if possible. Is there a way to multiply against different size frames or do I need to repeat the set of numbers the length of the dataset I'm multiplying against?
Easy way to do this is
w = np.arange(1, 6)
df.rolling(5).apply(lambda x: (x * w).sum())
A less easy way using strides
from numpy.lib.stride_tricks import as_strided as strided
v = df.values
n, m = v.shape
s1, s2 = v.strides
k = 5
w = np.arange(1, 6).reshape(1, 1, k)
pd.DataFrame(
(strided(v, (n - k + 1, m, k), (s1, s2, s1)) * w).sum(-1),
df.index[k - 1:], df.columns)
naive time test

Categories

Resources