Numpy generate binary array with at most N ones - python

I can easily create a random binary array in Numpy by doing this
random_mask = np.random.randint(0,2, (r, c))
But what what I actually want is to set the maximum number of 1s that can be in the array.
For example, if I want a 5,5 binary matrix, I want there to be at most 10 ones randomly placed throughout the matrix, and the rest are 0s.
I was thinking of an approach where I generate the random array like normal, count the number of 1s that are currently placed, and somehow subtracting off the ones I don't need.
I'm wondering if there's already a way to do this in numpy

This is the most basic approach I could think of:
import numpy as np
def binary_mask_random(r, c, n):
a = np.zeros((r,c)).flatten()
for i in range(np.random.randint(0, n+1)):
x = np.random.randint(0, r*c)
a[x] = 1
return a.reshape((r,c))
It creates a 1xr*c array of zeros and fills it with up to n 1s at random positions. Returns a rxc array.

Related

If an NumPy array is a vector, why is the addition of a single number possible?

In math you cant add a single number to a Vector scalar+vector!=, only multiply scalar+vector=vector. As far as I learned arrays are one dimensional vector, shouldn't it be impossible to add a single number to it then?
> x = np.array([1,2,3])
> x+=1
x=[2,3,4]
Shouldn't this be impossible?

In Python, how can I distribute a set of numbers randomly onto a grid/matrix?

I have the following problem:
I want to generate a 100x100 grid (numpy.ndarray) by filling it with numbers, out from a given list ([-1,0,1,2]). I want to distribute them randomly on this grid. Also, the numbers must maintain the following ratios: the number 0 must occupy 10% of the grid, while the remaining numbers have a 30% ratio each, so their sum equals 100%. Using np.random.choice() I was able to generate random numbers, each distributed with the associated probabilities. However, I run into problems because I have to make sure that the number 0 makes exactly 10% of the entire grid, and the non-zero numbers exactly 30% each. Using the np.random.choice() function, this is not always the case (especially if the sample size is small), because I have only assigned probabilities, and not ratios:
import numpy as np
numbers = np.random.choice([-1,0,1,2],(100,100),p=[0.3,0.1,0.3,0.3])
print(np.count_nonzero(numbers)) #must be = 0.1 always!
Another idea I had was to initially set the entire matrix as np.zeros((100,100)) and then fill up only 90% of it with non-zero elements, however, I don't how to approach this problem such that the numbers are distributed randomly on the grid, i.e., random location/index.
Edit: The ratio of each individual non-zero number in the grid will only depend on how many cells I want to be empty, or 0 in that case. All other non-zero elements must have the same ratio. For example, I want to have 20% of the grid to be zeros, the remaining numbers will have a ratio of (1 - ratio_of_zero)/amount_of_non-zero_elements.
This should do what you want it to (suggested by RemcoGerlich), though I don't know how efficient this method is:
import numpy as np
# Constants
SHAPE = (100, 100)
LENGTH = SHAPE[0] * SHAPE[1]
REST = [-1, 1, 2]
ZERO_PROB = 10
BASE_PROB = (100 - ZERO_PROB) // len(REST)
NUM_ZERO = round(LENGTH * (ZERO_PROB / 100))
NUM_REST = round(LENGTH * (BASE_PROB / 100))
# Make base 1D array
base_arr = [0 for _ in range(NUM_ZERO)]
for i in REST:
base_arr += [i for _ in range(NUM_REST)]
base_arr = np.array(base_arr)
# Give it a random order
np.random.shuffle(base_arr)
# Finally, reshape the array
result_arr = base_arr.reshape(SHAPE)
Looking at your comment, for flexibility that depends on how many of the numbers are to have different probabilities I suppose. You could just have a for loop which goes through and makes an array the right length for each one to add to the base_arr. Also, this can of course be a function you pass variables into rather than just a script with hard coded constants like this.
Edited based on comment.

Constructing Sparse CSR Matrix Directly vs Using COO tocsr() - Scipy

My goal here is to build the sparse CSR matrix very fast. It is currently the main bottleneck in my process, and I've already optimized it by constructing the coo matrix relatively fast, and then using tocsr().
However, I would imagine that constructing the csr matrix directly must be faster?
I have a very specific format of a sparse matrix that is also large (i.e. on orders of 100000x50000). I've looked online at these other answers, but most are not addressing the question I have.
Efficiently construct FEM/FVM matrix
Looks at constructing a very specific formatted sparse matrix vs using coo, which led to a scipy merge improvement on the speed of tocsr().
Sparse Matrix Structure:
The sparse matrix, H, is comprised of W lists of size N, or built from an initial array of size NxW, lets call it A. Along the diagonal are repeating lists of size N for N times. So for the first N rows of H, is A[:,0] repeated, but sliding along N steps for each row.
Comparison to COO.tocsr()
When I scale up N and W, and build up the COO matrix first, then running tocsr(), it is actually faster then just building the CSR matrix directly. I'm not sure why this would be the case? I am wondering if perhaps I can take advantage of the structure of my sparse matrix H in some way? Since there are many repeating elements in there.
Code Sample
Here is a code sample to visualize what is going on for a small sample size:
from scipy.sparse import linalg, dok_matrix, coo_matrix, csr_matrix
import numpy as np
import matplotlib.pyplot as plt
def test_csr(testdata):
indices = [x for _ in range(W-1) for x in range(N**2)]
ptrs = [N*(i) for i in range(N*(W-1))]
ptrs.append(len(indices))
data = []
# loop along the first axis
for i in range(W-1):
vec = testdata[:,i].squeeze()
# repeat vector N times
for i in range(N):
data.extend(vec)
Hshape = ((N*(W-1), N**2))
H = csr_matrix((data, indices, ptrs), shape=Hshape)
return H
N = 4
W = 8
testdata = np.random.randn(N,W)
print(testdata.shape)
H = test_csr(testdata)
plt.imshow(H.toarray(), cmap='jet')
plt.show()
It looks like your output has only the first W-1 rows of test data. I'm not sure if this is intentional or not. My solutions assumes you want to use all of testdata.
When you construct the COO matrix are you also constructing the indices and data in a similar way?
One thing which might speed up constructing the csr_matrix is to use built in numpy functions to generate the data for the csr_matrix rather than python loops and arrays. I would expect this to improve the speed of generating the indices significantly. You can adjust the dtype to different type of int depending on the size of your matrix.
N = 4
W = 8
testdata = np.random.randn(N,W)
ptrs = N*np.arange(W*N+1,dtype='int')
indices = np.tile(np.arange(N*N,dtype='int'),W)
data = np.tile(testdata,N).flatten()
Hshape = ((N*W, N**2))
H = csr_matrix((data, indices, ptrs), shape=Hshape)
Another possibility is to first construct the large array, and then define each of the N vertical column blocks at once. This means that you don't need to make a ton of copies of the original data before you put it into the sparse matrix. However, it may be slow to convert the matrix type.
N = 4
W = 8
testdata = np.random.randn(N,W)
Hshape = ((N*W, N**2))
H = sp.sparse.lol_matrix(Hshape)
for j in range(N):
H[N*np.arange(W)+j,N*j:N*(j+1)] = testdata.T
H = H.tocsc()

Efficient NumPy rows rotation over variable distances

Given a 2D M x N NumPy array and a list of rotation distances, I want to rotate all M rows over the distances in the list. This is what I currently have:
import numpy as np
M = 6
N = 8
dists = [2,0,2,1,4,2] # for example
matrix = np.random.randint(0,2,(M,N))
for i in range(M):
matrix[i] = np.roll(matrix[i], -dists[i])
The last two lines are actually part of an inner loop that gets executed hundreds of thousands of times and it is bottlenecking my performance as measured by cProfile. Is it possible to, for instance, avoid the for-loop and to do this more efficiently?
We can simulate the rolling behaviour with modulus operation after adding dists with a range(0...N) array to give us column indices for each row from where elements are to be picked and shuffled in the same row. We can vectorize this process across all rows with the help of broadcasting. Thus, we would have an implementation like so -
M,N = matrix.shape # Store matrix shape
# Get column indices for all elems for a rolled version with modulus operation
col_idx = np.mod(np.arange(N) + dists[:,None],N)
# Index into matrix with ranged row indices and col indices to get final o/p
out = matrix[np.arange(M)[:,None],col_idx]

masking operation in python and numpy

I am multiplying two n-d arrays using some set of indices that I have obtained using numpy mesh grid. So, I have two n-d arrays called current and previous and I have a set if indices p and q and I have something as follows:
time_lag = numpy.sum(current[p] * previous[q])
Now, I have a another n-d array filled with only 0s and 1s and what I want to do is make sure this multiplication only happens in the indices where this mask is set to 1. So, currently I am just doing:
time_lag = numpy.sum(current[p] * previous[q] * mask[p])
This ensures that it zeros out the regions I do not want and they do not contribute to the sum. I wonder if there is a better way to only selectively do the multiplication in those mask regions?

Categories

Resources