How to randomly sample a matrix in python?

How to randomly sample a matrix in python? - python

I have a relatively large array, e.g. 200 x 1000.The matrix is a sparse matrix where elements are can be considered weights. The weights range from 0 to 500. I would like to generate a new array of the same size, 200x1000, where N of the elements of the new array are random integers {0,1}. The probability of an element in the new matrix being 0 or 1 is proportional to the weights from the original array - the higher the weight the higher the probability of 1 versus 0.
Stated another way: I would like to generate a zero matrix of size 200x1000 and then randomly choose N elements to flip to 1 based on a 200x1000 matrix of weights.

I'll throw my proposed solution in here as well:
# for example
a = np.random.random_integers(0, 500, size=(200,1000))
N = 200
result = np.zeros((200,1000))
ia = np.arange(result.size)
tw = float(np.sum(a.ravel()))
result.ravel()[np.random.choice(ia, p=a.ravel()/tw,
size=N, replace=False)]=1
where a is the array of weights: that is, pick the indexes for the items to change to 1 from the array ia, weighted by a.

This can be done with numpy with
# Compute probabilities as a 1-D array
probs = numpy.float64(weights).ravel()
probs /= numpy.sum(probs)
# Pick winner indexes
winners = numpy.random.choice(len(probs), N, False, probs)
# Build result
result = numpy.zeros(weights.shape, numpy.uint8)
result.ravel()[winners] = 1

Something like this should work, no reason to get too complicated:
>>> import random
>>> weights = [[1,5],[500,0]]
>>> output = []
>>> for row in weights:
... outRow = []
... for entry in row:
... outRow.append(random.choice([0]+[1 for _ in range(entry)]))
... output.append(outRow)
...
>>> output
[[1, 1], [1, 0]]
This chooses a random entry from a sequence which always has a single zero and then n 1s where n is the corresponding entry in your weight matrix. In this implementation, a weight of 1 is actually a 50/50 chance of either a 1 or 0. If you want a 50/50 chance to happen at 250 use outRow.append(random.choice([0 for _ in range(500-entry)] + [1 for _ in range(entry)]))

Related

randomly sampling arrays - issue with numpy.delete

I have 2 arrays, x_1g and x_2g. I want to randomly sample 10% of each array and remove that 10% and insert it into the other array. This means that my final and initial arrays should have the same shape, but 10% of the data is randomly sampled from the other array. I have been trying this with the code below but my arrays keep increasing in length, meaning I haven't properly deleted the sampled 10% data from each array.
n = len(x_1g)
n2 = round(n/10)
ints1 = np.random.choice(n, n2)
x_1_replace = x_1g[ints1,:]
x_1 = np.delete(x_1g, ints1, 0)
x_2_replace = x_2g[ints1,:]
x_2 = np.delete(x_2g, ints1, 0)
My arrays x_1g and x_2g have shapes (150298, 10)
x_1g.shape
>> (1502983, 10)
x_1_replace.shape
>> (150298, 10)
so when I remove the 10% data (x_1_replace) from my original array (x_1g) I should get the array shape:
1502983-150298 = 1352685
However when I check the shape of my array x_1 I get:
x_1.shape
>> (1359941, 10)
I'm not sure what is going on here so if anyone has any suggestions please let me know!!

What happens, is that by using ints1 = np.random.choice(n, n2) to generate your indices, you are choosing n2 times a number between 0 and n-1. You have no guarantee that you will generate n2 different numbers. You are most likely generating a certain number of duplicates. And if you pass several times the same index position to np.delete it will be deleted just once. You can check this by reading the number of unique values in ints1:
np.unique(ints1).shape
You'll see it is not matching n2 (in your example, you'll get (143042,)).
There's probably more than one way to ensure that you'll get n2 different indices, here is one example:
n = len(x_1g)
n2 = round(n/10)
ints1 = np.arange(n) # generating an array [0 ... n-1]
np.random.shuffle(ints1) # shuffle it
ints1 = ints1[:n2] # take the first n2 values
x_1_replace = x_1g[ints1,:]
x_1 = np.delete(x_1g, ints1, 0)
x_2_replace = x_2g[ints1,:]
x_2 = np.delete(x_2g, ints1, 0)
Now you can check:
x_1.shape
# (1352685, 10)

Numpy median-of-means computation across unequal-sized array

Assume a numpy array X of shape m x n and type float64. The rows of X need to pass through an element-wise median-of-means computation. Specifically, the m row indices are partitioned into b "buckets", each containing m/b such indices. Next, within each bucket I compute the mean and across the resulting means I do a final median computation.
An example that clarifies it is
import numpy as np
m = 10
n = 10000
# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)
# Number of buckets to split rows into
b = 5
# Partition the rows of X into b buckets
row_indices = np.arange(X.shape[0])
buckets = np.array(np.array_split(row_indices, b))
X_bucketed = X[buckets, :]
# Compute the mean within each bucket
bucket_means = np.mean(X_bucketed, axis=1)
# Compute the median-of-means
median = np.median(bucket_means, axis=0)
# Edit - Method 2 (based on answer)
np.random.shuffle(row_indices)
X = X[row_indices, :]
buckets2 = np.array_split(X, b, axis=0)
bucket_means2 = [np.mean(x, axis=0) for x in buckets2]
median2 = np.median(np.array(bucket_means2), axis=0)
This program works fine if b divides m since np.array_split() results in partitioning the indices in equal parts and array buckets is a 2D array.
However, it does not work if b does not divide m. In that case, np.array_split() still splits into b buckets but of unequal sizes, which is fine for my purposes. For example, if b = 3 it will split the indices {0,1,...,9} into [0 1 2 3], [4 5 6] and [7 8 9]. Those arrays cannot be stacked onto one another so the array buckets is not a 2D array and it cannot be used to index X_bucketed.
How can I make this work for unequal-sized buckets, i.e., to have the program compute the mean within each bucket (irrespective of its size) and then the median across the buckets?
I cannot fully grasp masked arrays and I am not sure if those can be used here.

You can consider computing each buckets' mean separately, then stack and compute the median. Also you can just use array_split to X directly, no need to index it with a sliced index array (maybe this was your main question?).
m = 11
n = 10000
# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)
# Number of buckets to split rows into
b = 5
# Partition the rows of X into b buckets
buckets = np.array_split(X, 2, axis = 0)
# Compute the mean within each bucket
b_means = [np.mean(x, axis=0) for x in buckets]
# Compute the median-of-means
median = np.median(np.array(b_means), axis=0)
print(median) #(10000,) shaped array

How to randomly set elements in numpy array to 0

First I create my array
myarray = np.random.random_integers(0,10, size=20)
Then, I want to set 20% of the elements in the array to 0 (or some other number). How should I do this? Apply a mask?

You can calculate the indices with np.random.choice, limiting the number of chosen indices to the percentage:
indices = np.random.choice(np.arange(myarray.size), replace=False,
size=int(myarray.size * 0.2))
myarray[indices] = 0

For others looking for the answer in case of nd-array, as proposed by user holi:
my_array = np.random.rand(8, 50)
indices = np.random.choice(my_array.shape[1]*my_array.shape[0], replace=False, size=int(my_array.shape[1]*my_array.shape[0]*0.2))
We multiply the dimensions to get an array of length dim1*dim2, then we apply this indices to our array:
my_array[np.unravel_index(indices, my_array.shape)] = 0
The array is now masked.

Use np.random.permutation as random index generator, and take the first 20% of the index.
myarray = np.random.random_integers(0,10, size=20)
n = len(myarray)
random_idx = np.random.permutation(n)
frac = 20 # [%]
zero_idx = random_idx[:round(n*frac/100)]
myarray[zero_idx] = 0

If you want the 20% to be random:
random_list = []
array_len = len(myarray)
while len(random_list) < (array_len/5):
random_int = math.randint(0,array_len)
if random_int not in random_list:
random_list.append(random_int)
for position in random_list:
myarray[position] = 0
return myarray
This would ensure you definitely get 20% of the values, and RNG rolling the same number many times would not result in less than 20% of the values being 0.

Assume your input numpy array is A and p=0.2. The following are a couple of ways to achieve this.
Exact Masking
ones = np.ones(A.size)
idx = int(min(p*A.size, A.size))
ones[:idx] = 0
A *= np.reshape(np.random.permutation(ones), A.shape)
Approximate Masking
This is commonly done in several denoising objectives, most notably the Masked Language Modeling in Transformers pre-training. Here is a more pythonic way of setting a certain proportion (say 20%) of elements to zero.
A *= np.random.binomial(size=A.shape, n=1, p=0.8)
Another Alternative:
A *= np.random.randint(0, 2, A.shape)

Permute rows in "slices" of 3d array to match each other

I have a series of 2d arrays where the rows are points in some space. Many similar points occur across all arrays but in different row order. I want to sort the rows so they have the most similar order. Also the points are too different for clustering with K-means or DBSCAN. The problem can also be cast like this. If I stack the arrays into a 3d array, how do I permute the rows to minimize the average standard deviation (SD) along the 2nd axis? What's a good sorting algorithm for this problem?
I've tried the following approaches.
Create a set of reference 2d array and sort rows in each array to minimize mean euclidean distances to the reference 2d array. This I am afraid gives biased results.
Sort rows in arrays pairwise, then pairs of pair-medians, then pairs of that, etc... This doesn't really work and I'm not sure why.
A third approach could be just brute force optimization but I try to avoid that since I have multiple sets of arrays to perform the procedure on.
This is my code for the 2nd approach (Python):
def reorder_to(A, B):
"""Reorder rows in A to best match rows in B.
Input
-----
A : N x M numpy.array
B : N x M numpy.array
Output
------
perm_order : permutation order
"""
if A.shape != B.shape:
print "A and B must have the same shape"
return None
N = A.shape[0]
# Create a distance matrix of distance between rows in A and B
distance_matrix = np.ones((N, N))*np.inf
for i, a in enumerate(A):
for ii, b in enumerate(B):
ba = (b-a)
distance_matrix[i, ii] = np.sqrt(np.dot(ba, ba))
# Choose permutation order by smallest distances first
perm_order = [[] for _ in range(N)]
for _ in range(N):
ind = np.argmin(distance_matrix)
i, ii = ind/N, ind%N
perm_order[ii] = i
distance_matrix[i, :] = np.inf
distance_matrix[:, ii] = np.inf
return perm_order
def permute_tensor_rows(A):
"""Permute 1d rows in 3d array along the 0th axis to minimize average SD along 2nd axis.
Input
-----
A : numpy.3darray
Each "slice" in the 2nd direction is an independent array whose rows can be permuted
to decrease the average SD in the 2nd direction.
Output
------
A : numpy.3darray
A with sorted rows in each "slice".
"""
step = 2
while step <= A.shape[2]:
for k in range(0, A.shape[2], step):
# If last, reorder to previous
if k + step > A.shape[2]:
A_kk = A[:, :, k:(k+step)]
kk_order = reorder_to(np.median(A_kk, axis=2), np.median(A_k, axis=2))
A[:, :, k:(k+step)] = A[kk_order, :, k:(k+step)]
continue
k_0, k_1 = k, k+step/2
kk_0, kk_1 = k+step/2, k+step
A_k = A[:, :, k_0:k_1]
A_kk = A[:, :, kk_0:kk_1]
order = reorder_to(np.median(A_k, axis=2), np.median(A_kk, axis=2))
A[:, :, k_0:k_1] = A[order, :, k_0:k_1]
print "Step:", step, "\t ... Average SD:", np.mean(np.std(A, axis=2))
step *= 2
return A

Sorry I should have looked at your code sample; that was very informative.
Seems like this here gives an out-of-the-box solution to your problem:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html#scipy.optimize.linear_sum_assignment
Only really feasible for a few 100 points at most though, in my experience.

Simple way to create matrix of random numbers

I am trying to create a matrix of random numbers, but my solution is too long and looks ugly
random_matrix = [[random.random() for e in range(2)] for e in range(3)]
this looks ok, but in my implementation it is
weights_h = [[random.random() for e in range(len(inputs[0]))] for e in range(hiden_neurons)]
which is extremely unreadable and does not fit on one line.

You can drop the range(len()):
weights_h = [[random.random() for e in inputs[0]] for e in range(hiden_neurons)]
But really, you should probably use numpy.
In [9]: numpy.random.random((3, 3))
Out[9]:
array([[ 0.37052381, 0.03463207, 0.10669077],
[ 0.05862909, 0.8515325 , 0.79809676],
[ 0.43203632, 0.54633635, 0.09076408]])

Take a look at numpy.random.rand:
Docstring: rand(d0, d1, ..., dn)
Random values in a given shape.
Create an array of the given shape and propagate it with random
samples from a uniform distribution over [0, 1).
>>> import numpy as np
>>> np.random.rand(2,3)
array([[ 0.22568268, 0.0053246 , 0.41282024],
[ 0.68824936, 0.68086462, 0.6854153 ]])

use np.random.randint() as np.random.random_integers() is deprecated
random_matrix = np.random.randint(min_val,max_val,(<num_rows>,<num_cols>))

Looks like you are doing a Python implementation of the Coursera Machine Learning Neural Network exercise. Here's what I did for randInitializeWeights(L_in, L_out)
#get a random array of floats between 0 and 1 as Pavel mentioned
W = numpy.random.random((L_out, L_in +1))
#normalize so that it spans a range of twice epsilon
W = W * 2 * epsilon
#shift so that mean is at zero
W = W - epsilon

For creating an array of random numbers NumPy provides array creation using:
Real numbers
Integers
For creating array using random Real numbers:
there are 2 options
random.rand (for uniform distribution of the generated random numbers )
random.randn (for normal distribution of the generated random numbers )
random.rand
import numpy as np
arr = np.random.rand(row_size, column_size)
random.randn
import numpy as np
arr = np.random.randn(row_size, column_size)
For creating array using random Integers:
import numpy as np
numpy.random.randint(low, high=None, size=None, dtype='l')
where
low = Lowest (signed) integer to be drawn from the distribution
high(optional)= If provided, one above the largest (signed) integer to be drawn from the distribution
size(optional) = Output shape i.e. if the given shape is, e.g., (m, n, k), then m * n * k samples are drawn
dtype(optional) = Desired dtype of the result.
eg:
The given example will produce an array of random integers between 0 and 4, its size will be 5*5 and have 25 integers
arr2 = np.random.randint(0,5,size = (5,5))
in order to create 5 by 5 matrix, it should be modified to
arr2 = np.random.randint(0,5,size = (5,5)), change the multiplication symbol* to a comma ,#
[[2 1 1 0 1][3 2 1 4 3][2 3 0 3 3][1 3 1 0 0][4 1 2 0 1]]
eg2:
The given example will produce an array of random integers between 0 and 1, its size will be 1*10 and will have 10 integers
arr3= np.random.randint(2, size = 10)
[0 0 0 0 1 1 0 0 1 1]

First, create numpy array then convert it into matrix. See the code below:
import numpy
B = numpy.random.random((3, 4)) #its ndArray
C = numpy.matrix(B)# it is matrix
print(type(B))
print(type(C))
print(C)

x = np.int_(np.random.rand(10) * 10)
For random numbers out of 10. For out of 20 we have to multiply by 20.

When you say "a matrix of random numbers", you can use numpy as Pavel https://stackoverflow.com/a/15451997/6169225 mentioned above, in this case I'm assuming to you it is irrelevant what distribution these (pseudo) random numbers adhere to.
However, if you require a particular distribution (I imagine you are interested in the uniform distribution), numpy.random has very useful methods for you. For example, let's say you want a 3x2 matrix with a pseudo random uniform distribution bounded by [low,high]. You can do this like so:
numpy.random.uniform(low,high,(3,2))
Note, you can replace uniform by any number of distributions supported by this library.
Further reading: https://docs.scipy.org/doc/numpy/reference/routines.random.html

A simple way of creating an array of random integers is:
matrix = np.random.randint(maxVal, size=(rows, columns))
The following outputs a 2 by 3 matrix of random integers from 0 to 10:
a = np.random.randint(10, size=(2,3))

random_matrix = [[random.random for j in range(collumns)] for i in range(rows)
for i in range(rows):
print random_matrix[i]

An answer using map-reduce:-
map(lambda x: map(lambda y: ran(),range(len(inputs[0]))),range(hiden_neurons))

#this is a function for a square matrix so on the while loop rows does not have to be less than cols.
#you can make your own condition. But if you want your a square matrix, use this code.
import random
import numpy as np
def random_matrix(R, cols):
matrix = []
rows = 0
while rows < cols:
N = random.sample(R, cols)
matrix.append(N)
rows = rows + 1
return np.array(matrix)
print(random_matrix(range(10), 5))
#make sure you understand the function random.sample

numpy.random.rand(row, column) generates random numbers between 0 and 1, according to the specified (m,n) parameters given. So use it to create a (m,n) matrix and multiply the matrix for the range limit and sum it with the high limit.
Analyzing: If zero is generated just the low limit will be held, but if one is generated just the high limit will be held. In order words, generating the limits using rand numpy you can generate the extreme desired numbers.
import numpy as np
high = 10
low = 5
m,n = 2,2
a = (high - low)*np.random.rand(m,n) + low
Output:
a = array([[5.91580065, 8.1117106 ],
[6.30986984, 5.720437 ]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to randomly sample a matrix in python? - python

Related

randomly sampling arrays - issue with numpy.delete

Numpy median-of-means computation across unequal-sized array

How to randomly set elements in numpy array to 0

Permute rows in "slices" of 3d array to match each other

Simple way to create matrix of random numbers

Categories

Resources