randomly sampling arrays - issue with numpy.delete - python

I have 2 arrays, x_1g and x_2g. I want to randomly sample 10% of each array and remove that 10% and insert it into the other array. This means that my final and initial arrays should have the same shape, but 10% of the data is randomly sampled from the other array. I have been trying this with the code below but my arrays keep increasing in length, meaning I haven't properly deleted the sampled 10% data from each array.
n = len(x_1g)
n2 = round(n/10)
ints1 = np.random.choice(n, n2)
x_1_replace = x_1g[ints1,:]
x_1 = np.delete(x_1g, ints1, 0)
x_2_replace = x_2g[ints1,:]
x_2 = np.delete(x_2g, ints1, 0)
My arrays x_1g and x_2g have shapes (150298, 10)
x_1g.shape
>> (1502983, 10)
x_1_replace.shape
>> (150298, 10)
so when I remove the 10% data (x_1_replace) from my original array (x_1g) I should get the array shape:
1502983-150298 = 1352685
However when I check the shape of my array x_1 I get:
x_1.shape
>> (1359941, 10)
I'm not sure what is going on here so if anyone has any suggestions please let me know!!

What happens, is that by using ints1 = np.random.choice(n, n2) to generate your indices, you are choosing n2 times a number between 0 and n-1. You have no guarantee that you will generate n2 different numbers. You are most likely generating a certain number of duplicates. And if you pass several times the same index position to np.delete it will be deleted just once. You can check this by reading the number of unique values in ints1:
np.unique(ints1).shape
You'll see it is not matching n2 (in your example, you'll get (143042,)).
There's probably more than one way to ensure that you'll get n2 different indices, here is one example:
n = len(x_1g)
n2 = round(n/10)
ints1 = np.arange(n) # generating an array [0 ... n-1]
np.random.shuffle(ints1) # shuffle it
ints1 = ints1[:n2] # take the first n2 values
x_1_replace = x_1g[ints1,:]
x_1 = np.delete(x_1g, ints1, 0)
x_2_replace = x_2g[ints1,:]
x_2 = np.delete(x_2g, ints1, 0)
Now you can check:
x_1.shape
# (1352685, 10)

Related

efficient setting 1D range values in a DataFrame (or a ndarray) with boolean array

PREREQUISITE
import numpy as np
import pandas as pd
INPUT1:boolean 2d array (a sample array as below)
x = np.array(
[[False,False,False,False,True],
[True,False,False,False,False],
[False,False,True,False,True],
[False,True,True,False,False],
[False,False,False,False,False]])
INPUT2:1D Range values (a sample as below)
y=np.array([1,2,3,4])
EXPECTED OUTPUT:2D ndarray
[[0,0,0,0,1],
[1,0,0,0,2],
[2,0,1,0,1],
[3,1,1,0,2],
[4,2,2,0,3]]
I want to set a range value(vertical vector) for each True in 2d ndarray(INPUT1) efficiently. Is there some useful APIs or solutions for this purpose?
Unfortunately I couldn't come up with an elegant solution, so I came up with multiple inelegant ones. The two main approaches I could think of are
brute-force looping over each True value and assigning slices, and
using a single indexed assignment to replace the necessary values.
It turns out that the time complexity of these approaches is non-trivial, so depending on the size of your array either can be faster.
Using your example input:
import numpy as np
x = np.array(
[[False,False,False,False,True],
[True,False,False,False,False],
[False,False,True,False,True],
[False,True,True,False,False],
[False,False,False,False,False]])
y = np.array([1,2,3,4])
refout = np.array([[0,0,0,0,1],
[1,0,0,0,2],
[2,0,1,0,1],
[3,1,1,0,2],
[4,2,2,0,3]])
# alternative input with arbitrary size:
# N = 100; x = np.random.rand(N,N) < 0.2; y = np.arange(1,N)
def looping_clip(x, y):
"""Loop over Trues, use clipped slices"""
nmax = x.shape[0]
n = y.size
# initialize output
out = np.zeros_like(x, dtype=y.dtype)
# loop over True values
for i,j in zip(*x.nonzero()):
# truncate right-hand side where necessary
out[i:i+n, j] = y[:nmax-i]
return out
def looping_expand(x, y):
"""Loop over Trues, use an expanded buffer"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# loop over True values
for i,j in zip(ivals, jvals):
# slice will always be complete, i.e. of length y.size
out[i:i+n, j] = y
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
def index_2d(x, y):
"""Assign directly with 2d indices, use an expanded buffer"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# now we can safely index for each "(ivals:ivals+n, jvals)" so to speak
upped_ivals = ivals[:,None] + np.arange(n) # shape (ntrues, n)
upped_jvals = jvals.repeat(y.size).reshape(-1, n) # shape (ntrues, n)
out[upped_ivals, upped_jvals] = y # right-hand size of shape (n,) broadcasts
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
def index_1d(x,y):
"""Assign using linear indices, use an expanded buffer"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# grab linear indices corresponding to Trues in a buffed-up array
inds = np.ravel_multi_index((ivals, jvals), out.shape)
# now all we need to do is start stepping along rows for each item and assign y
upped_inds = inds[:,None] + mmax*np.arange(n) # shape (ntrues, n)
out.flat[upped_inds] = y # y of shape (n,) broadcasts to (ntrues, n)
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
# check that the results are correct
print(all([np.array_equal(refout, looping_clip(x,y)),
np.array_equal(refout, looping_expand(x,y)),
np.array_equal(refout, index_2d(x,y)),
np.array_equal(refout, index_1d(x,y))]))
I tried to document each function, but here's a synopsis:
looping_clip loops over every True value in the input and assigns to a corresponding slice in the output. We take care on the right-hand side to shorten the assigned array for when part of the slice would go beyond the edge of the array along the first dimension.
looping_expand loops over every True value in the input and assigns to a corresponding full slice in the output after allocating a padded output array ensuring that every slice will be full. We do more work when allocating a larger output array, but we don't have to shorten the right-hand side on assignment. We could omit the .copy() call in the last step, but I prefer not to return a nontrivially strided array (i.e. a view to an auxiliary array rather than a proper copy) as this might lead to obscure surprises for the user.
index_2d computes the 2d indices of every value to be assigned to, and assumes that duplicate indices will be handled in order. This is not guaranteed! (More on this a bit later.)
index_1d does the same using linearized indices and indexing into the flatiter of the output.
Here are the timings of the above methods using random arrays (see the commented line near the start):
What we can see is that for small and large arrays the looping versions are faster, but for linear sizes between roughly 10 and 150 the indexing versions are better. The reason I didn't go to higher sizes is that the indexing cases start to use a lot of memory, and I didn't want to have to worry about this messing with timings.
Just to make the above worse, note that the indexing versions assume that duplicate indices in a fancy indexing scenario are handled in order, so when True values are handled which are "lower" in the array, previous values will be overwritten as per your requirements. There's only one problem: this is not guaranteed:
For advanced assignments, there is in general no guarantee for the iteration order. This means that if an element is set more than once, it is not possible to predict the final result.
This doesn't sounds very encouraging. While in my experiments it seems that the indices are handled in order (according to C order), this can also be coincidence, or an implementation detail. So if you want to use the indexing versions, make sure that on your specific version and specific dimensions and shapes this still holds true.
We can make the assignment safer by getting rid of duplicate indices ourselves. For this we can make use of this answer by Divakar on a corresponding question:
def index_1d_safe(x,y):
"""Same as index_1d but use Divakar's safe solution for reducing duplicates"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# grab linear indices corresponding to Trues in a buffed-up array
inds = np.ravel_multi_index((ivals, jvals), out.shape)
# now all we need to do is start stepping along rows for each item and assign y
upped_inds = inds[:,None] + mmax*np.arange(n) # shape (ntrues, n)
# now comes https://stackoverflow.com/a/44672126
# need additional step: flatten upped_inds and corresponding y values for selection
upped_flat_inds = upped_inds.ravel() # shape (ntrues, n) -> (ntrues*n,)
y_vals = np.broadcast_to(y, upped_inds.shape).ravel() # shape (ntrues, n) -> (ntrues*n,)
sidx = upped_flat_inds.argsort(kind='mergesort')
sindex = upped_flat_inds[sidx]
idx = sidx[np.r_[np.flatnonzero(sindex[1:] != sindex[:-1]), upped_flat_inds.size-1]]
out.flat[upped_flat_inds[idx]] = y_vals[idx]
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
This still reproduces your expected output. The problem is that now the function takes much longer to finish:
Bummer. Considering how my indexing versions are only faster for an intermediate array size and how their faster versions are not guaranteed to work, perhaps it's simplest to just use one of the looping versions. This is not to say, of course, that there aren't any optimal vectorized solutions that I missed.

How to randomly set elements in numpy array to 0

First I create my array
myarray = np.random.random_integers(0,10, size=20)
Then, I want to set 20% of the elements in the array to 0 (or some other number). How should I do this? Apply a mask?
You can calculate the indices with np.random.choice, limiting the number of chosen indices to the percentage:
indices = np.random.choice(np.arange(myarray.size), replace=False,
size=int(myarray.size * 0.2))
myarray[indices] = 0
For others looking for the answer in case of nd-array, as proposed by user holi:
my_array = np.random.rand(8, 50)
indices = np.random.choice(my_array.shape[1]*my_array.shape[0], replace=False, size=int(my_array.shape[1]*my_array.shape[0]*0.2))
We multiply the dimensions to get an array of length dim1*dim2, then we apply this indices to our array:
my_array[np.unravel_index(indices, my_array.shape)] = 0
The array is now masked.
Use np.random.permutation as random index generator, and take the first 20% of the index.
myarray = np.random.random_integers(0,10, size=20)
n = len(myarray)
random_idx = np.random.permutation(n)
frac = 20 # [%]
zero_idx = random_idx[:round(n*frac/100)]
myarray[zero_idx] = 0
If you want the 20% to be random:
random_list = []
array_len = len(myarray)
while len(random_list) < (array_len/5):
random_int = math.randint(0,array_len)
if random_int not in random_list:
random_list.append(random_int)
for position in random_list:
myarray[position] = 0
return myarray
This would ensure you definitely get 20% of the values, and RNG rolling the same number many times would not result in less than 20% of the values being 0.
Assume your input numpy array is A and p=0.2. The following are a couple of ways to achieve this.
Exact Masking
ones = np.ones(A.size)
idx = int(min(p*A.size, A.size))
ones[:idx] = 0
A *= np.reshape(np.random.permutation(ones), A.shape)
Approximate Masking
This is commonly done in several denoising objectives, most notably the Masked Language Modeling in Transformers pre-training. Here is a more pythonic way of setting a certain proportion (say 20%) of elements to zero.
A *= np.random.binomial(size=A.shape, n=1, p=0.8)
Another Alternative:
A *= np.random.randint(0, 2, A.shape)

Permute rows in "slices" of 3d array to match each other

I have a series of 2d arrays where the rows are points in some space. Many similar points occur across all arrays but in different row order. I want to sort the rows so they have the most similar order. Also the points are too different for clustering with K-means or DBSCAN. The problem can also be cast like this. If I stack the arrays into a 3d array, how do I permute the rows to minimize the average standard deviation (SD) along the 2nd axis? What's a good sorting algorithm for this problem?
I've tried the following approaches.
Create a set of reference 2d array and sort rows in each array to minimize mean euclidean distances to the reference 2d array. This I am afraid gives biased results.
Sort rows in arrays pairwise, then pairs of pair-medians, then pairs of that, etc... This doesn't really work and I'm not sure why.
A third approach could be just brute force optimization but I try to avoid that since I have multiple sets of arrays to perform the procedure on.
This is my code for the 2nd approach (Python):
def reorder_to(A, B):
"""Reorder rows in A to best match rows in B.
Input
-----
A : N x M numpy.array
B : N x M numpy.array
Output
------
perm_order : permutation order
"""
if A.shape != B.shape:
print "A and B must have the same shape"
return None
N = A.shape[0]
# Create a distance matrix of distance between rows in A and B
distance_matrix = np.ones((N, N))*np.inf
for i, a in enumerate(A):
for ii, b in enumerate(B):
ba = (b-a)
distance_matrix[i, ii] = np.sqrt(np.dot(ba, ba))
# Choose permutation order by smallest distances first
perm_order = [[] for _ in range(N)]
for _ in range(N):
ind = np.argmin(distance_matrix)
i, ii = ind/N, ind%N
perm_order[ii] = i
distance_matrix[i, :] = np.inf
distance_matrix[:, ii] = np.inf
return perm_order
def permute_tensor_rows(A):
"""Permute 1d rows in 3d array along the 0th axis to minimize average SD along 2nd axis.
Input
-----
A : numpy.3darray
Each "slice" in the 2nd direction is an independent array whose rows can be permuted
to decrease the average SD in the 2nd direction.
Output
------
A : numpy.3darray
A with sorted rows in each "slice".
"""
step = 2
while step <= A.shape[2]:
for k in range(0, A.shape[2], step):
# If last, reorder to previous
if k + step > A.shape[2]:
A_kk = A[:, :, k:(k+step)]
kk_order = reorder_to(np.median(A_kk, axis=2), np.median(A_k, axis=2))
A[:, :, k:(k+step)] = A[kk_order, :, k:(k+step)]
continue
k_0, k_1 = k, k+step/2
kk_0, kk_1 = k+step/2, k+step
A_k = A[:, :, k_0:k_1]
A_kk = A[:, :, kk_0:kk_1]
order = reorder_to(np.median(A_k, axis=2), np.median(A_kk, axis=2))
A[:, :, k_0:k_1] = A[order, :, k_0:k_1]
print "Step:", step, "\t ... Average SD:", np.mean(np.std(A, axis=2))
step *= 2
return A
Sorry I should have looked at your code sample; that was very informative.
Seems like this here gives an out-of-the-box solution to your problem:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html#scipy.optimize.linear_sum_assignment
Only really feasible for a few 100 points at most though, in my experience.

How to randomly sample a matrix in python?

I have a relatively large array, e.g. 200 x 1000.The matrix is a sparse matrix where elements are can be considered weights. The weights range from 0 to 500. I would like to generate a new array of the same size, 200x1000, where N of the elements of the new array are random integers {0,1}. The probability of an element in the new matrix being 0 or 1 is proportional to the weights from the original array - the higher the weight the higher the probability of 1 versus 0.
Stated another way: I would like to generate a zero matrix of size 200x1000 and then randomly choose N elements to flip to 1 based on a 200x1000 matrix of weights.
I'll throw my proposed solution in here as well:
# for example
a = np.random.random_integers(0, 500, size=(200,1000))
N = 200
result = np.zeros((200,1000))
ia = np.arange(result.size)
tw = float(np.sum(a.ravel()))
result.ravel()[np.random.choice(ia, p=a.ravel()/tw,
size=N, replace=False)]=1
where a is the array of weights: that is, pick the indexes for the items to change to 1 from the array ia, weighted by a.
This can be done with numpy with
# Compute probabilities as a 1-D array
probs = numpy.float64(weights).ravel()
probs /= numpy.sum(probs)
# Pick winner indexes
winners = numpy.random.choice(len(probs), N, False, probs)
# Build result
result = numpy.zeros(weights.shape, numpy.uint8)
result.ravel()[winners] = 1
Something like this should work, no reason to get too complicated:
>>> import random
>>> weights = [[1,5],[500,0]]
>>> output = []
>>> for row in weights:
... outRow = []
... for entry in row:
... outRow.append(random.choice([0]+[1 for _ in range(entry)]))
... output.append(outRow)
...
>>> output
[[1, 1], [1, 0]]
This chooses a random entry from a sequence which always has a single zero and then n 1s where n is the corresponding entry in your weight matrix. In this implementation, a weight of 1 is actually a 50/50 chance of either a 1 or 0. If you want a 50/50 chance to happen at 250 use outRow.append(random.choice([0 for _ in range(500-entry)] + [1 for _ in range(entry)]))

Accessing ndarray and discarding invalid positions - Python

I have one question about accessing a matrix position that in fact does not exists.
First, I have an matrix with rows rows and cols columns. From this matrix, I have to get sets of n x n sub matrices. For example, to get 3 x 3 sub matrices, I do the following:
for x, y in product(range(1, matrix.rows-1), range(1, matrix.cols-1)):
bootstrap_3x3 = npr.choice(matrix.data[x-1:x+2, y-1:y+2].flatten(), size=(3, 3), replace=True)
But, as can be seen, I'm not considering the extremes, and I have to. For x = 0 and y = 0, for example, I should consider matrix.data[x:x+2, y:y+2] (the center should be the current x and y), returning a 3 x 3 with the first row/column = 0.
I know that I can achieve this with some if statements. But I guess Python should have a clever way to do this properly.
Thank you in advance.
I would make a new matrix, padded with (n-1)/2 zeros around it:
import numpy as np
rows, cols = 4, 6
n = 3
d = (n-1)/2
data = np.arange(rows*cols).reshape(rows, cols)
padded = np.pad(data, d, mode='constant')
for x, y in np.indices(data.shape).reshape(2, -1).T:
sub = padded[x:x+n, y:y+n]
print sub
bootstrap_nxn = np.random.choice(sub.ravel(), (n, n))
This assumes n is odd, and that the submatrix center is always within the the original data matrix. If n is even, the center of the submatrix isn't well defined.
If you actually want to have the submatrix overlap with the data matrix with only one row, then you'd need to pad with n-1 zeros (and in that case even vs odd n won't matter).

Categories

Resources