select random indices from 2d array - python

I want to generate a 2d random array and select some(m) random indices to alter their values by predefined values(m).
For an example here, I want to generate 4 by 4 matrix. Then select 4 random indices and alter their values with [105,110,115,120] this values.
random_matrix = np.random.randint(0,100,(4,4))
# array([[27, 20, 2, 8],
# [43, 88, 14, 63],
# [ 5, 55, 4, 72],
# [59, 49, 84, 96]])
Now, I want to randomly select 4 indices and alter their values from predefined p_array = [105,110,115,120]
I try to generate all the indices like this:
[
(i,j)
for i in range(len(random_matrix))
for j in range(len(random_matrix[i]))
]
But how to select 4 random indices from this and alter their values from predefined p_matrix? I couldn't think of any solution because I have to ensure 4 unique random indices where I stuck badly, as randomness haven't that guarantee.
Can we generate random matrix and selecting indices in a single shot? I need that because if the size of m getting larger and larger than it will be getting slower (current implementation). I have to ensure performance also.

Do the following:
import numpy as np
# for reproducibility
np.random.seed(42)
rows, cols = 4, 4
p_array = np.array([105, 110, 115, 120])
# generate random matrix that will always include all the values from p_array
k = rows * cols - len(p_array)
random_matrix = np.concatenate((p_array, np.random.randint(0, 100, k)))
np.random.shuffle(random_matrix)
random_matrix = random_matrix.reshape((rows, cols))
print(random_matrix)
Output
[[115 33 54 27]
[ 3 27 16 69]
[ 33 24 81 105]
[ 62 110 94 120]]
UPDATE
Assuming the same setup as before, you could do the following, to generate a random matrix knowing the indices of the p_array values:
positions = np.random.permutation(np.arange(rows * cols))
random_matrix = random_matrix[positions].reshape((rows, cols))
print("random-matrix")
print("-------------")
print(random_matrix)
print("-------------")
# get indices in flat array
flat_indices = np.argwhere(np.isin(positions, np.arange(4))).flatten()
# get indices in matrix
matrix_indices = np.unravel_index(flat_indices, (rows, cols))
print("p_array-indices")
print("-------------")
print(matrix_indices)
# verify that indeed those are the values
print(random_matrix[matrix_indices])
Output
random-matrix
-------------
[[ 60 74 20 14]
[105 86 120 82]
[ 74 87 110 51]
[ 92 115 99 71]]
-------------
p_array-indices
-------------
(array([1, 1, 2, 3]), array([0, 2, 2, 1]))
[105 120 110 115]

You can do the following, using your suggested cross-product and random.sample:
import random
from itertools import product
pool = [*product(range(len(random_matrix)), range(len(random_matrix[0])))]
random_indices = random.sample(pool, 4)
# [(3, 1), (1, 2), (2, 0), (2, 3)]

Related

Deleting specific numbers from a (2,60) numpy array?

I have a numpy array that has a shape of (2,60). Some of the numbers in the first row exceed 30 and I want to filter columns for which the value of the first row is less than 30.
I tried array = array[array < 30] #but that doesn't work
An example of my array is
array = np.array([[30,40,12,12,10,2,30,40],[2,5,75,67,89,5,3,4]])
Expected output:
array = [[12 12 10 2]
[75 67 89 5]]
You are looking for this:
array[:,array[0]<30]
output:
array([[12, 12, 10, 2],
[75, 67, 89, 5]])

fastest way to create a matrix who cols are a product of each other

Suppose I have a matrix X, with n columns. I want to create a new matrix, Y such that each column of Y is a product of two different columns of X.
Currently, I am doing a loop, something like this (not my actual code, but captures the essence of the code):
Y = np.array(X.shape[0], int(n * (n-1)/2))
cnt = 0
for j1 in range(0, n-1):
for j2 in range(j1+1, n):
Y[:, cnt] = X[:, j1] * X[:, j2]
cnt += 1
I was wondering if anyone knows if there is a faster way to generate (populate) matrix Y than the double loop that I am doing ? For instance, any function in numpy that be re-used to generate such a matrix quickly ?
Since you are looking for combinations of columns without repetition (i.e. col 0 * col 1 is the same as col 1 * col 0), I would use itertools since the combination is over something relatively smaller (the indices):
>>> x = np.arange(24).reshape(6,4)
>>> list(combinations(range(x.shape[1]), 2)) # For illustrative purposes. We want all pairs of different columns.
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)] |
>>> np.vstack([x[:, i]*x[:, j] for i, j in combinations(range(x.shape[1]), 2)]).T
array([[ 0, 0, 0, 2, 3, 6],
[ 20, 24, 28, 30, 35, 42],
[ 72, 80, 88, 90, 99, 110],
[156, 168, 180, 182, 195, 210],
[272, 288, 304, 306, 323, 342],
[420, 440, 460, 462, 483, 506]])
Using broadcasting (I think depending on your input might be faster):
Z = X.T[:,None]*X.T
output = Z[np.triu_indices(X.shape[1],k=1)].T
example input/output:
X = np.arange(24).reshape(6,4)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]
output:
[[ 0 0 0 2 3 6]
[ 20 24 28 30 35 42]
[ 72 80 88 90 99 110]
[156 168 180 182 195 210]
[272 288 304 306 323 342]
[420 440 460 462 483 506]]

Create Numpy array without enumerating array

Starting with this:
x = range(30,60,2)[::-1];
x = np.asarray(x); x
array([58, 56, 54, 52, 50, 48, 46, 44, 42, 40, 38, 36, 34, 32, 30])
Create an array like this: (Notice, first item repeats) But if I can get this faster without the first item repeating, I can np.hstack first item.
[[58 58 56 54 52]
[56 56 54 52 50]
[54 54 52 50 48]
[52 52 50 48 46]
[50 50 48 46 44]
[48 48 46 44 42]
[46 46 44 42 40]
[44 44 42 40 38]
[42 42 40 38 36]
[40 40 38 36 34]
[38 38 36 34 32]
[36 36 34 32 30]
[34 34 32 30 None]
[32 32 30 None None]
[30 30 None None None]]
The code below works, want it faster without 'for' loop and enumerate.
arr = np.empty((0,5), int)
for i,e in enumerate(x):
arr2 = np.hstack((x[i], x[i:i+4], np.asarray([None]*5)))[:5]
arr = np.vstack((arr,arr2))
Approach #1
Here's a vectorized approach using NumPy broadcasting -
N = 4 # width factor
x_ext = np.concatenate((x,[None]*(N-1)))
arr2D = x_ext[np.arange(N) + np.arange(x_ext.size-N+1)[:,None]]
out = np.column_stack((x,arr2D))
Approach #2
Here's another one using hankel -
from scipy.linalg import hankel
N = 4 # width factor
x_ext = np.concatenate((x,[None]*(N-1)))
out = np.column_stack((x,hankel(x_ext[:4], x_ext[3:]).T))
Runtime test
Here's a modified version of #Aaron's benchmarking script using an input format for this post identical to the one used for his post in that script for a fair benchmarking and focusing just on these two approaches -
upper_limit = 58 # We will edit this to vary the dataset sizes
print "Timings are : "
t = time()
for _ in range(1000): #1000 iterations of #Aaron's soln.
width = 3
x = np.array(range(upper_limit,28,-2) + [float('nan')]*width)
arr = np.empty([len(x)-width, width+2])
arr[:,0] = x[:len(x)-width]
for i in xrange(len(x)-width):
arr[i,1:] = x[i:i+width+1]
print(time()-t)
t = time()
for _ in range(1000):
N = 4 # width factor
x_ext = np.array(range(upper_limit,28,-2) + [float('nan')]*(N-1))
arr2D = x_ext[np.arange(N) + np.arange(x_ext.size-N+1)[:,None]]
out = np.column_stack((x_ext[:len(x_ext)-N+1],arr2D))
print(time()-t)
Case #1 (upper_limit = 58 ) :
Timings are :
0.0316879749298
0.0322730541229
Case #2 (upper_limit = 1058 ) :
Timings are :
0.680443048477
0.124517917633
Case #3 (upper_limit = 5058 ) :
Timings are :
3.28129291534
0.47504901886
I got about an order of magnitude faster by avoiding _stack() and only using floats...
edit: added #Divakar's post to time trial...
import numpy as np
from time import time
t = time()
for _ in range(1000): #1000 iterations of my soln.
width = 3
x = np.array(range(58,28,-2) + [float('nan')]*width)
arr = np.empty([len(x)-width, width+2])
arr[:,0] = x[:len(x)-width]
for i in xrange(len(x)-width):
arr[i,1:] = x[i:i+width+1]
print(time()-t)
t = time()
for _ in range(1000): #1000 iterations of OP code
x = range(30,60,2)[::-1];
x = np.asarray(x)
arr = np.empty((0,5), int)
for i,e in enumerate(x):
arr2 = np.hstack((x[i], x[i:i+4], np.asarray([None]*5)))[:5]
arr = np.vstack((arr,arr2))
print(time()-t)
t = time()
for _ in range(1000):
x = np.array(range(58,28,-2))
N = 4 # width factor
x_ext = np.hstack((x,[None]*(N-1)))
arr2D = x_ext[np.arange(N) + np.arange(x_ext.size-N+1)[:,None]]
out = np.column_stack((x,arr2D))
print(time()-t)
prints out:
>>> runfile('...temp.py', wdir='...')
0.0160000324249
0.374000072479
0.0319998264313
>>>
Starting with Divaker's padded x
N = 4 # width factor
x_ext = np.concatenate((x,[None]*(N-1)))
Since we aren't doing math on it, padding with None (which makes an object array) or np.nan (which makes a float) shouldn't make much difference.
The column stack could be eliminated with a little change to the indexing:
idx = np.r_[0,np.arange(N)] + np.arange(x_ext.size-N+1)[:,None]
this produces
array([[ 0, 0, 1, 2, 3],
[ 1, 1, 2, 3, 4],
[ 2, 2, 3, 4, 5],
[ 3, 3, 4, 5, 6],
[ 4, 4, 5, 6, 7],
...
so the full result is
x_ext[idx]
================
A different approach is to use striding to create a kind of rolling window.
as_strided = np.lib.stride_tricks.as_strided
arr2D = as_strided(x_ext, shape=(15,4), str‌​ides=(4,4))
This is one of easier applications of as_strided. shape is straight forward - the shape of the desired result (without the repeat column) (x.shape[0],N).
In [177]: x_ext.strides
Out[177]: (4,)
For 1d array of this type, the step to the next item is 4 bytes. If I reshape the array to 2d with 3 columns, the stride to the next row is 12 - 3*4 (3 offset).
In [181]: x_ext.reshape(6,3).strides
Out[181]: (12, 4)
Using strides=(4,4) means that the step to next row is just 4 bytes, one element in original.
as_strided(x_ext,shape=(8,4),strides=(8,4))
produces a 2 item overlap
array([[58, 56, 54, 52],
[54, 52, 50, 48],
[50, 48, 46, 44],
[46, 44, 42, 40],
....
The potentially dangerous part of as_strided is that it is possible to create an array that samples memory outside of the original data buffer. Usually that appears as large random numbers where None appears in this example. It's the same sort of error that you would encounter if C code if you were careless in using array pointers and indexing.
The as_strided array is a view (the repeated values are not copied). So writing to that array could be dangerous. The column_stack with x will make a copy, replicating the repeated values as needed.
I suggest to contruct an initial matrix with equal columns and then use np.roll() to rotate them:
import numpy as np
import timeit as ti
import numpy.matlib
x = range(30,60,2)[::-1];
x = np.asarray(x);
def sol1():
# Your solution, for comparison
arr = np.empty((0,5), int)
for i,e in enumerate(x):
arr2 = np.hstack((x[i], x[i:i+4], np.asarray([None]*5)))[:5]
arr = np.vstack((arr,arr2))
return arr
def sol2():
# My proposal
x2 = np.hstack((x, [None]*3))
mat = np.matlib.repmat(x2, 5, 1)
for i in range(3):
mat[i+2, :] = np.roll(mat[i+2, :], -(i+1))
return mat[:,:-3].T
print(ti.timeit(sol1, number=100))
print(ti.timeit(sol2, number=100))
which guives:
0.026760146000015084
0.0038611710006080102
It uses a for loop but it only iterates over the shorter axis. Also, it should not be hard to adapt this code for other configurations instead of using hardcoded numbers.

How to bin a 2D array in numpy?

I'm new to numpy and I have a 2D array of objects that I need to bin into a smaller matrix and then get a count of the number of objects in each bin to make a heatmap. I followed the answer on this thread to create the bins and do the counts for a simple array but I'm not sure how to extend it to 2 dimensions. Here's what I have so far:
data_matrix = numpy.ndarray((500,500),dtype=float)
# fill array with values.
bins = numpy.linspace(0,50,50)
digitized = numpy.digitize(data_matrix, bins)
binned_data = numpy.ndarray((50,50))
for i in range(0,len(bins)):
for j in range(0,len(bins)):
k = len(data_matrix[digitized == i:digitized == j]) # <-not does not work
binned_data[i:j] = k
P.S. the [digitized == i] notation on an array will return an array of binary values. I cannot find documentation on this notation anywhere. A link would be appreciated.
You can reshape the array to a four dimensional array that reflects the desired block structure, and then sum along both axes within each block. Example:
>>> a = np.arange(24).reshape(4, 6)
>>> a
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
>>> a.reshape(2, 2, 2, 3).sum(3).sum(1)
array([[ 24, 42],
[ 96, 114]])
If a has the shape m, n, the reshape should have the form
a.reshape(m_bins, m // m_bins, n_bins, n // n_bins)
At first I was also going to suggest that you use np.histogram2d rather than reinventing the wheel, but then I realized that it would be overkill to use that and would need some hacking still.
If I understand correctly, you just want to sum over submatrices of your input. That's pretty easy to brute force: going over your output submatrix and summing up each subblock of your input:
import numpy as np
def submatsum(data,n,m):
# return a matrix of shape (n,m)
bs = data.shape[0]//n,data.shape[1]//m # blocksize averaged over
return np.reshape(np.array([np.sum(data[k1*bs[0]:(k1+1)*bs[0],k2*bs[1]:(k2+1)*bs[1]]) for k1 in range(n) for k2 in range(m)]),(n,m))
# set up dummy data
N,M = 4,6
data_matrix = np.reshape(np.arange(N*M),(N,M))
# set up size of 2x3-reduced matrix, assume congruity
n,m = N//2,M//3
reduced_matrix = submatsum(data_matrix,n,m)
# check output
print(data_matrix)
print(reduced_matrix)
This prints
print(data_matrix)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
print(reduced_matrix)
[[ 24 42]
[ 96 114]]
which is indeed the result for summing up submatrices of shape (2,3).
Note that I'm using // for integer division to make sure it's python3-compatible, but in case of python2 you can just use / for division (due to the numbers involved being integers).
Another solution is to have a look at the binArray function on the comments here:
Binning a numpy array
To use your example :
data_matrix = numpy.ndarray((500,500),dtype=float)
binned_data = binArray(data_matrix, 0, 10, 10, np.sum)
binned_data = binArray(binned_data, 1, 10, 10, np.sum)
The result sum all square of size 10x10 in data_matrix (of size 500x500) to obtain a single value per square in binned_data (of size 50x50).
Hope this help !

In Tensorflow, how to use tf.gather() for the last dimension?

I am trying to gather slices of a tensor in terms of the last dimension for partial connection between layers. Because the output tensor's shape is [batch_size, h, w, depth], I want to select slices based on the last dimension, such as
# L is intermediate tensor
partL = L[:, :, :, [0,2,3,8]]
However, tf.gather(L, [0, 2,3,8]) seems to only work for the first dimension (right?) Can anyone tell me how to do it?
As of TensorFlow 1.3 tf.gather has an axis parameter, so the various workarounds here are no longer necessary.
https://www.tensorflow.org/versions/r1.3/api_docs/python/tf/gather
https://github.com/tensorflow/tensorflow/issues/11223
There's a tracking bug to support this use-case here: https://github.com/tensorflow/tensorflow/issues/206
For now you can:
transpose your matrix so that dimension to gather is first (transpose is expensive)
reshape your tensor into 1d (reshape is cheap) and turn your gather column indices into a list of individual element indices at linear indexing, then reshape back
use gather_nd. Will still need to turn your column indices into list of individual element indices.
With gather_nd you can now do this as follows:
cat_idx = tf.concat([tf.range(0, tf.shape(x)[0]), indices_for_dim1], axis=0)
result = tf.gather_nd(matrix, cat_idx)
Also, as reported by user Nova in a thread referenced by #Yaroslav Bulatov's:
x = tf.constant([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
idx = tf.constant([1, 0, 2])
idx_flattened = tf.range(0, x.shape[0]) * x.shape[1] + idx
y = tf.gather(tf.reshape(x, [-1]), # flatten input
idx_flattened) # use flattened indices
with tf.Session(''):
print y.eval() # [2 4 9]
The gist is flatten the tensor and use strided 1D addressing with tf.gather(...).
Yet another solution using tf.unstack(...), tf.gather(...) and tf.stack(..)
Code:
import tensorflow as tf
import numpy as np
shape = [2, 2, 2, 10]
L = np.arange(np.prod(shape))
L = np.reshape(L, shape)
indices = [0, 2, 3, 8]
axis = -1 # last dimension
def gather_axis(params, indices, axis=0):
return tf.stack(tf.unstack(tf.gather(tf.unstack(params, axis=axis), indices)), axis=axis)
print(L)
with tf.Session() as sess:
partL = sess.run(gather_axis(L, indices, axis))
print(partL)
Result:
L =
[[[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]]
[[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]]]
[[[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]]
[[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]]]]
partL =
[[[[ 0 2 3 8]
[10 12 13 18]]
[[20 22 23 28]
[30 32 33 38]]]
[[[40 42 43 48]
[50 52 53 58]]
[[60 62 63 68]
[70 72 73 78]]]]
A correct version of #Andrei's answer would read
cat_idx = tf.stack([tf.range(0, tf.shape(x)[0]), indices_for_dim1], axis=1)
result = tf.gather_nd(matrix, cat_idx)
You can try this way, for instance(in most cases in NLP at the least),
The parameter is of shape [batch_size, depth] and the indices are [i, j, k, n, m] of which the length is batch_size. Then gather_nd can be helpful.
parameters = tf.constant([
[11, 12, 13],
[21, 22, 23],
[31, 32, 33],
[41, 42, 43]])
targets = tf.constant([2, 1, 0, 1])
batch_nums = tf.range(0, limit=parameters.get_shape().as_list()[0])
indices = tf.stack((batch_nums, targets), axis=1) # the axis is the dimension number
items = tf.gather_nd(parameters, indices)
# which is what we want: [13, 22, 31, 42]
This snippet first find the fist dimension through the batch_num and then fetch the item along that dimension by the target number.
Tensor doesn't have attribute shape, but get_shape() method. Below is runnable by Python 2.7
import tensorflow as tf
import numpy as np
x = tf.constant([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
idx = tf.constant([1, 0, 2])
idx_flattened = tf.range(0, x.get_shape()[0]) * x.get_shape()[1] + idx
y = tf.gather(tf.reshape(x, [-1]), # flatten input
idx_flattened) # use flattened indices
with tf.Session(''):
print y.eval() # [2 4 9]
Implementing 2. from #Yaroslav Bulatov's:
#Your indices
indices = [0, 2, 3, 8]
#Remember for final reshaping
n_indices = tf.shape(indices)[0]
flattened_L = tf.reshape(L, [-1])
#Walk strided over the flattened array
offset = tf.expand_dims(tf.range(0, tf.reduce_prod(tf.shape(L)), tf.shape(L)[-1]), 1)
flattened_indices = tf.reshape(tf.reshape(indices, [-1])+offset, [-1])
selected_rows = tf.gather(flattened_L, flattened_indices)
#Final reshape
partL = tf.reshape(selected_rows, tf.concat(0, [tf.shape(L)[:-1], [n_indices]]))
Credit to How to select rows from a 3-D Tensor in TensorFlow?

Categories

Resources