Accessing ndarray and discarding invalid positions - Python - python

I have one question about accessing a matrix position that in fact does not exists.
First, I have an matrix with rows rows and cols columns. From this matrix, I have to get sets of n x n sub matrices. For example, to get 3 x 3 sub matrices, I do the following:
for x, y in product(range(1, matrix.rows-1), range(1, matrix.cols-1)):
bootstrap_3x3 = npr.choice(matrix.data[x-1:x+2, y-1:y+2].flatten(), size=(3, 3), replace=True)
But, as can be seen, I'm not considering the extremes, and I have to. For x = 0 and y = 0, for example, I should consider matrix.data[x:x+2, y:y+2] (the center should be the current x and y), returning a 3 x 3 with the first row/column = 0.
I know that I can achieve this with some if statements. But I guess Python should have a clever way to do this properly.
Thank you in advance.

I would make a new matrix, padded with (n-1)/2 zeros around it:
import numpy as np
rows, cols = 4, 6
n = 3
d = (n-1)/2
data = np.arange(rows*cols).reshape(rows, cols)
padded = np.pad(data, d, mode='constant')
for x, y in np.indices(data.shape).reshape(2, -1).T:
sub = padded[x:x+n, y:y+n]
print sub
bootstrap_nxn = np.random.choice(sub.ravel(), (n, n))
This assumes n is odd, and that the submatrix center is always within the the original data matrix. If n is even, the center of the submatrix isn't well defined.
If you actually want to have the submatrix overlap with the data matrix with only one row, then you'd need to pad with n-1 zeros (and in that case even vs odd n won't matter).

Related

randomly sampling arrays - issue with numpy.delete

I have 2 arrays, x_1g and x_2g. I want to randomly sample 10% of each array and remove that 10% and insert it into the other array. This means that my final and initial arrays should have the same shape, but 10% of the data is randomly sampled from the other array. I have been trying this with the code below but my arrays keep increasing in length, meaning I haven't properly deleted the sampled 10% data from each array.
n = len(x_1g)
n2 = round(n/10)
ints1 = np.random.choice(n, n2)
x_1_replace = x_1g[ints1,:]
x_1 = np.delete(x_1g, ints1, 0)
x_2_replace = x_2g[ints1,:]
x_2 = np.delete(x_2g, ints1, 0)
My arrays x_1g and x_2g have shapes (150298, 10)
x_1g.shape
>> (1502983, 10)
x_1_replace.shape
>> (150298, 10)
so when I remove the 10% data (x_1_replace) from my original array (x_1g) I should get the array shape:
1502983-150298 = 1352685
However when I check the shape of my array x_1 I get:
x_1.shape
>> (1359941, 10)
I'm not sure what is going on here so if anyone has any suggestions please let me know!!
What happens, is that by using ints1 = np.random.choice(n, n2) to generate your indices, you are choosing n2 times a number between 0 and n-1. You have no guarantee that you will generate n2 different numbers. You are most likely generating a certain number of duplicates. And if you pass several times the same index position to np.delete it will be deleted just once. You can check this by reading the number of unique values in ints1:
np.unique(ints1).shape
You'll see it is not matching n2 (in your example, you'll get (143042,)).
There's probably more than one way to ensure that you'll get n2 different indices, here is one example:
n = len(x_1g)
n2 = round(n/10)
ints1 = np.arange(n) # generating an array [0 ... n-1]
np.random.shuffle(ints1) # shuffle it
ints1 = ints1[:n2] # take the first n2 values
x_1_replace = x_1g[ints1,:]
x_1 = np.delete(x_1g, ints1, 0)
x_2_replace = x_2g[ints1,:]
x_2 = np.delete(x_2g, ints1, 0)
Now you can check:
x_1.shape
# (1352685, 10)

simplity construction of sparse (transition) matrix

I am constructing a transition matrix from a n1 x n2 x ... x nN x nN array. For concreteness let N = 3, e.g.,
import numpy as np
# example with N = 3
n1, n2, n3 = 3, 2, 5
dim = (n1, n2, n3)
arr = np.random.random_sample(dim + (n3,))
Here arr contains transition probabilities between 2 states, where the "from"-state is indexed by the first 3 dimensions, and the "to"-state is indexed by the first 2 and the last dimension. I want to construct a transition matrix, which expresses these probabilities raveled into a sparse (n1*n2*n3) x (n1*n2*n3 matrix.
To clarify, let me provide my current approach that does what I want to do. Unfortunately, it's slow and doesn't work when N and n1, n2, ... are large. So I am looking for a more efficient way of doing the same that scales better for larger problems.
My approach
import numpy as np
from scipy import sparse as sparse
## step 1: get the index correponding to each dimension of the from and to state
# ravel axes 1 to 3 into single axis and make sparse
spmat = sparse.coo_matrix(arr.reshape(np.prod(dim), -1))
data = spmat.data
row = spmat.row
col = spmat.col
# use unravel to get idx for
row_unravel = np.array(np.unravel_index(row, dim))
col_unravel = np.array(np.unravel_index(col, n3))
## step 2: combine "to" index with rows 1 and 2 of "from"-index to get "to"-coordinates in full state space
row_unravel[-1, :] = col_unravel # first 2 dimensions of state do not change
colnew = np.ravel_multi_index(row_unravel, dim) # ravel back to 1d
## step 3: assemble transition matrix
out = sparse.coo_matrix((data, (row, colnew)), shape=(np.prod(dim), np.prod(dim)))
Final thought
I will be running this code many times. Across iterations, the data of arr may change, but the dimensions will stay the same. So one thing I could do is to save and load row and colnew from a file, skipping everything between the definition of data (line 2) and final line of my code. Do you think this would be the best approach?
Edit: One problem I see with this strategy is that if some elements of arr are zero (which is possible) then the size of data will change across iterations.
One approach that beats the one posted in the OP. Not sure if it's the most efficient.
import numpy as np
from scipy import sparse
# get col and row indices
idx = np.arange(np.prod(dim))
row = idx.repeat(dim[-1])
col = idx.reshape(-1, dim[-1]).repeat(dim[-1], axis=0).ravel()
# get the data
data = arr.ravel()
# construct the sparse matrix
out = sparse.coo_matrix((data, (row, col)), shape=(np.prod(dim), np.prod(dim)))
Two things that could be improved:
(1) if arr is sparse, the output matrix out will have zeros coded as nonzero.
(2) The approach relies on the new state being the last dimension of dim. It would be nice to generalize so that the last axis of arr can replace any of the originating axis, not just the last one.

Numpy median-of-means computation across unequal-sized array

Assume a numpy array X of shape m x n and type float64. The rows of X need to pass through an element-wise median-of-means computation. Specifically, the m row indices are partitioned into b "buckets", each containing m/b such indices. Next, within each bucket I compute the mean and across the resulting means I do a final median computation.
An example that clarifies it is
import numpy as np
m = 10
n = 10000
# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)
# Number of buckets to split rows into
b = 5
# Partition the rows of X into b buckets
row_indices = np.arange(X.shape[0])
buckets = np.array(np.array_split(row_indices, b))
X_bucketed = X[buckets, :]
# Compute the mean within each bucket
bucket_means = np.mean(X_bucketed, axis=1)
# Compute the median-of-means
median = np.median(bucket_means, axis=0)
# Edit - Method 2 (based on answer)
np.random.shuffle(row_indices)
X = X[row_indices, :]
buckets2 = np.array_split(X, b, axis=0)
bucket_means2 = [np.mean(x, axis=0) for x in buckets2]
median2 = np.median(np.array(bucket_means2), axis=0)
This program works fine if b divides m since np.array_split() results in partitioning the indices in equal parts and array buckets is a 2D array.
However, it does not work if b does not divide m. In that case, np.array_split() still splits into b buckets but of unequal sizes, which is fine for my purposes. For example, if b = 3 it will split the indices {0,1,...,9} into [0 1 2 3], [4 5 6] and [7 8 9]. Those arrays cannot be stacked onto one another so the array buckets is not a 2D array and it cannot be used to index X_bucketed.
How can I make this work for unequal-sized buckets, i.e., to have the program compute the mean within each bucket (irrespective of its size) and then the median across the buckets?
I cannot fully grasp masked arrays and I am not sure if those can be used here.
You can consider computing each buckets' mean separately, then stack and compute the median. Also you can just use array_split to X directly, no need to index it with a sliced index array (maybe this was your main question?).
m = 11
n = 10000
# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)
# Number of buckets to split rows into
b = 5
# Partition the rows of X into b buckets
buckets = np.array_split(X, 2, axis = 0)
# Compute the mean within each bucket
b_means = [np.mean(x, axis=0) for x in buckets]
# Compute the median-of-means
median = np.median(np.array(b_means), axis=0)
print(median) #(10000,) shaped array

way to create a 3d matrix of 2 vectors and 1 matrix

Hello i have a question regarding a problem I am facing in python. I was studying about tensors and I saw that each row/column of a tensor must have the same size. Is it possible to create a tensor of perhaps a 3d object or matrix where lets say we have 3 axis : x,y,z
In the x axis I want to create a vector to work as an index. So let x be from 0 to N
Then on the y axis I want to have N random integer vectors of size m (where mm
Is it possible?
My first approach was to create a big vector of Nm and a big matrix of (Nm,Nm) dimensions where i would store all my random vectors and matrices and then if I wanted to change for example the my second vector then i would have to play with the indexes. However is there another way to approach this problem with tensors or numpy that I m unaware of?
Thank you in advance for your advices
First vector, N = 3, [1,2, 3]
Second N vectors with length m, m = 2
[[4,5], [6,7], [7,8]]
So, N matrices of size (m,m)
[[[1,1], [2,2]], [[1,1], [2,2]], [[1,1], [2,2]] ]
Lets create numpy arrays from them.
import numpy as np
N = 3
m = 2
a = np.array([1,2,3])
b = np.random.randn(N, m)
c = np.random.randn(N, m, m)
You see the problem here? The last matrix c has already 3 dimensions according to your definitions.
Your argument can be simplified.
Let's say our final matrix is -
a = np.zeros((3,2,2)) # 3 dimensions, x,y,z
1) For first dimension -
a[0,:,:] = 0 # first axis, first index = 0
a[1,:,:] = 1 # first axis, 2nd index = 1
a[2,:,:] = 2 # first axis, 3rd index = 2
2) Now, we need to fill up the rest of the positions, but dimensions don't match up.
So, it's better to create separate tensors for them.

Python: Filtering numpy values based on certain columns

I'm trying to create a method for evaluating co-ordinates for a project that's due in about a week.
Assuming that I'm working in a 3D cartesian co-ordinate system - whose values are stored as rows in a numpy array. I am trying to read if 'z' (n[i, 2]) values exist given the corresponding, predetermined 'x' (n[i,0]) and 'y' (n[i,1]) values.
In the case where the values that are assigned are scalars, I am content to think that:
# Given that n is some numpy array
x, y = 2,3
out = []
for i in range(0,n.shape[0]):
if n[i, 0] == x and n[i,1] == y:
out.append(n[i,2])
However, where the sorrow comes in is having to check if the values in another numpy array are in the original numpy array 'n'.
# Given that n is the numpy array that is to be searched
# Given that x contains the 'search elements'
out = []
for i in range(0,n.shape[0]):
for j in range(0, x.shape[0]):
if n[i, 0] == x[j,0] and n[i,1] == x[j,1]:
out.append(n[i,2])
The issue with doing it this way is that the 'n' matrix in my application may well be in excess of 100 000 lines long.
Is there a more efficient way of performing this function?
This might be more efficient than nested loops:
out = []
for row in x:
idx = np.equal(n[:,:2], row).all(1)
out.extend(n[idx,2].tolist())
Note this assumes that x is of shape (?, 2). Otherwise, if it has more than two columns, just change row to row[:2] in the loop body.
Numpythonic solution without loops.
This solution works in case the x and y coordinates are non-negative.
import numpy as np
# Using a for x and b for n, to avoid confusion with x,y coordinates and array names
a = np.array([[1,2],[3,4]])
b = np.array([[1,2,10],[1,2,11],[3,4,12],[5,6,13],[3,4,14]])
# Adjust the shapes by taking the z coordinate as 0 in a and take the dot product with b transposed
a = np.insert(a,2,0,axis=1)
dot_product = np.dot(a,b.T)
# Reshape a**2 to check the dot product values corresponding to exact values in the x, y coordinates
sum_reshaped = np.sum(a**2,axis=1).reshape(a.shape[0],1)
# Match for values for indivisual elements in a. Can be used if you want z coordinates corresponding to some x, y separately
indivisual_indices = ( dot_product == np.tile(sum_reshaped,b.shape[0]) )
# Take OR of column values and take z if atleast one x,y present
indices = np.any(indivisual_indices, axis=0)
print(b[:,2][indices]) # prints [10 11 12 14]

Categories

Resources