How to create indicator matrix from very large dataset in python

How to create indicator matrix from very large dataset in python - python

I'm running into confusingly large memory requirements for a relatively simple problem.
I have an ordered array of length N (index corresponds to sample ID) containing either an integer value or NaN.
I want to generate an indicator matrix of dimension N by N such that if two samples, i and j, both have a non-NaN value in the original list, then position (i, j) in the matrix is 1 and 0 otherwise (because the matrix is symmetrical I do not care about position (j, i).
To pare back on memory requirements, I've implemented the following code, which instead of generating a square matrix creates an array that represents the condensed square matrix (ie what squareform would generate). But for an initial list of 66,000 entries, this script requires over 80GB of memory! I think this is failing because of the map line in get_condensed_indeces, but I don't know how to fix it. If anyone has any suggestions for reducing memory use please share!
Code below, should work with any input array.
def ind_matrix(x):
ind = np.array([0.] * (len(x) * (len(x) - 1) / 2), dtype=np.float32)
mask = np.where(~np.isnan(x))[0]
targets = get_condensed_indeces(len(x), mask)
ind[targets] += 1
return ind
def get_condensed_indeces(n, desired_elements):
# args:
# n - number of cells in the current cluster
# desired_elements - list of numpy indeces that specify
# cells in a given cluster
return map(
index_converter,
[[n, x[0], x[1]] for x in itertools.combinations(desired_elements, 2)]
)
def index_converter(x):
# mapping from position (i,j) in square matrix to index in squareform 1D array
n, i, j = x[0], x[1], x[2]
return n * i - (i * (i + 1)) / 2 + j - 1 - i

Related

Replace rows in a numpy 2d array with rows from another 2d array

I have two 2d arrays, let's call them img (m * n) and means (k * n), and a list, let's call this clusters (m, ). I'd like to replace rows in the img array with rows from the means array where I select the row from the means array based on the value in the clusters list. For example:
Suppose, img
img = np.array([[0.40784314, 0.48627451, 0.52549022],
[0.05490196, 0.1254902, 0.2]]) # This will be a (m * n) array
And, means
means = np.array([[0.80551694, 0.69010299, 0.17438512],
[0.33569541, 0.45309059, 0.52275014]]) # (k * n) array
And, clusters list
clusters = [1, 0] # list of length m
The Desired output is
[[0.33569541 0.45309059 0.52275014]
[0.80551694 0.69010299 0.17438512]] # This should be a (m * n) array, same as img above
Notice that the first row has been replaced with the second row from the means array because cluster[0] == 1 and second row has been replaced with the first row from the means array because cluster[1] == 0 and so on so forth.
I am able to do this using the following line of code, but I was wondering if there is any faster way of doing this, if any.
np.array([means[i] for i in clusters])

What you're looking for is called advanced indexing:
>>> means[clusters]
array([[0.33569541, 0.45309059, 0.52275014],
[0.80551694, 0.69010299, 0.17438512]])

Is there a conventional way to create a tensor with a variable elements in tensorflow 2.0?

I'm trying to implement a machine learning model from a research paper into tensorflow which requires that I create a tensor block diagonal matrix. In the paper, they give the formula for the block which looks something like
[[cos(x * theta_1), sin(x * theta_2)], [-sin(x * theta_3), cos(x * theta_4)]]
I've made a function in tf which takes in x and returns a diagonal matrix but this matrix is going to be utilized hundreds of thousands of times over a training cycle so I'd like to find a way to avoid creating it from scratch every time I need to use it. Unfortunately, because x could be any real number within a range, I can't just create a matrix for every possible value of x and store them in a list for later use.
I'm wondering if there is a way to create the matrix such that it includes the variable x in some of its elements so that I can do something like
"create_tensor_from_schematic(tensor_with_variables, value_of_x)"
and it will
return the tensor evaluated for that value of x, saving me from having to reconstruct the diagonal matrix every time.
This matrix is a key component in a function that sits right in the middle of my model and is utilized by every training and testing sample once every epoch. Here's the code for that matrix:
def build_D(self, n, j):
def theta(k):
return (2 * math.pi * k) / n
def A(k, j):
j_thetak = j * theta(k)
return tf.convert_to_tensor([[math.cos(j_thetak), math.sin(j_thetak)],
[-math.sin(j_thetak), math.cos(j_thetak)]], dtype=tf.float32)
if n % 2 == 1: #n is odd
s = int((n - 1) / 2)
block_list_A = [tf.reshape(tf.convert_to_tensor([1], dtype=tf.float32), [1, 1])] + [A(k, j) for k in range(1, s + 1)]
else: #n is even
s = int((n - 2) / 2)
last_term = (-1) ** j
block_list_A = [tf.reshape(tf.convert_to_tensor([1], dtype=tf.float32), [1, 1])] + [A(k, j) for k in range(1, s + 1)] \
+ [tf.reshape(tf.convert_to_tensor([last_term], dtype=tf.float32), [1, 1])]
return tf.linalg.LinearOperatorBlockDiag(list(
map(lambda x: tf.linalg.LinearOperatorFullMatrix(x), block_list_A))).to_dense()
(this version of the code is the one I'm currently using which only supports integer valued j which allows me to just create the matrix for every j within my range and store them in a list but In the future j will be real valued and I obviously can't create a matrix for every possible j value.)
j is the only variable that changes, so it would be nice if there was a way to run this once and to keep just copy the j's over into the matrix for evaluation when I need the matrix which corresponds to a certain j value.
I wondered if it was possible to create a tensor with lambda expressions as elements but I can't imagine how I could pass an argument to them.
Is there an inbuilt conventional way to create a something like a tensor schematic in tensorflow? What are my options? Any ideas are appreciated.

efficient setting 1D range values in a DataFrame (or a ndarray) with boolean array

PREREQUISITE
import numpy as np
import pandas as pd
INPUT1:boolean 2d array (a sample array as below)
x = np.array(
[[False,False,False,False,True],
[True,False,False,False,False],
[False,False,True,False,True],
[False,True,True,False,False],
[False,False,False,False,False]])
INPUT2:1D Range values (a sample as below)
y=np.array([1,2,3,4])
EXPECTED OUTPUT:2D ndarray
[[0,0,0,0,1],
[1,0,0,0,2],
[2,0,1,0,1],
[3,1,1,0,2],
[4,2,2,0,3]]
I want to set a range value(vertical vector) for each True in 2d ndarray(INPUT1) efficiently. Is there some useful APIs or solutions for this purpose?

Unfortunately I couldn't come up with an elegant solution, so I came up with multiple inelegant ones. The two main approaches I could think of are
brute-force looping over each True value and assigning slices, and
using a single indexed assignment to replace the necessary values.
It turns out that the time complexity of these approaches is non-trivial, so depending on the size of your array either can be faster.
Using your example input:
import numpy as np
x = np.array(
[[False,False,False,False,True],
[True,False,False,False,False],
[False,False,True,False,True],
[False,True,True,False,False],
[False,False,False,False,False]])
y = np.array([1,2,3,4])
refout = np.array([[0,0,0,0,1],
[1,0,0,0,2],
[2,0,1,0,1],
[3,1,1,0,2],
[4,2,2,0,3]])
# alternative input with arbitrary size:
# N = 100; x = np.random.rand(N,N) < 0.2; y = np.arange(1,N)
def looping_clip(x, y):
"""Loop over Trues, use clipped slices"""
nmax = x.shape[0]
n = y.size
# initialize output
out = np.zeros_like(x, dtype=y.dtype)
# loop over True values
for i,j in zip(*x.nonzero()):
# truncate right-hand side where necessary
out[i:i+n, j] = y[:nmax-i]
return out
def looping_expand(x, y):
"""Loop over Trues, use an expanded buffer"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# loop over True values
for i,j in zip(ivals, jvals):
# slice will always be complete, i.e. of length y.size
out[i:i+n, j] = y
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
def index_2d(x, y):
"""Assign directly with 2d indices, use an expanded buffer"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# now we can safely index for each "(ivals:ivals+n, jvals)" so to speak
upped_ivals = ivals[:,None] + np.arange(n) # shape (ntrues, n)
upped_jvals = jvals.repeat(y.size).reshape(-1, n) # shape (ntrues, n)
out[upped_ivals, upped_jvals] = y # right-hand size of shape (n,) broadcasts
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
def index_1d(x,y):
"""Assign using linear indices, use an expanded buffer"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# grab linear indices corresponding to Trues in a buffed-up array
inds = np.ravel_multi_index((ivals, jvals), out.shape)
# now all we need to do is start stepping along rows for each item and assign y
upped_inds = inds[:,None] + mmax*np.arange(n) # shape (ntrues, n)
out.flat[upped_inds] = y # y of shape (n,) broadcasts to (ntrues, n)
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
# check that the results are correct
print(all([np.array_equal(refout, looping_clip(x,y)),
np.array_equal(refout, looping_expand(x,y)),
np.array_equal(refout, index_2d(x,y)),
np.array_equal(refout, index_1d(x,y))]))
I tried to document each function, but here's a synopsis:
looping_clip loops over every True value in the input and assigns to a corresponding slice in the output. We take care on the right-hand side to shorten the assigned array for when part of the slice would go beyond the edge of the array along the first dimension.
looping_expand loops over every True value in the input and assigns to a corresponding full slice in the output after allocating a padded output array ensuring that every slice will be full. We do more work when allocating a larger output array, but we don't have to shorten the right-hand side on assignment. We could omit the .copy() call in the last step, but I prefer not to return a nontrivially strided array (i.e. a view to an auxiliary array rather than a proper copy) as this might lead to obscure surprises for the user.
index_2d computes the 2d indices of every value to be assigned to, and assumes that duplicate indices will be handled in order. This is not guaranteed! (More on this a bit later.)
index_1d does the same using linearized indices and indexing into the flatiter of the output.
Here are the timings of the above methods using random arrays (see the commented line near the start):
What we can see is that for small and large arrays the looping versions are faster, but for linear sizes between roughly 10 and 150 the indexing versions are better. The reason I didn't go to higher sizes is that the indexing cases start to use a lot of memory, and I didn't want to have to worry about this messing with timings.
Just to make the above worse, note that the indexing versions assume that duplicate indices in a fancy indexing scenario are handled in order, so when True values are handled which are "lower" in the array, previous values will be overwritten as per your requirements. There's only one problem: this is not guaranteed:
For advanced assignments, there is in general no guarantee for the iteration order. This means that if an element is set more than once, it is not possible to predict the final result.
This doesn't sounds very encouraging. While in my experiments it seems that the indices are handled in order (according to C order), this can also be coincidence, or an implementation detail. So if you want to use the indexing versions, make sure that on your specific version and specific dimensions and shapes this still holds true.
We can make the assignment safer by getting rid of duplicate indices ourselves. For this we can make use of this answer by Divakar on a corresponding question:
def index_1d_safe(x,y):
"""Same as index_1d but use Divakar's safe solution for reducing duplicates"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# grab linear indices corresponding to Trues in a buffed-up array
inds = np.ravel_multi_index((ivals, jvals), out.shape)
# now all we need to do is start stepping along rows for each item and assign y
upped_inds = inds[:,None] + mmax*np.arange(n) # shape (ntrues, n)
# now comes https://stackoverflow.com/a/44672126
# need additional step: flatten upped_inds and corresponding y values for selection
upped_flat_inds = upped_inds.ravel() # shape (ntrues, n) -> (ntrues*n,)
y_vals = np.broadcast_to(y, upped_inds.shape).ravel() # shape (ntrues, n) -> (ntrues*n,)
sidx = upped_flat_inds.argsort(kind='mergesort')
sindex = upped_flat_inds[sidx]
idx = sidx[np.r_[np.flatnonzero(sindex[1:] != sindex[:-1]), upped_flat_inds.size-1]]
out.flat[upped_flat_inds[idx]] = y_vals[idx]
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
This still reproduces your expected output. The problem is that now the function takes much longer to finish:
Bummer. Considering how my indexing versions are only faster for an intermediate array size and how their faster versions are not guaranteed to work, perhaps it's simplest to just use one of the looping versions. This is not to say, of course, that there aren't any optimal vectorized solutions that I missed.

Permute rows in "slices" of 3d array to match each other

I have a series of 2d arrays where the rows are points in some space. Many similar points occur across all arrays but in different row order. I want to sort the rows so they have the most similar order. Also the points are too different for clustering with K-means or DBSCAN. The problem can also be cast like this. If I stack the arrays into a 3d array, how do I permute the rows to minimize the average standard deviation (SD) along the 2nd axis? What's a good sorting algorithm for this problem?
I've tried the following approaches.
Create a set of reference 2d array and sort rows in each array to minimize mean euclidean distances to the reference 2d array. This I am afraid gives biased results.
Sort rows in arrays pairwise, then pairs of pair-medians, then pairs of that, etc... This doesn't really work and I'm not sure why.
A third approach could be just brute force optimization but I try to avoid that since I have multiple sets of arrays to perform the procedure on.
This is my code for the 2nd approach (Python):
def reorder_to(A, B):
"""Reorder rows in A to best match rows in B.
Input
-----
A : N x M numpy.array
B : N x M numpy.array
Output
------
perm_order : permutation order
"""
if A.shape != B.shape:
print "A and B must have the same shape"
return None
N = A.shape[0]
# Create a distance matrix of distance between rows in A and B
distance_matrix = np.ones((N, N))*np.inf
for i, a in enumerate(A):
for ii, b in enumerate(B):
ba = (b-a)
distance_matrix[i, ii] = np.sqrt(np.dot(ba, ba))
# Choose permutation order by smallest distances first
perm_order = [[] for _ in range(N)]
for _ in range(N):
ind = np.argmin(distance_matrix)
i, ii = ind/N, ind%N
perm_order[ii] = i
distance_matrix[i, :] = np.inf
distance_matrix[:, ii] = np.inf
return perm_order
def permute_tensor_rows(A):
"""Permute 1d rows in 3d array along the 0th axis to minimize average SD along 2nd axis.
Input
-----
A : numpy.3darray
Each "slice" in the 2nd direction is an independent array whose rows can be permuted
to decrease the average SD in the 2nd direction.
Output
------
A : numpy.3darray
A with sorted rows in each "slice".
"""
step = 2
while step <= A.shape[2]:
for k in range(0, A.shape[2], step):
# If last, reorder to previous
if k + step > A.shape[2]:
A_kk = A[:, :, k:(k+step)]
kk_order = reorder_to(np.median(A_kk, axis=2), np.median(A_k, axis=2))
A[:, :, k:(k+step)] = A[kk_order, :, k:(k+step)]
continue
k_0, k_1 = k, k+step/2
kk_0, kk_1 = k+step/2, k+step
A_k = A[:, :, k_0:k_1]
A_kk = A[:, :, kk_0:kk_1]
order = reorder_to(np.median(A_k, axis=2), np.median(A_kk, axis=2))
A[:, :, k_0:k_1] = A[order, :, k_0:k_1]
print "Step:", step, "\t ... Average SD:", np.mean(np.std(A, axis=2))
step *= 2
return A

Sorry I should have looked at your code sample; that was very informative.
Seems like this here gives an out-of-the-box solution to your problem:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html#scipy.optimize.linear_sum_assignment
Only really feasible for a few 100 points at most though, in my experience.

Accessing ndarray and discarding invalid positions - Python

I have one question about accessing a matrix position that in fact does not exists.
First, I have an matrix with rows rows and cols columns. From this matrix, I have to get sets of n x n sub matrices. For example, to get 3 x 3 sub matrices, I do the following:
for x, y in product(range(1, matrix.rows-1), range(1, matrix.cols-1)):
bootstrap_3x3 = npr.choice(matrix.data[x-1:x+2, y-1:y+2].flatten(), size=(3, 3), replace=True)
But, as can be seen, I'm not considering the extremes, and I have to. For x = 0 and y = 0, for example, I should consider matrix.data[x:x+2, y:y+2] (the center should be the current x and y), returning a 3 x 3 with the first row/column = 0.
I know that I can achieve this with some if statements. But I guess Python should have a clever way to do this properly.
Thank you in advance.

I would make a new matrix, padded with (n-1)/2 zeros around it:
import numpy as np
rows, cols = 4, 6
n = 3
d = (n-1)/2
data = np.arange(rows*cols).reshape(rows, cols)
padded = np.pad(data, d, mode='constant')
for x, y in np.indices(data.shape).reshape(2, -1).T:
sub = padded[x:x+n, y:y+n]
print sub
bootstrap_nxn = np.random.choice(sub.ravel(), (n, n))
This assumes n is odd, and that the submatrix center is always within the the original data matrix. If n is even, the center of the submatrix isn't well defined.
If you actually want to have the submatrix overlap with the data matrix with only one row, then you'd need to pad with n-1 zeros (and in that case even vs odd n won't matter).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create indicator matrix from very large dataset in python - python

Related

Replace rows in a numpy 2d array with rows from another 2d array

Is there a conventional way to create a tensor with a variable elements in tensorflow 2.0?

efficient setting 1D range values in a DataFrame (or a ndarray) with boolean array

Permute rows in "slices" of 3d array to match each other

Accessing ndarray and discarding invalid positions - Python

Categories

Resources