Centering matrix - python

I want to write a function for centering an input data matrix by multiplying it with the centering matrix. The function shall subtract the row-wise mean from the input.
My code:
import numpy as np
def centering(data):
n = data.shape()[0]
centeringMatrix = np.identity(n) - 1/n * (np.ones(n) # np.ones(n).T)
data = centeringMatrix # data
data = np.array([[1,2,3], [3,4,5]])
center_with_matrix(data)
But I get a wrong result matrix, it is not centered.
Thanks!

The centering matrix is
np.eye(n) - np.ones((n, n)) / n
Here is a list of issues in your original formulation:
np.ones(n).T is the same as np.ones(n). The transpose of a 1D array is a no-op in numpy. If you want to turn a row vector into a column vector, add the dimension explicitly:
np.ones((n, 1))
OR
np.ones(n)[:, None]
The normal definition is to subtract the column-wise mean, not the row-wise, so you will have to transpose and right-multiply the input to get row-wise operation:
n = data.shape()[1]
...
data = (centeringMatrix # data.T).T
Your function creates a new array for the output but does not currently return anything. You can either return the result, or perform the assignment in-place:
return (centeringMatrix # data.T).T
OR
data[:] = (centeringMatrix # data.T).T
OR
np.matmul(centeringMatrix, data.T, out=data.T)

Related

3D tensor of diagonal matrices

I have a matrix A with m rows and n columns. I want a 3D tensor of dimension m*n*n such that the tensor consists out of m diagonal matrices formed by each of the columns of A. In other words every column of A should be converted into a diagonalized matrix and all those matrices should form a 3D tensor together.
This is quite easy to do with a for loop. But I want to do it without to improve speed.
I came up with a bad and inefficient way which works, but I hope someone can help me with finding a better way, which allows for large A matrices.
# I use python
# import numpy as np
n = A.shape[0] # A is an n*k matrix
k = A.shape[1]
holding_matrix = np.repeat(np.identity(k), repeats=n, axis=1) # k rows with n*k columns
identity_stack = np.tile(np.identity(n),k) #k nxn identity matrices stacked together
B = np.array((A#holding_matrix)*identity_stack)
B = np.array(np.hsplit(B,k)) # desired result of k n*n diagonal matrices in a tensor
n = A.shape[0] # A.shape == (n, k)
k = A.shape[1]
B = np.zeros_like(A, shape=(k, n*n)) # to preserve dtype and order of A
B[:, ::(n+1)] = A.T
B = B.reshape(k, n, n)

Setup sliding windows as columns (IM2COL from MATLAB) in multi-dimensional array - Python

Currently, I have a 4d array, say,
arr = np.arange(48).reshape((2,2,3,4))
I want to apply a function that takes a 2d array as input to each 2d array sliced from arr. I have searched and read this question, which is exactly what I want.
The function I'm using is im2col_sliding_broadcasting() which I get from here. It takes a 2d array and list of 2 elements as input and returns a 2d array. In my case: it takes 3x4 2d array and a list [2, 2] and returns 4x6 2d array.
I considered using apply_along_axis() but as said it only accepts 1d function as parameter. I can't apply im2col function this way.
I want an output that has the shape as 2x2x4x6. Surely I can achieve this with for loop, but I heard that it's too time expensive:
import numpy as np
def im2col_sliding_broadcasting(A, BSZ, stepsize=1):
# source: https://stackoverflow.com/a/30110497/10666066
# Parameters
M, N = A.shape
col_extent = N - BSZ[1] + 1
row_extent = M - BSZ[0] + 1
# Get Starting block indices
start_idx = np.arange(BSZ[0])[:, None]*N + np.arange(BSZ[1])
# Get offsetted indices across the height and width of input array
offset_idx = np.arange(row_extent)[:, None]*N + np.arange(col_extent)
# Get all actual indices & index into input array for final output
return np.take(A, start_idx.ravel()[:, None] + offset_idx.ravel()[::stepsize])
arr = np.arange(48).reshape((2,2,3,4))
output = np.empty([2,2,4,6])
for i in range(2):
for j in range(2):
temp = im2col_sliding_broadcasting(arr[i, j], [2,2])
output[i, j] = temp
Since my arr in fact is a 10000x3x64x64 array. So my question is: Is there another way to do this more efficiently ?
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows. More info on use of as_strided based view_as_windows.
from skimage.util.shape import view_as_windows
W1,W2 = 2,2 # window size
# create sliding windows along last two axes1
w = view_as_windows(arr,(1,1,W1,W2))[...,0,0,:,:]
# Merge the window axes (tha last two axes) and
# merge the axes along which those windows were created (3rd and 4th axes)
outshp = arr.shape[:-2] + (W1*W2,) + ((arr.shape[-2]-W1+1)*(arr.shape[-1]-W2+1),)
out = w.transpose(0,1,4,5,2,3).reshape(outshp)
The last step forces a copy. So, skip it if possible.

efficient setting 1D range values in a DataFrame (or a ndarray) with boolean array

PREREQUISITE
import numpy as np
import pandas as pd
INPUT1:boolean 2d array (a sample array as below)
x = np.array(
[[False,False,False,False,True],
[True,False,False,False,False],
[False,False,True,False,True],
[False,True,True,False,False],
[False,False,False,False,False]])
INPUT2:1D Range values (a sample as below)
y=np.array([1,2,3,4])
EXPECTED OUTPUT:2D ndarray
[[0,0,0,0,1],
[1,0,0,0,2],
[2,0,1,0,1],
[3,1,1,0,2],
[4,2,2,0,3]]
I want to set a range value(vertical vector) for each True in 2d ndarray(INPUT1) efficiently. Is there some useful APIs or solutions for this purpose?
Unfortunately I couldn't come up with an elegant solution, so I came up with multiple inelegant ones. The two main approaches I could think of are
brute-force looping over each True value and assigning slices, and
using a single indexed assignment to replace the necessary values.
It turns out that the time complexity of these approaches is non-trivial, so depending on the size of your array either can be faster.
Using your example input:
import numpy as np
x = np.array(
[[False,False,False,False,True],
[True,False,False,False,False],
[False,False,True,False,True],
[False,True,True,False,False],
[False,False,False,False,False]])
y = np.array([1,2,3,4])
refout = np.array([[0,0,0,0,1],
[1,0,0,0,2],
[2,0,1,0,1],
[3,1,1,0,2],
[4,2,2,0,3]])
# alternative input with arbitrary size:
# N = 100; x = np.random.rand(N,N) < 0.2; y = np.arange(1,N)
def looping_clip(x, y):
"""Loop over Trues, use clipped slices"""
nmax = x.shape[0]
n = y.size
# initialize output
out = np.zeros_like(x, dtype=y.dtype)
# loop over True values
for i,j in zip(*x.nonzero()):
# truncate right-hand side where necessary
out[i:i+n, j] = y[:nmax-i]
return out
def looping_expand(x, y):
"""Loop over Trues, use an expanded buffer"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# loop over True values
for i,j in zip(ivals, jvals):
# slice will always be complete, i.e. of length y.size
out[i:i+n, j] = y
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
def index_2d(x, y):
"""Assign directly with 2d indices, use an expanded buffer"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# now we can safely index for each "(ivals:ivals+n, jvals)" so to speak
upped_ivals = ivals[:,None] + np.arange(n) # shape (ntrues, n)
upped_jvals = jvals.repeat(y.size).reshape(-1, n) # shape (ntrues, n)
out[upped_ivals, upped_jvals] = y # right-hand size of shape (n,) broadcasts
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
def index_1d(x,y):
"""Assign using linear indices, use an expanded buffer"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# grab linear indices corresponding to Trues in a buffed-up array
inds = np.ravel_multi_index((ivals, jvals), out.shape)
# now all we need to do is start stepping along rows for each item and assign y
upped_inds = inds[:,None] + mmax*np.arange(n) # shape (ntrues, n)
out.flat[upped_inds] = y # y of shape (n,) broadcasts to (ntrues, n)
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
# check that the results are correct
print(all([np.array_equal(refout, looping_clip(x,y)),
np.array_equal(refout, looping_expand(x,y)),
np.array_equal(refout, index_2d(x,y)),
np.array_equal(refout, index_1d(x,y))]))
I tried to document each function, but here's a synopsis:
looping_clip loops over every True value in the input and assigns to a corresponding slice in the output. We take care on the right-hand side to shorten the assigned array for when part of the slice would go beyond the edge of the array along the first dimension.
looping_expand loops over every True value in the input and assigns to a corresponding full slice in the output after allocating a padded output array ensuring that every slice will be full. We do more work when allocating a larger output array, but we don't have to shorten the right-hand side on assignment. We could omit the .copy() call in the last step, but I prefer not to return a nontrivially strided array (i.e. a view to an auxiliary array rather than a proper copy) as this might lead to obscure surprises for the user.
index_2d computes the 2d indices of every value to be assigned to, and assumes that duplicate indices will be handled in order. This is not guaranteed! (More on this a bit later.)
index_1d does the same using linearized indices and indexing into the flatiter of the output.
Here are the timings of the above methods using random arrays (see the commented line near the start):
What we can see is that for small and large arrays the looping versions are faster, but for linear sizes between roughly 10 and 150 the indexing versions are better. The reason I didn't go to higher sizes is that the indexing cases start to use a lot of memory, and I didn't want to have to worry about this messing with timings.
Just to make the above worse, note that the indexing versions assume that duplicate indices in a fancy indexing scenario are handled in order, so when True values are handled which are "lower" in the array, previous values will be overwritten as per your requirements. There's only one problem: this is not guaranteed:
For advanced assignments, there is in general no guarantee for the iteration order. This means that if an element is set more than once, it is not possible to predict the final result.
This doesn't sounds very encouraging. While in my experiments it seems that the indices are handled in order (according to C order), this can also be coincidence, or an implementation detail. So if you want to use the indexing versions, make sure that on your specific version and specific dimensions and shapes this still holds true.
We can make the assignment safer by getting rid of duplicate indices ourselves. For this we can make use of this answer by Divakar on a corresponding question:
def index_1d_safe(x,y):
"""Same as index_1d but use Divakar's safe solution for reducing duplicates"""
n = y.size
nmax,mmax = x.shape
ivals,jvals = x.nonzero()
# initialize buffed-up output
out = np.zeros((nmax + max(n + ivals.max() - nmax,0), mmax), dtype=y.dtype)
# grab linear indices corresponding to Trues in a buffed-up array
inds = np.ravel_multi_index((ivals, jvals), out.shape)
# now all we need to do is start stepping along rows for each item and assign y
upped_inds = inds[:,None] + mmax*np.arange(n) # shape (ntrues, n)
# now comes https://stackoverflow.com/a/44672126
# need additional step: flatten upped_inds and corresponding y values for selection
upped_flat_inds = upped_inds.ravel() # shape (ntrues, n) -> (ntrues*n,)
y_vals = np.broadcast_to(y, upped_inds.shape).ravel() # shape (ntrues, n) -> (ntrues*n,)
sidx = upped_flat_inds.argsort(kind='mergesort')
sindex = upped_flat_inds[sidx]
idx = sidx[np.r_[np.flatnonzero(sindex[1:] != sindex[:-1]), upped_flat_inds.size-1]]
out.flat[upped_flat_inds[idx]] = y_vals[idx]
return out[:nmax, :].copy() # rather not return a view to an auxiliary array
This still reproduces your expected output. The problem is that now the function takes much longer to finish:
Bummer. Considering how my indexing versions are only faster for an intermediate array size and how their faster versions are not guaranteed to work, perhaps it's simplest to just use one of the looping versions. This is not to say, of course, that there aren't any optimal vectorized solutions that I missed.

numpy/scipy: fill upper triangle of array with elements of 1d vector?

Say I have a flattened 1D vector that exactly corresponds to the upper triangle elements of a 2D array.
The 1D vector needs to be read into the upper triangle.
I might do the following in python:
triu_flat = ...
row,col = np.triu_indices(50)
D = np.zeros((50,50))
i=0
for r in row:
for c in col:
D[r,c] = triu_flat[i]
i++
However, there must be a way to do this via numpy/scipy operations.
You can simply use the indices returned by triu_indices(), no need for a for loop:
import numpy
data = numpy.arange(6)
out = numpy.zeros((3, 3))
inds = numpy.triu_indices(len(out))
out[inds] = data

Multi-dimensional filtering using scipy.ndimage_generic_filter

I would like to use a generic filter to calculate the mean of values within a given window (or kernel), for values that fulfill a couple of conditions. I expected the following code to produce a mean filter of the first array in a 3-layer window, using the other two arrays to mask values from the mean calculation.
from scipy import ndimage
import numpy as np
#some test data
tstArr = np.random.rand(3,7,7)
tstArr = tstArr*10
tstArr = np.int_(tstArr)
tstArr[1] = tstArr[1]*100
tstArr[2] = tstArr[2] *1000
#mean function
def testFun(tstData,processLayer,nLayers,kernelSize):
funData= tstData.reshape((nLayers,kernelSize,kernelSize))
meanLayer = funData[processLayer]
maskedData = meanLayer[(funData[1]>1)&(funData[2]<9000)]
returnMean = np.mean(maskedData)
return returnMean
#number of layers in the array
nLayers = np.shape(tstArr)[0]
#window size
kernelSize = 5
#create a sampling window of 5x5 elements from each array
footprnt = np.ones((nLayers,kernelSize,kernelSize),dtype = np.int)
# calculate the mean of the first layer in the array (other two are for masking)
processLayer = 0
tstOut = ndimage.generic_filter(tstArr, testFun, footprint=footprnt, extra_arguments = (processLayer,nLayers,kernelSize))
I thought this would yield a 7x7 array of masked mean values from the first layer in the input array. The output is a 3x7x7 array, and I don't understand what the values represent. I'm not sure how to produce the "masked" mean-filtered array, or how to interpret the output as given.
Your code produce a mean filter of the first array in a 3-layer window, using the over two arrays to mask values from the mean calculation. You will find the result in tstOut[1].
What is going on ? When you call ndimage.generic_filter with tstArr of shape (3, 7, 7) and footprint=np.ones((3, 5, 5)) then for all i from 0 to 2, for all j from 0 to 6 and for all k from 0 to 6, testFun is called with the subarray of tstArr centered in (i, j, k) and of shape (3, 5, 5) (the array is reflected at the boundary to supply missing values).
In the end:
tstOut[0] is the mean filter of tstArr[0] with tstArr[0] and tstArr[1] as masks
tstOut[1] is the mean filter of tstArr[0] with tstArr[1] and tstArr[2] as masks
tstOut[2] is the mean filter of tstArr[1] with tstArr[2] and tstArr[2] as masks
Again, the wanted result is in tstOut[1].
I hope this will help you.

Categories

Resources