Covariance of Matrix in python - python

I want to find the covariance of a 10304*280 matrix (i.e 280 variable and each have 10304 subjects) and I am using the following numpy function to find this.
cov = numpy.cov(matrix)
I am expected 208*280 matrix as a result but it returned 10304*10304 matrix.

As suggested in the previous answer, you can change your memory layout.
An easy way to do this in 2d is simply transposing the matrix:
import numpy as np
r = np.random.rand(100, 10)
np.cov(r).shape # is (100,100)
np.cov(r.T).shape # is (10,10)
But you can also specify a rowvar flag. Read about it here:
import numpy as np
r = np.random.rand(100, 10)
np.cov(r).shape # is (100,100)
np.cov(r, rowvar=False).shape # is (10,10)
I think especially for large matrices this might be more performant, since you avoid the swapping/transposing of axes.
UPDATE:
I thought about this and wondered if the algorithm is actually different depending on rowvar == True or rowvar == False. Well, as it turns out, if you change the rowvar flag, numpy simply transposes the array itself :P.
Look here.
So, in terms of performance, nothing will change between the two versions.

here is what numpy.cov(m, y=None..) document says
m : array_like A 1-D or 2-D array containing multiple variables and
observations. Each row of m represents a variable, and each column a
single observation of all those variables...
So if your matrix contains 280 variable with 10304 samples for each, it suppose to be 280*10304 matrix instead of 10304*280 one. The simple solution would be same as others suggesting.
swap_matrix = numpy.swapaxis(matrix)
cov = numpy.cov(swap_matrix)

Related

How to resize an arbitrary Numpy NDArray to a new shape using interpolation

Remark:
This is rather a contribution than a question since I will answer my own question.
However, I am still interested in how the community would solve this problem.
So feel free to answer.
Story:
So when I was playing around with QT in Python (i.e., PySide6) and it's Volumerendering capabilities I noticed some problems when setting my data array. Long story short: I didn't know (and if it is stated somwhere in the QT documentation at all) that the provided texture has to be of a shape where each dimension is a power of two.
Thus, I wanted to rescale my array to a shape which fulfills this criteria.
Calculating this shape with numpy is easy:
new_shape = numpy.power(2, numpy.ceil(numpy.log2(old_shape))).astype(int)
Now the only problem left is to rescale my array with shape old_shape to the new array with shape new_shape and properly interpolate the values.
And since I am usually only interested in some sort of generic approaches (who knows what this might be good for and for whom in the future), the following question did arise:
Question
How to resize an arbitrary Numpy NDArray of shape old_shape to a Numpy NDArray of shape new shape with proper interpolation?
I tried using scipy RegularGridInterpolator to rescale my array and it actually worked.
I used scipy's RegularGridInterpolator to interpolate my array.
Other interpolators should work as well.
def resample_array_to_shape(array: np.array, new_shape, method="linear"):
# generate points for each entry in the array
entries = [np.arange(s) for s in array.shape]
# the value for each point corresponds to its value in the original array
interp = RegularGridInterpolator(entries, array, method=method)
# new entries
new_entries = [np.linspace(0, array.shape[i] - 1, new_shape[i]) for i in range(len(array.shape))]
# use 'ij' indexing to avoid swapping axes
new_grid = np.meshgrid(*new_entries, indexing='ij')
# interpolate and return
return interp(tuple(new_grid)).astype(array.dtype)

Finding nearest pixel in defined color space - quick implementation using numpy

I have been working on a task, where I implemented median cut for image quantization – representing the whole image by only limited set of pixels. I implemented the algorithm and now I am trying to implement the part, where I assign each pixel to a representant from the set found by median cut. So, I have variable 'color_space', which is 2d ndarray of shape (n,3), where n is the number of representatives. Then I have variable 'img', which is the original image of shape (rows, columns, 3).
Now I want to find the nearest pixel (bin) for each pixel from the image based on euclidean distance. I was able to come with this solution:
for row in range(img.shape[0]):
for column in range(img.shape[1]):
img[row][column] = color_space[np.linalg.norm(color_space - img[row][column], axis=1).argmin()]
What it does is, that for each pixel from the image, it computes the vector if distances from each of the bins and then it takes the closest one.
Problem is, that this solution is quite slow and I would like to vectorize it - instead of getting vector for each pixel, I would like to get a matrix, where for example first row would be the first vector of distances computed in my code etc...
This problem could be converted into a problem, where I want to do a matrix multiplication, but instead of getting dot product of two vectors, I would get their euclidean distance. Is there some good approach to such problems? Some general solution in numpy, if we want to do 'matrix multiplication' in numpy, but the function Rn x Rn -> R does not need to be dot product, but for example euclidean distance. Of course, for the multiplication, the original image should be resized to (row*columns, 3), but that is a detail.
I have been studying the documentation and searching internet, but didn't find any good approach.
Please note that I don't want others to solve my assignment, the solution I came up with is totally ok, I am just curious whether I could speed it up as I try to learn numpy properly.
Thanks for any advices!
Below is MWE for vectorizing your problem. See comments for explanation.
import numpy
# these are just random array declaration to work with.
image = numpy.random.rand(32, 32, 3)
color_space = numpy.random.rand(10,3)
# your code. I modified it to pick indexes
result = numpy.zeros((32,32))
for row in range(image.shape[0]):
for column in range(image.shape[1]):
result[row][column] = numpy.linalg.norm(color_space - image[row][column], axis=1).argmin()
result = result.astype(numpy.int)
# here we reshape for broadcasting correctly.
image = image.reshape(1,32,32,3)
color_space = color_space.reshape(10, 1,1,3)
# compute the norm on last axis, which is RGB values
result_norm = numpy.linalg.norm(image-color_space, axis=3)
# now compute the vectorized argmin
result_vectorized = result_norm.argmin(axis=0)
print(numpy.allclose(result, result_vectorized))
Eventually, you can get the correct solution by doing color_space[result]. You may have to remove the extra dimensions that you add in color space to get correct shapes in this final operation.
I think this approach might be a bit more numpy-ish/pythonic:
import numpy as np
from typing import *
from numpy import linalg as LA
# assume color_space is defined as a constant somewhere above and is of shape (n,3)
nearest_pixel_idxs: Callable[[np.ndarray], int] = lambda rgb: return LA.norm(color_space - rgb, axis=1).argmin()
img: np.ndarray = color_space[np.apply_along_axis(nearest_pixel_idxs, 1, img.reshape((-1, 3)))]
Why this solution might be more efficient:
It relies on the parallelizable apply_along_axis function nearest_pixel_idxs() rather than the nested for-loops. This is made possible by reshaping img and thereby removing the need for double indexing.
It avoids repeated writes into color_space by only indexing into it once at the very end.
Let me know if you would like me to go into greater depth on any of this - happy to help.
You could first broadcast to get all the combinations and then calculate each norm. You could then pick the smallest from there.
a = np.array([[1,2,3],
[2,3,4],
[3,4,5]])
b = np.array([[1,2,3],
[3,4,5]])
a = np.repeat(a.reshape(a.shape[0],1,3), b.shape[0], axis = 1)
b = np.repeat(b.reshape(1,b.shape[0],3), a.shape[0], axis = 0)
np.linalg.norm(a - b, axis = 2)
Each row of the result represents the distance of the row in a to each of the representatives in b
array([[0. , 3.46410162],
[1.73205081, 1.73205081],
[3.46410162, 0. ]])
You can then use argmin to get the final results.
IMO it is better to use (what #Umang Gupta proposed) numpy's automatic broadcasting than using repeat.

Defining empty numpy array when we do not know the size

From my understanding, when we want to define a numpy array, we have to define its size.
However, in my case, I want to define a numpy array and then extend it based on my values in the for loop. The shape of values might differ in each run. So I cannot define the numpy array shape in advance.
Is there any way to overcome this?
I would like to avoid using lists.
Thanks
import numpy as np
myArrayShape = 2
myArray = np.empty(shape=2)
Note that this generates random values for each element in the array.
I think numpy array is just like array in clang or c++, I mean when you make numpy array you allocate memory depend on your request(size and dtype).
So it is better to make array after the size of array is determinated.
Or you can try numpy.append
https://numpy.org/doc/stable/reference/generated/numpy.append.html
But I don't think it is preferable way because it keeps generate new arrays.
From the Octave (free-MATLAB) docs, https://octave.org/doc/v6.3.0/Advanced-Indexing.html
In cases where a loop cannot be avoided, or a number of values must be combined to form a larger matrix, it is generally faster to set the size of the matrix first (pre-allocate storage), and then insert elements using indexing commands. For example, given a matrix a,
[nr, nc] = size (a);
x = zeros (nr, n * nc);
for i = 1:n
x(:,(i-1)*nc+1:i*nc) = a;
endfor
is considerably faster than
x = a;
for i = 1:n-1
x = [x, a];
endfor
because Octave does not have to repeatedly resize the intermediate result.
The same idea applies in numpy. While you can start with a (0,n) shaped array, and grow by concatenating (1,n) arrays, that is a lot slower than starting with a (m,n) array, and assigning values.
There's a deleted answer that illustrates how to create an array by list append. That is highly recommended.

Efficient numpy approach to iterate through elements of numpy arrays

With the following code snippet, I am trying to generate a vector where each element of it is drawn from a different normal distribution. The "mean" and "standard deviation" (arguments to random.normal) values for this is obtained from 2 numpy vectors, meanVect and varVect. Both the vectors have the same shape as that of vector to be generated.
I am using list comprehension to achieve the same, which I have used as a quicj and dirty fix to achieve my objective. Is there a numpy specific approach to achieve the same, which is more efficient than my current solution.
from numpy import random
meanVect = np.random.rand(1,100) # using random vectors for MWE
varVect = np.random.rand(1,100) # Originally vectors from a different source is used
newVect = [random.normal(meanVect[i],varVect[i]) for i in range(len(meanVects[0])) ]
Since np.random.normal takes array-like inputs for loc and scale, you can just do:
newVect = np.random.normal(meanVect, varVect)
As long as both input vectors have the same .shape, this should work.

Solve broadcasting error without for loop, speed up code

I may be misunderstanding how broadcasting works in Python, but I am still running into errors.
scipy offers a number of "special functions" which take in two arguments, in particular the eval_XX(n, x[,out]) functions.
See http://docs.scipy.org/doc/scipy/reference/special.html
My program uses many orthogonal polynomials, so I must evaluate these polynomials at distinct points. Let's take the concrete example scipy.special.eval_hermite(n, x, out=None).
I would like the x argument to be a matrix shape (50, 50). Then, I would like to evaluate each entry of this matrix at a number of points. Let's define n to be an a numpy array narr = np.arange(10) (where we have imported numpy as np, i.e. import numpy as np).
So, calling
scipy.special.eval_hermite(narr, matrix)
should return Hermitian polynomials H_0(matrix), H_1(matrix), H_2(matrix), etc. Each H_X(matrix) is of the shape (50,50), the shape of the original input matrix.
Then, I would like to sum these values. So, I call
matrix1 = np.sum( [scipy.eval_hermite(narr, matrix)], axis=0 )
but I get a broadcasting error!
ValueError: operands could not be broadcast together with shapes (10,) (50,50)
I can solve this with a for loop, i.e.
matrix2 = np.sum( [scipy.eval_hermite(i, matrix) for i in narr], axis=0)
This gives me the correct answer, and the output matrix2.shape = (50,50). But using this for loop slows down my code, big time. Remember, we are working with entries of matrices.
Is there a way to do this without a for loop?
eval_hermite broadcasts n with x, then evaluates Hn(x) at each point. Thus, the output shape will be the result of broadcasting n with x. So, if you want to make this work, you'll have to make n and x have compatible shapes:
import scipy.special as ss
import numpy as np
matrix = np.ones([100,100]) # example
narr = np.arange(10) # example
ss.eval_hermite(narr[:,None,None], matrix).shape # => (10, 100, 100)
But note that this might actually be faster:
out = np.zeros_like(matrix)
for n in narr:
out += ss.eval_hermite(n, matrix)
In testing, it appears to be between 5-10% faster than np.sum(...) of above.
The documentation for these functions is skimpy, and a lot of the code is compiled, so this is just based on experimentation:
special.eval_hermite(n, x, out=None)
n apparently is a scalar or array of integers. x can be an array of floats.
special.eval_hermite(np.ones(5,int)[:,None],np.ones(6)) gives me a (5,6) result. This is the same shape as what I'd get from np.ones(5,int)[:,None] * np.ones(6).
The np.ones(5,int)[:,None] is a (5,1) array, np.ones(6) a (6,), which for this purpose is equivalent of (1,6). Both can be expanded to (5,6).
So as best I can tell, broadcasting rules in these special functions is the same as for operators like *.
Since special.eval_hermite(nar[:,None,None], x) produces a (10,50,50), you just apply sum to axis 0 of that to produce the (50,50).
special.eval_hermite(nar[:,Nar,Nar], x).sum(axis=0)
Like I wrote before, the same broadcasting (and summing) rules apply for this hermite as they do for a basic operation like *.

Categories

Resources