Correlating an array row-wise with a vector - python

I have an array X with dimension mxn, for every row m I want to get a correlation with a vector y with dimension n.
In Matlab this would be possible with the corr function corr(X,y). For Python however this does not seem possible with the np.corrcoef function:
import numpy as np
X = np.random.random([1000, 10])
y = np.random.random(10)
np.corrcoef(X,y).shape
Which results in shape (1001, 1001). But this will fail when the dimension of X is large. In my case, there is an error:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 5.93 TiB for an array with shape (902630, 902630) and data type float64
Since the X.shape[0] dimension is 902630.
My question is, how can I only get the row wise correlations with the vector resulting in shape (1000,) of all correlations?
Of course this could be done via a list comprehension:
np.array([np.corrcoef(X[i, :], y)[0,1] for i in range(X.shape[0])])
Currently I am therefore using numba with a for loop running through the >900000 elemens. But I think there could be a much more efficient matrix operation function for this problem.
EDIT:
Pandas provides with the corrwith function also a method for this problem:
X_df = pd.DataFrame(X)
y_s = pd.Series(y)
X_df.corrwith(y_s)
The implementation allows for different correlation type calculations, but does not seem to be implemmented as a matrix operation and is therefore really slow. Probably there is a more efficient implementation.

This should work to compute the correlation coefficient for each row with a specified y in a vectorized manner.
X = np.random.random([1000, 10])
y = np.random.random(10)
r = (len(y) * np.sum(X * y[None, :], axis=-1) - (np.sum(X, axis=-1) * np.sum(y))) / (np.sqrt((len(y) * np.sum(X**2, axis=-1) - np.sum(X, axis=-1) ** 2) * (len(y) * np.sum(y**2) - np.sum(y)**2)))
print(r[0], np.corrcoef(X[0], y))
0.4243951, 0.4243951

Related

dot product of (n,) shaped arrays to give an n by n array

Say you have an array X of shape (n,),
import numpy as np
n = 10
X = np.random.rand(n)
and you want to make the following dot product XX^T (by X^T I mean the transpose of X). The result should give an n by n matrix. However using
np.dot(X, X.T)
will give a scalar. It's like if it does X^T X instead. Unless you do the following
X = np.reshape(X, (X.shape[0], 1))
np.dot(X, X.T)
Is there a way to do it without having to reshape the numpy vector?
If both a and b are 1-D arrays, numpy.dot(a, b) returns the inner product of vectors (without complex conjugation).
You can use the numpy.outer function instead:
np.outer(X, X)

Python - y should be a 1d array, got an array of shape instead

Let's consider data :
import numpy as np
from sklearn.linear_model import LogisticRegression
x=np.linspace(0,2*np.pi,80)
x = x.reshape(-1,1)
y = np.sin(x)+np.random.normal(0,0.4,80)
y[y<1/2] = 0
y[y>1/2] = 1
clf=LogisticRegression(solver="saga", max_iter = 1000)
I want to fit logistic regression where y is dependent variable, and x is independent variable. But while I'm using :
clf.fit(x,y)
I see error
'y should be a 1d array, got an array of shape (80, 80) instead'.
I tried to reshape data by using
y=y.reshape(-1,1)
But I end up with array of length 6400! (How come?)
Could you please give me a hand with performing this regression ?
Change the order of your operations:
First geneate x and y as 1-D arrays:
x = np.linspace(0, 2*np.pi, 8)
y = np.sin(x) + np.random.normal(0, 0.4, 8)
Then (after y was generated) reshape x:
x = x.reshape(-1, 1)
Edit following a comment as of 2022-02-20
The source of the problem in the original code is that;
x = np.linspace(0,2*np.pi,80) - generates a 1-D array.
x = x.reshape(-1,1) - reshapes it into a 2-D array, with one column and
as many rows as needed.
y = np.sin(x) + np.random.normal(0,0.4,80) - operates on a columnar array and
a 1-D array (treated here as a single row array).
the effect is that y is a 2-D array (80 * 80).
then the attempt to reshape y gives a single column array with 6400 rows.
The proper solution is that both x and y should be initially 1-D
(single row) arrays and my code does just this.
Then both arrays can be reshaped.
I encountered this error and solving it via reshape but it didn't work
ValueError: y should be a 1d array, got an array of shape () instead.
Actually, this was happening due to the wrong placement of [] brackets around np.argmax, below is the wrong code and correct one, notice the positioning of [] around the np.argmax in both the snippets
Wrong Code
ax[i,j].set_title("Predicted Watch : "+str(le.inverse_transform([pred_digits[prop_class[count]]])) +"\n"+"Actual Watch : "+str(le.inverse_transform(np.argmax([y_test[prop_class[count]]])).reshape(-1,1)))
Correct Code
ax[i,j].set_title("Predicted Watch :"+str(le.inverse_transform([pred_digits[prop_class[count]]]))+"\n"+"Actual Watch : "+str(le.inverse_transform([np.argmax(y_test[prop_class[count]])])))

Raise array to the power of another array - i.e. expanding the dimension of the array

Is it possible to use numpy to raise an array to the power of another array, in a way that yields a result with a larger dimension than the inputs - i.e. not just simple element wise raising to the power of.
As a simple example, I'm looking to compute the following. Below is the "longhand" form - in practice this is implemented by a loop over a large x array, so it's slow.
x = np.arange(4)
t = np.random.rand(3,3)
y = np.empty_like(x)
y[0] = np.sum(x[0]**t)
y[1] = np.sum(x[1]**t)
y[2] = np.sum(x[2]**t)
y[3] = np.sum(x[3]**t)
I'd like a vectorised solution to replace doing y[i] each time. However, since x has shape [4] and y has shape [3,3], when I try to compute x**t I get an error.
Is there a fast optimized solution?
A straight-forward vectorized way would be with broadcasting -
y = (x[:,None,None]**t).sum((1,2)).astype(x.dtype)
Or with the builtin np.power.outer -
y = np.power.outer(x,t).sum((1,2)).astype(x.dtype)
For large arrays, leverage multi-cores with numexpr module -
import numexpr as ne
y = ne.evaluate('sum(x3D**t1D,1)',{'x3D':x[:,None],'t1D':t.ravel()}).astype(x.dtype)

Numpy - Speed up iteration comparison?

The following use case:
I have a Numpy matrix/array with a few thousand 2d points. Call it A.
Eg:
[1 2]
[300 400]
..
[123 242]
I also have another Numpy matrix with a few 2d points as above. Call it B.
Basically, I want to iterate through A, then iterate through B and compute the distance between A[i] and B[j]. Then assign that back to another array. I could do it like this:
for i, (x0, x1) in enumerate(zip(A[:,0],A[:,1])):
weight_distance = 0
for j, (p0, p1) in enumerate(zip(A[:,0],A[:,1])):
weight_distance = weight_distance + distance((p0,p1),(x0,x1))
weight_array[i] = weight_distance
But this is too slow. What might be a Numpy way to approach this?
What you're probably looking for is the code in scipy.spatial.distance, particularly the cdist function. This can efficiently compute the pairwise distances between arrays of points for a wide variety of metrics.
import numpy as np
from scipy.spatial.distance import cdist
A = np.random.random((1000, 2))
B = np.random.random((100, 2))
D = cdist(A, B, metric='euclidean')
print(D.shape) # (1000, 100)
weights = D.sum(1)
print(weights.shape) # (1000,)
Here euclidean is the standard root-sum-square distance that you're probably used to, and D[i, j] holds the distance between A[i] and B[j], and so summing along axis 1 gives the desired weights.
There are ways to do this via broadcasting directly in numpy, but that approach would use several large temporary arrays, and will in general be slower than the scipy cdist approach.
Edit:
I thought I may as well add a note on the NumPy-only approach. It looks like this:
D2 = np.sqrt(((A[:, None, :] - B[None, :, :]) ** 2).sum(-1))
weights2 = D2.sum(1)
np.allclose(weights, weights2) # True
Let's break it down:
A[:, None, :] adds a new dimension to A, so its shape is now [1000, 1, 2]. Similar for B[None, :, :], which becomes [1, 100, 2]
A[:, None, :] - B[None, :, :] is a broadcasting operation which results in an array of differences, with shape [1000, 100, 2]
We square every element of this result.
the sum(-1) method on this result sums across the last dimension, resulting in an array of shape [1000, 100]
we take the square root of the result, which gives the distance matrix
we sum along axis 1 to get the weights
Notice that this broadcasting approach creates not one, but two temporary arrays of size 1000 * 100 * 2 along the way, which is why it is less efficient than a purpose-built compiled function like cdist.

Fast iteration over vectors in a multidimensional numpy array

I'm writing some python + numpy + cython code, and am trying to find the most elegant and efficient way of doing the following kind of iteration over an array:
Let's say I have a function f(x, y) that takes a vector x of shape (3,) and a vector y of shape (10,) and returns a vector of shape (10,). Now I have two arrays X and Y of shape sx + (3,) and sy + (10,), where the sx and sy are two shapes that can be broadcast together (i.e. either sx == sy, or when an axis differs, one of the two has length 1, in which case it will be repeated). I want to produce an array Z that has the shape zs + (10,), where zs is the shape of the broadcasting of sx with sy. Each 10 dimensional vector in Z is equal to f(x, y) of the vectors x and y at the corresponding locations in X and Y.
I looked into np.nditer and while it plays nice with cython (see bottom of linked page), it doesn't seem to allow iterating over vectors from a multidimensional array, instead of elements. I also looked at index grids, but the problem there is that cython indexing is only fast when the number of indexes is equal to the dimensionality of the array, and are stored as cython integers instead of python tuples.
Any help is greatly appreciated!
You are describing what Numpy calls a Generalized Universal FUNCtion, or gufunc. As it name suggests, it is an extension of ufuncs. You probably want to start by reading these two pages:
Writing your own ufunc
Building a ufunc from scratch
The second example uses Cython and has some material on gufuncs. To fully go down the gufunc road, you will need to read the corresponding section in the numpy C API documentation:
Generalized Universal Function API
I do not know of any example of gufuncs being coded in Cython, although it shouldn't be too hard to do following the examples above. If you want to look at gufuncs coded in C, you can take a look at the source code for np.linalg here, although that can be a daunting experience. A while back I bored my local Python User Group to death giving a talk on extending numpy with C, which was mostly about writing gufuncs in C, the slides of that talk and a sample Python module providing a new gufunc can be found here.
If you want to stick with nditer, here's a way using your example dimensions. It's pure Python here, but shouldn't be hard to implement with cython (though it still has the tuple iterator). I'm borrowing ideas from ndindex as described in shallow iteration with nditer
The idea is to find the common broadcasting shape, sz, and construct a multi_index iterator over it.
I'm using as_strided to expand X and Y to usable views, and passing the appropriate vectors (actually (1,n) arrays) to the f(x,y) function.
import numpy as np
from numpy.lib.stride_tricks import as_strided
def f(x,y):
# sample that takes (10,) and (3,) arrays, and returns (10,) array
assert x.shape==(1,10), x.shape
assert y.shape==(1,3), y.shape
z = x*10 + y.mean()
return z
def brdcast(X, X1):
# broadcast X to shape of X1 (keep last dim of X)
# modeled on np.broadcast_arrays
shape = X1.shape + (X.shape[-1],)
strides = X1.strides + (X.strides[-1],)
X1 = as_strided(X, shape=shape, strides=strides)
return X1
def F(X, Y):
X1, Y1 = np.broadcast_arrays(X[...,0], Y[...,0])
Z = np.zeros(X1.shape + (10,))
it = np.nditer(X1, flags=['multi_index'])
X1 = brdcast(X, X1)
Y1 = brdcast(Y, Y1)
while not it.finished:
I = it.multi_index + (None,)
Z[I] = f(X1[I], Y1[I])
it.iternext()
return Z
sx = (2,3) # works with (2,1)
sy = (1,3)
# X, Y = np.ones(sx+(10,)), np.ones(sy+(3,))
X = np.repeat(np.arange(np.prod(sx)).reshape(sx)[...,None], 10, axis=-1)
Y = np.repeat(np.arange(np.prod(sy)).reshape(sy)[...,None], 3, axis=-1)
Z = F(X,Y)
print Z.shape
print Z[...,0]

Categories

Resources