scale two matrices with scipy or sklearn

scale two matrices with scipy or sklearn - python

I would like to scale a matrix X1 (by column), and then scale another matrix X2 with mean and standard deviations found when scaling X1.
As far as I know, sklearn does not return mean/variance when scaling a matrix. Is there an alternative approach without me implementing it?
For example:
X1
1 2 3 4
5 6 7 8
9 10 11 12
X2
12 13 14 15
16 17 18 19
replace X2[i][j] with (X2[i][j] - mean[X1[:, i]]) / std[X1[:, i]]
The scale function of sklearn preprocessing cannot be used because it does not return mean and variance.

The Standard Scaler from scikit learn handles this, and corner cases, pretty well.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X1)
output = scaler.transform(X2)
If necessary, you can access the means and standard deviations of the feature columns using
scaler.std_
scaler.mean_
You can also use the StandardScaler in a pipeline as preprocessing preceding an estimator.

Both .std() and .mean() method accept axis parameter to calculate row wise/column wise statistics, the rest will be taken care of by boardcasting:
In [170]:
X1
Out[170]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
In [171]:
X2
Out[171]:
array([[12, 13, 14, 15],
[16, 17, 18, 19]])
In [172]:
(X2-X1.mean(0))/X1.std(0)
Out[172]:
array([[ 2.14330352, 2.14330352, 2.14330352, 2.14330352],
[ 3.3680484 , 3.3680484 , 3.3680484 , 3.3680484 ]])

Related

Efficient ways to aggregate and replicate values in a numpy matrix

In my work I often need to aggregate and expand matrices of various quantities, and I am looking for the most efficient ways to do these actions. E.g. I'll have an NxN matrix that I want to aggregate from NxN into PxP where P < N. This is done using a correspondence between the larger dimensions and the smaller dimensions. Usually, P will be around 100 or so.
For example, I'll have a hypothetical 4x4 matrix like this (though in practice, my matrices will be much larger, around 1000x1000)
m=np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])
>>> m
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])
and a correspondence like this (schematically):
0 -> 0
1 -> 1
2 -> 0
3 -> 1
that I usually store in a dictionary. This means that indices 0 and 2 (for rows and columns) both get allocated to new index 0 and indices 1 and 3 (for rows and columns) both get allocated to new index 1. The matrix could be anything at all, but the correspondence is always many-to-one when I want to compress.
If the input matrix is A and the output matrix is B, then cell B[0, 0] would be the sum of A[0, 0] + A[0, 2] + A[2, 0] + A[2, 2] because new index 0 is made up of original indices 0 and 2.
The aggregation process here would lead to:
array([[ 1+3+9+11, 2+4+10+12 ],
[ 5+7+13+15, 6+8+14+16 ]])
= array([[ 24, 28 ],
[ 40, 44 ]])
I can do this by making an empty matrix of the right size and looping over all 4x4=16 cells of the initial matrix and accumulating in nested loops, but this seems to be inefficient and the vectorised nature of numpy is always emphasised by people. I have also done it by using np.ix_ to make sets of indices and use m[row_indices, col_indices].sum(), but I am wondering what the most efficient numpy-like way to do it is.
Conversely, what is the sensible and efficient way to expand a matrix using the correspondence the other way? For example with the same correspondence but in reverse I would go from:
array([[ 1, 2 ],
[ 3, 4 ]])
to
array([[ 1, 2, 1, 2 ],
[ 3, 4, 3, 4 ],
[ 1, 2, 1, 2 ],
[ 3, 4, 3, 4 ]])
where the values simply get replicated into the new cells.
In my attempts so far for the aggregation, I have used approaches with pandas methods with groupby on index and columns and then extracting the final matrix with, e.g. df.values. However, I don't know the equivalent way to expand a matrix, without using a lot of things like unstack and join and so on. And I see people often say that using pandas is not time-efficient.
Edit 1: I was asked in a comment about exactly how the aggregation should be done. This is how it would be done if I were using nested loops and a dictionary lookup between the original dimensions and the new dimensions:
>>> m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
>>> mnew=np.zeros((2,2))
>>> big2small={0:0, 1:1, 2:0, 3:1}
>>> for i in range(4):
... inew = big2small[i]
... for j in range(4):
... jnew = big2small[j]
... mnew[inew, jnew] += m[i, j]
...
>>> mnew
array([[24., 28.],
[40., 44.]])
Edit 2: Another comment asked for the aggregation example towards the start to be made more explicit, so I have done so.

Assuming you don't your indices don't have a regular structure I would do it try sparse matrices.
import scipy.sparse as ss
import numpy as np
# your current array of indices
g=np.array([[0,0],[1,1],[2,0],[3,1]])
# a sparse matrix of (data=ones, (row_ind=g[:,0], col_ind=g[:,1]))
# it is one for every pair (g[i,0], g[i,1]), zero elsewhere
u=ss.csr_matrix((np.ones(len(g)), (g[:,0], g[:,1])))
Aggregate
m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
u.T # m # u
Expand
m2 = np.array([[1,2],[3,4]])
u # m2 # u.T

matrix: move n-th row by n position efficiently

I have a numpy 2d array and I need to transform it in a way that the first row remains the same, the second row moves by one position to right, (it can wrap around or just have zero padded to the front). Third row shifts 3 positions to the right, etc.
I can do this through a "for loop" but that is not very efficient. I am guessing there should be a filtering matrix that multipled by the original one will have the same effect, or maybe a numpy trick that will help me doing this? Thanks!
I have looked into numpy.roll() but I don't think it can work on each row separately.
import numpy as np
p = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
'''
p = [ 1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16]
desired output:
p'= [ 1 2 3 4
0 5 6 7
0 0 9 10
0 0 0 13]
'''

We can extract sliding windows into a zeros padded version of the input to have a memory efficient approach and hence performant too. To get those windows, we can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows. More info on use of as_strided based view_as_windows.
Hence, the solution would be -
from skimage.util.shape import view_as_windows
def slide_by_one(p):
m,n = p.shape
z = np.zeros((m,m-1),dtype=p.dtype)
a = np.concatenate((z,p),axis=1)
w = view_as_windows(a,(1,p.shape[1]))[...,0,:]
r = np.arange(m)
return w[r,r[::-1]]
Sample run -
In [60]: p # generic sample of size mxn
Out[60]:
array([[ 1, 5, 9, 13, 17],
[ 2, 6, 10, 14, 18],
[ 3, 7, 11, 15, 19],
[ 4, 8, 12, 16, 20]])
In [61]: slide_by_one(p)
Out[61]:
array([[ 1, 5, 9, 13, 17],
[ 0, 2, 6, 10, 14],
[ 0, 0, 3, 7, 11],
[ 0, 0, 0, 4, 8]])
We can leverage the regular rampy pattern to have a more efficient approach with a more raw usage of np.lib.stride_tricks.as_strided, like so -
def slide_by_one_v2(p):
m,n = p.shape
z = np.zeros((m,m-1),dtype=p.dtype)
a = np.concatenate((z,p),axis=1)
s0,s1 = a.strides
return np.lib.stride_tricks.as_strided(a[:,m-1:],shape=(m,n),strides=(s0-s1,s1))
Another one with some masking -
def slide_by_one_v3(p):
m,n = p.shape
z = np.zeros((len(p),1),dtype=p.dtype)
a = np.concatenate((p,z),axis=1)
return np.triu(a[:,::-1],1)[:,::-1].flat[:-m].reshape(m,-1)

Here is a simple method based on zero-padding and reshaping. It is fast because it avoids advanced indexing and other overheads.
def pp(p):
m,n = p.shape
aux = np.zeros((m,n+m-1),p.dtype)
np.copyto(aux[:,:n],p)
return aux.ravel()[:-m].reshape(m,n+m-2)[:,:n].copy()

What are x , y axis values when not specified for kmeans scikit-learn

This code : runs k-means algorithm from scikit-learn package :
from sklearn.cluster import KMeans
import numpy as np
from matplotlib import pyplot
X = np.array([[10, 2 , 9], [1, 4 , 3], [1, 0 , 3],
[4, 2 , 1], [4, 4 , 7], [4, 0 , 5], [4, 6 , 3],[4, 1 , 7],[5, 2 , 3],[6, 3 , 3],[7, 4 , 13]])
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
k = 3
kmeans.fit(X)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
for i in range(k):
# select only data observations with cluster label == i
ds = X[np.where(labels==i)]
# plot the data observations
pyplot.plot(ds[:,0],ds[:,1],'o')
# plot the centroids
lines = pyplot.plot(centroids[i,0],centroids[i,1],'kx')
# make the centroid x's bigger
pyplot.setp(lines,ms=15.0)
pyplot.setp(lines,mew=2.0)
pyplot.show()
generates :
As I've not set the x and y axis labels what do these axis values represent ?
scikit-learn utilizes the Euclidian distance measure for computing the distance between each point, so are the axis values representative of the Euclidean distances ?
The doc http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html does not describe this scenario.
Update : it does appear to be just plotting first two two dimension in array as using
X = np.array([[10, 2 , 90], [1, 4 , 35], [1, 0 , 30],
[4, 2 , 1], [4, 4 , 7], [4, 0 , 5], [4, 6 , 3],[4, 1 , 7],[5, 2 , 3],[6, 3 , 3],[7, 4 , 13]])
I've updated the 3'rd dimensions for first 3 parameters to : to 90 , 35 & 40 . This does not have any impact on resultant plot. So in order to visualize dimensions > 2 I should run a PCA analysis on the data.

TL;DR
I think it's simply plotting your first variable on the "x", and your second variable on the "y".
(But "x" and "y" are the wrong terms.)
Detail
In machine learning, the terms x and y are usually used a bit differently. In your case your X matrix contains data points with 3 values:
The first two values are usually called x1 and x2 variables (x with 1 subscript, if I could format it that way).
And the third value is ... I'm not sure yet. I don't see it on the plot.
If you look at your original data in X, you see [10, 2, 9], [1, 4, 3], ...
The first two variables of the first data point are (10, 2).
You can see a point plotted at horizontal 10, vertical 2.
There is a second point plotted at horizontal 1, vertical 4.
And so on ...
So from that you can basically see that the horizontal axis is x1, and the vertical is x2.
I don't know how the third value appears on the plot. It's possible that it's the color, but usually in k-means, the color is used to separate the different values into clusters. So each color is a cluster.
So I don't really see where the third value is. But that wasn't your question! :)
You probably want the documentation for pyplot, not for scikit-learn. Here is pyplot: http://matplotlib.org/api/pyplot_api.html

Features Importance with list attributes from pd dataframe

I'm developing a ML algorithm to get a feature importance with an ExtraTrees.
The problem that I'm trying to solving is that the variables are not scalars but lists with different dimensions, or matrixes but for now I will focus only on lists.
From the momement the only think that I was able to do is a FI with the flat lists concateneted each other.
GOAL:
What I would like to do is to get a score point for each different list instead of a score for each lists' element.
I present here a toy example of the dataset and the current code:
df = pd.DataFrame({"list1": [[10,15,12,14],[20,30,10,43]], "R":[2,2] ,"C":[2,2] , "CLASS":[1,0] , "scalar1":[1,2] , "scalar2":[3,4]})
After PCA ( below ):
df['new'] = pd.Series([a.reshape((c, r)) for (a, c, r) in zip(df.A, df.C, df.R)])
df['pca'] = pd.Series([ pca_volatilities(matrix) for matrix in df.new ])
Becomes:
list1 # C # C1 # C2 # CLASS # R # new # pca # flat_pca
0 [10, 15, 12, 14] 2 1 3 1 2 [[10, 15], [12, 14]] [[-1.11803398875], [1.11803398875]] [-1.11803398875, 1.11803398875]
1 [20, 30, 10, 43] 2 2 4 0 2 [[20, 30], [10, 43]] [[-8.20060973343], [8.20060973343]] [-8.20060973343, 8.20060973343]
Here I present the fit:
X = np.concatenate([np.stack(df.flat_pca,axis=0), [df.C1, df.C2]], axis=0).transpose()
Y = np.array(df.CLASS)
model = ExtraTreesRegressor()
model.fit(X,Y)
model.feature_importances_
This returns:
array([ 0.2, 0.3, 0.2, 0.3]).
What I need is a score for list1 , C1,C2 and flat_pca. I don't know how to do this.
Hoping that someone is able to help me, thanks in advance !!!!!

How to bin a 2D array in numpy?

I'm new to numpy and I have a 2D array of objects that I need to bin into a smaller matrix and then get a count of the number of objects in each bin to make a heatmap. I followed the answer on this thread to create the bins and do the counts for a simple array but I'm not sure how to extend it to 2 dimensions. Here's what I have so far:
data_matrix = numpy.ndarray((500,500),dtype=float)
# fill array with values.
bins = numpy.linspace(0,50,50)
digitized = numpy.digitize(data_matrix, bins)
binned_data = numpy.ndarray((50,50))
for i in range(0,len(bins)):
for j in range(0,len(bins)):
k = len(data_matrix[digitized == i:digitized == j]) # <-not does not work
binned_data[i:j] = k
P.S. the [digitized == i] notation on an array will return an array of binary values. I cannot find documentation on this notation anywhere. A link would be appreciated.

You can reshape the array to a four dimensional array that reflects the desired block structure, and then sum along both axes within each block. Example:
>>> a = np.arange(24).reshape(4, 6)
>>> a
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
>>> a.reshape(2, 2, 2, 3).sum(3).sum(1)
array([[ 24, 42],
[ 96, 114]])
If a has the shape m, n, the reshape should have the form
a.reshape(m_bins, m // m_bins, n_bins, n // n_bins)

At first I was also going to suggest that you use np.histogram2d rather than reinventing the wheel, but then I realized that it would be overkill to use that and would need some hacking still.
If I understand correctly, you just want to sum over submatrices of your input. That's pretty easy to brute force: going over your output submatrix and summing up each subblock of your input:
import numpy as np
def submatsum(data,n,m):
# return a matrix of shape (n,m)
bs = data.shape[0]//n,data.shape[1]//m # blocksize averaged over
return np.reshape(np.array([np.sum(data[k1*bs[0]:(k1+1)*bs[0],k2*bs[1]:(k2+1)*bs[1]]) for k1 in range(n) for k2 in range(m)]),(n,m))
# set up dummy data
N,M = 4,6
data_matrix = np.reshape(np.arange(N*M),(N,M))
# set up size of 2x3-reduced matrix, assume congruity
n,m = N//2,M//3
reduced_matrix = submatsum(data_matrix,n,m)
# check output
print(data_matrix)
print(reduced_matrix)
This prints
print(data_matrix)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
print(reduced_matrix)
[[ 24 42]
[ 96 114]]
which is indeed the result for summing up submatrices of shape (2,3).
Note that I'm using // for integer division to make sure it's python3-compatible, but in case of python2 you can just use / for division (due to the numbers involved being integers).

Another solution is to have a look at the binArray function on the comments here:
Binning a numpy array
To use your example :
data_matrix = numpy.ndarray((500,500),dtype=float)
binned_data = binArray(data_matrix, 0, 10, 10, np.sum)
binned_data = binArray(binned_data, 1, 10, 10, np.sum)
The result sum all square of size 10x10 in data_matrix (of size 500x500) to obtain a single value per square in binned_data (of size 50x50).
Hope this help !

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scale two matrices with scipy or sklearn - python

Related

Efficient ways to aggregate and replicate values in a numpy matrix

matrix: move n-th row by n position efficiently

What are x , y axis values when not specified for kmeans scikit-learn

Features Importance with list attributes from pd dataframe

How to bin a 2D array in numpy?

Categories

Resources