How this command "preprocessing.scale" do in term of math?

How this command "preprocessing.scale" do in term of math? - python

I have read the manual in scikit learn website and i still don't know what is the mathematical formula behind this command.
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])

Center to the mean and component wise scale to unit variance.
This means that mean value along the axis is subtracted from X and the resulting value is divided by std along the axis.

Andrey's formula in the comments is correct - I'd just add that numpy and scikit-learn use the population formula for calculating the standard deviation, not the sample standard deviation, which is the default in other languages like R. So numpy and scikit-learn divide the sum of squares by n, instead of n-1.

Related

Dummy variables, is necessary to standardize them?

I have the following dataset represented like numpy array
direccion_viento_pos
Out[32]:
array([['S'],
['S'],
['S'],
...,
['SO'],
['NO'],
['SO']], dtype=object)
The dimension of this array is:
direccion_viento_pos.shape
(17249, 8)
I am using python and scikit learn to encode these categorical variables in this way:
from __future__ import unicode_literals
import pandas as pd
import numpy as np
# from sklearn import preprocessing
# from matplotlib import pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
Then I create a label encoder object:
labelencoder_direccion_viento_pos = LabelEncoder()
I take the column position 0 (the unique column) of the direccion_viento_pos and apply the fit_transform() method addressing all their rows:
direccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0])
My direccion_viento_pos is of this way:
direccion_viento_pos[:, 0]
array([5, 5, 5, ..., 7, 3, 7], dtype=object)
Until this moment, each row/observation of direccion_viento_pos have a numeric value, but I want solve the inconvenient of weight in the sense that there are rows with a value more higher than others.
Due to this, I create the dummy variables, which according to this reference are:
A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or more distinct categories/levels
Then, in my direccion_viento_pos context, I have 8 values
SO - Sur oeste
SE - Sur este
S - Sur
N - Norte
NO - Nor oeste
NE - Nor este
O - Oeste
E - Este
This mean, 8 categories.
Next, I create a OneHotEncoder object with the categorical_features attribute which specifies what features will be treated like categorical variables.
onehotencoder = OneHotEncoder(categorical_features = [0])
And apply this onehotencoder to our direccion_viento_pos matrix.
direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()
My direccion_viento_pos with their categorized variables has stayed so:
direccion_viento_pos
array([[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 0., ..., 1., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 1.]])
Then, until here, I've created dummy variables to each category.
I wanted to narrate this process, to arrive at my question.
If these dummy encoder variables already in a 0-1 range, is necessary apply the MinMaxScaler feature scaling?
Some say that it is not necessary to scale these fictitious variables. Others say that if necessary because we want accuracy in predictions
I ask this question due to when I apply the MinMaxScaler with the feature_range=(0, 1)
my values have been changed in some positions ... despite to still keep this scale.
What is the best option which can I have to choose with respect to my dataset direccion_viento_pos

I don't think scaling them will change the answer at all. They're all on the same scale already. Min 0, max 1, range 1. If some continuous variables were present, you'd want to normalize the continuous variables only, leaving the dummy variables alone. You could use the min-max scaler to give those continuous variables the same minimum of zero, max of one, range of 1. Then your regression slopes would be very easy to interpret. Your dummy variables are already normalized.
Here's a related question asking if one should ever standardize binary variables.

python how to get proper distance value out of scipy condensed distance matrix

I am using python 2.7 with scipy to calculate a distance matrix for an array.
I don't get how to find the wanted distance values in the returned condensed matrix.
See example
from scipy.spatial.distance import pdist
import numpy as np
a = np.array([[1],[4],[0],[5]])
print a
print pdist(a)
will print
[ 3. 1. 4. 4. 1. 5.]
I found here that the ij entry in the condensed matrix should store the distance between the i and j entries where ithread wondering if they mean ij as i*j or str.join(i,j) e.g 1,2 -> 2 or 12.
I can't find a consistent way to know the wanted index.
see my example, you should expect that all of the distances from entry 0 to anywhere else will be stored in entry 0 if the first option is valid.
can anyone shed some light on how can i extract my wanted distance from entry x to entry y? which index am i looking for?
Thanks!

This vector is in condensed form. It enumerates all pairs of indices in a natural order (in your example 0,1 0,2 0,3 0,4 1,2 1,3 1,4 2,3 2,4 ) and yields the distance between the elements at these array entries.
There is also the squareform function, which transforms the condensed form into a square matrix form (and vice versa). The square matrix form is exactly what you expect, i.e. at entry ij (row i, column j), it stores the distance between the i-th and j-th entry. For example, if you add print squareform(d) at the end of you code, the output will be:
array([[ 0., 3., 1., 4.],
[ 3., 0., 4., 1.],
[ 1., 4., 0., 5.],
[ 4., 1., 5., 0.]])

Theano: Operate on nonzero elements of sparse matrix

I'm trying to take the exp of nonzero elements in a sparse theano variable. I have the current code:
A = T.matrix("Some matrix with many zeros")
A_sparse = theano.sparse.csc_from_dense(A)
I'm trying to do something that's equivalent to the following numpy syntax:
mask = (A_sparse != 0)
A_sparse[mask] = np.exp(A_sparse[mask])
but Theano doesn't support != masks yet. (And (A_sparse > 0) | (A_sparse < 0) doesn't seem to work either.)
How can I achieve this?

The support for sparse matrices in Theano is incomplete, so some things are tricky to achieve. You can use theano.sparse.structured_exp(A_sparse) in that particular case, but I try to answer your question more generally below.
Comparison
In Theano one would normally use the comparison operators described here: http://deeplearning.net/software/theano/library/tensor/basic.html
For example, instead of A != 0, one would write T.neq(A, 0). With sparse matrices one has to use the comparison operators in theano.sparse. Both operators have to be sparse matrices, and the result is also a sparse matrix:
mask = theano.sparse.neq(A_sparse, theano.sparse.sp_zeros_like(A_sparse))
Modifying a Subtensor
In order to modify part of a matrix, one can use theano.tensor.set_subtensor. With dense matrices this would work:
indices = mask.nonzero()
A = T.set_subtensor(A[indices], T.exp(A[indices]))
Notice that Theano doesn't have a separated boolean type—the mask is zeros and ones—so nonzero() has to be called first to take the indices of the nonzero elements. Furthermore, this is not implemented for sparse matrices.
Operating on Nonzero Sparse Elements
Theano provides sparse operations that are said to be structured and operate only on the nonzero elements. See:
http://deeplearning.net/software/theano/tutorial/sparse.html#structured-operation
More precisely, they operate on the data attribute of a sparse matrix, independent of the indices of the elements. Such operations are straightforward to implement. Note that the structured operations will operate on all the values in the data array, also those that are explicitly set to zero.

Here's a way of doing this with the scipy.sparse module. I don't know how theano implements its sparse. It's likely to be based on similar ideas (since it uses name like csc)
In [224]: A=sparse.csc_matrix([[1.,0,0,2,0],[0,0,3,0,0],[0,1,1,2,0]])
In [225]: A.A
Out[225]:
array([[ 1., 0., 0., 2., 0.],
[ 0., 0., 3., 0., 0.],
[ 0., 1., 1., 2., 0.]])
In [226]: A.data
Out[226]: array([ 1., 1., 3., 1., 2., 2.])
In [227]: A.data[:]=np.exp(A.data)
In [228]: A.A
Out[228]:
array([[ 2.71828183, 0. , 0. , 7.3890561 , 0. ],
[ 0. , 0. , 20.08553692, 0. , 0. ],
[ 0. , 2.71828183, 2.71828183, 7.3890561 , 0. ]])
The main attributes of the csc format at data, indices, indptr. It's possible that data has some 0 values if you fiddle with them after creation, but a freshly created matrix shouldn't.
The matrix also has a nonzero method modeled on the numpy one. In practice it converts the matrix to coo format, filters out any zero values, and returns the row and col attributes:
In [229]: A.nonzero()
Out[229]: (array([0, 0, 1, 2, 2, 2]), array([0, 3, 2, 1, 2, 3]))
And the csc format allows indexing just as a dense numpy array:
In [230]: A[A.nonzero()]
Out[230]:
matrix([[ 2.71828183, 7.3890561 , 20.08553692, 2.71828183,
2.71828183, 7.3890561 ]])

T.where works.
A_sparse = T.where(A_sparse == 0, 0, T.exp(A_sparse))
#Seppo Envari's answer seems faster though. So I'll accept his answer.

I don't understand the k-means scipy algorithm

I'm trying to use the scipy kmeans algorithm.
So I have this really simple example:
from numpy import array
from scipy.cluster.vq import vq, kmeans, whiten
features = array([[3,4],[3,5],[4,2],[4,2]])
book = array((features[0],features[2]))
final = kmeans(features,book)
and the result is
final
(array([[3, 4],
[4, 2]]), 0.25)
What I don't understand is, for me the centroids coordinate should be the barycentre of all the points belongings to the cluster, so in this exemple
[3,9/2] and [4,2]
can anyone explain me the result the scipy algorithm is giving?

It looks like it is preserving the data type that you are giving it (int). Try:
features = array([[3., 4.], [3., 5.], [4., 2.], [4., 2.]])

Symmetric matrices in numpy?

I wish to initiate a symmetric matrix in python and populate it with zeros.
At the moment, I have initiated an array of known dimensions but this is unsuitable for subsequent input into R as a distance matrix.
Are there any 'simple' methods in numpy to create a symmetric matrix?
Edit
I should clarify - creating the 'symmetric' matrix is fine. However I am interested in only generating the lower triangular form, ie.,
ar = numpy.zeros((3, 3))
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
I want:
array([[ 0],
[ 0, 0 ],
[ 0., 0., 0.]])
Is this possible?

I don't think it's feasible to try work with that kind of triangular arrays.
So here is for example a straightforward implementation of (squared) pairwise Euclidean distances:
def pdista(X):
"""Squared pairwise distances between all columns of X."""
B= np.dot(X.T, X)
q= np.diag(B)[:, None]
return q+ q.T- 2* B
For performance wise it's hard to beat it (in Python level). What would be the main advantage of not using this approach?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.