Difference between R.scale() and sklearn.preprocessing.scale()

Difference between R.scale() and sklearn.preprocessing.scale() - python

I am currently moving my data analysis from R to Python. When scaling a dataset in R i would use R.scale(), which in my understanding would do the following: (x-mean(x))/sd(x)
To replace that function I tried to use sklearn.preprocessing.scale(). From my understanding of the description it does the same thing. Nonetheless I ran a little test-file and found out, that both of these methods have different return-values. Obviously the standard deviations are not the same... Is someone able to explain why the standard deviations "deviate" from one another?
MWE:
# import packages
from sklearn import preprocessing
import numpy
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up R namespaces
R = rpy2.robjects.r
np1 = numpy.array([[1.0,2.0],[3.0,1.0]])
print "Numpy-array:"
print np1
print "Scaled numpy array through R.scale()"
print R.scale(np1)
print "-------"
print "Scaled numpy array through preprocessing.scale()"
print preprocessing.scale(np1, axis = 0, with_mean = True, with_std = True)
scaler = preprocessing.StandardScaler()
scaler.fit(np1)
print "Mean of preprocessing.scale():"
print scaler.mean_
print "Std of preprocessing.scale():"
print scaler.std_
Output:

It seems to have to do with how standard deviation is calculated.
>>> import numpy as np
>>> a = np.array([[1, 2],[3, 1]])
>>> np.std(a, axis=0)
array([ 1. , 0.5])
>>> np.std(a, axis=0, ddof=1)
array([ 1.41421356, 0.70710678])
From numpy.std documentation,
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
Apparently, R.scale() uses ddof=1, but sklearn.preprocessing.StandardScaler() uses ddof=0.
EDIT: (To explain how to use alternate ddof)
There doesn't seem to be a straightforward way to calculate std with alternate ddof, without accessing the variables of the StandardScaler() object itself.
sc = StandardScaler()
sc.fit(data)
# Now, sc.mean_ and sc.std_ are the mean and standard deviation of the data
# Replace the sc.std_ value using std calculated using numpy
sc.std_ = numpy.std(data, axis=0, ddof=1)

The current answers are good, but sklearn has changed a bit meanwhile. The new syntax that makes sklearn behave exactly like R.scale() now is:
from sklearn.preprocessing import StandardScaler
import numpy as np
sc = StandardScaler()
sc.fit(data)
sc.scale_ = np.std(data, axis=0, ddof=1).to_list()
sc.transform(data)
Feature request:
https://github.com/scikit-learn/scikit-learn/issues/23758

R.scale documentation says:
The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale by the standard deviations without centering, use scale(x, center = FALSE, scale = apply(x, 2, sd, na.rm = TRUE)).)
However, sklearn.preprocessing.StandardScale always scale with standard deviation.
In my case, I want to replicate R.scale in Python without centered,I followed #Sid advice in a slightly different way:
import numpy as np
def get_scale_1d(v):
# I copy this function from R source code haha
v = v[~np.isnan(v)]
std = np.sqrt(
np.sum(v ** 2) / np.max([1, len(v) - 1])
)
return std
sc = StandardScaler()
sc.fit(data)
sc.std_ = np.apply_along_axis(func1d=get_scale_1d, axis=0, arr=x)
sc.transform(data)

Related

scaling data to specific range in python

I would like to scale an array of size [192,4000] to a specific range. I would like each row (1:192) to be rescaled to a specific range e.g. (-840,840). I run a very simple code:
import numpy as np
from sklearn import preprocessing as sp
sample_mat = np.random.randint(-840,840, size=(192, 4000))
scaler = sp.MinMaxScaler(feature_range=(-840,840))
scaler = scaler.fit(sample_mat)
scaled_mat= scaler.transform(sample_mat)
This messes up my matrix range, even when max and min of my original matrix is exactly the same. I can't figure out what is wrong, any idea?

You can do this manually.
It is a linear transformation of the minmax normalized data.
interval_min = -840
interval_max = 840
scaled_mat = (sample_mat - np.min(sample_mat) / (np.max(sample_mat) - np.min(sample_mat)) * (interval_max - interval_min) + interval_min

MinMaxScaler support feature_range argument on initialization that can produce the output in a certain range.
scaler = MinMaxScaler(feature_range=(1, 2)) will yield output in the (1,2) range

Efficient expanding OLS in pandas

I would like to explore the solutions of performing expanding OLS in pandas (or other libraries that accept DataFrame/Series friendly) efficiently.
Assumming the dataset is large, I am NOT interested in any solutions with a for-loop;
I am looking for solutions about expanding rather than rolling. Rolling functions always require a fixed window while expanding uses a variable window (starting from beginning);
Please do not suggest pandas.stats.ols.MovingOLS because it is deprecated;
Please do not suggest other deprecated methods such as expanding_mean.
For example, there is a DataFrame df with two columns X and y. To make it simpler, let's just calculate beta.
Currently, I am thinking about something like
import numpy as np
import pandas as pd
import statsmodels.api as sm
def my_OLS_func(df, y_name, X_name):
y = df[y_name]
X = df[X_name]
X = sm.add_constant(X)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
return b
df = pd.DataFrame({'X':[1,2.5,3], 'y':[4,5,6.3]})
df['beta'] = df.expanding().apply(my_OLS_func, args = ('y', 'X'))
Expected values of df['beta'] are 0 (or NaN), 0.66666667, and 1.038462.
However, this method does not seem to work because the method seems very inflexible. I am not sure how one could pass the two Series as arguments.
Any suggestions would be appreciated.

One option is to use the RecursiveLS (recursive least squares) model from Statsmodels:
# Simulate some data
rs = np.random.RandomState(seed=12345)
nobs = 100000
beta = [10., -0.2]
sigma2 = 2.5
exog = sm.add_constant(rs.uniform(size=nobs))
eps = rs.normal(scale=sigma2**0.5, size=nobs)
endog = np.dot(exog, beta) + eps
# Construct and fit the recursive least squares model
mod = sm.RecursiveLS(endog, exog)
res = mod.fit()
# This is a 2 x 100,000 numpy array with the regression coefficients
# that would be estimated when using data from the beginning of the
# sample to each point. You should usually ignore the first k=2
# datapoints since they are controlled by a diffuse prior.
res.recursive_coefficients.filtered

Are these functions equivalent?

I am building a neural network that makes use of T-distribution noise. I am using functions defined in the numpy library np.random.standard_t and the one defined in tensorflow tf.distributions.StudentT. The link to the documentation of the first function is here and that to the second function is here. I am using the said functions like below:
a = np.random.standard_t(df=3, size=10000) # numpy's function
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
sess = tf.Session()
b = sess.run(t_dist.sample(10000))
In the documentation provided for the Tensorflow implementation, there's a parameter called scale whose description reads
The scaling factor(s) for the distribution(s). Note that scale is not technically the standard deviation of this distribution but has semantics more similar to standard deviation than variance.
I have set scale to be 1.0 but I have no way of knowing for sure if these refer to the same distribution.
Can someone help me verify this? Thanks

I would say they are, as their sampling is defined in almost the exact same way in both cases. This is how the sampling of tf.distributions.StudentT is defined:
def _sample_n(self, n, seed=None):
# The sampling method comes from the fact that if:
# X ~ Normal(0, 1)
# Z ~ Chi2(df)
# Y = X / sqrt(Z / df)
# then:
# Y ~ StudentT(df).
seed = seed_stream.SeedStream(seed, "student_t")
shape = tf.concat([[n], self.batch_shape_tensor()], 0)
normal_sample = tf.random.normal(shape, dtype=self.dtype, seed=seed())
df = self.df * tf.ones(self.batch_shape_tensor(), dtype=self.dtype)
gamma_sample = tf.random.gamma([n],
0.5 * df,
beta=0.5,
dtype=self.dtype,
seed=seed())
samples = normal_sample * tf.math.rsqrt(gamma_sample / df)
return samples * self.scale + self.loc # Abs(scale) not wanted.
So it is a standard normal sample divided by the square root of a chi-square sample with parameter df divided by df. The chi-square sample is taken as a gamma sample with parameter 0.5 * df and rate 0.5, which is equivalent (chi-square is a special case of gamma). The scale value, like the loc, only comes into play in the last line, as a way to "relocate" the distribution sample at some point and scale. When scale is one and loc is zero, they do nothing.
Here is the implementation for np.random.standard_t:
double legacy_standard_t(aug_bitgen_t *aug_state, double df) {
double num, denom;
num = legacy_gauss(aug_state);
denom = legacy_standard_gamma(aug_state, df / 2);
return sqrt(df / 2) * num / sqrt(denom);
})
So essentially the same thing, slightly rephrased. Here we have also have a gamma with shape df / 2 but it is standard (rate one). However, the missing 0.5 is now by the numerator as / 2 within the sqrt. So it's just moving the numbers around. Here there is no scale or loc, though.
In truth, the difference is that in the case of TensorFlow the distribution really is a noncentral t-distribution. A simple empirical proof that they are the same for loc=0.0 and scale=1.0 is to plot histograms for both distributions and see how close they look.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
np.random.seed(0)
t_np = np.random.standard_t(df=3, size=10000)
with tf.Graph().as_default(), tf.Session() as sess:
tf.random.set_random_seed(0)
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
t_tf = sess.run(t_dist.sample(10000))
plt.hist((t_np, t_tf), np.linspace(-10, 10, 20), label=['NumPy', 'TensorFlow'])
plt.legend()
plt.tight_layout()
plt.show()
Output:
That looks pretty close. Obviously, from the point of view of statistical samples, this is not any kind of proof. If you were not still convinced, there are some statistical tools for testing whether a sample comes from a certain distribution or two samples come from the same distribution.

Weighted Least Squares in Statsmodels vs. Numpy?

I am trying to replicate the functionality of Statsmodels's weight least squares (WLS) function with Numpy's ordinary least squares (OLS) function (i.e. Numpy refers to OLS as just "least squares").
In other words, I want to compute the WLS in Numpy. I used this Stackoverflow post as reference, but drastically different R² values arise moving from Statsmodel to Numpy.
Take the following example code that replicates this:
import numpy as np
import statsmodels.formula.api as smf
import pandas as pd
# Test Data
patsy_equation = "y ~ C(x) - 1" # Use minus one to get ride of hidden intercept of "+ 1"
weight = np.array([0.37, 0.37, 0.53, 0.754])
y = np.array([0.23, 0.55, 0.66, 0.88])
x = np.array([3, 3, 3, 3])
d = {"x": x.tolist(), "y": y.tolist()}
data_df = pd.DataFrame(data=d)
# Weighted Least Squares from Statsmodel API
statsmodel_model = smf.wls(formula=patsy_equation, weights=weight, data=data_df)
statsmodel_r2 = statsmodel_model.fit().rsquared
# Weighted Least Squares from Numpy API
Aw = x.reshape((-1, 1)) * np.sqrt(weight[:, np.newaxis]) # Multiply two column vectors
Bw = y * np.sqrt(weight)
numpy_model, numpy_resid = np.linalg.lstsq(Aw, Bw, rcond=None)[:2]
numpy_r2 = 1 - numpy_resid / (Bw.size * Bw.var())
print("Statsmodels R²: " + str(statsmodel_r2))
print("Numpy R²: " + str(numpy_r2[0]))
After running such code, I get the following results:
Statsmodels R²: 2.220446049250313e-16
Numpy R²: 0.475486515775414
Clearly something is wrong here! Can anyone point out my flaws here? Am I miss understanding the patsy formula?

What is the pandas equivalent of R's qnorm()

I am moving some code from R to Anaconda Python. The R code uses qnorm, documented as "quantile function for the normal distribution with mean equal to mean and standard deviation equal to sd."
The call and parameters are:
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
p vector of probabilities.
mean vector of means.
sd vector of standard deviations.
log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are
P[X≤x] otherwise, P[X].
I don't see any equivalent in pandas.Series. Have I missed it, is it on another object, or is there some equivalent in another library?

A lot of this equivalent functionality is found in scipy.stats. In this case, you're looking for scipy.stats.norm.ppf.
qnorm(p, mean = 0, sd = 1) is equivalent to scipy.stats.norm.ppf(q, loc=0, scale=1).
import scipy.stats as st
>>> st.norm.ppf([0.01, 0.99])
array([-2.32634787, 2.32634787])
>>> st.norm.ppf([0.01, 0.99], loc=10, scale=0.1)
array([ 9.76736521, 10.23263479])

Just to expand #miradulo answer. If you want to get also qnorm(lower.tail=FALSE) you can just multiply by -1:
In R:
qnorm(0.8, lower.tail = F)
-0.8416212
In python
from scipy.stats import norm
norm.ppf(0.8) * -1
-0.8416212

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Difference between R.scale() and sklearn.preprocessing.scale() - python

Related

scaling data to specific range in python

Efficient expanding OLS in pandas

Are these functions equivalent?

Weighted Least Squares in Statsmodels vs. Numpy?

What is the pandas equivalent of R's qnorm()

Categories

Resources