Weighted Least Squares in Statsmodels vs. Numpy? - python

I am trying to replicate the functionality of Statsmodels's weight least squares (WLS) function with Numpy's ordinary least squares (OLS) function (i.e. Numpy refers to OLS as just "least squares").
In other words, I want to compute the WLS in Numpy. I used this Stackoverflow post as reference, but drastically different R² values arise moving from Statsmodel to Numpy.
Take the following example code that replicates this:
import numpy as np
import statsmodels.formula.api as smf
import pandas as pd
# Test Data
patsy_equation = "y ~ C(x) - 1" # Use minus one to get ride of hidden intercept of "+ 1"
weight = np.array([0.37, 0.37, 0.53, 0.754])
y = np.array([0.23, 0.55, 0.66, 0.88])
x = np.array([3, 3, 3, 3])
d = {"x": x.tolist(), "y": y.tolist()}
data_df = pd.DataFrame(data=d)
# Weighted Least Squares from Statsmodel API
statsmodel_model = smf.wls(formula=patsy_equation, weights=weight, data=data_df)
statsmodel_r2 = statsmodel_model.fit().rsquared
# Weighted Least Squares from Numpy API
Aw = x.reshape((-1, 1)) * np.sqrt(weight[:, np.newaxis]) # Multiply two column vectors
Bw = y * np.sqrt(weight)
numpy_model, numpy_resid = np.linalg.lstsq(Aw, Bw, rcond=None)[:2]
numpy_r2 = 1 - numpy_resid / (Bw.size * Bw.var())
print("Statsmodels R²: " + str(statsmodel_r2))
print("Numpy R²: " + str(numpy_r2[0]))
After running such code, I get the following results:
Statsmodels R²: 2.220446049250313e-16
Numpy R²: 0.475486515775414
Clearly something is wrong here! Can anyone point out my flaws here? Am I miss understanding the patsy formula?

Related

Efficient expanding OLS in pandas

I would like to explore the solutions of performing expanding OLS in pandas (or other libraries that accept DataFrame/Series friendly) efficiently.
Assumming the dataset is large, I am NOT interested in any solutions with a for-loop;
I am looking for solutions about expanding rather than rolling. Rolling functions always require a fixed window while expanding uses a variable window (starting from beginning);
Please do not suggest pandas.stats.ols.MovingOLS because it is deprecated;
Please do not suggest other deprecated methods such as expanding_mean.
For example, there is a DataFrame df with two columns X and y. To make it simpler, let's just calculate beta.
Currently, I am thinking about something like
import numpy as np
import pandas as pd
import statsmodels.api as sm
def my_OLS_func(df, y_name, X_name):
y = df[y_name]
X = df[X_name]
X = sm.add_constant(X)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
return b
df = pd.DataFrame({'X':[1,2.5,3], 'y':[4,5,6.3]})
df['beta'] = df.expanding().apply(my_OLS_func, args = ('y', 'X'))
Expected values of df['beta'] are 0 (or NaN), 0.66666667, and 1.038462.
However, this method does not seem to work because the method seems very inflexible. I am not sure how one could pass the two Series as arguments.
Any suggestions would be appreciated.
One option is to use the RecursiveLS (recursive least squares) model from Statsmodels:
# Simulate some data
rs = np.random.RandomState(seed=12345)
nobs = 100000
beta = [10., -0.2]
sigma2 = 2.5
exog = sm.add_constant(rs.uniform(size=nobs))
eps = rs.normal(scale=sigma2**0.5, size=nobs)
endog = np.dot(exog, beta) + eps
# Construct and fit the recursive least squares model
mod = sm.RecursiveLS(endog, exog)
res = mod.fit()
# This is a 2 x 100,000 numpy array with the regression coefficients
# that would be estimated when using data from the beginning of the
# sample to each point. You should usually ignore the first k=2
# datapoints since they are controlled by a diffuse prior.
res.recursive_coefficients.filtered

How should I write the code scikit-learn PCA `.transform()` method by using its `.components`?

How should I write the code scikit-learn PCA .transform() method by using its .components?
I thought the PCA .transform() method transforms a 3D point to 2D Point by just applying a matrix M to the 3D point P like below:
np.dot(M, P)
To ensure this is correct, I wrote the following code.
But, the result was, I couldn’t make the same result of the PCA .transform() method.
How should I modify the code? Am I missing something?
from sklearn.decomposition import PCA
import numpy as np
data3d = np.arange(10*3).reshape(10, 3) ** 2
pca = PCA(n_components=2)
pca.fit(data3d)
pca_transformed2d = pca.transform(data3d)
sample_index = 0
sample3d = data3d[sample_index]
# Manually transform `sample3d` to 2 dimensions.
w11, w12, w13 = pca.components_[0]
w21, w22, w23 = pca.components_[1]
my_transformed2d = np.zeros(2)
my_transformed2d[0] = w11 * sample3d[0] + w12 * sample3d[1] + w13 * sample3d[2]
my_transformed2d[1] = w21 * sample3d[0] + w22 * sample3d[1] + w23 * sample3d[2]
print("================ Validation ================")
print("pca_transformed2d:", pca_transformed2d[sample_index])
print("my_transformed2d:", my_transformed2d)
if np.all(my_transformed2d == pca_transformed2d[sample_index]):
print("My transformation is correct!")
else:
print("My transformation is not correct...")
Output:
================ Validation ================
pca_transformed2d: [-492.36557212 12.28386702]
my_transformed2d: [ 3.03163093 -2.67255444]
My transformation is not correct...
PCA begins with centering the data: subtracting the average of all observations. In this case, centering is done with
centered_data = data3d - data3d.mean(axis=0)
Averaging out along axis=0 (rows) means only one row will be left, with three components of the mean. After centering, multiply the data by the PCA components; but instead of writing out matrix multiplication by hand, I'd use .dot:
my_transformed2d = pca.components_.dot(centered_data[sample_index])
Finally, verification. Don't use == between floating point numbers; exact equality is rare. Tiny discrepancies appear because of a different order of operations somewhere: for example,
0.1 + 0.2 - 0.3 == 0.1 - 0.3 + 0.2
is False. This is why we have np.allclose, which says "they are close enough".
if np.allclose(my_transformed2d, pca_transformed2d[sample_index]):
print("My transformation is correct!")
else:
print("My transformation is not correct...")

Proper Way to Fit a Lognormal Distribution with Weight in Python

Currently I have code to fit a lognormal distribution.
shape, loc, scale = sm.lognorm.fit(dataToLearn, floc = 0)
for b in bounds:
toPlot.append((b, currCount+sm.lognorm.ppf(b, s = shape, loc = loc, scale = scale)))
I would like to be able to pass in a vector of weights to the fitting. Currently I have a workaround, where I keep all the weights rounded to 2 decimals and then repeat each value w times so that it gets weighted properly.
for i, d in enumerate(dataToLearn):
dataToLearn2 += int(w[i] * 100) * [d]
The runtime of this is getting too slow for my computer so I was hoping for a more correct solution.
Please advise whether it be using scipy or numpy to make my workaround faster and more efficient
The SciPy distributions do not implement a weighted fit. For the log-normal distribution, however, there are explicit formulas for the (unweighted) maximum likelihood estimation, and these are easily generalized for weighted data. The explicit formulas are both (in effect) averages, and the generalization to the case of weighted data is to use weighted averages in the formulas.
Here's a script that demonstrates the calculation using a small data set with integer weights, so we know what the exact value of the fitted parameters should be.
import numpy as np
from scipy.stats import lognorm
# Sample data and weights. To enable an exact comparison with
# the method of generating an array with the values repeated
# according to their weight, I use an array of weights that is
# all integers.
x = np.array([2.5, 8.4, 9.3, 10.8, 6.8, 1.9, 2.0])
w = np.array([ 1, 1, 2, 1, 3, 3, 1])
#-----------------------------------------------------------------------------
# Fit the log-normal distribution by creating an array containing the values
# repeated according to their weight.
xx = np.repeat(x, w)
# Use the explicit formulas for the MLE of the log-normal distribution.
lnxx = np.log(xx)
muhat = np.mean(lnxx)
varhat = np.var(lnxx)
shape = np.sqrt(varhat)
scale = np.exp(muhat)
print("MLE using repeated array: shape=%7.5f scale=%7.5f" % (shape, scale))
#-----------------------------------------------------------------------------
# Use the explicit formulas for the weighted MLE of the log-normal
# distribution.
lnx = np.log(x)
muhat = np.average(lnx, weights=w)
# varhat is the weighted variance of ln(x). There isn't a function in
# numpy for the weighted variance, so we compute it using np.average.
varhat = np.average((lnx - muhat)**2, weights=w)
shape = np.sqrt(varhat)
scale = np.exp(muhat)
print("MLE using weights: shape=%7.5f scale=%7.5f" % (shape, scale))
#-----------------------------------------------------------------------------
# Might as well check that we get the same result from lognorm.fit() using the
# repeated array
shape, loc, scale = lognorm.fit(xx, floc=0)
print("MLE using lognorm.fit: shape=%7.5f scale=%7.5f" % (shape, scale))
The output is
MLE using repeated array: shape=0.70423 scale=4.57740
MLE using weights: shape=0.70423 scale=4.57740
MLE using lognorm.fit: shape=0.70423 scale=4.57740
You can use numpy.repeat to make the workaround more efficient:
import numpy as np
dataToLearn = np.array([1,2,3,4,5])
weights = np.array([1,2,1,1,3])
print(np.repeat(dataToLearn, weights))
# Output: array([1, 2, 2, 3, 4, 5, 5, 5])
Very basic performance test of numpy.repeat performance:
import timeit
code_before = """
weights = np.array([1,2,1,1,3] * 1000)
dataToLearn = np.array([1,2,3,4,5] * 1000)
dataToLearn2 = []
for i, d in enumerate(dataToLearn):
dataToLearn2 += int(weights[i]) * [d]
"""
code_after = """
weights = np.array([1,2,1,1,3] * 1000)
dataToLearn = np.array([1,2,3,4,5] * 1000)
np.repeat(dataToLearn, weights)
"""
print(timeit.timeit(code_before, setup="import numpy as np", number=1000))
print(timeit.timeit(code_after, setup="import numpy as np", number=1000))
As a result, I've got roughly 3.38 for your current approach vs 0.75 for numpy.repeat

Least Squares method in practice

Very simple regression task. I have three variables x1, x2, x3 with some random noise. And I know target equation: y = q1*x1 + q2*x2 + q3*x3. Now I want to find target coefs: q1, q2, q3 evaluate the
performance using the mean Relative Squared Error (RSE) (Prediction/Real - 1)^2 to evaluate the performance of our prediction methods.
In the research, I see that this is ordinary Least Squares Problem. But I can't get from examples on the internet how to solve this particular problem in Python. Let say I have data:
import numpy as np
sourceData = np.random.rand(1000, 3)
koefs = np.array([1, 2, 3])
target = np.dot(sourceData, koefs)
(In real life that data are noisy, with not normal distribution.) How to find this koefs using Least Squares approach in python? Any lib usage.
#ayhan made a valuable comment.
And there is a problem with your code: Actually there is no noise in the data you collect. The input data is noisy, but after the multiplication, you don't add any additional noise.
I've added some noise to your measurements and used the least squares formula to fit the parameters, here's my code:
data = np.random.rand(1000,3)
true_theta = np.array([1,2,3])
true_measurements = np.dot(data, true_theta)
noise = np.random.rand(1000) * 1
noisy_measurements = true_measurements + noise
estimated_theta = np.linalg.inv(data.T # data) # data.T # noisy_measurements
The estimated_theta will be close to true_theta. If you don't add noise to the measurements, they will be equal.
I've used the python3 matrix multiplication syntax.
You could use np.dot instead of #
That makes the code longer, so I've split the formula:
MTM_inv = np.linalg.inv(np.dot(data.T, data))
MTy = np.dot(data.T, noisy_measurements)
estimated_theta = np.dot(MTM_inv, MTy)
You can read up on least squares here: https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#The_general_problem
UPDATE:
Or you could just use the builtin least squares function:
np.linalg.lstsq(data, noisy_measurements)
In addition to the #lhk answer I have found great scipy Least Squares function. It is easy to get the requested behavior with it.
This way we can provide a custom function that returns residuals and form Relative Squared Error instead of absolute squared difference:
import numpy as np
from scipy.optimize import least_squares
data = np.random.rand(1000,3)
true_theta = np.array([1,2,3])
true_measurements = np.dot(data, true_theta)
noise = np.random.rand(1000) * 1
noisy_measurements = true_measurements + noise
#noisy_measurements[-1] = data[-1] # (1000 * true_theta) - uncoment this outliner to see how much Relative Squared Error esimator works better then default abs diff for this case.
def my_func(params, x, y):
res = (x # params) / y - 1 # If we change this line to: (x # params) - y - we will got the same result as np.linalg.lstsq
return res
res = least_squares(my_func, x0, args=(data, noisy_measurements) )
estimated_theta = res.x
Also, we can provide custom loss with loss argument function that will process the residuals and form final loss.

Difference between R.scale() and sklearn.preprocessing.scale()

I am currently moving my data analysis from R to Python. When scaling a dataset in R i would use R.scale(), which in my understanding would do the following: (x-mean(x))/sd(x)
To replace that function I tried to use sklearn.preprocessing.scale(). From my understanding of the description it does the same thing. Nonetheless I ran a little test-file and found out, that both of these methods have different return-values. Obviously the standard deviations are not the same... Is someone able to explain why the standard deviations "deviate" from one another?
MWE:
# import packages
from sklearn import preprocessing
import numpy
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up R namespaces
R = rpy2.robjects.r
np1 = numpy.array([[1.0,2.0],[3.0,1.0]])
print "Numpy-array:"
print np1
print "Scaled numpy array through R.scale()"
print R.scale(np1)
print "-------"
print "Scaled numpy array through preprocessing.scale()"
print preprocessing.scale(np1, axis = 0, with_mean = True, with_std = True)
scaler = preprocessing.StandardScaler()
scaler.fit(np1)
print "Mean of preprocessing.scale():"
print scaler.mean_
print "Std of preprocessing.scale():"
print scaler.std_
Output:
It seems to have to do with how standard deviation is calculated.
>>> import numpy as np
>>> a = np.array([[1, 2],[3, 1]])
>>> np.std(a, axis=0)
array([ 1. , 0.5])
>>> np.std(a, axis=0, ddof=1)
array([ 1.41421356, 0.70710678])
From numpy.std documentation,
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
Apparently, R.scale() uses ddof=1, but sklearn.preprocessing.StandardScaler() uses ddof=0.
EDIT: (To explain how to use alternate ddof)
There doesn't seem to be a straightforward way to calculate std with alternate ddof, without accessing the variables of the StandardScaler() object itself.
sc = StandardScaler()
sc.fit(data)
# Now, sc.mean_ and sc.std_ are the mean and standard deviation of the data
# Replace the sc.std_ value using std calculated using numpy
sc.std_ = numpy.std(data, axis=0, ddof=1)
The current answers are good, but sklearn has changed a bit meanwhile. The new syntax that makes sklearn behave exactly like R.scale() now is:
from sklearn.preprocessing import StandardScaler
import numpy as np
sc = StandardScaler()
sc.fit(data)
sc.scale_ = np.std(data, axis=0, ddof=1).to_list()
sc.transform(data)
Feature request:
https://github.com/scikit-learn/scikit-learn/issues/23758
R.scale documentation says:
The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale by the standard deviations without centering, use scale(x, center = FALSE, scale = apply(x, 2, sd, na.rm = TRUE)).)
However, sklearn.preprocessing.StandardScale always scale with standard deviation.
In my case, I want to replicate R.scale in Python without centered,I followed #Sid advice in a slightly different way:
import numpy as np
def get_scale_1d(v):
# I copy this function from R source code haha
v = v[~np.isnan(v)]
std = np.sqrt(
np.sum(v ** 2) / np.max([1, len(v) - 1])
)
return std
sc = StandardScaler()
sc.fit(data)
sc.std_ = np.apply_along_axis(func1d=get_scale_1d, axis=0, arr=x)
sc.transform(data)

Categories

Resources