Principle component regression using python - python

I have strain temperature data and I have read that article
https://www.idtools.com.au/principal-component-regression-python-2/
I'm trying to build a model and predict the strain out of the temperature.
I have got the following results with cross validation is negative.
I have the data set here
http://www.mediafire.com/file/r7dg7i9dacvpl2j/curve_fitting_ahmed.xlsx/file
My question is Is it results of Cross validation makes sense ?
My code is the following
The input is dataframe from panda.
def pca_analysis(temperature, strain):
# Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Data
print("process data")
T1 = temperature['T1'].tolist()
W_A1 = strain[0]
N = len(T1)
xData = np.reshape(T1, (N, 1))
yData = np.reshape(W_A1, (N, 1))
# Define the PCA object
pca = PCA()
Xstd = StandardScaler().fit_transform(xData)
# Run PCA producing the reduced variable Xred and select the first pc components
Xreg = pca.fit_transform(Xstd)[:, :2]
''' Step 2: regression on selected principal components'''
# Create linear regression object
regr = linear_model.LinearRegression()
# Fit
regr.fit(Xreg,W_A1)
# Calibration
y_c = regr.predict(Xreg)
# Cross-validation
y_cv = cross_val_predict(regr, Xreg, W_A1, cv=10)
# Calculate scores for calibration and cross-validation
score_c = r2_score(W_A1, y_c)
score_cv = r2_score(W_A1, y_cv)
# Calculate mean square error for calibration and cross validation
mse_c = mean_squared_error(W_A1, y_c)
mse_cv = mean_squared_error(W_A1, y_cv)
print(mse_c)
print(mse_cv)
print(score_c)
print(score_cv)
# Regression plot
z = np.polyfit(W_A1, y_c, 1)
with plt.style.context(('ggplot')):
fig, ax = plt.subplots(figsize=(9, 5))
ax.scatter(W_A1, y_c, c='red', s = 0.4, edgecolors='k')
ax.plot(W_A1, z[1] + z[0] * yData, c='blue', linewidth=1)
ax.plot(W_A1, W_A1, color='green', linewidth=1)
plt.title('$R^{2}$ (CV): ' + str(score_cv))
plt.xlabel('Measured $^{\circ}$Strain')
plt.ylabel('Predicted $^{\circ}$Strain')
plt.show()
Here is the result of PCR
How would I improve the prediction using that data ?
enter image description here

From the Scikit Documentation, the value given by r2_score can be negative if your model is arbitrarily worse than random. Now, obviously this is not what one wants from using ML; you expect better than random results.
The first thing I would note is that your data seems like it may be quite nonlinear, in which case PCA struggles to improve model performance.
One potential substitute for PCA which accounts for essentially any nonlinearities in data is the use of autoencoders to preprocess data (Good article on these here). They can account for nonlinearities in data if you use non-linear activation functions on some of your hidden layers of the autoencoder, which may help your model's performance. There are many articles around the web that explain this, let me know if you want some resources if you so choose to pursue this course
The next thing that I would note is that r2_score is really not the best way to measure error, and that using mean-squared error is much more popular, especially for linear regression. So, if you want to keep your model as simple as this, I would simply ignore the r2_score and move on from there. However, that being said, linear regression is not equipped to solve very complex problems due to its simplicity, and judging by the picture you provided, it's pretty clear to me that linear regression is very rough when applied to this dataset.
I would be interested to know the difference in mean-squared-error between the PCA and non-PCA applied data. Here, the PCA should have less error than the the normal, non-PCA applied data. If it does not, then either your data is horribly non-linear (maybe?) or there is an error in your code (I looked over it and nothing is immediately obviously wrong with it). For linear regression, mean-squared-error is really almost the unanimous error function of choice, and is remarkably effective. Hope this answers your question, leave a comment/question about my answer if you have one and I will try to clarify as best as I can.
Also, while answering your question, I cam across this other question that I believe explains your problem pretty well (and uses some math, so be prepared). Most notably, there are situations where R^2 error is appropriate to use for your model, but given your results, I would say that R^2 error would probably be a poor choice of error function for this data.
Update: Given the values that you get for the mean squared error, my first guess would be that PCA is 1) either not working bc of the nature of the data, or 2) is implemented incorrectly. While I am not an expert with the libraries you are using, I would make sure that you transform all of the data in the same way, i.e. make sure that the PCA transformed vectors are being compared with transformed vectors.
For moving on from linear regression, I would investigate into making a simple neural network or SVR (this might be a little trickier). Both these methods are proven to work well for complex data and are very adaptable. There tons of resources online for both of these things, and I think giving specifics on implementation of either of these methods might be out of the scope of this question (you might have to ask a separate one about this).

Related

KNeighborsRegressor as denoising algorithm

On Kaggle I have found algorithms used for signals denoising. Such as Golay filters, spline functions, Autoregressive modelling or KNeighborsRegressor itself.
Link:
https://www.kaggle.com/residentmario/denoising-algorithms
How exactly does it work as I cannot find any article explaining its use for signal denoising? What kind of algorithm is it? I would like to understand how it works
It is a supervised learning algorithm - that is the best answer,
normally the algorithm is first trained with known data and it tries to interpret a function that best represents that data such that a new point can be produced for a previously unseen input.
Put simply it will determine the point for a previously unseen value based on an average of the k nearest points for which it has previously seen, a better, more detailed answer can be found below:
https://towardsdatascience.com/the-basics-knn-for-classification-and-regression-c1e8a6c955
in the kaggle code:
the time vector is:
df.index.values[:, np.newaxis]
and the signal vector is:
df.iloc[:, 0]
it appears the person in kaggle is using the data to first train the network - see below:
## define the KNN network
clf = KNeighborsRegressor(n_neighbors=100, weights='uniform')
## train the network
clf.fit(df.index.values[:, np.newaxis],
df.iloc[:, 0])
giving him a function that represents the relationship between time and the signal value. With this he then passes the time vector back to the network to get it to reproduce the signal.
y_pred = clf.predict(df.index.values[:, np.newaxis])
this new signal will represent the model's best interpretation of the signal, as you can see from the the link I have posted above, you can adjust certain parameters which will result in a 'cleaner' signal but also could degrade the original signal
One thing to note is that using this method in the same way as that guy in kaggle means it would only work for that one signal since the input is time it cannot be used to interpret future values:
y_pred = clf.predict(df.index.values[:, np.newaxis] + 400000)
ax = pd.Series(df.iloc[:, 0]).plot(color='lightgray')
pd.Series(y_pred).plot(color='black', ax=ax, figsize=(12, 8))

Why does Ridge model fitting show warning when power of the denominator in the alpha value is raised to 13 or more?

I was trying to create a loop to find out the variations in the accuracy scores of the train and test sets of Boston housing data set fitted with a Ridge regression model.
This was the for loop:
for i in range(1,20):
Ridge(alpha = 1/(10**i)).fit(X_train,y_train)
It showed a warning beginning from i=13.
The warning being:
LinAlgWarning: Ill-conditioned matrix (rcond=6.45912e-17): result may not be accurate.
overwrite_a=True).T
What is the meaning of this warning?
And is it possible to get rid of it?
I checked to execute it separately without a loop, still didn't help.
#importing libraries and packages
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
#importing boston housing dataset from mglearn
X,y = mglearn.datasets.load_extended_boston()
#Splitting the dataset
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
#Fitting the training data using Ridge model with alpha = 1/(10**13)
rd = Ridge(alpha = 1/(10**13)).fit(X_train,y_train)
Shoudn't display the warning mentioned above for any value of i.
Try fitting your Ridge model with normalization: Ridge(normalize=True). I ran into the same error as you, and it was because my features included both extremely large and extremely small values, which were causing problems with the underlying linear algebra solver used to fit the model.
In Ridge Regression you construct at Kernel Matrix which contains the similarities between all of your training labels.
The parameters of the Ridge Regression fit are found from both this Kernel Matrix and your training labels.
If you have e.g. two samples that are extremely similar, the matrix to be solved will be over determined.
To get around this, a small value can be added to the diagonal, and that value is the alpha parameter that you give.
So what happens is that when your alpha value approaches 0, the matrix is more likely to be over determined (but depends on the nature of your data).
But this should show up as poor cross validation accuracy, so you don't have to worry too much about it.
So all in all if you keep your alpha above the warning threshold you will be fine, and in a cross validation procedure, the alpha value would likely be selected to be above this threshold anyways.

transforming "back" scaled coefficents after making a fit using standardScaler in conjunction with ols

Im running a linear model using statsmodels.formula.api.ols after scaling the continuous input variables using sklearn.preprocessing.standardScaler
So everything works ok during the model building. My issue is more of a technical one after the fitting is done and I want to convert the fitted coefficients back into the space where they can be used with un-scaled input.
I know how standarScaler works: scaled = (x - x_mean)/ sqrt(x_var). So for any given scaled x value, I can calculate the transformation back (by hand if I wanted to), (~ x = scaled*sqrt(x_var) + x_mean).
What Im unsure of is how to apply this type of reverse scaling to the fit "scaled" linear coefficents.
Can anyone explain how to do this?
As a side note, Id really love to use sklearn.linear_model.LinearRegression with normalize=True in a pipeline... in this way the coefficents are scaled back for you (its much more elegant and streamlined)... alas, I really need the rich statistical output of the statsmodels ols method so that I can make an ANOVA analysis after the fit.

Python model targeting n variable prediction equation

I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?
Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.
ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.

OLS Regression: Scikit vs. Statsmodels? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Short version: I was using the scikit LinearRegression on some data, but I'm used to p-values so put the data into the statsmodels OLS, and although the R^2 is about the same the variable coefficients are all different by large amounts. This concerns me since the most likely problem is that I've made an error somewhere and now I don't feel confident in either output (since likely I have made one model incorrectly but don't know which one).
Longer version: Because I don't know where the issue is, I don't know exactly which details to include, and including everything is probably too much. I am also not sure about including code or data.
I am under the impression that scikit's LR and statsmodels OLS should both be doing OLS, and as far as I know OLS is OLS so the results should be the same.
For scikit's LR, the results are (statistically) the same whether or not I set normalize=True or =False, which I find somewhat strange.
For statsmodels OLS, I normalize the data using StandardScaler from sklearn. I add a column of ones so it includes an intercept (since scikit's output includes an intercept). More on that here: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html (Adding this column did not change the variable coefficients to any notable degree and the intercept was very close to zero.) StandardScaler didn't like that my ints weren't floats, so I tried this: https://github.com/scikit-learn/scikit-learn/issues/1709
That makes the warning go away but the results are exactly the same.
Granted I'm using 5-folds cv for the sklearn approach (R^2 are consistent for both test and training data each time), and for statsmodels I just throw it all the data.
R^2 is about 0.41 for both sklearn and statsmodels (this is good for social science). This could be a good sign or just a coincidence.
The data is observations of avatars in WoW (from http://mmnet.iis.sinica.edu.tw/dl/wowah/) which I munged about to make it weekly with some different features. Originally this was a class project for a data science class.
Independent variables include number of observations in a week (int), character level (int), if in a guild (Boolean), when seen (Booleans on weekday day, weekday eve, weekday late, and the same three for weekend), a dummy for character class (at the time for the data collection, there were only 8 classes in WoW, so there are 7 dummy vars and the original string categorical variable is dropped), and others.
The dependent variable is how many levels each character gained during that week (int).
Interestingly, some of the relative order within like variables is maintained across statsmodels and sklearn. So, rank order of "when seen" is the same although the loadings are very different, and rank order for the character class dummies is the same although again the loadings are very different.
I think this question is similar to this one: Difference in Python statsmodels OLS and R's lm
I am good enough at Python and stats to make a go of it, but then not good enough to figure something like this out. I tried reading the sklearn docs and the statsmodels docs, but if the answer was there staring me in the face I did not understand it.
I would love to know:
Which output might be accurate? (Granted they might both be if I missed a kwarg.)
If I made a mistake, what is it and how to fix it?
Could I have figured this out without asking here, and if so how?
I know this question has some rather vague bits (no code, no data, no output), but I am thinking it is more about the general processes of the two packages. Sure, one seems to be more stats and one seems to be more machine learning, but they're both OLS so I don't understand why the outputs aren't the same.
(I even tried some other OLS calls to triangulate, one gave a much lower R^2, one looped for five minutes and I killed it, and one crashed.)
Thanks!
It sounds like you are not feeding the same matrix of regressors X to both procedures (but see below). Here's an example to show you which options you need to use for sklearn and statsmodels to produce identical results.
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
# Generate artificial data (2 regressors + constant)
nobs = 100
X = np.random.random((nobs, 2))
X = sm.add_constant(X)
beta = [1, .1, .5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e
# Fit regression model
sm.OLS(y, X).fit().params
>> array([ 1.4507724 , 0.08612654, 0.60129898])
LinearRegression(fit_intercept=False).fit(X, y).coef_
>> array([ 1.4507724 , 0.08612654, 0.60129898])
As a commenter suggested, even if you are giving both programs the same X, X may not have full column rank, and they sm/sk could be taking (different) actions under-the-hood to make the OLS computation go through (i.e. dropping different columns).
I recommend you use pandas and patsy to take care of this:
import pandas as pd
from patsy import dmatrices
dat = pd.read_csv('wow.csv')
y, X = dmatrices('levels ~ week + character + guild', data=dat)
Or, alternatively, the statsmodels formula interface:
import statsmodels.formula.api as smf
dat = pd.read_csv('wow.csv')
mod = smf.ols('levels ~ week + character + guild', data=dat).fit()
Edit: This example might be useful: http://statsmodels.sourceforge.net/devel/example_formulas.html
If you use statsmodels, I would highly recommend using the statsmodels formula interface instead. You will get the same old result from OLS using the statsmodels formula interface as you would from sklearn.linear_model.LinearRegression, or R, or SAS, or Excel.
smod = smf.ols(formula ='y~ x', data=df)
result = smod.fit()
print(result.summary())
When in doubt, please
try reading the source code
try a different language for benchmark, or
try OLS from scratch, which is basic linear algebra.
i just wanted to add here, that in terms of sklearn, it does not use OLS method for linear regression under the hood. Since sklearn comes from the data-mining/machine-learning realm, they like to use Steepest Descent Gradient algorithm. This is a numerical method that is sensitive to initial conditions etc, while the OLS is an analytical closed form approach, so one should expect differences. So statsmodels comes from classical statistics field hence they would use OLS technique. So there are differences between the two linear regressions from the 2 different libraries

Categories

Resources