regression model statsmodel python - python

This is more of a stats question as the code is working fine, but I am learning regression modeling in python. I have some code below with statsmodel to create a simple linear regression model:
import statsmodels.api as sm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
ng = pd.read_csv('C:/Users/ben/ngDataBaseline.csv', thousands=',', index_col='Date', parse_dates=True)
X = ng['HDD']
y = ng['Therm']
# Note the difference in argument order
model = sm.OLS(y, X).fit()
# Print out the statistics
model.summary()
I get an output like the screen shot below. I am trying judge the goodness of fit, and I know the R^2 is high, but is it possible to find the root mean squared error (RMSE) of the prediction with statsmodel?
I'm also attempting to research if I can estimate the sampling distribution with a confidence interval. If I am interpreting the table correctly for the intercept HDD 5.9309, with standard error 0.220, p value low 0.000, and I think a 97.5% confidence interval the value of HDD (or is it my dependent variable Therm?) will be between 5.489 and 6.373?? Or I think in percentage that could be expressed as ~ +- 0.072%
EDIT included multiple regression table

Is it possible to calculate the RMSE with statsmodels? Yes, but you'll have to first generate the predictions with your model and then use the rmse method.
from statsmodels.tools.eval_measures import rmse
# fit your model which you have already done
# now generate predictions
ypred = model.predict(X)
# calc rmse
rmse = rmse(y, ypred)
As for interpreting the results, HDD isn't the intercept. It's your independent variable. The coefficient (e.g. the weight) is 5.9309 with standard error of 0.220. The t-score for this variable is really high suggesting that it is a good predictor, and since it is high, the p-value is very small (close to 0).
The 5.489 and 6.373 values are your confidence bounds for a 95% confidence interval. The bounds are simply calculated based on adding or subtracting the standard error times the t-statistic associated with the 95% confidence interval from the coefficient.
The t-statistic is dependent on your sample size which in your case is 53, so your degrees of freedom is 52. Using a t-table, this means that for df=52 and a confidence level of 95%, the t-statistic is 2.0066. Therefore the bounds can be manually calculated as thus:
lower: 5.9309 - (2.0066 x 0.220) = 5.498
upper: 5.9309 + (2.0066 x 0.220) = 6.372
Of course, there's some precision loss due to rounding but you can see the manual calculation is really close to what's reported in the summary.
Additional response to your comments:
There are several metrics you can use to evaluate the goodness of fit. One of them being the adjusted R-squared statistic. Others are RMSE, F-statistic, or AIC/BIC. It's up to you to decide which metric or metrics to use to evaluate the goodness of fit. For me, I usually use the adjusted R-squared and/or RMSE, though RMSE is more of a relative metric to compare against other models.
Now looking at your model summaries, both of the models fit well, especially the first model given the high adjusted R-squared value. There may be potential improvement with the second model (may try different combinations of the independent variables) but you won't know unless you experiment. Ultimately, there's no right and wrong model. It just comes down to building several models and comparing them to get the best one. I'll also link an article that explains some of the goodness of fit metrics for regression models.
As for confidence intervals, I'll link this SO post since the person who answered the question has code to create the confidence interval. You'll want to look at the predict_mean_ci_low and predict_mean_ci_high that he created in his code. These two variables will give you the confidence intervals at each observation and from there, you can calculate the +/- therms/kWh by subtracting the lower CI from your prediction or subtracting your prediction from the upper CI.

Related

Calculating percentage error from R-squared error

I created a ML model with Scikit Learn and Python. I calculated R-Squared error. Is there a way to convert this error to percentage error?
For example, If my true values are 100 and 50, and predicted values are 90 and 40, my average percentage error is 15%, because error for first prediction is 10% and error for second prediction is 20%.
Is there a way to calculate percentage error *(average percentage error) based on the value that I get for R-squared?
It is not possible. The R-squared is calculated via RSS, or residual sum of squares. Your r-squared is 1 - (RSS in model)/(RSS in intercept only model). From the above, you can see that R-squared is not really an error per se, but the percentage of variance explained.
We can use an example dataset
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import numpy as np
X, y = load_boston(return_X_y=True)
reg = LinearRegression().fit(X, y)
We let the prediction and mean of y be:
ybar = reg.predict(X)
ymean = y.mean()
The R-squared is
1 - sum((y-ybar)**2) / sum((y-ymean)**2)
0.7406426641094095
reg.score(X, y)
0.7406426641094095
Whereas your percentage error is:
np.mean(abs(y-ybar)/y)
0.16417298806489977
As you can see, it is not quite possible to just get back the mean percentage error from Rsq because you have already summed up the residuals and in the percentage error, you need the error relative to the observation
From your question, it sounds like you're working with a regression model. I would recommend looking into sklearn's built-in regression accuracy methods instead of trying to use R^2, which, is too an accuracy metric. For what you are trying to do, I would probably recommend trying out the mean_absolute_error or median_absolute_error- but other accuracy metrics can be useful in tuning your model!
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import median_absolute_error
y_pred = model.predict(X_test)
MAE = mean_absolute_error(y_test, y_pred)
MEDAE = median_absolute_error(y_test, y_pred)
If you're building a classifier should be able to use sklearn's accuracy_score metric. This will divide the number of correct predictions by the total number of predictions. Multiplying this number by 100 will yield the percentage of correct predictions. To get the percent of incorrect predictions you can just use 100(1-accuracy_score).
The above answer seems adequate but I've felt a confusion in the question so I'm leaving this right here.
R-squared is a metric that answers the question "If I were to use the average of target, would it be better than my predictions?", and gives you the value where if your model is worst than the baseline (average of target) gives you below zero, and if your model is better than the baseline, gives you something close to one. It is stated before but there are guidelines regarding which error you should use. If you're a starter I would suggest you to use mean squared error, as in the end you will get "the slope and the intercept where the mean squared error is zero" (some fancy differentiations take place there). In MSE, the distance between the data points and your model's predictions get squared, every squared distance (error) is summed up and then it takes the mean of the summed errors. Therefore, there's no way you can calculate error from R-squared, as they're not really related. You can implement MSE within the sklearn documentation like this:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
For other metrics in the sklearn (including classification and clustering metrics), see here. Sklearn's documentation is even better than the tutorials online.
You can also simply type sklearn.metrics.SCORERS.keys() to see the available metrics in the sklearn.

Different F1 scores for different preprocessing techniques- sklearn

I am building a classification model using sklearn's GradientBoostingClassifier. For the same model, I tried different preprocessing techniques: StandarScaler, Scale, and Normalizer on the same data but I am getting different f1_scores each time. For StandardScaler, it is highest and lowest for Normalizer. Why is it so? Is there any other technique for which I can get an even higher score?
The difference lies in their respective definitions:
StandardScaler: Standardize features by removing the mean and scaling to unit variance
Normalizer: Normalize samples individually to unit norm.
Scale: Standardize a dataset along any axis. Center to the mean and component wise scale to unit variance.
The data used to fit your model will change, so will the F1 score.
Here is a useful link comparing different scalers : https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

Limitations of Regression in Machine Learning?

I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
I tried to make a model that could predict the price of a room/apt given the various features of the dataset. I realised that this was a regression problem and using this sklearn cheat-sheet, I started trying the various regression models.
I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.
So I am thinking now, what happens when you want to do regression on a huge dataset? The kernel trick is extremely computationally expensive while the linear regression models are too simplistic as real data is seldom linear. So are neural nets the only answer or is there some clever solution that I am missing?
P.S. I am just starting on Overflow so please let me know what I can do to make my question better!
This is a great question but as it often happens there is no simple answer to complex problems. Regression is not a simple as it is often presented. It involves a number of assumptions and is not limited to linear least squares models. It takes couple university courses to fully understand it. Below I'll write a quick (and far from complete) memo about regressions:
Nothing will replace proper analysis. This might involve expert interviews to understand limits of your dataset.
Your model (any model, not limited to regressions) is only as good as your features. If home price depends on local tax rate or school rating, even a perfect model would not perform well without these features.
Some features cannot be included in the model by design, so never expect a perfect score in real world. For example, it is practically impossible to account for access to grocery stores, eateries, clubs etc. Many of these features are also moving targets, as they tend to change over time. Even 0.12 R2 might be great if human experts perform worse.
Models have their assumptions. Linear regression expects that dependent variable (price) is linearly related to independent ones (e.g. property size). By exploring residuals you can observe some non-linearities and cover them with non-linear features. However, some patterns are hard to spot, while still addressable by other models, like non-parametric regressions and neural networks.
So, why people still use (linear) regression?
it is the simplest and fastest model. There are a lot of implications for real-time systems and statistical analysis, so it does matter
often it is used as a baseline model. Before trying a fancy neural network architecture, it would be helpful to know how much we improve comparing to a naive method.
sometimes regressions are used to test certain assumptions, e.g. linearity of effects and relations between variables
To summarize, regression is definitely not the ultimate tool in most cases, but this is usually the cheapest solution to try first
UPD, to illustrate the point about non-linearity.
After building a regression you calculate residuals, i.e. regression error predicted_value - true_value. Then, for each feature you make a scatter plot, where horizontal axis is feature value and vertical axis is the error value. Ideally, residuals have normal distribution and do not depend on the feature value. Basically, errors are more often small than large, and similar across the plot.
This is how it should look:
This is still normal - it only reflects the difference in density of your samples, but errors have the same distribution:
This is an example of nonlinearity (a periodic pattern, add sin(x+b) as a feature):
Another example of non-linearity (adding squared feature should help):
The above two examples can be described as different residuals mean depending on feature value. Other problems include but not limited to:
different variance depending on feature value
non-normal distribution of residuals (error is either +1 or -1, clusters, etc)
Some of the pictures above are taken from here:
http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html
This is an great read on regression diagnostics for beginners.
I'll take a stab at this one. Look at my notes/comments embedded in the code. Keep in mind, this is just a few ideas that I tested. There are all kinds of other things you can try (get more data, test different models, etc.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
import sklearn
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.datasets import load_boston
#boston = load_boston()
# Predicting Continuous Target Variables with Regression Analysis
df = pd.read_csv('C:\\your_path_here\\AB_NYC_2019.csv')
df
# get only 2 fields and convert non-numerics to numerics
df_new = df[['neighbourhood']]
df_new = pd.get_dummies(df_new)
# print(df_new.columns.values)
# df_new.shape
# df.shape
# let's use a feature selection technique so we can see which features (independent variables) have the highest statistical influence on the target (dependent variable).
from sklearn.ensemble import RandomForestClassifier
features = df_new.columns.values
clf = RandomForestClassifier()
clf.fit(df_new[features], df['price'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
# what kind of object is this
# type(sorted_idx)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
X = df_new[features]
y = df['price']
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
Result:
Best alpha using built-in LassoCV: 0.040582
Best score using built-in LassoCV: 0.103947
Lasso picked 78 variables and eliminated the other 146 variables
Next step...
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
# get the top 25; plotting fewer features so we can actually read the chart
type(imp_coef)
imp_coef = imp_coef.tail(25)
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
X = df_new
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
Result:
MAE 1004799260.0756996
MSE 9.87308783180938e+21
RMSE 99363412943.64531
R squared error -2.603867717517002e+17
This is horrible! Well, we know this doesn't work. Let's try something else. We still need to rowk with numeric data so let's try lng and lat coordinates.
X = df[['longitude','latitude']]
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2
df2.plot(kind = 'bar')
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
# better but not awesome
Result:
MAE 85.35438165291622
MSE 36552.6244271195
RMSE 191.18740655994972
R squared error 0.03598346983552425
Let's look at OLS:
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
# run the model and interpret the predictions
predictions = model.predict(X)
# Print out the statistics
model.summary()
I would hypothesize the following:
One hot encoding is doing exactly what it is supposed to do, but it is not helping you get the results you want. Using lng/lat, is performing slightly better, but this too, is not helping you achieve the results you want. As you know, you must work with numeric data for a regression problem, but none of the features is helping you to predict price, at least not very well. Of course, I could have made a mistake somewhere. If I did make a mistake, please let me know!
Check out the links below for a good example of using various features to predict housing prices. Notice: all variables are numeric, and the results are pretty decent (just around 70%, give or take, but still much better than what we're seeing with the Air BNB data set).
https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

Regression model/ non-linear regression

The dataset has a timestamp and the water flow values, Now I want to model the data such that if any abnormal value(sudden spike or very low value) comes in it should send a notification that something is wrong. I have tried the ARIMA model to train the data since it was time series but it doesnt produce relevant results which means I am doing something wrong. So please guide me. Thanks. The link to the dataset is: https://drive.google.com/open?id=1cFHSVpY0XBxsEayl2k1cK4_qWZ4PvDBd
from sklearn.linear_model import LinearRegression
features = [col for col in x2.columns if 'day' in col]
X = x2['median'].reshape(-1,1)
y = x2['time']
# create linear regression object
reg = linear_model.LinearRegression()
# train the model using the training sets
reg.fit(X, y)
# regression coefficients
print('Coefficients: \n', reg.coef_)
I have tried using the median of all the water flow value for each time interval as the target variable but it produces negative variance score as well.
The expected result should be a value of water flow at the given time which tells if it is in a normal range or not.
Since this problem seems to be a single feature problem, I recommend starting with plotting median water flow with respect to time. The shape of the plot will tell you how best to model the problem.

SVM regression ruined by adding polynomial features

I'm trying to get the feel for SVM regression with a toy example. I generated random numbers between 1 and 100 as the predictors, then took their log and added gaussian noise to create the target variables. Popping this data into sklearn's SVR module produces a reasonable looking model:
However, when I augment the training data by throwing in the squares of the original predictor variables, everything goes haywire:
I understand that the RBF kernel does something analogous to taking powers of the original features, so throwing in the second feature is mostly redundant. However, is it really the case the SVMs are this bad at handling feature redundancy? Or am I doing something wrong?
Here is the code I used to generate these graphs:
from sklearn.svm import SVR
import numpy as np
import matplotlib.pyplot as plt
# change to highest_power=2 to get the bad model
def create_design_matrix(x_array, highest_power=1):
return np.array([[x**k for k in range(1, highest_power + 1)] for x in x_array])
N = 1000
x_array = np.random.uniform(1, 100, N)
y_array = np.log(x_array) + np.random.normal(0,0.2,N)
model = SVR(C=1.0, epsilon=0.1)
print model
X = create_design_matrix(x_array)
#print X
#print y_array
model = model.fit(X, y_array)
test_x = np.linspace(1.0, 100.0, num=10000)
test_y = model.predict(create_design_matrix(test_x))
plt.plot(x_array, y_array, 'ro')
plt.plot(test_x, test_y)
plt.show()
I'd appreciate any help with this mystery!
It looks like your model's picking up on outliers too heavily, which is a symptom of error from variance. This makes sense, because adding polynomial features increases the variance of a model. You should try tweaking the bias-variance tradeoff via cross validation by tweaking parameters. The parameters to modify would be C, epsilon, and gamma. The gamma parameter's incredibly important when using an RBF kernel, so I'd start there.
Manually fiddling with these parameters (which is not recommended - see below) gave me the following model:
The parameters used here were C=5, epsilon=0.1, gamma=2**-15.
Choosing these parameters is really a task for a proper model selection framework. I prefer simulated annealing + cross validation. The best scikit-learn currently has is random grid search + crossval. Shameless plug for a simulated annealing module I helped with: https://github.com/skylergrammer/SimulatedAnnealing
Note: Polynomial features are actually products of all combinations of size d (with replacement), not just the squares of features. In the second degree case, since you only have a single feature, these are equivalent. Scikit-learn has a class that'll calculate these though: sklearn.preprocessing.PolynomialFeatures

Categories

Resources