Related
I have been learning about classification techniques and studied about random forest, gradient boosting etc.Based on some help from codes available online,i tried to write code in python3 for random forest and GBM. My objective is to get the probability values from the model and not just look at accuracy as i intend to use the probability values to create KS later on.
I used the readily available titanic data set to start practicing.
Following are some of the steps i did :
/**load train data**/
train_df=pd.read_csv('***/classification/titanic/train.csv')
/**load test data**/
test_df =pd.read_csv('***/Desktop/classification/titanic/test.csv')
/**drop some variables in train data**/
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
/**drop some variables in test data**/
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
/** i calculated the title variable (again based on multiple threads in kaggle**/
train_df=pd.get_dummies(train_df,columns=['Pclass','Sex','Title'],drop_first=True)
test_df=pd.get_dummies(test_df,columns=['Pclass','Sex','Title'],drop_first=True)
/**i checked for missing and IV values next (not including that code here***/
predictors=[x for x in train.columns if x not in ['Survived','PassengerID']]
predictors
# create classifier object (GBM)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
# create classifier object (RF)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
Now when i check the probability values from the two different models, i noticed that for the Random forest output, a significant chunk had a 0 probability score whereas that was not the case for the GBM model.
I understand that the techniques are different, but how can the results be so far off ? Am i missing out on something ?
With a large chunk of the population getting tagged with '0' as probability score, my KS table goes for a toss.
Welcome to SO! Since you don't seem to be having an issue with code execution in specific, or totally incorrect outputs, this looks like it is more appropriate for CrossValidated, where you can find answers to questions of statistical concerns.
In fact, I'd suggest that answers to this question might give you some good insights into why you are seeing very different values from the predict_proba method. In short: while both GradientBoostingClassifier and RandomForestClassifier both use tree methods, what they do is very different, so direct comparison of the model parameters is not necessarily appropriate.
I have a linear regression model and my cost function is a Sum of Squares Error function. I've split my full dataset into three datasets, training, validation, and test. I am not sure how to calculate the training error and validation error (and the difference between the two).
Is the training error the Residual Sum of Squares error calculated using the training dataset?
An example of what I'm asking: So if I was doing this in Python, and let's say I had 90 data-points in the training data set, then is this the correct code for the training error?
y_predicted = f(X_train, theta) #predicted y-value at point x, where y_train is the actual y-value at x
training_error = 0
for i in range(90):
out = y_predicted[i] - y_train[i]
out = out*out
training_error+=out
training_error = training_error/2
print('The training error for this regression model is:', training_error)
This is mentioned in a comment on the post but you need to divide by the total number of samples to get a number that you can compare between validation and test sets.
Simply changed the code would be:
y_predicted = f(X_train, theta) #predicted y-value at point x, where y_train is the actual y-value at x
training_error = 0
for i in range(90):
out = y_predicted[i] - y_train[i]
out = out*out
training_error+=out
#change 2 to 90
training_error = training_error/90
print('The training error for this regression model is:', training_error)
The goal of this is so you can compare two different subsets of data using the same metric. You had a divide by 2 in there which was ok as well as long as you are also dividing by the number of samples.
Another way you can do this in Python is by using the sci-kit learn library, it already has the function.
see below.
from sklearn.metrics import mean_squared_error
training_error = mean_squared_error(y_train,y_predicted)
Also generally when making calculations like this it is better and faster to use matrix multiplication instead of a for loop. In the context, of this question 90 records is quite small but when you start working with larger sample sizes you could try something like this utilizing numpy.
import numpy as np
training_error = np.mean(np.square(np.array(y_predicted)-np.array(y_train)))
All 3 ways should get you similar results.
I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
I tried to make a model that could predict the price of a room/apt given the various features of the dataset. I realised that this was a regression problem and using this sklearn cheat-sheet, I started trying the various regression models.
I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.
So I am thinking now, what happens when you want to do regression on a huge dataset? The kernel trick is extremely computationally expensive while the linear regression models are too simplistic as real data is seldom linear. So are neural nets the only answer or is there some clever solution that I am missing?
P.S. I am just starting on Overflow so please let me know what I can do to make my question better!
This is a great question but as it often happens there is no simple answer to complex problems. Regression is not a simple as it is often presented. It involves a number of assumptions and is not limited to linear least squares models. It takes couple university courses to fully understand it. Below I'll write a quick (and far from complete) memo about regressions:
Nothing will replace proper analysis. This might involve expert interviews to understand limits of your dataset.
Your model (any model, not limited to regressions) is only as good as your features. If home price depends on local tax rate or school rating, even a perfect model would not perform well without these features.
Some features cannot be included in the model by design, so never expect a perfect score in real world. For example, it is practically impossible to account for access to grocery stores, eateries, clubs etc. Many of these features are also moving targets, as they tend to change over time. Even 0.12 R2 might be great if human experts perform worse.
Models have their assumptions. Linear regression expects that dependent variable (price) is linearly related to independent ones (e.g. property size). By exploring residuals you can observe some non-linearities and cover them with non-linear features. However, some patterns are hard to spot, while still addressable by other models, like non-parametric regressions and neural networks.
So, why people still use (linear) regression?
it is the simplest and fastest model. There are a lot of implications for real-time systems and statistical analysis, so it does matter
often it is used as a baseline model. Before trying a fancy neural network architecture, it would be helpful to know how much we improve comparing to a naive method.
sometimes regressions are used to test certain assumptions, e.g. linearity of effects and relations between variables
To summarize, regression is definitely not the ultimate tool in most cases, but this is usually the cheapest solution to try first
UPD, to illustrate the point about non-linearity.
After building a regression you calculate residuals, i.e. regression error predicted_value - true_value. Then, for each feature you make a scatter plot, where horizontal axis is feature value and vertical axis is the error value. Ideally, residuals have normal distribution and do not depend on the feature value. Basically, errors are more often small than large, and similar across the plot.
This is how it should look:
This is still normal - it only reflects the difference in density of your samples, but errors have the same distribution:
This is an example of nonlinearity (a periodic pattern, add sin(x+b) as a feature):
Another example of non-linearity (adding squared feature should help):
The above two examples can be described as different residuals mean depending on feature value. Other problems include but not limited to:
different variance depending on feature value
non-normal distribution of residuals (error is either +1 or -1, clusters, etc)
Some of the pictures above are taken from here:
http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html
This is an great read on regression diagnostics for beginners.
I'll take a stab at this one. Look at my notes/comments embedded in the code. Keep in mind, this is just a few ideas that I tested. There are all kinds of other things you can try (get more data, test different models, etc.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
import sklearn
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.datasets import load_boston
#boston = load_boston()
# Predicting Continuous Target Variables with Regression Analysis
df = pd.read_csv('C:\\your_path_here\\AB_NYC_2019.csv')
df
# get only 2 fields and convert non-numerics to numerics
df_new = df[['neighbourhood']]
df_new = pd.get_dummies(df_new)
# print(df_new.columns.values)
# df_new.shape
# df.shape
# let's use a feature selection technique so we can see which features (independent variables) have the highest statistical influence on the target (dependent variable).
from sklearn.ensemble import RandomForestClassifier
features = df_new.columns.values
clf = RandomForestClassifier()
clf.fit(df_new[features], df['price'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
# what kind of object is this
# type(sorted_idx)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
X = df_new[features]
y = df['price']
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
Result:
Best alpha using built-in LassoCV: 0.040582
Best score using built-in LassoCV: 0.103947
Lasso picked 78 variables and eliminated the other 146 variables
Next step...
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
# get the top 25; plotting fewer features so we can actually read the chart
type(imp_coef)
imp_coef = imp_coef.tail(25)
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
X = df_new
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
Result:
MAE 1004799260.0756996
MSE 9.87308783180938e+21
RMSE 99363412943.64531
R squared error -2.603867717517002e+17
This is horrible! Well, we know this doesn't work. Let's try something else. We still need to rowk with numeric data so let's try lng and lat coordinates.
X = df[['longitude','latitude']]
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2
df2.plot(kind = 'bar')
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
# better but not awesome
Result:
MAE 85.35438165291622
MSE 36552.6244271195
RMSE 191.18740655994972
R squared error 0.03598346983552425
Let's look at OLS:
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
# run the model and interpret the predictions
predictions = model.predict(X)
# Print out the statistics
model.summary()
I would hypothesize the following:
One hot encoding is doing exactly what it is supposed to do, but it is not helping you get the results you want. Using lng/lat, is performing slightly better, but this too, is not helping you achieve the results you want. As you know, you must work with numeric data for a regression problem, but none of the features is helping you to predict price, at least not very well. Of course, I could have made a mistake somewhere. If I did make a mistake, please let me know!
Check out the links below for a good example of using various features to predict housing prices. Notice: all variables are numeric, and the results are pretty decent (just around 70%, give or take, but still much better than what we're seeing with the Air BNB data set).
https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155
This is more of a stats question as the code is working fine, but I am learning regression modeling in python. I have some code below with statsmodel to create a simple linear regression model:
import statsmodels.api as sm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
ng = pd.read_csv('C:/Users/ben/ngDataBaseline.csv', thousands=',', index_col='Date', parse_dates=True)
X = ng['HDD']
y = ng['Therm']
# Note the difference in argument order
model = sm.OLS(y, X).fit()
# Print out the statistics
model.summary()
I get an output like the screen shot below. I am trying judge the goodness of fit, and I know the R^2 is high, but is it possible to find the root mean squared error (RMSE) of the prediction with statsmodel?
I'm also attempting to research if I can estimate the sampling distribution with a confidence interval. If I am interpreting the table correctly for the intercept HDD 5.9309, with standard error 0.220, p value low 0.000, and I think a 97.5% confidence interval the value of HDD (or is it my dependent variable Therm?) will be between 5.489 and 6.373?? Or I think in percentage that could be expressed as ~ +- 0.072%
EDIT included multiple regression table
Is it possible to calculate the RMSE with statsmodels? Yes, but you'll have to first generate the predictions with your model and then use the rmse method.
from statsmodels.tools.eval_measures import rmse
# fit your model which you have already done
# now generate predictions
ypred = model.predict(X)
# calc rmse
rmse = rmse(y, ypred)
As for interpreting the results, HDD isn't the intercept. It's your independent variable. The coefficient (e.g. the weight) is 5.9309 with standard error of 0.220. The t-score for this variable is really high suggesting that it is a good predictor, and since it is high, the p-value is very small (close to 0).
The 5.489 and 6.373 values are your confidence bounds for a 95% confidence interval. The bounds are simply calculated based on adding or subtracting the standard error times the t-statistic associated with the 95% confidence interval from the coefficient.
The t-statistic is dependent on your sample size which in your case is 53, so your degrees of freedom is 52. Using a t-table, this means that for df=52 and a confidence level of 95%, the t-statistic is 2.0066. Therefore the bounds can be manually calculated as thus:
lower: 5.9309 - (2.0066 x 0.220) = 5.498
upper: 5.9309 + (2.0066 x 0.220) = 6.372
Of course, there's some precision loss due to rounding but you can see the manual calculation is really close to what's reported in the summary.
Additional response to your comments:
There are several metrics you can use to evaluate the goodness of fit. One of them being the adjusted R-squared statistic. Others are RMSE, F-statistic, or AIC/BIC. It's up to you to decide which metric or metrics to use to evaluate the goodness of fit. For me, I usually use the adjusted R-squared and/or RMSE, though RMSE is more of a relative metric to compare against other models.
Now looking at your model summaries, both of the models fit well, especially the first model given the high adjusted R-squared value. There may be potential improvement with the second model (may try different combinations of the independent variables) but you won't know unless you experiment. Ultimately, there's no right and wrong model. It just comes down to building several models and comparing them to get the best one. I'll also link an article that explains some of the goodness of fit metrics for regression models.
As for confidence intervals, I'll link this SO post since the person who answered the question has code to create the confidence interval. You'll want to look at the predict_mean_ci_low and predict_mean_ci_high that he created in his code. These two variables will give you the confidence intervals at each observation and from there, you can calculate the +/- therms/kWh by subtracting the lower CI from your prediction or subtracting your prediction from the upper CI.
I want to use GPR to predict RSS from a deployed access point (AP). Since GPR gives mean RSS and its variance too, GPR could be very useful in positioning and navigation system. I read the GPR related published journals and got the theoretical insight of it. Now, I want to implement it with real data (RSS). In my system, the input and corresponding outputs (observations) are:
X: 2D cartesian coordinates points
y: an array of RSS (-dBm) at the corresponding coordinates
After searching online, I found that I can use sklearn software (using python). I installed sklearn and successfully tested the sample codes. The sample python scripts are for 1D GPR. Since my input sets are 2D coordinates, I wanted to modify the sample code. I found that other people have also tried to do the same, for example : How to correctly use scikit-learn's Gaussian Process for a 2D-inputs, 1D-output regression?, How to make a 2D Gaussian Process Using GPML (Matlab) for regression?, and Is kringing suitable for high dimensional regression problems?.
The expected (predicted) values should be similar to y. The value I got is very different. The size of the testbed where I want to predict the RSS is 16*16 sq.meters. I want to predict RSS at every meter apart. I assume that the Gaussian Process predictor is trained with the Gaussian Decent algorithm in the sample code. I want to optimize the hyperparameter (theta: trained by using y and X) with Firefly algorithm.
In order to use my own data (2D input), in which line of code am I suppose to edit? Similarly, how can I implement Firefly algorithm (I've installed firefly algorithm using pip)?
Please help me with your kind suggestions and comments.
Thank you very much.
I have simplified the code a bit to illustrate potential issues:
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
x_train = np.array([[0,0],[2,0],[4,0],[6,0],[8,0],[10,0],[12,0],[14,0],[16,0],[0,2],
[2,2],[4,2],[6,2],[8,2],[10,2],[12,2],[14,2],[16,2]])
y_train = np.array([-54,-60,-62,-64,-66,-68,-70,-72,-74,-60,-62,-64,-66,
-68,-70,-72,-74,-76])
# This is a test set?
x1min = 0
x1max = 16
x2min = 0
x2max = 16
x1 = np.linspace(x1min, x1max)
x2 = np.linspace(x2min, x2max)
x_test =(np.array([x1, x2])).T
gp = GaussianProcessRegressor()
gp.fit(x_train, y_train)
# predict on training data
y_pred_train = gp.predict(x_train)
print('Avg MSE: ', ((y_train - y_pred_train)**2).mean()) # MSE is 0
# predict on test (?) data
y_pred_test = gp.predict(x_test)
# it is unclear how good this result without y_test (e.g., held out labeled test samples)
The expected (predicted) values should be similar to y.
Here, I have renamed y to y_train for clarity. After fitting the GP and predicting on x_train, we see that the model perfectly predicts the training samples, which is possibly what you meant. I am not sure if you mistakenly wrote lowercase x which I call x_test (instead of uppercase X which I call x_train) in the question. If we predict on x_test, we cannot really know how good the prediction is without the corresponding y_test values. So, this basic example is working as I would expect.
It also appears you are trying to create a grid for x_test, however the current code does not do that. Here, x1 and x2 are always the same for each position. If you want a grid, take a look at np.meshgrid.