I have divided my data into training and validation samples and have successfully fit my model with three types of linear models. What I cannot figure out how to do is apply the model to the validation sample to evaluate the fit. When I attempt to apply the model to the holdout sample (sorry, I know that this isn't a reproducible example but I think that the issue is pretty clear. I'm just putting this snippet here for completeness. Please be gentle!):
valid = validation.loc[:, x + [ "sale_amt"]]
holdout1 = m1.predict(valid)
I get the following error message:
AttributeError Traceback (most recent call last)
in ()
8
9 valid = validation.loc[:, x + [ "sale_amt"]]
---> 10 holdout1 = m1.predict(valid)
AttributeError: 'OLS' object has no attribute 'predict'`
Other Python OLS regression packages have a 'predict' method, but it doesn't seem that PySAL does. I realize that the function coefficients (betas) are available and will pursue applying them to my validation data directly, but I was hoping that there is a simple answer that I just missed.
I apologize if it is bad form to answer my own question, but I did come up with a solution. I contacted Daniel Arribas-Bel, one of the PySAL developers, and he helped guide me to the result I was seeking. Note that my PySAL OLS object is named m1, and my validation dataframe is called 'validation':
m1 = ps.model.spreg.OLS(...)
m1.intercept = m1.betas[0] # Get the intercept from the betas array
m1.coefficients = m1.betas[1:len(m1.betas)] # Get the coefficients from the betas array
validation['predicted_price'] = m1.intercept + validation.loc[:, x].dot( m1.coefficients)
Note that this is the method I would use for a non-spatial model adapted for the KNN model I built in PySAL and might not be technically fully correct for a spatial model. Caveat emptor.
Related
I am having problems using sklearn's SequentialFeatureSelector for feature selection prior to clustering.
Please see my code:
Shooting_ST_array = np.array(Shooting_ST)
kmeans1 = KMeans(n_clusters=4)
kmeans1 = kmeans1.fit(Shooting_ST_array)
sfs = SequentialFeatureSelector(estimator=kmeans1, n_features_to_select=5, direction='backward')
sfs.fit(Shooting_ST_array, y=None)
sfs.get_feature_names_out(Shooting_ST.columns)
Depending on whether I used forward or backward selection, the algorithm returns the first 5 or last 5 columns.
I am also getting the following attribute error:
AttributeError: 'NoneType' object has no attribute 'split'
Does anyone have an idea what I am doing wrong? I have unfortunately not found any example implementations of the method for unlabelled data.
So I am attempting to do some time series analysis with the statsmodel package in python. I have some code that was given to me in a class - but it doesn't work! I've narrowed down the error to the function below, but am getting a strange error metric that I can't solve.
def model_ARIMA_2(ts, order):
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.arima_model import ARIMAResults
model = ARIMA(ts, order = order)
model_fit = model.fit(disp=0, method='mle', trend='nc')
BIC = ARIMAResults(model_fit, order).bic
print('Testing model of order: ' + str(order) + ' with BIC = ' + str(BIC))
return(BIC, order, model_fit)
order = (1,1,1)
model_ARIMA_2(decomp.resid[6:-6], order)
And I get the error: AttributeError: 'ARIMAResults' object has no attribute 'endog'
My data looks like:
I've tried searching this online and haven't found anything helpful. Does anyone know why this error is cropping up and what the solution may be?
Thanks!
It looks like the error occurs when you are trying to extract the BIC.
When you fit an ARIMA model, in your case model_fit = model.fit(disp=0, method='mle', trend='nc'), Statsmodels returns an ARIMAResults object (see the documentation for the fit method). So you are attempting to create an ARIMAResults object from an ARIMAResults object, which is causing your error.
You should be able get the BIC directly from the object returned when you fit the model (i.e. BIC = model_fit.bic) as well as all other model fitting statsmodels reports.
It will be useful to become familiar with the methods and attributes of ARIMAResults objects which can be found here.
Best of luck!
As mentioned in this post, the adjusted R2 score can be calculated via the following equation, where n is the number of samples, p is the number of parameters of the model.
adj_r2 = 1-(1-R2)*(n-1)/(n-p-1)
According this another post, we can get the number of parameters of our model with model.coef_.
However, for Gradient Boosting (GBM), it seems we cannot obtain the number of parameters in our model:
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
X = np.random.randn(100,10)
y = np.random.randn(100,1)
model = GradientBoostingRegressor()
model.fit(X,y)
model.coef_
output >>>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-4650e3f7c16c> in <module>
----> 1 model.coef_
AttributeError: 'GradientBoostingRegressor' object has no attribute 'coef_'
After checking the documentation, it seems GBM consists of different estimators. Is the number of estimators equals to the number of parameters?
Still, I cannot get the number of parameters for each individual estimator
model.estimators_[0][0].coef_
output >>>
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-27216ebb4944> in <module>
----> 1 model.estimators_[0][0].coef_
AttributeError: 'DecisionTreeRegressor' object has no attribute 'coef_'
How to calculate the adjusted R2 score for GBM?
Short answer: don't do it (notice that all the posts you link to are about linear regression).
Long answer:
To start with, your definition that
p is the number of parameters of the model
is not correct. p is the number of explanatory variables used by the model (source).
In agreement to this definition, the post you have linked to actually uses X.shape[1] instead of model.coef_; the latter is suggested in a comment, and it is not correct either (see own comment there).
So, if you insist on computing the r-squared for your GBM model, you can always adjust the code from the linked post (after getting your predictions y_pred), taking also advantage of scikit-learn r2_score:
from sklearn.metrics import r2_score
y_pred = model.predict(X)
r_squared = r2_score(y, y_pred)
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
But why you shouldn't do it? Well, quoting from an answer of mine in another question:
the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it.
Can Label Propagation be used for semi-supervised regression tasks in scikit-learn?
According to its API, the answer is YES.
http://scikit-learn.org/stable/modules/label_propagation.html
However, I got the error message when I tried to run the following code.
from sklearn import datasets
from sklearn.semi_supervised import label_propagation
import numpy as np
rng=np.random.RandomState(0)
boston = datasets.load_boston()
X=boston.data
y=boston.target
y_30=np.copy(y)
y_30[rng.rand(len(y))<0.3]=-999
label_propagation.LabelSpreading().fit(X,y_30)
It shows that "ValueError: Unknown label type: 'continuous'" in the label_propagation.LabelSpreading().fit(X,y_30) line.
How should I solve the problem? Thanks a lot.
It looks like the error in the documentation, code itself clearly is classification only (beggining of the .fit call of the BasePropagation class):
check_classification_targets(y)
# actual graph construction (implementations should override this)
graph_matrix = self._build_graph()
# label construction
# construct a categorical distribution for classification only
classes = np.unique(y)
classes = (classes[classes != -1])
In theory you could remove the "check_classification_targets" call and use "regression like method", but it will not be the true regression since you will never "propagate" any value which is not encountered in the training set, you will simply treat the regression value as the class identifier. And you will be unable to use value "-1" since it is a codename for "unlabeled"...
I am a complete beginner, and I'm currently doing this tutorial about logit regression models in python 3.4, with statsmodels 0.6.1 and Pycharm community version 4.5.1:
http://blog.yhathq.com/posts/logistic-regression-and-python.html
It runs smoothly. I try to add my own lines, to try out a few things.
After the part when I fit the data
train_cols = data.columns[1:]
logit = sm.logit(data['admit'], data[train_cols])
result = logit.fit()
and I print out the summary
print(result.summary())
I tried to take a little detour from the tutorial, to print only the Goodness of Fit measurement (in this case, it is a pseudo R-squared value). According to the documentation it is a method of result object (same as summary), so it should work like this:
print(result.prsquared())
However, running this code results in a TypeError on a line containing only print(result.prsquared()):
TypeError: 'numpy.float64' object is not callable
It really bugs me, because if I would to compare several models, pseudo R-squared would be my first choice to do it.
prsquared is an attribute, not a function. Try:
print(result.prsquared)