I am trying to train a logistic regression model with data as follows:
Categorical Variable: either 0 or 1
Numerical Variables: Continuous number between 8 and 20
I have 20 numerical variables and I want to only use one at a time for the predicting model, and see which is the best feature to use.
The code I'm using is:
for variable in numerical_variable:
X = data[[variable ]]
y = data[categorical_variable]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred=logreg.predict(X_test)
print(y_pred)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred))
print("Recall:", metrics.recall_score(y_test, y_pred))
The categorical variable is biased towards 1, there are about 800 1s to 200 0s. So I think this is why it always predicts one, regardless of the test samples (if I don't set random_state=0) and regardless of the numerical variable.
(using python 3)
Any thoughts on how to fix this?
Thanks
use joblib library to save your model,
import joblib
your_model = LogisticRegression()
your_model.fit(X_train, y_train)
filename = 'finalized_model.sav'
joblib.dump(your_model, filename)
this code will save your model as 'finalized_model.sav'. the extension doesn't matter,even if you don't write.
then you can call your exact and fixed model by this code to take the same predictions all the time.
your_loaded_model = joblib.load('finalized_model.sav')
as a prediction example;
your_loaded_model.predict(X_test)
Related
I wanted to do Cross Validation on a regression (non-classification ) model and ended getting mean accuracies of about 0.90. however, i don't know what metric is used in the method to find out the accuracies. I know how splitting in k-fold cross validation works . I just don't know the formula that the scikit learn library is using to calculate the accuracy of prediction. (I know how it works for classification model though). Can someone give me the metric/formula used by sklearn.model_selection.cross_val_score?
Thanks in advance.
from sklearn.model_selection import cross_val_score
def metrics_of_accuracy(classifier , X_train , y_train) :
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
return accuracies
By default, sklearn uses accuracy in case of classification and r2_score for regression when you use the model.score method(same for cross_val_score). So r2_score in this case whose formula is
r2 = 1 - (SSE(y_hat)/SSE(y_mean))
where
SSE(y_hat) is the squared error for predictions made
SSE(y_mean) is the squared error when all predictions are the mean of the actual predictions
Yes, Also I can use the same metric using sklearn.metrics-> r2_score.
r2_score(y_true, y_pred). This score is also called Coefficient of determination or R-squared.
The formula for the same is as follows -
Find the link to image below.
https://i.stack.imgur.com/USaWH.png
For more on this -
https://en.wikipedia.org/wiki/Coefficient_of_determination
I made a neural network to predict future prices of a stock using historical data for training. The data set is arranged in hourly intervals so the goal is to forecast the price at the end of the next hour of trading.
I used 2 hidden layers of (20,16) format, 26 inputs and one output which should be the price. The activation function is 'relu' and solver is 'adam.
When the training is done and i try to test it all the outputs are values between 0 and 1 which i dont understand.
Here's my code:
# Defining the target column and predictors
target_column = ['close']
predictors = list(set(list(df.columns))-set(target_column)
# Standardizing the DataFrame
df[predictors] = df[predictors]/df[predictors].max()
df.describe().transpose()
# Creating the sets and splitting data
X = df[predictors].values
y = df[target_column].values
y = y.astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
df.reset_index()
#Fitting the training data
mlp = MLPClassifier(hidden_layer_sizes=(20,16), activation='relu', solver='adam', max_iter=1000)
mlp.fit(X_train, y_train.ravel())
#Predicting training and test data
predict_train = mlp.predict(X_train)
predict_test = mlp.predict(X_test)
After i execute the code i also get this warning:
UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. `Use `zero_division` parameter to control this behavior.
Shouldn't the prediction at least be a continuous numerical variable not limited in between 1 and 0?
Price is a continuous variable, so this is a regression problem and not a classification one; here you are trying to apply a classifier to a regression problem, which is wrong.
Such classifiers produce a probabilistic output, which, unsurprisingly, lies in [0, 1].
You should use an MLPRegressor model instead of MLPClassifier.
This is somewhat of a follow up to my previous question about evaluating my scikit Gaussian process regressor. I am very new to GPRs and I think that I may be making a methodological mistake in how I am using training vs testing data.
Essentially I'm wondering what the difference is between specifying training data by splitting the input between test and training data like this:
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.33,
random_state = 0)
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X_train, y_train)
score = gp.score(X_test, y_test)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
vs using the full data set to train like this.
X = np.atleast_2d(some_data).T
Y = np.atleast_2d(other_data).T
kernel = ConstantKernel() + Matern() + WhiteKernel(noise_level=1)
gp = gaussian_process.GaussianProcessRegressor(
alpha=1e-10,
copy_X_train=True,
kernel = kernel,
n_restarts_optimizer=10,
normalize_y=False,
random_state=None)
gp.fit(X, Y)
score = gp.score(X, Y)
print(score)
x_pred = np.atleast_2d(np.linspace(0,10,1000)).T
y_pred, sigma = gp.predict(x_pred, return_std=True)
Is one of these options going to result in incorrect predictions?
You split off the training data from the test data to evaluate your model, because otherwise you have no idea if you are over fitting the data. For example, just place data in excel and plot it with a smooth line. Technically, that spline function from excel is a perfect model but useless for predicting new values.
In your example, your predictions are over a uniform space to allow you to visualize what your model thinks is the underlying function. But it would be useless for understanding how general the model is. Sometimes you can get very high accuracy (> 95%) on training data and less than chance for testing data, which means the model is over fitting.
In addition to plotting a uniform prediction space to visualize the model, you should also predict values from the test set, then see accuracy metrics for both testing and training data.
I have a data set with a target variable that can have 7 different labels. Each sample in my training set has only one label for the target variable.
For each sample, I want to calculate the probability for each of the target labels. So my prediction would consist of 7 probabilities for each row.
On the sklearn website I read about multi-label classification, but this doesn't seem to be what I want.
I tried the following code, but this only gives me one classification per sample.
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(DecisionTreeClassifier())
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
Does anyone have some advice on this? Thanks!
You can do that by simply removing the OneVsRestClassifer and using predict_proba method of the DecisionTreeClassifier. You can do the following:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)
This will give you a probability for each of your 7 possible classes.
Hope that helps!
You can try using scikit-multilearn - an extension of sklearn that handles multilabel classification. If your labels are not overly correlated you can train one classifier per label and get all predictions - try (after pip install scikit-multilearn):
from skmultilearn.problem_transform import BinaryRelevance
classifier = BinaryRelevance(classifier = DecisionTreeClassifier())
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
Predictions will contain a sparse matrix of size (n_samples, n_labels) in your case - n_labels = 7, each column contains prediction per label for all samples.
In case your labels are correlated you might need more sophisticated methods for multi-label classification.
Disclaimer: I'm the author of scikit-multilearn, feel free to ask more questions.
If you insist on using the OneVsRestClassifer, then you could also call predict_proba(X_test) as it is supported by OneVsRestClassifer as well.
For eg:
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(DecisionTreeClassifier())
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)
The order of the labels for which you get the result can be found in:
clf.classes_
While working with a linear regression model I split the data into a training set and test set. I then calculated R^2, RMSE, and MAE using the following:
lm.fit(X_train, y_train)
R2 = lm.score(X,y)
y_pred = lm.predict(X_test)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
MAE = metrics.mean_absolute_error(y_test, y_pred)
I thought that I was calculating R^2 for the entire data set (instead of comparing the training and original data). However, I learned that you must fit the model before you score it, therefore I'm not sure if I'm scoring the original data (as inputted in R2) or the data that I used to fit the model (X_train, and y_train). When I run:
lm.fit(X_train, y_train)
lm.score(X_train, y_train)
I get a different result than what I got when I was scoring X and y. So my question is are the inputs to the .score parameter compared to the model that was fitted (thereby making lm.fit(X,y); lm.score(X,y) the R^2 value for the original data and lm.fit(X_train, y_train); lm.score(X,y) the R^2 value for the original data based off the model created in .fit.) or is something else entirely happening?
fit() that only fit the data which is synonymous to train, that is fit the data means train the data.
score is something like testing or predict.
So one should use different dataset for training the classifier and testing the acuracy
One can do like this.
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)