Evaluating logistic regression using cross validation and ROC

Evaluating logistic regression using cross validation and ROC - python

I am trying to evaluate logistic regression using the AUROC curve and and cross-validate my scores. When I don't cross-validate I have no issues, but I really want to use cross validation to help decrease bias in my method.
Anyway, below is the code and error term I get for the beginning part of my code:
X = df.drop('Survived', axis=1)
y = df['Survived']
skf = StratifiedKFold(n_splits=5)
logmodel = LogisticRegression()
i=0
for train, test in skf.split(X,y):
logmodel.fit(X[train], y[train]) # error occurs here
predictions = logmodel.predict_proba(X[test])
# a bunch of code that I haven't included which creates the ROC curve
i += 1
The error occurs in the fourth to last line, and returns a list of integers followed by 'not in index'
I don't really understand what the problem is?
This is my understanding of the code: First I create an instance of both stratified kfold and logistic regression. The instance of stratified kfold states that five folds are to be made. Next, I say that for each train and test fold in my dataset X, y I fit the logistic model to the data and then create a list of predictions for different probabilities based on the test data. Later (this part is not showed) I will create a ROC curve for each k-fold of data.
Again, I don't really understand what the problem is but maybe somebody can clarify. My work is more or less copied directly from this link in sklearn: https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py

Please add more details so it can be truly examined. Preferably (and actually required) a piece of code that one can run to see the error.
From first view, you take a pandas dataframe and feed it into the model, and that is done incorrect.
See the following lines that are correct for retrieving data and feeding it to the model:
X = df.drop('Survived', axis=1).values
y = df['Survived'].values
The .values suffix accesses the numpy data object that is stored in those dataframes, which is consistent with the rest of the code.
Hopefully that helps you to solve the error.
Good luck!

Related

Using cross_val_predict for predictions

I have the following code where I want to use k-fold cross validation for a Linear Regression model:
kf = KFold(n_splits=100)
predi = cross_val_predict(model, train[columns], train[target], cv = kf)
predi = pandas.Series(predi)
model.fit(data[columns], data[target])
pred_test = model.predict(test[columns])
print(mean_squared_error(pred_test, test[target]))
However, I am not sure whether the code does what I would like it to do. Specifically, I am not sure about the model.fit part. Does it even use the cross-validation?
The reason why I am not sure that calculating it like this yields worse results than without cross-validation.

No. CV is just for checking the performance of model on a data (or rather different parts of it)
When you call fit(), it will fit the whole data supplied at the time whereas cross-validation only uses parts of the data (leaving 1 fold in each iteration). So this data difference may cause the estimator to perform better or worse.

model.fit doesn't have any functionality to divide the data. It just works on the cost function minimization problem and creates a model (means find parameters).
Also if you think that you create a loop and you divide the data on every iteration and call model.fit again and again you get the more generalized model, then it's not possible because on calling fit 2nd time on linear regression model object, it forgets about old data.

Reason to use Cross_Val_score

I'm confused about the reason to use cross_val_score.
From what I understood, cross_val_score tells if my model is
'overfitting' or 'underfitting'. Moreover,it does not train my model.
Since I have only 1 feature, it is tfidf (sparse matrix). I don't know
what to do if it under/over fitting.
Q1: Did I use it in wrong order? I've seen both 'cross->fit' and
'fit->cross' examples.
Q2: What did the scores in '#print1' tell me? Does it mean I have to train my model k-times (with the same training set) where k is the k-fold that give the best score?
My code now:
model1=GaussianNB(priors=None)
score=cross_val_score(model1, X_train.toarray(), y_train,cv=3,scoring='accuracy')
# print1
print (score.mean())
model1.fit(X_train.toarray(),y_train)
predictions1 = model1.predict(X_test.toarray()) #held out data
# print2
print (classification_report(predictions1,y_test))

Here are some informations about cross-validation.
The order (cross then fit) seems fine to me.
First you evaluate the performance of your model on known data. Taking the mean of all the CV scores is interesting but maybe it would be best to leave the raw scores to see if your model doesn't work on some sets.
If your model works, then you can fit it on your train set and predict on your test set.
Training the same model k times won't change anything.

How to get ordered list of labels after fitting sklearn

train_index, test_index = next(iter(ShuffleSplit(821, train_size=0.2, test_size=0.80, random_state=42)))
print train_index, len(train_index)
print test_index, len(test_index)
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features, labels, train_size=0.33, random_state=42)
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test, labels_test)
print pred, len(pred)
A few questions from this code:
Why do I need the cross_validation.train_test_split line in order to fit and predict with my classifier? (I am not doing any preprocessing on my data except for stopword removal I have already done)
Do the test and train indexes correspond to the classified & predicted labels? My goal is to get all my labels, in their original order, after fitting and predicting them. My features and labels used for training and testing are from a pandas dataframe (two columns), and I need the predicted labels, in order, so that I can feed them back into the pandas dataframe.
Is there a way to predict the labels for the whole set, and not just the test set?

tl;dr
Because your decision tree classifier has to be trained before it can predict anything. It's not a magic algorithm. It has to be shown examples of what to do before it can work out what to do on other things.
cross_validation.test_train_split() facilitates this by splitting your data into a test and training dataset in such a way that you can analyse how well it performed later on. Without this, you have no way of assessing how well your decision tree classifier actually performed.
You can create your own testing and training data without test_train_split() (and I suspect that was what you were trying to do with ShuffleSplit()), but you will need at least some training data.
test_index and train_index have nothing to do with your data. Full stop. They come from a randomly generated process that is completely unrelated to what test_train_split() does.
The purpose of ShuffleSplit() is to give you the indices to partition your data into training and test yourself. test_train_split() will instead choose their own indices and partition based on those indices. You should either use one or the other and sensibly.
Yes. You can always just call
pred = clf.predict(features) or pred = clf.predict(features_test + features_train)
The Full Story
You need cross_validation if you want to do this right. The whole purpose of cross-validation is to avoid overfit.
Basically, if you run your model on both the training and the testing data, then your model is going to perform really well on the training set (because, well, that's what you trained it on) and that's going to skew your overall metrics of how well your model will perform on real data.
It's a lot like asking a student to perform in an exam and then in real life: if you want to know whether your student learned from the process of preparing for an exam, you don't give him another exam, you ask him to demonstrate his skills in the real world dealing with unknown and complex data.
If you want to know if your model will be useful, then you want to cross-validate. Wikipedia puts it best:
In a prediction problem, a model is usually given a dataset of known
data on which training is run (training dataset), and a dataset of
unknown data (or first seen data) against which the model is tested
(testing dataset).
The goal of cross validation is to define a
dataset to "test" the model in the training phase (i.e., the
validation dataset), in order to limit problems like overfitting, give
an insight on how the model will generalize to an independent dataset
(i.e., an unknown dataset, for instance from a real problem), etc.
cross_validation.train_test_split doesn't do anything except split the dataset into training and testing data for you.
But perhaps you don't care about metrics, and that's fine. The question then becomes: is it possible to run a decision tree classifier without a training dataset?
The answer is no. Decision tree classifiers are supervised algorithms: they need to be trained on data before they can generalise their model to new results. If you don't give them any data to train on, it will be unable to do anything with any data you feed it in predict.
Finally, while it is perfectly possible to get the labels for the whole set (see tl;dr) , it is a really bad idea if you actually care about whether or not you're getting sensible results.
You already have the labels for the testing and training data. You don't need another column that includes prediction on the testing data, because they'll either come out to be identical or close enough to identical.
I can't think of a single meaningful reason to get back predicted results for your training data short of trying to optimise how it's performing on your training data. If that's what you are trying to do, then do that. What you are doing right now is definitely not that, and I encourage you to think strongly about what your reasons are for blindly inserting numbers into your table without due cause to believe they actually mean something.
There are ways to improve this: get back an accuracy metric, for example, or try to do k-fold cross-validation to model accuracy, or look at log-loss or AUC or any one of number of metrics to gauge whether or not your model is performing well.

Using both ShuffleSplit and train_test_split is redundant. You do not even appear to be using the indices returned by ShuffleSplit.
An example of how to use the indices return by ShuffleSplit is below. X and y are np.array. X is number of instances by number of features. y contains the labels of each row.
train_inds, test_inds = train_test_split(range(len(y)),test_size=0.33, random_state=42)
X_train, y_train = X[train_inds], y[train_inds]
X_test , y_test = X[test_inds] , y[test_inds]
You should not test on your training data! But if you want to see what happens just do
pred = clf.predict(features_train)
Also you do not need to pass the labels to predict. You should be using
score = metrics.accuracy_score(y_test, pred)

Expected and predicted arrays ending up to be the same in scikit learn random forest model

data = df_train.as_matrix(columns=train_vars) # All columns aside from 'output'
target = df_train.as_matrix(columns=['output']).ravel()
# Get training and testing splits
splits = cross_validation.train_test_split(data, target, test_size=0.2)
data_train, data_test, target_train, target_test = splits
# Fit the training data to the model
model = RandomForestRegressor(100)
model.fit(data_train, target_train)
# Make predictions
expected = target_test
predicted = model.predict(data_test)
When I run this code to predict the variable 'output' as a function of all other variables in this file: https://www.dropbox.com/s/cgyh09q2liew85z/uuu.csv?dl=0
The expected and predicted arrays are exactly the same. Seems like I am overfitting or doing something wrong. How to fix it?

Kudos for questioning too good results!
Each feature (column) in the data contains only a small amount of distinct values. If I counted correctly, there are only 14 uniquely different rows.
This has two implications:
You are very likely to be overfitting because you only have 14 effective samples but 36 features.
The same rows are very likely to appear in the testing set and in the training set again. This means you are testing on the same data that the model was trained on. Since the model is perfectly overfitted to this data you get perfect results.
Edit
I just realized I haven't answered the actual question - How to fix it?
That depends.
If you are lucky, someone made an error in preparing the data.
If the data is correct, things will be more difficult. First, get rid of duplicate rows, for example by doing np.vstack({tuple(row) for row in data}) (see here). Then try if you can do some meaningful work with it. But to be honest, I believe 14 samples is a bit low for doing machine learning. Try to get more data :)

Probability and Machine Learning

I am using python to do a bit of machine learning.
I have a python nd array with 2000 entries. Each entry has information about some subjects and at the end has a boolean to tell me if they are a vampire or not.
Each entry in the array looks like this:
[height(cm), weight(kg), stake aversion, garlic aversion, reflectance, shiny, IS_VAMPIRE?]
My goal is to be able to give a probability that a new subject is a vampire given the data shown above for the subject.
I have used sklearn to do some machine learning for me:
clf = tree.DecisionTreeRegressor()
clf=clf.fit(X,Y)
print clf.predict(W)
Where W is an array of data for the new subject. The script I have written returns booleans, but I would like it to return probabilities. How can I modify it?

If you are using DecisionTreeRegressor() then you may use the score function to determine the coefficient of determination R^2 of the prediction.
Please find the below link to the documentation.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor
Also you can list out the cross validation score (for 10 samples) as below
from sklearn.model_selection import cross_val_score
clf = tree.DecisionTreeRegressor()
clf=clf.fit(X,Y)
cross_val_score(clf, X, Y, cv=10)
print clf.predict(W)
Which gives an output something similar to this,
array([ 0.61..., 0.57..., -0.34..., 0.41..., 0.75...,
0.07..., 0.29..., 0.33..., -1.42..., -1.77...])

Use a DecisionTreeClassifier instead of a regressor, and use the predict_proba method. Alternatively, you could use a logistic regression (also available in scikit learn.)
The basic idea is this:
clf = tree.DecisionTreeClassifier()
clf=clf.fit(X,Y)
print clf.predict_proba(W)

You want to use a classifier that gives you a probability. Also, you will want to make sure in your testing array W, the data points are not replicates of any of your training data. If it matches exactly with any of your training data, it thinks it's definitely vampire or definitely not vampire, so will give you 0 or 1.

You're using a regressor but you probably want to use a classifier.
You'll also want to use a classifier that can give you posterior probabilities like a decision tree or logistic regression. Other classifiers may give you a score (some kind of confidence measure) which may also work for your needs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.