The assignment is to write a simple ML program that trains and predicts on a dataset of our choice. I want to determine the best model for my data. The response is a class (0/1). I wrote code to try different cross-validation methods (validation set, leave-one-out, and k-fold) on multiple models (linear regression, logistic regression, k-nearest neighbors, linear discriminant analysis). Per model, I report the MSE for each cross-validation method and track the lowest one. I then pick the model with the lowest tracked MSE. This is where I think I went wrong. If I am cross-validating multiple models, should I use the same cross-validation method?
Related
I am a beginner in machine learning in python, and I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%. I have tried numerous ways to improve the accuracy of the model, such as one-hot encoding of categorical variables, scaling of the continuous variables, and I did a grid search to find the best parameters. They all failed to improve the accuracy. So, I looked into unsupervised learning methods in order to improve it.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X_train)
logreg = LogisticRegression().fit(X_train, y_train)
cross_val_score(logreg, X_train, kmeans.labels_, cv = 5)
When using the cross_val_score, the accuracy is averaging over 95%. However, when I use the .score() method:
logreg.score(X_train, kmeans.labels_)
, the score is in the 60s. My questions are:
What does the significance (or meaning) of the score that is produced when testing the model against the labels predicted by k-means?
How can I use k-means clustering to improve the accuracy of the model? I tried adding a 'cluster' column that contains the clustering labels to the training data and fit the logistic regression, but it also didn't improve the score.
Why is there a huge discrepancy between the score when evaluated via cross_val_predict and the .score() method?
I'm having a hard time understanding the context of your problem based on the snippet you provided. Strong work for providing minimal code, but in this case I feel it may have been a bit too minimal. Regardless, I'm going to read between the lines and state some relevent ideas. I'll then attempt to answer your questions more directly.
I am working on a binary classification problem. I have implemented a logistic regression model with an average accuracy of around 75%
This only tells a small amount of the story. knowing what data your classifying and it's general form is pretty vital, and accuracy doesn't tell us a lot about how innaccuracy is distributed through the problem.
Some natural questions:
Is one class 50% accurate and another class is 100% accurate? are the classes both 75% accurate?
what is the class balance? (is there more of one class than the other)?
how much overlap do these classes have?
I recommend profiling your training and testing set, and maybe running your data through TSNE to get an idea of class overlap in your vector space.
these plots will give you an idea of how much overlap your two classes have. In essence, TSNE maps a high dimensional X to a 2d X while attempting to preserve proximity. You can then plot your flagged Y values as color and the 2d X values as points on a grid to get an idea of how tightly packed your classes are in high dimensional space. In the image above, this is a very easy classification problem as each class exists in it's own island. The more these islands mix together, the harder classification will be.
did a grid search to find the best parameters
hot take, but don't use grid search, random search is better. (source Artificial Intelligence by Jones and Barlett). Grid search repeats too much information, wasting time re-exploring similar parameters.
I tried using KMeans clustering, and I set the n_clusters into 2. I trained the logistic regression model using the X_train and y_train values. After that, I tried testing the model on the training data using cross-validation but I set the cross-validation to be against the labels predicted by the KMeans:
So, to rephrase, you trained your model to predict an output given some input, then tested how it performed predicting the same data and got 75%. This is called training accuracy (as opposed to validation or test accuracy). A low training accuracy is indicative of one of two things:
there's a lot of overlap between your classes. If this is the case, I would look into feature engineering. Find a vector space which better segregates the two classes.
there's not a lot of overlap, but the front between the two classes is complex. You need a model with more parameters to segregate your two classes.
model complexity isn't free though. See the curse of dimensionality and overfitting.
ok, answering more directly
these accuracy scores mean your model isn't complex enough to learn the problem, or there's too much overlap between the two classes to see a better accuracy.
I wouldn't use k-means clustering to try to improve this. k-means attempts to find cluster information based on location in a vector space, but you already have flagged data y_train so you already know which clusters data should belong in. Try modifying X_train in some way to get better segregation, or try a more complex model. you can use things like k-means or TSNE to check your transformed X_train for better segregation, but I wouldn't use them directly. Obligatory reminder that you need to test and validate with holdout data. see another answer I provided for more info.
I'd need more code to figure that one out.
p.s. welcome to stack overflow! Keep at it.
I'm trying to make a model which can predict test scores. I'm currently using a simple linear regression model but receiving an accuracy score of close to 0 due to the fact that it's guessing a single number as the score. I was wondering if there was a way to have the model predict a range of about 10 numbers and if the true number is in that range it is marked as a correct guess.
The dataset I am using
Github page with notebook
It seems like you are using a LogisticRegression, LogisticRegression is in fact not for regression, it is for classification (for example, is the input data class a or b).
use sklearn.linear_model.LinearRegression for linear regression, read this for more details
There are also many other regression algorithms that I cannot list all in an answer. If you want to use regressions other than simple naive linear regression, read this for all available supervised learning algorithms scikit-learn provides, Ridge regression and SVR might be good places to start with.
When using tensorflow to train a neural network I can set the loss function arbitrarily. Is there a way to do the same in sklearn when training a SVM? Let's say I want my classifier to only optimize sensitivity (regardless of the sense of it), how would I do that?
This is not possible with Support Vector Machines, as far as I know. With other models you might either change the loss that is optimized, or change the classification threshold on the predicted probability.
SVMs however minimize the hinge loss, and they do not model the probability of classes but rather their separating hyperplane, so there is not much room for manual adjustements.
If you need to focus on Sensitivity or Specificity, use a different model that allows maximizing that function directly, or that allows predicting the class probabilities (thinking Logistic Regressions, Tree based methods, for example)
I wanted to use KNN on features which were textmined while using another type of regression for the rest of my features. Is it possible to somehow combine both regression models to predict a single label? Should I split my datasets into two different ones?
I am currently using pandas and sklearn.
You can absolutely do that using Ensemble models.
Ensemble models combine decisions from various models in order to improve the overall performance. For regression problems I would suggest the following ensemble models/techniques:
Averaging
Is a fairly simple ensemble technique where you need to take the average of the predictions from all of your models and use it to make the final prediction.
Weighted Averaging
This is similar to simple averaging, but all of the models are now assgined different weights, defining the importance/contribution of each of the models in the final prediction.
Bagging meta-estimator
Is an ensembling technique that can be used in both classification (BaggingClassifier) and regression (BaggingRegressor). Bagging meta-estimator undertakes the following steps in order to reach to the final prediction:
Randomly create subsets out of the original dataset
A base estimator is fitted on each of the subsets created in step 1.
Predictions are combined to get the final predicted label
Below is a very simple example that makes use of BaggingRegressor of sklearn:
from sklearn.ensemble import BaggingRegressor
ensemble_model = BaggingRegressor(tree.DecisionTreeRegressor(random_state=1))
ensemble_model.fit(X_train, Y_train)
ensemble_model.score(X_test,Y_test)
I am performing a machine learning task wherein I am using logistic regression for topic classification.
If this is my code:
model= LogisticRegression()
model= model.fit(mat_tmp, label_tmp)
y_train_pred = model.predict(mat_tmp_test)
print(metrics.accuracy_score(label_tmp_test, y_train_pred))
Is there a way I can output what exactly is happening inside the model. Like probably a working example of what my model is doing? Like maybe displaying 2-3 documents and how they are being classified?
In order to be fully aware of what is happening in your model, you must first take some time to study the logistic regression algorithm (eg. from lecture notes or Wikipedia). As with other supervised techniques, logistic regression has hyper-parameters and parameters. Hyper-parameters basically specify how your algorithm runs, which you must provide at initialisation (ie. before it sees any data). For example, you could have prior information about the distribution of classes, which then would be a hyper-parameter. Parameters are "learnt" from your data.
Once you understand the algorithm, the interesting question will be what the parameters of your model are (recall that these are retrieved from the data). By visiting the documentation, you find in the attributes section, that this classifier has 3 parameters, which you can access by their field names.
If you are not interested in such details, but only want to assess the accuracy of your classifier, a useful technique is cross-validation. You split your labeled data into k equal sized subsets, and train your classifier using k-1 of them. Then you evaluate the trained classifier on the remaining 1 subset and calculate the accuracy (ie. what proportion of the data could be predicted properly). This method has its drawbacks, but proves to be very useful in general.