Finding optimal feature using Lasso regression in binary classification

Finding optimal feature using Lasso regression in binary classification - python

I am working on large data, and I want to find important features.
As I am a kind of biologist, so please forgive my lacking knowledge.
My dataset havs about 5000 attributes and 500 samples, which has binary classes 0 and 1. Also, the data set is biased - about 400 0s and 100 1s for samples.
I want to find some features, which influence most in determining class.
A1 A2 A3 ... Gn Class
S1 1.0 0.8 -0.1 ... 1.0 0
S2 0.8 0.4 0.9 ... 1.0 0
S3 -1.0 -0.5 -0.8 ... 1.0 1
...
As I got some advice from previous question, I am trying to find attribute coefficient of which are high as important features, using Lasso regression using L1 penalty, because it makes score of unimportant features to 0.
I am doing this work using scikit-learn library.
So, my questions are like this.
Can I use Lasso regression for biased binary class? If not, is it a good solution to use Logistic regression, although it does not use L1 penalty?
How can I find optimal value of alpha using LassoCV? The document says that LassoCV supports it, but I cannot find the function.
Is there other good way for this kind of classification?
Thank you very much.

You should use a classifier instead of a regressor so either SVM or Logistic Regression would do the job. Instead you can use SGDClassifier where you can set the loss parameter to 'log' for Logistic Regression or 'hinge' for SVM.
In SGDClassifier you can set the penalty to either of 'l1', 'l2' or 'elasticnet' which is a combination of both.
You can find an opimum value of 'alpha' by either looping over different values of alpha and evaluating the performance over a validation set or you can use gridsearchcv as:
tuned_parameters = {'alpha': [10 ** a for a in range(-6, -2)]}
clf = GridSearchCV(SGDClassifier(loss='hinge', penalty='elasticnet',l1_ratio=0.15, n_iter=5, shuffle=True, verbose=False, n_jobs=10, average=False, class_weight='balanced')
, tuned_parameters, cv=10, scoring='f1_macro')
#now clf is the best classifier found given the search space
clf.fit(X_train, Y_train)
#you can find the best alpha here
print(clf.best_params_)
This, searches over the range of values of alpha you have provided in tuned_parameters and then finds the best one. You can change the performance criteria from 'f1_macro' to 'f1_weighted' or other metrics.
To address the skewness of your dataset in terms of labels use the class_weight parameter of SGDCassifier and set it to "balanced".
To find the top 10 features contributing to the class labels you can find the indices as:
for i in range(0, clf.best_estimator_.coef_.shape[0]):
top10 = np.argsort(clf.best_estimator_.coef_[i])[-10:]
Note 1: It is always good to keep some part of your dataset aside as validation/test set and after finding your optimum model evaluating it on the held out data.
Note 2: It is usually good to play a little bit with different types of feature normalisation and sample normalisation by dividing a row or a column to 'l2' or 'l1' of the row or column to see its effect on the performance using normalizer
Note 3: For elasticnet regularisation play a little bit with l1_ratio parameter.

Related

Is it possible to provide a scoring metric to sklearn RFE?

I need to get subsets of top 1, top 2, top 3, etc. features, and performance of my model on each of these subsets. Something like this:
Number of features
Features
Performance
1
A
0.7
2
A, D
0.72
3
A, D, B
0.75
I wanted to use RFE as a possible improvement over simply using feature importances from models.
In sklearn, the RFECV object has a ranking_ attribute, which would let me create the feature subsets. The problem is that all features below the number of features that RFECV found to be optimal are equal to 1, so the first k features are not ordered by importance.
I thought of using a simple RFE instead, but it doesn't accept the scoring parameter, and the default accuracy is not appropriate in my case where classes are very unbalanced.
Is there a way to either somehow provide a scoring metric to sklearn RFE, or to force RFECV (that does accept a scoring parameter) to evaluate ranking below the 'optimal' number of features?
I also considered using SFS but I have about 500 features and it takes days to run.

Is Random Forest regression is good for this kind of regression problem?

I am working with vehicle occupancy prediction and I am very much new to this, I have used random forest regression to predict the occupancy values.
Jupyter notebook_Random forest
I have around 48 M rows and I have used all the data to predict the occupancy, As the population and occupancy were normalized due to the higher numbers and I have predicted. I am sure the model is not good, how can I interpret the results from the RMSE and MAE. Also, the plot shows that it is not predicted well, Am I doing it in a correct way to predict the occupancy of the vehicles.
Kindly help me with the following,
Is Random forest regression is a good method to approach this problem?
How can I improve the model results?
How to interpret the results from the outcome

Is Random forest regression is a good method to approach this problem?
-> The model is just a tool and can of course be used. However, no one can answer whether it is suitable or not, because we have not studied the distribution of data. It is suggested that you can try logistic regression, support vector machine regression, etc.
How can I improve the model results?
-> I have several suggestions on how to improve: 1.Do not standardize without confirming whether the y value column has extreme values. 2.When calculating RMSE and Mae, use the original y value. 3.Deeply understand business logic and add new features. 4.Learn about data processing and Feature Engineering on the blog.
How to interpret the results from the outcome
-> Bad results do not necessarily mean no value. You need to compare whether the model is better than the existing methods and whether it has produced more economic value. For example, error is loss, and accuracy is gain.
Hope these can help you.

You were recommended XGBoost based regressor so you could try as well LightGBM based one: https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html

You are getting RMSE of 0.002175863553610834 which is really close to zero. So, we can say that you have a good model. I don't think the model needs further improvement. If you still want to improve it, I think you should change the algorithm to XGBoost and use regularization and early stopping to avoid overfitting.
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators = 3000, learning_rate = 0.01, reg_alpha = 2, reg_lambda = 1, n_jobs = -1, random_state = 34, verbosity = 0)
evalset = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, eval_metric = 'rmse', eval_set = evalset, early_stopping_rounds = 5)

Random Forest and Imbalance

I'm working on a dataset of around 20000 rows.
The aim is to predict whether a person has been hired by a company or not given some features like gender, experience, application date, test score, job skill, etc. The dataset is imbalanced: the classes are either '1' or '0' (hired / not hired) with ratio 1:10.
I chose to train a Random Forest Classifier to work on this problem.
I splitted the dataset 70%-30% randomly into a training set and a test set.
After careful reading of the different options to tackle the imbalance problem (e.g. Dealing with the class imbalance in binary classification, Unbalanced classification using RandomForestClassifier in sklearn) I got stuck on getting a good score on my test set.
I tried several things:
I trained three different random forests on the whole X_train, on undersampled training X_und and on oversampled X_sm respectively. X_und was generated by simply cutting down at random the rows of X_train labelled by 0 to get 50-50, 66-33 or 75-25 ratios of 0s and 1s; X_sm was generated by SMOTE.
Using scikit-learn GridSearchCV i tweaked the three models to get the best parameters:
param_grid = {'min_samples_leaf':[3,5,7,10,15],'max_features':[0.5,'sqrt','log2'],
'max_depth':[10,15,20],
'class_weight':[{0:1,1:1},{0:1,1:2},{0:1,1:5},'balanced'],
'criterion':['entropy','gini']}
sss = StratifiedShuffleSplit(n_splits=5)
grid = GridSearchCV(RandomForestClassifier(),param_grid,cv=sss,verbose=1,n_jobs=-1,scoring='roc_auc')
grid.fit(X_train,y_train)
The best score was obtained with
rfc = RandomForestClassifier(n_estimators=150, criterion='gini', min_samples_leaf=3,
max_features=0.5, n_jobs=-1, oob_score=True, class_weight={0:1,1:5})
trained on the whole X_train and giving classification report on the test set
precision recall f1-score support
0 0.9397 0.9759 0.9575 5189
1 0.7329 0.5135 0.6039 668
micro avg 0.9232 0.9232 0.9232 5857
macro avg 0.8363 0.7447 0.7807 5857
weighted avg 0.9161 0.9232 0.9171 5857
With the sampling methods I got similar results, but no better ones. Precision went down with the undersampling and I got almost the same result with the oversampling.
For undersampling:
precision recall f1-score support
0 0.9532 0.9310 0.9420 5189
1 0.5463 0.6452 0.5916 668
For SMOTE:
precision recall f1-score support
0 0.9351 0.9794 0.9567 5189
1 0.7464 0.4716 0.5780 668
I played with the parameter class_weights to give more weight to the 1s and also with sample_weight in the fitting process.
I tried to figure out which score to take into account other than accuracy. Running the GridSearchCV to tweak the forests, I used different scores, focusing especially on f1 and roc_auc hoping to decrease the False Negatives. I got great scores with the SMOTE-oversampling, but this model did not generalize well on the test set. I wasn't able to understand how to change the splitting criterion or the scoring for the random forest in order to lower the number of False Negatives and increase the Recall for the 1s. I saw that cohen_kappa_score is also useful for imbalanced datasets, but it cannot be used in cross validation methods of sklearn like GridSearch.
Select only the most important features, but this did not change the result, on the contrary it got worse. I remarked that feature importance obtained from training a RF after SMOTE was completely different from the normal sample one.
I don't know exactly what to do with the oob_score other than considering it as a free validation score obtained when training the forests. With the oversampling I get the highest oob_score = 0.9535 but this is kind of natural since the training set is in this case balanced, the problem is still that it does not generalize well to the test set.
Right now I ran out of ideas, so I would like to know if I'm missing something or doing something wrong. Or should I just try another model instead of Random Forest?

How to penalize False Negatives more than False Positives

From the business perspective, false negatives lead to about tenfold higher costs (real money) than false positives. Given my standard binary classification models (logit, random forest, etc.), how can I incorporate this into my model?
Do I have to change (weight) the loss function in favor of the 'preferred' error (FP) ? If so, how to do that?

There are several options for you:
As suggested in the comments, class_weight should boost the loss function towards the preferred class. This option is supported by various estimators, including sklearn.linear_model.LogisticRegression,
sklearn.svm.SVC, sklearn.ensemble.RandomForestClassifier, and others. Note there's no theoretical limit to the weight ratio, so even if 1 to 100 isn't strong enough for you, you can go on with 1 to 500, etc.
You can also select the decision threshold very low during the cross-validation to pick the model that gives highest recall (though possibly low precision). The recall close to 1.0 effectively means false_negatives close to 0.0, which is what to want. For that, use sklearn.model_selection.cross_val_predict and sklearn.metrics.precision_recall_curve functions:
y_scores = cross_val_predict(classifier, x_train, y_train, cv=3,
method="decision_function")
precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)
If you plot the precisions and recalls against the thresholds, you should see the picture like this:
After picking the best threshold, you can use the raw scores from classifier.decision_function() method for your final classification.
Finally, try not to over-optimize your classifier, because you can easily end up with a trivial const classifier (which is obviously never wrong, but is useless).

As #Maxim mentioned, there are 2 stages to make this kind of tuning: in the model training stage (like custom weights) and the prediction stage (like lowering the decision threshold).
Another tuning for the model-training stage is using a recall scorer. you can use it in your grid-search cross-validation (GridSearchCV) for tuning your classifier with the best hyper-param towards high recall.
GridSearchCV scoring parameter can either accepts the 'recall' string or the function recall_score.
Since you're using a binary classification, both options should work out of the box, and call recall_score with its default values that suits a binary classification:
average: 'binary' (i.e. one simple recall value)
pos_label: 1 (like numpy's True value)
Should you need to custom it, you can wrap an existing scorer, or a custom one, with make_scorer, and pass it to the scoring parameter.
For example:
from sklearn.metrics import recall_score, make_scorer
recall_custom_scorer = make_scorer(
lambda y, y_pred, **kwargs: recall_score(y, y_pred, pos_label='yes')[1]
)
GridSearchCV(estimator=est, param_grid=param_grid, scoring=recall_custom_scorer, ...)

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I'm working in a sentiment analysis problem the data looks like this:
label instances
5 1190
4 838
3 239
1 204
2 127
So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im using scikit's SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:
First:
wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
wclf.fit(X, y)
weighted_prediction = wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)
Second:
auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)
print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Recall:', recall_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, auto_weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)
Third:
clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)
from sklearn.metrics import precision_score, \
recall_score, confusion_matrix, classification_report, \
accuracy_score, f1_score
print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
0.930416613529
However, Im getting warnings like this:
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with
multiclass or multilabel data or pos_label=None will result in an
exception. Please set an explicit value for `average`, one of (None,
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for
instance, scoring="f1_weighted" instead of scoring="f1"
How can I deal correctly with my unbalanced data in order to compute in the right way classifier's metrics?

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).
Class weights
The weights from the class_weight parameter are used to train the classifier.
They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.
Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.
The metrics
Once you have a classifier, you want to know how well it is performing.
Here you can use the metrics you mentioned: accuracy, recall_score, f1_score...
Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.
I will not detail all these metrics but note that, with the exception of accuracy, they are naturally applied at the class level: as you can see in this print of a classification report they are defined for each class. They rely on concepts such as true positives or false negative that require defining which class is the positive one.
precision recall f1-score support
0 0.65 1.00 0.79 17
1 0.57 0.75 0.65 16
2 0.33 0.06 0.10 17
avg / total 0.52 0.60 0.51 50
The warning
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The
default `weighted` averaging is deprecated, and from version 0.18,
use of precision, recall or F-score with multiclass or multilabel data
or pos_label=None will result in an exception. Please set an explicit
value for `average`, one of (None, 'micro', 'macro', 'weighted',
'samples'). In cross validation use, for instance,
scoring="f1_weighted" instead of scoring="f1".
You get this warning because you are using the f1-score, recall and precision without defining how they should be computed!
The question could be rephrased: from the above classification report, how do you output one global number for the f1-score?
You could:
Take the average of the f1-score for each class: that's the avg / total result above. It's also called macro averaging.
Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class). Aka micro averaging.
Compute a weighted average of the f1-score. Using 'weighted' in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.
These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an average argument for the score method.
Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you'll get more importance for the class 5.
The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.
Computing scores
Last thing I want to mention (feel free to skip it if you're aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen.
This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.
Here's a way to do it using StratifiedShuffleSplit, which gives you a random splits of your data (after shuffling) that preserve the label distribution.
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))

Lot of very detailed answers here but I don't think you are answering the right questions. As I understand the question, there are two concerns:
How to I score a multiclass problem?
How do I deal with unbalanced data?
1.
You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems. Ex.:
from sklearn.metrics import precision_recall_fscore_support as score
predicted = [1,2,3,4,5,1,2,1,1,4,5]
y_test = [1,2,3,4,5,1,2,1,1,4,1]
precision, recall, fscore, support = score(y_test, predicted)
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
This way you end up with tangible and interpretable numbers for each of the classes.
| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1 | 94% | 83% | 0.88 | 204 |
| 2 | 71% | 50% | 0.54 | 127 |
| ... | ... | ... | ... | ... |
| 4 | 80% | 98% | 0.89 | 838 |
| 5 | 93% | 81% | 0.91 | 1190 |
Then...
2.
... you can tell if the unbalanced data is even a problem. If the scoring for the less represented classes (class 1 and 2) are lower than for the classes with more training samples (class 4 and 5) then you know that the unbalanced data is in fact a problem, and you can act accordingly, as described in some of the other answers in this thread.
However, if the same class distribution is present in the data you want to predict on, your unbalanced training data is a good representative of the data, and hence, the unbalance is a good thing.

Posed question
Responding to the question 'what metric should be used for multi-class classification with imbalanced data': Macro-F1-measure.
Macro Precision and Macro Recall can be also used, but they are not so easily interpretable as for binary classificaion, they are already incorporated into F-measure, and excess metrics complicate methods comparison, parameters tuning, and so on.
Micro averaging are sensitive to class imbalance: if your method, for example, works good for the most common labels and totally messes others, micro-averaged metrics show good results.
Weighting averaging isn't well suited for imbalanced data, because it weights by counts of labels. Moreover, it is too hardly interpretable and unpopular: for instance, there is no mention of such an averaging in the following very detailed survey I strongly recommend to look through:
Sokolova, Marina, and Guy Lapalme. "A systematic analysis of
performance measures for classification tasks." Information Processing
& Management 45.4 (2009): 427-437.
Application-specific question
However, returning to your task, I'd research 2 topics:
metrics commonly used for your specific task - it lets (a) to
compare your method with others and understand if you do something
wrong, and (b) to not explore this by yourself and reuse someone
else's findings;
cost of different errors of your methods - for
example, use-case of your application may rely on 4- and 5-star
reviewes only - in this case, good metric should count only these 2
labels.
Commonly used metrics.
As I can infer after looking through literature, there are 2 main evaluation metrics:
Accuracy, which is used, e.g. in
Yu, April, and Daryl Chang. "Multiclass Sentiment Prediction using
Yelp Business."
(link) - note that the authors work with almost the same distribution of ratings, see Figure 5.
Pang, Bo, and Lillian Lee. "Seeing stars: Exploiting class
relationships for sentiment categorization with respect to rating
scales." Proceedings of the 43rd Annual Meeting on Association for
Computational Linguistics. Association for Computational Linguistics,
2005.
(link)
MSE (or, less often, Mean Absolute Error - MAE) - see, for example,
Lee, Moontae, and R. Grafe. "Multiclass sentiment analysis with
restaurant reviews." Final Projects from CS N 224 (2010).
(link) - they explore both accuracy and MSE, considering the latter to be better
Pappas, Nikolaos, Rue Marconi, and Andrei Popescu-Belis. "Explaining
the Stars: Weighted Multiple-Instance Learning for Aspect-Based
Sentiment Analysis." Proceedings of the 2014 Conference on Empirical
Methods In Natural Language Processing. No. EPFL-CONF-200899. 2014.
(link) - they utilize scikit-learn for evaluation and baseline approaches and state that their code is available; however, I can't find it, so if you need it, write a letter to the authors, the work is pretty new and seems to be written in Python.
Cost of different errors.
If you care more about avoiding gross blunders, e.g. assinging 1-star to 5-star review or something like that, look at MSE;
if difference matters, but not so much, try MAE, since it doesn't square diff;
otherwise stay with Accuracy.
About approaches, not metrics
Try regression approaches, e.g. SVR, since they generally outperforms Multiclass classifiers like SVC or OVA SVM.

First of all it's a little bit harder using just counting analysis to tell if your data is unbalanced or not. For example: 1 in 1000 positive observation is just a noise, error or a breakthrough in science? You never know.
So it's always better to use all your available knowledge and choice its status with all wise.
Okay, what if it's really unbalanced?
Once again — look to your data. Sometimes you can find one or two observation multiplied by hundred times. Sometimes it's useful to create this fake one-class-observations.
If all the data is clean next step is to use class weights in prediction model.
So what about multiclass metrics?
In my experience none of your metrics is usually used. There are two main reasons.
First: it's always better to work with probabilities than with solid prediction (because how else could you separate models with 0.9 and 0.6 prediction if they both give you the same class?)
And second: it's much easier to compare your prediction models and build new ones depending on only one good metric.
From my experience I could recommend logloss or MSE (or just mean squared error).
How to fix sklearn warnings?
Just simply (as yangjie noticed) overwrite average parameter with one of these
values: 'micro' (calculate metrics globally), 'macro' (calculate metrics for each label) or 'weighted' (same as macro but with auto weights).
f1_score(y_test, prediction, average='weighted')
All your Warnings came after calling metrics functions with default average value 'binary' which is inappropriate for multiclass prediction.
Good luck and have fun with machine learning!
Edit:
I found another answerer recommendation to switch to regression approaches (e.g. SVR) with which I cannot agree. As far as I remember there is no even such a thing as multiclass regression. Yes there is multilabel regression which is far different and yes it's possible in some cases switch between regression and classification (if classes somehow sorted) but it pretty rare.
What I would recommend (in scope of scikit-learn) is to try another very powerful classification tools: gradient boosting, random forest (my favorite), KNeighbors and many more.
After that you can calculate arithmetic or geometric mean between predictions and most of the time you'll get even better result.
final_prediction = (KNNprediction * RFprediction) ** 0.5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.