Which should I use, oversampling or undersampling?

Which should I use, oversampling or undersampling? - python

The data I have has an imbalance.
It is about 45000 : 1500 imbalance, but when oversampling, smote, and smotetomek are used, more than 97% of the results are obtained.
However, when the test was actually performed, all 1500 cases had opposite results.
When undersampling is performed, it has an accuracy of about 85%,
Of the 1,500 cases, 300 are different but the difference in accuracy is large
Of course, I checked the recall and precision, but there was no significant difference from the accuracy, so could you explain to me why these results occurred?
sm = SMOTE(random_state=42)
X, y = sm.fit_resample(X, y)
got this confusion matrix
accuracy: 0.9784
precision: 0.9718
recall: 0.9854
F1: 0.9786
AUC: 0.9784
But acutally when I run the test containing class 1 data got this result
test_result = model.predict(test_x)
pd.DataFrame(test_result).value_counts()
0 1589
dtype: int64
All class 1 predict 0,,,
Also equal when use oversampling, smote, smotetomek
Undersampling
Accuracy : 86%
0 354
1 1235
dtype: int64
I don't know what case is best?
Is there anything else I can try?
Added I find and referenced imbalanced-learn.org
from sklearn.metrics import balanced_accuracy_score
balanced_accuracy_score(y_test, y_pred)
Should I check this accuracy score?
If I have the imbalanced data set?
When I check the Accuracy evaluation about the oversampling and undersampling
oversampling : 50%
undersampling : 84%

For who don't know the answer
Random Oversampling is problem
Because it is a random method of inserting data from a minority class repeatedly, overfitting problems can occur due to repeated data insertion.
And, when the test data were put in, of course, a different value was obtained.

Related

F1 score reduced after using class weight

I am working on a multi class classification use case and the data is highly imbalanced. By highly imbalanced data I mean that there is a huge difference between class with maximum frequency and the class with minimum frequency. So if I go ahead using SMOTE oversampling then the data size increases tremendously (data size goes from 280k rows to more than 25 billion rows because the imbalance is too high) and it becomes practically impossible to fit a ML model to such a huge dataset. Similarly I can't use undersampling as that would lead to loss of information.
So I thought of using compute_class_weight from sklearn while creating a ML model.
Code:
from sklearn.utils.class_weight import compute_class_weight
class_weight = compute_class_weight(class_weight='balanced',
classes=np.unique(train_df['Label_id']),
y=train_df['Label_id'])
dict_weights = dict(zip(np.unique(train_df['Label_id']), class_weight))
svc_model = LinearSVC(class_weight=dict_weights)
I did predictions on test data and noted the result of metrics like accuracy, f1_score, recall etc.
I tried to replicate the same but by not passing class_weight, like this:
svc_model = LinearSVC()
But the results I obtained were strange. The metrics after passing class_weight were a bit poor than the metrics without class_weight.
I was hoping for the exact opposite as I am using class_weight to make the model better and hence the metrics.
The difference between metrics for both the models was minimal but f1_score was less for model with class_weight as compared to model without class_weight.
I also tried the below snippet:
svc_model = LinearSVC(class_weight='balanced')
but still the f1_score was less as compared to model without class_weight.
Below are the metrics I obtained:
LinearSVC w/o class_weight
Accuracy: 89.02, F1 score: 88.92, Precision: 89.17, Recall: 89.02, Misclassification error: 10.98
LinearSVC with class_weight=’balanced’
Accuracy: 87.98, F1 score: 87.89, Precision: 88.3, Recall: 87.98, Misclassification error: 12.02
LinearSVC with class_weight=dict_weights
Accuracy: 87.97, F1 score: 87.87, Precision: 88.34, Recall: 87.97, Misclassification error: 12.03
I assumed that using class_weight would improve the metrics but instead its deteriorating the metrics. Why is this happening and what should I do? Will it be okay if I don't handle imbalance data?

It's not always guaranteed that if you use class_weight the performance will improve always. There is always some uncertainty involved when we're working with stochastic systems.
You can try with class_weight = 'auto'. Here's a discussion: https://github.com/scikit-learn/scikit-learn/issues/4324
Finally, you seem to use default hyperparameter of linear SVM, meaning C=1 and; I would suggest experimenting with the hyparameters, even do a grid-search if possible to test, if still the class_weight decreases performance, try data-normalization.

How I see the Problem
My understanding of your problem is that your class weight approach IS actually improving your model, but you don't see it (probably). Here is why:
Assume you have 10 POS and 1k NEG samples, and you have two models: M-1 predicts all of the NEG samples correctly (false negative rate = 0) but predicts only 2 out of 10 POS samples correctly. M-2 predicts 700 of the NEG and 8 POS samples correctly. From the anomaly detection point of view, second model might be preferred, while the first model (which clearly has trapped in the imbalance issue) has higher f1 score.
class weights will attempt to solve your imbalance issue, shifting your model from M-1 to M-2. Thus your f1 score might decrease slightly. but you might have a model of better quality.
How You can Validate my opinion
You can check my point by looking at the confusion matrix to see if the f1 score has been decreased due to more misclassification of your major class, and if your minor class is now having more true positives. Plus, you can test other metrics specifically for imbalance classes. I know of Cohen's Kappa Maybe you see that class weights actually increase the Kappa score.
And one more thing: do some bootstrapping or cross validation, the change in f1 score might be due to data variability and mean nothing

Why the positive sample size in the confusion metrics is smaller than the actual data?

I was doing logistic regression for my coupon dataset, and the coupon.flag.value_counts() shows there are 22356 negative samples and 2961 positive samples. But after building the logistic regression model, the total amount of positive samples in the training confusion metric is only 51.(test_size = 0.3) Can someone help me to figure out what's the problem here? Thanks!
coupon=pd.read_csv('L2_Week3.csv')
coupon=pd.get_dummies(coupon)
coupon.flag.value_counts()
0 22356
1 2961
Name: flag, dtype: int64
from sklearn.model_selection import train_test_split
y=coupon['flag']
x=coupon[['coupon_used_in_last_month','job_retired','job_student','marital_single','returned_yes','job_bl
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=100)
from sklearn import linear_model
lr=linear_model.LogisticRegression()
lr.fit(x_train,y_train)
y_pred_train=lr.predict(x_train)
y_pred_test=lr.predict(x_test)
import sklearn.metrics as metrics
metrics.confusion_matrix(y_train,y_pred_train)
Out[96]:
array([[15589, 34],
[ 2081, 17]])
As I said, the numbers for positive samples in the metric just is way smaller than the actual data.

Logistic regression gives probabilities as the result, not pure 0 and 1. In such kind of prediction you should find a threshold (using valid dataset) to decide which probability will be treated as 0 and which as 1 to obtain the best value of your metric (accuracy, precision, etc.), and then use this threshold on test predictions. If you do not do this operation, a confusion matrix is generated using the default threshold (0.5), which is a bad idea, because even without any further optimization, the threshold should be set as the ratio of 1's in the dataset, i.e. 2961/(2961+22356).
Try to use this code:
threshold = 2961/(2961+22356)
metrics.confusion_matrix(y_train,(y_pred_train>threshold).astype(int))
Notabene, this operation is not needed to evaluate AUC, which operates both on probabilities and pure 0 and 1. It is worth to know, that predicted probabilities converted to 0 and 1 give usually much worse AUC than raw probabilities.

Random Forest and Imbalance

I'm working on a dataset of around 20000 rows.
The aim is to predict whether a person has been hired by a company or not given some features like gender, experience, application date, test score, job skill, etc. The dataset is imbalanced: the classes are either '1' or '0' (hired / not hired) with ratio 1:10.
I chose to train a Random Forest Classifier to work on this problem.
I splitted the dataset 70%-30% randomly into a training set and a test set.
After careful reading of the different options to tackle the imbalance problem (e.g. Dealing with the class imbalance in binary classification, Unbalanced classification using RandomForestClassifier in sklearn) I got stuck on getting a good score on my test set.
I tried several things:
I trained three different random forests on the whole X_train, on undersampled training X_und and on oversampled X_sm respectively. X_und was generated by simply cutting down at random the rows of X_train labelled by 0 to get 50-50, 66-33 or 75-25 ratios of 0s and 1s; X_sm was generated by SMOTE.
Using scikit-learn GridSearchCV i tweaked the three models to get the best parameters:
param_grid = {'min_samples_leaf':[3,5,7,10,15],'max_features':[0.5,'sqrt','log2'],
'max_depth':[10,15,20],
'class_weight':[{0:1,1:1},{0:1,1:2},{0:1,1:5},'balanced'],
'criterion':['entropy','gini']}
sss = StratifiedShuffleSplit(n_splits=5)
grid = GridSearchCV(RandomForestClassifier(),param_grid,cv=sss,verbose=1,n_jobs=-1,scoring='roc_auc')
grid.fit(X_train,y_train)
The best score was obtained with
rfc = RandomForestClassifier(n_estimators=150, criterion='gini', min_samples_leaf=3,
max_features=0.5, n_jobs=-1, oob_score=True, class_weight={0:1,1:5})
trained on the whole X_train and giving classification report on the test set
precision recall f1-score support
0 0.9397 0.9759 0.9575 5189
1 0.7329 0.5135 0.6039 668
micro avg 0.9232 0.9232 0.9232 5857
macro avg 0.8363 0.7447 0.7807 5857
weighted avg 0.9161 0.9232 0.9171 5857
With the sampling methods I got similar results, but no better ones. Precision went down with the undersampling and I got almost the same result with the oversampling.
For undersampling:
precision recall f1-score support
0 0.9532 0.9310 0.9420 5189
1 0.5463 0.6452 0.5916 668
For SMOTE:
precision recall f1-score support
0 0.9351 0.9794 0.9567 5189
1 0.7464 0.4716 0.5780 668
I played with the parameter class_weights to give more weight to the 1s and also with sample_weight in the fitting process.
I tried to figure out which score to take into account other than accuracy. Running the GridSearchCV to tweak the forests, I used different scores, focusing especially on f1 and roc_auc hoping to decrease the False Negatives. I got great scores with the SMOTE-oversampling, but this model did not generalize well on the test set. I wasn't able to understand how to change the splitting criterion or the scoring for the random forest in order to lower the number of False Negatives and increase the Recall for the 1s. I saw that cohen_kappa_score is also useful for imbalanced datasets, but it cannot be used in cross validation methods of sklearn like GridSearch.
Select only the most important features, but this did not change the result, on the contrary it got worse. I remarked that feature importance obtained from training a RF after SMOTE was completely different from the normal sample one.
I don't know exactly what to do with the oob_score other than considering it as a free validation score obtained when training the forests. With the oversampling I get the highest oob_score = 0.9535 but this is kind of natural since the training set is in this case balanced, the problem is still that it does not generalize well to the test set.
Right now I ran out of ideas, so I would like to know if I'm missing something or doing something wrong. Or should I just try another model instead of Random Forest?

Why cross validation result shows high accuracy while there is overfitting?

I'm using random tree algo for binary classification problem. Training set contains 70k values as "0" class and only 3k as "1". In addition, result of predicting on X_test should give same amount of "0" and "1".
clf = RandomForestClassifier(random_state=1, n_estimators=350, min_samples_split=6, min_samples_leaf=2)
scores = cross_validation.cross_val_score(clf, x_train, y_train, cv=cv)
print("Accuracy (random forest): {}+/-{}".format(scores.mean(), scores.std()))
Accuracy (random forest): 0.960755941369/1.40500919606e-06
clf.fit(x_train, y_train)
prediction_final = clf.predict(X_test) # this return Target values: 76k Zeroes and only 15 ones
#x_test is 10% of x_train set
preds_test = clf.predict(x_test)
print "precision_score", precision_score(y_test, preds_final)
print "recall_score", recall_score(y_test, preds_final)
precision_score 0.0;
recall_score 0.0
confusion_matrix [[7279 1]
[ 322 0]]
As far I can see, there is the overfitting problem, but why doesn't cross validation detect it? Even standard deviation is very low. So how can I fix that problem?
P.S. I've tried to take 3k rows with "0" and 3k with "1" - as training set, model is much better, but this is not solution.

(Overall) Accuracy is a nearly useless measure for unbalanced data sets like yours, since it computes the percentage of correct predictions. In your case, imagine a classifier that would learn nothing, but just always predict "0". Since you have 70k zeroes and only 3k ones, that classifier would reach an accuracy score of 70/73 = 95.9%.
Inspecting the Confusion Matrix is often helpful for disclosing such a "classifier".
Thus, you should definitely use another measure to quantify classification quality. Average Accuracy would be an option, since it computes the average accuracy over all classes. In the case of binary classification, it is also called Balanced Accuracy and results in computing (TP/P + TN/N)/2, so that the classifier imagined above, which always predicts "0", would only score (100% + 0%) / 2 = 50%. However, that measure seems to be not implemented in scikit-learn. Though you could implement such a scoring function by yourself, it will probably be easier and faster to use one of the other predefined scorers.
For example, you could compute the F1 Score instead of Accuracy by passing scoring = 'f1' to cross_validation.cross_val_score. The F1 Score takes both precision and recall into account.

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I'm working in a sentiment analysis problem the data looks like this:
label instances
5 1190
4 838
3 239
1 204
2 127
So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im using scikit's SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:
First:
wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
wclf.fit(X, y)
weighted_prediction = wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)
Second:
auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)
print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Recall:', recall_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, auto_weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)
Third:
clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)
from sklearn.metrics import precision_score, \
recall_score, confusion_matrix, classification_report, \
accuracy_score, f1_score
print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
0.930416613529
However, Im getting warnings like this:
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with
multiclass or multilabel data or pos_label=None will result in an
exception. Please set an explicit value for `average`, one of (None,
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for
instance, scoring="f1_weighted" instead of scoring="f1"
How can I deal correctly with my unbalanced data in order to compute in the right way classifier's metrics?

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).
Class weights
The weights from the class_weight parameter are used to train the classifier.
They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.
Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.
The metrics
Once you have a classifier, you want to know how well it is performing.
Here you can use the metrics you mentioned: accuracy, recall_score, f1_score...
Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.
I will not detail all these metrics but note that, with the exception of accuracy, they are naturally applied at the class level: as you can see in this print of a classification report they are defined for each class. They rely on concepts such as true positives or false negative that require defining which class is the positive one.
precision recall f1-score support
0 0.65 1.00 0.79 17
1 0.57 0.75 0.65 16
2 0.33 0.06 0.10 17
avg / total 0.52 0.60 0.51 50
The warning
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The
default `weighted` averaging is deprecated, and from version 0.18,
use of precision, recall or F-score with multiclass or multilabel data
or pos_label=None will result in an exception. Please set an explicit
value for `average`, one of (None, 'micro', 'macro', 'weighted',
'samples'). In cross validation use, for instance,
scoring="f1_weighted" instead of scoring="f1".
You get this warning because you are using the f1-score, recall and precision without defining how they should be computed!
The question could be rephrased: from the above classification report, how do you output one global number for the f1-score?
You could:
Take the average of the f1-score for each class: that's the avg / total result above. It's also called macro averaging.
Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class). Aka micro averaging.
Compute a weighted average of the f1-score. Using 'weighted' in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.
These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an average argument for the score method.
Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you'll get more importance for the class 5.
The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.
Computing scores
Last thing I want to mention (feel free to skip it if you're aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen.
This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.
Here's a way to do it using StratifiedShuffleSplit, which gives you a random splits of your data (after shuffling) that preserve the label distribution.
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))

Lot of very detailed answers here but I don't think you are answering the right questions. As I understand the question, there are two concerns:
How to I score a multiclass problem?
How do I deal with unbalanced data?
1.
You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems. Ex.:
from sklearn.metrics import precision_recall_fscore_support as score
predicted = [1,2,3,4,5,1,2,1,1,4,5]
y_test = [1,2,3,4,5,1,2,1,1,4,1]
precision, recall, fscore, support = score(y_test, predicted)
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
This way you end up with tangible and interpretable numbers for each of the classes.
| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1 | 94% | 83% | 0.88 | 204 |
| 2 | 71% | 50% | 0.54 | 127 |
| ... | ... | ... | ... | ... |
| 4 | 80% | 98% | 0.89 | 838 |
| 5 | 93% | 81% | 0.91 | 1190 |
Then...
2.
... you can tell if the unbalanced data is even a problem. If the scoring for the less represented classes (class 1 and 2) are lower than for the classes with more training samples (class 4 and 5) then you know that the unbalanced data is in fact a problem, and you can act accordingly, as described in some of the other answers in this thread.
However, if the same class distribution is present in the data you want to predict on, your unbalanced training data is a good representative of the data, and hence, the unbalance is a good thing.

Posed question
Responding to the question 'what metric should be used for multi-class classification with imbalanced data': Macro-F1-measure.
Macro Precision and Macro Recall can be also used, but they are not so easily interpretable as for binary classificaion, they are already incorporated into F-measure, and excess metrics complicate methods comparison, parameters tuning, and so on.
Micro averaging are sensitive to class imbalance: if your method, for example, works good for the most common labels and totally messes others, micro-averaged metrics show good results.
Weighting averaging isn't well suited for imbalanced data, because it weights by counts of labels. Moreover, it is too hardly interpretable and unpopular: for instance, there is no mention of such an averaging in the following very detailed survey I strongly recommend to look through:
Sokolova, Marina, and Guy Lapalme. "A systematic analysis of
performance measures for classification tasks." Information Processing
& Management 45.4 (2009): 427-437.
Application-specific question
However, returning to your task, I'd research 2 topics:
metrics commonly used for your specific task - it lets (a) to
compare your method with others and understand if you do something
wrong, and (b) to not explore this by yourself and reuse someone
else's findings;
cost of different errors of your methods - for
example, use-case of your application may rely on 4- and 5-star
reviewes only - in this case, good metric should count only these 2
labels.
Commonly used metrics.
As I can infer after looking through literature, there are 2 main evaluation metrics:
Accuracy, which is used, e.g. in
Yu, April, and Daryl Chang. "Multiclass Sentiment Prediction using
Yelp Business."
(link) - note that the authors work with almost the same distribution of ratings, see Figure 5.
Pang, Bo, and Lillian Lee. "Seeing stars: Exploiting class
relationships for sentiment categorization with respect to rating
scales." Proceedings of the 43rd Annual Meeting on Association for
Computational Linguistics. Association for Computational Linguistics,
2005.
(link)
MSE (or, less often, Mean Absolute Error - MAE) - see, for example,
Lee, Moontae, and R. Grafe. "Multiclass sentiment analysis with
restaurant reviews." Final Projects from CS N 224 (2010).
(link) - they explore both accuracy and MSE, considering the latter to be better
Pappas, Nikolaos, Rue Marconi, and Andrei Popescu-Belis. "Explaining
the Stars: Weighted Multiple-Instance Learning for Aspect-Based
Sentiment Analysis." Proceedings of the 2014 Conference on Empirical
Methods In Natural Language Processing. No. EPFL-CONF-200899. 2014.
(link) - they utilize scikit-learn for evaluation and baseline approaches and state that their code is available; however, I can't find it, so if you need it, write a letter to the authors, the work is pretty new and seems to be written in Python.
Cost of different errors.
If you care more about avoiding gross blunders, e.g. assinging 1-star to 5-star review or something like that, look at MSE;
if difference matters, but not so much, try MAE, since it doesn't square diff;
otherwise stay with Accuracy.
About approaches, not metrics
Try regression approaches, e.g. SVR, since they generally outperforms Multiclass classifiers like SVC or OVA SVM.

First of all it's a little bit harder using just counting analysis to tell if your data is unbalanced or not. For example: 1 in 1000 positive observation is just a noise, error or a breakthrough in science? You never know.
So it's always better to use all your available knowledge and choice its status with all wise.
Okay, what if it's really unbalanced?
Once again — look to your data. Sometimes you can find one or two observation multiplied by hundred times. Sometimes it's useful to create this fake one-class-observations.
If all the data is clean next step is to use class weights in prediction model.
So what about multiclass metrics?
In my experience none of your metrics is usually used. There are two main reasons.
First: it's always better to work with probabilities than with solid prediction (because how else could you separate models with 0.9 and 0.6 prediction if they both give you the same class?)
And second: it's much easier to compare your prediction models and build new ones depending on only one good metric.
From my experience I could recommend logloss or MSE (or just mean squared error).
How to fix sklearn warnings?
Just simply (as yangjie noticed) overwrite average parameter with one of these
values: 'micro' (calculate metrics globally), 'macro' (calculate metrics for each label) or 'weighted' (same as macro but with auto weights).
f1_score(y_test, prediction, average='weighted')
All your Warnings came after calling metrics functions with default average value 'binary' which is inappropriate for multiclass prediction.
Good luck and have fun with machine learning!
Edit:
I found another answerer recommendation to switch to regression approaches (e.g. SVR) with which I cannot agree. As far as I remember there is no even such a thing as multiclass regression. Yes there is multilabel regression which is far different and yes it's possible in some cases switch between regression and classification (if classes somehow sorted) but it pretty rare.
What I would recommend (in scope of scikit-learn) is to try another very powerful classification tools: gradient boosting, random forest (my favorite), KNeighbors and many more.
After that you can calculate arithmetic or geometric mean between predictions and most of the time you'll get even better result.
final_prediction = (KNNprediction * RFprediction) ** 0.5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.