Random Forest and Imbalance

Random Forest and Imbalance - python

I'm working on a dataset of around 20000 rows.
The aim is to predict whether a person has been hired by a company or not given some features like gender, experience, application date, test score, job skill, etc. The dataset is imbalanced: the classes are either '1' or '0' (hired / not hired) with ratio 1:10.
I chose to train a Random Forest Classifier to work on this problem.
I splitted the dataset 70%-30% randomly into a training set and a test set.
After careful reading of the different options to tackle the imbalance problem (e.g. Dealing with the class imbalance in binary classification, Unbalanced classification using RandomForestClassifier in sklearn) I got stuck on getting a good score on my test set.
I tried several things:
I trained three different random forests on the whole X_train, on undersampled training X_und and on oversampled X_sm respectively. X_und was generated by simply cutting down at random the rows of X_train labelled by 0 to get 50-50, 66-33 or 75-25 ratios of 0s and 1s; X_sm was generated by SMOTE.
Using scikit-learn GridSearchCV i tweaked the three models to get the best parameters:
param_grid = {'min_samples_leaf':[3,5,7,10,15],'max_features':[0.5,'sqrt','log2'],
'max_depth':[10,15,20],
'class_weight':[{0:1,1:1},{0:1,1:2},{0:1,1:5},'balanced'],
'criterion':['entropy','gini']}
sss = StratifiedShuffleSplit(n_splits=5)
grid = GridSearchCV(RandomForestClassifier(),param_grid,cv=sss,verbose=1,n_jobs=-1,scoring='roc_auc')
grid.fit(X_train,y_train)
The best score was obtained with
rfc = RandomForestClassifier(n_estimators=150, criterion='gini', min_samples_leaf=3,
max_features=0.5, n_jobs=-1, oob_score=True, class_weight={0:1,1:5})
trained on the whole X_train and giving classification report on the test set
precision recall f1-score support
0 0.9397 0.9759 0.9575 5189
1 0.7329 0.5135 0.6039 668
micro avg 0.9232 0.9232 0.9232 5857
macro avg 0.8363 0.7447 0.7807 5857
weighted avg 0.9161 0.9232 0.9171 5857
With the sampling methods I got similar results, but no better ones. Precision went down with the undersampling and I got almost the same result with the oversampling.
For undersampling:
precision recall f1-score support
0 0.9532 0.9310 0.9420 5189
1 0.5463 0.6452 0.5916 668
For SMOTE:
precision recall f1-score support
0 0.9351 0.9794 0.9567 5189
1 0.7464 0.4716 0.5780 668
I played with the parameter class_weights to give more weight to the 1s and also with sample_weight in the fitting process.
I tried to figure out which score to take into account other than accuracy. Running the GridSearchCV to tweak the forests, I used different scores, focusing especially on f1 and roc_auc hoping to decrease the False Negatives. I got great scores with the SMOTE-oversampling, but this model did not generalize well on the test set. I wasn't able to understand how to change the splitting criterion or the scoring for the random forest in order to lower the number of False Negatives and increase the Recall for the 1s. I saw that cohen_kappa_score is also useful for imbalanced datasets, but it cannot be used in cross validation methods of sklearn like GridSearch.
Select only the most important features, but this did not change the result, on the contrary it got worse. I remarked that feature importance obtained from training a RF after SMOTE was completely different from the normal sample one.
I don't know exactly what to do with the oob_score other than considering it as a free validation score obtained when training the forests. With the oversampling I get the highest oob_score = 0.9535 but this is kind of natural since the training set is in this case balanced, the problem is still that it does not generalize well to the test set.
Right now I ran out of ideas, so I would like to know if I'm missing something or doing something wrong. Or should I just try another model instead of Random Forest?

Related

How do I properly compare performance of machine learning models, in the case of a multi-output classification problem?

TL,DR: I'm looking for a good way to compare the output of different scikit learn ML models on a multi-output classification problem: labelling social media messages according to the different disaster response categories they might fall into. I'm currently just using precision_recall_fscore_support on each label and then averaging the results, but I'm not convinced that this is a good solution.
In detail: As part of an exercise I'm doing for an online data science course, I'm looking at a dataset of social media messages that occurred during natural disasters. The goal of the exercise is to train a machine learning model to classify these messages according to the various emergency departments they relate to, such as: aid_related, medical_help, weather_related, floods, etc...
So for example the following message: "UN reports Leogane 80-90 destroyed. Only Hospi..." is classed in my training data as 'medical_products', 'aid_related' and 'request'.
I've started off using scikit-learn's KNeighborsClassifier, and MultiOutputClassifier. I'm also using gridsearch to compare parameters inside the model:
pipeline = pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(KNeighborsClassifier()))
])
parameters = { 'clf__estimator__n_neighbors': [5, 7]}
cv = GridSearchCV(pipeline, parameters)
cv.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
When I finally (it takes forever just with two parameteres to compare) get the model output, I've written the following function to pull out a matrix with the average precision, recall and fscore for each column:
def classify_model_output(y_test, y_pred):
classification_scores = []
for i, column in enumerate(y_test.columns): classification_scores.append(precision_recall_fscore_support(y_test[column], y_pred[:, i]))
df_classification = pd.DataFrame(classification_scores)
df_classification.columns = ['precision', 'recall', 'fscore', 'support']
df_classification.set_index(y_test.columns, inplace=True)
# below loop splits the precision, recall and f-score columns into two, one for negatives and one for positives (0 and 1)
for column in df_classification.columns:
column_1 = df_classification[column].apply(lambda x: x[0]).rename(column+str(0), inplace=True)
column_2 = df_classification[column].apply(lambda x: x[1]).rename(column+str(1), inplace=True)
df_classification.drop([column], axis=1, inplace=True)
df_classification = pd.concat([df_classification, column_1, column_2], axis=1)
# finally, take the average of the dataframe to get a classifier for the model
df_classification_avg = df_classification.mean(axis=0)
return df_classification_avg
The df_classification table which looks like this (top 5 rows):
And here's what I get when I compare the average classification tables (produced by the previous method) for knn with 5 neighbors (avg_knn), knn with 7 neighbors (knn_avg_2), and random forest (rf) - yellow cells represent the max for that row:
But I'm not sure how to interpret this. One the face of it it looks like Random Forest (rf) performed best. But I'm not sure if this is the best way to achieve this, or if using the average even makes sense here.
Does anyone have any advice on the best way to accurately and transparently compare my models, in the case of a multioutput problem like this one?
Edit: updated code block with easier to read function, and added comparison of three models

If your training data is not biased towards any particular output label, then you can go for your accuracy score. i.e. corresponding all labels, there is balanced amount of training data.
However if your data is imbalanced i.e. training data is more towards one or two particular output label then go for precision and recall.
Now between precision and recall , the choice depends on your need . If you are not much considered about accuracy go for recall e.g. on airport there is minimum chance that any bomb would be recovered from luggage, but you check all bags. That is recall.
When you are more considered about how much correct predictions are done from a sample , go for precision.

F1 score reduced after using class weight

I am working on a multi class classification use case and the data is highly imbalanced. By highly imbalanced data I mean that there is a huge difference between class with maximum frequency and the class with minimum frequency. So if I go ahead using SMOTE oversampling then the data size increases tremendously (data size goes from 280k rows to more than 25 billion rows because the imbalance is too high) and it becomes practically impossible to fit a ML model to such a huge dataset. Similarly I can't use undersampling as that would lead to loss of information.
So I thought of using compute_class_weight from sklearn while creating a ML model.
Code:
from sklearn.utils.class_weight import compute_class_weight
class_weight = compute_class_weight(class_weight='balanced',
classes=np.unique(train_df['Label_id']),
y=train_df['Label_id'])
dict_weights = dict(zip(np.unique(train_df['Label_id']), class_weight))
svc_model = LinearSVC(class_weight=dict_weights)
I did predictions on test data and noted the result of metrics like accuracy, f1_score, recall etc.
I tried to replicate the same but by not passing class_weight, like this:
svc_model = LinearSVC()
But the results I obtained were strange. The metrics after passing class_weight were a bit poor than the metrics without class_weight.
I was hoping for the exact opposite as I am using class_weight to make the model better and hence the metrics.
The difference between metrics for both the models was minimal but f1_score was less for model with class_weight as compared to model without class_weight.
I also tried the below snippet:
svc_model = LinearSVC(class_weight='balanced')
but still the f1_score was less as compared to model without class_weight.
Below are the metrics I obtained:
LinearSVC w/o class_weight
Accuracy: 89.02, F1 score: 88.92, Precision: 89.17, Recall: 89.02, Misclassification error: 10.98
LinearSVC with class_weight=’balanced’
Accuracy: 87.98, F1 score: 87.89, Precision: 88.3, Recall: 87.98, Misclassification error: 12.02
LinearSVC with class_weight=dict_weights
Accuracy: 87.97, F1 score: 87.87, Precision: 88.34, Recall: 87.97, Misclassification error: 12.03
I assumed that using class_weight would improve the metrics but instead its deteriorating the metrics. Why is this happening and what should I do? Will it be okay if I don't handle imbalance data?

It's not always guaranteed that if you use class_weight the performance will improve always. There is always some uncertainty involved when we're working with stochastic systems.
You can try with class_weight = 'auto'. Here's a discussion: https://github.com/scikit-learn/scikit-learn/issues/4324
Finally, you seem to use default hyperparameter of linear SVM, meaning C=1 and; I would suggest experimenting with the hyparameters, even do a grid-search if possible to test, if still the class_weight decreases performance, try data-normalization.

How I see the Problem
My understanding of your problem is that your class weight approach IS actually improving your model, but you don't see it (probably). Here is why:
Assume you have 10 POS and 1k NEG samples, and you have two models: M-1 predicts all of the NEG samples correctly (false negative rate = 0) but predicts only 2 out of 10 POS samples correctly. M-2 predicts 700 of the NEG and 8 POS samples correctly. From the anomaly detection point of view, second model might be preferred, while the first model (which clearly has trapped in the imbalance issue) has higher f1 score.
class weights will attempt to solve your imbalance issue, shifting your model from M-1 to M-2. Thus your f1 score might decrease slightly. but you might have a model of better quality.
How You can Validate my opinion
You can check my point by looking at the confusion matrix to see if the f1 score has been decreased due to more misclassification of your major class, and if your minor class is now having more true positives. Plus, you can test other metrics specifically for imbalance classes. I know of Cohen's Kappa Maybe you see that class weights actually increase the Kappa score.
And one more thing: do some bootstrapping or cross validation, the change in f1 score might be due to data variability and mean nothing

How to use trained model for another dataset in Sklearn's Random Forest Classifier?

Update : Some "Terminology "
Sample : row
features: columns
'labels' : classes for the prediction (one column among the features).
Basically I wonder :
I have dataset1 and dataset2 identical in terms of shape and size. After training and testing with dataset1, I am using this model to predict dataset2. (Number of features are also same).
If I predict all the items in the dataset2 , accuracy is close to dataset1 test results. But If I pick 1 item for each class from dataset2, accuracy is around 30%. How it is possible that full dataset2 accuracy is drastically different than the "subsampled" dataset2?
I am using RandomForestClassifier.
I have a data set with 200K sample (rows) having around 90 classes. After Training and testing, accuracy is high enough (around ~96%).
Now since I have a trained model, I am using another different database (again with 200 K samples and 90 classes) to make predictions.
If I submit all samples from this second database, accuracy is close enough to training accuracy (around ~92% ).
But If I select 90 samples (one from each class) from this second database accuracy is not what I have expected. (around ~30%)
.... data preprocessing is done.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(n_estimators=nestimators, bootstrap=False,
class_weight=None, criterion='entropy',
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_weight_fraction_leaf=0.0, n_jobs=6,
oob_score=False, random_state=np.random.seed(1234), verbose=0, warm_start=False)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
and accuracy is around ~96%.
Now I am using this trained model with a new database (identical in shape) :
df2=pd.read_csv("newdata.csv", low_memory=False, skipinitialspace=True, na_filter=False)
features=['col1','col2','col3','col4']
Xnew=df2[features].values
ynew=df2['labels'].values # Labels
y_prednew=clf.predict(Xnew)
Accuracy is above ~90%. Close to first database accuracy. But
If I filter this new data set for 1 sample for each class with this :
df2=pd.read_csv("newdata.csv", low_memory=False, skipinitialspace=True, na_filter=False)
samplesize=1
df2=df2.sample(frac=1)
df2=df2.groupby('labels')
df2=df2.head(samplesize).reset_index(drop=True)
features=['col1','col2','col3','col4']
Xnew=df2[features].values
ynew=df2['labels'].values # Labels
y_prednew=clf.predict(Xnew)
...
accuracy is ~35%. But if I do not filter this new data and submit it to the model accuracy is above ~90%.
First and second data sets are identical in term of shape. If I give all samples from the second data set to this trained model, accurayc is close to the first dataset test results. But If I filter it for 1 sample from each class, accuracy is ~30%.
I dont know where did I make mistake.

Generally the code seems OK. It's hard to know but I would hazard a guess that the classes aren't equally represented in the dataset (at least the second, perhaps also the first), and the more dominant classes are more accurately identified.
The classic example is some extremely imbalanced binary classification task where 99% of the samples are positive. By always predicting positive you can get 99% accuracy, but a sample of 1 datapoint for each class would have 50% accuracy (and while out-of-context the accuracy might seem good, the model isn't very useful).
I would recommend examining the class frequencies, and also using other metrics (see precision, recall and f1) with the appropriate average parameter to more accurately assess your model's performance.
To summarise, a 90%+ accuracy on the entire dataset and 30% accuracy on a sample of 1 datapoint for each class aren't necessarily conflicting, e.g. if the classes aren't balanced in the dataset.
Edit: In short what I'm trying to say is that it could be you're experiencing the Accuracy Paradox.

Why cross validation result shows high accuracy while there is overfitting?

I'm using random tree algo for binary classification problem. Training set contains 70k values as "0" class and only 3k as "1". In addition, result of predicting on X_test should give same amount of "0" and "1".
clf = RandomForestClassifier(random_state=1, n_estimators=350, min_samples_split=6, min_samples_leaf=2)
scores = cross_validation.cross_val_score(clf, x_train, y_train, cv=cv)
print("Accuracy (random forest): {}+/-{}".format(scores.mean(), scores.std()))
Accuracy (random forest): 0.960755941369/1.40500919606e-06
clf.fit(x_train, y_train)
prediction_final = clf.predict(X_test) # this return Target values: 76k Zeroes and only 15 ones
#x_test is 10% of x_train set
preds_test = clf.predict(x_test)
print "precision_score", precision_score(y_test, preds_final)
print "recall_score", recall_score(y_test, preds_final)
precision_score 0.0;
recall_score 0.0
confusion_matrix [[7279 1]
[ 322 0]]
As far I can see, there is the overfitting problem, but why doesn't cross validation detect it? Even standard deviation is very low. So how can I fix that problem?
P.S. I've tried to take 3k rows with "0" and 3k with "1" - as training set, model is much better, but this is not solution.

(Overall) Accuracy is a nearly useless measure for unbalanced data sets like yours, since it computes the percentage of correct predictions. In your case, imagine a classifier that would learn nothing, but just always predict "0". Since you have 70k zeroes and only 3k ones, that classifier would reach an accuracy score of 70/73 = 95.9%.
Inspecting the Confusion Matrix is often helpful for disclosing such a "classifier".
Thus, you should definitely use another measure to quantify classification quality. Average Accuracy would be an option, since it computes the average accuracy over all classes. In the case of binary classification, it is also called Balanced Accuracy and results in computing (TP/P + TN/N)/2, so that the classifier imagined above, which always predicts "0", would only score (100% + 0%) / 2 = 50%. However, that measure seems to be not implemented in scikit-learn. Though you could implement such a scoring function by yourself, it will probably be easier and faster to use one of the other predefined scorers.
For example, you could compute the F1 Score instead of Accuracy by passing scoring = 'f1' to cross_validation.cross_val_score. The F1 Score takes both precision and recall into account.

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I'm working in a sentiment analysis problem the data looks like this:
label instances
5 1190
4 838
3 239
1 204
2 127
So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im using scikit's SVC. The problem is I do not know how to balance my data in the right way in order to compute accurately the precision, recall, accuracy and f1-score for the multiclass case. So I tried the following approaches:
First:
wclf = SVC(kernel='linear', C= 1, class_weight={1: 10})
wclf.fit(X, y)
weighted_prediction = wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, weighted_prediction)
print 'F1 score:', f1_score(y_test, weighted_prediction,average='weighted')
print 'Recall:', recall_score(y_test, weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test, weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, weighted_prediction)
Second:
auto_wclf = SVC(kernel='linear', C= 1, class_weight='auto')
auto_wclf.fit(X, y)
auto_weighted_prediction = auto_wclf.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, auto_weighted_prediction)
print 'F1 score:', f1_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Recall:', recall_score(y_test, auto_weighted_prediction,
average='weighted')
print 'Precision:', precision_score(y_test, auto_weighted_prediction,
average='weighted')
print '\n clasification report:\n', classification_report(y_test,auto_weighted_prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, auto_weighted_prediction)
Third:
clf = SVC(kernel='linear', C= 1)
clf.fit(X, y)
prediction = clf.predict(X_test)
from sklearn.metrics import precision_score, \
recall_score, confusion_matrix, classification_report, \
accuracy_score, f1_score
print 'Accuracy:', accuracy_score(y_test, prediction)
print 'F1 score:', f1_score(y_test, prediction)
print 'Recall:', recall_score(y_test, prediction)
print 'Precision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test,prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1082: DeprecationWarning: The default `weighted` averaging is deprecated, and from version 0.18, use of precision, recall or F-score with multiclass or multilabel data or pos_label=None will result in an exception. Please set an explicit value for `average`, one of (None, 'micro', 'macro', 'weighted', 'samples'). In cross validation use, for instance, scoring="f1_weighted" instead of scoring="f1".
sample_weight=sample_weight)
0.930416613529
However, Im getting warnings like this:
/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:1172:
DeprecationWarning: The default `weighted` averaging is deprecated,
and from version 0.18, use of precision, recall or F-score with
multiclass or multilabel data or pos_label=None will result in an
exception. Please set an explicit value for `average`, one of (None,
'micro', 'macro', 'weighted', 'samples'). In cross validation use, for
instance, scoring="f1_weighted" instead of scoring="f1"
How can I deal correctly with my unbalanced data in order to compute in the right way classifier's metrics?

I think there is a lot of confusion about which weights are used for what. I am not sure I know precisely what bothers you so I am going to cover different topics, bear with me ;).
Class weights
The weights from the class_weight parameter are used to train the classifier.
They are not used in the calculation of any of the metrics you are using: with different class weights, the numbers will be different simply because the classifier is different.
Basically in every scikit-learn classifier, the class weights are used to tell your model how important a class is. That means that during the training, the classifier will make extra efforts to classify properly the classes with high weights.
How they do that is algorithm-specific. If you want details about how it works for SVC and the doc does not make sense to you, feel free to mention it.
The metrics
Once you have a classifier, you want to know how well it is performing.
Here you can use the metrics you mentioned: accuracy, recall_score, f1_score...
Usually when the class distribution is unbalanced, accuracy is considered a poor choice as it gives high scores to models which just predict the most frequent class.
I will not detail all these metrics but note that, with the exception of accuracy, they are naturally applied at the class level: as you can see in this print of a classification report they are defined for each class. They rely on concepts such as true positives or false negative that require defining which class is the positive one.
precision recall f1-score support
0 0.65 1.00 0.79 17
1 0.57 0.75 0.65 16
2 0.33 0.06 0.10 17
avg / total 0.52 0.60 0.51 50
The warning
F1 score:/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py:676: DeprecationWarning: The
default `weighted` averaging is deprecated, and from version 0.18,
use of precision, recall or F-score with multiclass or multilabel data
or pos_label=None will result in an exception. Please set an explicit
value for `average`, one of (None, 'micro', 'macro', 'weighted',
'samples'). In cross validation use, for instance,
scoring="f1_weighted" instead of scoring="f1".
You get this warning because you are using the f1-score, recall and precision without defining how they should be computed!
The question could be rephrased: from the above classification report, how do you output one global number for the f1-score?
You could:
Take the average of the f1-score for each class: that's the avg / total result above. It's also called macro averaging.
Compute the f1-score using the global count of true positives / false negatives, etc. (you sum the number of true positives / false negatives for each class). Aka micro averaging.
Compute a weighted average of the f1-score. Using 'weighted' in scikit-learn will weigh the f1-score by the support of the class: the more elements a class has, the more important the f1-score for this class in the computation.
These are 3 of the options in scikit-learn, the warning is there to say you have to pick one. So you have to specify an average argument for the score method.
Which one you choose is up to how you want to measure the performance of the classifier: for instance macro-averaging does not take class imbalance into account and the f1-score of class 1 will be just as important as the f1-score of class 5. If you use weighted averaging however you'll get more importance for the class 5.
The whole argument specification in these metrics is not super-clear in scikit-learn right now, it will get better in version 0.18 according to the docs. They are removing some non-obvious standard behavior and they are issuing warnings so that developers notice it.
Computing scores
Last thing I want to mention (feel free to skip it if you're aware of it) is that scores are only meaningful if they are computed on data that the classifier has never seen.
This is extremely important as any score you get on data that was used in fitting the classifier is completely irrelevant.
Here's a way to do it using StratifiedShuffleSplit, which gives you a random splits of your data (after shuffling) that preserve the label distribution.
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
# We use a utility to generate artificial classification data.
X, y = make_classification(n_samples=100, n_informative=10, n_classes=3)
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=0)
for train_idx, test_idx in sss:
X_train, X_test, y_train, y_test = X[train_idx], X[test_idx], y[train_idx], y[test_idx]
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))

Lot of very detailed answers here but I don't think you are answering the right questions. As I understand the question, there are two concerns:
How to I score a multiclass problem?
How do I deal with unbalanced data?
1.
You can use most of the scoring functions in scikit-learn with both multiclass problem as with single class problems. Ex.:
from sklearn.metrics import precision_recall_fscore_support as score
predicted = [1,2,3,4,5,1,2,1,1,4,5]
y_test = [1,2,3,4,5,1,2,1,1,4,1]
precision, recall, fscore, support = score(y_test, predicted)
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
This way you end up with tangible and interpretable numbers for each of the classes.
| Label | Precision | Recall | FScore | Support |
|-------|-----------|--------|--------|---------|
| 1 | 94% | 83% | 0.88 | 204 |
| 2 | 71% | 50% | 0.54 | 127 |
| ... | ... | ... | ... | ... |
| 4 | 80% | 98% | 0.89 | 838 |
| 5 | 93% | 81% | 0.91 | 1190 |
Then...
2.
... you can tell if the unbalanced data is even a problem. If the scoring for the less represented classes (class 1 and 2) are lower than for the classes with more training samples (class 4 and 5) then you know that the unbalanced data is in fact a problem, and you can act accordingly, as described in some of the other answers in this thread.
However, if the same class distribution is present in the data you want to predict on, your unbalanced training data is a good representative of the data, and hence, the unbalance is a good thing.

Posed question
Responding to the question 'what metric should be used for multi-class classification with imbalanced data': Macro-F1-measure.
Macro Precision and Macro Recall can be also used, but they are not so easily interpretable as for binary classificaion, they are already incorporated into F-measure, and excess metrics complicate methods comparison, parameters tuning, and so on.
Micro averaging are sensitive to class imbalance: if your method, for example, works good for the most common labels and totally messes others, micro-averaged metrics show good results.
Weighting averaging isn't well suited for imbalanced data, because it weights by counts of labels. Moreover, it is too hardly interpretable and unpopular: for instance, there is no mention of such an averaging in the following very detailed survey I strongly recommend to look through:
Sokolova, Marina, and Guy Lapalme. "A systematic analysis of
performance measures for classification tasks." Information Processing
& Management 45.4 (2009): 427-437.
Application-specific question
However, returning to your task, I'd research 2 topics:
metrics commonly used for your specific task - it lets (a) to
compare your method with others and understand if you do something
wrong, and (b) to not explore this by yourself and reuse someone
else's findings;
cost of different errors of your methods - for
example, use-case of your application may rely on 4- and 5-star
reviewes only - in this case, good metric should count only these 2
labels.
Commonly used metrics.
As I can infer after looking through literature, there are 2 main evaluation metrics:
Accuracy, which is used, e.g. in
Yu, April, and Daryl Chang. "Multiclass Sentiment Prediction using
Yelp Business."
(link) - note that the authors work with almost the same distribution of ratings, see Figure 5.
Pang, Bo, and Lillian Lee. "Seeing stars: Exploiting class
relationships for sentiment categorization with respect to rating
scales." Proceedings of the 43rd Annual Meeting on Association for
Computational Linguistics. Association for Computational Linguistics,
2005.
(link)
MSE (or, less often, Mean Absolute Error - MAE) - see, for example,
Lee, Moontae, and R. Grafe. "Multiclass sentiment analysis with
restaurant reviews." Final Projects from CS N 224 (2010).
(link) - they explore both accuracy and MSE, considering the latter to be better
Pappas, Nikolaos, Rue Marconi, and Andrei Popescu-Belis. "Explaining
the Stars: Weighted Multiple-Instance Learning for Aspect-Based
Sentiment Analysis." Proceedings of the 2014 Conference on Empirical
Methods In Natural Language Processing. No. EPFL-CONF-200899. 2014.
(link) - they utilize scikit-learn for evaluation and baseline approaches and state that their code is available; however, I can't find it, so if you need it, write a letter to the authors, the work is pretty new and seems to be written in Python.
Cost of different errors.
If you care more about avoiding gross blunders, e.g. assinging 1-star to 5-star review or something like that, look at MSE;
if difference matters, but not so much, try MAE, since it doesn't square diff;
otherwise stay with Accuracy.
About approaches, not metrics
Try regression approaches, e.g. SVR, since they generally outperforms Multiclass classifiers like SVC or OVA SVM.

First of all it's a little bit harder using just counting analysis to tell if your data is unbalanced or not. For example: 1 in 1000 positive observation is just a noise, error or a breakthrough in science? You never know.
So it's always better to use all your available knowledge and choice its status with all wise.
Okay, what if it's really unbalanced?
Once again — look to your data. Sometimes you can find one or two observation multiplied by hundred times. Sometimes it's useful to create this fake one-class-observations.
If all the data is clean next step is to use class weights in prediction model.
So what about multiclass metrics?
In my experience none of your metrics is usually used. There are two main reasons.
First: it's always better to work with probabilities than with solid prediction (because how else could you separate models with 0.9 and 0.6 prediction if they both give you the same class?)
And second: it's much easier to compare your prediction models and build new ones depending on only one good metric.
From my experience I could recommend logloss or MSE (or just mean squared error).
How to fix sklearn warnings?
Just simply (as yangjie noticed) overwrite average parameter with one of these
values: 'micro' (calculate metrics globally), 'macro' (calculate metrics for each label) or 'weighted' (same as macro but with auto weights).
f1_score(y_test, prediction, average='weighted')
All your Warnings came after calling metrics functions with default average value 'binary' which is inappropriate for multiclass prediction.
Good luck and have fun with machine learning!
Edit:
I found another answerer recommendation to switch to regression approaches (e.g. SVR) with which I cannot agree. As far as I remember there is no even such a thing as multiclass regression. Yes there is multilabel regression which is far different and yes it's possible in some cases switch between regression and classification (if classes somehow sorted) but it pretty rare.
What I would recommend (in scope of scikit-learn) is to try another very powerful classification tools: gradient boosting, random forest (my favorite), KNeighbors and many more.
After that you can calculate arithmetic or geometric mean between predictions and most of the time you'll get even better result.
final_prediction = (KNNprediction * RFprediction) ** 0.5

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.