I'm trying to do multilabel classification with SVM.
I have nearly 8k features and also have y vector of length with nearly 400. I already have binarized Y vectors, so I didn't use MultiLabelBinarizer() but when I use it with my Y data's raw form, it still gives same thing.
I'm running this code:
X = np.genfromtxt('data_X', delimiter=";")
Y = np.genfromtxt('data_y', delimiter=";")
training_X = X[:2600,:]
training_y = Y[:2600,:]
test_sample = X[2600:2601,:]
test_result = Y[2600:2601,:]
classif = OneVsRestClassifier(SVC(kernel='rbf'))
classif.fit(training_X, training_y)
print(classif.predict(test_sample))
print(test_result)
After all fitting process when it comes to prediction part, it says Label not x is present in all training examples (x is a few different numbers in range of my y vector length which is 400). After that it gives predicted y vector which is always zero vector with length of 400(y vector length).
I'm new at scikit-learn and also in machine learning. I couldn't figure out the problem here. What's the problem and what should I do to fix it?
Thanks.
There are 2 problems here:
1) The missing label warning
2) You are getting all 0's for predictions
The warning means that some of your classes are missing from the training data. This is a common problem. If you have 400 classes, then some of them must only occur very rarely, and on any split of the data, some classes may be missing from one side of the split. There may also be classes that simply don't occur in your data at all. You could try Y.sum(axis=0).all() and if that is False, then some classes do not occur even in Y. This all sounds horrible, but realistically, you aren't going to be able to correctly predict classes that occur 0, 1, or any very small number of times anyway, so predicting 0 for those is probably about the best you can do.
As for the all-0 predictions, I'll point out that with 400 classes, probably all of your classes occur much less than half the time. You could check Y.mean(axis=0).max() to get the highest label frequency. With 400 classes, it might only be a few percent. If so, a binary classifier that has to make a 0-1 prediction for each class will probably pick 0 for all classes on all instances. This isn't really an error, it is just because all of the class frequencies are low.
If you know that each instance has a positive label (at least one), you could get the decision values (clf.decision_function) and pick the class with the highest one for each instance. You'll have to write some code to do that, though.
I once had a top-10 finish in a Kaggle contest that was similar to this. It was a multilabel problem with ~200 classes, none of which occurred with even a 10% frequency, and we needed 0-1 predictions. In that case I got the decision values and took the highest one, plus anything that was above a threshold. I chose the threshold that worked the best on a holdout set. The code for that entry is on Github: Kaggle Greek Media code. You might take a look at it.
If you made it this far, thanks for reading. Hope that helps.
Related
I have a pool of samples from s1 to s100 that I want to classify between two different categories A and B
In this problem, I cannot perform predictions individually for each sample but in groups of 10 and every prediction returns the predicted label and the confidence for each label. Something like:
[s1,s21,s3,s15,s5,s62,s90,s13,s9,s100];A;0.9
[s1,s5,s12,s20,s53,s89,s27,s42,s76,s55];A;0.4
...
Every predicted vector is set at random and I can perform as much combinations as needeed. Also, the sample could be repeated more than once in a pool.
What I would like to accomplish is to rank every sample importance for each category prediction using the confidence label
Searching similar problems I end up thinking that computing shaply values would be a good solution but these are thought to be implementend for features rather than samples.
Any ideas how to implement this?
EDIT:
As suggested I will try to add a minimal example of the issue with 4 samples and 2 samples per group. Something like:
Sample_group;Prediction;Confidence
[s1,s2];A;0.7
[s3,s4];A;0.6
[s1,s3];A;0.9
[s2,s4];A;0.5
[s1,s4];A;0.7
[s2,s3];A;0.6
Although all pairs give the same prediction, looking at the confidence value shows [s1,s3] pair has the highest value and [s2,s4] has the lowest. Checking the rest of pairs one can infer that s1 value seems to be the one which gives the highest confidence compared to s3 when paired with the other two. The result, then, should be something like:
Sample;rank
s1;0
s3;1
s2;2
s4;3
First approach:
You can try to reframe your problem: you actually have a model that takes a 100-feature vector and returns a prediction. Each individual feature is boolean (where feature i is 1 if sample i is one of the 10 included samples and 0 is not; of course this framework can support any mixture of samples and not just groups of 10).
The fact that your prediction has two components can be dealt with by replacing it with a single value, which is the confidence multiplied by either 1 or -1 for prediction A and B, so that your prediction is in the range of [-1 1] (where -1 is predicting A with the highest confidence and 1 is predicting B with the highest confidence, etc.). That's just one suggestion, there can be other ways of reducing your 2D output to 1D but this one seems simplest.
Now that you basically have a simple regression model that takes 100 features and returns a single number, you can compute SHAP values for each feature (which would translate in your case to the "sample importance" - that is, the importance of the sample being included for the prediction). As to how to compute the SHAP values, I think that if you actually implement a class with a .predict method that wraps around your prediction, you could use SHAP's KernelExplainer. Your next problem would be that KernelExplainer gives your shap values for each feature for a specific prediction (and runs .predict 100K times to do that so your method better be fast). So you might need to do this many times for different sample groups and average the results.
Second approach:
Another option which may take more work to implement but is a more direct solution, is to implement your own version of the Shapley computation.
The original game-theory formulation of Shapley values seems actually more in line with your problem than its adoption for machine learning. That is, if you think of each of your samples as a "contributor" and the final output (reduced to a single number as described above) as the "outcome", then the Shapley formula is exactly intended to estimate the contribution of each contributor being present, across all other permutations of other contributors being present.
In the general case, if you have N contributors, there are 2^N combinations of which are present, but in your case you can say that only combinations where 10 samples exist out of 100 are legal. So you can take the Shapley formula and instead of going over all possible combinations, just go over the legal ones. There are 100 choose 10 which is still a huge number (trillions) so you'd probably need to sample randomly out of it for a reasonable runtime. As far as I understand the idea behind the formula, it will provide you with exactly what you need.
I am training Random Forests with two sets of "true" y values (empirical). I can easy tell which one is better.
However, I was wondering if there is a simple method, other than brute force, to pick up the values from each set that would produce the best model. In other words, I would like to automatically mix both y sets to produce a new ideal one.
Say, for instance, biological activity. Different experiments and different databases provide different values. This is a simple example showing two different sets of y values on columns 3 and 4.
4a50,DQ7,47.6,45.4
3atu,ADP,47.7,30.7
5i9i,5HV,47.7,41.9
5jzn,GUI,47.7,34.2
4bjx,73B,48.0,44.0
4a6c,QG9,48.1,45.5
I know that column 3 is better because I have already trained different models against each of them and also because I checked a few articles to verify which value is correct and 3 is right more often than 4. However, I have thousands of rows and cannot read thousands of papers.
So I would like to know if there is an algorithm that, for instance, would use 3 as a base for the true y values but would pick values from 4 when the model improves by so doing.
It would be useful it it would report the final y column and be able to use more than 2, but I think I can figure out that.
The idea now is to find out if there is already a solution out there so that I don't need to reinvent the wheel.
Best,
Miro
NOTE: The features (x) are in a different file.
The problem is that an algorithm alone doesn't know which label is better.
What you could do: Train a classifier on data which you know is correct. Use the clasifier to predcit a value for each datapoint. Compare this value to the two list of labels which you already have and choose the label which is closer.
This solution obviously isn't perfect since the results depends on quality of the classfier which predicts the value and you still need enough labeled data to train the classifier. Additionaly there is also a chance that the classifier itself predicts a better value compared to your two lists of labels.
Choose column 3 and column 4 both together as target/predicted/y values in Random Forest classifier model fitting - and predict it with your result. Thus, your algorithm can keep track of both Y values and their correlation to predicted values. Your problem seems to be Multi-output classification problem, where there are multiple target/predicted variables (multiple y - values ) as you suggest.
Random forest supports this multi-output classification using random forest. Random Forest fit(X,y) method supports y to be array-like y : array-like, shape = [n_samples, n_outputs]
multioutput-classification
sklearn.ensemble.RandomForestClassifier.fit
Check multi-class and multi-output classification
I had trained my model on KNN classification algorithm , and I was getting around 97% accuracy. However,I later noticed that I had missed out to normalise my data and I normalised my data and retrained my model, now I am getting an accuracy of only 87%. What could be the reason? And should I stick to using data that is not normalised or should I switch to normalized version.
To answer your question, you first need to understand how KNN works. Here is a simple diagram:
Supposed the ? is the point you are trying to classify into either red or blue. For this case lets assume you haven't normalized any of the data. As you can see clearly the ? is closer to more red dots than blue bots. Therefore, this point would be assumed to be red. Lets also assume the correct label is red, therefore this is a correct match!
Now, to discuss normalization. Normalization is a way of taking data that is slightly dissimilar but giving it a common state (in your case think of it as making the features more similar). Assume in the above example that you normalize the ?'s features, and therefore the output y value becomes less. This would place the question mark below it's current position and surrounded by more blue dots. Therefore, your algo would label it as blue, and it would be incorrect. Ouch!
Now to answer your questions. Sorry, but there is no answer! Sometimes normalizing data removes important feature differences therefore causing accuracy to go down. Other times, it helps to eliminate noise in your features which cause incorrect classifications. Also, just because accuracy goes up for the data set your are currently working with, doesn't mean you will get the same results with a different data set.
Long story short, instead of trying to label normalization as good/bad, instead consider the feature inputs you are using for classification, determine which ones are important to your model, and make sure differences in those features are reflected accurately in your classification model. Best of luck!
That's a pretty good question, and is unexpected at first glance because usually a normalization will help a KNN classifier do better. Generally, good KNN performance usually requires preprocessing of data to make all variables similarly scaled and centered. Otherwise KNN will be often be inappropriately dominated by scaling factors.
In this case the opposite effect is seen: KNN gets WORSE with scaling, seemingly.
However, what you may be witnessing could be overfitting. The KNN may be overfit, which is to say it memorized the data very well, but does not work well at all on new data. The first model might have memorized more data due to some characteristic of that data, but it's not a good thing. You would need to check your prediction accuracy on a different set of data than what was trained on, a so-called validation set or test set.
Then you will know whether the KNN accuracy is OK or not.
Look into learning curve analysis in the context of machine learning. Please go learn about bias and variance. It's a deeper subject than can be detailed here. The best, cheapest, and fastest sources of instruction on this topic are videos on the web, by the following instructors:
Andrew Ng, in the online coursera course Machine Learning
Tibshirani and Hastie, in the online stanford course Statistical Learning.
If you use normalized feature vectors, the distances between your data points are likely to be different than when you used unnormalized features, particularly when the range of the features are different. Since kNN typically uses euclidian distance to find k nearest points from any given point, using normalized features may select a different set of k neighbors than the ones chosen when unnormalized features were used, hence the difference in accuracy.
I've been using python to experiment with sklearn's BayesianGaussianMixture (and with GaussianMixture, which shows the same issue).
I fit the model with a number of items drawn from a distribution, then tested the model with a held out data set (some from the distribution, some outside it).
Something like:
X_train = ... # 70x321 matrix
X_in = ... # 20x321 matrix of held out data points from X
X_out = ... # 20x321 matrix of data points drawn from a different distribution
model = BayesianGaussianMixture(n_components=1)
model.fit(X_train)
print(model.score_samples(X_in).mean())
print(model.score_samples(X_out).mean())
outputs:
-1334380148.57
-2953544628.45
The score_samples method returns a per-sample log likelihood of the given data, and "in" samples are much more likely than the "out" samples as expected - I'm just wondering why the absolute values are so high?
The documentation for score_samples states "Compute the weighted log probabilities for each sample" - but I'm unclear what the weights are based on.
Do I need to scale my input first? Is my input dimensionality too high? Do I need to do some additional parameter tuning? Or am I just misunderstanding what the method returns?
The weights are based on the mixture weights.
Do I need to scale my input first?
This is usually not a bad idea but I can't say not knowing more about your data.
Is my input dimensionality too high?
It seems given the amount of data you are fitting it actually is too high. Remember the curse of dimensionality. You have very few rows of data and 312 features, 1:4 ratio; that's not really going to work in practice.
Do I need to do some additional parameter tuning? Or am I just
misunderstanding what the method returns?
Your outputs are log-probabilites that are very negative. If you raise e to such a large negative magnitude you get a probability that is very close to zero. Your results actually make sense from that perspective. You may want to check the log-probability in areas where you know there is a higher probability of being in that component. You may also want to check the covariances for each component to make sure you don't have a degenerate solution, which is quite likely given the amount of data and dimensionality in this case. Before any of that, you may want to get more data or see if you can reduce the number of dimensions.
I forgot to mention a rather important point: The output is for the Density so keep that in mind too.
I have a dataset where the classes are unbalanced. The classes are either '1' or '0' where the ratio of class '1':'0' is 5:1. How do you calculate the prediction error for each class and the rebalance weights accordingly in sklearn with Random Forest, kind of like in the following link: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance
You can pass sample weights argument to Random Forest fit method
sample_weight : array-like, shape = [n_samples] or None
Sample weights. If None, then samples are equally weighted. Splits
that would create child nodes with net zero or negative weight are
ignored while searching for a split in each node. In the case of
classification, splits are also ignored if they would result in any
single class carrying a negative weight in either child node.
In older version there were a preprocessing.balance_weights method to generate balance weights for given samples, such that classes become uniformly distributed. It is still there, in internal but still usable preprocessing._weights module, but is deprecated and will be removed in future versions. Don't know exact reasons for this.
Update
Some clarification, as you seems to be confused. sample_weight usage is straightforward, once you remember that its purpose is to balance target classes in training dataset. That is, if you have X as observations and y as classes (labels), then len(X) == len(y) == len(sample_wight), and each element of sample witght 1-d array represent weight for a corresponding (observation, label) pair. For your case, if 1 class is represented 5 times as 0 class is, and you balance classes distributions, you could use simple
sample_weight = np.array([5 if i == 0 else 1 for i in y])
assigning weight of 5 to all 0 instances and weight of 1 to all 1 instances. See link above for a bit more crafty balance_weights weights evaluation function.
This is really a shame that sklearn's "fit" method does not allow specifying a performance measure to be optimized. No one around seem to understand or question or be interested in what's actually going on when one calls fit method on data sample when solving a classification task.
We (users of the scikit learn package) are silently left with suggestion to indirectly use crossvalidated grid search with specific scoring method suitable for unbalanced datasets in hope to stumble upon a parameters/metaparameters set which produces appropriate AUC or F1 score.
But think about it: looks like "fit" method called under the hood each time always optimizes accuracy. So in end effect, if we aim to maximize F1 score, GridSearchCV gives us "model with best F1 from all modesl with best accuracy". Is that not silly? Would not it be better to directly optimize model's parameters for maximal F1 score?
Remember old good Matlab ANNs package, where you can set desired performance metric to RMSE, MAE, and whatever you want given that gradient calculating algo is defined. Why is choosing of performance metric silently omitted from sklearn?
At least, why there is no simple option to assign class instances weights automatically to remedy unbalanced datasets issues? Why do we have to calculate wights manually? Besides, in many machine learning books/articles I saw authors praising sklearn's manual as awesome if not the best sources of information on topic. No, really? Why is unbalanced datasets problem (which is obviously of utter importance to data scientists) not even covered nowhere in the docs then?
I address these questions to contributors of sklearn, should they read this. Or anyone knowing reasons for doing that welcome to comment and clear things out.
UPDATE
Since scikit-learn 0.17, there is class_weight='balanced' option which you can pass at least to some classifiers:
The “balanced” mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as n_samples / (n_classes * np.bincount(y)).
Use the parameter class_weight='balanced'
From sklearn documentation: The balanced mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
If the majority class is 1, and the minority class is 0, and they are in the ratio 5:1, the sample_weight array should be:
sample_weight = np.array([5 if i == 1 else 1 for i in y])
Note that you do not invert the ratios.This also applies to class_weights. The larger number is associated with the majority class.