I am working on classifying texts and images of scientific articles. From the texts I use title and abstract. So far I have achieved good results using an SVM for the texts and not that good using a CNN for the images. I still did a multimodal classification, which did not show any classification improvement.
What I would like to do now is to use the svm and cnn predictions to classify, something like a vote ensemble. However the VotingClassifier from sklearn does not accept mixed inputs. You would have some idea of how I could implement or some guide line.
Thank you!
One simple thing you can do is take the outputs from both your models and just use them as inputs to third linear regression model. This effectively "mixes" your 2 learners into a small ensemble. Of course this is a very simple strategy but it might give you a slight boost over using each model separately.
Related
I am trying to train a model with the following data
The image is the input and 14 features must be predicted.
Could I please know how to go about training such a model?
Thank you.
These are not really features as far as I am concerned. These are classes and if I got it correctly, your images sometimes belong to more than one classes.
This is a very broad question but I think here might be a good start to learn more about multi-label image classification.
Note that your model should not be much different than an image classification model that is used for cifar10 challenge, for example. But you need to structure your data and choose your loss function accordingly.
I have a multi-label classification problem, I want to classify texts with six labels, each text can have one to six labels but this label distribution is not equal. For example, 10 people annotated sentence1 as below:
These labels are the number of votes for that class. I can normalize them like sad 0.7, anger 0.2, fear 0.1, happy 0.0,...
What is the best classifier for this problem? What is the best type for labels I mean I should normalize them or not?
What keywords should I search for this kind of multi-label classification problem where the probability of labels is not equal?
Well, first, to clarify if I understand your problem correctly. You have sentences=[sent1, sent2, ... sentn] and you want to classify them into these six labels labels=[l1,l2,...,l6]. Your data isn't the labels themselves, but the probability of having that label in the text. You also mentioned the six labels comes from human annotation (I don't know what you mean by 10 people commented, I'll guess it is annotation)
If this is the case, you can deal with the problem with multi-label classification or a multi-target regression perspectives. I'll approach what you can do with your data both cases:
Multilabel Classification: In this case, you need to define the classes for each sentence so that you can train your model. Right now you have only the probabilities. You can do that by creating a threshold and the probabilities of labels that are above the threshold can be considered the labels for a sentence. You can read more about the evaluation metrics here.
Multi-target Regression: In this case, you don't need to define the classes, you just use the training input and we use the data to predict the probabilities for each label. I think it is a better and easier problem, given your data collection. If you want to know more about the problem of multi-target regression, you can read more about it here, but the models they used in this tutorial are not the the state-of-the-art (be aware of it).
Training Models: You can use both shallow and deep models for this task. You need a model that can receive a sentence as input and predict six labels or six probabilities. I suggest you take a look into this example, it can be a very good starting point for your work. The author provides a tutorial on how to build a multi-label text classifier using deep neural networks. He basically built a LSTM and a Feed-forward layer in the end to classify the labels. If you decide to use regression instead of classification, you can just drop the activation in the end.
The best results are likely to be obtained by deep neural networks, so the article I sent you can work very well. I also suggest you take a look in the state-of-the-art methods for text classification, such as BERT or XLNET. I implemented a Multi-label classification method using BERT, maybe it can be helpful to you.
So basically I want to classify a lot of labels (200K+).
Are there any recommended models I should try in order to have a relatively good accuracy and not take days to complete?
I have tried to use Sklearn's OneVsRestClassifier for LinearRegression, and I left it overnight and the fitting still didn't finish
I believe that there should be more efficient algorithms for multiclass classification for NLP
Thanks in advance
Given the amount of data that you have available, consider multinomial Naive Bayes. Sklearn has a very straight forward implementation of this: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
This will be quicker than using a neural network. A lot of training data on a simple model will always have more predictive power than a larger model with less data.
amateur machine learning programmer here. I would like to perform a classification task wherein two simultaneous class predictions could occur.
For instance, in flowers image classification. Apart from being able to classify an image of a rose, or an orchid; I would also like to be able to classify if an image contains both roses and orchids simultaneously. Do I have to train my model to distinguish "Rose + Orchid" as an independent class?
Here's an example image of the task.
In scikit learn all classifiers which have prob_a function have your specification. This function returns the probability of assigning each class to the input x. Hence, you can use SVC, logistic regression, naive Bayes, random forest or any explained classifier in scikit learn (if you are seeking the specified classifier in scikit learn) based on your problem.
When you found the prob_a for each class, if the difference between two most probable class is near to each other, you can introduce the input with the two most probable class.
This is called as multi-label classification problem. There are many approaches to solve this problem. Sklearn documentation about multi-label classification.
An example with sklearn
I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.