Unify text and image classification (Python)

Unify text and image classification (Python) - python

I am working on a code to classify texts of scientific articles (using the title and the abstract). And for this I'm using an SVM, which delivers a good accuracy (83%). At the same time I used a CNN to classify the images of these articles. My idea is to merge the text classifier with the image classifier, to improve the accuracy.
It is possible? If so, you would have some idea how I could implement it or some kind of guideline?
Thank you!

You could use the CNN to do both. For this you'd need two (or even three) inputs. One for the text (or two where one is for the abstract and the other for the title) and the second input for the image. Then you'd have some conv-max pooling layers before you merge them at one point. You then plug in some additional CNN or dense layers.
You could also have multiple outputs in this model. E.g a combined one, one for the text and one for the images. If you're using keras you would need the functional API. A picture of an example model can be found here (They're using LSTM in the example, but I guess you should stick to CNN.)

If you get probability from both classifiers you can average them and take the combined result. However taking a weighted average might be a better approach in which case you can use a validation set to find the suitable value for the weight.
prob_svm = probability from SVM text classifier
prob_cnn = probability from CNN image classifier
prob_total = alpha * prob_svm + (1-alpha) * prob_cnn # fine-tune alpha with validation set
If you can get another classifier (maybe a different version of any of these two classifiers), you can also do a majority voting i.e., take the class on which two or all three classifiers agree on.

Related

SimCLR does not learn representations

So I'm trying to train a SimCLR network with a custom lightweight ConvNet backbone (tried it with a ResNet already) on a dataset containing first 5 letters of the alphabet out of which two are randomly selected and placed in random positions in the image. I am unsure of what augmentations to use in such a scenario, so I only use Image translation to provide some degree of difference between the augmented samples.
This sounds like an extremely trivial task, but it performs VERY poorly on a multi-label classifier built on top of the frozen pretrained network. I'm quite certain this is because of how poor the quality of representations learnt are rather than the linear classifier. This works well on a supervised classifier, obviously.
Variations I've tried till now:
Made the dataset single letter, random position (multi-class) and it performed very well.
Made the dataset with random letters, but same center position, and it performed well. Same augmentation mentioned above for these as well.
Sample image from dataset (Label here is [1, 1, 0, 0, 0] for the letters that are present)
Can someone please help me figure out how to make this work?

This is not the first time I hear of someone trying SimCLR and getting horrible results...
I have some questions:
Have you tried other losses for the contrastive pretraining part? How about triplet loss?
Are the representations normalised?
Are you getting good results with contrastive pretraining in the variations that you mention?
Are you getting good supervised classification results with both models (Resnet and custom convnet)?
Have you tried to visualise the features learned by the model in the conv layers?
You could also try to visualise the feature maps with forward hooks and see what is the network "looking at".

Splitting train test sets for Node2vec link prediction in Stellargraph

I'm trying to understand how to use Stellargraph's EdgeSplitter class. In particular, the examples on the documentation for training a link prediction model based on Node2Vec splits the graph in the following parts:
Distrution of samples across train, val and test set
Following the examples on the documentation, first you sample 10% of the links of the full graph in order to obtain the test set:
# Define an edge splitter on the original graph:
edge_splitter_test = EdgeSplitter(graph)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from graph, and obtain the
# reduced graph graph_test with the sampled links removed:
graph_test, examples_test, labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global"
)
As far as I understand from the docs, graph_test is the original graph but with the test links removed. Then you perform the same operation with the training set,
# Do the same process to compute a training subset from within the test graph
edge_splitter_train = EdgeSplitter(graph_test)
graph_train, examples, labels = edge_splitter_train.train_test_split(
p=0.1, method="global"
)
Following the previous logic, graph_train corresponds to graph_test with the training links removed.
Further down the code, my understanding is that we use graph_train to train the embedding and the training samples (examples, labels) to train the classifier. So I have several questions here:
Why are we using disjoint sets of training data to train different parts of the model? Shouldn´t we train both the embedding and the classifier with the full training set of links?
Why is the test set so big? Wouldn´t it be better to have most samples in the training set?
What is the correct way of using the EdgeSplitter class?
Thanks you in advance for your help!

Why disjoint sets:
This may or may not matter depending on the embedding algorithm.
The risk with edges that are both seen by the embedding algorithm and the classifier as targets is that the embedding algorithm may encode non-generalizable features.
For example, theoretically one feature of the embedding could be the node id, and then you could have other features encoding the entire neighborhood of the node. When combining two node's embeddings into a link vector in a weird way, or when using a multilayer model, one could therefore create a binary feature which is 1 if the two nodes are connected during embedding training and 0 otherwise.
In this case the classifier would perhaps just learn to use this trivial feature which is not present (i.e. has value 0) when you go to the test data.
The above would not happen in a real scenario, but more subtle features could have the same effect to a lesser degree.
In the end, this only risks to make model selection bad.
That is, the first split is to make the test reliable. The second split is to improve model selection. You can therefore omit the second split if you wish.
Why test set so big:
You are likely to get higher score with a bigger train set. As long as the experiment is repeated with different splits and variance is under control, it should be fine to increase train size.
What is the correct way to use EdgeSplitter:
I dont know what 'correct' means here. I think graph splitting is still an active research field.

Multi-label text classification with non-uniform distribution of class labels for every train data

I have a multi-label classification problem, I want to classify texts with six labels, each text can have one to six labels but this label distribution is not equal. For example, 10 people annotated sentence1 as below:
These labels are the number of votes for that class. I can normalize them like sad 0.7, anger 0.2, fear 0.1, happy 0.0,...
What is the best classifier for this problem? What is the best type for labels I mean I should normalize them or not?
What keywords should I search for this kind of multi-label classification problem where the probability of labels is not equal?

Well, first, to clarify if I understand your problem correctly. You have sentences=[sent1, sent2, ... sentn] and you want to classify them into these six labels labels=[l1,l2,...,l6]. Your data isn't the labels themselves, but the probability of having that label in the text. You also mentioned the six labels comes from human annotation (I don't know what you mean by 10 people commented, I'll guess it is annotation)
If this is the case, you can deal with the problem with multi-label classification or a multi-target regression perspectives. I'll approach what you can do with your data both cases:
Multilabel Classification: In this case, you need to define the classes for each sentence so that you can train your model. Right now you have only the probabilities. You can do that by creating a threshold and the probabilities of labels that are above the threshold can be considered the labels for a sentence. You can read more about the evaluation metrics here.
Multi-target Regression: In this case, you don't need to define the classes, you just use the training input and we use the data to predict the probabilities for each label. I think it is a better and easier problem, given your data collection. If you want to know more about the problem of multi-target regression, you can read more about it here, but the models they used in this tutorial are not the the state-of-the-art (be aware of it).
Training Models: You can use both shallow and deep models for this task. You need a model that can receive a sentence as input and predict six labels or six probabilities. I suggest you take a look into this example, it can be a very good starting point for your work. The author provides a tutorial on how to build a multi-label text classifier using deep neural networks. He basically built a LSTM and a Feed-forward layer in the end to classify the labels. If you decide to use regression instead of classification, you can just drop the activation in the end.
The best results are likely to be obtained by deep neural networks, so the article I sent you can work very well. I also suggest you take a look in the state-of-the-art methods for text classification, such as BERT or XLNET. I implemented a Multi-label classification method using BERT, maybe it can be helpful to you.

Unify classifiers based on their predictions

I am working on classifying texts and images of scientific articles. From the texts I use title and abstract. So far I have achieved good results using an SVM for the texts and not that good using a CNN for the images. I still did a multimodal classification, which did not show any classification improvement.
What I would like to do now is to use the svm and cnn predictions to classify, something like a vote ensemble. However the VotingClassifier from sklearn does not accept mixed inputs. You would have some idea of how I could implement or some guide line.
Thank you!

One simple thing you can do is take the outputs from both your models and just use them as inputs to third linear regression model. This effectively "mixes" your 2 learners into a small ensemble. Of course this is a very simple strategy but it might give you a slight boost over using each model separately.

Clustering a set of images

I have a folder with hundres/thousands of images, some of them look alike. I would like to create clusters separating those images (those which look alike in the same cluster).
I can't determine the number of clusters that will be needed, it depends on the images.
Does anyone have an idea on how to do this using Python, OpenCV and which algorithm to use?
I've made some research and found that AffinityPropagation or DBSCAN can be useful for me but I don't know where to start (how to encode my images, what should I pass to those algorithms etc...)

Unfortunately it is not that simple with images, since naively clustering would result in clusters of images with the same colors, not the same "content". You can use a neural network as a feature extractor for the images, I see two options:
Use a pre-trained network and get the features from an intermediate layer
Train an autoencoder on your dataset, and use the latent features
Option 1 is cheaper since you can easily find pre-trained models, option 2 is much more computationally expensive but should work better, especially if there is no pre-trained model on your domain.
This tutorial (randomly found on the internet) seems to be a good introduction to method 2.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.