Intent classification with large number of intent classes - python

I am working on a data set of approximately 3000 questions and I want to perform intent classification. The data set is not labelled yet, but from the business perspective, there's a requirement of identifying approximately 80 various intent classes. Let's assume my training data has approximately equal number of each classes and is not majorly skewed towards some of the classes. I am intending to convert the text to word2vec or Glove and then feed into my classifier.
I am familiar with cases in which I have a smaller number of intent classes, such as 8 or 10 and the choice of machine learning classifiers such as SVM, naive bais or deeplearning (CNN or LSTM).
My question is that if you have had experience with such large number of intent classes before, and which of machine learning algorithm do you think will perform reasonably? do you think if i use deep learning frameworks, still large number of labels will cause poor performance given the above training data?
We need to start labelling the data and it is rather laborious to come up with 80 classes of labels and then realise that it is not performing well, so I want to ensure that I am making the right decision on how many classes of intent maximum I should consider and what machine learning algorithm do you suggest?
Thanks in advance...

First, word2vec and GloVe are, almost, dead. You should probably consider using more recent embeddings like BERT or ELMo (both of which are sensitive to the context; in other words, you get different embeddings for the same word in a different context). Currently, BERT is my own preference since it's completely open-source and available (gpt-2 was released a couple of days ago which is apparently a little bit better. But, it's not completely available to the public).
Second, when you use BERT's pre-trained embeddings, your model has the advantage of seeing a massive amount of text (Google massive) and thus can be trained on small amounts of data which will increase it's performance drastically.
Finally, if you could classify your intents into some coarse-grained classes, you could train a classifier to specify which of these coarse-grained classes your instance belongs to. Then, for each coarse-grained class train another classifier to specify the fine-grained one. This hierarchical structure will probably improve the results. Also for the type of classifier, I believe a simple fully connected layer on top of BERT would suffice.

Related

Is it necessary to mitigate class imbalance problem in multiclass text classification?

I am performing multi-class text classification using BERT in python. The dataset that I am using for retraining my model is highly imbalanced. Now, I am very clear that the class imbalance leads to a poor model and one should balance the training set by undersampling, oversampling, etc. before model training.
However, it is also a fact that the distribution of the training set should be similar to the distribution of the production data.
Now, if I am sure that the data thrown at me in the production environment will also be imbalanced, i.e., the samples to be classified will likely belong to one or more classes as compared to some other classes, should I balance my training set?
OR
Should I keep the training set as it is as I know that the distribution of the training set is similar to the distribution of data that I will encounter in the production?
Please give me some ideas, or provide some blogs or papers for understanding this problem.
Class imbalance is not a problem by itself, the problem is too few minority class' samples make it harder to describe its statistical distribution, which is especially true for high-dimensional data (and BERT embeddings have 768 dimensions IIRC).
Additionally, logistic function tends to underestimate the probability of rare events (see e.g. https://gking.harvard.edu/files/gking/files/0s.pdf for the mechanics), which can be offset by selecting a classification threshold as well as resampling.
There's quite a few discussions on CrossValidated regarding this (like https://stats.stackexchange.com/questions/357466). TL;DR:
while too few class' samples may degrade the prediction quality, resampling is not guaranteed to give an overall improvement; at least, there's no universal recipe to a perfect resampling proportion, you'll have to test it out for yourself;
however, real life tasks often weigh classification errors unequally: resampling may help improving certain class' metrics at the cost of overall accuracy. Same applies to classification threshold selection however.
This depends on the goal of your classification:
Do you want a high probability that a random sample is classified correctly? -> Do not balance your training set.
Do you want a high probability that a random sample from a rare class is classified correctly? -> balance your training set or apply weighting during training increasing the weights for rare classes.
For example in web applications seen by clients, it is important that most samples are classified correctly, disregarding rare classes, whereas in the case of anomaly detection/classification, it is very important that rare classes are classified correctly.
Keep in mind that a highly imbalanced dataset tends to always predicting the majority class, therefore increasing the number or weights of rare classes can be a good idea, even without perfectly balancing the training set..
P(label | sample) is not the same as P(label).
P(label | sample) is your training goal.
In the case of gradient-based learning with mini-batches on models with large parameter space, rare labels have a small footprint on the model training. So, your model fits in P(label).
To avoid fitting to P(label), you can balance batches.
Overall batches of an epoch, data looks like an up-sampled minority class. The goal is to get a better loss function that its gradients move parameters toward a better classification goal.
UPDATE
I don't have any proof to show this here. It is perhaps not an accurate statement. With enough training data (with respect to the complexity of features) and enough training steps you may not need balancing. But most language tasks are quite complex and there is not enough data for training. That was the situation I imagined in the statements above.

Optimum number of target class objects for yolov5 custom model training

I am trying to train a custom object detector, is ther a limit on the number of target class objects that the yolov5 architecture can be trained on.
For example- coco dataset has 80 target class, suppose I have 500 object types to detect, is it advisable to use yolov5.
Can this be explain with reasons.
You can add as many classes you want to any network.
The yolo architecture is known for giving more attention to inference time rather than to performance. Although achieving good results on conventional datasets, the yolo model is built for speed.
But essentially you want a network that has a good backbone (deep and wide) that can really obtain rich features from your image.
From my experience, there is really no straight forward answer. It depends on your dataset as well, if you have large/medium/small objects to detect. I really recommend trying out different models, because every single model, will perform differently on custom datasets. From here you select the best one. State-of-the-art models don't directly relate to the best model on transfer learning and fine-tuning.
The Yolo and all other single shot detectors, for me, were the ones who worked best for fine-tuning purposes (RetinaNet was best for my use cases so far), they are good for hyper parameter tuning because you can train them fast and test what works and what doesn't. With two stage detectors (Faster-RCNN etc) I never achieved overall good results, mainly because the training process is different and much slower.
I recommend you read this article, it explains both architecture types, pros and cons.
Additionally if you want to train a model for more than 500 classes, Tensorflow Object Detection API has pre-trained models for the OpenImages dataset (600 classes), and there is the Detectron 2 on LVIS dataset (1200 classes). I recommend starting with models which were trained on a higher number of classes if you want to fine tune to a similar number of classes in your dataset.

Machine Learning, What are the common techniques for feature engineering and presenting the model?

I am having a ML language identification project (Python) that requires a multi-class classification model with high dimension feature input.
Currently, all I can do to improve accuracy is through trail-and-error. Mindlessly combining available feature extraction algorithms and available ML models and see if I get lucky.
I am asking if there is a commonly accepted workflow that find a ML solution systematically.
This thought might be naive, but I am thinking if I can somehow visualize those high dimension data and the decision boundaries of my model. Hopefully this visualization can help me to do some tuning. In MATLAB, after training, I can choose any two features among all features and MATLAB will give a decision boundary accordingly. Can I do this in Python?
Also, I am looking for some types of graphs that I can use in the presentation to introduce my model and features. What are the most common graphs used in the field?
Thank you
Feature engineering is more of art than technique. That might require domain knowledge or you could try adding, subtracting, dividing and multiplying different columns to make features out of it and check if it adds value to the model. If you are using Linear Regression then the adjusted R-squared value must increase or in the Tree models, you can see the feature importance, etc.

twitter/facebook comments classification into various categories

I have some comments dataset which I want to classify into five categories :-
jewelries, clothes, shoes, electronics, food & beverages
So if someones talking about pork, steak, wine, soda, eat : its classified into f&b
Whereas if somebodys talking about say - gold, pendent, locket etc : its classified into jewelries
I want to know , what tags/tokens should I be looking for in a comment/tweet so as to classify it into any of these categories. Finally which classifier to use. I just need some guidance and suggestions , Ill take it from there.
Please help. Thanks
This answer can be a bit long and perhaps I abstract a few things away, but it's just to give you an idea and some advice.
Supervised Vs Unsupervised
As others already mentioned, in the land of machine learning there are 2 main roads: Supervised and Unsupervised learning. As you probably already know by now, if your corpus(documents) are labeled, you are talking about supervised learning. The labels are the categories and are in this case boolean values.
For instance if a text is related to clothes and shoes the labels for those categories should be true.
Since a text can be related to multiple categories (multiple labels), we are looking at multiclassifiers.
What to use?
I presume that the dataset is not yet labeled, since twitter does not do this categorisation for you. So here comes a big decision on your part.
You label the data manually, which means you try to look at as much tweets/fb messages in your dataset and for each of them you consider the 5 categories and answer them by True/False.
You decide to use a unsupervised learning algorithm and hope that you discover these 5 categories. Since approaches like clustering will just try to find categories on their own and these don't have to match your 5 predefined categories by default.
I've used quite some supervised learning in the past and have had good experience with this type of learning, therefore I will continue explaining this path.
Feature Engineering
You have to come up with the features that you want to use. For text classification, a good approach is to use each possible word in the document as a feature. A value of True represents if the word is present in the document, false represents absence.
Before doing this, you need to do some preprocessing. This can be done by using various features provided by the NLTK library.
Tokenization this will break your text up into a list of words. You can use this module.
Stopword removal this will remove common words out of the tokens. Words likes 'a',the',... You can take a look at this.
Stemming stemming will transform words to their stem-form. For example: the words 'working','worked','works' will be transformed to 'work'. Take a look at this.
Now if you have preprocessed the data, then generate a featureset for each word that exists in the documents. There exist automatic methods and filters for this, but I'm not sure how to do this in Python.
Classification
There are multiple classifiers that you can use for this purpose. I suggest to take a deeper look at the ones that exist and their benefits.You can user the nltk classifier which supports multiclassification, but to be honest I never tried that one before. In the past I've used Logistic Regression and SVM.
Training & testing
You will use a part of your data for training and a part for validating if the trained model performs well. I suggest you to use cross-validation, because you will have a small dataset (you have to manually label the data, which is cumbersome). The benefit of cross-validation is that you don't have to split your dataset in a training set and testing set. Instead it will run in multiple rounds and iterate through the data for a part training data and a part testing data. Resulting in all the data being used at least once in your training data.
Predicting
Once your model is built and the outcome of the predictions on 'test-data' is plausible. You can use your model in the wild to predict the categories of the new Facebook messages/tweets.
Tools
The NLTK library is great for preprocessing and natural language processing, but I never used it before for classification. I've heard a lot of great things about the scikit python library. But to be fair honest, I prefer to use Weka, which is a data mining tool written in java, offering a great UI and which speeds up your task a lot!
From a different angle: Topic modelling
In your question you state that you want to classify the dataset into five categories. I would like to show you the idea of topic modelling. It might not be useful in your scenario if you are really only targeting those categories (that's why I leave this part at the end of my answer). However if your goal is to categorise the tweets/fb messages into non-predefined categories, topic modelling is the way to go.
Topic modeling is an unsupervised learning method, where you decide in advance the amount of topics(categories) you want to 'discover'. This number can be high (e.g. 40) Now the cool thing is that the algorithm will find 40 topics that contain words that have something related. It will also output for each document a distribution that indicates to which topics the document is related. This way you can discover a lot more categories than your 5 predefined ones.
Now I'm not gonna go much deeper into this, but just google it if you want more information. In addition you could consider to use MALLET which is an excellent tool for topic modelling.
Well this is kind of a big subject.
You mentioned Python, so you should have a look at the NLTK library which allows you to process natural language, such as your comments.
After this step, you should have a classifier which will map the words you retrieved to a certain class. NTLK also have tools for classification which is linked to knowledge databases. If you are lucky, the categories you are looking for are already available; otherwise you may have to build them yourself. You can have a look at this example which uses NTLK and the WordNet database. You can have access to the Synset, which seems to be pretty broad; and you can also have a look at the hypersets (see for example list(dog.closure(hyper)) ).
Basically you should consider using a multiclassifier on the whole tokenized text (comments on Facebook and tweets are usually short. You might also decide to only consider FB comments below 200 characters, your choice). The choice of a multiclassifier is motivated by non-orthogonality of your classification set (clothes, shoes and jewelries can be the same object; you could have electronic jewelry [ie smartwatches], etc.). This is a fairly simple setup but it's an interesting first step, whose strengths and weaknesses will allow you to iterate easily (if needed).
Good luck!
What you're looking for is in the subject of
Natural Language Processing (NLP) : processing text data and
Machine learning (where the classification models are built)
First I would suggesting going through NLP tutorials and then text classification tutorials, the most appropriate being https://class.coursera.org/nlp/lecture
If you're looking for libraries available in python or java, take a look at Java or Python for Natural Language Processing
If you're new to text processing, please take a look at the NLTK library that provides a nice introduction to doing NLP, see http://www.nltk.org/book/ch01.html
Now to the hard core details:
First, ask yourself whether you have twitter/facebook comments (let's call them documents from now on) that are manually labelled with the categories you want.
1a. If YES, look at supervised machine learning, see http://scikit-learn.org/stable/tutorial/basic/tutorial.html
1b. If NO, look at UNsupervised machine learning, i suggest clustering and topic modelling, http://radimrehurek.com/gensim/
After knowing which kind of machine learning you need, split the documents up into at least training (70-90%) and testing (10-30%) set, see
Note. I suggest at least because there are other ways to split up your documents, e.g. for development or cross-validation. (if you don't understand this, it's all right, just follow step 2)
Finally, Train and Test your model
3a. If supervised, use the training set to train your supervised model. Apply your model onto the test set and then see how well you performed.
3b. If unsupervised, use the training set to generate documents clusters (that means to group similar documents) but they still have no labels. So you need to think of some smart way to label the groups of documents correctly. (To this date, there is no real good solution to this, even super effective neural networks cannot know what the neurons are firing, they just know each neuron is firing something specific)

Using Pybrain to detect malicious PDF files

I'm trying to make an ANN to classify a PDF file as either malicious or clean, by utilising the 26,000 PDF samples (both clean and malicious) found on contagiodump. For each PDF file, I used PDFid.py to parse the file and return a vector of 42 numbers. The 26000 vectors are then passed into pybrain; 50% for training and 50% for testing. This is my source code:
https://gist.github.com/sirpoot/6805938
After much tweaking with the dimensions and other parameters I managed to get a false positive rate of about 0.90%. This is my output:
https://gist.github.com/sirpoot/6805948
My question is, is there any explicit way for me to decrease the false positive rate further? What do I have to do to reduce the rate to perhaps 0.05%?
There are several things you can try to increase the accuracy of your neural network.
Use more of your data for training. This will permit the network to learn from a larger set of training samples. The drawback of this is that having a smaller test set will make your error measurements more noisy. As a rule of thumb, however, I find that 80%-90% of your data can be used in the training set, with the rest for test.
Augment your feature representation. I'm not familiar with PDFid.py, but it only returns ~40 values for a given PDF file. It's possible that there are many more than 40 features that might be relevant in determining whether a PDF is malicious, so you could conceivably use a different feature representation that includes more values to increase the accuracy of your model.
Note that this can potentially involve a lot of work -- feature engineering is difficult! One suggestion I have if you decide to go this route is to look at the PDF files that your model misclassifies, and try to get an intuitive idea of what went wrong with those files. If you can identify a common feature that they all share, you could try adding that feature to your input representation (giving you a vector of 43 values) and re-train your model.
Optimize the model hyperparameters. You could try training several different models using training parameters (momentum, learning rate, etc.) and architecture parameters (weight decay, number of hidden units, etc.) chosen randomly from some reasonable intervals. This is one way to do what is called "hyperparameter optimization" and, like feature engineering, it can involve a lot of work. However, unlike feature engineering, hyperparameter optimization can largely be done automatically and in parallel, provided you have access to a lot of processing cores.
Try a deeper model. Deep models have become quite "hot" in the machine learning literature recently, especially for speech processing and some types of image classification. By using stacked RBMs, a second-order learning method (PDF), or a different nonlinearity like a rectified linear activation function, then you can add multiple layers of hidden units to your model, and sometimes this will help improve your error rate.
These are the ones that come to mind right off the bat. Good luck !
Let me first say I am in no ways an expert in Neural Networks. But I played with pyBrain once and I used the .train() method in a while error < 0.001 loop to get the error rate I wanted. So you can try using all of them for training with that loop and test it with other files.

Categories

Resources