I'm fairly new to NLP and trying to learn the techniques that can help me get my job done.
Here is my task: I have to classify stages of a drilling process based on text memos.
I have to classify labels for "Activity", "Activity Detail", "Operation" based on what's written in "Com" column.
I've been reading a lot of articles online and all the different kinds of techniques that I've read really confuses me.
The buzz words that I'm trying to understand are
Skip-gram (prediction based method, Word2Vec)
TF-IDF (frequency based method)
Co-Occurrence Matrix (frequency based method)
I am given about ~40,000 rows of data (pretty small, I know), and I came across an article that says neural-net based models like Skip-gram might not be a good choice if I have small number of training data. So I was also looking into frequency based methods too. Overall, I am unsure which technique is the best for me.
Here's what I understand:
Skip-gram: technique used to represent words in a vector space. But I don't understand what to do next once I vectorized my corpus
TF-IDF: tells how important each word is in each sentence. But I still don't know how it can be applied on my problem
Co-Occurence Matrix: I don'y really understand what it is.
All the three techniques are to numerically represent texts. But I am unsure what step I should take next to actually classify labels.
What approach & sequence of techniques should I use to tackle my problem? If there's any open source Jupyter notebook project, or link to an article (hopefully with codes) that did the similar job done, please share it here.
Let's get things a bit clearer. You task is to create a system that will predict labels for given texts, right? And label prediction (classification) can't be done for unstructured data (texts). So you need to make your data structured, and then train and infer your classifier. Therefore, you need to induce two separate systems:
Text vectorizer (as you said, it helps to numerically represent texts).
Classifier (to predict the labels for numerically represented texts).
Skip-Gram and co-occurrence matrix are ways to vectorize your texts (here is a nice article that explains their difference). In case of skip-gram you could download and use a 3rd party model that already has mapping of vectors to most of the words; in case of co-occurrence matrix you need to build it on your texts (if you have specific lexis, it will be a better way). In this matrix you could use different measures to represent the degree of co-occurrence of words with words or documents with documents. TF-IDF is one of such measures (that gives a score for every word-document pair); there are a lot of others (PMI, BM25, etc). This article should help to implement classification with co-occurrence matrix on your data. And this one gives an idea how to do the same with Word2Vec.
Hope it helped!
Related
I would like to create word embeddings that take context into account, so the vector of the word Jaguar [animal] would be different from the word Jaguar [car brand].
As you know, word2vec only gives one representation for a given word, and I would like to take already pretrained embeddings and enrich them with context. So far I've tried a simple way with taking an average vector of the word and category word, for example like this.
Now I would like to try to create and train a neural network that would take entire sentences, e.g.
Jaguar F-PACE is a great SUV sports car.
Among cats, only tigers and lions are bigger than jaguars.
And then it would undertake the task of text classification (I have a dataset with several categories like animals, cars, etc.), but the result would be new representations for the word jaguar, but in different contexts, so two different embeddings.
Does anyone have any idea how I could create such a network? I don't hide that I'm a beginner and have no idea how to go about it.
If you've already been able to perform sense-disambiguation outside word2vec, then you can change the word-tokens to reflect your external judgement. For example, change some appearances of the token 'jaguar' to 'jaguar*car' and others to 'jaguar*animal'. Proceeding with normal word2vec training will then get your two different tokens two different word-vectors.
If you're hoping for the training to discover these itself, as ~Erwan mentioned in a comment, that seems like an open research question, without a standard or off-the-shelf solution that a beginner could drop-in.
I'd once seen a paper (around the time of the original word2vec papers, but can't find the link now) that tried to do this in a word2vec-compatible way by 1st proceeding with traditional polysemy-oblivious training. Then, for every appearance of a word X, model its surrounding context via some combination of the word-vectors of neighbors within a certain number of positions. (That in itself is very similar to the preparation of a context-vector in the CBOW mode of word2vec.) Perform some clustering on that collection-of-all-contexts to come up with some idea of alternate senses – each associated with one cluster. Then, in a followup pass on the original corpus, replace word-tokens with those that also reflect their nearby-context cluster. (EG: 'jaguar' might be replaced with 'jaguar*1', 'jaguar*2', etc based on which discrete cluster its context suggested.) Then, repeat (or continue) word2vec training to get sense-specific word-vectors. Of course, the devil would be in the details of how contexts are defined, how clusters are deduced, and tough edge-cases (where potentially the text's author is themselves deploying the multiple senses).
Some other interesting efforts to model or deduce polysemy in word2vec models:
"Linear Algebraic Structure of Word Meanings"
"A Simple Approach to Learn Polysemous Word Embeddings"
But per above, I've not seen these sorts of techniques widely implemented/adopted in a form that's easy to drop-in to another project.
Is it possible to use Google BERT for calculating similarity between two textual documents? As I understand BERT's input is supposed to be a limited size sentences. Some works use BERT for similarity calculation for sentences like:
https://github.com/AndriyMulyar/semantic-text-similarity
https://github.com/beekbin/bert-cosine-sim
Is there an implementation of BERT done to use it for large documents instead of sentences as inputs ( Documents with thousands of words)?
BERT is not trained to determine if one sentence follows another. That is just ONE of the GLUE tasks and there are a myriad more. ALL of the GLUE tasks (and superglue) are getting knocked out of the park by ALBERT.
BERT (and Albert for that matter) is the absolute state of the art in Natural Language Understanding. Doc2Vec doesn't come close. BERT is not a bag-of-words method. It's a bi-directional attention based encoder built on the Transformer which is the incarnation of the Google Brain paper Attention is All you Need. Also see this Visual breakdown of the Transformer model.
This is a fundamentally new way of looking at natural language which doesn't use RNN's or LSTMs or tf-idf or any of that stuff. We aren't turning words or docs into vectors anymore. GloVes: Global Vectors for Word Representations with LSTMs are old. Doc2Vec is old.
BERT is reeeeeallly powerful - like, pass the Turing test easily powerful. Take a look at
See superGLUE which just came out. Scroll to the bottom at look at how insane those tasks are. THAT is where NLP is at.
Okay so now that we have dispensed with the idea that tf-idf is state of the art - you want to take documents and look at their similarity? I would use ALBERT on Databricks in two layers:
Perform either Extractive or Abstractive summarization: https://pypi.org/project/bert-extractive-summarizer/ (NOTICE HOW BIG THOSE DOCUMENTS OF TEXT ARE - and reduce your document down to a summary.
In a separate step, take each summary and do the STS-B task from Page 3 GLUE
Now, we are talking about absolutely bleeding edge technology here (Albert came out in just the last few months). You will need to be extremely proficient to get through this but it CAN be done, and I believe in you!!
BERT is a sentence representation model. It is trained to predict words in a sentence and to decide if two sentences follow each other in a document, i.e., strictly on the sentence level. Moreover, BERT requires quadratic memory with respect to the input length which would not be feasible with documents.
It is quite common practice to average word embeddings to get a sentence representation. You can try the same thing with BERT and average the [CLS] vectors from BERT over sentences in a document.
There are some document-level embeddings. For instance doc2vec is a commonly used option.
As far as I know, at the document level, frequency-based vectors such as tf-idf (with a good implementation in scikit-learn) are still close to state of the art, so I would not hesitate using it. Or at least it is worth trying to see how it compares to embeddings.
To add to #jindřich answer, BERT is meant to find missing words in a sentence and predict next sentence. Word embedding based doc2vec is still a good way to measure similarity between docs. If you want to delve deeper into why every best model can't be the best choice for a use case, give this post a read where it clearly explains why not every state-of-the-art model is suitable for a task.
Ya. You would just do each part independently. For summarization you hardly need to do much. Just look on pypi for summarize and you have several packages. Don't even need to train. Now for sentence to sentence similarity there is a fairly complex method for getting loss but it's spelled out in the GLUE website. It's considerd part of the challenge (meeting the metric). Determining that distance (sts) is non trivial and I think they call it "coherence" but I'm not sure.
I have some comments dataset which I want to classify into five categories :-
jewelries, clothes, shoes, electronics, food & beverages
So if someones talking about pork, steak, wine, soda, eat : its classified into f&b
Whereas if somebodys talking about say - gold, pendent, locket etc : its classified into jewelries
I want to know , what tags/tokens should I be looking for in a comment/tweet so as to classify it into any of these categories. Finally which classifier to use. I just need some guidance and suggestions , Ill take it from there.
Please help. Thanks
This answer can be a bit long and perhaps I abstract a few things away, but it's just to give you an idea and some advice.
Supervised Vs Unsupervised
As others already mentioned, in the land of machine learning there are 2 main roads: Supervised and Unsupervised learning. As you probably already know by now, if your corpus(documents) are labeled, you are talking about supervised learning. The labels are the categories and are in this case boolean values.
For instance if a text is related to clothes and shoes the labels for those categories should be true.
Since a text can be related to multiple categories (multiple labels), we are looking at multiclassifiers.
What to use?
I presume that the dataset is not yet labeled, since twitter does not do this categorisation for you. So here comes a big decision on your part.
You label the data manually, which means you try to look at as much tweets/fb messages in your dataset and for each of them you consider the 5 categories and answer them by True/False.
You decide to use a unsupervised learning algorithm and hope that you discover these 5 categories. Since approaches like clustering will just try to find categories on their own and these don't have to match your 5 predefined categories by default.
I've used quite some supervised learning in the past and have had good experience with this type of learning, therefore I will continue explaining this path.
Feature Engineering
You have to come up with the features that you want to use. For text classification, a good approach is to use each possible word in the document as a feature. A value of True represents if the word is present in the document, false represents absence.
Before doing this, you need to do some preprocessing. This can be done by using various features provided by the NLTK library.
Tokenization this will break your text up into a list of words. You can use this module.
Stopword removal this will remove common words out of the tokens. Words likes 'a',the',... You can take a look at this.
Stemming stemming will transform words to their stem-form. For example: the words 'working','worked','works' will be transformed to 'work'. Take a look at this.
Now if you have preprocessed the data, then generate a featureset for each word that exists in the documents. There exist automatic methods and filters for this, but I'm not sure how to do this in Python.
Classification
There are multiple classifiers that you can use for this purpose. I suggest to take a deeper look at the ones that exist and their benefits.You can user the nltk classifier which supports multiclassification, but to be honest I never tried that one before. In the past I've used Logistic Regression and SVM.
Training & testing
You will use a part of your data for training and a part for validating if the trained model performs well. I suggest you to use cross-validation, because you will have a small dataset (you have to manually label the data, which is cumbersome). The benefit of cross-validation is that you don't have to split your dataset in a training set and testing set. Instead it will run in multiple rounds and iterate through the data for a part training data and a part testing data. Resulting in all the data being used at least once in your training data.
Predicting
Once your model is built and the outcome of the predictions on 'test-data' is plausible. You can use your model in the wild to predict the categories of the new Facebook messages/tweets.
Tools
The NLTK library is great for preprocessing and natural language processing, but I never used it before for classification. I've heard a lot of great things about the scikit python library. But to be fair honest, I prefer to use Weka, which is a data mining tool written in java, offering a great UI and which speeds up your task a lot!
From a different angle: Topic modelling
In your question you state that you want to classify the dataset into five categories. I would like to show you the idea of topic modelling. It might not be useful in your scenario if you are really only targeting those categories (that's why I leave this part at the end of my answer). However if your goal is to categorise the tweets/fb messages into non-predefined categories, topic modelling is the way to go.
Topic modeling is an unsupervised learning method, where you decide in advance the amount of topics(categories) you want to 'discover'. This number can be high (e.g. 40) Now the cool thing is that the algorithm will find 40 topics that contain words that have something related. It will also output for each document a distribution that indicates to which topics the document is related. This way you can discover a lot more categories than your 5 predefined ones.
Now I'm not gonna go much deeper into this, but just google it if you want more information. In addition you could consider to use MALLET which is an excellent tool for topic modelling.
Well this is kind of a big subject.
You mentioned Python, so you should have a look at the NLTK library which allows you to process natural language, such as your comments.
After this step, you should have a classifier which will map the words you retrieved to a certain class. NTLK also have tools for classification which is linked to knowledge databases. If you are lucky, the categories you are looking for are already available; otherwise you may have to build them yourself. You can have a look at this example which uses NTLK and the WordNet database. You can have access to the Synset, which seems to be pretty broad; and you can also have a look at the hypersets (see for example list(dog.closure(hyper)) ).
Basically you should consider using a multiclassifier on the whole tokenized text (comments on Facebook and tweets are usually short. You might also decide to only consider FB comments below 200 characters, your choice). The choice of a multiclassifier is motivated by non-orthogonality of your classification set (clothes, shoes and jewelries can be the same object; you could have electronic jewelry [ie smartwatches], etc.). This is a fairly simple setup but it's an interesting first step, whose strengths and weaknesses will allow you to iterate easily (if needed).
Good luck!
What you're looking for is in the subject of
Natural Language Processing (NLP) : processing text data and
Machine learning (where the classification models are built)
First I would suggesting going through NLP tutorials and then text classification tutorials, the most appropriate being https://class.coursera.org/nlp/lecture
If you're looking for libraries available in python or java, take a look at Java or Python for Natural Language Processing
If you're new to text processing, please take a look at the NLTK library that provides a nice introduction to doing NLP, see http://www.nltk.org/book/ch01.html
Now to the hard core details:
First, ask yourself whether you have twitter/facebook comments (let's call them documents from now on) that are manually labelled with the categories you want.
1a. If YES, look at supervised machine learning, see http://scikit-learn.org/stable/tutorial/basic/tutorial.html
1b. If NO, look at UNsupervised machine learning, i suggest clustering and topic modelling, http://radimrehurek.com/gensim/
After knowing which kind of machine learning you need, split the documents up into at least training (70-90%) and testing (10-30%) set, see
Note. I suggest at least because there are other ways to split up your documents, e.g. for development or cross-validation. (if you don't understand this, it's all right, just follow step 2)
Finally, Train and Test your model
3a. If supervised, use the training set to train your supervised model. Apply your model onto the test set and then see how well you performed.
3b. If unsupervised, use the training set to generate documents clusters (that means to group similar documents) but they still have no labels. So you need to think of some smart way to label the groups of documents correctly. (To this date, there is no real good solution to this, even super effective neural networks cannot know what the neurons are firing, they just know each neuron is firing something specific)
The title says it all; I have an SQL database bursting at the seams with online conversation text. I've already done most of this project in Python, so I would like to do this using Python's NLTK library (unless there's a strong reason not to).
The data is organized by Thread, Username, and Post. Each thread more or less focuses on discussing one "product" of the Category that I am interested in analyzing. Ultimately, when this is finished, I would like to have an estimated opinion (like/dislike sort of deal) from each user for any of the products they had discussed at some point.
So, what I would like to know:
1) How can I go about determining what product each thread is about? I was reading about keyword extraction... is that the correct method?
2) How do I determine a specific users sentiment based on their posts? From my limited understanding, I must first "train" NLTK to recognize certain indicators of opinion, and then do I simply determine the context of those words when they appear in the text?
As you may have guessed by now, I have no prior experience with NLP. From my reading so far, I think I can handle learning it though. Even just a basic and crude working model for now would be great if someone can point me in the right direction. Google was not very helpful to me.
P.S. I have permission to analyze this data (in case it matters)
Training any classifier requires a training set of labeled data and a feature extractor to obtain feature sets for each text. After you have a trained classifier, you can apply it to previously unseen text (unlabeled) and obtain a classification based on the machine learning algorithm used. NLTK gives a good explanation and some samples to play around with.
If you are interested in building a classifier for positive/negative sentiment, using your own training dataset, I would avoid simple keyword counts, as they aren't accurate for a number of reasons (eg. negation of positive words: "not happy"). An alternative, where you can still use a large training set without having to manually label anything, is distant supervision. Basically, this approach uses emoticons or other specific text elements as noisy labels. You still have to choose which features are relevant but many studies have had good results with simply using unigrams or bigrams (individual words or pairs of words, respectively).
All of this can be done relatively easily with Python and NLTK. You can also choose to use a tool like NLTK-trainer, which is a wrapper for NLTK and requires less code.
I think this study by Go et al. is one of the easiest to understand. You can also read other studies for distant supervision, distant supervision sentiment analysis, and sentiment analysis.
There are a few built-in classifiers in NLTK with both training and classification methods (Naive Bayes, MaxEnt, etc.) but if you are interested in using Support Vector Machines (SVM) then you should look elsewhere. Technically NLTK provides you with an SVM class but its really just a wrapper for PySVMLight, which itself is a wrapper for SVMLight, written in C. I had numerous problems with this approach though, and would instead recommend LIBSVM.
For determining the topic, many have used simple keywords but there are some more complex methods available.
You could train any classifier with similar datasets and see what the results are when you apply it to your data. For example, the NLTK contains the Movie Reviews Corpus that contains 1000 positive and 1000 negative reviews. Here is an example on how to train a Naive Bayes Classifier with it. Some other review datasets like Amazon Product Review data are available here.
Another possibility is to take a list of positive and negative words like this one and count their frequencies in your dataset. If you want a complete list, use SentiWordNet.
I have a human tagged corpus of over 5000 subject indexed documents in XML. They vary in size from a few hundred kilobytes to a few hundred megabytes. Being short articles to manuscripts. They have all been subjected indexed as deep as the paragraph level. I am lucky to have such a corpus available, and I am trying to teach myself some NLP concepts. Admittedly, I've only begun. Thus far reading only the freely available NLTK book, streamhacker, and skimming jacobs(?) NLTK cookbook. I like to experiment with some ideas.
It was suggested to me, that perhaps, I could take bi-grams and use naive Bayes classification to tag new documents. I feel as if this is the wrong approach. a Naive Bayes is proficient at a true/false sort of relationship, but to use it on my hierarchical tag set I would need to build a new classifier for each tag. Nearly a 1000 of them. I have the memory and processor power to undertake such a task, but am skeptical of the results. However, I will be trying this approach first, to appease someones request. I should likely have this accomplished in the next day or two, but I predict the accuracy to be low.
So my question is a bit open ended. Laregly becuase of the nature of the discipline and the general unfamilirity with my data it will likely be hard to give an exact answer.
What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.
what feature extraction should I pursue for such a task. I am not expecting much with the bigrams.
Each document also contains some citational information including, author/s, an authors gender of m,f,mix(m&f),and other (Gov't inst et al.), document type, published date(16th cent. to current), human analyst, and a few other general elements. I'd also appreciate some useful descriptive tasks to help investigate this data better for gender bias, analyst bias, etc. But realize that is a bit beyond the scope of this question.
What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.
You can easily build a multilabel classifier by building a separate binary classifier for each class, that can distinguish between that class and all others. The classes for which the corresponding classifier yields a positive value are the combined classifier's output. You can use Naïve Bayes for this or any other algorithm. (You could also play tricks with NB's probability output and a threshold value, but NB's probability estimates are notoriously bad; only its ranking among them is what makes it valuable.)
what feature extraction should I pursue for such a task
For text classification, tf-idf vectors are known to work well, but you haven't specified what the exact task is. Any metadata on the documents might work as well; try doing some simple statistical analysis. If any feature of the data is more frequently present in some classes than in others, it may be a useful feature.
I understand that you have two tasks to solve here. The 1st one is that you want to tag an article based on its topic(?) and thus the article can be classified in more than one categories/classes and thus you have a multi-label classification problem. There are several algorithms proposed for solving a multi-label classification problem - please check the literature. I found this paper quite helpful when I was dealing with a similar problem: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9401
The 2nd problem you want to solve is to tag the paper with authors, gender, type of document. This is a multi-class problem - each class has more than two potential values but all documents have some values for these classes.
I think as a first step it is important to understand the differences between multi-class and multi-label classification.