Training a model from multiple corpus - python

Imagine I have a fasttext model that had been trained thanks to the Wikipedia articles (like explained on the official website).
Would it be possible to train it again with another corpus (scientific documents) that could add new / more pertinent links between words? especially for the scientific ones ?
To summarize, I would need the classic links that exist between all the English words coming from Wikipedia. But I would like to enhance this model with new documents about specific sectors. Is there a way to do that ? And if yes, is there a way to maybe 'ponderate' the trainings so relations coming from my custom documents would be 'more important'.
My final wish is to compute cosine similarity between documents that can be very scientific (that's why to have better results I thought about adding more scientific documents)

Adjusting more-generic models with your specific domain training data is often called "fine-tuning".
The gensim implementation of FastText allows an existing model to expand its known-vocabulary via what's seen in new training data (via build_vocab(..., update=True)) and then for further training cycles including that new vocabulary to occur (through train()).
But, doing this particular form of updating introduces murky issues of balance between older and newer training data, with no clear best practices.
As just one example, to the extent there are tokens/ngrams in the original model that don't recur in the new data, new training is pulling those in the new data into new positions that are optimal for the new data... but potentially arbitrarily far from comparable compatibility with the older tokens/ngrams.)
Further, it's likely some model modes (like negative-sampling versus hierarchical-softmax), and some mixes of data, have a better chance of net-benefiting from this approach than others – but you pretty much have to hammer out the tradeoffs yourself, without general rules to rely upon.
(There may be better fine-tuning strategies for other kinds models; this is just speaking to the ability of the gensim FastText to update-vocabulary and repeat-train.)
But perhaps, your domain of interest is scientific texts. And maybe you also have a lot of representative texts – perhaps even, at training time, the complete universe of papers you'll want to compare.
In that case, are you sure you want to deal with the complexity of starting with a more-generic word-model? Why would you want to contaminate your analysis with any of the dominant word-senses in generic reference material, like Wikipedia, if in fact you already have sufficiently-varied and representative examples of your domain words in your domain contexts?
So I would recommend 1st trying to train your own model, from your own representative data. And only if you then fear you're missing important words/senses, try mixing in Wikipedia-derived senses. (At that point, another way to mix in that influence would be to mix Wikipedia texts with your other corpus. And you should also be ready to test whether that really helps or hurts – because it could be either.)
Also, to the extent your real goal is comparing full papers, you might want to look into other document-modeling strategies, including bag-of-words representations, the Doc2Vec ('Paragraph Vector') implementation in gensim, or others. Those approaches will not necessarily require per-word vectors as an input, but might still work well for quantifying text-to-text similarities.

Related

Calculating optimal number of topics for topic modeling (LDA)

I am going to do topic modeling via LDA. I run my commands to see the optimal number of topics. The output was as follows: It is a bit different from any other plots that I have ever seen. Do you think it is okay? or it is better to use other algorithms rather than LDA. It is worth mentioning that when I run my commands to visualize the topics-keywords for 10 topics, the plot shows 2 main topics and the others had almost a strong overlap. Is there any valid range for coherence?
Many thanks to share your comments as I am a beginner in topic modeling.
Shameless self-promotion: I suggest you use the OCTIS library: https://github.com/mind-Lab/octis
It allows you to run different topic models and optimize their hyperparameters (also the number of topics) in order to select the best result.
There might be many reasons why you get those results. But here some hints and observations:
Make sure that you've preprocessed the text appropriately. This usually includes removing punctuation and numbers, removing stopwords and words that are too frequent or rare, (optionally) lemmatizing the text. Preprocessing is dependent on the language and the domain of the texts.
LDA is a probabilistic model, which means that if you re-train it with the same hyperparameters, you will get different results each time. A good practice is to run the model with the same number of topics multiple times and then average the topic coherence.
There are a lot of topic models and LDA works usually fine. The choice of the topic model depends on the data that you have. For example, if you are working with tweets (i.e. short texts), I wouldn't recommend using LDA because it cannot handle well sparse texts.
Check how you set the hyperparameters. They may have a huge impact on the performance of the topic model.
The range for coherence (I assume you used NPMI which is the most well-known) is between -1 and 1, but values very close to the upper and lower bound are quite rare.
References: https://www.aclweb.org/anthology/2021.eacl-demos.31/

Concepts to measure text "relevancy" to a subject?

I do side work writing/improving a research project web application for some political scientists. This application collects articles pertaining to the U.S. Supreme Court and runs analysis on them, and after nearly a year and half, we have a database of around 10,000 articles (and growing) to work with.
One of the primary challenges of the project is being able to determine the "relevancy" of an article - that is, the primary focus is the federal U.S. Supreme Court (and/or its justices), and not a local or foreign supreme court. Since its inception, the way we've addressed it is to primarily parse the title for various explicit references to the federal court, as well as to verify that "supreme" and "court" are keywords collected from the article text. Basic and sloppy, but it actually works fairly well. That being said, irrelevant articles can find their way into the database - usually ones with headlines that don't explicitly mention a state or foreign country (the Indian Supreme Court is the usual offender).
I've reached a point in development where I can focus on this aspect of the project more, but I'm not quite sure where to start. All I know is that I'm looking for a method of analyzing article text to determine its relevance to the federal court, and nothing else. I imagine this will entail some machine learning, but I've basically got no experience in the field. I've done a little reading into things like tf-idf weighting, vector space modeling, and word2vec (+ CBOW and Skip-Gram models), but I'm not quite seeing a "big picture" yet that shows me how just how applicable these concepts can be to my problem. Can anyone point me in the right direction?
Framing the Problem
When starting a novel machine learning project like this there are a few fundamental questions to think through that can help you refine the problem and lit review + experiment more effectively.
Do you have the right data to build a model? You have ~10,000 articles that will be your model input, however, to use a supervised learning approach you will need trustworthy labels for all articles that will be used in model training. It sounds like you already have done this.
What metric(s) to use to quantify success. How can you measure if your model is doing what you want? In your specific case this sounds like a binary classification problem - you want to be able to label articles as relevant or not. You could measure your success using a standard binary classification metric like area under the ROC. Or since you have a specific issue with False Positives you could choose a metric like Precision.
How well can you do with a random or naive approach. Once a dataset and metric have been established you can quantify how well you can do at your task with a basic approach. This could be a simple as calculating your metric for a model that chooses at random, but in your case you have your keyword parser model which is a perfect way to set a bench mark. Quantify how well your keyword parsing approach does for your dataset so you can determine when a machine learning model is doing well.
Sorry if this was obvious and basic to you but I wanted to make sure it was in the answer. In an innovative open ended project like this diving straight into machine learning experiments without thinking through these fundamentals can be inefficient.
Machine Learning Approaches
As suggested by Evan Mata and Stefan G, the best approach is to first reduce your articles into features. This could be done without machine learning (eg vector space model) or with machine learning (word2vec and other examples you cited). For your problem I think something like BOW makes sense to try as a starting point.
Once you have a feature representation of your articles you are almost done and there are a number of binary classification models that will do well. Experiment from here to find the best solution.
Wikipedia has a nice example of a simple way to use this two step approach in spam filtering, an analogous problem (See the Example Usage section of the article).
Good luck, sounds like a fun project!
If you have sufficient labeled data - not only for "yes this article is relevant" but also for "no this article is not relevant" (you're basically making a binary model between y/n relevant - so I would research spam filters) then you can train a fair model. I don't know if you actually have a decent quantity of no-data. If you do, you could train a relatively simple supervised model by doing (pesudocode) the following:
Corpus = preprocess(Corpus) #(remove stop words, etc.)
Vectors = BOW(Corpus) #Or TFIDF or Whatever model you want to use
SomeModel.train(Vectors[~3/4 of them], Labels[corresponding 3/4]) #Labels = 1 if relevant, 0 if not
SomeModel.evaluate(Vectors[remainder], Labels[remainder]) #Make sure the model doesn't overfit
SomeModel.Predict(new_document)
The exact model will depend on your data. A simple Naive-Bayes could (probably will) work fine if you can get a decent number of no-documents. One note - you imply that you have two kinds of no-documents - those that are reasonably close (Indian Supreme Court) or those that are completely irrelevant (say Taxes). You should test training with "close" erroneous cases with "far" erroneous cases filtered out as you do now vs both "close" erroneous cases and "far" erroneous cases and see which one comes out better.
There are many many ways to do this, and the best method changes depending on the project. Perhaps the easiest way to do this is to keyword search in your articles and then empirically choose a cut off score. Although simple, this actually works pretty well, especially in a topic like this one where you can think of a small list of words that are highly likely to appear somewhere in a relevant article.
When a topic is more broad with something like 'business' or 'sports', keyword search can be prohibitive and lacking. This is when a machine learning approach might start to become the better idea. If machine learning is the way you want to go, then there are two steps:
Embed your articles into feature vectors
Train your model
Step 1 can be something simple like a TFIDF vector. However, embedding your documents can also be deep learning on its own. This is where CBOW and Skip-Gram come into play. A popular way to do this is Doc2Vec (PV-DM). A fine implementation is in the Python Gensim library. Modern and more complicated character, word, and document embeddings are much more of a challenge to start with, but are very rewarding. Examples of these are ELMo embeddings or BERT.
Step 2 can be a typical model, as it is now just binary classification. You can try a multilayer neural network, either fully-connected or convolutional, or you can try simpler things like logistic regression or Naive Bayes.
My personal suggestion would be to stick with TFIDF vectors and Naive Bayes. From experience, I can say that this works very well, is by far the easiest to implement, and can even outperform approaches like CBOW or Doc2Vec depending on your data.

twitter/facebook comments classification into various categories

I have some comments dataset which I want to classify into five categories :-
jewelries, clothes, shoes, electronics, food & beverages
So if someones talking about pork, steak, wine, soda, eat : its classified into f&b
Whereas if somebodys talking about say - gold, pendent, locket etc : its classified into jewelries
I want to know , what tags/tokens should I be looking for in a comment/tweet so as to classify it into any of these categories. Finally which classifier to use. I just need some guidance and suggestions , Ill take it from there.
Please help. Thanks
This answer can be a bit long and perhaps I abstract a few things away, but it's just to give you an idea and some advice.
Supervised Vs Unsupervised
As others already mentioned, in the land of machine learning there are 2 main roads: Supervised and Unsupervised learning. As you probably already know by now, if your corpus(documents) are labeled, you are talking about supervised learning. The labels are the categories and are in this case boolean values.
For instance if a text is related to clothes and shoes the labels for those categories should be true.
Since a text can be related to multiple categories (multiple labels), we are looking at multiclassifiers.
What to use?
I presume that the dataset is not yet labeled, since twitter does not do this categorisation for you. So here comes a big decision on your part.
You label the data manually, which means you try to look at as much tweets/fb messages in your dataset and for each of them you consider the 5 categories and answer them by True/False.
You decide to use a unsupervised learning algorithm and hope that you discover these 5 categories. Since approaches like clustering will just try to find categories on their own and these don't have to match your 5 predefined categories by default.
I've used quite some supervised learning in the past and have had good experience with this type of learning, therefore I will continue explaining this path.
Feature Engineering
You have to come up with the features that you want to use. For text classification, a good approach is to use each possible word in the document as a feature. A value of True represents if the word is present in the document, false represents absence.
Before doing this, you need to do some preprocessing. This can be done by using various features provided by the NLTK library.
Tokenization this will break your text up into a list of words. You can use this module.
Stopword removal this will remove common words out of the tokens. Words likes 'a',the',... You can take a look at this.
Stemming stemming will transform words to their stem-form. For example: the words 'working','worked','works' will be transformed to 'work'. Take a look at this.
Now if you have preprocessed the data, then generate a featureset for each word that exists in the documents. There exist automatic methods and filters for this, but I'm not sure how to do this in Python.
Classification
There are multiple classifiers that you can use for this purpose. I suggest to take a deeper look at the ones that exist and their benefits.You can user the nltk classifier which supports multiclassification, but to be honest I never tried that one before. In the past I've used Logistic Regression and SVM.
Training & testing
You will use a part of your data for training and a part for validating if the trained model performs well. I suggest you to use cross-validation, because you will have a small dataset (you have to manually label the data, which is cumbersome). The benefit of cross-validation is that you don't have to split your dataset in a training set and testing set. Instead it will run in multiple rounds and iterate through the data for a part training data and a part testing data. Resulting in all the data being used at least once in your training data.
Predicting
Once your model is built and the outcome of the predictions on 'test-data' is plausible. You can use your model in the wild to predict the categories of the new Facebook messages/tweets.
Tools
The NLTK library is great for preprocessing and natural language processing, but I never used it before for classification. I've heard a lot of great things about the scikit python library. But to be fair honest, I prefer to use Weka, which is a data mining tool written in java, offering a great UI and which speeds up your task a lot!
From a different angle: Topic modelling
In your question you state that you want to classify the dataset into five categories. I would like to show you the idea of topic modelling. It might not be useful in your scenario if you are really only targeting those categories (that's why I leave this part at the end of my answer). However if your goal is to categorise the tweets/fb messages into non-predefined categories, topic modelling is the way to go.
Topic modeling is an unsupervised learning method, where you decide in advance the amount of topics(categories) you want to 'discover'. This number can be high (e.g. 40) Now the cool thing is that the algorithm will find 40 topics that contain words that have something related. It will also output for each document a distribution that indicates to which topics the document is related. This way you can discover a lot more categories than your 5 predefined ones.
Now I'm not gonna go much deeper into this, but just google it if you want more information. In addition you could consider to use MALLET which is an excellent tool for topic modelling.
Well this is kind of a big subject.
You mentioned Python, so you should have a look at the NLTK library which allows you to process natural language, such as your comments.
After this step, you should have a classifier which will map the words you retrieved to a certain class. NTLK also have tools for classification which is linked to knowledge databases. If you are lucky, the categories you are looking for are already available; otherwise you may have to build them yourself. You can have a look at this example which uses NTLK and the WordNet database. You can have access to the Synset, which seems to be pretty broad; and you can also have a look at the hypersets (see for example list(dog.closure(hyper)) ).
Basically you should consider using a multiclassifier on the whole tokenized text (comments on Facebook and tweets are usually short. You might also decide to only consider FB comments below 200 characters, your choice). The choice of a multiclassifier is motivated by non-orthogonality of your classification set (clothes, shoes and jewelries can be the same object; you could have electronic jewelry [ie smartwatches], etc.). This is a fairly simple setup but it's an interesting first step, whose strengths and weaknesses will allow you to iterate easily (if needed).
Good luck!
What you're looking for is in the subject of
Natural Language Processing (NLP) : processing text data and
Machine learning (where the classification models are built)
First I would suggesting going through NLP tutorials and then text classification tutorials, the most appropriate being https://class.coursera.org/nlp/lecture
If you're looking for libraries available in python or java, take a look at Java or Python for Natural Language Processing
If you're new to text processing, please take a look at the NLTK library that provides a nice introduction to doing NLP, see http://www.nltk.org/book/ch01.html
Now to the hard core details:
First, ask yourself whether you have twitter/facebook comments (let's call them documents from now on) that are manually labelled with the categories you want.
1a. If YES, look at supervised machine learning, see http://scikit-learn.org/stable/tutorial/basic/tutorial.html
1b. If NO, look at UNsupervised machine learning, i suggest clustering and topic modelling, http://radimrehurek.com/gensim/
After knowing which kind of machine learning you need, split the documents up into at least training (70-90%) and testing (10-30%) set, see
Note. I suggest at least because there are other ways to split up your documents, e.g. for development or cross-validation. (if you don't understand this, it's all right, just follow step 2)
Finally, Train and Test your model
3a. If supervised, use the training set to train your supervised model. Apply your model onto the test set and then see how well you performed.
3b. If unsupervised, use the training set to generate documents clusters (that means to group similar documents) but they still have no labels. So you need to think of some smart way to label the groups of documents correctly. (To this date, there is no real good solution to this, even super effective neural networks cannot know what the neurons are firing, they just know each neuron is firing something specific)

Sentiment Analysis on LARGE collection of online conversation text

The title says it all; I have an SQL database bursting at the seams with online conversation text. I've already done most of this project in Python, so I would like to do this using Python's NLTK library (unless there's a strong reason not to).
The data is organized by Thread, Username, and Post. Each thread more or less focuses on discussing one "product" of the Category that I am interested in analyzing. Ultimately, when this is finished, I would like to have an estimated opinion (like/dislike sort of deal) from each user for any of the products they had discussed at some point.
So, what I would like to know:
1) How can I go about determining what product each thread is about? I was reading about keyword extraction... is that the correct method?
2) How do I determine a specific users sentiment based on their posts? From my limited understanding, I must first "train" NLTK to recognize certain indicators of opinion, and then do I simply determine the context of those words when they appear in the text?
As you may have guessed by now, I have no prior experience with NLP. From my reading so far, I think I can handle learning it though. Even just a basic and crude working model for now would be great if someone can point me in the right direction. Google was not very helpful to me.
P.S. I have permission to analyze this data (in case it matters)
Training any classifier requires a training set of labeled data and a feature extractor to obtain feature sets for each text. After you have a trained classifier, you can apply it to previously unseen text (unlabeled) and obtain a classification based on the machine learning algorithm used. NLTK gives a good explanation and some samples to play around with.
If you are interested in building a classifier for positive/negative sentiment, using your own training dataset, I would avoid simple keyword counts, as they aren't accurate for a number of reasons (eg. negation of positive words: "not happy"). An alternative, where you can still use a large training set without having to manually label anything, is distant supervision. Basically, this approach uses emoticons or other specific text elements as noisy labels. You still have to choose which features are relevant but many studies have had good results with simply using unigrams or bigrams (individual words or pairs of words, respectively).
All of this can be done relatively easily with Python and NLTK. You can also choose to use a tool like NLTK-trainer, which is a wrapper for NLTK and requires less code.
I think this study by Go et al. is one of the easiest to understand. You can also read other studies for distant supervision, distant supervision sentiment analysis, and sentiment analysis.
There are a few built-in classifiers in NLTK with both training and classification methods (Naive Bayes, MaxEnt, etc.) but if you are interested in using Support Vector Machines (SVM) then you should look elsewhere. Technically NLTK provides you with an SVM class but its really just a wrapper for PySVMLight, which itself is a wrapper for SVMLight, written in C. I had numerous problems with this approach though, and would instead recommend LIBSVM.
For determining the topic, many have used simple keywords but there are some more complex methods available.
You could train any classifier with similar datasets and see what the results are when you apply it to your data. For example, the NLTK contains the Movie Reviews Corpus that contains 1000 positive and 1000 negative reviews. Here is an example on how to train a Naive Bayes Classifier with it. Some other review datasets like Amazon Product Review data are available here.
Another possibility is to take a list of positive and negative words like this one and count their frequencies in your dataset. If you want a complete list, use SentiWordNet.

NLTK/NLP buliding a many-to-many/multi-label subject classifier

I have a human tagged corpus of over 5000 subject indexed documents in XML. They vary in size from a few hundred kilobytes to a few hundred megabytes. Being short articles to manuscripts. They have all been subjected indexed as deep as the paragraph level. I am lucky to have such a corpus available, and I am trying to teach myself some NLP concepts. Admittedly, I've only begun. Thus far reading only the freely available NLTK book, streamhacker, and skimming jacobs(?) NLTK cookbook. I like to experiment with some ideas.
It was suggested to me, that perhaps, I could take bi-grams and use naive Bayes classification to tag new documents. I feel as if this is the wrong approach. a Naive Bayes is proficient at a true/false sort of relationship, but to use it on my hierarchical tag set I would need to build a new classifier for each tag. Nearly a 1000 of them. I have the memory and processor power to undertake such a task, but am skeptical of the results. However, I will be trying this approach first, to appease someones request. I should likely have this accomplished in the next day or two, but I predict the accuracy to be low.
So my question is a bit open ended. Laregly becuase of the nature of the discipline and the general unfamilirity with my data it will likely be hard to give an exact answer.
What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.
what feature extraction should I pursue for such a task. I am not expecting much with the bigrams.
Each document also contains some citational information including, author/s, an authors gender of m,f,mix(m&f),and other (Gov't inst et al.), document type, published date(16th cent. to current), human analyst, and a few other general elements. I'd also appreciate some useful descriptive tasks to help investigate this data better for gender bias, analyst bias, etc. But realize that is a bit beyond the scope of this question.
What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.
You can easily build a multilabel classifier by building a separate binary classifier for each class, that can distinguish between that class and all others. The classes for which the corresponding classifier yields a positive value are the combined classifier's output. You can use Naïve Bayes for this or any other algorithm. (You could also play tricks with NB's probability output and a threshold value, but NB's probability estimates are notoriously bad; only its ranking among them is what makes it valuable.)
what feature extraction should I pursue for such a task
For text classification, tf-idf vectors are known to work well, but you haven't specified what the exact task is. Any metadata on the documents might work as well; try doing some simple statistical analysis. If any feature of the data is more frequently present in some classes than in others, it may be a useful feature.
I understand that you have two tasks to solve here. The 1st one is that you want to tag an article based on its topic(?) and thus the article can be classified in more than one categories/classes and thus you have a multi-label classification problem. There are several algorithms proposed for solving a multi-label classification problem - please check the literature. I found this paper quite helpful when I was dealing with a similar problem: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9401
The 2nd problem you want to solve is to tag the paper with authors, gender, type of document. This is a multi-class problem - each class has more than two potential values but all documents have some values for these classes.
I think as a first step it is important to understand the differences between multi-class and multi-label classification.

Categories

Resources