LDA topic modelling improvement

LDA topic modelling improvement - python

I am working on an LDA model to identify the topic of ~100,000 online courses based on their course descriptions and titles. In the further process, I would like to use these topics to cluster these courses. So far, the topics identified by our model have not been great and I am looking for ways to improve - and get to get some thoughts on my attempts for improvement. Here is a quick summary of our current - pretty standard - approach as well as some ideas that I have for improvement:
Merging title, subtitle, and course description
Removing descriptions on length < 100 words, and non-english descriptions
For training, I am using only longer descriptions in English. Of course this means that courses with non-English descriptions
will be classified randomly.
Pick a random 30,000 descriptions
The number is somewhat arbitrary. I have noticed that the topics are more "clear" when using less descriptions for training. However, we don't want our topics to be biased based on the random descriptions that were chosen in this step.
Removing stopwords
Both self-defined and using library.
Removing punctuation
Lemmatizing words
Removing words that appear in over 50% of documents
To identify reoccurring topics I ran the model multiple times in a for loop and printed the resulting topics. Based on the topic-overlap of those iterations, I am considering to add Wikipedia articles related to the reoccurring topics and add them to the descriptions we use for training. This way I am hoping to "strengthen" those topics in the training data and make them more clear - in the hope of getting more interpretable topics. Currently, I am adding around 150 Wikipedia articles to a corpus of 30,000 course descriptions and the results seem to be promising.
My main question is: Is the approach of adding pre-selected wikipedia articles into our training data valid? What are the implications of this?
I am aware that by using this approach, I'm "pushing" our model in the direction of topics that we saw in initial runs - however, I believe that training on this data set will lead to a better/more interpretable classification of course descriptions.

Related

What should be used between Doc2Vec and Word2Vec when analyzing product reviews?

I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:

There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.

Unlabeled text data containing messages

I am working on a text dataset containing messages from users on a website. Please check the image in the link as stack is not allowing me to post this image directly.
dataframe of the first five rows
Reading those messages i want to find out the intent of the users whether they are buyer, seller or neutral. I have tried topic modelling using both LDA and NMF but it's not giving me answers. As i am getting very different topics and i cannot find a way to relate it to buyer seller or neutral. And i cannot manually label these data because it's a huge dataset containing 200,000 thousands of rows. So which technique or algorithm can i use to solve this problem.

the algorithm you tried "LDA" (I'm not famalier with the other one) is (as you said) a topic model algorithm which isn't so helpful in this case...
What I'd suggest you to do is,
try to label chunck of messages for each category-
seller
buyer
neutral
and transform the problem your facing into a classification problem, then choose any classification algorithm to classify the messages into one of the categories...
For reference I'd suggest you to look at this problem and have some inspiration- https://towardsdatascience.com/applied-text-classification-on-email-spam-filtering-part-1-1861e1a83246

Can I use LDA topic modeling if I do not know the number of topics

I have more than 100k .txt files with newspaper issues and I need to define the lexical field of protectionism. Nonetheless, the newspaper issues treat very diverse subjects and I cannot know the total number of topics. Can I still use LDA topic modeling to find the lexical field or is there another method (maybe supervised learning)?

You probably can, but have a look at this CorEx idea. This works very nice and gives you the opportunity to guide the groups by supplying a set of anchor words (so you can call it semi-supervised leraning).
You could give it ["protectionism", "tarifs", "trade wars", ...] as ancors for one topic and even try to push articles that are non related to the topic you are interested in into a second topic by defining ancor words that have nothing to do with your topic e.g. ["police protection", "custom features", ...]
The notebooks supplied are really excellent and will have you up and running in no time

Feature extraction for text

I am stuck at this issue and am not able to find relevant literature. Not sure, if this is coding question to begin with.
I have articles related to some disaster. I want to do a temporal classification of the text. Hereby, I want to get the sentences/ phrases related to the infoamtion before the event. I want know about classification from background in ml. But I have no idea about relevant extraction.
I have tried tokening the words and get relevant frequencies and also tried pos tagging using max ent.
I guess the problem reduces to analyzing the manually classified text and constructing features. But I am not sure how to extract patterns using the pos tags. I also am how to know the exhaustive set if features.

How can I evaluate my technique?

I am dealing with a problem of text summarization i.e. given a large chunk(s) of text, I want to find the most representative "topics" or the subject of the text. For this, I used various information theoretic measures such as TF-IDF, Residual IDF and Pointwise Mutual Information to create a "dictionary" for my corpus. This dictionary contains important words mentioned in the text.
I manually sifted through the entire 50,000 list of phrases sorted on their TFIDF measure and hand-picked 2,000 phrases (I know! It took me 15 hours to do this...) that are the ground truth i.e. these are important for sure. Now when I use this as a dictionary and run a simple frequency analysis on my text and extract the top-k phrases, I am basically seeing what the subject is and I agree with what I am seeing.
Now how can I evaluate this approach? There is no machine learning or classification involved here. Basically, I used some NLP techniques to create a dictionary and using the dictionary alone to do simple frequency analysis is giving me the topics I am looking for. However, is there a formal analysis I can do for my system to measure its accuracy or something else?

I'm not an expert of machine learning, but I would use cross-validation. If you used e.g. 1000 pages of text to "train" the algorithm (there is a "human in the loop", but no problem), then you could take another few hundred test pages, and use your "top-k phrases algorithm" to find the "topic" or "subject" of these. The ratio of test pages where you agree with the outcome of the algorithm gives you a (somewhat subjective) measure of how well your method performs.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.