Unlabeled text data containing messages - python

I am working on a text dataset containing messages from users on a website. Please check the image in the link as stack is not allowing me to post this image directly.
dataframe of the first five rows
Reading those messages i want to find out the intent of the users whether they are buyer, seller or neutral. I have tried topic modelling using both LDA and NMF but it's not giving me answers. As i am getting very different topics and i cannot find a way to relate it to buyer seller or neutral. And i cannot manually label these data because it's a huge dataset containing 200,000 thousands of rows. So which technique or algorithm can i use to solve this problem.

the algorithm you tried "LDA" (I'm not famalier with the other one) is (as you said) a topic model algorithm which isn't so helpful in this case...
What I'd suggest you to do is,
try to label chunck of messages for each category-
seller
buyer
neutral
and transform the problem your facing into a classification problem, then choose any classification algorithm to classify the messages into one of the categories...
For reference I'd suggest you to look at this problem and have some inspiration- https://towardsdatascience.com/applied-text-classification-on-email-spam-filtering-part-1-1861e1a83246

Related

NLP Text Classification Model with defined categories / context / intent

This is more of a guideline question rather than a technical query. I am looking to create a classification model that classifies documents based on a specific list of strings. However, as it turns out from the data, that that output is more useful when you pass labelled intents / topics, instead of allowing the model to guess. Based on my research so far, I found that Bertopic might be the right way to achieve this as it allows guided topic modeling but the only caveat is that the guided topics should contain words of similar meaning (see link below).
It might be more clear from the example below on what I want to achieve. Suppose we have a chat text extract from a conversation between a customer and store associate, below is what the customer asking for.
Hi, the product I purchased from your store is defective and I would like to get it replaced or refunded. Here is the receipt number and other details...
If my list of intended labels is as follows: ['defective or damaged', 'refund request', 'wrong label description',...], we can see that the above extract qualifies into 'defective or damaged' and 'refund request'. For the sake of simplicity, I would pick the one that model returns with highest score so that we only have 1 label per request. Now, if my data does not have these labels defined, I may be able to use zero-shot classification to "best guess" the intent from this list. However, as I understand the use of zero-shot classification or even the guided topic modeling of BERTopic, the categories that I want above may not be derived since the individual words in those categories do not mean that same.
For example, in BERTopic, an intended classified label could be like ["space", "launch", "orbit", "lunar"] for a "Space Related" topic, but in my case let us say for the 3rd label it would be ["wrong", "label", "description"] which would not be best suited as it would try to find all records that have mentions of wrong address, wrong department, wrong color etc., so I am essentially looking for a combination of those 3 words in the context. Also, those 3 words may not always be together or in same order. For example, in this sentence -
The item had description which was labelled incorrectly.
This same challenge would be for zero-shot classification where the labels are expected to be one word or a combination of words that mean the same thing. Let me know if this clarifies the question more or if I can help further clarify it.
Approach mentioned above:
https://maartengr.github.io/BERTopic/getting_started/guided/guided.html#semi-supervised-topic-modeling

Can I use LDA topic modeling if I do not know the number of topics

I have more than 100k .txt files with newspaper issues and I need to define the lexical field of protectionism. Nonetheless, the newspaper issues treat very diverse subjects and I cannot know the total number of topics. Can I still use LDA topic modeling to find the lexical field or is there another method (maybe supervised learning)?
You probably can, but have a look at this CorEx idea. This works very nice and gives you the opportunity to guide the groups by supplying a set of anchor words (so you can call it semi-supervised leraning).
You could give it ["protectionism", "tarifs", "trade wars", ...] as ancors for one topic and even try to push articles that are non related to the topic you are interested in into a second topic by defining ancor words that have nothing to do with your topic e.g. ["police protection", "custom features", ...]
The notebooks supplied are really excellent and will have you up and running in no time

LDA topic modelling improvement

I am working on an LDA model to identify the topic of ~100,000 online courses based on their course descriptions and titles. In the further process, I would like to use these topics to cluster these courses. So far, the topics identified by our model have not been great and I am looking for ways to improve - and get to get some thoughts on my attempts for improvement. Here is a quick summary of our current - pretty standard - approach as well as some ideas that I have for improvement:
Merging title, subtitle, and course description
Removing descriptions on length < 100 words, and non-english descriptions
For training, I am using only longer descriptions in English. Of course this means that courses with non-English descriptions
will be classified randomly.
Pick a random 30,000 descriptions
The number is somewhat arbitrary. I have noticed that the topics are more "clear" when using less descriptions for training. However, we don't want our topics to be biased based on the random descriptions that were chosen in this step.
Removing stopwords
Both self-defined and using library.
Removing punctuation
Lemmatizing words
Removing words that appear in over 50% of documents
To identify reoccurring topics I ran the model multiple times in a for loop and printed the resulting topics. Based on the topic-overlap of those iterations, I am considering to add Wikipedia articles related to the reoccurring topics and add them to the descriptions we use for training. This way I am hoping to "strengthen" those topics in the training data and make them more clear - in the hope of getting more interpretable topics. Currently, I am adding around 150 Wikipedia articles to a corpus of 30,000 course descriptions and the results seem to be promising.
My main question is: Is the approach of adding pre-selected wikipedia articles into our training data valid? What are the implications of this?
I am aware that by using this approach, I'm "pushing" our model in the direction of topics that we saw in initial runs - however, I believe that training on this data set will lead to a better/more interpretable classification of course descriptions.

Classifying text documents using nltk

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...
I've successfully stripped the message bodies out of my emails, now I'm looking to start classifying. I've done the classic example of sentiment-analysis classification using the move_reviews corpus separating documents into positive and negative reviews.
I'm just wondering how I could apply this approach to my project? Can I create multiple classes like sports, technology, politics, entertainment, etc.? I have hit a road block here and am looking for a push in the right direction.
If this isn't an appropriate question for SO I'll happily delete it.
Edit: Hello everyone, I see that this post has gained a bit of popularity, I did end up successfully completing this project, here is a link to the code in the projects GitHub Repo:
https://github.com/codyreandeau/Email-Categorizer/blob/master/Email_Categorizer.py
The task of text classification is a Supervised Machine Learning problem. This means that you need to have labelled data. When you approached the movie_review problem, you used the +1/-1 labels to train your sentiment analysis system.
Getting back to your problem:
If you have labels for your data, approach the problem in the same manner. I suggest you use the scikit-learn library. You can draw some inspiration from here: Scikit-Learn for Text Classification
If you don't have labels, you can try an unsupervised learning approach. If you have any clue about how many categories(call the number K) you have, you can try a KMeans approach. This means, grouping the emails in K categories based on how similar they are. Similar emails will end up in similar buckets. Then inspect the clusters by hand and come up with a label. Assign new emails to the most similar cluster. If you need help with KMeans check this quick recipe: Text Clustering Recipe
Suggestion: Getting labels for emails can be easier than you think. For example, Gmail lets you export your emails with folder information. If you have categorised your email, you can take advantage of this.
To create a classifier, you need a training data set with the classes you are looking for. In your case, you may need to either:
create your own data set
use a pre-existing dataset
The brown corpus is a seminal text with many of the categories you are speaking about. This could be a starting point to help classify your emails using some package like gensim to find semantically similar texts.
Once you classify your emails, you can then train a system to predict a label for each unseen email.

Generating a good LDA model of Twitter in Python with correct input data

I'm dealing with topic-modelling of Twitter to define profiles of invidual Twitter users. I'm using Gensim module to generate a LDA model. My question is about choosing good input data. I'd like to generate topics which then I'd assign to specific users. Question is about input data. Now I'm using a supervised method of choosing users from different categories on my own (sports, IT, politics etc) and putting their tweets into the model but it's not very efficient and effective.
What would be a good method for generating meaningful topics of the whole Twitter?
Here is one profiling that I used to perform when I worked for a social media company.
Let's say you want to profile "sports" followers.
First, using Twitter API, download all the followers of one famous sports handle, say "ESPN". Looks like this:
"ESPN": 51879246, #These are IDs who follow ESPN
2361734293,
778094964,
23000618,
2828513313,
2687406674,
2402689721,
2209802017,
Then you also download all handles that 51879246, 2361734293... are following. Those "topics" will be your features.
Now all you need to do is to create the matrix X whose size is same as the number of features * number of followers. Then start filling that Matrix with 1 whenever that follower follows the specific topic(feature) in your feature dictionary.
Then here are simple 2 lines to start playing with.
model = lda.LDA(n_topics=5, n_iter=1000, random_state=1)
model.fit(X)

Categories

Resources