Classifying text documents using nltk

Classifying text documents using nltk - python

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...
I've successfully stripped the message bodies out of my emails, now I'm looking to start classifying. I've done the classic example of sentiment-analysis classification using the move_reviews corpus separating documents into positive and negative reviews.
I'm just wondering how I could apply this approach to my project? Can I create multiple classes like sports, technology, politics, entertainment, etc.? I have hit a road block here and am looking for a push in the right direction.
If this isn't an appropriate question for SO I'll happily delete it.
Edit: Hello everyone, I see that this post has gained a bit of popularity, I did end up successfully completing this project, here is a link to the code in the projects GitHub Repo:
https://github.com/codyreandeau/Email-Categorizer/blob/master/Email_Categorizer.py

The task of text classification is a Supervised Machine Learning problem. This means that you need to have labelled data. When you approached the movie_review problem, you used the +1/-1 labels to train your sentiment analysis system.
Getting back to your problem:
If you have labels for your data, approach the problem in the same manner. I suggest you use the scikit-learn library. You can draw some inspiration from here: Scikit-Learn for Text Classification
If you don't have labels, you can try an unsupervised learning approach. If you have any clue about how many categories(call the number K) you have, you can try a KMeans approach. This means, grouping the emails in K categories based on how similar they are. Similar emails will end up in similar buckets. Then inspect the clusters by hand and come up with a label. Assign new emails to the most similar cluster. If you need help with KMeans check this quick recipe: Text Clustering Recipe
Suggestion: Getting labels for emails can be easier than you think. For example, Gmail lets you export your emails with folder information. If you have categorised your email, you can take advantage of this.

To create a classifier, you need a training data set with the classes you are looking for. In your case, you may need to either:
create your own data set
use a pre-existing dataset
The brown corpus is a seminal text with many of the categories you are speaking about. This could be a starting point to help classify your emails using some package like gensim to find semantically similar texts.
Once you classify your emails, you can then train a system to predict a label for each unseen email.

Related

How to extract out unique words and there pos tags in separate columns while working with Dataset

I am working through Indonesian Data to use data for NER and as I get to know, there is no pretrained NLTK model to help for this language. So, to do this manually I tried to extract all the unique words used in the entire data frame, I still don't know how to apply tags to the words but this is what I did so far.
the first step,
the second step,
the third step,
the fourth step
please let me know if there is any other convenient way to do this, what I did in the following codes. also, let me know how to add tags to each row(if possible) and how to do NER for this.
(I am new to coding that's why I don't know how to ask, but I am trying my best to provide as much information as possible.)

Depending on what you want to do if results is all that matters you could use a pretrained transfomer model from huggingface instead of NLTK. This will be more computionally heavy but also give you a better performance.
There is one fitting model I could find (I don't speak Indonesian obviously, so excuse eventual errors in the sample sentence):
https://huggingface.co/cahya/xlm-roberta-large-indonesian-NER?text=Nama+saya+Peter+dan+saya+tinggal+di+Berlin.
The easiest way to use this would probably be either the API or using an inference-only pipeline, check out this guide, all you would have to do to get this running for the Indonesian model is to replace the previous model path (dslim/bert-base-NER) with cahya/xlm-roberta-large-indonesian-NER.
Note that this Indonesian model is quite large, so you need to have some decent hardware. If you don't you could alternatively use some (free) cloud computing service such as Google Colab.

Decide rather a text is about "Topic A" or not - NLP with Python

I'm trying to write a python program that will decide if a given post is about the topic of volunteering. My data-sets are small (only the posts, which are examined 1 by 1) so approaches like LDA do not yield results.
My end goal is a simple True/False, a post is about the topic or not.
I'm trying this approach:
Using Google's word2vec model, I'm creating a "cluster" of words that are similar to the word: "volunteer".
CLUSTER = [x[0] for x in MODEL.most_similar_cosmul("volunteer", topn=120)]
Getting the posts and translating them to English, using Google translate.
Cleaning the translated posts using NLTK (removing stopwords, punctuation, and lemmatize the post)
Making a BOW out of the translated, clean post.
This stage is difficult for me. I want to calculate a "distance" / "similarity" / something that will help me get the True/False answer that I'm looking for, but I can't think of a good way to do that.
Thank you for your suggestions and help in advance.

You are attempting to intuitively improvise a set of steps that, in the end, will classify these posts into the two categories, "volunteering" and "not-volunteering".
You should looks for online examples that do "text classification" that are similar to your task, work through them (with their original demo data) for understanding, then adapt them incrementally to work with your data instead.
At some point, word2vec might be a helpful contributor to your task - but I wouldn't start with it. Similarly, eliminating stop-words, performing lemmatization, etc might eventually be helpful, but need not be important up front.
You'll typically want to start by acquiring (by hand-labeling if necessary) a training set of text for which you know the "volunteering" or "not-volunteering" value (known labels).
Then, create some feature-vectors for the texts – A simple starting approach that offers a quick baseline for later improvements is a "bag of words" representation.
Then, feed those representations, with the known-labels, to some existing classification algorithm. The popular scikit-learn package in Python offers many. That is: you don't yet need to be worrying about choosing ways to calculate a "distance" / "similarity" / something that will guide your own ad hoc classifier. Just feed the labeled data into one (or many) existing classifiers, and check how well they're doing. Many will be using various kinds of similarity/distance calculations internally - but that's automatic and explicit from choosing & configuring the algorithm.
Finally, when you have something working start-to-finish, no matter how modest in results, then try alternate ways of preprocessing text (stop-word-removal, lemmatization, etc), featurizing text, and alternate classifiers/algorithm paramterizations - to compare results, and thus discover what works well given your specific data, goals, and practical constraints.
The scikit-learn "Working With Text Data" guide is worth reviewing & working-through, and their "Choosing the right estimator" map is useful for understanding the broad terrain of alternate techniques and major algorithms, and when different ones apply to your task.
Also, scikit-learn contributors/educators like Jake Vanderplas (github.com/jakevdp) and Olivier Grisel (github.com/ogrisel) have many online notebooks/tutorials/archived-video-presentations which step through all the basics, often including text-classification problems much like yours.

Unlabeled text data containing messages

I am working on a text dataset containing messages from users on a website. Please check the image in the link as stack is not allowing me to post this image directly.
dataframe of the first five rows
Reading those messages i want to find out the intent of the users whether they are buyer, seller or neutral. I have tried topic modelling using both LDA and NMF but it's not giving me answers. As i am getting very different topics and i cannot find a way to relate it to buyer seller or neutral. And i cannot manually label these data because it's a huge dataset containing 200,000 thousands of rows. So which technique or algorithm can i use to solve this problem.

the algorithm you tried "LDA" (I'm not famalier with the other one) is (as you said) a topic model algorithm which isn't so helpful in this case...
What I'd suggest you to do is,
try to label chunck of messages for each category-
seller
buyer
neutral
and transform the problem your facing into a classification problem, then choose any classification algorithm to classify the messages into one of the categories...
For reference I'd suggest you to look at this problem and have some inspiration- https://towardsdatascience.com/applied-text-classification-on-email-spam-filtering-part-1-1861e1a83246

Can I use LDA topic modeling if I do not know the number of topics

I have more than 100k .txt files with newspaper issues and I need to define the lexical field of protectionism. Nonetheless, the newspaper issues treat very diverse subjects and I cannot know the total number of topics. Can I still use LDA topic modeling to find the lexical field or is there another method (maybe supervised learning)?

You probably can, but have a look at this CorEx idea. This works very nice and gives you the opportunity to guide the groups by supplying a set of anchor words (so you can call it semi-supervised leraning).
You could give it ["protectionism", "tarifs", "trade wars", ...] as ancors for one topic and even try to push articles that are non related to the topic you are interested in into a second topic by defining ancor words that have nothing to do with your topic e.g. ["police protection", "custom features", ...]
The notebooks supplied are really excellent and will have you up and running in no time

NLP project on Comment Summarization

I am planning to do my final year project on Natural Language Processing (using NLTK) and my area of interest is Comment Summarization from Social media websites such as Facebook. For example, I am trying to do something like this:
Random Facebook comments in a picture :
Wow! Beautiful.
Looking really beautiful.
Very pretty, Nice pic.
Now, all these comments will get mapped (using a template based comment summarization technique) into something like this:
3 people find this picture to be "beautiful".
The ouput will consist of the word "beautiful" since it is more commonly used in the comments than the word "pretty" (and also the fact that Beautiful and pretty are synonyms).In order to accomplish this task, I am going to use approaches like tracking Keyword frequency and Keyword Scores (In this scenario,"Beautiful" and "Pretty" have a very close score).
Is this the best way to do it?
So far with my research, I have been able to come up with the following papers but none of the papers address this kind of comment summarization :
Automatic Summarization of Events from Social Media
Social Context Summarization
-
What are the other papers in this field which address a similar issue?
Apart from this, I also want my summarizer to improve with every summarization task.How do I apply machine learning in this regard?

Topic model clustering is what you are looking for.
A search on Google Scholars for "topic model clustering will give you lots of references on topic model clustering.
To understand them, you need to be familiar with approaches for the following tasks, apart from basics of Machine Learning in general.
Clustering: Cosine distance clustering, k-means clustering
Ranking: PageRank, TF-IDF, Mutual Information Gain, Maximal Marginal Relevance

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.