NLP project on Comment Summarization - python

I am planning to do my final year project on Natural Language Processing (using NLTK) and my area of interest is Comment Summarization from Social media websites such as Facebook. For example, I am trying to do something like this:
Random Facebook comments in a picture :
Wow! Beautiful.
Looking really beautiful.
Very pretty, Nice pic.
Now, all these comments will get mapped (using a template based comment summarization technique) into something like this:
3 people find this picture to be "beautiful".
The ouput will consist of the word "beautiful" since it is more commonly used in the comments than the word "pretty" (and also the fact that Beautiful and pretty are synonyms).In order to accomplish this task, I am going to use approaches like tracking Keyword frequency and Keyword Scores (In this scenario,"Beautiful" and "Pretty" have a very close score).
Is this the best way to do it?
So far with my research, I have been able to come up with the following papers but none of the papers address this kind of comment summarization :
Automatic Summarization of Events from Social Media
Social Context Summarization
-
What are the other papers in this field which address a similar issue?
Apart from this, I also want my summarizer to improve with every summarization task.How do I apply machine learning in this regard?

Topic model clustering is what you are looking for.
A search on Google Scholars for "topic model clustering will give you lots of references on topic model clustering.
To understand them, you need to be familiar with approaches for the following tasks, apart from basics of Machine Learning in general.
Clustering: Cosine distance clustering, k-means clustering
Ranking: PageRank, TF-IDF, Mutual Information Gain, Maximal Marginal Relevance

Related

Need suggestions for a NLP use case

I am trying to build a web scraper that can predict the content of a given URL into multiple categories, but I am currently confused about which method is best suited for my use case. Here's the overall use case:
I want to predict a researcher's interest from their biography and categorize them into one or multiple categories based on SDG 17 goals. I have three data points to work with:
The biography of each researcher (can be scrapped and tokenized)
A list of keywords that are often associated with each of the SDG categories/goals (here's the example of said keywords)
Hundreds of categorizations that are done manually by students in the form of binary data (here's the example of said data)
So far, we have students that read each researcher's biography and decide which SDG category/goal each researcher belongs to. One research can belong to one or more SDG categories. We usually categorize it based on how often SDG keywords listed in our database are present in each researcher's bio.
I have looked up online machine learning models for NLP but couldn't decide on which method would work best with my use case. Any suggestions and references would be super appreciated because I'm a bit lost here.
The problem that you have here is a multi-label classification and you can solve it by applying supervised learning since you have a labelled dataset.
A labelled dataset should look something like this,
article 1 - sdg1, sdg2, sdg4
article 2 - sdg4
.
.
.
The implementation is explained in detail here - keras - multi-label-classification
This one has plenty of things abstracted and the implementation is kept simple - fasttext multi-label-classification
Profound insights of these libraries are here,
keras and fasttext

is there a way to use pretrained doc2vec model to evaluate some document dataset

Lately I am doing a research with purpose of unsupervised clustering of a huge texts database. Firstly I tried bag-of-words and then several clustering algorithms which gave me a good result, but now I am trying to step into doc2vec representation and it seems to not be working for me, I cannot load prepared model and work with it, instead training my own doesnt prove any result.
I tried to train my model on 10k texts
model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=2, epochs=100,workers=8)
(around 20-50 words each) but the similarity score which is proposed by gensim like
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
is working much worse than the same for Bag-of-words with my model.
By much worse i mean that identical or almost identical text have similarity score compatible to text which dont have any connection i can think about. So i decided to use model from Is there pre-trained doc2vec model? to use some pretrained model which might have more connections between words. Sorry for somewhat long preambula but the question is how do i plug it in? Can someone provide some ideas how do i, using the loaded gensim model from https://github.com/jhlau/doc2vec convert my own dataset of text into vectors of same length? My data is preprocesssed (stemmed, no punctuation, lowercase, no nlst.corpus stopwords)and i can deliver it from list or dataframe or file if needed, the code question is how to pass my own data to pretrained model? Any help would be appreciated.
UPD: outputs that make me feel bad
Train Document (6134): «use medium paper examination medium habit one
week must chart daily use medium radio television newspaper magazine
film video etc wake radio alarm listen traffic report commuting get
news watch sport soap opera watch tv use internet work home read book
see movie use data collect journal basis analysis examining
information using us gratification model discussed textbook us
gratification article provided perhaps carrying small notebook day
inputting material evening help stay organized smartphone use note app
track medium need turn diary trust tell tell immediately paper whether
actually kept one begin medium diary soon possible order give ample
time complete journal write paper completed diary need write page
paper use medium functional analysis theory say something best
understood understanding used us gratification model provides
framework individual use medium basis analysis especially category
discussed posted dominick article apply concept medium usage expected
le medium use cognitive social utility affiliation withdrawal must
draw conclusion use analyzing habit within framework idea discussed
text article concept must clearly included articulated paper common
mistake student make assignment tell medium habit fail analyze habit
within context us gratification model must include idea paper»
Similar Document (6130, 0.6926988363265991): «use medium paper examination medium habit one week must chart daily use medium radio
television newspaper magazine film video etc wake radio alarm listen
traffic report commuting get news watch sport soap opera watch tv use
internet work home read book see movie use data collect journal basis
analysis examining information using us gratification model discussed
textbook us gratification article provided perhaps carrying small
notebook day inputting material evening help stay organized smartphone
use note app track medium need turn diary trust tell tell immediately
paper whether actually kept one begin medium diary soon possible order
give ample time complete journal write paper completed diary need
write page paper use medium functional analysis theory say something
best understood understanding used us gratification model provides
framework individual use medium basis analysis especially category
discussed posted dominick article apply concept medium usage expected
le medium use cognitive social utility affiliation withdrawal must
draw conclusion use analyzing habit within framework idea discussed
text article concept must clearly included articulated paper common
mistake student make assignment tell medium habit fail analyze habit
within context us gratification model must include idea paper»
This looks perfectly ok, but looking on other outputs
Train Document (1185): «photography garry winogrand would like paper
life work garry winogrand famous street photographer also influenced
street photography aim towards thoughtful imaginative treatment detail
referencescite research material academic essay university level»
Similar Document (3449, 0.6901006698608398): «tang dynasty write page
essay tang dynasty essay discus buddhism tang dynasty name artifact
tang dynasty discus them history put heading paragraph information
tang dynasty discussed essay»
Shows us that the score of similarity between two exactly same texts which are the most similar in the system and two like super distinct is almost the same, which makes it problematic to do anything with the data.
To get most similar documents i use
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
The models from https://github.com/jhlau/doc2vec are based on a custom fork of an older version of gensim, so you'd have to find/use that to make them usable.
Models from a generic dataset (like Wikipedia) may not understand the domain-specific words you need, and even where words are shared, the effective senses of those words may vary. Also, to use another model to infer vectors on your data, you should ensure you're preprocessing/tokenizing your text in the same way as the training data was processed.
Thus, it's best to use a model you've trained yourself – so you fully understand it – on domain-relevant data.
10k documents of 20-50 words each is a bit small compared to published Doc2Vec work, but might work. Trying to get 500-dimensional vectors from a smaller dataset could be a problem. (With less data, fewer vector dimensions and more training iterations may be necessary.)
If your result on your self-trained model are unsatisfactory, there could be other problems in your training and inference code (that's not shown yet in your question). It would also help to see more concrete examples/details of how your results are unsatisfactory, compared to a baseline (like the bag-of-words representations you mention). If you add these details to your question, it might be possible to offer other suggestions.

Classifying text documents using nltk

I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...
I've successfully stripped the message bodies out of my emails, now I'm looking to start classifying. I've done the classic example of sentiment-analysis classification using the move_reviews corpus separating documents into positive and negative reviews.
I'm just wondering how I could apply this approach to my project? Can I create multiple classes like sports, technology, politics, entertainment, etc.? I have hit a road block here and am looking for a push in the right direction.
If this isn't an appropriate question for SO I'll happily delete it.
Edit: Hello everyone, I see that this post has gained a bit of popularity, I did end up successfully completing this project, here is a link to the code in the projects GitHub Repo:
https://github.com/codyreandeau/Email-Categorizer/blob/master/Email_Categorizer.py
The task of text classification is a Supervised Machine Learning problem. This means that you need to have labelled data. When you approached the movie_review problem, you used the +1/-1 labels to train your sentiment analysis system.
Getting back to your problem:
If you have labels for your data, approach the problem in the same manner. I suggest you use the scikit-learn library. You can draw some inspiration from here: Scikit-Learn for Text Classification
If you don't have labels, you can try an unsupervised learning approach. If you have any clue about how many categories(call the number K) you have, you can try a KMeans approach. This means, grouping the emails in K categories based on how similar they are. Similar emails will end up in similar buckets. Then inspect the clusters by hand and come up with a label. Assign new emails to the most similar cluster. If you need help with KMeans check this quick recipe: Text Clustering Recipe
Suggestion: Getting labels for emails can be easier than you think. For example, Gmail lets you export your emails with folder information. If you have categorised your email, you can take advantage of this.
To create a classifier, you need a training data set with the classes you are looking for. In your case, you may need to either:
create your own data set
use a pre-existing dataset
The brown corpus is a seminal text with many of the categories you are speaking about. This could be a starting point to help classify your emails using some package like gensim to find semantically similar texts.
Once you classify your emails, you can then train a system to predict a label for each unseen email.

Semantic Clustering

I am looking for advice on how to find clusters of terms that are all related to a single concept.
The goal is to improve a tag or keyword search for images that describe concepts or processes or situations. An image may describe a brainstorming session, or a particular theme. These images which are meant to be used in PowerPoint or other presentation material have user contributed tags.
The issue is our tag based search may bring back completely unrelated images. Our goal is to find the clusters within the tags in order to refine the tags related to a central concept and remove the outliers that are not related to the clusters.
For example if you have a you had the tags meeting, planning, brainstorming, and round table. Ideally we would want to remove round table from the cluster as it doesn't fit the theme.
I have worked with WordNet Similarity but the results are quite strange. I was wondering if there are any other tools in python's NLTK that could help me solve this.
Thanks!
Your question is based in the area called "topic modeling" you can use:
gensim
https://radimrehurek.com/gensim/
or lda
https://pypi.python.org/pypi/lda

question on sentiment analysis

I have a question regarding sentiment analysis that i need help with.
Right now, I have a bunch of tweets I've gathered through the twitter search api. Because I used my search terms, I know what are the subjects or entities (Person names) that I want to look at. I want to know how others feel about these people.
For starters, I downloaded a list of english words with known valence/sentiment score and calculate the sentiments (+/-) based on availability of these words in the tweet. The problem is that sentiments calculated this way - I'm actually looking more at the tone of the tweet rather than ABOUT the person.
For instance, I have this tweet:
"lol... Person A is a joke. lmao!"
The message is obviously in a positive tone, but person A should get a negative.
To improve my sentiment analysis, I can probably take into account negation and modifiers from my word list. But how exactly can I get my sentiments analysis to look at the subject of the message (and possibly sarcasm) instead?
It would be great if someone can direct me towards some resources....
While awaiting for answers from researchers in AI field I will give you some clues on what you can do quickly.
Even though this topic requires knowledge from natural language processing, machine learning and even psychology, you don't have to start from scratch unless you're desperate or have no trust in the quality of research going on in the field.
One possible approach to sentiment analysis would be to treat it as a supervised learning problem, where you have some small training corpus that includes human made annotations (later about that) and a testing corpus on which you test how well you approach/system is performing. For training you will need some classifiers, like SVM, HMM or some others, but keep it simple. I would start from binary classification: good, bad. You could do the same for a continuous spectrum of opinion ranges, from positive to negative, that is to get a ranking, like google, where the most valuable results come on top.
For a start check libsvm classifier, it is capable of doing both classification {good, bad} and regression (ranking).
The quality of annotations will have a massive influence on the results you get, but where to get it from?
I found one project about sentiment analysis that deals with restaurants. There is both data and code, so you can see how they extracted features from natural language and which features scored high in the classification or regression.
The corpus consists of opinions of customers about restaurants they recently visited and gave some feedback about the food, service or atmosphere.
The connection about their opinions and numerical world is expressed in terms of numbers of stars they gave to the restaurant. You have natural language on one site and restaurant's rate on another.
Looking at this example you can devise your own approach for the problem stated.
Take a look at nltk as well. With nltk you can do part of speech tagging and with some luck get names as well. Having done that you can add a feature to your classifier that will assign a score to a name if within n words (skip n-gram) there are words expressing opinions (look at the restaurant corpus) or use weights you already have, but it's best to rely on a classfier to learn weights, that's his job.
In the current state of technology this is impossible.
English (and any other language) is VERY complicated and cannot be "parsed" yet by programs. Why? Because EVERYTHING has to be special-cased. Saying that someone is a joke is a special-case of a joke, which is another exception in your program. Etcetera, etc, etc.
A good example (posted by ScienceFriction somewhere here on SO):
Similarly, the sentiment word "unpredictable" could be positive in the context of a thriller but negative when describing the breaks system of the Toyota.
If you are willing to spend +/-40 years of your life on this subject, go ahead, it will be much appreciated :)
I don't entirely agree with what nightcracker said. I agree that it is a hard problem, but we are making a good progress towards the solution.
For example, 'part-of-speech' might help you to figure out subject, verb and object in the sentence. And 'n-grams' might help you in the Toyota vs. thriller example to figure out the context. Look at TagHelperTools. It is built on top of weka and provides part-of-speech and n-grams tagging.
Still, it is difficult to get the results that OP wants, but it won't take 40 years.

Categories

Resources