build WORD2VEC words dictionary to indicate feelings - python

I am working with Word2vec (Gensim, in python) to understand meaning of sentences (by each word in them).
My goal is to be able to realize if the sentence indicates about the feeling of the speaker.
Where can I find this kind of a dictionary of words?
For example one dictionary for words that indicate happiness and other for sadness.
Thanks

Try SentiWordNet
"SentiWordNet is a lexical resource for opinion mining that assigns to each synset of WordNet three sentiment scores: positivity, negativity, and objectivity"

Related

TF-IDF algorithm on chinese text

I am doing TF-IDF on chinese text and searching for top 10 used words in the text.
when i getting the top 10 words i have some of the meaningless word like "成为", "表示" and other.
Is there is any ways which only get meaningful words?
I am using "jieba" to cut the chinese sentence to words
The words like "成为", "表示" are what we refer to as stop words. In many cases, they are commonly used words that provide little meaning within the sentence, think the words "a", and "the" in English.
It is sometimes necessary to remove these stopwords before performing analysis, especially for TF-IDF as it may lead to meaningless results as you have seen.
It seems that Jieba doesn't include the functionality to remove stop words, but genediazjr collected a fairly comprehensive list of stopwords for the Chinese language. You can import this and remove these stop words from your original text before the TF-IDF analysis.

How to Grab meaning of sentence using NLP?

I am new to NLP. My requirement is to parse meaning from sentences.
Example
"Perpetually Drifting is haunting in all the best ways."
"When The Fog Rolls In is a fantastic song
From above sentences, I need to extract the following sentences
"haunting in all the best ways."
"fantastic song"
Is it possible to achieve this in spacy?
It is not possible to extract the summarized sentences using spacy. I hope the following methods might work for you
Simplest one is extract the noun phrases or verb phrases. Most of the time that should give the text what you want.(Phase struce grammar).
You can use dependency parsing and extract the center word dependencies.
dependency grammar
You can train an sequence model where input is going to be the full sentence and output will be your summarized sentence.
Sequence models for text summaraization
Extracting the meaning of a sentence is a quite arbitrary task. What do you mean by the meaning? Using spaCy you can extract the dependencies between the words (which specify the meaning of the sentence), find the POS tags to check how words are used in the sentence and also find places, organizations, people using NER tagger. However, meaning of the sentence is too general even for the humans.
Maybe you are searching for a specific meaning? If that's the case, you have to train your own classifier. This will get you started.
If your task is summarization of a couple of sentences, consider also using gensim . You can have a look here.
Hope it helps :)

Find 'modern' nltk words corpus [duplicate]

This question already has answers here:
nltk words corpus does not contain "okay"?
(2 answers)
Closed 5 years ago.
I'm building a text classifier that will classify text into topics.
In the first phase of my program as a part of cleaning the data, I remove all the non-English words. For this I'm using the nltk.corpus.words.words() corpus. The problem with this corpus is that it removes 'modern' English words such as Facebook, Instagram etc. Does anybody know another, more 'modern' corpus which I can replace or union with the present one?
I prefer nltk corpus but I'm open to other suggestions.
Thanks in advance
Rethink your approach. Any collection of English texts will have a "long tail" of words that you have not seen before. No matter how large a dictionary you amass, you'll be removing words that are not "non-English". And to what purpose? Leave them in, they won't spoil your classification.
If your goal is to remove non-English text, do it at the sentence or paragraph level using a statistical approach, e.g. ngram models. They work well and need minimal resources.
I'd use Wikipedia, but it's pretty time consuming to tokenize the entirety of it. Fortunately, it's been done for you already. You could use a Word2Vec model trained on 100 billion words of wikipedia and just check if the word is in the model.
I also found this project where Chris made text files of the 3 millions word vocabulary of the model.
Note that this project's list of words doesn't contain some stop words, so it'd be a good idea to find the union of your list from nltk and this one.

R/python: build model from training sentences

What I'm trying to achieve:
I have been looking for an approach for a long while now but I'm not able to find an (effective) way to this:
build a model from example sentences while taking word order and synonyms into account.
map a sentence against this model and get a similarity score (thus a score indicating how much this sentence fits the model, in other words fits the sentences which were used to train the model)
What I tried:
Python: nltk in combination with gensim (as far as I could code and read it was only capable to use word similarity (but not taking order into
account).
R: used tm to build a TermDocumentMatrix which looked really promising but was not able to map anything to this matrix. Further this TermDocumentMatrix seems to take the order into account but misses the synonyms (I think).
I know the lemmatization didn't go that well hahah :)
Question:
Is there any way to do achieve the steps described above using either R or Python? A simple sample code would be great (or references to a good tutorial)
There are many ways to do what you described above, and it will of course take lots of testing to find an optimized solution. But here is some helpful functionality to help solve this using python/nltk.
build a model from example sentences while taking word order and
synonyms into account.
1. Tokenization
In this step you will want to break down individual sentences into a list of words.
Sample code:
import nltk
tokenized_sentence = nltk.word_tokenize('this is my test sentence')
print(tokenized_sentence)
['this', 'is', 'my', 'test', 'sentence']
2. Finding synonyms for each word.
Sample code:
from nltk.corpus import wordnet as wn
synset_list = wn.synsets('motorcar')
print(synset_list)
[Synset('car.n.01')]
Feel free to research synsets if you are unfamiliar, but for now just know the above returns a list, so multiple synsets are possibly returned.
From the synset you can get a list of synonyms.
Sample code:
print( wn.synset('car.n.01').lemma_names() )
['car', 'auto', 'automobile', 'machine', 'motorcar']
Great, now you are able to convert your sentence into a list of words, and you're able to find synonyms for all words in your sentences (while retaining the order of your sentence). Also, you may want to consider removing stopwords and stemming your tokens, so feel free to look up those concepts if you think it would be helpful.
You will of course need to write the code to do this for all sentences, and store the data in some data structure, but that is probably outside the scope of this question.
map a sentence against this model and get a similarity score (thus a
score indicating how much this sentence fits the model, in other words
fits the sentences which were used to train the model)
This is difficult to answer since the possibilities to do this are endless, but here are a few examples of how you could approach it.
If you're interested in binary classification you could do something as simple as, Have I seen this sentence of variation of this sentence before (variation being same sentence but words replaced by their synonyms)? If so, score is 1, else score is 0. This would work, but may not be what you want.
Another example, store each sentence along with synonyms in a python dictionary and calculate score depending on how far down the dictionary you can align the new sentence.
Example:
training_sentence1 = 'This is my awesome sentence'
training_sentence2 = 'This is not awesome'
And here is a sample data structure on how you would store those 2 sentences:
my_dictionary = {
'this': {
'is':{
'my':{
'awesome': {
'sentence':{}
}
},
'not':{
'awesome':{}
}
}
}
}
Then you could write a function that traverses that data structure for each new sentence, and depending how deep it gets, give it a higher score.
Conclusion:
The above two examples are just some possible ways to approach the similarity problem. There are countless articles/whitepapers about computing semantic similarity between text, so my advice would be just explore many options.
I purposely excluded supervised classification models, since you never mentioned having access to labelled training data, but of course that route is possible if you do have a gold standard data source.

wordnet similarity calculation in python

I have list of medical words in a file. I have a list of tweets which are tokenized and lemmatized into words.
I want to check similarity between word in a tweet and all words in the medical file. I want to check how closest is the word in a tweet to any medical word.
Can the above be accomplished? Please help me out with all possible ways to accomplish the above.
Thanks
Since your terminology is in the medical domain, you might want to consider the use of the UMLS. UMLS::Similarity could help you find similarity between medical terms.
http://umls-similarity.sourceforge.net
Good luck,
Ted

Categories

Resources