I have list of medical words in a file. I have a list of tweets which are tokenized and lemmatized into words.
I want to check similarity between word in a tweet and all words in the medical file. I want to check how closest is the word in a tweet to any medical word.
Can the above be accomplished? Please help me out with all possible ways to accomplish the above.
Thanks
Since your terminology is in the medical domain, you might want to consider the use of the UMLS. UMLS::Similarity could help you find similarity between medical terms.
http://umls-similarity.sourceforge.net
Good luck,
Ted
Related
I am working with Word2vec (Gensim, in python) to understand meaning of sentences (by each word in them).
My goal is to be able to realize if the sentence indicates about the feeling of the speaker.
Where can I find this kind of a dictionary of words?
For example one dictionary for words that indicate happiness and other for sadness.
Thanks
Try SentiWordNet
"SentiWordNet is a lexical resource for opinion mining that assigns to each synset of WordNet three sentiment scores: positivity, negativity, and objectivity"
Hello i'm looking to find a solution of my issue :
I Want to find a list of similar words with french and english
For example :
name could be : first name, last name, nom, prénom, username....
Postal address could be : city, country, street, ville, pays, code postale ....
The other answer, and comments, describe how to get synonyms, but I think you want more than that?
I can suggest two broad approaches: WordNet and word embeddings.
Using nltk and wordnet, you want to explore the adjacent graph nodes. See http://www.nltk.org/howto/wordnet.html for an overview of the functions available. I'd suggest that once you've found your start word in Wordnet, follow all its relations, but also go up to the hypernym, and do the same there.
Finding the start word is not always easy:
http://wordnetweb.princeton.edu/perl/webwn?s=Postal+address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
Instead it seems I have to use "address": http://wordnetweb.princeton.edu/perl/webwn?s=address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
and then decide which of those is the correct sense here. Then try clicking the hypernym, hyponym, sister term, etc.
To be honest, none of those feels quite right.
Open Multilingual WordNet tries to link different languages. http://compling.hss.ntu.edu.sg/omw/ So you could take your English WordNet code, and move to the French WordNet with it, or vice versa.
The other approach is to use word embeddings. You find the, say, 300 dimensional, vector of your source word, and then hunt for the nearest words in that vector space. This will be returning words that are used in similar contexts, so they could be similar meaning, or similar syntactically.
Spacy has a good implementation, see https://spacy.io/usage/spacy-101#vectors-similarity and https://spacy.io/usage/vectors-similarity
Regarding English and French, normally you would work in the two languages independently. But if you search for "multilingual word embeddings" you will find some papers and projects where the vector stays the same for the same concept in different languages.
Note: the API is geared towards telling you how two words are similar, not finding similar words. To find similar words you need to take your vector and compare with every other word vector, which is O(N) in the size of the vocabulary. So you might want to do this offline, and build your own "synonyms-and-similar" dictionary for each word of interest.
from PyDictionary import PyDictionary
dictionary=PyDictionary()
answer = dictionary.synonym(word)
word is the word for which you are finding the synonyms.
I am new to NLP. My requirement is to parse meaning from sentences.
Example
"Perpetually Drifting is haunting in all the best ways."
"When The Fog Rolls In is a fantastic song
From above sentences, I need to extract the following sentences
"haunting in all the best ways."
"fantastic song"
Is it possible to achieve this in spacy?
It is not possible to extract the summarized sentences using spacy. I hope the following methods might work for you
Simplest one is extract the noun phrases or verb phrases. Most of the time that should give the text what you want.(Phase struce grammar).
You can use dependency parsing and extract the center word dependencies.
dependency grammar
You can train an sequence model where input is going to be the full sentence and output will be your summarized sentence.
Sequence models for text summaraization
Extracting the meaning of a sentence is a quite arbitrary task. What do you mean by the meaning? Using spaCy you can extract the dependencies between the words (which specify the meaning of the sentence), find the POS tags to check how words are used in the sentence and also find places, organizations, people using NER tagger. However, meaning of the sentence is too general even for the humans.
Maybe you are searching for a specific meaning? If that's the case, you have to train your own classifier. This will get you started.
If your task is summarization of a couple of sentences, consider also using gensim . You can have a look here.
Hope it helps :)
This question already has answers here:
nltk words corpus does not contain "okay"?
(2 answers)
Closed 5 years ago.
I'm building a text classifier that will classify text into topics.
In the first phase of my program as a part of cleaning the data, I remove all the non-English words. For this I'm using the nltk.corpus.words.words() corpus. The problem with this corpus is that it removes 'modern' English words such as Facebook, Instagram etc. Does anybody know another, more 'modern' corpus which I can replace or union with the present one?
I prefer nltk corpus but I'm open to other suggestions.
Thanks in advance
Rethink your approach. Any collection of English texts will have a "long tail" of words that you have not seen before. No matter how large a dictionary you amass, you'll be removing words that are not "non-English". And to what purpose? Leave them in, they won't spoil your classification.
If your goal is to remove non-English text, do it at the sentence or paragraph level using a statistical approach, e.g. ngram models. They work well and need minimal resources.
I'd use Wikipedia, but it's pretty time consuming to tokenize the entirety of it. Fortunately, it's been done for you already. You could use a Word2Vec model trained on 100 billion words of wikipedia and just check if the word is in the model.
I also found this project where Chris made text files of the 3 millions word vocabulary of the model.
Note that this project's list of words doesn't contain some stop words, so it'd be a good idea to find the union of your list from nltk and this one.
Good day,
I'm attempting to write a sentimental analysis application in python (Using naive-bayes classifier) with the aim to categorize phrases from news as being positive or negative.
And I'm having a bit of trouble finding an appropriate corpus for that.
I tried using "General Inquirer" (http://www.wjh.harvard.edu/~inquirer/homecat.htm) which works OK but I have one big problem there.
Since it is a word list, not a phrase list I observe the following problem when trying to label the following sentence:
He is not expected to win.
This sentence is categorized as being positive, which is wrong. The reason for that is that "win" is positive, but "not" does not carry any meaning since "not win" is a phrase.
Can anyone suggest either a corpus or a work around for that issue?
Your help and insight is greatly appriciated.
See for example: "What's great and what's not: learning to classify the scope of negation for improved sentiment analysis" by Councill, McDonald, and Velikovich
http://dl.acm.org/citation.cfm?id=1858959.1858969
and followups,
http://scholar.google.com/scholar?cites=3029019835762139237&as_sdt=5,33&sciodt=0,33&hl=en
e.g. by Morante et al 2011
http://eprints.pascal-network.org/archive/00007634/
In this case, the work not modifies the meaning of the phrase expecteed to win, reversing it. To identify this, you would need to POS tag the sentence and apply the negative adverb not to the (I think) verb phrase as a negation. I don't know if there is a corpus that would tell you that not would be this type of modifier or not, however.