Hello i'm looking to find a solution of my issue :
I Want to find a list of similar words with french and english
For example :
name could be : first name, last name, nom, prénom, username....
Postal address could be : city, country, street, ville, pays, code postale ....
The other answer, and comments, describe how to get synonyms, but I think you want more than that?
I can suggest two broad approaches: WordNet and word embeddings.
Using nltk and wordnet, you want to explore the adjacent graph nodes. See http://www.nltk.org/howto/wordnet.html for an overview of the functions available. I'd suggest that once you've found your start word in Wordnet, follow all its relations, but also go up to the hypernym, and do the same there.
Finding the start word is not always easy:
http://wordnetweb.princeton.edu/perl/webwn?s=Postal+address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
Instead it seems I have to use "address": http://wordnetweb.princeton.edu/perl/webwn?s=address&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=&o3=&o4=&h=
and then decide which of those is the correct sense here. Then try clicking the hypernym, hyponym, sister term, etc.
To be honest, none of those feels quite right.
Open Multilingual WordNet tries to link different languages. http://compling.hss.ntu.edu.sg/omw/ So you could take your English WordNet code, and move to the French WordNet with it, or vice versa.
The other approach is to use word embeddings. You find the, say, 300 dimensional, vector of your source word, and then hunt for the nearest words in that vector space. This will be returning words that are used in similar contexts, so they could be similar meaning, or similar syntactically.
Spacy has a good implementation, see https://spacy.io/usage/spacy-101#vectors-similarity and https://spacy.io/usage/vectors-similarity
Regarding English and French, normally you would work in the two languages independently. But if you search for "multilingual word embeddings" you will find some papers and projects where the vector stays the same for the same concept in different languages.
Note: the API is geared towards telling you how two words are similar, not finding similar words. To find similar words you need to take your vector and compare with every other word vector, which is O(N) in the size of the vocabulary. So you might want to do this offline, and build your own "synonyms-and-similar" dictionary for each word of interest.
from PyDictionary import PyDictionary
dictionary=PyDictionary()
answer = dictionary.synonym(word)
word is the word for which you are finding the synonyms.
Related
my goal is very simple: I have a set of strings or a sentence and I want to find the most similar one within a text corpus.
For example I have the following text corpus: "The front of the library is adorned with the Word of Life mural designed by artist Millard Sheets."
And I'd like to find the substring of the original corpus which is most similar to: "the library facade is painted"
So what I should get as output is: "fhe front of the library is adorned"
The only thing I came up with is to split the original sentence in substrings of variable lengths (eg. in substrings of 3,4,5 strings) and then use something like string.similarity(substring) from the spacy python module to assess the similarities of my target text with all the substrings and then keep the one with the highest value.
It seems a pretty inefficient method. Is there anything better I can do?
It probably works to some degree, but I wouldn't expect the spacy similarity method (averaging word vectors) to work particularly well.
The task you're working on is related to paraphrase detection/identification and semantic textual similarity and there is a lot of existing work. It is frequently used for things like plagiarism detection and the evaluation of machine translation systems, so you might find more approaches by looking in those areas, too.
If you want something that works fairly quickly out of the box for English, one suggestion is terp, which was developed for MT evaluation but shown to work well for paraphrase detection:
https://github.com/snover/terp
Most methods are set up to compare two sentences, so this doesn't address your potential partial sentence matches. Maybe it would make sense to find the most similar sentence and then look for substrings within that sentence that match better than the sentence as a whole?
I have a list of words and would like to keep only nouns.
This is not a duplicate of Extracting all Nouns from a text file using nltk
In the linked question a piece of text is processed. The accepted answer proposes a tagger. I'm aware of the different options for tagging text (nlkt, textblob, spacy), but I can't use them, since my data doesn't consist of sentences. I only have a list of individual words:
would
research
part
technologies
size
articles
analyzes
line
nltk has a wide selection of corpora. I found verbnet with a comprehensive list of verbs. But so far I didn't see anything similar for nouns. Is there something like a dictionary, where I can look up if a word is a noun, verb, adjective, etc ?
This could probably done by some online service. Microsoft translate for example returns a lot of information in their responses: https://learn.microsoft.com/en-us/azure/cognitive-services/translator/reference/v3-0-dictionary-lookup?tabs=curl
But this is a paid service. I would prefer a python package.
Regarding the ambiguity of words: Ideally I would like a dictionary that can tell me all the functions a word can have. "fish" for example is both noun and verb. "eat" is only verb, "dog" is only noun. I'm aware that this is not an exact science. A working solution would simply remove all words that can't be nouns.
Tried using wordnet?
from nltk.corpus import wordnet
words = ["would","research","part","technologies","size","articles","analyzes","line"]
for w in words:
syns = wordnet.synsets(w)
print(w, syns[0].lexname().split('.')[0]) if syns else (w, None)
You should see:
('would', None)
('research', u'noun')
('part', u'noun')
('technologies', u'noun')
('size', u'noun')
('articles', u'noun')
('analyzes', u'verb')
('line', u'noun')
You can run a POS tagger on individual fragments, it will have lower accuracy but I suppose that's already a given.
Ideally, find a POS tagger which reveals every possible reading for possible syntactic disambiguation later on in the processing pipeline. This will basically just pick out all the possible readings from the lexicon (perhaps with a probability) and let you take it from there.
Even if you use a dictionary, you will always have to deal with ambiguity, for example, the same word depending on the context can be a noun or a verb, take the word research
The government will invest on research.
The goal is to research new techniques of POS-tagging.
Most dictionaries will have more than one definition of research, example:
research as a noun
research as a verb
Where do these words come from, can you maybe pos-tag them within the context where they occur?
As #Triplee and #DavidBatista pointed out, it is really complicated to find out if a word is a noun or a verb only by itself, because in most languages, the syntax of a word depends on context.
Words are just representations of meanings. Because of that I'd like to add another proposition that might fit what you mean - instead of trying to find out if a words is a noun or a verb, try to find out if a Concept is an Object or an Action - this still has the problem of ambiguity, because a concept can carry both the Action or Object form.
However, you can stick to Concepts that only has object properties (such as TypeOf, HasAsPart, IsPartOf, etc) or Concepts that have both object and action properties (action properties are such as Subevents, Effects, Requires).
A good tool for Concept Searching is Conceptnet, it provides a WebApi to search for concepts in its network by keyword (it is based of Wikipedia and many other sites and is very complete for english language), is open and also points to synonyms in other languages (that are tagged as their common POS - you could average the POS of the synonyms to try to find out if the word is an object [noun-like] or an action [verb-like]).
I am working on NLP with python and my next step is to gather huge-huge data regarding specific topics available in English grammar.
For example : all words that can define a "Department" say "Accounts".
So can any tell me how I can gather such data (if possible, through any API).
NLTK wordnet is a great framework for these kind of problems. Here is a brief documentation:
http://www.nltk.org/howto/wordnet.html This uses things objects like "synset" which gives you words with common meanings. There are also ways to get a numerical score for the similarities of two words. Lemmas will give you words with similar root meanings.
If you are looking for more of a find related words (ex: "spaghetti" --> "pasta", "ravioli", "Italy" database is probably better:
https://www.datamuse.com/api/
I am wondering if it's possible to calculate the distance/similarity between two related words in Python (like "fraud" and "steal"). These two words are not synonymous per se but they are clearly related. Are there any concepts/algorithms in NLP that can show this relationship numerically? Maybe via NLTK?
I'm not looking for the Levenshtein distance as that relates to the individual characters that make up a word. I'm looking for how the meaning relates.
Would appreciate any help provided.
My suggestion is as follows:
Put each word through the same thesaurus, to get a list of synonyms.
Get the size of the set of similar synonyms for the two words.
That is a measure of similarity between the words.
If you would like to do a more thorough analysis:
Also get the antonyms for each of the two words.
Get the size of the intersection of the sets of antonyms for the two words.
If you would like to go further!...
Put each word through the same thesaurus, to get a list of synonyms.
Use the top n (=5, or whatever) words from the query result to initiate a new query.
Repeat this to a depth you feel is adequate.
Make a collection of synonyms from the repeated synonym queries.
Get the size of the set of similar synonyms for the two words from the two collections of synonyms.
That is a measure of similarity between the words.
NLTK's wordnet is the tool you'd want to use for this. First get the set of all the senses of each word using:
synonymSet = wordnet.synsets(word)
Then loop through each possible sense of each of the 2 words and compare them to each other in a nested loop:
similarity = synonym1.res_similarity(synonym2,semcor_ic)
Either average that value or use the maximum you find; up to you.
This example is using a word similarity comparison that uses "IC" or information content. This will score similarity higher if the word is more specific, or contains more information, so generally it's closer to what we mean when we think about word similarity.
To use this stuff you'll need the imports and variables:
import nltk
from nltk.corpus import wordnet
from nltk.corpus import wordnet_ic
semcor_ic = wordnet_ic.ic('ic-semcor.dat')
As #jose_bacoy suggested above, the Gensim library can provide a measure of similarity between words using the word2vec technique. The below example is modified from this blog post. You can run it in Google Colab.
Google Colab comes with the Gensim package installed. We can import the part of it we require:
from gensim.models import KeyedVectors
We will download training data from Google News, and load it up
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
This gives us a measure of similarity between any two words. To use your example:
word_vectors.similarity('fraud', 'steal')
>>> 0.19978741
Twenty percent similarity may be a surprisingly low level of similarity to you if you consider these words to be similar. But consider that fraud is a noun and steal is generally a verb. This will give them very different associations as viewed by word2vec.
They become much more similar if you modify the noun to become a verb:
word_vectors.similarity('defraud', 'steal')
>>> 0.43293646
Is there a method in NLTK to be able to find certain adjective attributes that describe the word? For example, if I typed in the word "Skyscraper", attributes such as 'tall', 'structured', etc. would appear. I'm more so interested in the reverse, where if I type in the word 'tall' then it will list the semantic relations with other words.
I believe the attribute method on NLTK is meant for this, but it doesn't work particularly the way I described above and this is the code that I'm using for it:
from nltk.corpus import wordnet as wn
synsets = wn.synsets('skyscraper')
print[str(syns.attributes()) for syns in synsets]
I've tried using the part_meronyms and attributes methods, but this doesn't always result the adjective attributes of a word. I know of other Python tools that would allow me to do this, but I would prefer to use only NLTK as of now.
Using purely NLTK, you can achieve this as a two-step process, with your own functions.
Basic Idea
Step 1. Find all meaningful collocations for your target word ("skyscraper" or "tall")
Step 2. For the adjectives identified in those collocations that are of interest to you, parse the POS to get the semantic relations.
For Step 1. this SO question on Scoring bigrams has defs that are very relevant. You'll have to tweak the BigramAssocMeasures to your problem. (It uses the brown corpus, but you can use many others.)
For Step 2. you could use something like pos_tag() or even Tree.parse() to get the associations that you are looking for to your target adjective.
For a (simpler) and alternative approach, this link has examples of text.similar() that should be relevant.
Hope that helps.