I am working on an IR project, I need an alternative to both stemming (which returns unreal words) and lemmatization (which may not change the word at all)
So I looked for a way to get forms of a word.
This python script gives me derivationally_related_forms of a word (e.g. "retrieving"), using NLTK and Wordnet:
from nltk.corpus import wordnet as wn
str = "retrieving"
synsets = wn.synsets(str)
s = set()
result = ""
for synset in synsets:
related = None
lemmas = synset.lemmas()
for lemma in lemmas:
forms = lemma.derivationally_related_forms()
for form in forms:
name = form.name()
s.add(name)
print(list(s))
The output is:
['recollection', 'recovery', 'regaining', 'think', 'retrieval', 'remembering', 'recall', 'recollective', 'thought', 'remembrance', 'recoverer', 'retriever']
But what I really want is only : 'retrieval' , 'retriever' , not 'think' or 'recovery'...etc
and the result is also missing other forms, such as: 'retrieve'
I know that the problem is that "synsets" include words different from my input word, so I get unrelated derivated forms
Is there a way to get the result I am expecting?
You could do what you currently do, then run a stemmer over the word list you get, and only keep the ones that have the same stem as the word you want.
Another approach, not using Wordnet, is to get a large dictionary that contains all derived forms, then do a fuzzy search on it. I just found this: https://github.com/dwyl/english-words/ (Which links back to this question How to get english language word database? )
The simplest algorithm would be an O(N) linear search, doing Levenshtein Distance on each. Or run your stemmer on each entry.
If efficiency starts to be a concern... well, that is really a new question, but the first idea that comes to mind is you could do a one-off indexing of all entries by the stemmer result.
Related
There seems to be an inconsistency when iterating over a spacy document and lemmatizing the tokens compared to looking up the lemma of the word in the Vocab lemma_lookup table.
nlp = spacy.load("en_core_web_lg")
doc = nlp("I'm running faster")
for tok in doc:
print(tok.lemma_)
This prints out "faster" as lemma for the token "faster" instead of "fast". However the token does exist in the lemma_lookup table.
nlp.vocab.lookups.get_table("lemma_lookup")["faster"]
which outputs "fast"
Am I doing something wrong? Or is there another reason why these two are different? Maybe my definitions are not correct and I'm comparing apples with oranges?
I'm using the following versions on Ubuntu Linux:
spacy==2.2.4
spacy-lookups-data==0.1.0
With a model like en_core_web_lg that includes a tagger and rules for a rule-based lemmatizer, it provides the rule-based lemmas rather than the lookup lemmas when POS tags are available to use with the rules. The lookup lemmas aren't great overall and are only used as a backup if the model/pipeline doesn't have enough information to provide the rule-based lemmas.
With faster, the POS tag is ADV, which is left as-is by the rules. If it had been tagged as ADJ, the lemma would be fast with the current rules.
The lemmatizer tries to provide the best lemmas it can without requiring the user to manage any settings, but it's also not very configurable right now (v2.2). If you want to run the tagger but have lookup lemmas, you'll have to replace the lemmas after running the tagger.
aab wrote, that:
The lookup lemmas aren't great overall and are only used as a backup
if the model/pipeline doesn't have enough information to provide the
rule-based lemmas.
This is also how I understood it from the spaCy code, but since I wanted to add my own dictionaries to improve the lemmatization of the pretrained models, I decided to try out the following, which worked:
#load model
nlp = spacy.load('es_core_news_lg')
#define dictionary, where key = lemma, value = token to be lemmatized - not case-sensitive
corr_es = {
"decir":["dixo", "decia", "Dixo", "Decia"],
"ir":["iba", "Iba"],
"pacerer":["parecia", "Parecia"],
"poder":["podia", "Podia"],
"ser":["fuesse", "Fuesse"],
"haber":["habia", "havia", "Habia", "Havia"],
"ahora" : ["aora", "Aora"],
"estar" : ["estàn", "Estàn"],
"lujo" : ["luxo","luxar", "Luxo","Luxar"],
"razón" : ["razon", "razòn", "Razon", "Razòn"],
"caballero" : ["cavallero", "Cavallero"],
"mujer" : ["muger", "mugeres", "Muger", "Mugeres"],
"vez" : ["vèz", "Vèz"],
"jamás" : ["jamas", "Jamas"],
"demás" : ["demas", "demàs", "Demas", "Demàs"],
"cuidar" : ["cuydado", "Cuydado"],
"posible" : ["possible", "Possible"],
"comedia":["comediar", "Comedias"],
"poeta":["poetas", "Poetas"],
"mano":["manir", "Manir"],
"barba":["barbar", "Barbar"],
"idea":["ideo", "Ideo"]
}
#replace lemma with key in lookup table
for key, value in corr_es.items():
for token in value:
correct = key
wrong = token
nlp.vocab.lookups.get_table("lemma_lookup")[token] = key
#process the text
nlp(text)
Hopefully this could help.
I have a script that replaces a word with a synonym using NLTK and WordNet. As far as I can tell, the most effective way to find a synonym by lemmatizing, but that removes conjugation from the process.
For example, say I want to replace "bored" with "drilled"...
word = 'bored'
syns = []
wordNetSynset = wn.synsets(word)
for synSet in wordNetSynset:
for w in synSet.lemma_names():
syns.append(w)
set(syns)
Output:
{'blase', 'bore', 'bored', 'drill', 'tire', 'world-weary'}
I can use some POS filtering to make sure I only return verbs, but they won't be conjugated appropriately. I can get "bore", "drill" and "tire" ... how do I get "bored", "drilled" and "tired"? Or, if I do nouns, what if I want "bores", "drills" or "tires"?
(I will be going over these manually, so meaning is not an issue right now.)
This is a task for surface realisation. After lemmatiziation and finding the appropriate synonym, you can inflect the lemma using e.g. simpleNLG or another surface realiser of your choice. What you need to do is check the inflection type (e.g. 3rd person past) of the original word and restore on the synonym using the functions of the surface realisation module.
When using spacy, the lemma of a token (lemma_) depends on the POS. Therefore, a specific string can have more than one lemmas. For example:
import spacy
nlp = spacy.load('en')
for tok in nlp(u'He leaves early'):
if tok.text == 'leaves':
print (tok, tok.lemma_)
for tok in nlp(u'These are green leaves'):
if tok.text == 'leaves':
print (tok, tok.lemma_)
Will yield that the lemma for 'leaves' can be either 'leave' or 'leaf', depending on context. I'm interested in:
1) Get all possible lemmas for a specific string, regardless of context. Meaning, applying the Lemmatizer without depending on the POS or exceptions, just get all feasible options.
In addition, but independently, I would also like to apply tokenization and get the "correct" lemma.
2) Running over a large corpus only tokenization and lemmatizer, as efficiently as possible, without damaging the lemmatizer at all. I know that I can drop the 'ner' pipeline for example, and shouldn't drop the 'tagger', but didn't receive a straightforward answer regarding parser etc. From a simulation over a corpus, it seems like results yielded the same, but I thought that the 'parser' or 'sentenzicer' should affect? My current code at the moment is:
import multiprocessing
our_num_threads = multiprocessing.cpu_count()
corpus = [u'this is a text', u'this is another text'] ## just an example
nlp = spacy.load('en', disable = ['ner', 'textcat', 'similarity', 'merge_noun_chunks', 'merge_entities', 'tensorizer', 'parser', 'sbd', 'sentencizer'])
nlp.pipe(corpus, n_threads = our_num_threads)
If I have a good answer on 1+2, I can then for my needs use for words that were "lemmatized", consider other possible variations.
Thanks!
I have been going through many Libraries like whoosh/nltk and concepts like word net.
However I am unable to tackle my problem. I am not sure if I can find a library for this or I have to build this using the above mentioned resources.
Question:
My scenario is that I have to search for key words.
Say I have key words like 'Sales Document' / 'Purchase Documents' and have to search for them in a small 10-15 pages book.
The catch is:
Now they can also be written as 'Sales should be documented' or 'company selling should be written in the text files'. (For Sales Document - Keyword) Is there an approach here or will I have to build something?
The code for the POS Tags is as follows. If no library is available I will have to proceed with this.
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
import nltk
from nltk.corpus import wordnet
def tag(x):
return pos_tag(word_tokenize(x))
synonyms = []
antonyms = []
for syn in wordnet.synsets("Sales document"):
#print("Down2")
print (syn)
#print("Down")
for l in syn.lemmas():
print(" \n")
print(l)
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
for i in synonyms:
print(tag(i))
Update:
We went ahead and made a python program - Feel free to fork it. (Pun intended)
Further the Git Dhund is very untidy right now will clean it once completed.
Currently it is still in a development phase.
The is the link.
To match occurrences like "Sales should be documented", this can be done by increasing the slop parameter in the Phrase query object of Whoosh.
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.
You can also define slop in Query like this: "Sales should be documented"~5
To match the second example "company selling should be written in the text files", this needs a semantic processing for your texts. Whoosh has a low-level implementation for wordnet thesaurus to allow you index synonyms but it has only one-word synonyms.
For example, Suppose the word "happy" is given, I want to generate other forms of happy such as happiness, happily... etc.
I have read some other previous questions on Stackoverflow and NLTK references. However, there are only POS tagging, morph just like identifying the grammatical form of certain words within sentences, not generating a list of different words. Is there anyone who bumped into similar issues? Thank you.
This type of information is included in the Lemma class of NLTK's WordNet implementation. Specifically, it's found in Lemma.derivationally_related_forms().
Here's an example script for finding all possible derivation forms of "happy":
from nltk.corpus import wordnet as wn
forms = set() #We'll store the derivational forms in a set to eliminate duplicates
for happy_lemma in wn.lemmas("happy"): #for each "happy" lemma in WordNet
forms.add(happy_lemma.name()) #add the lemma itself
for related_lemma in happy_lemma.derivationally_related_forms(): #for each related lemma
forms.add(related_lemma.name()) #add the related lemma
Unfortunately, the information in WordNet is not complete. The above script finds "happy" and "happiness" but it fails to find "happily", even though there are multiple "happily" lemmas.