Spacy adds words automatically to vocab? - python

I loaded regular spacy language, and tries the following code:
import spacy
nlp = spacy.load("en_core_web_md")
text = "xxasdfdsfsdzz is the first U.S. public company"
if 'xxasdfdsfsdzz' in nlp.vocab:
print("in")
else:
print("not")
if 'Apple' in nlp.vocab:
print("in")
else:
print("not")
# Process the text
doc = nlp(text)
if 'xxasdfdsfsdzz' in nlp.vocab:
print("in")
else:
print("not")
if 'Apple' in nlp.vocab:
print("in")
else:
print("not")
It seems like spacy loaded words after they called to analyze - nlp(text)
Can someone explain the output? How can I avoid it? Why "Apple" is not existing in vocab? and why "xxasdfdsfsdzz" exists?
Output:
not
not
in
not

The spaCy Vocab is mainly an internal implementation detail to interface with a memory-efficient method of storing strings. It is definitely not a list of "real words" or any other thing that you are likely to find useful.
The main thing a Vocab stores by default is strings that are used internally, such as POS and dependency labels. In pipelines with vectors, words in the vectors are also included. You can read more about the implementation details here.
All words an nlp object has seen need storage for their strings, and so will be present in the Vocab. That's what you're seeing with your nonsense string in the example above.

Related

Wordnet: Getting derivationally_related_forms of a word

I am working on an IR project, I need an alternative to both stemming (which returns unreal words) and lemmatization (which may not change the word at all)
So I looked for a way to get forms of a word.
This python script gives me derivationally_related_forms of a word (e.g. "retrieving"), using NLTK and Wordnet:
from nltk.corpus import wordnet as wn
str = "retrieving"
synsets = wn.synsets(str)
s = set()
result = ""
for synset in synsets:
related = None
lemmas = synset.lemmas()
for lemma in lemmas:
forms = lemma.derivationally_related_forms()
for form in forms:
name = form.name()
s.add(name)
print(list(s))
The output is:
['recollection', 'recovery', 'regaining', 'think', 'retrieval', 'remembering', 'recall', 'recollective', 'thought', 'remembrance', 'recoverer', 'retriever']
But what I really want is only : 'retrieval' , 'retriever' , not 'think' or 'recovery'...etc
and the result is also missing other forms, such as: 'retrieve'
I know that the problem is that "synsets" include words different from my input word, so I get unrelated derivated forms
Is there a way to get the result I am expecting?
You could do what you currently do, then run a stemmer over the word list you get, and only keep the ones that have the same stem as the word you want.
Another approach, not using Wordnet, is to get a large dictionary that contains all derived forms, then do a fuzzy search on it. I just found this: https://github.com/dwyl/english-words/ (Which links back to this question How to get english language word database? )
The simplest algorithm would be an O(N) linear search, doing Levenshtein Distance on each. Or run your stemmer on each entry.
If efficiency starts to be a concern... well, that is really a new question, but the first idea that comes to mind is you could do a one-off indexing of all entries by the stemmer result.

spacy lemmatizing inconsistency with lemma_lookup table

There seems to be an inconsistency when iterating over a spacy document and lemmatizing the tokens compared to looking up the lemma of the word in the Vocab lemma_lookup table.
nlp = spacy.load("en_core_web_lg")
doc = nlp("I'm running faster")
for tok in doc:
print(tok.lemma_)
This prints out "faster" as lemma for the token "faster" instead of "fast". However the token does exist in the lemma_lookup table.
nlp.vocab.lookups.get_table("lemma_lookup")["faster"]
which outputs "fast"
Am I doing something wrong? Or is there another reason why these two are different? Maybe my definitions are not correct and I'm comparing apples with oranges?
I'm using the following versions on Ubuntu Linux:
spacy==2.2.4
spacy-lookups-data==0.1.0
With a model like en_core_web_lg that includes a tagger and rules for a rule-based lemmatizer, it provides the rule-based lemmas rather than the lookup lemmas when POS tags are available to use with the rules. The lookup lemmas aren't great overall and are only used as a backup if the model/pipeline doesn't have enough information to provide the rule-based lemmas.
With faster, the POS tag is ADV, which is left as-is by the rules. If it had been tagged as ADJ, the lemma would be fast with the current rules.
The lemmatizer tries to provide the best lemmas it can without requiring the user to manage any settings, but it's also not very configurable right now (v2.2). If you want to run the tagger but have lookup lemmas, you'll have to replace the lemmas after running the tagger.
aab wrote, that:
The lookup lemmas aren't great overall and are only used as a backup
if the model/pipeline doesn't have enough information to provide the
rule-based lemmas.
This is also how I understood it from the spaCy code, but since I wanted to add my own dictionaries to improve the lemmatization of the pretrained models, I decided to try out the following, which worked:
#load model
nlp = spacy.load('es_core_news_lg')
#define dictionary, where key = lemma, value = token to be lemmatized - not case-sensitive
corr_es = {
"decir":["dixo", "decia", "Dixo", "Decia"],
"ir":["iba", "Iba"],
"pacerer":["parecia", "Parecia"],
"poder":["podia", "Podia"],
"ser":["fuesse", "Fuesse"],
"haber":["habia", "havia", "Habia", "Havia"],
"ahora" : ["aora", "Aora"],
"estar" : ["estàn", "Estàn"],
"lujo" : ["luxo","luxar", "Luxo","Luxar"],
"razón" : ["razon", "razòn", "Razon", "Razòn"],
"caballero" : ["cavallero", "Cavallero"],
"mujer" : ["muger", "mugeres", "Muger", "Mugeres"],
"vez" : ["vèz", "Vèz"],
"jamás" : ["jamas", "Jamas"],
"demás" : ["demas", "demàs", "Demas", "Demàs"],
"cuidar" : ["cuydado", "Cuydado"],
"posible" : ["possible", "Possible"],
"comedia":["comediar", "Comedias"],
"poeta":["poetas", "Poetas"],
"mano":["manir", "Manir"],
"barba":["barbar", "Barbar"],
"idea":["ideo", "Ideo"]
}
#replace lemma with key in lookup table
for key, value in corr_es.items():
for token in value:
correct = key
wrong = token
nlp.vocab.lookups.get_table("lemma_lookup")[token] = key
#process the text
nlp(text)
Hopefully this could help.

Spacy lemmatizer issue/consistency

I'm currently using spaCy for NLP purpose (mainly lemmatization and tokenization). The model used is en-core-web-sm (2.1.0).
The following code is run to retrieve a list of words "cleansed" from a query
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(query)
list_words = []
for token in doc:
if token.text != ' ':
list_words.append(token.lemma_)
However I face a major issue, when running this code.
For example, when the query is "processing of tea leaves".
The result stored in list_words can be either ['processing', 'tea', 'leaf'] or ['processing', 'tea', 'leave'].
It seems that the result is not consistent. I cannot change my input/query (adding another word for context is not possible) and I really need to find the same result every time. I think the loading of the model may be the issue.
Why the result differ ? Can I load the model the "same" way everytime ? Did I miss a parameter to obtain the same result for ambiguous query ?
Thanks for your help
The issue was analysed by the spaCy team and they've come up with a solution.
Here's the fix : https://github.com/explosion/spaCy/pull/3646
Basically, when the lemmatization rules were applied, a set was used to return a lemma. Since a set has no ordering, the returned lemma could change in between python session.
For example in my case, for the noun "leaves", the potential lemmas were "leave" and "leaf". Without ordering, the result was random - it could be "leave" or "leaf".

Python Spacy's Lemmatizer: getting all options for lemmas with maximum efficiency

When using spacy, the lemma of a token (lemma_) depends on the POS. Therefore, a specific string can have more than one lemmas. For example:
import spacy
nlp = spacy.load('en')
for tok in nlp(u'He leaves early'):
if tok.text == 'leaves':
print (tok, tok.lemma_)
for tok in nlp(u'These are green leaves'):
if tok.text == 'leaves':
print (tok, tok.lemma_)
Will yield that the lemma for 'leaves' can be either 'leave' or 'leaf', depending on context. I'm interested in:
1) Get all possible lemmas for a specific string, regardless of context. Meaning, applying the Lemmatizer without depending on the POS or exceptions, just get all feasible options.
In addition, but independently, I would also like to apply tokenization and get the "correct" lemma.
2) Running over a large corpus only tokenization and lemmatizer, as efficiently as possible, without damaging the lemmatizer at all. I know that I can drop the 'ner' pipeline for example, and shouldn't drop the 'tagger', but didn't receive a straightforward answer regarding parser etc. From a simulation over a corpus, it seems like results yielded the same, but I thought that the 'parser' or 'sentenzicer' should affect? My current code at the moment is:
import multiprocessing
our_num_threads = multiprocessing.cpu_count()
corpus = [u'this is a text', u'this is another text'] ## just an example
nlp = spacy.load('en', disable = ['ner', 'textcat', 'similarity', 'merge_noun_chunks', 'merge_entities', 'tensorizer', 'parser', 'sbd', 'sentencizer'])
nlp.pipe(corpus, n_threads = our_num_threads)
If I have a good answer on 1+2, I can then for my needs use for words that were "lemmatized", consider other possible variations.
Thanks!

NLP Phrase Search in Python

I have been going through many Libraries like whoosh/nltk and concepts like word net.
However I am unable to tackle my problem. I am not sure if I can find a library for this or I have to build this using the above mentioned resources.
Question:
My scenario is that I have to search for key words.
Say I have key words like 'Sales Document' / 'Purchase Documents' and have to search for them in a small 10-15 pages book.
The catch is:
Now they can also be written as 'Sales should be documented' or 'company selling should be written in the text files'. (For Sales Document - Keyword) Is there an approach here or will I have to build something?
The code for the POS Tags is as follows. If no library is available I will have to proceed with this.
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from pandas import Series
import nltk
from nltk.corpus import wordnet
def tag(x):
return pos_tag(word_tokenize(x))
synonyms = []
antonyms = []
for syn in wordnet.synsets("Sales document"):
#print("Down2")
print (syn)
#print("Down")
for l in syn.lemmas():
print(" \n")
print(l)
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(set(synonyms))
print(set(antonyms))
for i in synonyms:
print(tag(i))
Update:
We went ahead and made a python program - Feel free to fork it. (Pun intended)
Further the Git Dhund is very untidy right now will clean it once completed.
Currently it is still in a development phase.
The is the link.
To match occurrences like "Sales should be documented", this can be done by increasing the slop parameter in the Phrase query object of Whoosh.
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.
You can also define slop in Query like this: "Sales should be documented"~5
To match the second example "company selling should be written in the text files", this needs a semantic processing for your texts. Whoosh has a low-level implementation for wordnet thesaurus to allow you index synonyms but it has only one-word synonyms.

Categories

Resources