spacy lemmatizing inconsistency with lemma_lookup table

spacy lemmatizing inconsistency with lemma_lookup table - python

There seems to be an inconsistency when iterating over a spacy document and lemmatizing the tokens compared to looking up the lemma of the word in the Vocab lemma_lookup table.
nlp = spacy.load("en_core_web_lg")
doc = nlp("I'm running faster")
for tok in doc:
print(tok.lemma_)
This prints out "faster" as lemma for the token "faster" instead of "fast". However the token does exist in the lemma_lookup table.
nlp.vocab.lookups.get_table("lemma_lookup")["faster"]
which outputs "fast"
Am I doing something wrong? Or is there another reason why these two are different? Maybe my definitions are not correct and I'm comparing apples with oranges?
I'm using the following versions on Ubuntu Linux:
spacy==2.2.4
spacy-lookups-data==0.1.0

With a model like en_core_web_lg that includes a tagger and rules for a rule-based lemmatizer, it provides the rule-based lemmas rather than the lookup lemmas when POS tags are available to use with the rules. The lookup lemmas aren't great overall and are only used as a backup if the model/pipeline doesn't have enough information to provide the rule-based lemmas.
With faster, the POS tag is ADV, which is left as-is by the rules. If it had been tagged as ADJ, the lemma would be fast with the current rules.
The lemmatizer tries to provide the best lemmas it can without requiring the user to manage any settings, but it's also not very configurable right now (v2.2). If you want to run the tagger but have lookup lemmas, you'll have to replace the lemmas after running the tagger.

aab wrote, that:
The lookup lemmas aren't great overall and are only used as a backup
if the model/pipeline doesn't have enough information to provide the
rule-based lemmas.
This is also how I understood it from the spaCy code, but since I wanted to add my own dictionaries to improve the lemmatization of the pretrained models, I decided to try out the following, which worked:
#load model
nlp = spacy.load('es_core_news_lg')
#define dictionary, where key = lemma, value = token to be lemmatized - not case-sensitive
corr_es = {
"decir":["dixo", "decia", "Dixo", "Decia"],
"ir":["iba", "Iba"],
"pacerer":["parecia", "Parecia"],
"poder":["podia", "Podia"],
"ser":["fuesse", "Fuesse"],
"haber":["habia", "havia", "Habia", "Havia"],
"ahora" : ["aora", "Aora"],
"estar" : ["estàn", "Estàn"],
"lujo" : ["luxo","luxar", "Luxo","Luxar"],
"razón" : ["razon", "razòn", "Razon", "Razòn"],
"caballero" : ["cavallero", "Cavallero"],
"mujer" : ["muger", "mugeres", "Muger", "Mugeres"],
"vez" : ["vèz", "Vèz"],
"jamás" : ["jamas", "Jamas"],
"demás" : ["demas", "demàs", "Demas", "Demàs"],
"cuidar" : ["cuydado", "Cuydado"],
"posible" : ["possible", "Possible"],
"comedia":["comediar", "Comedias"],
"poeta":["poetas", "Poetas"],
"mano":["manir", "Manir"],
"barba":["barbar", "Barbar"],
"idea":["ideo", "Ideo"]
}
#replace lemma with key in lookup table
for key, value in corr_es.items():
for token in value:
correct = key
wrong = token
nlp.vocab.lookups.get_table("lemma_lookup")[token] = key
#process the text
nlp(text)
Hopefully this could help.

Related

Spacy adds words automatically to vocab?

I loaded regular spacy language, and tries the following code:
import spacy
nlp = spacy.load("en_core_web_md")
text = "xxasdfdsfsdzz is the first U.S. public company"
if 'xxasdfdsfsdzz' in nlp.vocab:
print("in")
else:
print("not")
if 'Apple' in nlp.vocab:
print("in")
else:
print("not")
# Process the text
doc = nlp(text)
if 'xxasdfdsfsdzz' in nlp.vocab:
print("in")
else:
print("not")
if 'Apple' in nlp.vocab:
print("in")
else:
print("not")
It seems like spacy loaded words after they called to analyze - nlp(text)
Can someone explain the output? How can I avoid it? Why "Apple" is not existing in vocab? and why "xxasdfdsfsdzz" exists?
Output:
not
not
in
not

The spaCy Vocab is mainly an internal implementation detail to interface with a memory-efficient method of storing strings. It is definitely not a list of "real words" or any other thing that you are likely to find useful.
The main thing a Vocab stores by default is strings that are used internally, such as POS and dependency labels. In pipelines with vectors, words in the vectors are also included. You can read more about the implementation details here.
All words an nlp object has seen need storage for their strings, and so will be present in the Vocab. That's what you're seeing with your nonsense string in the example above.

Updating spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match) so that hashtags are tokenized as a single token

This is my first time using spacy and I am trying to learn how to edit the tokenizer on one of the pretrained models (en_core_web_md) so that when tweets are tokenized, the entire hashtag becomes a single token (e.g. I want one token '#hashtagText', the default would be two tokens, '#' and 'hashtagText').
I know I am not the first person that has faced this issue. I have tried implementing the advice other places online but after using their methods the output remains the same (#hashtagText is two tokens). These articles show the methods I have tried.
https://the-fintech-guy.medium.com/spacy-handling-of-hashtags-and-dollartags-ed1e661f203c
https://towardsdatascience.com/pre-processing-should-extract-context-specific-features-4d01f6669a7e
Shown in the code below, my troubleshooting steps have been:
save the default pattern matching regex (default_token_matching_regex)
save the regex that nlp (the pretrained model) is using before any updates (nlp_token_matching_regex_pre_update)
Note: I originally suspected these would be the same, but they are not. See below for outputs.
Append the regex I need (#\w+) to the list that nlp is current using, save this combination as updated_token_matching_regex
Update the regex nlp is using with the variable created above (updated_token_matching_regex)
Save the new regex used by nlp to verify things were updated correctly (nlp_token_matching_regex_post_update).
See code below:
import spacy
import en_core_web_md
import re
nlp = en_core_web_md.load()
# Spacys default token matching regex.
default_token_matching_regex = spacy.tokenizer._get_regex_pattern(nlp.Defaults.token_match)
# Verify what regex nlp is using before changing anything.
nlp_token_matching_regex_pre_update = spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match)
# Create a new regex that combines the default regex and a term to treat hashtags as a single token.
updated_token_matching_regex = f"({nlp_token_matching_regex_pre_update}|#\w+)"
# Update the token matching regex used by nlp with the regex created in the line above.
nlp.tokenizer.token_match = re.compile(updated_token_matching_regex).match
# Verify that nlp is now using the updated regex.
nlp_token_matching_regex_post_update = spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match)
# Now let's try again
s = "2020 can't get any worse #ihate2020 #bestfriend <https://t.co>"
doc = nlp(s)
# Let's look at the lemmas and is stopword of each token
print(f"Token\t\tLemma\t\tStopword")
print("="*40)
for token in doc:
print(f"{token}\t\t{token.lemma_}\t\t{token.is_stop}")
As you can see above, the tokenization behavior is not as it should be with the addition of '#\w+'. See below for printouts of all the troubleshooting variables.
Since I feel like I have proven to myself above that I did correctly update the regex nlp is using, the only possible issue I could think of is that the regex itself was wrong. I tested the regex by itself and it seems to behave as intended, see below:
Is anyone able to see the error that is causing nlp to tokenize #hashTagText as two tokens after its nlp.tokenizer.token_match regex was updated to do it as a single token?
Thank you!!

Not sure it is the best possible solution, but I did find a way to make it work. See below for what I did:
Spacy gives us the chart below which shows the order things are processed when tokenization is performed.
I was able to use the tokenizer.explain() method to see that the hashtags were being ripped off due to a prefix rule. Viewing the tokenizer.explain() output is a simple as running the code below, where "first_tweet" is any string.
tweet_doc = nlp.tokenizer.explain(first_tweet)
for token in tweet_doc:
print(token)
Next, referencing the chart above, we see that prefix rules are the first things applied during the tokenization process.
This meant that even though I updated the token_match rules with a regular expression that allows for keeping "#Text" as a single token, it didn't matter because by the time the token_match rules were evaluated the prefix rule had already separated the '#' from the text.
Since this is a twitter project, I will never want "#" treated as a prefix. Therefore my solution was to remove "#" from the list of prefixes considered, this was accomplished with the code below:
default_prefixes = list(nlp.Defaults.prefixes)
default_prefixes.remove('#')
prefix_regex = spacy.util.compile_prefix_regex(default_prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search
and that's it! Hope this solution helps someone else.
Final thoughts:
Recently spacy was updated to version 3.0. I am curious if prior versions of the spacy pretrained models did not include '#' in the prefix list. That is the only explanation I can come up with for why the code shown in the articles previous posted no longer seem to work as intended. If anyone can explain in detail why my solution seems much more complicated than those posted in the articles I linked to earlier, I would certainly love to learn.
Cheers.
-Braden

The default token_match for English is None (as of v2.3.0, now that the URL pattern is in url_match), so you can just overwrite it with your new pattern:
import re
import spacy
nlp = spacy.blank("en")
nlp.tokenizer.token_match = re.compile("^#\w+$").match
assert [t.text for t in nlp("#asdf1234")] == ["#asdf1234"]
Your example in the question ends up with the pattern (None|#\w+), which isn't exactly what you want, but it seems to work fine for this given example with v2.3.5 and v3.0.5:
Token Lemma Stopword
========================================
2020 2020 False
ca ca True
n't n't True
get get True
any any True
worse bad False
#ihate2020 #ihate2020 False
#bestfriend #bestfriend False
< < False
https://t.co https://t.co False
> > False

Wordnet: Getting derivationally_related_forms of a word

I am working on an IR project, I need an alternative to both stemming (which returns unreal words) and lemmatization (which may not change the word at all)
So I looked for a way to get forms of a word.
This python script gives me derivationally_related_forms of a word (e.g. "retrieving"), using NLTK and Wordnet:
from nltk.corpus import wordnet as wn
str = "retrieving"
synsets = wn.synsets(str)
s = set()
result = ""
for synset in synsets:
related = None
lemmas = synset.lemmas()
for lemma in lemmas:
forms = lemma.derivationally_related_forms()
for form in forms:
name = form.name()
s.add(name)
print(list(s))
The output is:
['recollection', 'recovery', 'regaining', 'think', 'retrieval', 'remembering', 'recall', 'recollective', 'thought', 'remembrance', 'recoverer', 'retriever']
But what I really want is only : 'retrieval' , 'retriever' , not 'think' or 'recovery'...etc
and the result is also missing other forms, such as: 'retrieve'
I know that the problem is that "synsets" include words different from my input word, so I get unrelated derivated forms
Is there a way to get the result I am expecting?

You could do what you currently do, then run a stemmer over the word list you get, and only keep the ones that have the same stem as the word you want.
Another approach, not using Wordnet, is to get a large dictionary that contains all derived forms, then do a fuzzy search on it. I just found this: https://github.com/dwyl/english-words/ (Which links back to this question How to get english language word database? )
The simplest algorithm would be an O(N) linear search, doing Levenshtein Distance on each. Or run your stemmer on each entry.
If efficiency starts to be a concern... well, that is really a new question, but the first idea that comes to mind is you could do a one-off indexing of all entries by the stemmer result.

Spacy lemmatizer issue/consistency

I'm currently using spaCy for NLP purpose (mainly lemmatization and tokenization). The model used is en-core-web-sm (2.1.0).
The following code is run to retrieve a list of words "cleansed" from a query
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(query)
list_words = []
for token in doc:
if token.text != ' ':
list_words.append(token.lemma_)
However I face a major issue, when running this code.
For example, when the query is "processing of tea leaves".
The result stored in list_words can be either ['processing', 'tea', 'leaf'] or ['processing', 'tea', 'leave'].
It seems that the result is not consistent. I cannot change my input/query (adding another word for context is not possible) and I really need to find the same result every time. I think the loading of the model may be the issue.
Why the result differ ? Can I load the model the "same" way everytime ? Did I miss a parameter to obtain the same result for ambiguous query ?
Thanks for your help

The issue was analysed by the spaCy team and they've come up with a solution.
Here's the fix : https://github.com/explosion/spaCy/pull/3646
Basically, when the lemmatization rules were applied, a set was used to return a lemma. Since a set has no ordering, the returned lemma could change in between python session.
For example in my case, for the noun "leaves", the potential lemmas were "leave" and "leaf". Without ordering, the result was random - it could be "leave" or "leaf".

Python Spacy's Lemmatizer: getting all options for lemmas with maximum efficiency

When using spacy, the lemma of a token (lemma_) depends on the POS. Therefore, a specific string can have more than one lemmas. For example:
import spacy
nlp = spacy.load('en')
for tok in nlp(u'He leaves early'):
if tok.text == 'leaves':
print (tok, tok.lemma_)
for tok in nlp(u'These are green leaves'):
if tok.text == 'leaves':
print (tok, tok.lemma_)
Will yield that the lemma for 'leaves' can be either 'leave' or 'leaf', depending on context. I'm interested in:
1) Get all possible lemmas for a specific string, regardless of context. Meaning, applying the Lemmatizer without depending on the POS or exceptions, just get all feasible options.
In addition, but independently, I would also like to apply tokenization and get the "correct" lemma.
2) Running over a large corpus only tokenization and lemmatizer, as efficiently as possible, without damaging the lemmatizer at all. I know that I can drop the 'ner' pipeline for example, and shouldn't drop the 'tagger', but didn't receive a straightforward answer regarding parser etc. From a simulation over a corpus, it seems like results yielded the same, but I thought that the 'parser' or 'sentenzicer' should affect? My current code at the moment is:
import multiprocessing
our_num_threads = multiprocessing.cpu_count()
corpus = [u'this is a text', u'this is another text'] ## just an example
nlp = spacy.load('en', disable = ['ner', 'textcat', 'similarity', 'merge_noun_chunks', 'merge_entities', 'tensorizer', 'parser', 'sbd', 'sentencizer'])
nlp.pipe(corpus, n_threads = our_num_threads)
If I have a good answer on 1+2, I can then for my needs use for words that were "lemmatized", consider other possible variations.
Thanks!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.