Add/remove custom stop words with spacy - python

What is the best way to add/remove stop words with spacy? I am using token.is_stop function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!

Using Spacy 2.0.11, you can update its stopwords set using one of the following:
To add a single stopword:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")
To add several stopwords at once:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
To remove a single stopword:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")
To remove several stopwords at once:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}
Note: To see the current set of stopwords, use:
print(nlp.Defaults.stop_words)
Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path") and nlp.from_disk("/path") (further described at https://spacy.io/usage/saving-loading).

You can edit them before processing your text like this (see this post):
>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True
Note: This seems to work <=v1.8. For newer versions, see other answers.

Short answer for version 2.0 and above (just tested with 3.4+):
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS) # <- set of Spacy's default stop words
STOP_WORDS.add("your_additional_stop_word_here")
This loads all stop words as a set.
You can add your stop words to STOP_WORDS or use your own list in the first place.
To check if the attribute is_stop for the stop words is set to True use this:
for word in STOP_WORDS:
lexeme = nlp.vocab[word]
print(lexeme.text, lexeme.is_stop)
In the unlikely case that stop words for some reason aren't set to is_stop = True do this:
for word in STOP_WORDS:
lexeme = nlp.vocab[word]
lexeme.is_stop = True
Detailed explanation step by step with links to documentation.
First we import spacy:
import spacy
To instantiate class Language as nlp from scratch we need to import Vocab and Language. Documentation and example here.
from spacy.vocab import Vocab
from spacy.language import Language
# create new Language object from scratch
nlp = Language(Vocab())
stop_words is a default attribute of class Language and can be set to customize the default language data. Documentation here. You can find spacy's GitHub repo folder with defaults for various languages here.
For our instance of nlp we get 0 stop words which is reasonable since we haven't set any language with defaults
print(f"Language instance 'nlp' has {len(nlp.Defaults.stop_words)} default stopwords.")
>>> Language instance 'nlp' has 0 default stopwords.
Let's import English language defaults.
from spacy.lang.en import English
Now we have 326 default stop words.
print(f"The language default English has {len(spacy.lang.en.STOP_WORDS)} stopwords.")
print(sorted(list(spacy.lang.en.STOP_WORDS))[:10])
>>> The language default English has 326 stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']
Let's create a new instance of Language, now with defaults for English. We get the same result.
nlp = English()
print(f"Language instance 'nlp' now has {len(nlp.Defaults.stop_words)} default stopwords.")
print(sorted(list(nlp.Defaults.stop_words))[:10])
>>> Language instance 'nlp' now has 326 default stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']
To check if all words are set to is_stop = True we iterate over the stop words, retrieve the lexeme from vocab and print out the is_stop attribute.
[nlp.vocab[word].is_stop for word in nlp.Defaults.stop_words][:10]
>>> [True, True, True, True, True, True, True, True, True, True]
We can add stopwords to the English language defaults.
spacy.lang.en.STOP_WORDS.add("aaaahhh-new-stopword")
print(len(spacy.lang.en.STOP_WORDS))
# these propagate to our instance 'nlp' too!
print(len(nlp.Defaults.stop_words))
>>> 327
>>> 327
Or we can add new stopwords to instance nlp. However, these propagate to our language defaults too!
nlp.Defaults.stop_words.add("_another-new-stop-word")
print(len(spacy.lang.en.STOP_WORDS))
print(len(nlp.Defaults.stop_words))
>>> 328
>>> 328
The new stop words are set to is_stop = True.
print(nlp.vocab["aaaahhh-new-stopword"].is_stop)
print(nlp.vocab["_another-new-stop-word"].is_stop)
>>> True
>>> True

For 2.0 use the following:
for word in nlp.Defaults.stop_words:
lex = nlp.vocab[word]
lex.is_stop = True

This collects the stop words too :)
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

In latest version following would remove the word out of the list:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')

For version 2.3.0
If you want to replace the entire list instead of adding or removing a few stop words, you can do this:
custom_stop_words = set(['the','and','a'])
# First override the stop words set for the language
cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words = custom_stop_words
# Now load your model
nlp = spacy.load('en_core_web_md')
The trick is to assign the stop word set for the language before loading the model. It also ensures that any upper/lower case variation of the stop words are considered stop words.

See below piece of code
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)
len(nlp.Defaults.stop_words)
# Make list of word you want to add to stop words
list = ['apple', 'ball', 'cat']
# Iterate this in loop
for item in list:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add(item)
# Set the stop_word tag on the lexeme
nlp.vocab[item].is_stop = True
Hope this helps. You can print length before and after to confirm.

Related

Add the word "cant" to Spacy stopwords

How to I get SpaCy to set words such as "cant" and "wont" as stopwords?
For example, even with tokenisation it will identify "can't" as a stop word, but not "cant".
When it sees "cant", it removes "ca" but leaves "nt". Is it by design? I guess "nt" is not really a word.
Here is a sample code:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")
text = "cant can't cannot"
doc = nlp(text)
for word in doc:
print(word,":",word.is_stop)
ca : True
nt : False
ca : True
n't : True
can : True
not : True
The tokenizer splits "cant" into "ca" and "nt". Adding "cant" to the list won't surge any effect because not token will be matched. Instead "nt" should be added as in the example (3rd line of code).
Also it is important to update the stopwords before loading the model, otherwise if won't pick the changes.
Example:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
STOP_WORDS.add("nt")
nlp = spacy.load("en_core_web_sm")
text = "cant can't cannot"
doc = nlp(text)
for word in doc:
print(word,":",word.is_stop)
ca : True
nt : True
ca : True
n't : True
can : True
not : True
As stated in Spacy's documentation, the tokenizer cannot add or remove information to the text, so you would always be able to reconstruct the same input text (using the whitespace information stored inside the Tokens). This also means that if the text contains spelling errors, they will be kept.
So, there is no error in the tokenization process, since Spacy splits constructs such as can't or don't into two different tokens: do and n't, for example.
cant and wont are two spelling errors (actually, they are actual English words, that Spacy "is able to recognize" as auxiliaries and it then splits them as it would split can't or won't). We could say that the split is correct and that it follows the rule it would follow with the correct version of these words, the only problem there is consists in recognizing wo and nt as stopwords. You can see here the list of stopwords used by Spacy; for example, ca is present and that is why it is correctly recognized as a stopword (n't is added at the end among the contractions).
If the split is ok for your use case, you can add wo and nt manually to the list of stopwords.
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
my_stop_words = ["nt", "wo"]
STOP_WORDS.update(my_stop_words)
nlp = spacy.load("en_core_web_sm")
# analyze docs
If, for some reason, you need to do something with the stopwords in your text and you'd like to have wont and cant and not wo, nt and ca, nt, you could think about concatenating consecutive stopwords by checking if the trailing whitespace is empty (meaning that the tokens were attached in the original text):
stop_words_in_text = []
doc = nlp("Today I cant go to work. We wont come to your party.")
for token in doc:
i = token.i
if token.is_stop:
if i > 0 and doc[i-1].whitespace_ == "" and doc[i-1].is_stop:
stop_words_in_text[-1] += token.text
else:
stop_words_in_text.append(token.text)
print(stop_words_in_text)
['I', 'cant', 'go', 'to', 'We', 'wont', 'to', 'your']
Hopefully, this will help you. You can also implement custom Spacy components and check here if you need to add special tokenization cases.

Why does tokeniser break down words that are present in vocab

In my understanding, what tokeniser does is that, given each word, the tokeniser will break down the word into sub-words only if the word is not present in the tokeniser.get_vocab() :
def checkModel(model):
tokenizer = AutoTokenizer.from_pretrained(model)
allList = []
for word in tokenizer.get_vocab():
word = word.lower()
tokens = tokenizer.tokenize(word)
try:
if word[0]!='#' and word[0]!='[' and tokens[0] != word:
allList.append((word, tokens))
print(word, tokens)
except:
continue
return allList
checkModel('bert-base-uncased')
# ideally should return an empty list
However, what I have observed is that some models on huggingface will break down words into smaller pieces even if the word is present in the vocab.
checkModel('emilyalsentzer/Bio_ClinicalBERT')
output:
welles ['well', '##es']
lexington ['le', '##xing', '##ton']
palestinian ['pale', '##st', '##inian']
...
elisabeth ['el', '##isa', '##beth']
alexander ['ale', '##xa', '##nder']
appalachian ['app', '##ala', '##chia', '##n']
mitchell ['mit', '##chel', '##l']
...
4630 # tokens in vocab got broken down, not supposed to happen
I have checked a few models of this behaviour, was wondering why is this happening?
This is a really interesting question, and I am currently wondering whether it should be considered as a bug report on the Huggingface repo.
EDIT: I realized that it is possible to define model-specific tokenization_config.json files to overwrite the default behavior. One example is the bert-base-cased repository, which has the following content for the tokenizer config:
{
"do_lower_case": false
}
Given that this functionality is available, I think the best option would be to contact the original author of the work and ask them to potentially consider this configuration (if appropriate for the general use case).
Original Answer:
As it turns out, the vocabulary word that you are checking for is welles, yet the vocab file itself only contains Welles. Notice the difference in the uppercased first letter?
It turns out you can manually force the tokenizer to specifically check for cased vocabulary words, in which case it works fine.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",
do_lower_case=False) # This is different
print(tokenizer.do_lower_case)
# Output: False
# Lowercase input will result in split word
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
# Output: ['[CLS]', 'well', '##es', '[SEP]']
# Uppercase input will correctly *not split* the word
tokenizer2.convert_ids_to_tokens(tokenizer2("Welles")["input_ids"])
['[CLS]', 'Welles', '[SEP]']
Per default, however, this is not the case, and all words will be converted to lowercase, which is why you cannot find the word:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
# Per default, lowercasing is enabled!
print(tokenizer.do_lower_case)
# Output: True
# This time now we get the same (lowercased) output both times!
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']
tokenizer.convert_ids_to_tokens(tokenizer("Welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']
The tokenizer you are calling 'emilyalsentzer/Bio_ClinicalBERT' has tokens that are not present in the original base tokenizer. To add tokens to the tokenizer one can either provide a list of strings or a list of tokenizers.AddedTokens.
The default behavior in both cases is to allow new words to be used as subwords. In my example if we add 'director' and 'cto' to the tokenizer, then 'director' can be broken down into 'dire' + 'cto' + 'r' ('dire' and 'r' are a part of the original tokenizer). To avoid this, one should use:
tokenizer.add_tokens([tokenizers.AddedToken(new_word, single_word = True) for new_word in new_words])
I do think a lot of users would simply use a list of strings (as I did, until half an hour ago). But this would lead to the problem that you saw.
To change this for a customized tokenizer (like 'emilyalsentzer/Bio_ClinicalBERT') w/o losing much in model performance, I'd recommend extracting the set of words from this tokenizer, and comparing it to its base tokenizer (for example 'bert-base-uncased'). This will give you the set of words that were added to the base tokenizer as part of model re-training. Then take the base tokenizer and add this new words to it using AddedToken with single_word set to True. Replace the custom tokenizer with this new tokenizer.

word tokenization takes too much time to run

I use Pythainlp package to tokenize my Thai language data for doing sentiment analysis.
first, I build a function to add new words set and tokenize it
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie
from pythainlp import word_tokenize
def text_tokenize(Mention):
new_words = {'คนละครึ่ง', 'ยืนยันตัวตน', 'เติมเงิน', 'เราชนะ', 'เป๋าตัง', 'แอปเป๋าตัง'}
words = new_words.union(thai_words())
custom_dictionary_trie = dict_trie(words)
dataa = word_tokenize(Mention, custom_dict=custom_dictionary_trie, keep_whitespace=False)
return dataa
after that I apply it within my text_process function which including remove punctuation and stop words.
puncuations = '''.?!,;:-_[]()'/<>{}\##$&%~*ๆฯ'''
from pythainlp import word_tokenize
def text_process(Mention):
final = "".join(u for u in Mention if u not in puncuations and ('ๆ', 'ฯ'))
final = text_tokenize(final)
final = " ".join(word for word in final)
final = " ".join(word for word in final.split() if word.lower not in thai_stopwords)
return final
dff['text_tokens'] = dff['Mention'].apply(text_process)
dff
the point is it takes too long to run this function. it took 17 minutes and still not finished. I tried to replace
final = text_tokenize(final) with final = word_tokenize(final)
and it took just 2 minutes but I can't no longer use it because I need to add new custom dictionary. I know there is something wrong but really don't know how to fix it
I am new to python and nlp so please help.
Ps. sorry for my broken English
I am not familiar with Thai language, but assume that for tokenization you can also use language agnostic tokenization tools.
If you want to perform word tokenization, try the example below:
from nltk.tokenize import word_tokenize
s = '''This is the text I want to tokenize'''
word_tokenize(s)
>>> ['This', 'is', 'the', 'text', 'I', 'want', 'to', 'tokenize']

spacy 3 - lemma_ returned will be empty string

I normalize ten thousands of docs using spacy 3.
To speed up the process, I try this way,
nlp = spacy.load('en_core_web_sm')
docs = nlp.tokenizer.pipe(doc_list)
return [[word.lemma_ for word in doc if word.is_punct == False and word.is_stop == False] for doc, _ in doc_list]
but all lemma_ returned would be empty string.
So I directly use nlp(doc) like the following, but it's too slow.
a = [[word.lemma_ for word in nlp(doc) if word.is_punct == False and word.is_stop == False] for doc in doc_list]
How can I do this properly?
The difference is in how you are creating the docs.
In the first example you use nlp.tokenizer.pipe() - this will only run the tokenizer on all your docs but not the lemmatizer. So, all you get is your docs split into tokens but the lemma_ attribute is not set.
In the second example you use nlp(doc) this will run all the default components (which are ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']. Since the lemmatizer is part of the pipeline the lemma_ attribute is set. But, it slower because you are running all the components, even the ones you don't need.
What you should be doing:
import spacy
# Exclude components not required when loading the spaCy model.
nlp = spacy.load("en_core_web_sm", exclude=["tok2vec", "parser", "ner", "attrbute_ruler"])
# Extract lemmas as required.
a = [[word.lemma_ for word in nlp(doc) if word.is_punct == False and word.is_stop == False] for doc in doc_list]

Iterate and Lemmatize List

I'm a newbie and struggling with what I'm sure is a simple task.
I have a list of words taken from POS tagging:
words = ['drink', 'drinking']
And I want to lemmatize them and then process them (using set?) to ultimately refine my list to:
refined_list = ['drink']
However, I"m stuck on the next step of lemmatization - my method still returns the following:
refinded_list = ['drink', 'drinking']
I tried to reference this but can't figure out what to import so 'lmtzr' works or how to get it to work.
Here's my code so far:
import nltk
words = ['drink', 'drinking']
WNlemma = nltk.WordNetLemmatizer()
refined_list = [WNlemma.lemmatize(t) for t in words]
print(refined_list)
Thank you for helping me.
You need to set pos tag parameter from lemmatize as VERB. By default it is NOUN.
So it considers everything as NOUN even if you pass the VERB.
import nltk
words = ['drink', 'drinking']
WNlemma = nltk.WordNetLemmatizer()
refined_list = [WNlemma.lemmatize(t, pos='v') for t in words]
print(refined_list)
Output:
['drink', 'drink']

Categories

Resources