How to I get SpaCy to set words such as "cant" and "wont" as stopwords?
For example, even with tokenisation it will identify "can't" as a stop word, but not "cant".
When it sees "cant", it removes "ca" but leaves "nt". Is it by design? I guess "nt" is not really a word.
Here is a sample code:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")
text = "cant can't cannot"
doc = nlp(text)
for word in doc:
print(word,":",word.is_stop)
ca : True
nt : False
ca : True
n't : True
can : True
not : True
The tokenizer splits "cant" into "ca" and "nt". Adding "cant" to the list won't surge any effect because not token will be matched. Instead "nt" should be added as in the example (3rd line of code).
Also it is important to update the stopwords before loading the model, otherwise if won't pick the changes.
Example:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
STOP_WORDS.add("nt")
nlp = spacy.load("en_core_web_sm")
text = "cant can't cannot"
doc = nlp(text)
for word in doc:
print(word,":",word.is_stop)
ca : True
nt : True
ca : True
n't : True
can : True
not : True
As stated in Spacy's documentation, the tokenizer cannot add or remove information to the text, so you would always be able to reconstruct the same input text (using the whitespace information stored inside the Tokens). This also means that if the text contains spelling errors, they will be kept.
So, there is no error in the tokenization process, since Spacy splits constructs such as can't or don't into two different tokens: do and n't, for example.
cant and wont are two spelling errors (actually, they are actual English words, that Spacy "is able to recognize" as auxiliaries and it then splits them as it would split can't or won't). We could say that the split is correct and that it follows the rule it would follow with the correct version of these words, the only problem there is consists in recognizing wo and nt as stopwords. You can see here the list of stopwords used by Spacy; for example, ca is present and that is why it is correctly recognized as a stopword (n't is added at the end among the contractions).
If the split is ok for your use case, you can add wo and nt manually to the list of stopwords.
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
my_stop_words = ["nt", "wo"]
STOP_WORDS.update(my_stop_words)
nlp = spacy.load("en_core_web_sm")
# analyze docs
If, for some reason, you need to do something with the stopwords in your text and you'd like to have wont and cant and not wo, nt and ca, nt, you could think about concatenating consecutive stopwords by checking if the trailing whitespace is empty (meaning that the tokens were attached in the original text):
stop_words_in_text = []
doc = nlp("Today I cant go to work. We wont come to your party.")
for token in doc:
i = token.i
if token.is_stop:
if i > 0 and doc[i-1].whitespace_ == "" and doc[i-1].is_stop:
stop_words_in_text[-1] += token.text
else:
stop_words_in_text.append(token.text)
print(stop_words_in_text)
['I', 'cant', 'go', 'to', 'We', 'wont', 'to', 'your']
Hopefully, this will help you. You can also implement custom Spacy components and check here if you need to add special tokenization cases.
Related
In my understanding, what tokeniser does is that, given each word, the tokeniser will break down the word into sub-words only if the word is not present in the tokeniser.get_vocab() :
def checkModel(model):
tokenizer = AutoTokenizer.from_pretrained(model)
allList = []
for word in tokenizer.get_vocab():
word = word.lower()
tokens = tokenizer.tokenize(word)
try:
if word[0]!='#' and word[0]!='[' and tokens[0] != word:
allList.append((word, tokens))
print(word, tokens)
except:
continue
return allList
checkModel('bert-base-uncased')
# ideally should return an empty list
However, what I have observed is that some models on huggingface will break down words into smaller pieces even if the word is present in the vocab.
checkModel('emilyalsentzer/Bio_ClinicalBERT')
output:
welles ['well', '##es']
lexington ['le', '##xing', '##ton']
palestinian ['pale', '##st', '##inian']
...
elisabeth ['el', '##isa', '##beth']
alexander ['ale', '##xa', '##nder']
appalachian ['app', '##ala', '##chia', '##n']
mitchell ['mit', '##chel', '##l']
...
4630 # tokens in vocab got broken down, not supposed to happen
I have checked a few models of this behaviour, was wondering why is this happening?
This is a really interesting question, and I am currently wondering whether it should be considered as a bug report on the Huggingface repo.
EDIT: I realized that it is possible to define model-specific tokenization_config.json files to overwrite the default behavior. One example is the bert-base-cased repository, which has the following content for the tokenizer config:
{
"do_lower_case": false
}
Given that this functionality is available, I think the best option would be to contact the original author of the work and ask them to potentially consider this configuration (if appropriate for the general use case).
Original Answer:
As it turns out, the vocabulary word that you are checking for is welles, yet the vocab file itself only contains Welles. Notice the difference in the uppercased first letter?
It turns out you can manually force the tokenizer to specifically check for cased vocabulary words, in which case it works fine.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",
do_lower_case=False) # This is different
print(tokenizer.do_lower_case)
# Output: False
# Lowercase input will result in split word
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
# Output: ['[CLS]', 'well', '##es', '[SEP]']
# Uppercase input will correctly *not split* the word
tokenizer2.convert_ids_to_tokens(tokenizer2("Welles")["input_ids"])
['[CLS]', 'Welles', '[SEP]']
Per default, however, this is not the case, and all words will be converted to lowercase, which is why you cannot find the word:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
# Per default, lowercasing is enabled!
print(tokenizer.do_lower_case)
# Output: True
# This time now we get the same (lowercased) output both times!
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']
tokenizer.convert_ids_to_tokens(tokenizer("Welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']
The tokenizer you are calling 'emilyalsentzer/Bio_ClinicalBERT' has tokens that are not present in the original base tokenizer. To add tokens to the tokenizer one can either provide a list of strings or a list of tokenizers.AddedTokens.
The default behavior in both cases is to allow new words to be used as subwords. In my example if we add 'director' and 'cto' to the tokenizer, then 'director' can be broken down into 'dire' + 'cto' + 'r' ('dire' and 'r' are a part of the original tokenizer). To avoid this, one should use:
tokenizer.add_tokens([tokenizers.AddedToken(new_word, single_word = True) for new_word in new_words])
I do think a lot of users would simply use a list of strings (as I did, until half an hour ago). But this would lead to the problem that you saw.
To change this for a customized tokenizer (like 'emilyalsentzer/Bio_ClinicalBERT') w/o losing much in model performance, I'd recommend extracting the set of words from this tokenizer, and comparing it to its base tokenizer (for example 'bert-base-uncased'). This will give you the set of words that were added to the base tokenizer as part of model re-training. Then take the base tokenizer and add this new words to it using AddedToken with single_word set to True. Replace the custom tokenizer with this new tokenizer.
I use Pythainlp package to tokenize my Thai language data for doing sentiment analysis.
first, I build a function to add new words set and tokenize it
from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie
from pythainlp import word_tokenize
def text_tokenize(Mention):
new_words = {'คนละครึ่ง', 'ยืนยันตัวตน', 'เติมเงิน', 'เราชนะ', 'เป๋าตัง', 'แอปเป๋าตัง'}
words = new_words.union(thai_words())
custom_dictionary_trie = dict_trie(words)
dataa = word_tokenize(Mention, custom_dict=custom_dictionary_trie, keep_whitespace=False)
return dataa
after that I apply it within my text_process function which including remove punctuation and stop words.
puncuations = '''.?!,;:-_[]()'/<>{}\##$&%~*ๆฯ'''
from pythainlp import word_tokenize
def text_process(Mention):
final = "".join(u for u in Mention if u not in puncuations and ('ๆ', 'ฯ'))
final = text_tokenize(final)
final = " ".join(word for word in final)
final = " ".join(word for word in final.split() if word.lower not in thai_stopwords)
return final
dff['text_tokens'] = dff['Mention'].apply(text_process)
dff
the point is it takes too long to run this function. it took 17 minutes and still not finished. I tried to replace
final = text_tokenize(final) with final = word_tokenize(final)
and it took just 2 minutes but I can't no longer use it because I need to add new custom dictionary. I know there is something wrong but really don't know how to fix it
I am new to python and nlp so please help.
Ps. sorry for my broken English
I am not familiar with Thai language, but assume that for tokenization you can also use language agnostic tokenization tools.
If you want to perform word tokenization, try the example below:
from nltk.tokenize import word_tokenize
s = '''This is the text I want to tokenize'''
word_tokenize(s)
>>> ['This', 'is', 'the', 'text', 'I', 'want', 'to', 'tokenize']
I would like to use spacy for tokenizing Wikipedia scrapes. Ideally it would work like this:
text = 'procedure that arbitrates competing models or hypotheses.[2][3] Researchers also use experimentation to test existing theories or new hypotheses to support or disprove them.[3][4]'
# run spacy
spacy_en = spacy.load("en")
doc = spacy_en(text, disable=['tagger', 'ner'])
tokens = [tok.text.lower() for tok in doc]
# desired output
# tokens = [..., 'models', 'or', 'hypotheses', '.', '[2][3]', 'Researchers', ...
# actual output
# tokens = [..., 'models', 'or', 'hypotheses.[2][3', ']', 'Researchers', ...]
The problem is that the 'hypotheses.[2][3]' is glued together into one token.
How can I prevent spacy from connecting this '[2][3]' to the previous token?
As long as it is split from the word hypotheses and the point at the end of the sentence, I don't care how it is handled. But individual words and grammar should stay apart from syntactical noise.
So for example, any of the following would be a desirable output:
'hypotheses', '.', '[2][', '3]'
'hypotheses', '.', '[2', '][3]'
I think you could try playing around with infix:
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'''[.]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world! I am hypothesis.[2][3]")
print([t.text for t in doc])
More on this https://spacy.io/usage/linguistic-features#native-tokenizers
What is the best way to add/remove stop words with spacy? I am using token.is_stop function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!
Using Spacy 2.0.11, you can update its stopwords set using one of the following:
To add a single stopword:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")
To add several stopwords at once:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
To remove a single stopword:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")
To remove several stopwords at once:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}
Note: To see the current set of stopwords, use:
print(nlp.Defaults.stop_words)
Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path") and nlp.from_disk("/path") (further described at https://spacy.io/usage/saving-loading).
You can edit them before processing your text like this (see this post):
>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True
Note: This seems to work <=v1.8. For newer versions, see other answers.
Short answer for version 2.0 and above (just tested with 3.4+):
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS) # <- set of Spacy's default stop words
STOP_WORDS.add("your_additional_stop_word_here")
This loads all stop words as a set.
You can add your stop words to STOP_WORDS or use your own list in the first place.
To check if the attribute is_stop for the stop words is set to True use this:
for word in STOP_WORDS:
lexeme = nlp.vocab[word]
print(lexeme.text, lexeme.is_stop)
In the unlikely case that stop words for some reason aren't set to is_stop = True do this:
for word in STOP_WORDS:
lexeme = nlp.vocab[word]
lexeme.is_stop = True
Detailed explanation step by step with links to documentation.
First we import spacy:
import spacy
To instantiate class Language as nlp from scratch we need to import Vocab and Language. Documentation and example here.
from spacy.vocab import Vocab
from spacy.language import Language
# create new Language object from scratch
nlp = Language(Vocab())
stop_words is a default attribute of class Language and can be set to customize the default language data. Documentation here. You can find spacy's GitHub repo folder with defaults for various languages here.
For our instance of nlp we get 0 stop words which is reasonable since we haven't set any language with defaults
print(f"Language instance 'nlp' has {len(nlp.Defaults.stop_words)} default stopwords.")
>>> Language instance 'nlp' has 0 default stopwords.
Let's import English language defaults.
from spacy.lang.en import English
Now we have 326 default stop words.
print(f"The language default English has {len(spacy.lang.en.STOP_WORDS)} stopwords.")
print(sorted(list(spacy.lang.en.STOP_WORDS))[:10])
>>> The language default English has 326 stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']
Let's create a new instance of Language, now with defaults for English. We get the same result.
nlp = English()
print(f"Language instance 'nlp' now has {len(nlp.Defaults.stop_words)} default stopwords.")
print(sorted(list(nlp.Defaults.stop_words))[:10])
>>> Language instance 'nlp' now has 326 default stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']
To check if all words are set to is_stop = True we iterate over the stop words, retrieve the lexeme from vocab and print out the is_stop attribute.
[nlp.vocab[word].is_stop for word in nlp.Defaults.stop_words][:10]
>>> [True, True, True, True, True, True, True, True, True, True]
We can add stopwords to the English language defaults.
spacy.lang.en.STOP_WORDS.add("aaaahhh-new-stopword")
print(len(spacy.lang.en.STOP_WORDS))
# these propagate to our instance 'nlp' too!
print(len(nlp.Defaults.stop_words))
>>> 327
>>> 327
Or we can add new stopwords to instance nlp. However, these propagate to our language defaults too!
nlp.Defaults.stop_words.add("_another-new-stop-word")
print(len(spacy.lang.en.STOP_WORDS))
print(len(nlp.Defaults.stop_words))
>>> 328
>>> 328
The new stop words are set to is_stop = True.
print(nlp.vocab["aaaahhh-new-stopword"].is_stop)
print(nlp.vocab["_another-new-stop-word"].is_stop)
>>> True
>>> True
For 2.0 use the following:
for word in nlp.Defaults.stop_words:
lex = nlp.vocab[word]
lex.is_stop = True
This collects the stop words too :)
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
In latest version following would remove the word out of the list:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')
For version 2.3.0
If you want to replace the entire list instead of adding or removing a few stop words, you can do this:
custom_stop_words = set(['the','and','a'])
# First override the stop words set for the language
cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words = custom_stop_words
# Now load your model
nlp = spacy.load('en_core_web_md')
The trick is to assign the stop word set for the language before loading the model. It also ensures that any upper/lower case variation of the stop words are considered stop words.
See below piece of code
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)
len(nlp.Defaults.stop_words)
# Make list of word you want to add to stop words
list = ['apple', 'ball', 'cat']
# Iterate this in loop
for item in list:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add(item)
# Set the stop_word tag on the lexeme
nlp.vocab[item].is_stop = True
Hope this helps. You can print length before and after to confirm.
My goal is to create a system that will be able to take any random text, extract sentences, remove punctuations, and then, on the bare sentence (one of them), to randomly replace NN or VB tagged words with their meronym, holonym or synonim as well as with a similar word from a WordNet synset. There is a lot of work ahead, but I have a problem at the very beginning.
For this I use pattern and TextBlob packages. This is what I have done so far...
from pattern.web import URL, plaintext
from pattern.text import tokenize
from pattern.text.en import wordnet
from textblob import TextBlob
import string
s = URL('http://www.fangraphs.com/blogs/the-fringe-five-baseballs-most-compelling-fringe-prospects-35/#more-157570').download()
s = plaintext(s, keep=[])
secam = (tokenize(s, punctuation=""))
simica = secam[15].strip(string.punctuation)
simica = simica.replace(",", "")
simica = TextBlob(simica)
simicaTg = simica.words
synsimica = wordnet.synsets(simicaTg[3])[0]
djidja = synsimica.hyponyms()
Now everything works the way I want but when I try to extract the i.e. hyponym from this djidja variable it proves to be impossible since it is a Synset object, and I can't manipulate it anyhow.
Any idea how to extract a the very word that is reported in hyponyms list (i.e. print(djidja[2]) displays Synset(u'bowler')...so how to extract only 'bowler' from this)?
Recall that a synset is just a list of words marked as synonyms. Given a sunset, you can extract the words that form it:
from pattern.text.en import wordnet
s = wordnet.synsets('dog')[0] # a word can belong to many synsets, let's just use one for the sake of argument
print(s.synonyms)
This outputs:
Out[14]: [u'dog', u'domestic dog', u'Canis familiaris']
You can also extract hypernims and hyponyms:
print(s.hypernyms())
Out[16]: [Synset(u'canine'), Synset(u'domestic animal')]
print(s.hypernyms()[0].synonyms)
Out[17]: [u'canine', u'canid']