Why does tokeniser break down words that are present in vocab

Why does tokeniser break down words that are present in vocab - python

In my understanding, what tokeniser does is that, given each word, the tokeniser will break down the word into sub-words only if the word is not present in the tokeniser.get_vocab() :
def checkModel(model):
tokenizer = AutoTokenizer.from_pretrained(model)
allList = []
for word in tokenizer.get_vocab():
word = word.lower()
tokens = tokenizer.tokenize(word)
try:
if word[0]!='#' and word[0]!='[' and tokens[0] != word:
allList.append((word, tokens))
print(word, tokens)
except:
continue
return allList
checkModel('bert-base-uncased')
# ideally should return an empty list
However, what I have observed is that some models on huggingface will break down words into smaller pieces even if the word is present in the vocab.
checkModel('emilyalsentzer/Bio_ClinicalBERT')
output:
welles ['well', '##es']
lexington ['le', '##xing', '##ton']
palestinian ['pale', '##st', '##inian']
...
elisabeth ['el', '##isa', '##beth']
alexander ['ale', '##xa', '##nder']
appalachian ['app', '##ala', '##chia', '##n']
mitchell ['mit', '##chel', '##l']
...
4630 # tokens in vocab got broken down, not supposed to happen
I have checked a few models of this behaviour, was wondering why is this happening?

This is a really interesting question, and I am currently wondering whether it should be considered as a bug report on the Huggingface repo.
EDIT: I realized that it is possible to define model-specific tokenization_config.json files to overwrite the default behavior. One example is the bert-base-cased repository, which has the following content for the tokenizer config:
{
"do_lower_case": false
}
Given that this functionality is available, I think the best option would be to contact the original author of the work and ask them to potentially consider this configuration (if appropriate for the general use case).
Original Answer:
As it turns out, the vocabulary word that you are checking for is welles, yet the vocab file itself only contains Welles. Notice the difference in the uppercased first letter?
It turns out you can manually force the tokenizer to specifically check for cased vocabulary words, in which case it works fine.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",
do_lower_case=False) # This is different
print(tokenizer.do_lower_case)
# Output: False
# Lowercase input will result in split word
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
# Output: ['[CLS]', 'well', '##es', '[SEP]']
# Uppercase input will correctly *not split* the word
tokenizer2.convert_ids_to_tokens(tokenizer2("Welles")["input_ids"])
['[CLS]', 'Welles', '[SEP]']
Per default, however, this is not the case, and all words will be converted to lowercase, which is why you cannot find the word:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
# Per default, lowercasing is enabled!
print(tokenizer.do_lower_case)
# Output: True
# This time now we get the same (lowercased) output both times!
tokenizer.convert_ids_to_tokens(tokenizer("welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']
tokenizer.convert_ids_to_tokens(tokenizer("Welles")["input_ids"])
['[CLS]', 'well', '##es', '[SEP]']

The tokenizer you are calling 'emilyalsentzer/Bio_ClinicalBERT' has tokens that are not present in the original base tokenizer. To add tokens to the tokenizer one can either provide a list of strings or a list of tokenizers.AddedTokens.
The default behavior in both cases is to allow new words to be used as subwords. In my example if we add 'director' and 'cto' to the tokenizer, then 'director' can be broken down into 'dire' + 'cto' + 'r' ('dire' and 'r' are a part of the original tokenizer). To avoid this, one should use:
tokenizer.add_tokens([tokenizers.AddedToken(new_word, single_word = True) for new_word in new_words])
I do think a lot of users would simply use a list of strings (as I did, until half an hour ago). But this would lead to the problem that you saw.
To change this for a customized tokenizer (like 'emilyalsentzer/Bio_ClinicalBERT') w/o losing much in model performance, I'd recommend extracting the set of words from this tokenizer, and comparing it to its base tokenizer (for example 'bert-base-uncased'). This will give you the set of words that were added to the base tokenizer as part of model re-training. Then take the base tokenizer and add this new words to it using AddedToken with single_word set to True. Replace the custom tokenizer with this new tokenizer.

Related

Add the word "cant" to Spacy stopwords

How to I get SpaCy to set words such as "cant" and "wont" as stopwords?
For example, even with tokenisation it will identify "can't" as a stop word, but not "cant".
When it sees "cant", it removes "ca" but leaves "nt". Is it by design? I guess "nt" is not really a word.
Here is a sample code:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")
text = "cant can't cannot"
doc = nlp(text)
for word in doc:
print(word,":",word.is_stop)
ca : True
nt : False
ca : True
n't : True
can : True
not : True

The tokenizer splits "cant" into "ca" and "nt". Adding "cant" to the list won't surge any effect because not token will be matched. Instead "nt" should be added as in the example (3rd line of code).
Also it is important to update the stopwords before loading the model, otherwise if won't pick the changes.
Example:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
STOP_WORDS.add("nt")
nlp = spacy.load("en_core_web_sm")
text = "cant can't cannot"
doc = nlp(text)
for word in doc:
print(word,":",word.is_stop)
ca : True
nt : True
ca : True
n't : True
can : True
not : True

As stated in Spacy's documentation, the tokenizer cannot add or remove information to the text, so you would always be able to reconstruct the same input text (using the whitespace information stored inside the Tokens). This also means that if the text contains spelling errors, they will be kept.
So, there is no error in the tokenization process, since Spacy splits constructs such as can't or don't into two different tokens: do and n't, for example.
cant and wont are two spelling errors (actually, they are actual English words, that Spacy "is able to recognize" as auxiliaries and it then splits them as it would split can't or won't). We could say that the split is correct and that it follows the rule it would follow with the correct version of these words, the only problem there is consists in recognizing wo and nt as stopwords. You can see here the list of stopwords used by Spacy; for example, ca is present and that is why it is correctly recognized as a stopword (n't is added at the end among the contractions).
If the split is ok for your use case, you can add wo and nt manually to the list of stopwords.
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
my_stop_words = ["nt", "wo"]
STOP_WORDS.update(my_stop_words)
nlp = spacy.load("en_core_web_sm")
# analyze docs
If, for some reason, you need to do something with the stopwords in your text and you'd like to have wont and cant and not wo, nt and ca, nt, you could think about concatenating consecutive stopwords by checking if the trailing whitespace is empty (meaning that the tokens were attached in the original text):
stop_words_in_text = []
doc = nlp("Today I cant go to work. We wont come to your party.")
for token in doc:
i = token.i
if token.is_stop:
if i > 0 and doc[i-1].whitespace_ == "" and doc[i-1].is_stop:
stop_words_in_text[-1] += token.text
else:
stop_words_in_text.append(token.text)
print(stop_words_in_text)
['I', 'cant', 'go', 'to', 'We', 'wont', 'to', 'your']
Hopefully, this will help you. You can also implement custom Spacy components and check here if you need to add special tokenization cases.

Adding Special Tokens Changes all Embeddings - TF Bert Hugging Face

Given the following,
from transformers import TFAutoModel
from transformers import BertTokenizer
bert = TFAutoModel.from_pretrained('bert-base-cased')
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
I expected that if special tokens are added to the tokens, the remaining tokens would remain the same and yet they do not. For example I expected that the following should be equal but all the tokens change. Why is this?
tokens = tokenizer(['this product is no good'], add_special_tokens=True,return_tensors='tf')
output = bert(tokens)
output[0][0][1]
tokens = tokenizer(['this product is no good'], add_special_tokens=False,return_tensors='tf')
output = bert(tokens)
output[0][0][0]

When setting add_special_tokens=True, you are including the [CLS] token in the front and the [SEP] token at the end of your sentence, which leads to a total of 7 tokens instead of 5:
tokens = tokenizer(['this product is no good'], add_special_tokens=True, return_tensors='tf')
print(tokenizer.convert_ids_to_tokens(tf.squeeze(tokens['input_ids'], axis=0)))
['[CLS]', 'this', 'product', 'is', 'no', 'good', '[SEP]']
Your sentence level embeddings are different, because these two special tokens become a part of your embedding as they are propagated through the BERT model. They are not masked like padding tokens [pad]. Check out the docs for more information.
If you take a closer look at how Bert's Transformer-Encoder architecture and attention mechanism works, you will quickly understand why a single difference between two sentences will generate different hidden_states. New tokens are not simply concatenated to existing ones. In a sense, the tokens depend on each other. According to the BERT author Jacob Devlin:
I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations.
Or another interesting discussion:
[...] The value of CLS is influenced by other tokens, just like other tokens are influenced by their context (attention).

tensorflow_hub to pull BERT embedding on windows machine - extending to albert

Recently I posted this question and tried to solve my problem. My questions are
is my approach correct?
My example sentences length are 7 and 6 respectively - (['New Delhi is the capital of India', 'The capital of India is Delhi']), even if I add cls and sep tokens, the lengths are 9 and 8. max_seq_len parameter is 10, then why the last row of x1 and x2 are not the same?
How to get embedding when I have a paragraph of more than 2 sentences? do i have to pass one sentence at a time? But in such case wont i loose information as I am not passing all sentences together?
I did some additional research and it seems that I can pass entire paragraph as a single sentence using segment_ids as 0 for all words in a paragraph. Is that correct?
how to get embedding for ALBERT? I see that the ALBERT also has tokenization.py file. But I dont see vocab.txt. I see file 30k-clean.vocab. Could i use 30k-clean.vocab instead of vocab.txt?

#user2543622, you may refer to the official code here, in your case, you can do something like:
import tensorflow_hub as hub
albert_module = hub.Module("https://tfhub.dev/google/albert_base/2", trainable=True)
print(albert_module.get_signature_names()) # should output ['tokens', 'tokenization_info', 'mlm']
# then
tokenization_info = albert_module(signature="tokenization_info",
as_dict=True)
with tf.Session() as sess:
vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
tokenization_info["do_lower_case"]])
print(vocab_file) # output b'/var/folders/v6/vnz79w0d2dn95fj0mtnqs27m0000gn/T/tfhub_modules/098d91f064a4f53dffc7633d00c3d8e87f3a4716/assets/30k-clean.model'
I guess this vocab_file is a binary sentencepiece model file, so you should this one for tokenization as below, instead of using the 30k-clean.vocab.
# you still need the tokenization.py code to perform full tokenization
return tokenization.FullTokenizer(
vocab_file=vocab_file, do_lower_case=do_lower_case,
spm_model_file=FLAGS.spm_model_file)
If you only need the embedding matrix values, you take a look at the albert_module.variable_map, e.g.:
print(albert_module.variable_map['bert/embeddings/word_embeddings'])
# <tf.Variable 'module/bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>

Your approach seems right
Could you please check the tokenizations of sentence 1 and 2 using the tokenizer, this should reveal if there are additional word pieces in one of the sentences. This can be checked as below:
import tokenization
tokenizer = tokenization.FullTokenizer(vocab_file=<PATH to Vocab file>, do_lower_case=True)
tokens = tokenizer.tokenize(example.text_a)
print(tokens)
This should give you word piece tokenized list, without [CLS] and [SEP] token.
Generally, word piece tokenization splits the words when words are not in vocabulary, this would create higher length of tokens than the number of inputs tokens.
You can pass both the sentences together, provided that the length of the paragraph after word piece tokenization does not exceed max_sequence length.
The vocab file for albert is in ./data/vocab.txt directory. Provided you have got the albert code from: here.
In case, if you have got the model from tf-hub, the file is 2/assets/30k-clean.vocab

Python: Finding out if certain words in a list are actual English words or close to English words

I am working on a problem where I get a lot of words with their frequency of occurrence listed. Here is a sample of what I get:
drqsQAzaQ:1
OnKxnXecCINJ:1
QoGzQpg:1
cordially:1
Sponsorship:1
zQnpzQou:1
Thriving:1
febrero:1
rzaye:1
VseKEX:1
contributed:1
SNfXQoWV:1
hRwzmPR:1
Happening:1
TzJYAMWAQUIJTkWYBX:1
DYeUIqf:1
formats:1
eiizh:1
wIThY:1
infonewsletter:8
BusinessManager:10
MailScanner:12
As you can see, words like 'cordially' are actual English words, while words like 'infonewsletter' are not actual English words by themselves, but we can see that they are actually in English and mean something. However, words like 'OnKxnXecCINJ' do not mean anything (actually they are words from another charset, but I am ignoring them in my exercise and sticking to English) - I can discard them as junk
What would be the best method in Python to detect and eliminate such junk words from a given dictionary such as the one above?
I tried examining each word using nltk.corpus.word.words(), but it is killing my performance as my data set is very huge. Moreover, I am not certain whether this will give me a True for words like 'infonewsletter'
Please help.
Thanks,
Mahesh.

If the words are from completely different script within Unicode like CJK characters or Greek, Cyrillic, Thai, you could use unicodedata.category to see if they're letters to begin with (category starts with L):
>>> import unicodedata
>>> unicodedata.category('a')
'Ll'
>>> unicodedata.category('E')
'Lu'
>>> unicodedata.category('中')
'Lo'
>>> [unicodedata.category(i).startswith('L') for i in 'aE中,']
[True, True, True, False]
Then you can use the unicodedata.name to see that they're Latin letters:
>>> 'LATIN' in unicodedata.name('a')
True
>>> 'LATIN' in unicodedata.false('中')
False
Presumably it is not an English-language word if it has non-Latin letters in it.
Otherwise, you could use a letter bigram/trigram classifier to find out if there is a high probability these are English words. For example OnKxnXecCINJ contains Kxn which is a trigram that neither very probably exist in any single English language word, nor any concatenation of 2 words.
You can build one yourself from the corpus by splitting words into character trigrams, or you can use any of the existing libraries like langdetect or langid or so.
Also, see that the corpus is a set for fast in operations; only after the algorithm tells that there is a high probability it is in English, and the word fails to be found in the set, consider that it is alike to infonewsletter - a concatenation of several words; split it recursively into smaller chunks and see that each part thereof is found in the corpus.

Thank you. I am trying out this approach. However, I have a question. I have a word 'vdgutumvjaxbpz'. I know this is junk. I wrote some code to get all grams of this word, 4-gram and higher. This was the result:
['vdgu', 'dgut', 'gutu', 'utum', 'tumv', 'umvj', 'mvja', 'vjax', 'jaxb', 'axbp', 'xbpz', 'vdgut', 'dgutu', 'gutum', 'utumv', 'tumvj', 'umvja', 'mvjax', 'vjaxb', 'jaxbp', 'axbpz', 'vdgutu', 'dgutum', 'gutumv', 'utumvj', 'tumvja', 'umvjax', 'mvjaxb', 'vjaxbp', 'jaxbpz', 'vdgutum', 'dgutumv', 'gutumvj', 'utumvja', 'tumvjax', 'umvjaxb', 'mvjaxbp', 'vjaxbpz', 'vdgutumv', 'dgutumvj', 'gutumvja', 'utumvjax', 'tumvjaxb', 'umvjaxbp', 'mvjaxbpz', 'vdgutumvj', 'dgutumvja', 'gutumvjax', 'utumvjaxb', 'tumvjaxbp', 'umvjaxbpz', 'vdgutumvja', 'dgutumvjax', 'gutumvjaxb', 'utumvjaxbp', 'tumvjaxbpz', 'vdgutumvjax', 'dgutumvjaxb', 'gutumvjaxbp', 'utumvjaxbpz', 'vdgutumvjaxb', 'dgutumvjaxbp', 'gutumvjaxbpz', 'vdgutumvjaxbp', 'dgutumvjaxbpz', 'vdgutumvjaxbpz']
Now, I compared each gram result with nltk.corpus.words.words() and found the intersection of the 2 sets.
vocab = nltk.corpus.words.words()
vocab = set(w.lower().strip() for w in vocab)
def GetGramsInVocab(listOfGrams, vocab):
text_vocab = set(w.lower() for w in listOfGrams if w.isalpha())
common = text_vocab & vocab
return list(common)
However, the intersection contains 'utum', whereas I expected it to be NULL.
Also,
print("utum" in vocab)
returned true.
This does not make sense to me. I peeked into the vocabulary and found 'utum' in a few words like autumnian and metascutum
However, 'utum' is not a word by itself and I expected nltk to return false. Is there a more accurate corpus I can check against that would do whole word comparisons?
Also, I did a simple set operations test:
set1 = {"cutums" "acutum"}
print("utum" in set1)
This returned False as expected.
I guess I am confused as to why the code says 'utum' is present in the nltk words corpus.
Thanks,
Mahesh.

Probability tree for sentences in nltk employing both lookahead and lookback dependencies

Does nltk or any other NLP tool allow to construct probability trees based on input sentences thus storing the language model of the input text in a dictionary tree, the following example gives the rough idea, but I need the same functionality such that a word Wt does not just probabilistically modelled on past input words(history) Wt-n but also on lookahead words like Wt+m. Also the lookback and lookahead word count should also be 2 or more i.e. bigrams or more. Are there any other libraries in python which achieve this?
from collections import defaultdict
import nltk
import math
ngram = defaultdict(lambda: defaultdict(int))
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token in ngram:
total = math.log10(sum(ngram[token].values()))
ngram[token] = {nxt: math.log10(v) - total for nxt, v in ngram[token].items()}
the solution requires both lookahead and lookback and a specially sub classed dictionary may help in solving this problem. Can also point to relevant resources which talk about implementing such a system. A nltk.models seemed to be doing something similar but is no longer available. Are there any existing design patterns in NLP which implement this idea? skip gram based models are similar to this idea too but I feel this has should have been implemented already somewhere.

If I understand your question correctly, you are looking for a way to predict the probability of a word given its surrounding context (not just backward context but also the forward context).
One quick hack for your purpose is to train two different language models. One from right to left and the other from left to right and then probability of a word given its context would be the normalized sum of both forward and backward contexts.
Extending your code:
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
ngram = defaultdict(lambda: defaultdict(int))
ngram_rev = defaultdict(lambda: defaultdict(int)) #reversed n-grams
corpus = "The cat is cute. He jumps and he is happy."
for sentence in nltk.sent_tokenize(corpus):
tokens = map(str.lower, nltk.word_tokenize(sentence))
for token, next_token in zip(tokens, tokens[1:]):
ngram[token][next_token] += 1
for token, rev_token in zip(tokens[1:], tokens):
ngram_rev[token][rev_token] += 1
for token in ngram:
total = np.log(np.sum(ngram[token].values()))
total_rev = np.log(np.sum(ngram_rev[token].values()))
ngram[token] = {nxt: np.log(v) - total
for nxt, v in ngram[token].items()}
ngram_rev[token] = {prv: np.log(v) - total_rev
for prv, v in ngram_rev[token].items()}
Now the context is in both ngram and ngram_rev which respectively hold the forward and backward contexts.
You should also account for smoothing. That is if a given phrase is not seen in your training corpus, you would just get zero probabilities. In order to avoid that, there are many smoothing techniques the most simple of which is the add-on smoothing.

The normal ngram algorithm traditionally works with prior context only, and for good reason: A bigram tagger makes decisions by considering the tags of the last two words, plus the current word. So unless you tag in two passes, the tag of the next word is not yet known. But you are interested in word ngrams, not tag ngrams, so nothing keeps you from training an ngram tagger where the ngram consists of words from both sides. And you can indeed do it easily with the NLTK.
The NLTK's ngram taggers all make tag ngrams, from the left; but you can easily derive your own tagger from their abstract base class, ContextTagger:
import nltk
from nltk.tag import ContextTagger
class TwoSidedTagger(ContextTagger):
left = 2
right = 1
def context(self, tokens, index, history):
left = self.left
right = self.right
tokens = tuple(t.lower() for t in tokens)
if index < left:
tokens = ("<start>",) * left + tokens
index += left
if index + right >= len(tokens):
tokens = tokens + ("<end>",) * right
return tokens[index-left:index+right+1]
This defines a tetragram tagger (2+1+1) where the current word is third in the ngram, not last as usual. You can then initialize and train a tagger just like the regular ngram taggers (see chapter 5 of the NLTK book, especially sections 5.4ff). Let's see first how you'd build a part-of-speech tagger, using a portion of the Brown corpus as training data:
data = list(nltk.corpus.brown.tagged_sents(categories="news"))
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('NN'))
twosidedtagger._train(train_sents)
Like all ngram taggers in the NLTK, this one will delegate to the backoff tagger if it is asked to tag an ngram it did not see during training.
For simplicity I used a simple "default tagger" as the backoff tagger, but you'll probably need to use something more powerful (see the NLTK chapter again).
You can then use your tagger to tag new text, or evaluate it with an already tagged test set:
>>> print(twosidedtagger.tag("There were dogs everywhere .".split()))
>>> print(twosidedtagger.evaluate(test_sents))
Predicting words:
The above tagger assigns a POS tag by considering nearby words; but your goal is to predict the word itself, so you need different training data and a different default tagger. The NLTK API expects training data in the form (word, LABEL), where LABEL is the value you want to generate. In your case, LABEL is just the current word itself; so make your training data as follows:
data = [ zip(s,s) for s in nltk.corpus.brown.sents(categories="news") ]
train_sents = data[400:]
test_sents = data[:400]
twosidedtagger = TwoSidedTagger({}, backoff=nltk.DefaultTagger('the')) # most common word
twosidedtagger._train(train_sents)
It makes no sense for the target word to appear in the "context" ngram, so you should also modify the method context() so that the returned ngram does not include it:
def context(self, tokens, index, history):
...
return tokens[index-left:index] + tokens[index+1:index+right+1]
This tagger uses trigrams consisting of two words from the left and one from the right of the current word.
With these modifications, you'll build a tagger that outputs the most likely word at any position. Try it and how you like it.
Prediction:
My expectation is that you'll need a humongous amount of training data before you can get decent performance. The problem is that ngram taggers can only suggest a tag for contexts they saw during training.
To build a tagger that generalizes, consider using the NLTK to train a "sequential classifier". You can use whatever features you want, including the words before and after-- of course, how well it will work is your problem. The NLTK classifier API is similar to that for the ContextTagger, but the context function (aka feature function) returns a dictionary, not a tuple. Again, see the NLTK book and the source code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does tokeniser break down words that are present in vocab - python

Related

Add the word "cant" to Spacy stopwords

Adding Special Tokens Changes all Embeddings - TF Bert Hugging Face

tensorflow_hub to pull BERT embedding on windows machine - extending to albert

Python: Finding out if certain words in a list are actual English words or close to English words

Probability tree for sentences in nltk employing both lookahead and lookback dependencies

Categories

Resources