I would like to keep the tokenizer that SpaCy normally uses, but adding a condition.
SpaCy usually separates a dot (".") from a word and places it as a token. I want to keep that, except in cases where I have the abbreviation: "et al.", in this case I would like to return as tokens: ['et' , 'al.'], without considering the dot as another token, just in this case.
I have been reviewing the information and it seems to me that the solution would be related to the script below, however, I do not know where I could place this condition.
import spacy
from spacy.lang.char_classes import ALPHA_LOWER, ALPHA_UPPER, PUNCT
from spacy.lang.char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_ICONS
from spacy.lang.char_classes import CURRENCY, UNITS, ALPHA_LOWER, CONCAT_QUOTES, PUNCT, ALPHA_UPPER
from spacy.util import compile_suffix_regex
# Default tokenizer
nlp = spacy.load("pt_core_news_sm")
doc = nlp("Esse é um exemplo. Ramon et al., kcal.")
print([t.text for t in doc]) # ['Esse', 'é', 'um', 'exemplo', '.', 'Ramon', 'et', 'al', '.', ',', 'kcal', '.']
# Modify tokenizer suffix patterns
suffixes = (
LIST_PUNCT
+ LIST_ELLIPSES
+ LIST_QUOTES
+ LIST_ICONS
+ ["'s", "'S", "’s", "’S", "—", "–"]
+ [
r"(?<=[0-9])\+",
r"(?<=°[FfCcKk])\.",
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
r"(?<=[0-9])(?:{u})".format(u=UNITS),
r"(?<=[0-9{al}{e}{p}(?:{q})])\.".format(
al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
),
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
]
)
suffix_regex = compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
doc = nlp("Esse é um exemplo. Ramon et al., kcal.")
print([t.text for t in doc]) # Expected -> ['Esse', 'é', 'um', 'exemplo', '.', 'Ramon', 'et', 'al.', ',', 'kcal', '.']
For this instance, I think the easiest thing to do is to add a special case to the tokenizer. The benefit is that you don't have to recreate and recompile all of those tokenizer regexes, but just add this one instance as follows:
import spacy
from spacy.symbols import ORTH
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer.add_special_case("al.", [{ORTH: "al."}])
# Check new tokenization
print([w.text for w in nlp("et al.")]) # ['et', 'al.']
It would be good to review how the tokenizer works to understand what this is going to do. The tokenizer handles special cases first, so whenever it sees this substring it's going to tokenize it that way before any of the other rules. That means this solution could include some false positive tokenizations, where another token precedes al. aside from 'et' and you don't want to combine the period. For a more precise solution, you could write a small component that merges the al and . tokens after things have been tokenized - a good example for this is spaCy's merge_noun_chunks (source).
Related
How to I get SpaCy to set words such as "cant" and "wont" as stopwords?
For example, even with tokenisation it will identify "can't" as a stop word, but not "cant".
When it sees "cant", it removes "ca" but leaves "nt". Is it by design? I guess "nt" is not really a word.
Here is a sample code:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
nlp = spacy.load("en_core_web_sm")
text = "cant can't cannot"
doc = nlp(text)
for word in doc:
print(word,":",word.is_stop)
ca : True
nt : False
ca : True
n't : True
can : True
not : True
The tokenizer splits "cant" into "ca" and "nt". Adding "cant" to the list won't surge any effect because not token will be matched. Instead "nt" should be added as in the example (3rd line of code).
Also it is important to update the stopwords before loading the model, otherwise if won't pick the changes.
Example:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
STOP_WORDS.add("nt")
nlp = spacy.load("en_core_web_sm")
text = "cant can't cannot"
doc = nlp(text)
for word in doc:
print(word,":",word.is_stop)
ca : True
nt : True
ca : True
n't : True
can : True
not : True
As stated in Spacy's documentation, the tokenizer cannot add or remove information to the text, so you would always be able to reconstruct the same input text (using the whitespace information stored inside the Tokens). This also means that if the text contains spelling errors, they will be kept.
So, there is no error in the tokenization process, since Spacy splits constructs such as can't or don't into two different tokens: do and n't, for example.
cant and wont are two spelling errors (actually, they are actual English words, that Spacy "is able to recognize" as auxiliaries and it then splits them as it would split can't or won't). We could say that the split is correct and that it follows the rule it would follow with the correct version of these words, the only problem there is consists in recognizing wo and nt as stopwords. You can see here the list of stopwords used by Spacy; for example, ca is present and that is why it is correctly recognized as a stopword (n't is added at the end among the contractions).
If the split is ok for your use case, you can add wo and nt manually to the list of stopwords.
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
my_stop_words = ["nt", "wo"]
STOP_WORDS.update(my_stop_words)
nlp = spacy.load("en_core_web_sm")
# analyze docs
If, for some reason, you need to do something with the stopwords in your text and you'd like to have wont and cant and not wo, nt and ca, nt, you could think about concatenating consecutive stopwords by checking if the trailing whitespace is empty (meaning that the tokens were attached in the original text):
stop_words_in_text = []
doc = nlp("Today I cant go to work. We wont come to your party.")
for token in doc:
i = token.i
if token.is_stop:
if i > 0 and doc[i-1].whitespace_ == "" and doc[i-1].is_stop:
stop_words_in_text[-1] += token.text
else:
stop_words_in_text.append(token.text)
print(stop_words_in_text)
['I', 'cant', 'go', 'to', 'We', 'wont', 'to', 'your']
Hopefully, this will help you. You can also implement custom Spacy components and check here if you need to add special tokenization cases.
I seem to be able to add tokens without issue but if I try to add a suffix (ie.. one that doesn't have the init character 'Ġ' at the front), the tokenizer doesn't put spaces in the right spots. Here's some very simplified test code.
from copy import deepcopy
from transformers import BartTokenizer
# Get the different tokenizers
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
tokenizer_ext = deepcopy(tokenizer)
# Note that putting Ġ after the token causes the token not to be used
num_added = tokenizer_ext.add_tokens(['-of', '_01', 'WXYZ'])
# Original sentence
print('Original')
serial = ':ARG0-of ( sense_01 :ARG1 ( urgencyWXYZ )'
print(serial)
print()
# Baseline tokenizer
print('Bart default tokenizer')
tokens = tokenizer.tokenize(serial)
out_str = tokenizer.convert_tokens_to_string(tokens)
print(tokens)
print(out_str)
print()
# extended tokenizer
print('Extended tokenizer')
tokens = tokenizer_ext.tokenize(serial)
out_str = tokenizer_ext.convert_tokens_to_string(tokens)
print(tokens)
print(out_str)
This gives...
Original
:ARG0-of ( sense_01 :ARG1 ( urgencyWXYZ )
Bart default tokenizer
[':', 'AR', 'G', '0', '-', 'of', 'Ġ(', 'Ġsense', '_', '01', 'Ġ:', 'AR', 'G', '1', 'Ġ(', 'Ġurgency', 'W', 'XY', 'Z', 'Ġ)']
:ARG0-of ( sense_01 :ARG1 ( urgencyWXYZ )
Extended tokenizer
[':', 'AR', 'G', '0', '-of', '(', 'Ġsense', '_01', ':', 'AR', 'G', '1', 'Ġ(', 'Ġurgency', 'WXYZ', ')']
:ARG0-of( sense_01:ARG1 ( urgencyWXYZ)
Notice that the default bart tokenizer produces the same output as the original sentence but the extended tokenizer doesn't put in spaces after the new suffix tokens. ie.. it selects '(' instead of 'Ġ('. Any idea why this is and what's the right way to add suffix tokens?
The short answer is that there's "behavior" (bug?) in the handling of added tokens for Bart (and RoBerta, GPT2, etc..) that explicitly strips spaces from the tokens adjacent (both left and right) to the added token's location. I don't see a simple work-around to this.
Added tokens are handled differently in the transformer's tokenizer code. The text is first split, using a Trie to identify any tokens in the added tokens list (see tokenization_utils.py::tokenize()). After finding any added tokens in the text, the remainder is then tokenized using the existing vocab/bpe encoding scheme (see tokenization_gpt2.py::_tokenize())
The added tokens are added to the self.unique_no_split_tokens list which prevents them from being broken down further, into smaller chunks. The code that handles this (see tokenization_utils.py::tokenize() explicitly strips the spaces from the tokens to the left and right.
You could manually remove them from the "no split" list but then they may be broken down into smaller sub-components.
Note that for "special tokens", if you add the token inside of the AddedToken class you can set the lstrip and rstrip behaviors but this isn't available for non-special tokens.
See https://github.com/huggingface/transformers/blob/v4.12.5-release/src/transformers/tokenization_utils.py#L517 for the else statement where the spaces are stripped.
I would like to use spacy for tokenizing Wikipedia scrapes. Ideally it would work like this:
text = 'procedure that arbitrates competing models or hypotheses.[2][3] Researchers also use experimentation to test existing theories or new hypotheses to support or disprove them.[3][4]'
# run spacy
spacy_en = spacy.load("en")
doc = spacy_en(text, disable=['tagger', 'ner'])
tokens = [tok.text.lower() for tok in doc]
# desired output
# tokens = [..., 'models', 'or', 'hypotheses', '.', '[2][3]', 'Researchers', ...
# actual output
# tokens = [..., 'models', 'or', 'hypotheses.[2][3', ']', 'Researchers', ...]
The problem is that the 'hypotheses.[2][3]' is glued together into one token.
How can I prevent spacy from connecting this '[2][3]' to the previous token?
As long as it is split from the word hypotheses and the point at the end of the sentence, I don't care how it is handled. But individual words and grammar should stay apart from syntactical noise.
So for example, any of the following would be a desirable output:
'hypotheses', '.', '[2][', '3]'
'hypotheses', '.', '[2', '][3]'
I think you could try playing around with infix:
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'''[.]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world! I am hypothesis.[2][3]")
print([t.text for t in doc])
More on this https://spacy.io/usage/linguistic-features#native-tokenizers
Here is my code, I have a sentence and I want to tokenize and stem it before passing it to TfidfVectorizer to finally to get a tf-idf representation of the sentence:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.stem.snowball import SnowballStemmer
stemmer_ita = SnowballStemmer("italian")
def tokenizer_stemmer_ita(text):
return [stemmer_ita.stem(word) for word in text.split()]
def sentence_tokenizer_stemmer(text):
return " ".join([stemmer_ita.stem(word) for word in text.split()])
X_train = ['il libro è sul tavolo']
X_train = [sentence_tokenizer_stemmer(text) for text in X_train]
tfidf = TfidfVectorizer(preprocessor=None, tokenizer=None, use_idf=True, stop_words=None, ngram_range=(1,2))
X_train = tfidf.fit_transform(X_train)
# let's see the features
print (tfidf.get_feature_names())
I get as output:
['il', 'il libr', 'libr', 'libr sul', 'sul', 'sul tavol', 'tavol']
if I change the parameter
tokenizer=None
to:
tokenizer=tokenizer_stemmer_ita
and I comment this line:
X_train = [sentence_tokenizer_stemmer(text) for text in X_train]
I expect to get the same result but the result is different:
['il', 'il libr', 'libr', 'libr è', 'sul', 'sul tavol', 'tavol', 'è', 'è sul']
Why? Am I implementing correctly the external stemmer? It seems, at least, that the stopwords ("è") are removed in the first run, even if stop_words=None.
[edit]
As suggested by Vivek, the problem seems to be the default token patter, which is applied anyway when tokenizer = None. So if a add these two lines at the beginning of tokenizer_stemmer_ita:
token_pattern = re.compile(u'(?u)\\b\\w\\w+\\b')
text = " ".join( token_pattern.findall(text) )
I should get the correct behaviour, and in fact I get it for the above simple example, but for a different example:
X_train = ['0.05%.\n\nVedete?']
I don't, the two outputs are different:
['05', '05 ved', 'ved']
and
['05', '05 vedete', 'vedete']
why? In this case the question mark seems to be the problem, without it the output are identical.
[edit2]
It seems I have to stem first and then apply the regex, in this case the two outputs are identical.
Thats because of default tokenizer pattern token_pattern used in TfidfVectorizer:
token_pattern : string
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more
alphanumeric characters (punctuation is completely ignored and always
treated as a token separator).
So the character è is not selected.
import re
token_pattern = re.compile(u'(?u)\\b\\w\\w+\\b')
print token_pattern.findall('il libro è sul tavolo')
# Output
# ['il', 'libro', 'sul', 'tavolo']
This default token_pattern is used when tokenizer is None, as you are experiencing.
There's a ton available about removing punctuation, but I can't seem to find anything keeping it.
If I do:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P', '.']
the last "." is pushed into its own token. However, if instead there is another word at the end, the last "." is preserved:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P. Another Co"
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P.', 'Another', 'Co']
I'd like this to always perform as the second case. For now, I'm hackishly doing:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str + " |||")
since I feel pretty confident in throwing away "|||" at any given time, but don't know what other punctuation I might want to preserve that could get dropped. Is there a better way to accomplish this ?
It is a quirk of spelling that if a sentence ends with an abbreviated word, we only write one period, not two. The nltk's tokenizer doesn't "remove" it, it splits it off because sentence structure ("a sentence must end with a period or other suitable punctuation") is more important to NLP tools than consistent representation of abbreviations. The tokenizer is smart enough to recognize most abbreviations, so it doesn't separate the period in L.P. mid-sentence.
Your solution with ||| results in inconsistent sentence structure, since you now have no sentence-final punctuation. A better solution would be to add the missing period only after abbreviations. Here's one way to do this, ugly but as reliable as the tokenizer's own abbreviation recognizer:
toks = nltk.word_tokenize(test_str + " .")
if len(toks) > 1 and len(toks[-2]) > 1 and toks[-2].endswith("."):
pass # Keep the added period
else:
toks = toks[:-1]
PS. The solution you have accepted will completely change the tokenization, leaving all punctuation attached to the adjacent word (along with other undesirable effects like introducing empty tokens). This is most likely not what you want.
Could you use re?
import re
test_str = "Some Co Inc. Other Co L.P."
print re.split('\s', test_str)
This will split the input string based on spacing, retaining your punctuation.