I have a question about whether there is a way to keep single white space as an independent token in spaCy tokenization.
For example if I ran:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is easy.")
toks = [w.text for w in doc]
toks
The result is
['This', 'is', 'easy', '.']
Instead, I would like to have something like
['This', ' ', 'is', ' ', 'easy', '.']
Is there are a simple way to do that?
spaCy exposes the token's whitespace as the whitespace_ attribute. So if you only need a list of strings, you could do:
token_texts = []
for token in doc:
token_texts.append(token.text)
if token.whitespace_: # filter out empty strings
token_texts.append(token.whitespace_)
If you want to create an actual Doc object out of those tokens, that's possible, too. Doc objects can be constructed with a words keyword argument (a list of strings to add as tokens). However, I'm not sure how useful that would be.
If you want the whitespaces in the doc object:
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer(object):
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.split(' ')
res = [' '] * (2 * len(words) - 1)
res[::2] = words
return Doc(self.vocab, words=res)
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("This is easy.")
print([t.text for t in doc])
Related
I am new to using SpaCy. Can we tell spacy API to ignore symbols while giving tokens?
Example:
For the sentence Hi, Welcome to StackOverflow. The tokens are
Hi
,
Welcome
to
StackOverflow
.
I want spacy to give tokens only for words which have whitespace. For the above example, the tokens should be
Hi,
Welcome
to
StackOverflow.
Try:
import spacy
nlp = spacy.load("en_core_web_sm")
txt = "Hi, Welcome to StackOverflow."
doc = nlp(txt)
tokens = [tok.text for tok in doc if not tok.is_punct]
print(tokens)
['Hi', 'Welcome', 'to', 'StackOverflow']
You may wish to define your own list of punctuation:
punctuation = [".",",","!"]
tokens = [tok.text for tok in doc if tok.text not in punctuation]
print(tokens)
['Hi', 'Welcome', 'to', 'StackOverflow']
or use a readily available one from string package:
from string import punctuation
print(punctuation)
doc_punct = nlp(" ".join([punctuation]))
tokens = [tok.text for tok in doc if tok.text not in punctuation]
print(tokens)
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
['Hi', 'Welcome', 'to', 'StackOverflow']
Is there a way to force spacy not to parse punctuation as separate tokens ???
nlp = spacy.load('en')
doc = nlp(u'the $O is in $R')
[ w for w in doc ]
: [the, $, O, is, in, $, R]
I want :
: [the, $O, is, in, $R]
Customize the prefix_search function for the spaCy's Tokenizer class. Refer documentation. Something like:
import spacy
import re
from spacy.tokenizer import Tokenizer
# use any currency regex match as per your requirement
prefix_re = re.compile('''^\$[a-zA-Z0-9]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search)
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'the $O is in $R')
print([t.text for t in doc])
# ['the', '$O', 'is', 'in', '$R']
Yes, there is. For example,
import spacy
import regex as re
from spacy.tokenizer import Tokenizer
prefix_re = re.compile(r'''^[\[\+\("']''')
suffix_re = re.compile(r'''[\]\)"']$''')
infix_re = re.compile(r'''[\(\-\)\#\.\:\$]''') #you need to change the infix tokenization rules
simple_url_re = re.compile(r'''^https?://''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=simple_url_re.match)
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'the $O is in $R')
print [w for w in doc] #prints
[the, $O, is, in, $R]
You just need to add '$' character to the infix regex (with an escape character '\' obviously).
Aside: Have included prefix and suffix to showcase the flexibility of spaCy tokenizer. In your case just the infix regex will suffice.
I would like to use spacy for tokenizing Wikipedia scrapes. Ideally it would work like this:
text = 'procedure that arbitrates competing models or hypotheses.[2][3] Researchers also use experimentation to test existing theories or new hypotheses to support or disprove them.[3][4]'
# run spacy
spacy_en = spacy.load("en")
doc = spacy_en(text, disable=['tagger', 'ner'])
tokens = [tok.text.lower() for tok in doc]
# desired output
# tokens = [..., 'models', 'or', 'hypotheses', '.', '[2][3]', 'Researchers', ...
# actual output
# tokens = [..., 'models', 'or', 'hypotheses.[2][3', ']', 'Researchers', ...]
The problem is that the 'hypotheses.[2][3]' is glued together into one token.
How can I prevent spacy from connecting this '[2][3]' to the previous token?
As long as it is split from the word hypotheses and the point at the end of the sentence, I don't care how it is handled. But individual words and grammar should stay apart from syntactical noise.
So for example, any of the following would be a desirable output:
'hypotheses', '.', '[2][', '3]'
'hypotheses', '.', '[2', '][3]'
I think you could try playing around with infix:
import re
import spacy
from spacy.tokenizer import Tokenizer
infix_re = re.compile(r'''[.]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u"hello-world! I am hypothesis.[2][3]")
print([t.text for t in doc])
More on this https://spacy.io/usage/linguistic-features#native-tokenizers
In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens:
import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]
output:
[This, is, a, #, sentence, .]
I'd like to have hashtags tokenized as follows, is that possible?
[This, is, a, #sentence, .]
I also tried several ways to prevent spaCy from splitting hashtags or words with hyphens like "cutting-edge". My experience is that merging tokens afterwards can be problematic, because the pos tagger and dependency parsers already used the wrong tokens for their decisions. Touching the infix, prefix, suffix regexps is kind of error prone / complex, because you don't want to produce side effects by your changes.
The simplest way is indeed, as pointed out by before, to modify the token_match function of the tokenizer. This is a re.match identifying regular expressions that will not be split. Instead of importing the speficic URL pattern I'd rather extend whatever spaCy's default is.
from spacy.tokenizer import _get_regex_pattern
nlp = spacy.load('en')
# get default pattern for tokens that don't get split
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
# add your patterns (here: hashtags and in-word hyphens)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"
# overwrite token_match function of the tokenizer
nlp.tokenizer.token_match = re.compile(re_token_match).match
text = "#Pete: choose low-carb #food #eatsmart ;-) 😋👍"
doc = nlp(text)
This yields:
['#Pete', ':', 'choose', 'low-carb', '#food', '#eatsmart', ';-)', '😋', '👍']
This is more of a add-on to the great answer by #DhruvPathak AND a shameless copy from the below linked github thread (and the even better answer by #csvance). spaCy features (since V2.0) the add_pipe method. Meaning you can define #DhruvPathak great answer in a function and add the step (conveniently) into your nlp processing pipeline, as below.
Citations starts here:
def hashtag_pipe(doc):
merged_hashtag = False
while True:
for token_index,token in enumerate(doc):
if token.text == '#':
if token.head is not None:
start_index = token.idx
end_index = start_index + len(token.head.text) + 1
if doc.merge(start_index, end_index) is not None:
merged_hashtag = True
break
if not merged_hashtag:
break
merged_hashtag = False
return doc
nlp = spacy.load('en')
nlp.add_pipe(hashtag_pipe)
doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'
Citation ends here; Check out how to add hashtags to the part of speech tagger #503 for the full thread.
PS It's clear when reading the code, but for the copy&pasters, don't disable the parser :)
You can do some pre and post string manipulations,which shall make you bypass '#' based tokenization, and is easy to implement. e.g
> >>> import re
> >>> import spacy
> >>> nlp = spacy.load('en')
> >>> sentence = u'This is my twitter update #MyTopic'
> >>> parsed = nlp(sentence)
> >>> [token.text for token in parsed]
[u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic']
> >>> new_sentence = re.sub(r'#(\w+)',r'ZZZPLACEHOLDERZZZ\1',sentence)
> >>> new_sentence u'This is my twitter update ZZZPLACEHOLDERZZZMyTopic'
> >>> parsed = nlp(new_sentence)
> >>> [token.text for token in parsed]
[u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic']
> >>> [x.replace(u'ZZZPLACEHOLDERZZZ','#') for x in [token.text for token in parsed]]
[u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic']
You can try setting custom seperators in spacy's tokenizer.
I am not aware of methods to do that.
UPDATE : You can use a regex to find span of token you would want to stay as single token, and retokenize using span.merge method as mentioned here : https://spacy.io/docs/api/span#merge
Merge example:
>>> import spacy
>>> import re
>>> nlp = spacy.load('en')
>>> my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]
>>> indexes = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)]
>>> indexes
[(15, 25), (26, 36)]
>>> for start,end in indexes:
... parsed.merge(start_idx=start,end_idx=end)
...
#MyHashOne
#MyHashTwo
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]
>>>
I found this on github, which uses spaCy's Matcher:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
doc = nlp('This is a #sentence. Here is another #hashtag. #The #End.')
matches = matcher(doc)
hashtags = []
for match_id, start, end in matches:
hashtags.append(doc[start:end])
for span in hashtags:
span.merge()
print([t.text for t in doc])
outputs:
['This', 'is', 'a', '#sentence', '.', 'Here', 'is', 'another', '#hashtag', '.', '#The', '#End', '.']
A list of hashtags is also available in the hashtags list:
print(hashtags)
output:
[#sentence, #hashtag, #The, #End]
I spent quite a bit of time on this and found I share what I came up with:
Subclassing the Tokenizer and adding the regex for hashtags to the default URL_PATTERN was the easiest solution for me, additionally adding a custom extension to match on hashtags to identify them:
import re
import spacy
from spacy.language import Language
from spacy.tokenizer import Tokenizer
from spacy.tokens import Token
nlp = spacy.load('en_core_web_sm')
def create_tokenizer(nlp):
# contains the regex to match all sorts of urls:
from spacy.lang.tokenizer_exceptions import URL_PATTERN
# spacy defaults: when the standard behaviour is required, they
# need to be included when subclassing the tokenizer
prefix_re = spacy.util.compile_prefix_regex(Language.Defaults.prefixes)
infix_re = spacy.util.compile_infix_regex(Language.Defaults.infixes)
suffix_re = spacy.util.compile_suffix_regex(Language.Defaults.suffixes)
# extending the default url regex with regex for hashtags with "or" = |
hashtag_pattern = r'''|^(#[\w_-]+)$'''
url_and_hashtag = URL_PATTERN + hashtag_pattern
url_and_hashtag_re = re.compile(url_and_hashtag)
# set a custom extension to match if token is a hashtag
hashtag_getter = lambda token: token.text.startswith('#')
Token.set_extension('is_hashtag', getter=hashtag_getter)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=url_and_hashtag_re.match
)
nlp.tokenizer = create_tokenizer(nlp)
doc = nlp("#spreadhappiness #smilemore so_great#good.com https://www.somedomain.com/foo")
for token in doc:
print(token.text)
if token._.is_hashtag:
print("-> matches hashtag")
# returns: "#spreadhappiness -> matches hashtag #smilemore -> matches hashtag so_great#good.com https://www.somedomain.com/foo"
There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.
import nltk
words = nltk.word_tokenize("I've found a medicine for my disease.")
result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize() for some reason doesn't work.
Edit:
I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:
result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')
You can use "treebank detokenizer" - TreebankWordDetokenizer:
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.
To reverse word_tokenize from nltk, i suggest looking in http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize and do some reverse engineering.
Short of doing crazy hacks on nltk, you can try this:
>>> import nltk
>>> import string
>>> nltk.word_tokenize("I've found a medicine for my disease.")
['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
>>> tokens = nltk.word_tokenize("I've found a medicine for my disease.")
>>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
"I've found a medicine for my disease."
use token_utils.untokenize from here
import re
def untokenize(words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, `untokenize(tokenize(text))` should be identical to `text`,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
"can not", "cannot")
step6 = step5.replace(" ` ", " '")
return step6.strip()
tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
untokenize(tokenized)
"I've found a medicine for my disease."
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
I propose to keep offsets in tokenization: (token, offset).
I think, this information is useful for processing over the original sentence.
import re
from nltk.tokenize import word_tokenize
def offset_tokenize(text):
tail = text
accum = 0
tokens = self.tokenize(text)
info_tokens = []
for tok in tokens:
scaped_tok = re.escape(tok)
m = re.search(scaped_tok, tail)
start, end = m.span()
# global offsets
gs = accum + start
ge = accum + end
accum += end
# keep searching in the rest
tail = tail[end:]
info_tokens.append((tok, (gs, ge)))
return info_token
sent = '''I've found a medicine for my disease.
This is line:3.'''
toks_offsets = offset_tokenize(sent)
for t in toks_offsets:
(tok, offset) = t
print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]
Gives:
True I I
True 've 've
True found found
True a a
True medicine medicine
True for for
True my my
True disease disease
True . .
True This This
True is is
True line:3 line:3
True . .
For me, it worked when I installed python nltk 3.2.5,
pip install -U nltk
then,
import nltk
nltk.download('perluniprops')
from nltk.tokenize.moses import MosesDetokenizer
If you are using insides pandas dataframe, then
df['detoken']=df['token_column'].apply(lambda x: detokenizer.detokenize(x, return_str=True))
The reason there is no simple answer is you actually need the span locations of the original tokens in the string. If you don't have that, and you aren't reverse engineering your original tokenization, your reassembled string is based on guesses about the tokenization rules that were used. If your tokenizer didn't give you spans, you can still do this if you have three things:
1) The original string
2) The original tokens
3) The modified tokens (I'm assuming you have changed the tokens in some way, because that is the only application for this I can think of if you already have #1)
Use the original token set to identify spans (wouldn't it be nice if the tokenizer did that?) and modify the string from back to front so the spans don't change as you go.
Here I'm using TweetTokenizer but it shouldn't matter as long as the tokenizer you use doesn't change the values of your tokens so that they aren't actually in the original string.
tokenizer=nltk.tokenize.casual.TweetTokenizer()
string="One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin."
tokens=tokenizer.tokenize(string)
replacement_tokens=list(tokens)
replacement_tokens[-3]="cute"
def detokenize(string,tokens,replacement_tokens):
spans=[]
cursor=0
for token in tokens:
while not string[cursor:cursor+len(token)]==token and cursor<len(string):
cursor+=1
if cursor==len(string):break
newcursor=cursor+len(token)
spans.append((cursor,newcursor))
cursor=newcursor
i=len(tokens)-1
for start,end in spans[::-1]:
string=string[:start]+replacement_tokens[i]+string[end:]
i-=1
return string
>>> detokenize(string,tokens,replacement_tokens)
'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a cute vermin.'
The reason tokenize.untokenize does not work is because it needs more information than just the words. Here is an example program using tokenize.untokenize:
from StringIO import StringIO
import tokenize
sentence = "I've found a medicine for my disease.\n"
tokens = tokenize.generate_tokens(StringIO(sentence).readline)
print tokenize.untokenize(tokens)
Additional Help:
Tokenize - Python Docs |
Potential Problem
I am using following code without any major library function for detokeization purpose. I am using detokenization for some specific tokens
_SPLITTER_ = r"([-.,/:!?\";)(])"
def basic_detokenizer(sentence):
""" This is the basic detokenizer helps us to resolves the issues we created by our tokenizer"""
detokenize_sentence =[]
words = sentence.split(' ')
pos = 0
while( pos < len(words)):
if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1:
left = detokenize_sentence.pop()
detokenize_sentence.append(left +''.join(words[pos:pos + 2]))
pos +=1
elif words[pos] in '[(' and pos < len(words) - 1:
detokenize_sentence.append(''.join(words[pos:pos + 2]))
pos +=1
elif words[pos] in ']).,:!?;' and pos > 0:
left = detokenize_sentence.pop()
detokenize_sentence.append(left + ''.join(words[pos:pos + 1]))
else:
detokenize_sentence.append(words[pos])
pos +=1
return ' '.join(detokenize_sentence)
Use the join function:
You could just do a ' '.join(words) to get back the original string.