Ignore punctuation symbol while giving tokens in spacy - python

I am new to using SpaCy. Can we tell spacy API to ignore symbols while giving tokens?
Example:
For the sentence Hi, Welcome to StackOverflow. The tokens are
Hi
,
Welcome
to
StackOverflow
.
I want spacy to give tokens only for words which have whitespace. For the above example, the tokens should be
Hi,
Welcome
to
StackOverflow.

Try:
import spacy
nlp = spacy.load("en_core_web_sm")
txt = "Hi, Welcome to StackOverflow."
doc = nlp(txt)
tokens = [tok.text for tok in doc if not tok.is_punct]
print(tokens)
['Hi', 'Welcome', 'to', 'StackOverflow']
You may wish to define your own list of punctuation:
punctuation = [".",",","!"]
tokens = [tok.text for tok in doc if tok.text not in punctuation]
print(tokens)
['Hi', 'Welcome', 'to', 'StackOverflow']
or use a readily available one from string package:
from string import punctuation
print(punctuation)
doc_punct = nlp(" ".join([punctuation]))
tokens = [tok.text for tok in doc if tok.text not in punctuation]
print(tokens)
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
['Hi', 'Welcome', 'to', 'StackOverflow']

Related

Is there a way to generalize the orths inside the argument of Spacy's retokenizer.split?

I'm tryng to fix wronly merged spanish words from a text file and I'm using Spacy's retokenizer.split, however, I would like to generalize the orth's arguments inside retokenizer.split. I have the next code
doc= nlp("the wordsare wronly merged and weneed split them") #example
words = ["wordsare"] # Example: words to be split
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc(text) for text in words]
matcher.add("Terminology", None, *patterns)
matches = matcher(doc)
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
heads = [(doc[start],1), doc[start]]
attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
orths= [str(doc[start]),str(doc[end])]
retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
token_split=[token.text for token in doc]
print(token_split)
But when I put the orths this way orths= [str(doc[start]),str(doc[end])] and not ["words","are"] I get this error:
ValueError: [E117] The newly split tokens must match the text of the original token. New orths: wordsarewronly. Old text: wordsare.
I would like some help with the generalization of that, because i want the code no just fix the word wordsare but also with the word weneed and others that the file could have.
What I would change in your example is:
words = ["wordsare"] to words = ["wordsare","weneed"]
That is the list of wrongly spelled words.
Add rules for splitting that map to the first list: splits = {"wordsare":["words","are"], "weneed":["we","need"]}
orths= [str(doc[start]),str(doc[end])] to orths= splits[doc[start:end].text]
this is a list of splits to substitute the match found. Your original [str(doc[start]),str(doc[end])] does not make too much sense.
Move retokenizer.split into the loop.
Think about adding another dictionary for attrs
As soon you have it place, you have a working and generalizing example:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
doc= nlp("the wordsare wronly merged and weneed split them") #example
words = ["wordsare","weneed"] # Example: words to be split
splits = {"wordsare":["words","are"], "weneed":["we","need"]}
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc(text) for text in words]
matcher.add("Terminology", None, *patterns)
matches = matcher(doc)
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
heads = [(doc[start],1), doc[start]]
attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]}
orths= splits[doc[start:end].text]
retokenizer.split(doc[start], orths=orths, heads=heads, attrs=attrs)
token_split=[token.text for token in doc]
print(token_split)
['the', 'words' ,'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']
Note, if you're only concerned with tokenization, there is much easier, and perhaps faster way to do the same:
[splits[tok.text] if tok.text in words else tok.text for tok in doc]
['the', 'words', 'are', 'wronly', 'merged', 'and', 'we', 'need', 'split', 'them']
Also note, in the first example attrs are fixed and wrongly assigned for some cases. You may take care of that by making another dictionary, but more straitforward and clean way to have a fully functional pipeline, is to redefine tokenizer and let spacy do the rest for you:
from spacy.tokens import Doc
nlp.make_doc = lambda txt: Doc(nlp.vocab, [i for l in [splits[tok.text] if tok.text in words else [tok.text] for tok in nlp.tokenizer(txt)] for i in l])
doc2 = nlp("the wordsare wronly merged and weneed split them")
for tok in doc2:
print(f"{tok.text:<10}", f"{tok.pos_:<10}", f"{tok.dep_:<10}")
the DET det
words NOUN nsubjpass
are AUX auxpass
wronly ADV advmod
merged VERB ROOT
and CCONJ cc
we PRON nsubj
need VERB aux
split VERB conj
them PRON dobj

How to convert token list into wordnet lemma list using nltk?

I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to its lemma in the wordnet corpus. So, my tokens list looks like this:
['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]
There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:
['age', 'remember', 'hear', ...]
I am checking the synonyms through this code:
syns = wn.synsets("heard")
print(syns[0].lemmas()[0].name())
At this point I have created the function clean_text() in python for preprocessing. That looks like:
def clean_text(text):
# Eliminating punctuations
text = "".join([word for word in text if word not in string.punctuation])
# tokenizing
tokens = re.split("\W+", text)
# lemmatizing and removing stopwords
text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
# converting token list into synset
syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
return text
I am getting the error :
syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
AttributeError: 'list' object has no attribute 'lower'
How to get the token list for each lemma?
The full code:
import string
import re
from wordcloud import WordCloud
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import wordnet
import PyPDF4
import matplotlib
import numpy as np
from PIL import Image
stopwords = nltk.corpus.stopwords.words('english')
moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.
wn = nltk.WordNetLemmatizer()
data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))
pageData = ''
for page in data.pages:
pageData += page.extractText()
# print(pageData)
def clean_text(text):
text = "".join([word for word in text if word not in string.punctuation])
tokens = re.split("\W+", text)
text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]
return syns
print(clean_text(pageData))
You are calling wordnet.synsets(text) with a list of words (check what is text at that point) and you should call it with a word.
The preprocessing of wordnet.synsets is trying to apply .lower() to its parameters and therefore the error (AttributeError: 'list' object has no attribute 'lower').
Below there is a functional version of clean_text with a fix of this problem:
import string
import re
import nltk
from nltk.corpus import wordnet
stopwords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()
def clean_text(text):
text = "".join([word for word in text if word not in string.punctuation])
tokens = re.split("\W+", text)
text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
lemmas = []
for token in text:
lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]
return lemmas
text = "The grass was greener."
print(clean_text(text))
Returns:
['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']

Lemmatize a doc with spacy?

I have a spaCy doc that I would like to lemmatize.
For example:
import spacy
nlp = spacy.load('en_core_web_lg')
my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)
How can I convert every token in the doc to its lemma?
Each token has a number of attributes, you can iterate through the doc to access them.
For example: [token.lemma_ for token in doc]
If you want to reconstruct the sentence you could use: ' '.join([token.lemma_ for token in doc])
For a full list of token attributes see: https://spacy.io/api/token#attributes
If you don’t need a particular component of the pipeline – for example, the NER or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed.
For your case (Lemmatize a doc with spaCy) you only need the tagger component.
So here is a sample code:
import spacy
# keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_lg', disable=["parser", "ner"])
my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)
words_lemmas_list = [token.lemma_ for token in doc]
print(words_lemmas_list)
Output:
['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world']
This answer covers the case where your text consists of multiple sentences.
If you want to obtain a list of all tokens being lemmatized, do:
import spacy
nlp = spacy.load('en')
my_str = 'Python is the greatest language in the world. A python is an animal.'
doc = nlp(my_str)
words_lemmata_list = [token.lemma_ for token in doc]
print(words_lemmata_list)
# Output:
# ['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world', '.',
# 'a', 'python', 'be', 'an', 'animal', '.']
If you want to obtain a list of all sentences with each token being lemmatized, do:
sentences_lemmata_list = [sentence.lemma_ for sentence in doc.sents]
print(sentences_lemmata_list)
# Output:
# ['Python be the great language in the world .', 'a python be an animal .']

Keeping all white spaces as tokens

I have a question about whether there is a way to keep single white space as an independent token in spaCy tokenization.
For example if I ran:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is easy.")
toks = [w.text for w in doc]
toks
The result is
['This', 'is', 'easy', '.']
Instead, I would like to have something like
['This', ' ', 'is', ' ', 'easy', '.']
Is there are a simple way to do that?
spaCy exposes the token's whitespace as the whitespace_ attribute. So if you only need a list of strings, you could do:
token_texts = []
for token in doc:
token_texts.append(token.text)
if token.whitespace_: # filter out empty strings
token_texts.append(token.whitespace_)
If you want to create an actual Doc object out of those tokens, that's possible, too. Doc objects can be constructed with a words keyword argument (a list of strings to add as tokens). However, I'm not sure how useful that would be.
If you want the whitespaces in the doc object:
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer(object):
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.split(' ')
res = [' '] * (2 * len(words) - 1)
res[::2] = words
return Doc(self.vocab, words=res)
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp("This is easy.")
print([t.text for t in doc])

How could spacy tokenize hashtag as a whole?

In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens:
import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]
output:
[This, is, a, #, sentence, .]
I'd like to have hashtags tokenized as follows, is that possible?
[This, is, a, #sentence, .]
I also tried several ways to prevent spaCy from splitting hashtags or words with hyphens like "cutting-edge". My experience is that merging tokens afterwards can be problematic, because the pos tagger and dependency parsers already used the wrong tokens for their decisions. Touching the infix, prefix, suffix regexps is kind of error prone / complex, because you don't want to produce side effects by your changes.
The simplest way is indeed, as pointed out by before, to modify the token_match function of the tokenizer. This is a re.match identifying regular expressions that will not be split. Instead of importing the speficic URL pattern I'd rather extend whatever spaCy's default is.
from spacy.tokenizer import _get_regex_pattern
nlp = spacy.load('en')
# get default pattern for tokens that don't get split
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
# add your patterns (here: hashtags and in-word hyphens)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"
# overwrite token_match function of the tokenizer
nlp.tokenizer.token_match = re.compile(re_token_match).match
text = "#Pete: choose low-carb #food #eatsmart ;-) πŸ˜‹πŸ‘"
doc = nlp(text)
This yields:
['#Pete', ':', 'choose', 'low-carb', '#food', '#eatsmart', ';-)', 'πŸ˜‹', 'πŸ‘']
This is more of a add-on to the great answer by #DhruvPathak AND a shameless copy from the below linked github thread (and the even better answer by #csvance). spaCy features (since V2.0) the add_pipe method. Meaning you can define #DhruvPathak great answer in a function and add the step (conveniently) into your nlp processing pipeline, as below.
Citations starts here:
def hashtag_pipe(doc):
merged_hashtag = False
while True:
for token_index,token in enumerate(doc):
if token.text == '#':
if token.head is not None:
start_index = token.idx
end_index = start_index + len(token.head.text) + 1
if doc.merge(start_index, end_index) is not None:
merged_hashtag = True
break
if not merged_hashtag:
break
merged_hashtag = False
return doc
nlp = spacy.load('en')
nlp.add_pipe(hashtag_pipe)
doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'
Citation ends here; Check out how to add hashtags to the part of speech tagger #503 for the full thread.
PS It's clear when reading the code, but for the copy&pasters, don't disable the parser :)
You can do some pre and post string manipulations,which shall make you bypass '#' based tokenization, and is easy to implement. e.g
> >>> import re
> >>> import spacy
> >>> nlp = spacy.load('en')
> >>> sentence = u'This is my twitter update #MyTopic'
> >>> parsed = nlp(sentence)
> >>> [token.text for token in parsed]
[u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic']
> >>> new_sentence = re.sub(r'#(\w+)',r'ZZZPLACEHOLDERZZZ\1',sentence)
> >>> new_sentence u'This is my twitter update ZZZPLACEHOLDERZZZMyTopic'
> >>> parsed = nlp(new_sentence)
> >>> [token.text for token in parsed]
[u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic']
> >>> [x.replace(u'ZZZPLACEHOLDERZZZ','#') for x in [token.text for token in parsed]]
[u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic']
You can try setting custom seperators in spacy's tokenizer.
I am not aware of methods to do that.
UPDATE : You can use a regex to find span of token you would want to stay as single token, and retokenize using span.merge method as mentioned here : https://spacy.io/docs/api/span#merge
Merge example:
>>> import spacy
>>> import re
>>> nlp = spacy.load('en')
>>> my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]
>>> indexes = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)]
>>> indexes
[(15, 25), (26, 36)]
>>> for start,end in indexes:
... parsed.merge(start_idx=start,end_idx=end)
...
#MyHashOne
#MyHashTwo
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]
>>>
I found this on github, which uses spaCy's Matcher:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])
doc = nlp('This is a #sentence. Here is another #hashtag. #The #End.')
matches = matcher(doc)
hashtags = []
for match_id, start, end in matches:
hashtags.append(doc[start:end])
for span in hashtags:
span.merge()
print([t.text for t in doc])
outputs:
['This', 'is', 'a', '#sentence', '.', 'Here', 'is', 'another', '#hashtag', '.', '#The', '#End', '.']
A list of hashtags is also available in the hashtags list:
print(hashtags)
output:
[#sentence, #hashtag, #The, #End]
I spent quite a bit of time on this and found I share what I came up with:
Subclassing the Tokenizer and adding the regex for hashtags to the default URL_PATTERN was the easiest solution for me, additionally adding a custom extension to match on hashtags to identify them:
import re
import spacy
from spacy.language import Language
from spacy.tokenizer import Tokenizer
from spacy.tokens import Token
nlp = spacy.load('en_core_web_sm')
def create_tokenizer(nlp):
# contains the regex to match all sorts of urls:
from spacy.lang.tokenizer_exceptions import URL_PATTERN
# spacy defaults: when the standard behaviour is required, they
# need to be included when subclassing the tokenizer
prefix_re = spacy.util.compile_prefix_regex(Language.Defaults.prefixes)
infix_re = spacy.util.compile_infix_regex(Language.Defaults.infixes)
suffix_re = spacy.util.compile_suffix_regex(Language.Defaults.suffixes)
# extending the default url regex with regex for hashtags with "or" = |
hashtag_pattern = r'''|^(#[\w_-]+)$'''
url_and_hashtag = URL_PATTERN + hashtag_pattern
url_and_hashtag_re = re.compile(url_and_hashtag)
# set a custom extension to match if token is a hashtag
hashtag_getter = lambda token: token.text.startswith('#')
Token.set_extension('is_hashtag', getter=hashtag_getter)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=url_and_hashtag_re.match
)
nlp.tokenizer = create_tokenizer(nlp)
doc = nlp("#spreadhappiness #smilemore so_great#good.com https://www.somedomain.com/foo")
for token in doc:
print(token.text)
if token._.is_hashtag:
print("-> matches hashtag")
# returns: "#spreadhappiness -> matches hashtag #smilemore -> matches hashtag so_great#good.com https://www.somedomain.com/foo"

Categories

Resources