fast filtering of sentences in spacy - python

I'm using SpaCy to divide a text into sentences, match a regex pattern on each sentence, and use some logic based on the results of the match. I started with a naive approach such as:
nlp = spacy.load("en_core_web_trf")
regex = re.compile(r'\b(foo|bar)\b')
for text in texts_list:
doc = nlp(text)
for sent in doc.sents:
if re.search(regex, str(s)):
[...]
else:
[...]
and it was very slow. Then I used a pipe:
for doc in nlp.pipe(texts_list, disable=['tagger', 'ner', 'attribute_ruler', 'lemmatizer'], n_process=4):
for sent in doc.sents:
if re.search(regex, str(s)):
[...]
else:
[...]
but it's still slow. Am I missing something?

A transformer model is overkill for splitting sentences and will be very slow. Instead, a good option is the fast senter from an sm model:
import spacy
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])
nlp.enable_pipe("senter")
for doc in nlp.pipe(texts, n_process=4):
...
The senter should work pretty well if your sentences end with punctuation. If you have a lot of run-on sentences without final punctuation, then the parser might do a better job. To run only the parser, keep the tok2vec and parser components from the original pipeline and don't enable the senter. The parser will be ~5-10x slower than the senter.
If you need this to be even faster, you can use the rule-based sentencizer (start from a blank en model), which is typically a bit worse than the senter because it only splits on the provided punctuation symbols.

Related

Writing a tokenizer for an english text in python with nltk and regex

I want to write a tokenizer for an English text and I'm working with the RegExp tokenizer from the nltk module in python.
This is the expression right now I use to split the words:
[\w\.]+
(the "." so something like u.s.a doesn't get butchered.)
Problem: At the same time i want to remove the punctuation from the word: usa.
Of course I can do it in separate steps but I thought there has to be a smoother way than iterating over the whole text again just to remove punctuation.
Since it needs to be scalable I want to optimize the runtime as best as I can.
I'm pretty new to Regular Expressions and have a really hard time, so I'm really happy for any help I can get.
The module uses more then regular expressions alone (specifically trained sets) and does a pretty good job on its own with abbreviations, really:
from nltk import sent_tokenize, word_tokenize
text = """
In recent times, the U.S. has had to endure difficult
political times and many trials and tribulations.
Maybe things will get better soon - but only with the
right punctuation marks. Am I right, Dr.?"""
words = []
for nr, sent in enumerate(sent_tokenize(text, 1)):
print("{}. {}".format(nr, sent))
for word in word_tokenize(sent):
words.append(word)
print(words)
Don't reinvent the wheel here with own regular expressions.

Meaningless Spacy Nouns

I am using Spacy for extracting nouns from sentences. These sentences are grammatically poor and may contain some spelling mistakes as well.
Here is the code that I am using:
Code
import spacy
import re
nlp = spacy.load("en_core_web_sm")
sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN":
print (token.text)
Output:
sfx
Similarly for sentence "fast foward2", I get Spacy noun as
foward2
Which shows that these nouns have some meaningless words like: sfx, foward2, ms, 64x, bit, pwm, r, brailledisplayfastmovement, etc.
I only want to keep phrases that contain sensible single-word nouns like broom, ticker, pool, highway etc.
I have tried Wordnet to filter common nouns between wordnet and spacy but it is a bit strict and filter some sensible nouns as well. For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc
Therefore, I am looking for a solution in which I can filter out most sensible nouns from spacy nouns list that I have obtained.
It seems you can use pyenchant library:
Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.
More information is available on the Enchant website:
https://abiword.github.io/enchant/
Sample Python code:
import spacy, re
import enchant #pip install pyenchant
d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")
sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN" and d.check(token.text):
print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]
While using pyenchant spellchecker, I have found it useful to do the check after converting the word to uppercase fully. Also Splitting the sentence/phrase and feeding the words one at a time gives better results.
Example:
enchantChecker.check("landsat".upper()) and enchantChecker.check("wwii".upper()) returns True where as using lowercase words returns False.
If you need to mix more than one spellcheckers, another good one is to check spaCy library's is_oov (out of vocabulary) flag after loading en_core_web_lg

NLP: How do I combine stemming and tagging?

I'm trying to write code which passes in text that has been tokenized and had the stop words filtered out, and then stems and tags it. However, I'm not sure in what order I should stem and tag. This is what I have at the moment:
#### Stemming
ps = PorterStemmer() # PorterStemmer imported from nltk.stem
stemText = []
for word in swFiltText: # Tagged text w/o stop words
stemText.append(ps.stem(word))
#### POS Tagging
def tagging():
tagTot = []
try:
for i in stemText:
words = nltk.word_tokenize(i) # I need to tokenize again (idk why?)
tagged = nltk.pos_tag(words)
tagTot = tagTot + tagged # Combine tagged words into list
except Exception as e:
print(str(e))
return tagTot
tagText = tagging()
At first glance, this works just fine. However, because I stemmed first, pos_tag often mislabels words. For example, it marked "hous" as an adjective, when the original word was really the noun "house". But when I try to stem after tagging, it gives me an error about how pos_tag can't deal with 'tuples' - I'm guessing this has something to do with the way that the stemmer formats the word list as [('come', 'VB'), ('hous', 'JJ'), etc.
Should I be using a different stemmer/tagger? Or is the error in my code?
Thanks in advance!
You should tag the text before you apply stemming or lemmatisation to it.
Removing the endings of words takes away crucial clues about what part-of-speech tag a word can be.
The reason that you got hous as an adjective is that any tagger expects unprocessed tokens, and words ending in -ous in English are usually adjectives (nefarious, serious). If you tag it first, it can be recognises (even without context) as either a noun or a verb. The tagger can then use context (preceded by the? -> noun) to disambiguate which is the most likely one.
A good lemmatiser can take the part-of-speech into account, eg housing could be a noun (lemma: housing) or a verb (lemma: house). With p-o-s information a lemmatiser can make the correct choice there.
Whether you use stemming or lemmatisation depends on your application. For many purposes they will be equivalent. The main difference from my experience are that:
Stemming is a lot faster, as stemmers have a few rules on how to handle various endings
Lemmatisation gives you 'proper' words which you can look up in dictionaries (if you want to get glosses in other languages or definitions)
Stemmed strings sometimes don't look anything like the original word, and if you present them to a human user they might get confused
Stemmers conflate words which have similar meanings but different lemmas, so for information retrieval they might be more useful
Stemmers don't need a word list, so if you want to write your own stemmer, it's quicker than writing a lemmatiser (if you're processing languages for which no ready-made tools exist)
I would suggest using lemmatization over stemming, stemming just chops off letters from the end until the root/stem word is reached. Lemmatization also looks at the surrounding text to determine the given words's part of speech.

hello i need a proper example to develop NLP using spacy Python

I searched a lot regarding the spacy library for the NLP and i learned a lot of thing about spacy and NLP.
But, nowadays I want to implement spacy but I didn't find a proper example to do that. Can anyone guide me through the process or provide me any example link for spacy.
i referred to this: https://spacy.io/usage/
Or give me any other library with runnable example to develop NLP.
Thanks in advance!
Although your question is highly unclear, from what I understand, you want to build the NPL pipeline with SpaCY. I can guide you through the basic steps but this is a vast area and you'll pretty much have to figure stuff out on your own. But, it'll be easier after this.
So, you have to take a look at the SpaCY API documentation.
Basic steps in any NLP pipeline are the following:
Language detection (self explanatory, if you're working with some dataset, you know what the language is and you can adapt your pipeline to that). When you know a language you have to download a correct models from SpaCY. The instructions are here. Let's use English for this example. In your command line just type python -m spacy download en and then import it to the preprocessing script like this:
import spacy
nlp = spacy.load('en')
Tokenization - this is the process of splitting the text into words. It's not enough to just do text.split() (ex. there's would be treated as a single word but it's actually two words there and is). So here we use Tokenizers. In SpaCy you can do something like:
nlp_doc = nlp(text)
where text is your dataset corpus or a sample from a dataset. You can read more about the document instance here
Punctuation removal - pretty self explanatory process, done by the method in the previous step. To remove punctuation, just type:
import re
# removing punctuation tokens
text_no_punct = [token.text for token in doc if not token.is_punct]
# remove punctuation tokens that are in the word string like 'bye!' -> 'bye'
REPLACE_PUNCT = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])")
text_no_punct = [REPLACE_PUNCT.sub("", tok.text) for tok in text_no_punct]
POS tagging - short for Part-Of-Speech tagging. It is the process of marking up a word in a text as corresponding to a particular part of speech. For example:
A/DT Part-Of-Speech/NNP Tagger/NNP is/VBZ a/DT piece/NN of/IN
software/NN that/WDT reads/VBZ text/NN in/IN some/DT
language/NN and/CC assigns/VBZ parts/NNS of/IN speech/NN to/TO
each/DT word/NN ,/, such/JJ as/IN noun/NN ,/, verb/NN ,/,
adjective/NN ,/, etc./FW./.
where the uppercase codes after the slash are a standard word tags. A list of tags can be found here
In SpaCy, this is already done by putting the text into nlp instance. You can get the tags with:
for token in doc:
print(token.text, token.tag_)
Morphological processing: lemmatization - it's a process of transforming the words into a linguistically valid base form, called the lemma:
nouns → singular nominative form
verbs → infinitive form
adjectives → singular, nominative, masculine, indefinitive, positive form
In SpaCy, it's also already done for you by putting the text into nlp instance. You can get the lemma of every word by:
for token in doc:
print(token.text, token.lemma_)
Removing stopwords - stopwords are the words that are not bringing any new information or meaning to the sentence and can be omitted. You guessed, it's also already done for you by nlp instance. To filter the stopwords just type:
text_without_stopwords = [token.text for token in doc if not token.is_stop]
doc = nlp(' '.join(text_without_stopwords))
Now you have a clean dataset. You can now use word2vec or GloVe pretrained models to create a word vectors and input your data to some model. But let's leave that for another post. I hope this is clear enough :)

Keras: Text Preprocessing (Stopword Removal, etc.)

I'm using Keras to do a multilabel classification task (Toxic Comment Text Classification on Kaggle).
I'm using the Tokenizer class to do some pre-processing like this:
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(train_sentences)
train_sentences_tokenized = tokenizer.texts_to_sequences(train_sentences)
max_len = 250
X_train = pad_sequences(train_sentences_tokenized, maxlen=max_len)
This is a good start, but I haven't removed stop words, stemmed words, etc. For stop word removal, here's what I do before the above:
def filter_stop_words(train_sentences, stop_words):
for i, sentence in enumerate(train_sentences):
new_sent = [word for word in sentence.split() if word not in stop_words]
train_sentences[i] = ' '.join(new_sent)
return train_sentences
stop_words = set(stopwords.words("english"))
train_sentences = filter_stop_words(train_sentences, stop_words)
Shouldn't there be an easier way to do this within Keras? Was hoping there was stemming capability as well, but docs don't indicate there is:
https://keras.io/preprocessing/text/
Any help on best practices for stop word removal and stemming would be awesome!
Thanks!
No, Keras is not a natural language processing library. You will have to handle any complex processing yourself. At this stage it might make sense to use an actual NLP library with a Python interface such as NLTK or Spacy. Tokenizer is a small utility class for basic natural language tasks and you can extend it up to a certain point yourself but the NLP libraries will give much more including tokenisation, POS tagging and stemming.

Categories

Resources