I'm using Keras to do a multilabel classification task (Toxic Comment Text Classification on Kaggle).
I'm using the Tokenizer class to do some pre-processing like this:
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(train_sentences)
train_sentences_tokenized = tokenizer.texts_to_sequences(train_sentences)
max_len = 250
X_train = pad_sequences(train_sentences_tokenized, maxlen=max_len)
This is a good start, but I haven't removed stop words, stemmed words, etc. For stop word removal, here's what I do before the above:
def filter_stop_words(train_sentences, stop_words):
for i, sentence in enumerate(train_sentences):
new_sent = [word for word in sentence.split() if word not in stop_words]
train_sentences[i] = ' '.join(new_sent)
return train_sentences
stop_words = set(stopwords.words("english"))
train_sentences = filter_stop_words(train_sentences, stop_words)
Shouldn't there be an easier way to do this within Keras? Was hoping there was stemming capability as well, but docs don't indicate there is:
https://keras.io/preprocessing/text/
Any help on best practices for stop word removal and stemming would be awesome!
Thanks!
No, Keras is not a natural language processing library. You will have to handle any complex processing yourself. At this stage it might make sense to use an actual NLP library with a Python interface such as NLTK or Spacy. Tokenizer is a small utility class for basic natural language tasks and you can extend it up to a certain point yourself but the NLP libraries will give much more including tokenisation, POS tagging and stemming.
Related
I'm using SpaCy to divide a text into sentences, match a regex pattern on each sentence, and use some logic based on the results of the match. I started with a naive approach such as:
nlp = spacy.load("en_core_web_trf")
regex = re.compile(r'\b(foo|bar)\b')
for text in texts_list:
doc = nlp(text)
for sent in doc.sents:
if re.search(regex, str(s)):
[...]
else:
[...]
and it was very slow. Then I used a pipe:
for doc in nlp.pipe(texts_list, disable=['tagger', 'ner', 'attribute_ruler', 'lemmatizer'], n_process=4):
for sent in doc.sents:
if re.search(regex, str(s)):
[...]
else:
[...]
but it's still slow. Am I missing something?
A transformer model is overkill for splitting sentences and will be very slow. Instead, a good option is the fast senter from an sm model:
import spacy
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])
nlp.enable_pipe("senter")
for doc in nlp.pipe(texts, n_process=4):
...
The senter should work pretty well if your sentences end with punctuation. If you have a lot of run-on sentences without final punctuation, then the parser might do a better job. To run only the parser, keep the tok2vec and parser components from the original pipeline and don't enable the senter. The parser will be ~5-10x slower than the senter.
If you need this to be even faster, you can use the rule-based sentencizer (start from a blank en model), which is typically a bit worse than the senter because it only splits on the provided punctuation symbols.
I am implementing a simple doc2vec with gensim, not a word2vec
I need to remove stopwords without losing the correct order to a list of list.
Each list is a document and, as I understood for doc2vec, the model will have as input a list of TaggedDocuments
model = Doc2Vec(lst_tag_documents, vector_size=5, window=2, min_count=1, workers=4)
dataset = [['We should remove the stopwords from this example'],
['Otherwise the algo'],
["will not work correctly"],
['dont forget Gensim doc2vec takes list_of_list' ]]
STOPWORDS = ['we','i','will','the','this','from']
def word_filter(lst):
lower=[word.lower() for word in lst]
lst_ftred = [word for word in lower if not word in STOPWORDS]
return lst_ftred
lst_lst_filtered= list(map(word_filter,dataset))
print(lst_lst_filtered)
Output:
[['we should remove the stopwords from this example'], ['otherwise the algo'], ['will not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]
Expected Output:
[[' should remove the stopwords example'], ['otherwise the algo'], [' not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]
What was my mistake and how to fix?
There are other efficient ways to solve this issue without losing the
proper order?
List of questions I examined before asking:
How to apply a function to each sublist of a list in python?
I studied this and tried to apply on my specific case
Removing stopwords from list of lists
The order is important I can't use set
Removing stopwords from a list of text files
This could be a possible solution is similar to what I have implemented.
I undestood that the difference, but I don't know how to deal with it.
In my case the document is not tokenized (and should not be tokenized because is a doc2vec not a word2vec)
How to remove stop words using nltk or python
In this question the SO is dealing with a list not a list of list
lower is a list of one element, word not in STOPWORDS will return False. Take the first item in the list with index and split by blank space
lst_ftred = ' '.join([word for word in lower[0].split() if word not in STOPWORDS])
# output: ['should remove stopwords example', 'otherwise algo', 'not work correctly', 'dont forget gensim doc2vec takes list_of_list']
# 'the' is also in STOPWORDS
First, note it's not that important to remove stopwords from Doc2Vec training. Second, note that such tiny toy datasets won't deliver interesting results from Doc2Vec. Tha algorithm, like Word2Vec, only starts to show its value when trained on large datasets with (1) many, many more unique words than the number of vector dimensions; and (2) lots of varied examples of the usage of each word - at least a few, ideally dozens or hundreds.
Still, if you wanted to strip stopwords, it'd be easiest if you did it after tokenizing the raw strings. (That is, splitting the strings into lists-of-words. That's the format Doc2Vec will need anyway.) And, you don't want your dataset to be a list-of-lists-with-one-string-each. Instead, you want it to be either a list-of-strings (at first), then a list-of-lists-with-many-tokens-each.
The following should work:
string_dataset = [
'We should remove the stopwords from this example',
'Otherwise the algo',
"will not work correctly",
'dont forget Gensim doc2vec takes list_of_list',
]
STOPWORDS = ['we','i','will','the','this','from']
# Python list comprehension to break into tokens
tokenized_dataset = [s.split() for s in string_dataset]
def filter_words(tokens):
"""lowercase each token, and keep only if not in STOPWORDS"""
return [token.lower() for token in tokens if token not in STOPWORDS]
filtered_dataset = [filter_words(s) for sent in tokenized_dataset]
Finally, because as noted above, Doc2Vec needs multiple word examples to work well, it's almost always a bad idea to use min_count=1.
I am using Spacy for extracting nouns from sentences. These sentences are grammatically poor and may contain some spelling mistakes as well.
Here is the code that I am using:
Code
import spacy
import re
nlp = spacy.load("en_core_web_sm")
sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN":
print (token.text)
Output:
sfx
Similarly for sentence "fast foward2", I get Spacy noun as
foward2
Which shows that these nouns have some meaningless words like: sfx, foward2, ms, 64x, bit, pwm, r, brailledisplayfastmovement, etc.
I only want to keep phrases that contain sensible single-word nouns like broom, ticker, pool, highway etc.
I have tried Wordnet to filter common nouns between wordnet and spacy but it is a bit strict and filter some sensible nouns as well. For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc
Therefore, I am looking for a solution in which I can filter out most sensible nouns from spacy nouns list that I have obtained.
It seems you can use pyenchant library:
Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.
More information is available on the Enchant website:
https://abiword.github.io/enchant/
Sample Python code:
import spacy, re
import enchant #pip install pyenchant
d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")
sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN" and d.check(token.text):
print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]
While using pyenchant spellchecker, I have found it useful to do the check after converting the word to uppercase fully. Also Splitting the sentence/phrase and feeding the words one at a time gives better results.
Example:
enchantChecker.check("landsat".upper()) and enchantChecker.check("wwii".upper()) returns True where as using lowercase words returns False.
If you need to mix more than one spellcheckers, another good one is to check spaCy library's is_oov (out of vocabulary) flag after loading en_core_web_lg
I'm trying to write code which passes in text that has been tokenized and had the stop words filtered out, and then stems and tags it. However, I'm not sure in what order I should stem and tag. This is what I have at the moment:
#### Stemming
ps = PorterStemmer() # PorterStemmer imported from nltk.stem
stemText = []
for word in swFiltText: # Tagged text w/o stop words
stemText.append(ps.stem(word))
#### POS Tagging
def tagging():
tagTot = []
try:
for i in stemText:
words = nltk.word_tokenize(i) # I need to tokenize again (idk why?)
tagged = nltk.pos_tag(words)
tagTot = tagTot + tagged # Combine tagged words into list
except Exception as e:
print(str(e))
return tagTot
tagText = tagging()
At first glance, this works just fine. However, because I stemmed first, pos_tag often mislabels words. For example, it marked "hous" as an adjective, when the original word was really the noun "house". But when I try to stem after tagging, it gives me an error about how pos_tag can't deal with 'tuples' - I'm guessing this has something to do with the way that the stemmer formats the word list as [('come', 'VB'), ('hous', 'JJ'), etc.
Should I be using a different stemmer/tagger? Or is the error in my code?
Thanks in advance!
You should tag the text before you apply stemming or lemmatisation to it.
Removing the endings of words takes away crucial clues about what part-of-speech tag a word can be.
The reason that you got hous as an adjective is that any tagger expects unprocessed tokens, and words ending in -ous in English are usually adjectives (nefarious, serious). If you tag it first, it can be recognises (even without context) as either a noun or a verb. The tagger can then use context (preceded by the? -> noun) to disambiguate which is the most likely one.
A good lemmatiser can take the part-of-speech into account, eg housing could be a noun (lemma: housing) or a verb (lemma: house). With p-o-s information a lemmatiser can make the correct choice there.
Whether you use stemming or lemmatisation depends on your application. For many purposes they will be equivalent. The main difference from my experience are that:
Stemming is a lot faster, as stemmers have a few rules on how to handle various endings
Lemmatisation gives you 'proper' words which you can look up in dictionaries (if you want to get glosses in other languages or definitions)
Stemmed strings sometimes don't look anything like the original word, and if you present them to a human user they might get confused
Stemmers conflate words which have similar meanings but different lemmas, so for information retrieval they might be more useful
Stemmers don't need a word list, so if you want to write your own stemmer, it's quicker than writing a lemmatiser (if you're processing languages for which no ready-made tools exist)
I would suggest using lemmatization over stemming, stemming just chops off letters from the end until the root/stem word is reached. Lemmatization also looks at the surrounding text to determine the given words's part of speech.
I searched a lot regarding the spacy library for the NLP and i learned a lot of thing about spacy and NLP.
But, nowadays I want to implement spacy but I didn't find a proper example to do that. Can anyone guide me through the process or provide me any example link for spacy.
i referred to this: https://spacy.io/usage/
Or give me any other library with runnable example to develop NLP.
Thanks in advance!
Although your question is highly unclear, from what I understand, you want to build the NPL pipeline with SpaCY. I can guide you through the basic steps but this is a vast area and you'll pretty much have to figure stuff out on your own. But, it'll be easier after this.
So, you have to take a look at the SpaCY API documentation.
Basic steps in any NLP pipeline are the following:
Language detection (self explanatory, if you're working with some dataset, you know what the language is and you can adapt your pipeline to that). When you know a language you have to download a correct models from SpaCY. The instructions are here. Let's use English for this example. In your command line just type python -m spacy download en and then import it to the preprocessing script like this:
import spacy
nlp = spacy.load('en')
Tokenization - this is the process of splitting the text into words. It's not enough to just do text.split() (ex. there's would be treated as a single word but it's actually two words there and is). So here we use Tokenizers. In SpaCy you can do something like:
nlp_doc = nlp(text)
where text is your dataset corpus or a sample from a dataset. You can read more about the document instance here
Punctuation removal - pretty self explanatory process, done by the method in the previous step. To remove punctuation, just type:
import re
# removing punctuation tokens
text_no_punct = [token.text for token in doc if not token.is_punct]
# remove punctuation tokens that are in the word string like 'bye!' -> 'bye'
REPLACE_PUNCT = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])")
text_no_punct = [REPLACE_PUNCT.sub("", tok.text) for tok in text_no_punct]
POS tagging - short for Part-Of-Speech tagging. It is the process of marking up a word in a text as corresponding to a particular part of speech. For example:
A/DT Part-Of-Speech/NNP Tagger/NNP is/VBZ a/DT piece/NN of/IN
software/NN that/WDT reads/VBZ text/NN in/IN some/DT
language/NN and/CC assigns/VBZ parts/NNS of/IN speech/NN to/TO
each/DT word/NN ,/, such/JJ as/IN noun/NN ,/, verb/NN ,/,
adjective/NN ,/, etc./FW./.
where the uppercase codes after the slash are a standard word tags. A list of tags can be found here
In SpaCy, this is already done by putting the text into nlp instance. You can get the tags with:
for token in doc:
print(token.text, token.tag_)
Morphological processing: lemmatization - it's a process of transforming the words into a linguistically valid base form, called the lemma:
nouns → singular nominative form
verbs → infinitive form
adjectives → singular, nominative, masculine, indefinitive, positive form
In SpaCy, it's also already done for you by putting the text into nlp instance. You can get the lemma of every word by:
for token in doc:
print(token.text, token.lemma_)
Removing stopwords - stopwords are the words that are not bringing any new information or meaning to the sentence and can be omitted. You guessed, it's also already done for you by nlp instance. To filter the stopwords just type:
text_without_stopwords = [token.text for token in doc if not token.is_stop]
doc = nlp(' '.join(text_without_stopwords))
Now you have a clean dataset. You can now use word2vec or GloVe pretrained models to create a word vectors and input your data to some model. But let's leave that for another post. I hope this is clear enough :)