Meaningless Spacy Nouns - python

I am using Spacy for extracting nouns from sentences. These sentences are grammatically poor and may contain some spelling mistakes as well.
Here is the code that I am using:
Code
import spacy
import re
nlp = spacy.load("en_core_web_sm")
sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN":
print (token.text)
Output:
sfx
Similarly for sentence "fast foward2", I get Spacy noun as
foward2
Which shows that these nouns have some meaningless words like: sfx, foward2, ms, 64x, bit, pwm, r, brailledisplayfastmovement, etc.
I only want to keep phrases that contain sensible single-word nouns like broom, ticker, pool, highway etc.
I have tried Wordnet to filter common nouns between wordnet and spacy but it is a bit strict and filter some sensible nouns as well. For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc
Therefore, I am looking for a solution in which I can filter out most sensible nouns from spacy nouns list that I have obtained.

It seems you can use pyenchant library:
Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.
More information is available on the Enchant website:
https://abiword.github.io/enchant/
Sample Python code:
import spacy, re
import enchant #pip install pyenchant
d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")
sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN" and d.check(token.text):
print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]

While using pyenchant spellchecker, I have found it useful to do the check after converting the word to uppercase fully. Also Splitting the sentence/phrase and feeding the words one at a time gives better results.
Example:
enchantChecker.check("landsat".upper()) and enchantChecker.check("wwii".upper()) returns True where as using lowercase words returns False.
If you need to mix more than one spellcheckers, another good one is to check spaCy library's is_oov (out of vocabulary) flag after loading en_core_web_lg

Related

Writing a tokenizer for an english text in python with nltk and regex

I want to write a tokenizer for an English text and I'm working with the RegExp tokenizer from the nltk module in python.
This is the expression right now I use to split the words:
[\w\.]+
(the "." so something like u.s.a doesn't get butchered.)
Problem: At the same time i want to remove the punctuation from the word: usa.
Of course I can do it in separate steps but I thought there has to be a smoother way than iterating over the whole text again just to remove punctuation.
Since it needs to be scalable I want to optimize the runtime as best as I can.
I'm pretty new to Regular Expressions and have a really hard time, so I'm really happy for any help I can get.
The module uses more then regular expressions alone (specifically trained sets) and does a pretty good job on its own with abbreviations, really:
from nltk import sent_tokenize, word_tokenize
text = """
In recent times, the U.S. has had to endure difficult
political times and many trials and tribulations.
Maybe things will get better soon - but only with the
right punctuation marks. Am I right, Dr.?"""
words = []
for nr, sent in enumerate(sent_tokenize(text, 1)):
print("{}. {}".format(nr, sent))
for word in word_tokenize(sent):
words.append(word)
print(words)
Don't reinvent the wheel here with own regular expressions.

NLP: How do I combine stemming and tagging?

I'm trying to write code which passes in text that has been tokenized and had the stop words filtered out, and then stems and tags it. However, I'm not sure in what order I should stem and tag. This is what I have at the moment:
#### Stemming
ps = PorterStemmer() # PorterStemmer imported from nltk.stem
stemText = []
for word in swFiltText: # Tagged text w/o stop words
stemText.append(ps.stem(word))
#### POS Tagging
def tagging():
tagTot = []
try:
for i in stemText:
words = nltk.word_tokenize(i) # I need to tokenize again (idk why?)
tagged = nltk.pos_tag(words)
tagTot = tagTot + tagged # Combine tagged words into list
except Exception as e:
print(str(e))
return tagTot
tagText = tagging()
At first glance, this works just fine. However, because I stemmed first, pos_tag often mislabels words. For example, it marked "hous" as an adjective, when the original word was really the noun "house". But when I try to stem after tagging, it gives me an error about how pos_tag can't deal with 'tuples' - I'm guessing this has something to do with the way that the stemmer formats the word list as [('come', 'VB'), ('hous', 'JJ'), etc.
Should I be using a different stemmer/tagger? Or is the error in my code?
Thanks in advance!
You should tag the text before you apply stemming or lemmatisation to it.
Removing the endings of words takes away crucial clues about what part-of-speech tag a word can be.
The reason that you got hous as an adjective is that any tagger expects unprocessed tokens, and words ending in -ous in English are usually adjectives (nefarious, serious). If you tag it first, it can be recognises (even without context) as either a noun or a verb. The tagger can then use context (preceded by the? -> noun) to disambiguate which is the most likely one.
A good lemmatiser can take the part-of-speech into account, eg housing could be a noun (lemma: housing) or a verb (lemma: house). With p-o-s information a lemmatiser can make the correct choice there.
Whether you use stemming or lemmatisation depends on your application. For many purposes they will be equivalent. The main difference from my experience are that:
Stemming is a lot faster, as stemmers have a few rules on how to handle various endings
Lemmatisation gives you 'proper' words which you can look up in dictionaries (if you want to get glosses in other languages or definitions)
Stemmed strings sometimes don't look anything like the original word, and if you present them to a human user they might get confused
Stemmers conflate words which have similar meanings but different lemmas, so for information retrieval they might be more useful
Stemmers don't need a word list, so if you want to write your own stemmer, it's quicker than writing a lemmatiser (if you're processing languages for which no ready-made tools exist)
I would suggest using lemmatization over stemming, stemming just chops off letters from the end until the root/stem word is reached. Lemmatization also looks at the surrounding text to determine the given words's part of speech.

hello i need a proper example to develop NLP using spacy Python

I searched a lot regarding the spacy library for the NLP and i learned a lot of thing about spacy and NLP.
But, nowadays I want to implement spacy but I didn't find a proper example to do that. Can anyone guide me through the process or provide me any example link for spacy.
i referred to this: https://spacy.io/usage/
Or give me any other library with runnable example to develop NLP.
Thanks in advance!
Although your question is highly unclear, from what I understand, you want to build the NPL pipeline with SpaCY. I can guide you through the basic steps but this is a vast area and you'll pretty much have to figure stuff out on your own. But, it'll be easier after this.
So, you have to take a look at the SpaCY API documentation.
Basic steps in any NLP pipeline are the following:
Language detection (self explanatory, if you're working with some dataset, you know what the language is and you can adapt your pipeline to that). When you know a language you have to download a correct models from SpaCY. The instructions are here. Let's use English for this example. In your command line just type python -m spacy download en and then import it to the preprocessing script like this:
import spacy
nlp = spacy.load('en')
Tokenization - this is the process of splitting the text into words. It's not enough to just do text.split() (ex. there's would be treated as a single word but it's actually two words there and is). So here we use Tokenizers. In SpaCy you can do something like:
nlp_doc = nlp(text)
where text is your dataset corpus or a sample from a dataset. You can read more about the document instance here
Punctuation removal - pretty self explanatory process, done by the method in the previous step. To remove punctuation, just type:
import re
# removing punctuation tokens
text_no_punct = [token.text for token in doc if not token.is_punct]
# remove punctuation tokens that are in the word string like 'bye!' -> 'bye'
REPLACE_PUNCT = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])")
text_no_punct = [REPLACE_PUNCT.sub("", tok.text) for tok in text_no_punct]
POS tagging - short for Part-Of-Speech tagging. It is the process of marking up a word in a text as corresponding to a particular part of speech. For example:
A/DT Part-Of-Speech/NNP Tagger/NNP is/VBZ a/DT piece/NN of/IN
software/NN that/WDT reads/VBZ text/NN in/IN some/DT
language/NN and/CC assigns/VBZ parts/NNS of/IN speech/NN to/TO
each/DT word/NN ,/, such/JJ as/IN noun/NN ,/, verb/NN ,/,
adjective/NN ,/, etc./FW./.
where the uppercase codes after the slash are a standard word tags. A list of tags can be found here
In SpaCy, this is already done by putting the text into nlp instance. You can get the tags with:
for token in doc:
print(token.text, token.tag_)
Morphological processing: lemmatization - it's a process of transforming the words into a linguistically valid base form, called the lemma:
nouns → singular nominative form
verbs → infinitive form
adjectives → singular, nominative, masculine, indefinitive, positive form
In SpaCy, it's also already done for you by putting the text into nlp instance. You can get the lemma of every word by:
for token in doc:
print(token.text, token.lemma_)
Removing stopwords - stopwords are the words that are not bringing any new information or meaning to the sentence and can be omitted. You guessed, it's also already done for you by nlp instance. To filter the stopwords just type:
text_without_stopwords = [token.text for token in doc if not token.is_stop]
doc = nlp(' '.join(text_without_stopwords))
Now you have a clean dataset. You can now use word2vec or GloVe pretrained models to create a word vectors and input your data to some model. But let's leave that for another post. I hope this is clear enough :)

NLP process for combining common collocations

I have a corpus that I'm using the tm package on in R (and also mirroring the same script in NLTK in python). I'm working with unigrams, but would like a parser of some kind to combine words commonly co-located to be as if one word---ie, I'd like to stop seeing "New" and "York" separately in my data set when they occur together, and see this particular pair represented as "New York" as if that were a single word, and alongside other unigrams.
What is this process called, of transforming meaningful, common n-grams onto the same footing as unigrams? Is it not a thing? Finally, what would the tm_map look like for this?
mydata.corpus <- tm_map(mydata.corpus, fancyfunction)
And/or in python?
I recently had a similar question and played around with collocations
This was the solution I chose to identify pairs of collocated words:
from nltk import word_tokenize
from nltk.collocations import *
text = <a long text read in as string string>
tokenized_text = word_tokenize(text)
bigram_measures = nltk.collocations.BigramAssocMeasures(tokenized_text)
finder = BigramCollocationFinder.from_words()
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)

How can I take a few paragraphs of text, see if any sentence has a pronoun and select all those sentences to make a new paragraph?

Should I use NLTK or regular expressions to split it?
How can I do the selection for pronouns (he/she). I want to select any sentence that has a pronoun.
This is a part of a larger project and I am new to Python. Could you please point me to any helpful code?
I am working on a NLP project which has similar needs. I suggest you to use NLTK since it makes things really easy and gives us a lot of flexibility. Since you need to collect all sentences having pronouns, you can split all sentences in the text and hold them in a list. Then, you can iterate over the list and look for sentences containing pronouns. Also make sure you note down the index of the sentence (in the list) or you can form a new list.
Sample code below:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
sentences = ['alice loves to read crime novels.', 'she also loves to play chess with him']
sentences_with_pronouns = []
for sentence in sentences:
words = word_tokenize(sentence)
for word in words:
word_pos = pos_tag([word])
if word_pos[0][1] == 'PRP':
sentences_with_pronouns.append(sentence)
break
print sentences_with_pronouns
Output:
['she also loves to play chess.']
P.S. Also check pylucene and whoosh libraries which are pretty useful NLP python
packages.
NLTK is your best bet. Given a string of sentences as input, you can obtain a list of those sentences containing pronouns by doing:
from nltk import pos_tag, sent_tokenize, word_tokenize
paragraph = "This is a sentence with no pronouns. Take it or leave it."
print [sentence for sentence in sent_tokenize(paragraph)
if 'PRP' in {pos for _,pos in pos_tag(word_tokenize(sentence))}]
Returns:
['Take it or leave it.']
Basically we split the string into a list of sentences, those sentences into a list of words and convert the list of words for each sentence into a set of part of speech tags (this is important since if we don't, when we have multiple pronouns in a sentence, we would get get duplicate sentences).

Categories

Resources