I searched a lot regarding the spacy library for the NLP and i learned a lot of thing about spacy and NLP.
But, nowadays I want to implement spacy but I didn't find a proper example to do that. Can anyone guide me through the process or provide me any example link for spacy.
i referred to this: https://spacy.io/usage/
Or give me any other library with runnable example to develop NLP.
Thanks in advance!
Although your question is highly unclear, from what I understand, you want to build the NPL pipeline with SpaCY. I can guide you through the basic steps but this is a vast area and you'll pretty much have to figure stuff out on your own. But, it'll be easier after this.
So, you have to take a look at the SpaCY API documentation.
Basic steps in any NLP pipeline are the following:
Language detection (self explanatory, if you're working with some dataset, you know what the language is and you can adapt your pipeline to that). When you know a language you have to download a correct models from SpaCY. The instructions are here. Let's use English for this example. In your command line just type python -m spacy download en and then import it to the preprocessing script like this:
import spacy
nlp = spacy.load('en')
Tokenization - this is the process of splitting the text into words. It's not enough to just do text.split() (ex. there's would be treated as a single word but it's actually two words there and is). So here we use Tokenizers. In SpaCy you can do something like:
nlp_doc = nlp(text)
where text is your dataset corpus or a sample from a dataset. You can read more about the document instance here
Punctuation removal - pretty self explanatory process, done by the method in the previous step. To remove punctuation, just type:
import re
# removing punctuation tokens
text_no_punct = [token.text for token in doc if not token.is_punct]
# remove punctuation tokens that are in the word string like 'bye!' -> 'bye'
REPLACE_PUNCT = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])")
text_no_punct = [REPLACE_PUNCT.sub("", tok.text) for tok in text_no_punct]
POS tagging - short for Part-Of-Speech tagging. It is the process of marking up a word in a text as corresponding to a particular part of speech. For example:
A/DT Part-Of-Speech/NNP Tagger/NNP is/VBZ a/DT piece/NN of/IN
software/NN that/WDT reads/VBZ text/NN in/IN some/DT
language/NN and/CC assigns/VBZ parts/NNS of/IN speech/NN to/TO
each/DT word/NN ,/, such/JJ as/IN noun/NN ,/, verb/NN ,/,
adjective/NN ,/, etc./FW./.
where the uppercase codes after the slash are a standard word tags. A list of tags can be found here
In SpaCy, this is already done by putting the text into nlp instance. You can get the tags with:
for token in doc:
print(token.text, token.tag_)
Morphological processing: lemmatization - it's a process of transforming the words into a linguistically valid base form, called the lemma:
nouns → singular nominative form
verbs → infinitive form
adjectives → singular, nominative, masculine, indefinitive, positive form
In SpaCy, it's also already done for you by putting the text into nlp instance. You can get the lemma of every word by:
for token in doc:
print(token.text, token.lemma_)
Removing stopwords - stopwords are the words that are not bringing any new information or meaning to the sentence and can be omitted. You guessed, it's also already done for you by nlp instance. To filter the stopwords just type:
text_without_stopwords = [token.text for token in doc if not token.is_stop]
doc = nlp(' '.join(text_without_stopwords))
Now you have a clean dataset. You can now use word2vec or GloVe pretrained models to create a word vectors and input your data to some model. But let's leave that for another post. I hope this is clear enough :)
Related
I'm using SpaCy to divide a text into sentences, match a regex pattern on each sentence, and use some logic based on the results of the match. I started with a naive approach such as:
nlp = spacy.load("en_core_web_trf")
regex = re.compile(r'\b(foo|bar)\b')
for text in texts_list:
doc = nlp(text)
for sent in doc.sents:
if re.search(regex, str(s)):
[...]
else:
[...]
and it was very slow. Then I used a pipe:
for doc in nlp.pipe(texts_list, disable=['tagger', 'ner', 'attribute_ruler', 'lemmatizer'], n_process=4):
for sent in doc.sents:
if re.search(regex, str(s)):
[...]
else:
[...]
but it's still slow. Am I missing something?
A transformer model is overkill for splitting sentences and will be very slow. Instead, a good option is the fast senter from an sm model:
import spacy
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])
nlp.enable_pipe("senter")
for doc in nlp.pipe(texts, n_process=4):
...
The senter should work pretty well if your sentences end with punctuation. If you have a lot of run-on sentences without final punctuation, then the parser might do a better job. To run only the parser, keep the tok2vec and parser components from the original pipeline and don't enable the senter. The parser will be ~5-10x slower than the senter.
If you need this to be even faster, you can use the rule-based sentencizer (start from a blank en model), which is typically a bit worse than the senter because it only splits on the provided punctuation symbols.
I am using Spacy for extracting nouns from sentences. These sentences are grammatically poor and may contain some spelling mistakes as well.
Here is the code that I am using:
Code
import spacy
import re
nlp = spacy.load("en_core_web_sm")
sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN":
print (token.text)
Output:
sfx
Similarly for sentence "fast foward2", I get Spacy noun as
foward2
Which shows that these nouns have some meaningless words like: sfx, foward2, ms, 64x, bit, pwm, r, brailledisplayfastmovement, etc.
I only want to keep phrases that contain sensible single-word nouns like broom, ticker, pool, highway etc.
I have tried Wordnet to filter common nouns between wordnet and spacy but it is a bit strict and filter some sensible nouns as well. For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc
Therefore, I am looking for a solution in which I can filter out most sensible nouns from spacy nouns list that I have obtained.
It seems you can use pyenchant library:
Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.
More information is available on the Enchant website:
https://abiword.github.io/enchant/
Sample Python code:
import spacy, re
import enchant #pip install pyenchant
d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")
sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN" and d.check(token.text):
print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]
While using pyenchant spellchecker, I have found it useful to do the check after converting the word to uppercase fully. Also Splitting the sentence/phrase and feeding the words one at a time gives better results.
Example:
enchantChecker.check("landsat".upper()) and enchantChecker.check("wwii".upper()) returns True where as using lowercase words returns False.
If you need to mix more than one spellcheckers, another good one is to check spaCy library's is_oov (out of vocabulary) flag after loading en_core_web_lg
I'm trying to write code which passes in text that has been tokenized and had the stop words filtered out, and then stems and tags it. However, I'm not sure in what order I should stem and tag. This is what I have at the moment:
#### Stemming
ps = PorterStemmer() # PorterStemmer imported from nltk.stem
stemText = []
for word in swFiltText: # Tagged text w/o stop words
stemText.append(ps.stem(word))
#### POS Tagging
def tagging():
tagTot = []
try:
for i in stemText:
words = nltk.word_tokenize(i) # I need to tokenize again (idk why?)
tagged = nltk.pos_tag(words)
tagTot = tagTot + tagged # Combine tagged words into list
except Exception as e:
print(str(e))
return tagTot
tagText = tagging()
At first glance, this works just fine. However, because I stemmed first, pos_tag often mislabels words. For example, it marked "hous" as an adjective, when the original word was really the noun "house". But when I try to stem after tagging, it gives me an error about how pos_tag can't deal with 'tuples' - I'm guessing this has something to do with the way that the stemmer formats the word list as [('come', 'VB'), ('hous', 'JJ'), etc.
Should I be using a different stemmer/tagger? Or is the error in my code?
Thanks in advance!
You should tag the text before you apply stemming or lemmatisation to it.
Removing the endings of words takes away crucial clues about what part-of-speech tag a word can be.
The reason that you got hous as an adjective is that any tagger expects unprocessed tokens, and words ending in -ous in English are usually adjectives (nefarious, serious). If you tag it first, it can be recognises (even without context) as either a noun or a verb. The tagger can then use context (preceded by the? -> noun) to disambiguate which is the most likely one.
A good lemmatiser can take the part-of-speech into account, eg housing could be a noun (lemma: housing) or a verb (lemma: house). With p-o-s information a lemmatiser can make the correct choice there.
Whether you use stemming or lemmatisation depends on your application. For many purposes they will be equivalent. The main difference from my experience are that:
Stemming is a lot faster, as stemmers have a few rules on how to handle various endings
Lemmatisation gives you 'proper' words which you can look up in dictionaries (if you want to get glosses in other languages or definitions)
Stemmed strings sometimes don't look anything like the original word, and if you present them to a human user they might get confused
Stemmers conflate words which have similar meanings but different lemmas, so for information retrieval they might be more useful
Stemmers don't need a word list, so if you want to write your own stemmer, it's quicker than writing a lemmatiser (if you're processing languages for which no ready-made tools exist)
I would suggest using lemmatization over stemming, stemming just chops off letters from the end until the root/stem word is reached. Lemmatization also looks at the surrounding text to determine the given words's part of speech.
I am using NLTK to extract nouns from a text-string and each of its word already have POS tags in them in (ibaloi) language which be later used on creating grammar:
sentence = "this is a tribal language"
words = nltk.word_tokenize(sentence)
taggedWords = tagged_text = nltk.pos_tag(nltk.Text(words))
There is no problem in english. Is there a way to make it work in tribal(ibaloi) language as well?
(I am new to natural language process taking some tutorials which is great by the way.)
You may want to refer to this similar question where the OP also had a wordlist containing word and part of speech (noun, verb, etc.) in an Excel file, for a language not in NLTK.
If I process the sentence
'Return target card to your hand'
with spacy and the en_web_core_lg model, it recognize the tokens as below:
Return NOUN target NOUN card NOUN to ADP your ADJ hand NOUN
How can I force 'Return' to be tagged as a VERB? And how can I do it before the parser, so that the parser can better interpret relations between tokens?
There are other situations in which this would be useful. I am dealing with text which contains specific symbols such as {G}. These three characters should be considered a NOUN, as a whole, and {T} should be a VERB. But right now I do not know how to achieve that, without developing a new model for tokenizing and for tagging. If I could "force" a token, I could replace these symbols for something that would be recognized as one token and force it to be tagged appropriately. For example, I could replace {G} with SYMBOLG and force tagging SYMBOLG as NOUN.
EDIT: this solution used spaCy 2.0.12 (IIRC).
To answer the second part of your question, you can add special tokenisation rules to the tokeniser, as stated in the docs here. The following code should do what you want, assuming those symbols are unambiguous:
import spacy
from spacy.symbols import ORTH, POS, NOUN, VERB
nlp = spacy.load('en')
nlp.tokenizer.add_special_case('{G}', [{ORTH: '{G}', POS: NOUN}])
nlp.tokenizer.add_special_case('{T}', [{ORTH: '{T}', POS: VERB}])
doc = nlp('This {G} a noun and this is a {T}')
for token in doc:
print('{:10}{:10}'.format(token.text, token.pos_))
Output for this is (the tags are not correct, but this shows the special case rules have been applied):
This DET
{G} NOUN
a DET
noun NOUN
and CCONJ
this DET
is VERB
a DET
{T} VERB
As for the first part of your question, the problem with assigning a part-of-speech to individual words is that they are mostly ambiguous out of context (e.g. "return" noun or verb?). So the above method would not allow you to account for use in context and is likely to generate errors. spaCy does allow you to do token-based pattern matching however, so that is worth having a look at. Maybe there is a way to do what you're after.