This question already has answers here:
Python NLTK pos_tag not returning the correct part-of-speech tag
(3 answers)
Closed 6 years ago.
I've been trying to improve the POS tagger on the NLTK for a few days, but I cannot figure it out. Right now, the default tagger is really inaccurate and tags most words as 'NN'. How can I improve the tagger to make it more accurate? I've already looked up training the tagger, but I can't get it to work.
Does anybody have a simple method for this? thanks a lot.
Are you doing it one word at a time or in a large corpus? Usually POS tagging algorithms use the probability that the word is a tag type e.g "NN" but they also use the surrounding sentence context to predict so the more words, the more likely it is to be accurate.
You can also try with varying Unigram, bigram, trigram, etc tagging to try to get higher accuracy at the cost of performance. You can read about doing that here: http://www.nltk.org/book/ch05.html
Related
I am new to NLP. My requirement is to parse meaning from sentences.
Example
"Perpetually Drifting is haunting in all the best ways."
"When The Fog Rolls In is a fantastic song
From above sentences, I need to extract the following sentences
"haunting in all the best ways."
"fantastic song"
Is it possible to achieve this in spacy?
It is not possible to extract the summarized sentences using spacy. I hope the following methods might work for you
Simplest one is extract the noun phrases or verb phrases. Most of the time that should give the text what you want.(Phase struce grammar).
You can use dependency parsing and extract the center word dependencies.
dependency grammar
You can train an sequence model where input is going to be the full sentence and output will be your summarized sentence.
Sequence models for text summaraization
Extracting the meaning of a sentence is a quite arbitrary task. What do you mean by the meaning? Using spaCy you can extract the dependencies between the words (which specify the meaning of the sentence), find the POS tags to check how words are used in the sentence and also find places, organizations, people using NER tagger. However, meaning of the sentence is too general even for the humans.
Maybe you are searching for a specific meaning? If that's the case, you have to train your own classifier. This will get you started.
If your task is summarization of a couple of sentences, consider also using gensim . You can have a look here.
Hope it helps :)
This question already has answers here:
How to intrepret Clusters results after using Doc2vec?
(3 answers)
Closed 5 years ago.
I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster.
My question is is there any way to figure out the most dominat or simlar terms/word of a document in Doc2vec . I am using python's gensim package for the Doc2vec implementaton
#TrnKh's answer is good, but there is an additional option made available when using Doc2Vec.
Some gensim Doc2Vec training modes – either the default PV-DM (dm=1) or PV-DBOW with added word-training (dm=0, dbow_words=1) train both doc-vectors and word-vectors into the same coordinate space, and to some extent that means doc-vectors are near related word-vectors, and vice-versa.
So you could take an individual document's vector, or the average/centroid vectors you've synthesized, and feed it to the model to find most_similar() words. (To be clear that this is a raw vector, rather than a list of vector-keys, you should use the form of most_similar() that specifies an explicit list of positive examples.)
For example:
docvec = d2v_model.docvecs['doc77145'] # assuming such a doc-tag exists
similar_words = d2v_model.most_similar(positive=[docvec])
print(similar_words)
To find out the most dominant words of your clusters, you can use any of these two classic approaches. I personally found the second one very efficient and effective for this purpose.
Latent Drichlet Allocation (LDA): A topic modelling algorithm that will give you a set of topic given a collection of documents. You can treat the set of similar documents in the clusters as one document and apply LDA to generate the topics and see topic distributions across documents.
TF-IDF: TF-IDF calculate the importance of a word to a document given a collection of documents. Therefore, to find the most important keywords/ngrams, you can calculate TF-IDF for every word that appears in the documents. The words with the highest TF-IDF then are you keywords. So:
calculate IDF for every single word that appears in the documents based on the number of documents that contain that keyword
concatenate the text of the similar documents (I 'd call it a super-document) and then calculate TF for each word that appears in this super-document
calculate TF*IDF for every word... and then TA DAAA... you have your keywords associated with each cluster.
Take a look at Section 5.1 here for more details on the use of TF-IDF.
This question already has answers here:
nltk words corpus does not contain "okay"?
(2 answers)
Closed 5 years ago.
I'm building a text classifier that will classify text into topics.
In the first phase of my program as a part of cleaning the data, I remove all the non-English words. For this I'm using the nltk.corpus.words.words() corpus. The problem with this corpus is that it removes 'modern' English words such as Facebook, Instagram etc. Does anybody know another, more 'modern' corpus which I can replace or union with the present one?
I prefer nltk corpus but I'm open to other suggestions.
Thanks in advance
Rethink your approach. Any collection of English texts will have a "long tail" of words that you have not seen before. No matter how large a dictionary you amass, you'll be removing words that are not "non-English". And to what purpose? Leave them in, they won't spoil your classification.
If your goal is to remove non-English text, do it at the sentence or paragraph level using a statistical approach, e.g. ngram models. They work well and need minimal resources.
I'd use Wikipedia, but it's pretty time consuming to tokenize the entirety of it. Fortunately, it's been done for you already. You could use a Word2Vec model trained on 100 billion words of wikipedia and just check if the word is in the model.
I also found this project where Chris made text files of the 3 millions word vocabulary of the model.
Note that this project's list of words doesn't contain some stop words, so it'd be a good idea to find the union of your list from nltk and this one.
I am stuck at this issue and am not able to find relevant literature. Not sure, if this is coding question to begin with.
I have articles related to some disaster. I want to do a temporal classification of the text. Hereby, I want to get the sentences/ phrases related to the infoamtion before the event. I want know about classification from background in ml. But I have no idea about relevant extraction.
I have tried tokening the words and get relevant frequencies and also tried pos tagging using max ent.
I guess the problem reduces to analyzing the manually classified text and constructing features. But I am not sure how to extract patterns using the pos tags. I also am how to know the exhaustive set if features.
This question already has an answer here:
Closed 11 years ago.
Possible Duplicate:
Extracting nouns from Noun Phase in NLP
Do anyone have some examples on how to extract all nouns from a string using Python's NLTK?
For example, i have this string: "I Like Tomatoes and Lettuce". I want to build a method that returns "Tomatoes" and "Lettuce."
If not in Python, does anyone know of any other solution?
Get the NLTK package, and either use its built-in parser then this method; or, much faster, part-of-speech tag the string and get all the words out that have the tag NN; those are the nouns. Read up on other part-of-speech tags to find out how you can properly extract I and like.
Neither method is flawless, but it's about the best you can do. Accuracy of a good part-of-speech tagger will be above 95% on clean input. I don't think you can reach such accuracy with a WordNet-based method without a lot of extra work.
Dave Taylor wrote an adlib generator using Bash that queried Princetons wordnet to get this done. You could do something very similar in python of course with wordnets help.
Here is the link
Linux Journal - Dave Taylor adlib generator.