Find 'modern' nltk words corpus [duplicate]

Find 'modern' nltk words corpus [duplicate] - python

This question already has answers here:
nltk words corpus does not contain "okay"?
(2 answers)
Closed 5 years ago.
I'm building a text classifier that will classify text into topics.
In the first phase of my program as a part of cleaning the data, I remove all the non-English words. For this I'm using the nltk.corpus.words.words() corpus. The problem with this corpus is that it removes 'modern' English words such as Facebook, Instagram etc. Does anybody know another, more 'modern' corpus which I can replace or union with the present one?
I prefer nltk corpus but I'm open to other suggestions.
Thanks in advance

Rethink your approach. Any collection of English texts will have a "long tail" of words that you have not seen before. No matter how large a dictionary you amass, you'll be removing words that are not "non-English". And to what purpose? Leave them in, they won't spoil your classification.
If your goal is to remove non-English text, do it at the sentence or paragraph level using a statistical approach, e.g. ngram models. They work well and need minimal resources.

I'd use Wikipedia, but it's pretty time consuming to tokenize the entirety of it. Fortunately, it's been done for you already. You could use a Word2Vec model trained on 100 billion words of wikipedia and just check if the word is in the model.
I also found this project where Chris made text files of the 3 millions word vocabulary of the model.
Note that this project's list of words doesn't contain some stop words, so it'd be a good idea to find the union of your list from nltk and this one.

Related

TF-IDF algorithm on chinese text

I am doing TF-IDF on chinese text and searching for top 10 used words in the text.
when i getting the top 10 words i have some of the meaningless word like "成为", "表示" and other.
Is there is any ways which only get meaningful words?
I am using "jieba" to cut the chinese sentence to words

The words like "成为", "表示" are what we refer to as stop words. In many cases, they are commonly used words that provide little meaning within the sentence, think the words "a", and "the" in English.
It is sometimes necessary to remove these stopwords before performing analysis, especially for TF-IDF as it may lead to meaningless results as you have seen.
It seems that Jieba doesn't include the functionality to remove stop words, but genediazjr collected a fairly comprehensive list of stopwords for the Chinese language. You can import this and remove these stop words from your original text before the TF-IDF analysis.

build WORD2VEC words dictionary to indicate feelings

I am working with Word2vec (Gensim, in python) to understand meaning of sentences (by each word in them).
My goal is to be able to realize if the sentence indicates about the feeling of the speaker.
Where can I find this kind of a dictionary of words?
For example one dictionary for words that indicate happiness and other for sadness.
Thanks

Try SentiWordNet
"SentiWordNet is a lexical resource for opinion mining that assigns to each synset of WordNet three sentiment scores: positivity, negativity, and objectivity"

How to Grab meaning of sentence using NLP?

I am new to NLP. My requirement is to parse meaning from sentences.
Example
"Perpetually Drifting is haunting in all the best ways."
"When The Fog Rolls In is a fantastic song
From above sentences, I need to extract the following sentences
"haunting in all the best ways."
"fantastic song"
Is it possible to achieve this in spacy?

It is not possible to extract the summarized sentences using spacy. I hope the following methods might work for you
Simplest one is extract the noun phrases or verb phrases. Most of the time that should give the text what you want.(Phase struce grammar).
You can use dependency parsing and extract the center word dependencies.
dependency grammar
You can train an sequence model where input is going to be the full sentence and output will be your summarized sentence.
Sequence models for text summaraization

Extracting the meaning of a sentence is a quite arbitrary task. What do you mean by the meaning? Using spaCy you can extract the dependencies between the words (which specify the meaning of the sentence), find the POS tags to check how words are used in the sentence and also find places, organizations, people using NER tagger. However, meaning of the sentence is too general even for the humans.
Maybe you are searching for a specific meaning? If that's the case, you have to train your own classifier. This will get you started.
If your task is summarization of a couple of sentences, consider also using gensim . You can have a look here.
Hope it helps :)

word2vec gensim multiple languages

This problem is going completely over my head. I am training a Word2Vec model using gensim. I have provided data in multiple languages i.e. English and Hindi. When I am trying to find the words closest to 'man', this is what I am getting:
model.wv.most_similar(positive = ['man'])
Out[14]:
[('woman', 0.7380284070968628),
('lady', 0.6933152675628662),
('monk', 0.6662989258766174),
('guy', 0.6513140201568604),
('soldier', 0.6491742134094238),
('priest', 0.6440571546554565),
('farmer', 0.6366012692451477),
('sailor', 0.6297377943992615),
('knight', 0.6290514469146729),
('person', 0.6288090944290161)]
--------------------------------------------
Problem is, these are all English words. Then I tried to find similarity between same meaning Hindi and English words,
model.similarity('man', 'आदमी')
__main__:1: DeprecationWarning: Call to deprecated `similarity` (Method will
be removed in 4.0.0, use self.wv.similarity() instead).
Out[13]: 0.078265618974427215
This accuracy should have been better than all the other accuracies. The Hindi corpus I have has been made by translating the English one. Hence the words appear in similar contexts. Hence they should be close.
This is what I am doing here:
#Combining all the words together.
all_reviews=HindiWordsList + EnglishWordsList
#Training FastText model
cpu_count=multiprocessing.cpu_count()
model=Word2Vec(size=300,window=5,min_count=1,alpha=0.025,workers=cpu_count,max_vocab_size=None,negative=10)
model.build_vocab(all_reviews)
model.train(all_reviews,total_examples=model.corpus_count,epochs=model.iter)
model.save("word2vec_combined_50.bin")

I have been dealing with a very similar problem and came across a reasonably robust solution. This paper shows that a linear relationship can be defined between two Word2Vec models that have been trained on different languages. This means you can derive a translation matrix to convert word embeddings from one language model into the vector space of another language model. What does all of that mean? It means I can take a word from one language, and find words in the other language that have a similar meaning.
I've written a small Python package that implements this for you: transvec. Here's an example where I use pre-trained models to search for Russian words and find English words with a similar meaning:
import gensim.downloader
from transvec.transformers import TranslationWordVectorizer
# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")
# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]
bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)
# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]
Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.
So basically, if you can provide a list of words with their translations, then you can train a TranslationWordVectorizer to translate any word that exists in your source language corpus into the target language. When I used this for real, I produced some training data by extracting all the individual Russian words from my data, running them through Google Translate and then keeping everything that translated to a single word in English. The results were pretty good (sorry I don't have any more detail for the benchmark yet; it's still a work in progress!).

First of all, you should really use self.wv.similarity().
I'm assuming there are very close to no words that exist in both between your Hindi corpus and English corpus, since Hindi corpus is in Devanagari and English is in, well, English. Simply adding two corpuses together to make a model does not make sense. Corresponding words in the two languages co-occur in two versions of a document, but not in your word embeddings for Word2Vec to figure out most similar.
Eg. Until your model knows that
Man:Aadmi::Woman:Aurat,
from the word embeddings, it can never make out the
Raja:King::Rani:Queen
relation. And for that, you need some anchor between the two corpuses.
Here are a few suggestions that you can try out:
Make an independent Hindi corpus/model
Maintain and lookup data of a few English->Hindi word pairs that you have will have to create manually.
Randomly replace input document words with their counterparts from the corresponding document while training
These might be enough to give you an idea. You can also look into seq2seq if you want only want to do translations. You can also read the Word2Vec theory in detail to understand what it does.

After reading the comments, I think that the problem is in the very different grammatical construction between English and Hindi sentences. I have worked with Hindi NLP models and it is much more difficult to get similar results as English (since you mention it).
In Hindi there's no order between words at all, only when declining them. Moreover, the translation of a sentence between languages that are not even descendants of the same root language is somewhat random and you can not assume that the contexts of both sentences are similar.

Extracting all nouns from a string [duplicate]

This question already has an answer here:
Closed 11 years ago.
Possible Duplicate:
Extracting nouns from Noun Phase in NLP
Do anyone have some examples on how to extract all nouns from a string using Python's NLTK?
For example, i have this string: "I Like Tomatoes and Lettuce". I want to build a method that returns "Tomatoes" and "Lettuce."
If not in Python, does anyone know of any other solution?

Get the NLTK package, and either use its built-in parser then this method; or, much faster, part-of-speech tag the string and get all the words out that have the tag NN; those are the nouns. Read up on other part-of-speech tags to find out how you can properly extract I and like.
Neither method is flawless, but it's about the best you can do. Accuracy of a good part-of-speech tagger will be above 95% on clean input. I don't think you can reach such accuracy with a WordNet-based method without a lot of extra work.

Dave Taylor wrote an adlib generator using Bash that queried Princetons wordnet to get this done. You could do something very similar in python of course with wordnets help.
Here is the link
Linux Journal - Dave Taylor adlib generator.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find 'modern' nltk words corpus [duplicate] - python

Related

TF-IDF algorithm on chinese text

build WORD2VEC words dictionary to indicate feelings

How to Grab meaning of sentence using NLP?

word2vec gensim multiple languages

Extracting all nouns from a string [duplicate]

Categories

Resources