TF-IDF algorithm on chinese text

TF-IDF algorithm on chinese text - python

I am doing TF-IDF on chinese text and searching for top 10 used words in the text.
when i getting the top 10 words i have some of the meaningless word like "成为", "表示" and other.
Is there is any ways which only get meaningful words?
I am using "jieba" to cut the chinese sentence to words

The words like "成为", "表示" are what we refer to as stop words. In many cases, they are commonly used words that provide little meaning within the sentence, think the words "a", and "the" in English.
It is sometimes necessary to remove these stopwords before performing analysis, especially for TF-IDF as it may lead to meaningless results as you have seen.
It seems that Jieba doesn't include the functionality to remove stop words, but genediazjr collected a fairly comprehensive list of stopwords for the Chinese language. You can import this and remove these stop words from your original text before the TF-IDF analysis.

Related

Python libary to build learning sentences

I need to remember some declinations in another language. I do this by building a sentence where the first letters represent the phrase I have to remember. Then I use a website dictionary to find words starting with this phrase and manually combine it to a sentence which is worth remembering.
I have distilled a Python-dict of word beginnings. One column represents one sentence to build. Now the steps I need tooling or sources for:
I now require a dictionary of English words and tooling to look up all works which match the beginnings in the dictionary.
From this distilled list I need to build a memorable sentence, combining nouns and verbs with the right beginnings.
Bonus: The tool generates easy to remember phrases and not just something random.
Is there any way to automate this in Python?

build WORD2VEC words dictionary to indicate feelings

I am working with Word2vec (Gensim, in python) to understand meaning of sentences (by each word in them).
My goal is to be able to realize if the sentence indicates about the feeling of the speaker.
Where can I find this kind of a dictionary of words?
For example one dictionary for words that indicate happiness and other for sadness.
Thanks

Try SentiWordNet
"SentiWordNet is a lexical resource for opinion mining that assigns to each synset of WordNet three sentiment scores: positivity, negativity, and objectivity"

How to replace similar meaning words with a common synonym in a text

I am doing a project where I have to make a graph out of words of a text. I want to replace similar meaning words with a common synonym. For example, if a text has the occurences of 'Murder', 'Kill' and 'slay'. I want to replace all of these with a common synonym (which may have a little different meaning) like 'Kill'. How can I do this in Python?
I have tried NLTK synsets, but could not come up with a way so that all similar words are replaced with synonyms.

It's a simple machine learning problem. Use any clustering algorithm.
Convert words to vectors, and group similar words based on their vector plot.
Select one word from each cluster to replace with its similar words.

word2vec gensim multiple languages

This problem is going completely over my head. I am training a Word2Vec model using gensim. I have provided data in multiple languages i.e. English and Hindi. When I am trying to find the words closest to 'man', this is what I am getting:
model.wv.most_similar(positive = ['man'])
Out[14]:
[('woman', 0.7380284070968628),
('lady', 0.6933152675628662),
('monk', 0.6662989258766174),
('guy', 0.6513140201568604),
('soldier', 0.6491742134094238),
('priest', 0.6440571546554565),
('farmer', 0.6366012692451477),
('sailor', 0.6297377943992615),
('knight', 0.6290514469146729),
('person', 0.6288090944290161)]
--------------------------------------------
Problem is, these are all English words. Then I tried to find similarity between same meaning Hindi and English words,
model.similarity('man', 'आदमी')
__main__:1: DeprecationWarning: Call to deprecated `similarity` (Method will
be removed in 4.0.0, use self.wv.similarity() instead).
Out[13]: 0.078265618974427215
This accuracy should have been better than all the other accuracies. The Hindi corpus I have has been made by translating the English one. Hence the words appear in similar contexts. Hence they should be close.
This is what I am doing here:
#Combining all the words together.
all_reviews=HindiWordsList + EnglishWordsList
#Training FastText model
cpu_count=multiprocessing.cpu_count()
model=Word2Vec(size=300,window=5,min_count=1,alpha=0.025,workers=cpu_count,max_vocab_size=None,negative=10)
model.build_vocab(all_reviews)
model.train(all_reviews,total_examples=model.corpus_count,epochs=model.iter)
model.save("word2vec_combined_50.bin")

I have been dealing with a very similar problem and came across a reasonably robust solution. This paper shows that a linear relationship can be defined between two Word2Vec models that have been trained on different languages. This means you can derive a translation matrix to convert word embeddings from one language model into the vector space of another language model. What does all of that mean? It means I can take a word from one language, and find words in the other language that have a similar meaning.
I've written a small Python package that implements this for you: transvec. Here's an example where I use pre-trained models to search for Russian words and find English words with a similar meaning:
import gensim.downloader
from transvec.transformers import TranslationWordVectorizer
# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")
# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]
bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)
# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]
Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.
So basically, if you can provide a list of words with their translations, then you can train a TranslationWordVectorizer to translate any word that exists in your source language corpus into the target language. When I used this for real, I produced some training data by extracting all the individual Russian words from my data, running them through Google Translate and then keeping everything that translated to a single word in English. The results were pretty good (sorry I don't have any more detail for the benchmark yet; it's still a work in progress!).

First of all, you should really use self.wv.similarity().
I'm assuming there are very close to no words that exist in both between your Hindi corpus and English corpus, since Hindi corpus is in Devanagari and English is in, well, English. Simply adding two corpuses together to make a model does not make sense. Corresponding words in the two languages co-occur in two versions of a document, but not in your word embeddings for Word2Vec to figure out most similar.
Eg. Until your model knows that
Man:Aadmi::Woman:Aurat,
from the word embeddings, it can never make out the
Raja:King::Rani:Queen
relation. And for that, you need some anchor between the two corpuses.
Here are a few suggestions that you can try out:
Make an independent Hindi corpus/model
Maintain and lookup data of a few English->Hindi word pairs that you have will have to create manually.
Randomly replace input document words with their counterparts from the corresponding document while training
These might be enough to give you an idea. You can also look into seq2seq if you want only want to do translations. You can also read the Word2Vec theory in detail to understand what it does.

After reading the comments, I think that the problem is in the very different grammatical construction between English and Hindi sentences. I have worked with Hindi NLP models and it is much more difficult to get similar results as English (since you mention it).
In Hindi there's no order between words at all, only when declining them. Moreover, the translation of a sentence between languages that are not even descendants of the same root language is somewhat random and you can not assume that the contexts of both sentences are similar.

Find 'modern' nltk words corpus [duplicate]

This question already has answers here:
nltk words corpus does not contain "okay"?
(2 answers)
Closed 5 years ago.
I'm building a text classifier that will classify text into topics.
In the first phase of my program as a part of cleaning the data, I remove all the non-English words. For this I'm using the nltk.corpus.words.words() corpus. The problem with this corpus is that it removes 'modern' English words such as Facebook, Instagram etc. Does anybody know another, more 'modern' corpus which I can replace or union with the present one?
I prefer nltk corpus but I'm open to other suggestions.
Thanks in advance

Rethink your approach. Any collection of English texts will have a "long tail" of words that you have not seen before. No matter how large a dictionary you amass, you'll be removing words that are not "non-English". And to what purpose? Leave them in, they won't spoil your classification.
If your goal is to remove non-English text, do it at the sentence or paragraph level using a statistical approach, e.g. ngram models. They work well and need minimal resources.

I'd use Wikipedia, but it's pretty time consuming to tokenize the entirety of it. Fortunately, it's been done for you already. You could use a Word2Vec model trained on 100 billion words of wikipedia and just check if the word is in the model.
I also found this project where Chris made text files of the 3 millions word vocabulary of the model.
Note that this project's list of words doesn't contain some stop words, so it'd be a good idea to find the union of your list from nltk and this one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

TF-IDF algorithm on chinese text - python

I am doing TF-IDF on chinese text and searching for top 10 used words in the text. when i getting the top 10 words i have some of the meaningless word like "成为", "表示" and other. Is there is any ways which only get meaningful words? I am using "jieba" to cut the chinese sentence to words

Related

Python libary to build learning sentences

build WORD2VEC words dictionary to indicate feelings

How to replace similar meaning words with a common synonym in a text

word2vec gensim multiple languages

Find 'modern' nltk words corpus [duplicate]

Categories

Resources