word2vec gensim multiple languages

word2vec gensim multiple languages - python

This problem is going completely over my head. I am training a Word2Vec model using gensim. I have provided data in multiple languages i.e. English and Hindi. When I am trying to find the words closest to 'man', this is what I am getting:
model.wv.most_similar(positive = ['man'])
Out[14]:
[('woman', 0.7380284070968628),
('lady', 0.6933152675628662),
('monk', 0.6662989258766174),
('guy', 0.6513140201568604),
('soldier', 0.6491742134094238),
('priest', 0.6440571546554565),
('farmer', 0.6366012692451477),
('sailor', 0.6297377943992615),
('knight', 0.6290514469146729),
('person', 0.6288090944290161)]
--------------------------------------------
Problem is, these are all English words. Then I tried to find similarity between same meaning Hindi and English words,
model.similarity('man', 'आदमी')
__main__:1: DeprecationWarning: Call to deprecated `similarity` (Method will
be removed in 4.0.0, use self.wv.similarity() instead).
Out[13]: 0.078265618974427215
This accuracy should have been better than all the other accuracies. The Hindi corpus I have has been made by translating the English one. Hence the words appear in similar contexts. Hence they should be close.
This is what I am doing here:
#Combining all the words together.
all_reviews=HindiWordsList + EnglishWordsList
#Training FastText model
cpu_count=multiprocessing.cpu_count()
model=Word2Vec(size=300,window=5,min_count=1,alpha=0.025,workers=cpu_count,max_vocab_size=None,negative=10)
model.build_vocab(all_reviews)
model.train(all_reviews,total_examples=model.corpus_count,epochs=model.iter)
model.save("word2vec_combined_50.bin")

I have been dealing with a very similar problem and came across a reasonably robust solution. This paper shows that a linear relationship can be defined between two Word2Vec models that have been trained on different languages. This means you can derive a translation matrix to convert word embeddings from one language model into the vector space of another language model. What does all of that mean? It means I can take a word from one language, and find words in the other language that have a similar meaning.
I've written a small Python package that implements this for you: transvec. Here's an example where I use pre-trained models to search for Russian words and find English words with a similar meaning:
import gensim.downloader
from transvec.transformers import TranslationWordVectorizer
# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")
# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]
bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)
# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]
Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.
So basically, if you can provide a list of words with their translations, then you can train a TranslationWordVectorizer to translate any word that exists in your source language corpus into the target language. When I used this for real, I produced some training data by extracting all the individual Russian words from my data, running them through Google Translate and then keeping everything that translated to a single word in English. The results were pretty good (sorry I don't have any more detail for the benchmark yet; it's still a work in progress!).

First of all, you should really use self.wv.similarity().
I'm assuming there are very close to no words that exist in both between your Hindi corpus and English corpus, since Hindi corpus is in Devanagari and English is in, well, English. Simply adding two corpuses together to make a model does not make sense. Corresponding words in the two languages co-occur in two versions of a document, but not in your word embeddings for Word2Vec to figure out most similar.
Eg. Until your model knows that
Man:Aadmi::Woman:Aurat,
from the word embeddings, it can never make out the
Raja:King::Rani:Queen
relation. And for that, you need some anchor between the two corpuses.
Here are a few suggestions that you can try out:
Make an independent Hindi corpus/model
Maintain and lookup data of a few English->Hindi word pairs that you have will have to create manually.
Randomly replace input document words with their counterparts from the corresponding document while training
These might be enough to give you an idea. You can also look into seq2seq if you want only want to do translations. You can also read the Word2Vec theory in detail to understand what it does.

After reading the comments, I think that the problem is in the very different grammatical construction between English and Hindi sentences. I have worked with Hindi NLP models and it is much more difficult to get similar results as English (since you mention it).
In Hindi there's no order between words at all, only when declining them. Moreover, the translation of a sentence between languages that are not even descendants of the same root language is somewhat random and you can not assume that the contexts of both sentences are similar.

Related

Alternatives to NER taggers for long, heterogeneous phrases?

I am looking for ideas/thoughts on the following problem:
I am working with food ingredient data such as: milk, sugar, eggs, flour, may contain nuts
From such piece of text I want to be able to identify and extract phrases like may contain nuts, to preprocess them separately
These kinds of phrases can change quite a lot in terms of length and content. I thought of using NER taggers, but I don't know if they will do the job correctly as they are mainly used for identifying single-word entities...
Any ideas on what to use as a phrase-entity-recognition system? Also which package would you use? Cheers

IMHO NER (or model-based entity extraction in general) alone is a poor choice of methodology for this particular problem as it requires LOTS of manual annotation to do it right. Instead I suggest using Word2Vec (https://radimrehurek.com/gensim/models/word2vec.html) with phrasing (https://radimrehurek.com/gensim/models/phrases.html).
The idea is to have an unsupervised model containing phrases and their similarities which can then queried using some seed words to list all possible ingredients (e.g. "cat" produces similar words like "dog" or "rat"). Next step would be either to create dictionaries containing the ingredient words & phrases or try clustering the vocabulary of the model using cosine similarity between each word/phrase pair.
Now if you want to take things further you can always match your created dictionaries/clusters back to the corpus the W2V model was trained on and then train a custom entity recognition model using those matches as you now have annotated examples.

I believe this is a Multiword-Expression problem.
There are a few ways you can try to solve this:
Build a named entity recognition model (NER)
Search with Regex for a fixed set of known phrases
Chunking tokens with POS tags
Find collocations of tokens
Let's look at each of these
Build a named entity recognition model (NER)
Named Entity Recognition labels known spans of tokens an a entity type
For each input token you have to label it as part of a known named entity.
Eddy N PERSON
Bonte N PERSON
is V O
woordvoerder N O
van Prep O
diezelfde Pron O
Hogeschool N ORG
. Punc O
This is costly and requires a lot of time for labelling.
It is probably not a good choice for your task.
Search with Regex
This is not a bad idea, using some known phrases you could easily search input texts with minimal word boundaries for context.
import re
re.findall(r"\bmay contain nuts\b", text)
This would require you knowing all phrases you want to search for up front, and might not be possible.
Chunking tokens with POS tags
This could be a good intermediate step but could give many false positives.
You could do this my knowing the sequences of POS tags you expect
may MD
contain VB
nuts NNS
Then you could use chunking with the known tag sequence (MD, VB, NNS).
The problem is that you may not know this, and would have to capture many use cases. It will also capture many sequences which you wont want to capture (False Positive)
Find collocations of tokens
This is probably the best way, as it seems you are looking for a highly common sequences of words (tokens) in a corpus.
You can do this using:
Word2Vec Phrases
NLTK Collocations
Both do the same thing, they look for statistically common sequences of tokens which occur in a corpus.
That can then be used to extract the same collocation phrases from new texts.

It looks like your ingredient list is easy to split into a list. In that case you don't really need a sequence tagger; I wouldn't treat this problem as phrase extraction or NER. What I would do is train a classifier on different items in the list to label them as "food" or "non-food". You should be able to start with rules and train a basic classifier using anything really.
Before training a model, an even simpler step would be to run each list item through a PoS tagger (say spaCy), and if there's a verb you can guess that it's not a food item.

Whats a good way to match text to sets of keywords (NLP)

I'm trying to match an input text (e.g. a headline of a news article) to sets of keywords, s.t. the best-matching set can be selected.
Let's assume, I have some sets of keywords:
[['democracy', 'votes', 'democrats'], ['health', 'corona', 'vaccine', 'pandemic'], ['security', 'police', 'demonstration']]
and as input the (hypothetical) headline: New Pfizer vaccine might beat COVID-19 pandemic in the next few months.. Obviously, it fits well to the second set of keywords.
Exact matching words is one way to do it, but more complex situations might arise, for which it might make sense to use base forms of words (e.g. duck instead of ducks, or run instead of running) to enhance the algorithm. Now we're talking NLP already.
I experimented with Spacy word and document embeddings (example) to determine similarity between a headline and each set of keywords. Is it a good idea to calculate document similarity between a full sentence and a limited number of keywords? Are there other ways?
Related: What NLP tools to use to match phrases having similar meaning or semantics

There is not one correct solution for such a task. you have to try what fits your problem!
Possible ways to solve your problem I can think of:
Matching: either exact or more elaborated such as lemma/stemming, or Levensthein.
Embedding Similarity: I guess word similarity would outperform document-keywords similarity, but again, just experiment with it.
Classification: Your problem seems to be a classic classification problem, which each set being one class. If you don't have enough labeled training data, you could try active-learning.

Need of context while using Word2Vec

I have a large number of strings in a list:
A small example of the list contents is :
["machine learning","Apple","Finance","AI","Funding"]
I wish to convert these into vectors and use them for clustering purpose.
Is the context of these strings in the sentences considered while finding out their respective vectors?
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
I have done this code so far..
from gensim.models import Word2Vec
vec = Word2Vec(mylist)
P.S. Also, can I get a good reference/tutorial on Word2Vec?

To find word vectors using word2vec you need a list of sentences not a list of strings.
What word2vec does is, it tries to goes through every word in a sentence and for each word, it tries to predict the words around it in a specified window (mostly around 5) and adjusts the vector associated with that word so that the error is minimized.
Obviously, this means that the order of words matter when finding word vectors. If you just supply a list of strings without a meaningful order, you will not get a good embedding.
I'm not sure, but I think you will find LDA better suited in this case, because your list of strings don't have inherent order in them.

Answers to your 2 questions:
Is the context of these strings in the sentences considered while finding out their respective vectors?
Yes, word2vec creates one vector per word (or string since it can consider multiword expression as unique word, e.g. New York); this vector describe the word by its context. It assumes that similar words will appear with similar context. The context is composed of the surrounding words (in a window, with bag-of-words or skip-gram assumption).
How should I go about with getting the vectors of these strings if i have just this list containing the strings?
You need more words. Word2Vec outputs quality depends on the size of the training set. Training Word2Vec on your data is a non-sense.
The links provided by #Beta are a good introduction/explanation.

Word2Vec is an artificial neural network method. Word2Vec actually creates embeddings, which reflects the relationship among the words. The links below will help you get the complete code to implement Word2Vec.
Some good links are this and this. For the 2nd link try his github repo for the detail code. He is explaining only major part in the blog. Main article is this.
You can use the following code, to convert words to there corresponding numerical values.
word_counts = Counter(words)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

word2vec + context = doc2vec
Build sentences from text you have and tag them with labels.
Train doc2vec on tagged sentences to get vectors for each label embedded in the same space as words.
Then you can do vector inference and get labels for arbitrary piece of text.

nltk built-in (or easily obtainable) probabilistic parsing models?

So I want to analyze the structure of fairly arbitrary English sentences with nltk. There seem to be lots of classes for doing this (eg PCFGs, ProbabilisticProjectiveDependencyParsers), but all require data to train on? Does NLTK come with data that can be used to train such parsers for arbitrary English (ie, I don't need exotic words, but basic English sentences should work).
The demo for the PPDP, for instance, seems to use a data set for Dutch. Further, this data sentence seems incomplete. It doesn't seem to be able to parse sentences with 'Ik' ('I' in Dutch according to google translate).

How to tokenize a Malayalam word?

ഇതുഒരുസ്ടലംമാണ്
itu oru stalam anu
This is a Unicode string meaning this is a place
import nltk
nltk.wordpunct_tokenize('ഇതുഒരുസ്ഥാലമാണ് '.decode('utf8'))
is not working for me .
nltk.word_tokenize('ഇതുഒരുസ്ഥാലമാണ് '.decode('utf8'))
is also not working
other examples
"കണ്ടില്ല " = കണ്ടു +ഇല്ല,
"വലിയൊരു" = വലിയ + ഒരു
Right Split :
ഇത് ഒരു സ്ഥാലം ആണ്
output:
[u'\u0d07\u0d24\u0d4d\u0d12\u0d30\u0d41\u0d38\u0d4d\u0d25\u0d32\u0d02\u0d06\u0d23\u0d4d']
I just need to split the words as shown in the other example. Other example section is for testing.The problem is not with Unicode. It is with morphology of language. for this purpose you need to use a morphological analyzer
Have a look at this paper.
http://link.springer.com/chapter/10.1007%2F978-3-642-27872-3_38

After a crash course of the language from wikipedia (http://en.wikipedia.org/wiki/Malayalam), there are some issues in your question and the tools you've requested for your desired output.
Conflated Task
Firstly, the OP conflated the task of morphological analysis, segmentation and tokenization. Often there is a fine distinction especially for aggluntinative languages such as Turkish/Malayalam (see http://en.wikipedia.org/wiki/Agglutinative_language).
Agglutinative NLP and best practices
Next, I don't think tokenizer is appropriate for Malayalam, an agglutinative language. One of the most studied aggluntinative language in NLP, Turkish have adopted a different strategy when it comes to "tokenization", they found that a full blown morphological analyzer is necessary (see http://www.denizyuret.com/2006/11/turkish-resources.html, www.andrew.cmu.edu/user/ko/downloads/lrec.pdf‎).
Word Boundaries
Tokenization is defined as the identification of linguistically meaningful units (LMU) from the surface text (see Why do I need a tokenizer for each language?) And different language would require a different tokenizer to identify the word boundary of different languages. Different people have approach the problem for finding word boundary different but in summary in NLP people have subscribed to the following:
Agglutinative Languages requires a full blown morphological analyzer trained with some sort of language models. There is often only a single tier when identifying what is token and that is at the morphemic level hence the NLP community had developed different language models for their respective morphological analysis tools.
Polysynthetic Languages with specified word boundary has the choice of a two tier tokenization where the system can first identify an isolated word and then if necessary morphological analysis should be done to obtain a finer grain tokens. A coarse grain tokenizer can split a string using certain delimiter (e.g. NLTK's word_tokenize or punct_tokenize which uses whitespaces/punctuation for English). Then for finer grain analysis at morphemic level, people would usually use some finite state machines to split words up into morpheme (e.g. in German http://canoo.net/services/WordformationRules/Derivation/To-N/N-To-N/Pre+Suffig.html)
Polysynthetic Langauges without specified word boundary often requires a segmenter first to add whitespaces between the tokens because the orthography doesn't differentiate word boundaries (e.g. in Chinese https://code.google.com/p/mini-segmenter/). Then from the delimited tokens, if necessary, morphemic analysis can be done to produce finer grain tokens (e.g. http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html). Often this finer grain tokens are tied with POS tags.
The answer in brief to OP's request/question, the OP had used the wrong tools for the task:
To output tokens for Malayalam, a morphological analyzer is necessary, simple coarse grain tokenizer in NLTK would not work.
NLTK's tokenizer is meant to tokenize polysynthetic Languages with specified word boundary (e.g. English/European languages) so it is not that the tokenizer is not working for Malayalam, it just wasn't meant to tokenize aggluntinative languages.
To achieve the output, a full blown morphological analyzer needs to be built for the language and someone had built it (aclweb.org/anthology//O/O12/O12-1028.pdf‎), the OP should contact the author of the paper if he/she is interested in the tool.
Short of building a morphological analyzer with a language model, I encourage the OP to first spot for common delimiters that splits words into morphemes in the language and then perform the simple re.split() to achieve a baseline tokenizer.

A tokenizer is indeed the right tool; certainly this is what the NLTK calls them. A morphological analyzer (as in the article you link to) is for breaking words into smaller parts (morphemes). But in your example code, you tried to use a tokenizer that is appropriate for English: It recognizes space-delimited words and punctuation tokens. Since Malayalam evidently doesn't indicate word boundaries with spaces, or with anything else, you need a different approach.
So the NLTK doesn't provide anything that detects word boundaries for Malayalam. It might provide the tools to build a decent one fairly easily, though.
The obvious approach would be to try dictionary lookup: Try to break up your input into strings that are in the dictionary. But it would be harder than it sounds: You'd need a very large dictionary, you'd still have to deal with unknown words somehow, and since Malayalam has non-trivial morphology, you may need a morphological analyzer to match inflected words to the dictionary. Assuming you can store or generate every word form with your dictionary, you can use an algorithm like the one described here (and already mentioned by #amp) to divide your input into a sequence of words.
A better alternative would be to use a statistical algorithm that can guess where the word boundaries are. I don't know of such a module in the NLTK, but there has been quite a bit of work on this for Chinese. If it's worth your trouble, you can find a suitable algorithm and train it to work on Malayalam.
In short: The NLTK tokenizers only work for the typographical style of English. You can train a suitable tool to work on Malayalam, but the NLTK does not include such a tool as far as I know.
PS. The NLTK does come with several statistical tokenization tools; the PunctSentenceTokenizer can be trained to recognize sentence boundaries using an unsupervised learning algorithm (meaning you don't need to mark the boundaries in the training data). Unfortunately, the algorithm specifically targets the issue of abbreviations, and so it cannot be adapted to word boundary detection.

maybe the Viterbi algorithm could help?
This answer to another SO question (and the other high-vote answer) could help: https://stackoverflow.com/a/481773/583834

It seems like your space is the unicode character u'\u0d41'. So you should split normally with str.split().
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
x = 'ഇതുഒരുസ്ഥാലമാണ്'.decode('utf8')
y = x.split(u'\u0d41')
print " ".join(y)
[out]:
ഇത ഒര സ്ഥാലമാണ്`

I tried the following:
# encoding=utf-8
import nltk
cheese = nltk.wordpunct_tokenize('ഇതുഒരുസ്ഥാലമാണ്'.decode('utf8'))
for var in cheese:
print var.encode('utf8'),
And as output, I got the following:
ഇത ു ഒര ു സ ് ഥ ാ ലമ ാ ണ ്
Is this anywhere close to the output that you want, I'm a little in the dark here, since its difficult to get this right without understanding the language.

Morphological analysis example
from mlmorph import Analyser
analyser = Analyser()
analyser.analyse("കേരളത്തിന്റെ")
Gives
[('കേരളം<np><genitive>', 179)]
url: mlmorph
if you using anaconda then:
install git in anaconda prompt
conda install -c anaconda git
then clone the file using following command:
git clone https://gitlab.com/smc/mlmorph.git

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.