Good evening, I have a relatively simple question that primarily comes from my inexperience with python. I would like to extract word embeddings for a list of words. Here I have created a simple list:
list_word = [['Word'],
['ant'],
['bear'],
['beaver'],
['bee'],
['bird']]
Then load gensim and other required libraries:
#import tweepy # Obtain Tweets via API
import re # Obtain expressions
from gensim.models import Word2Vec #Import gensim Word2Fec
Now when I use the Word2Vec function I run the following:
#extract embedding length 12
model = Word2Vec(list_word, min_count = 3, size = 12)
print(model)
When the model is run I then see that the vocab size is 1, when it should not be. The output is the following:
Word2Vec(vocab=1, size=12, alpha=0.025)
I imagine that the imported data is not in the correct format and could use some advise or even example code on how to transform it into the correct format. Thank you for your help.
Your list_data, 6 sentences each with a single word, is insufficient to train Word2Vec, which requires a lot of varied realistic text data. Among other problems:
words that only appear once will be ignored due to the min_count=3 setting (& it's not a good idea to lower that parameter)
single-word sentences have none of the nearby-words contexts the algorithm uses
getting good 'dense' vectors requires a vocabulary far larger than the vector-dimensionality, and many varied examples of each word's use with other words
Try using a larger dataset, and you'll see more realistic results. Also, enabling Python logging at the INFO level will show a lot of progress as the code runs - and perhaps hint at issues, as you notice steps happening with or without reasonable counts & delays.
Related
I am trying to understand what is going wrong in the following example.
To train on the 'text8' dataset as described in the docs, one only has to do the following:
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load('text8')
model = Word2Vec(dataset)
doing this gives very good embedding vectors, as verified by evaluating on a word-similarity task.
However, when loading the same textfile which is used above manually, as in
text_path = '~/gensim-data/text8/text'
text = []
with open(text_path) as file:
for line in file:
text.extend(line.split())
text = [text]
model = Word2Vec(test)
The model still says it's training for the same number of epochs as above (5), but training is much faster, and the resulting vectors have a very, very bad performance on the similarity task.
What is happening here? I suppose it could have to do with the number of 'sentences', but the text8 file seems to have only a single line, so does gensim.downloader split the text8 file into sentences? If yes, of which length?
In your second example, you've created a training dataset with just a single text with the entire contents of the file. That's about 1.1 million word tokens, in a single list.
Word2Vec (& other related algorithms) in gensim have an internal implementation limitation, in their optimized paths, of 10,000 tokens per text item. All additional tokens are ignored.
So, in your 2nd case, 99% of your data is being discarded. Training may seem instant, but very little actual training will have occurred. (Word-vectors for words that only appear past the 1st 10,000 tokens won't have been trained at all, having only their initial randomly-set values.) If you enable logging at the INFO level, you'll see more details about each step of the process, and discrepancies like this may be easier to identify.
Yes, the api.load() variant takes extra steps to break the single-line-file into 10,000-token chunks. I believe it's using the LineSentence utility class for this purpose, whose source can be examined here:
https://github.com/RaRe-Technologies/gensim/blob/e859c11f6f57bf3c883a718a9ab7067ac0c2d4cf/gensim/models/word2vec.py#L1209
However, I recommend avoiding the api.load() functionality entirely. It doesn't just download data; it also downloads a shim of additional outside-of-version-control Python code for prepping that data for extra operations. Such code is harder to browse & less well-reviewed than official gensim release code as packaged for PyPI/etc, which also presents a security risk. Each load target (by name like 'text8') might do something different, leaving you with a different object type as the return value.
It's much better for understanding to directly download precisely the data files you need, to known local paths, and do the IO/prep yourself, from those paths, so you know what steps have been applied, and the only code you're running is the officially versioned & released code.
I am trying to split financial documents to sentences. I have ~50.000 documents containing plain English text. The total file size is ~2.6 GB.
I am using NLTK's PunktSentenceTokenizer with the standard English pickle file. I additionally tweaked it with providing additional abbreviations but the results are still not accurate enough.
Since NLTK PunktSentenceTokenizer bases on the unsupervised algorithm by Kiss & Strunk (2006) I am trying to train the sentence tokenizer based on my documents, based on training data format for nltk punkt.
import nltk.tokenize.punkt
import pickle
import codecs
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt", "r", "utf8").read()
tokenizer.train(text)
out = open("someplain.pk", "wb")
pickle.dump(tokenizer, out)
out.close()
Unfortunately, when running the code, I got an error, that there is not sufficient memory. (Mainly because I first concatenated all the files to one big file.)
Now my questions are:
How can I train the algorithm batchwise and would that lead to a lower memory consumption?
Can I use the standard English pickle file and do further training with that already trained object?
I am using Python 3.6 (Anaconda 5.2) on Windows 10 on a Core I7 2600K and 16GB RAM machine.
I found this question after running into this problem myself. I figured out how to train the tokenizer batchwise and am leaving this answer for anyone else looking to do this. I was able to train a PunktSentenceTokenizer on roughly 200GB of Biomedical text content in around 12 hours with a memory footprint no greater than 20GB at a time. Nevertheless, I'd like to second #colidyre's recommendation to prefer other tools over the PunktSentenceTokenizer in most situations.
There is a class PunktTrainer you can use to train the PunktSentenceTokenizer in a batchwise fashion.
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
Suppose we have a generator that yields a stream of training texts
texts = text_stream()
In my case, each iteration of the generator queries a database for 100,000 texts at a time, then yields all of these texts concatenated together.
We can instantiate a PunktTrainer and then begin training
trainer = PunktTrainer()
for text in texts:
trainer.train(text)
trainer.freq_threshold()
Notice the call to the freq_threshold method after processing each text. This reduces the memory footprint by cleaning up information about rare tokens that are unlikely to influence future training.
Once this is complete, call the finalize training method. Then you can instantiate a new tokenizer using the parameters found during training.
trainer.finalize_training()
tokenizer = PunktSentenceTokenizer(trainer.get_params())
#colidyre recommended using spaCy with added abbreviations. However, it can be difficult to know which abbreviations will appear in you text domain in advance. To get the best of both worlds you can add the abbreviations found by Punkt. You can get a set of these abbreviations in the following way
params = trainer.get_params()
abbreviations = params.abbrev_types
As described in the source code:
Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences
by using an unsupervised algorithm to build a model for abbreviation
words, collocations, and words that start sentences. It must be
trained on a large collection of plaintext in the target language
before it can be used.
It is not very clear what a large collection really means. In the paper, there are no information given about learning curves (when it is sufficiant to stop learning process, because enough data was seen). Wall Street Journal corpus is mentioned there (it has approximately 30 million words). So it is very unclear if you can simply trim your training corpus and have less memory footprints.
There is also an open issue on your topic saying something about 200 GB RAM and more. As you can see there, NLTK has probably not a good implementation of the algorithm presented by Kiss & Strunk (2006).
I cannot see how to batch it, as you can see in the function signature of train()-method (NLTK version 3.3):
def train(self, train_text, verbose=False):
"""
Derives parameters from a given training text, or uses the parameters
given. Repeated calls to this method destroy previous parameters. For
incremental training, instantiate a separate PunktTrainer instance.
"""
But there are probably more issues, e.g. if you compare the signature of given version 3.3 with the git tagged version 3.3, there is a new parameter finalize which might be helpful and indicates a possible batch-process or a possible merge with an already trained model:
def train(self, text, verbose=False, finalize=True):
"""
Collects training data from a given text. If finalize is True, it
will determine all the parameters for sentence boundary detection. If
not, this will be delayed until get_params() or finalize_training() is
called. If verbose is True, abbreviations found will be listed.
"""
Anyway, I would strongly recommend not using NLTK's Punkt Sentence Tokenizer if you want to do sentence tokenization beyond playground level. Nevertheless, if you want to stick to that tokenizer, I would simply recommend using also the given models and not train new models unless you have a server with huge RAM memory.
I am just playing around with Doc2Vec from gensim, analysing stackexchange dump to analyze semantic similarity of questions to identify duplicates.
The tutorial on Doc2Vec-Tutorial seems to describe the input as tagged sentences.
But the original paper: Doc2Vec-Paper claims that the method can be used to infer fixed length vectors of paragraphs/documents.
Can someone explain the difference between a sentence and a document in this context, and how i would go about inferring paragraph vectors.
Since a question can sometimes span multiple sentences,
I thought, during training i will give sentences arising from the same question the same tags, but then how would i do this to infer_vector on unseen questions?
And this notebook : Doc2Vec-Notebook
seems to be training vectors on TRAIN and TEST docs, can someone explain the rationale behind this and should i do the same?
Gensim's Doc2Vec expects you to provide text examples of the same object-shape as the example TaggedDocument class: having both a words and a tags property.
The words are an ordered sequence of string tokens of the text – they might be a single sentence worth, or a paragraph, or a long document, it's up to you.
The tags are a list of tags to be learned from the text – such as plain ints, or string-tokens, that somehow serve to name the corresponding texts. In the original 'Paragraph Vectors' paper, they were just unique IDs for each text – such as integers monotonically increasing from 0. (So the first TaggedDocument might have a tags of just [0], the next [1], etc.)
The algorithm just works on chunks of text, without any idea of what a sentence/paragraph/document etc might be. (Just consider them all 'documents' for the purpose of Doc2Vec, with you deciding what's the right kind of 'document' from your corpus.) It's even common for the tokenization to retain punctuation, such as the periods between sentences, as standalone tokens.
Inference occurs via the infer_vector() method, which takes a mandatory parameter doc_words, which should be a list-of-string-tokens just like those that were supplied as text words during training.
You don't supply any tags on inferred text: Doc2Vec just gives you back a raw vector that, within the relationships learned by the model, fits the text well. (That is: the vector is good at predicting the text's words, in the same way that the vectors and internal model weights learned during bulk training were good at prediction the training texts' words.)
Note that many have found better results from inference by increasing the optional steps parameter (and possibly decreasing the inference starting alpha to be more like the bulk-training starting alpha, 0.025 to 0.05).
The doc2vec-IMDB demo notebook tries to reproduce one of the experiments from the original Paragraph Vectors paper, so it's following what's described there, and a demo script that one of the authors (Mikolov) once released. Since 'test' documents (withoout their target-labels/known-sentiments) may still be available, at training time, to help improve the text-modelling, it can be reasonable to include their raw texts during the unsupervised Doc2Vec training. (Their known-labels are not used when training the classifier which uses the doc-vectors.)
(Note that at the moment, February 2017, the doc2vec-IMDB demo notebook is a little out-of-date compared to the current gensim Doc2Vec defaults & best-practices – in particular the models aren't given the right explicit iter=1 value to make the later manual loop-and-train() do just the right umber of training passes.)
A particular natural language practice is to calculate the similarity between two words using WordNet. I start my question with the following python code:
from nltk.corpus import wordnet
sport = wordnet.synsets("sport")[0]
badminton = wordnet.synsets("badminton")[0]
print(sport.wup_similarity(badminton))
We will get 0.8421
Now what if I look for "haha" and "lol" as following:
haha = wordnet.synsets("haha")
lol = wordnet.synsets("lol")
print(haha)
print(lol)
We will get
[]
[]
Then we cannot consider the similarity between them. What can we do in this case?
You can create a semantic space from cooccurrence matrices using a tool like Dissect (DIStributional SEmantics Composition Toolkit)
and then you are set to measure semantic similarity between words or phrases (if you compose words).
In your case for ha and lol you'll need to collect those cooccurrences.
Another thing to try is word2vec.
There are two possible other ways:
CBOW: continuous bag of word
skip gram model: This model is vice versa of CBOW model
look at this: https://www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures-in-laymans-terms
These model are well represted here: https://www.tensorflow.org/tutorials/word2vec, also GENSIM is a good python library for doing such these things
Try to look for Tensorflow Solutions, For example this: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
Or try to look for word2vec: https://en.wikipedia.org/wiki/Word2vec
You can use other frameworks. I was trying also NLTK but finally landed on spacy (spacy.io) very fast and functional framework. There is a method for words called 'similarity' which compers to other words(but it works also for sentences, docs etc). It is implemented using word2vec. Actually I don't know how big is their vocabulary and how it struggle in case the word is unknown but it might be worth to try.
I was also playing a little bit with this one:
https://radimrehurek.com/gensim/models/word2vec.html
Where in 2 lines you can load google's big word2vec model(this project ports google word2vec c++ library into python) accessible here:
https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit
There are different models for measuring similarity, such as word2vec or glove, but you seem to be looking more for a corpus which includes social, informal phrases like 'lol'.
However, I'm going to bring up word2vec because it leads to what I think is an answer to your question.
The foundational concept of word2vec (and other word embedding models like glove) is the representation of words in a vector space which incorporates relationships between words. This lends itself very well to measuring similarity, since vectors have lots of established math to draw from. You can read more about the technical details of word2vec in the original paper, but I quite like this blog post because it is well-written and concise.
Again, since word2vec is just a model, you need to pair it with the right training set to get the kind of scope you seek. There are some pre-trained models floating around on the web, such as this bunch. The training set is really what allows you to query a larger variety of terms, rather than the model.
You can certainly use those pre-trained models if they have social phrases like the ones you're seeking. However, if you don't see a model that has been trained on a suitable corpus, you could easily train a model yourself. I suggest Twitter or Wikipedia for corpora (training sets), and the implementation of word2vec in gensim as a word embedding model.
I am working on Python NLTK tagging, and my input text is non hindi.
In order to tokenize my input text it must first be trained.
My question is how to train the data?
I am having this line of code as suggested to me here on stackoverflow.
train_data = indian.tagged_sents('hindi.pos')
*how about non-hindi data input.
The short answer is: Training a tagger requires a tagged corpus.
Assigning part of speech tags must be done according to some existing model.
Unfortunately, unlike some problems like finding sentence boundaries, there is no way to choose them out of thin air. There are some experimental approaches that try to assign parts of speech using parallel texts and machine-translation alignment algorithms, but all real POS taggers must be trained on text that has been tagged already.
Evidently you don't have a tagged corpus for your unnamed language, so you'll need to find or create one if you want to build a tagger. Creating a tagged corpus is a major undertaking, since you'll need a lot of training materials to get any sort of decent performance. There may be ways to "bootstrap" a tagged corpus (put together a poor-quality tagger that will make it easier to retag the results by hand), but all that depends on your situation.