I am new to python and word2vec and keep getting a "you must first build vocabulary before training the model" error. What is wrong with my code?
Here is my code:
file_object=open("SupremeCourt.txt","w")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)
out=model.most_similar()
print(out[1])
print(out[2])
I could see some wrong things in your code like the file is opened in write mode and the model which you have loaded doesn't contain the word which you want to find the most similar words.
I would like to suggest to use the predefined models like google_news_vectors to load in the gensim or to build your own word2vec model so that you won't get the error.
the usage of most_similar in gensim is out = model.most_similar("word-name")
file_object=open("SupremeCourt.txt","r")
from gensim.models import word2vec
data = word2vec.Text8Corpus('SupremeCourt.txt')
model = word2vec.Word2Vec(data, size=200)#use google news vectors here
out=model.most_similar("word")
print(out)
You're opening that file in write mode with this line:
file_object = open("SupremeCourt.txt", "w")
By doing this, you're erasing the contents of your file, so that when you try to pass the file the model for training, there is no data to read. That's why that error is thrown.
Remove that line (and also restore your file contents), and it'll work.
Related
I have trained a deep learning model and it got saved in a pickle file. Due to some reason, I have to slightly change the code from which I got the pickle file. It took me months in training & I want to anyhow use the last pickle file created, as the weights will remains same. Is there any way to view and change the content of the pickle file?
Edit: For example, if we have the stylegan2 pre-trained network pickle file and suppose we made changes on the G_synthesis function code (present in https://github.com/NVlabs/stylegan2/blob/master/training/networks_stylegan2.py) then how can we use the old pickled file.
If you just want to change some functions but keep the same weights, can you just copy the weights to new model like this:
import pickle
from old_model_file import old_model
from new_model_file import new_model
# 1.load pickle file
with open('old.pickle','rb') as f:
old_pickle = pickle.load(f)
# 2.create model based new model
new_pickle = new_model()
# 3. copy weights from old model
'''
##you should copy all weights from old_pickle to new_pickle
##for example:
new_pickle.weight_A = old_pickle.weight_A
new_pickle.weight_B = old_pickle.weight_B
'''
# 4. save the new model
with open('new.pickle','wb') as f:
pickle.dump(new_pickle,f)
Is this what you want?
I am trying to figure out why I cant read the contents of the english.pickle file downloaded from nltk module.
I first downloaded the nltk file using this code:
import nltk
nltk.download('punkt')
I then looked for inside the punkt file that I have on my home directory and found english.pickle file. I used the following code to read the file in python:
import pickle
with open('english.pickle', 'rb') as file:
x = pickle.load(file)
It all seemed fine, however, when I am running the variable x (which should be storing the pickled data) i am unable to retrieve the data from as I would from any other pickled file.
Instead I am only getting the object name and the id:
<nltk.tokenize.punkt.PunktParameters at 0x7f86cf6c0cd0>
The problem is I need to access the content of the file and I cant iterate through as it is not iterable.
Has anyone encountered the same problem?
You have downloaded the punkt tokenizer, for which the documentation says:
This tokenizer divides a text into a list of sentences by using an
unsupervised algorithm to build a model for abbreviation words,
collocations, and words that start sentences. It must be trained on a
large collection of plaintext in the target language before it can be
used.
After this:
with open('english.pickle', 'rb') as file:
x = pickle.load(file)
You should have a nltk.tokenize.punkt.PunktSentenceTokenizer object. You can call methods on that object to perform tokenization. E.g.:
>>> x.tokenize('This is a test. I like apples. The cow is blue.')
['This is a test.', 'I like apples.', 'The cow is blue.']
The built-in classifier in textblob is pretty dumb. It's trained on movie reviews, so I created a huge set of examples in my context (57,000 stories, categorized as positive or negative) and then trained it using nltk. I tried using textblob to train it but it always failed:
with open('train.json', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="json")
That would run for hours and end in a memory error.
I looked at the source and found it was just using nltk and wrapping that, so I used that instead, and it worked.
The structure for nltk training set needed to be a list of tuples, with the first part was a Counter of words in the text and frequency of appearance. The second part of tuple was 'pos' or 'neg' for sentiment.
>>> train_set = [(Counter(i["text"].split()),i["label"]) for i in data[200:]]
>>> test_set = [(Counter(i["text"].split()),i["label"]) for i in data[:200]] # withholding 200 examples for testing later
>>> cl = nltk.NaiveBayesClassifier.train(train_set) # <-- this is the same thing textblob was using
>>> print("Classifier accuracy percent:",(nltk.classify.accuracy(cl, test_set))*100)
('Classifier accuracy percent:', 66.5)
>>>>cl.show_most_informative_features(75)
Then I pickled it.
with open('storybayes.pickle','wb') as f:
pickle.dump(cl,f)
Now... I took this pickled file, and re opened it to get the nltk.classifier 'nltk.classify.naivebayes.NaiveBayesClassifier'> -- and tried to feed it into textblob. Instead of
from textblob.classifiers import NaiveBayesClassifier
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())
I tried:
blob = TextBlob("I love this library", analyzer=myclassifier)
Traceback (most recent call last):
File "<pyshell#116>", line 1, in <module>
blob = TextBlob("I love this library", analyzer=cl4)
File "C:\python\lib\site-packages\textblob\blob.py", line 369, in __init__
parser, classifier)
File "C:\python\lib\site-packages\textblob\blob.py", line 323, in
_initialize_models
BaseSentimentAnalyzer, BaseBlob.analyzer)
File "C:\python\lib\site-packages\textblob\blob.py", line 305, in
_validated_param
.format(name=name, cls=base_class_name))
ValueError: analyzer must be an instance of BaseSentimentAnalyzer
what now? I looked at the source and both are classes, but not quite exactly the same.
I wasn't able to be certain that a nltk corpus cannot work with textblob, and that would surprise me since textblob imports all of the nltk functions in its source code, and is basically a wrapper.
But what I did conclude after many hours of testing is that nltk offers a better built-in sentiment corpus called "vader" that outperformed all of my trained models.
import nltk
nltk.download('vader_lexicon') # do this once: grab the trained model from the web
from nltk.sentiment.vader import SentimentIntensityAnalyzer
Analyzer = SentimentIntensityAnalyzer()
Analyzer.polarity_scores("I find your lack of faith disturbing.")
{'neg': 0.491, 'neu': 0.263, 'pos': 0.246, 'compound': -0.4215}
CONCLUSION: NEGATIVE
vader_lexicon and nltk code does a lot more parsing of negation language in sentences in order to negate positive Words. Like when Darth Vader says "lack of faith" that changes the sentiment to its opposite.
I explained it here, with examples of the better results:
https://chewychunks.wordpress.com/2018/06/19/sentiment-analysis-discovering-the-best-way-to-sort-positive-and-negative-feedback/
That replaces this textblob implementation:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
TextBlob("I find your lack of faith disturbing.", analyzer=NaiveBayesAnalyzer())
{'neg': 0.182, 'pos': 0.817, 'combined': 0.635}
CONCLUSION: POSITIVE
The vader nltk classifier also has additional documentation here on using it for sentiment analysis: http://www.nltk.org/howto/sentiment.html
textBlob always crashed my computer with as little as 5000 examples.
Going over the error message, it seems like the analyzer must be inherited from the abstract class BaseSentimentAnalyzer. As mentioned in the docs here, this class must implement the analyze(text) function. However, while checking the docs of NLTK's implementation, I could not find this method in it's main documentation here or its parent class ClassifierI here. Hence, I believe both these implementations cannot be combined, unless you can implement a new analyze function in NLTK's implementation to make it compatible with TextBlob's.
Another more forward-looking solution is to use spaCy to build the model instead of textblob or nltk. This is new to me, but seems a lot easier to use and more powerful:
https://spacy.io/usage/spacy-101#section-lightning-tour
"spaCy is the Ruby of Rails of natural language processing."
import spacy
import random
nlp = spacy.load('en') # loads the trained starter model here
train_data = [("Uber blew through $1 million", {'entities': [(0, 4, 'ORG')]})] # better model stuff
with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
optimizer = nlp.begin_training()
for i in range(10):
random.shuffle(train_data)
for text, annotations in train_data:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')
I was wondering if it is possible to update spacys default vocabulary. What I am trying doing is this:
run word2vec on my own corpus with gensim
load the vectors into my model with nlp.vocab.load_vectors_from_bin_loc(\path)
But since a lot of words in my corpus aren't in spacys default vocabulary I can't make use of the imported vectors. Is there an (easy) way to add those missing types?
Edit:
I realize it might be problematic to mix vectors. So my question is:
How can I import a custom vocabulary into spacy?
This is much easier in the next version, which should be out this week --- I'm just finishing testing it. For now:
By default spaCy loads a data/vocab/vec.bin file, where the "data" directory is within the spacy.en module directory
Create the vec.bin file from a bz2 file using spacy.vocab.write_binary_vectors
Either replace spaCy's vec.bin file, or call nlp.vocab.load_rep_vectors at run-time, with the path to the binary file.
The above is a bit inconvenient at first, but the binary file format is much smaller and faster to load, and the vectors files are fairly big. Note that GloVe distributes in gzip format, not bzip.
Out of interest: are you using the GloVe vectors, or something you trained on your own data? If your own data, did you use Gensim? I'd like to make this much easier, so I'd appreciate suggestions for what work-flow you'd like to see.
Load new vectors at run-time, optionally converting them
import spacy.vocab
def set_spacy_vectors(nlp, binary_loc, bz2_loc=None):
if bz2_loc is not None:
spacy.vocab.write_binary_vectors(bz2_loc, binary_loc)
write_binary_vectors(bz2_input_loc, binary_loc)
nlp.vocab.load_rep_vectors(binary_loc)
Replace the vec.bin, so your vectors will be loaded by default
from spacy.vocab import write_binary_vectors
import spacy.en
from os import path
def main(bz2_loc):
bin_loc = path.join(path.dirname(spacy.en.__file__), 'data', 'vocab', 'vec.bin')
write_binary_vectors(bz2_loc, bin_loc)
if __name__ == '__main__':
plac.call(main)
Today I just started writing an script which trains LDA models on large corpora (minimum 30M sentences) using gensim library.
Here is the current code that I am using:
from gensim import corpora, models, similarities, matutils
def train_model(fname):
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
dictionary = corpora.Dictionary(line.lower().split() for line in open(fname))
print "DOC2BOW"
corpus = [dictionary.doc2bow(line.lower().split()) for line in open(fname)]
print "running LDA"
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100, update_every=1, chunksize=10000, asses=1)
running this script on a small corpus (2M sentences) I realized that it needs about 7GB of RAM.
And when I try to run it on the larger corpora, it fails because of the memory issue.
The problem is obviously due to the fact that I am loading the corpus using this command:
corpus = [dictionary.doc2bow(line.lower().split()) for line in open(fname)]
But, I think there is no other way because I would need it for calling the LdaModel() method:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=100, update_every=1, chunksize=10000, asses=1)
I searched for a solution to this problem but I could not find anything helpful.
I would imagine that it should be a common problem since we mostly train the models on very large corpora (usually wikipedia documents). So, it should be already a solution for it.
Any ideas about this issue and the solution for it?
Consider wrapping your corpus up as an iterable and passing that instead of a list (a generator will not work).
From the tutorial:
class MyCorpus(object):
def __iter__(self):
for line in open(fname):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())
corpus = MyCorpus()
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=dictionary,
num_topics=100,
update_every=1,
chunksize=10000,
passes=1)
Additionally, Gensim has several different corpus formats readily available, which can be found in the API reference. You might consider using TextCorpus, which should fit your format nicely already:
corpus = gensim.corpora.TextCorpus(fname)
lda = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=corpus.dictionary, # TextCorpus can build the dictionary for you
num_topics=100,
update_every=1,
chunksize=10000,
passes=1)