How to get an iterable for scikit-learn partial_fit

How to get an iterable for scikit-learn partial_fit - python

I am trying to train the SGDClassifier with text data using the HashingVectorizer. I wonder how I could assemble the batches which are passed to partial_fit() reading from multiple files.
Is the following code an appropriate way to get the data in batches via an iterable? Is there any best practice or recommended way for doing this?
class MyIterable:
def __init__(self, files, batch_size):
self.files = files
self.batch_size = batch_size
def __iter__(self):
batchstartmark = 0
for line in fileinput.input(self.files):
while batchstartmark < self.batch_size
yield line.split('\t')
batchstartmark += 1
Thanks in advance!

Just judging the theory of this approach here:
That's a very very bad approach!
As SGDClassifier is using Stochastic Gradient Descent (using mini-batches if you want), you should try to fulfill the assumptions of SGDs mathematical analysis.
The basic idea of SGD is: pick some random element and descent. Your code obviously diverges by two points:
A) You are picking your elements in the same order in every epoch
B) You are sampling (not really) without replacement
So x17 will not get picked until every other x was picked in this epoch
Your ignorance of A will lead to very bad performance with some high probability.
The point B is hard to analyze. There are different theoretical views, mostly dependent on some specific problem (of course there are differences between convex and non-convex problems), and while sampling-with-replacement is the classic one (with the most general convergence proofs), sometimes sampling-without-replacement (aka: shuffle and iterate during epoch / cycling) is used and often it's faster in convergence.

Related

Word2vec is not getting better with the number of epoch increasing

Running this code give me back loss values that cycle not really decreasing. Could you explain me why ?
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
class callback(CallbackAny2Vec):
'''Callback to print loss after each epoch.'''
def __init__(self):
self.epoch = 0
self.previous_loss = 0
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
print('Loss after epoch {}: {}'.format(self.epoch, loss-self.previous_loss))
self.epoch += 1
self.previous_loss = loss
model = Word2Vec(datapath('lee_background.cor'),epochs=10000,
compute_loss=True, callbacks=[callback()])

Loss can't decrease forever, unless:
the model can perfectly memorize the training set; and
every input – in the case of word2vec, context skip-gram or CBOW word(s) – should always generate the exact same outputs
The latter definitely isn't the case in natural-language: neither one skip-gram word X, nor a window of CBOW words X1, X2, ... Xn, will always exactly predict a target word. Hence, there will always be loss-against-training-examples.
All that you're doing with training (stochastic-gradient-descent optimization) is driving loss to the smallest that's practical given the mechanism/size of the chosen model.
At some point, at a still non-zero loss, changing the model to be better on some training-examples necessarily worsens it on others.
At this point, often called 'convergence', further training can only cause measured loss to jitter up-and-down around some range-of-approximately-best-value. Which seems to be what you're describing.
Related: a model with lower loss is better at the training task – mechanistically predicting words among the texts of the training set. But it won't necessarily be better at all the other downstream things you want to use word-vectors for.
At a certain point, being superficially better at the training-set – memorizing every detail, even the idiosyncratic non-generalizable things – can make things worse for other out-of-training-set tasks. That's 'overfitting'.
Especially with small training-sets, you can see this for yourself by expanding the vector_size. Some size will do best for specific other tasks – creating word-vectors that reflect what you want about the word's relationships – but an ever-larger size will do worse.
(It'll also, at some point, make the model larger than the training-data – an imbalance that practically ensures overfitting, because all 'learning' typically needs to have some aspect of compression: boiling a smaller number of useful compact lessons from a larger amount of suggestive/noisy data.)
That's why assessing model fitness for your project requires evaluations other than just looking-at-loss. Ideally these evaluations are even project-specific, though often more generic ones – like say the analogy-solving often applied to word2vec models – may point in the roughly right direction to match human-salient word senses. Still: on any project-specific goal, like classification or info-retrieval, the word-vectors 'best' at analogy-solving might not be best for the project's purposes.

Why scaling down the parameter many times during training will help the learning speed be the same for all weights in Progressive GAN?

Equalized learning rate is one of the special things in Progressive Gan, a paper of the NVIDIA team. By using this method, they introduced that
Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights.
In detail, they inited all learnable parameters by normal distribution . During training time, each forward time, they will scale the result with per-layer normalization constant from He's initializer
I reproduced the code from pytorch GAN zoo Github's repo
def forward(self, x, equalized):
# generate He constant depend on the size of tensor W
size = self.module.weight.size()
fan_in = prod(size[1:])
weight = math.sqrt(2.0 / fan_in)
'''
A module example:
import torch.nn as nn
module = nn.Conv2d(nChannelsPrevious, nChannels, kernelSize, padding=padding, bias=bias)
'''
x = self.module(x)
if equalized:
x *= self.weight
return x
At first, I thought the He constant will be as He's paper
Normally, so can be scaled up which leads to the gradient in backpropagation is increased as the formula in ProGan's paper prevent vanishing gradient.
However, the code shows that .
In summary, I can't understand why to scale down the parameter many times during training will help the learning speed be more stable.
I asked this question in some communities e.g: Artificial Intelligent, mathematics, and still haven't had an answer yet.
Please help me explain it, thank you!

There is already an explanation in the paper for the reason for scaling down the parameters in every single pass:
The benefit of doing this dynamically instead of during initialization is somewhat
subtle and relates to the scale-invariance in commonly used adaptive stochastic gradient descent
methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These
methods normalize a gradient update by its estimated standard deviation, thus making the update
independent of the scale of the parameter. As a result, if some parameters have a larger dynamic
range than others, they will take longer to adjust. This is a scenario modern initializers cause, and
thus it is possible that a learning rate is both too large and too small at the same time.
I believe multiplying He's constant in every pass ensures that the range of the parameter will not be too wide at any point in backpropagation and therefore they will not take so long to adjust. So if for example discriminator at some point in the learning process adjusted faster than the generator, it will not take the generator to adjust itself and consequently, the learning process of them will equalize.

Does the gensim `Word2Vec()` constructor make a completely independent model?

I'm testing feeding gensim's Word2Vec different sentences with the same overall vocabulary to see if some sentences carry "better" information than others. My method to train Word2Vec looks like this
def encode_sentences(self, w2v_params, sentences):
model = Word2Vec(sentences, **w2v_params)
idx_order = torch.tensor([int(i) for i in model.wv.index2entity], dtype=torch.long)
X = torch.zeros((idx_order.max()+1, w2v_params['size']), dtype=torch.float)
# Put embeddings back in order
X[idx_order] = torch.tensor(model.wv.vectors)
return X, y
What I'm hoping for here, is each time w2v runs, it starts with a fresh model and trains from scratch. However, I'm testing 3 kinds of sentences, so my test code looks like this:
def test(sentence):
w2v = {'size': 128, 'sg': 1}
X = encode_sentences(w2v, sentence)
evaluate(X) # Basic cluster analysis stuff here
# s1, s2 and s3 are the 3 sets of sentences with the same vocabulary in different order/frequency
[print(test(s) for s in [s1, s2, s3]]
However, I noticed if I remove one of the test sets, and only test s1 and s2 (or any combination of 2 sets of the three), the overall quality of the clusterings decreases. If I go back into encode_sentences and add del model before the return call, the overall cluster quality also goes down but remains consistent no matter how many datasets are tested.
What gives? Is the constructor not actually building a fresh model each time with new weights? The docs and source code give no indication of this. I'm quite sure it isn't my evaluation method, as everything was fixed after the del model was added. I'm at a loss here... Are these runs actually independent, or is each call to Word2Vec(foo, ...) equivalent to retraining the previous model with foo as new data?
And before you ask, no model is nowhere outside of the scope of the encode_sentence variable; that's the only time that variable name is used in the whole program. Very odd.
Edit with more details
If it's important, I'm using Word2Vec to build node embeddings on a graph the way Node2Vec does with different walk strategies. These embeddings are then fed to a Logistic Regression model (evaluate(X)) and which calculates area under the roc.
Here is some sample output of the model before adding the del model call to the encode_sentences method averaged over 5 trials:
Random walks: 0.9153 (+/-) 0.002
Policy walks: 0.9125 (+/-) 0.005
E-greedy walks: 0.8489 (+/-) 0.011
Here is the same output with the only difference being del model in the encoding method:
Random walks: 0.8627 (+/-) 0.005
Policy walks: 0.8527 (+/-) 0.002
E-greedy walks: 0.8385 (+/-) 0.009
As you can see, in each case, the variance is very low (the +/- value is the standard error) but the difference between the two runs is almost a whole standard deviation. It seems odd that if each call to Word2Vec was truly independent that manually freeing the data structure would have such a large effect.

Each call to the Word2Vec() constructor creates an all-new model.
However, runs are not completely deterministic under normal conditions, for a variety of reasons, so results quality for downstream evaluations (like your unshown clustering) will jitter from run-to-run.
If the variance in repeated runs with the same data is very large, there are probably other problems, such an oversized model prone to overfitting. (Stability from run-to-run can be one indicator that your process is sufficiently specified that the data and model choices are driving results, not the randomness used by the algorithm.)
If this explanation isn't satisfying, try adding more info to your question - such as the actual magnitude of your evaluation scores, in repeated runs, both with and without the changes that you conjecture are affecting results. (I suspect the variations from the steps you think are having effect will be no larger than variations from re-runs or different seed values.)
(More generally, Word2Vec is generally hungry for as much varies training data as possible; only if texts are non-representative of the relevant domain are they likely to result in a worse model. So I generally wouldn't expect being choosier about which subset of sentences is best to be an important technique, unless some of the sentences are total junk/noise, but of course there's always a change you'll find some effects in your particular data/goals.)

Inaccurate similarities results by doc2vec using gensim library

I am working with Gensim library to train some data files using doc2vec, while trying to test the similarity of one of the files using the method model.docvecs.most_similar("file") , I always get all the results above 91% with almost no difference between them (which is not logic), because the files do not have similarities between them. so the results are inaccurate.
Here is the code for training the model
model = gensim.models.Doc2Vec(vector_size=300, min_count=0, alpha=0.025, min_alpha=0.00025,dm=1)
model.build_vocab(it)
for epoch in range(100):
model.train(it,epochs=model.iter, total_examples=model.corpus_count)
model.alpha -= 0.0002
model.min_alpha = model.alpha
model.save('doc2vecs.model')
model_d2v = gensim.models.doc2vec.Doc2Vec.load('doc2vecs.model')
sim = model_d2v.docvecs.most_similar('file1.txt')
print sim
**this is the output result**
[('file2.txt', 0.9279470443725586), ('file6.txt', 0.9258157014846802), ('file3.txt', 0.92499840259552), ('file5.txt', 0.9209873676300049), ('file4.txt', 0.9180108308792114), ('file7.txt', 0.9141069650650024)]
what am I doing wrong ? how could I improve the accuracy of results ?

What is your it data, and how is it prepared? (For example, what does print(iter(it).next()) do, especially if you call it twice in a row?)
By calling train() 100 times, and also retaining the default model.iter of 5, you're actually making 500 passes over the data. And the first 5 passes will use train()s internal, effective alpha-management to lower the learning rate gradually to your declared min_alpha value. Then your next 495 passes will be at your own clumsily-managed alpha rates, first back up near 0.025 and then lower each batch-of-5 until you reach 0.005.
None of that is a good idea. You can just call train() once, passing it your desired number of epochs. A typical number of epochs in published work is 10-20. (A bit more might help with a small dataset, but if you think you need hundreds, something else is probably wrong with the data or setup.)
If it's a small amount of data, you won't get very interesting Word2Vec/Doc2Vec results, as these algorithms depend on lots of varied examples. Published results tend to use training sets with tens-of-thousands to millions of documents, and each document at least dozens, but preferably hundreds, of words long. With tinier datasets, sometimes you can squeeze out adequate results by using more training passes, and smaller vectors. Also using the simpler PV-DBOW mode (dm=0) may help with smaller corpuses/documents.
The values reported by most_similar() are not similarity "percentages". They're cosine-similarity values, from -1.0 to 1.0, and their absolute values are less important than the relative ranks of different results. So it shouldn't matter if there are a lot of results with >0.9 similarities – as long as those documents are more like the query document than those lower in the rankings.
Looking at the individual documents suggested as most-similar is thus the real test. If they seem like nonsense, it's likely there are problems with your data or its preparation, or training parameters.
For datasets with sufficient, real natural-language text, it's typical for higher min_count values to give better results. Real text tends to have lots of low-frequency words that don't imply strong things without many more examples, and thus keeping them during training serves as noise making the model less strong.

Without knowing the contents of the documents, here are two hints that might help you.
Firstly, 100 epochs will probably be too small for the model to learn the differences.
also, check the contents of the documents vs the corpus you are using. Make sure that the vocab is relevant for your files?

How do I avoid re-training machine learning models

self-learner here.
I am building a web application that predict events.
Let's consider this quick example.
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
print(neigh.predict([[1.1]]))
How can I keep the state of neigh so when I enter a new value like neigh.predict([[1.2]]) I don't need to re-train the model. Is there any good practice, or hint to start solving the problem ?

You've chosen a slightly confusing example for a couple of reasons. First, when you say neigh.predict([[1.2]]), you aren't adding a new training point, you're just doing a new prediction, so that doesn't require any changes at all. Second, KNN algorithms aren't really "trained" -- KNN is an instance-based algorithm, which means that "training" amounts to storing the training data in a suitable structure. As a result, this question has two different answers. I'll try to answer the KNN question first.
K Nearest Neighbors
For KNN, adding new training data amounts to appending new data points to the structure. However, it appears that scikit-learn doesn't provide any such functionality. (That's reasonable enough -- since KNN explicitly stores every training point, you can't just keep giving it new training points indefinitely.)
If you aren't using many training points, a simple list might be good enough for your needs! In that case, you could skip sklearn altogether, and just append new data points to your list. To make a prediction, do a linear search, saving the k nearest neighbors, and then make a prediction based on a simple "majority vote" -- if out of five neighbors, three or more are red, then return red, and so on. But keep in mind that every training point you add will slow the algorithm.
If you need to use many training points, you'll want to use a more efficient structure for nearest neighbor search, like a K-D Tree. There's a scipy K-D Tree implementation that ought to work. The query method allows you to find the k nearest neighbors. It will be more efficient than a list, but it will still get slower as you add more training data.
Online Learning
A more general answer to your question is that you are (unbeknownst to yourself) trying to do something called online learning. Online learning algorithms allow you to use individual training points as they arrive, and discard them once they've been used. For this to make sense, you need to be storing not the training points themselves (as in KNN) but a set of parameters, which you optimize.
This means that some algorithms are better suited to this than others. sklearn provides just a few algorithms capable of online learning. These all have a partial_fit method that will allow you to pass training data in batches. The SKDClassifier with 'hinge' or 'log' loss is probably a good starting point.

Or maybe you just want to save your model after fitting
joblib.dump(neigh, FName)
and load it when needed
neigh = joblib.load(FName)
neigh.predict([[1.1]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.