Differences between MiniBatchKMeans.fit and MiniBatchKMeans.partial_fit

Differences between MiniBatchKMeans.fit and MiniBatchKMeans.partial_fit - python

I am interested in sklearn.cluster.MiniBatchKMeans as a way to use huge datasets. Anyway I am a bit confused about the difference between MiniBatchKMeans.partial_fit() and MiniBatchKMeans.fit().
Documentation about fit() states that:
Compute the centroids on X by chunking it into mini-batches.
while documentation about partial_fit() states that:
Update k means estimate on a single mini-batch X.
So, as I understand it fit() splits up the dataset to chunk of data with which it trains the k means (I guess the argument batch_size of MiniBatchKMeans() refers to this one) while partial_fit() uses all data passed to it to update the centres. The term "update" may seem a bit ambiguous indicating an initial training (using fit()) should have been performed or not, but judging from the example in the documentation this is not necessary (I can use partial_fit() at the beginning also).
Is it true that partial_fit() will use all data passed to it regardless of size or is the data size bound to the batch_size passed as argument to the MiniBatchKMeans constructor? Also if batch_size is set to be greater than the actual data size is the result the same as the standard k-means algorithm (I guess efficient could vary in the latter case though due to different architectures).

TL;DR
partial_fit is for online clustering were fit is for offline, however i think MiniBatchKMeans's partial_fit method is a little rough.
Long explanation
I diged old PR's from the repo, and found this one, it seems to be the first commit of this implementation, it mentions that this algorithm can implement the partial_fit method as a online clustering method (following the online API discussion).
So as well as the BIRCH implementation, this algorithm uses fit as one time offline clustering and partial_fit as online clustering.
However, i did some tests comparing the ARI of the result labels by using the fit in the entire dataset versus using partial_fit and fit in chunks, and didn't seems to get anywhere, since the ARI result were very low (~0.5), and by changing the initialization apparently the fit chunked beat partial_fit, which doesn't make sense. you can find my notebook here.
So my guess is, based in this response in the PR:
I believe that this branch can and should be merged.
The online fitting API (partial_fit) probably needs to mature, but I
think that it is a bit out of scope of this work. The core
contribution, that is a mini-batch K-means, is nice, and does seem to
speed up things.
Is that the implementation hasn't changed much since that PR, and the partial_fit method is still a little rough, the two implementations from 2011 and now has changed (compared from the release tag), however both of them calls the function _mini_batch_step once in partial_fit (without verbose info) and calls multiple time in fit (with verbose info).

Related

How to effectively tune the hyper-parameters of Gensim Doc2Vec to achieve maximum accuracy in Document Similarity problem?

I have around 20k documents with 60 - 150 words. Out of these 20K documents, there are 400 documents for which the similar document are known. These 400 documents serve as my test data.
At present I am removing those 400 documents and using remaining 19600 documents for training the doc2vec. Then I extract the vectors of train and test data. Now for each test data document, I find it's cosine distance with all the 19600 train documents and select the top 5 with least cosine distance. If the similar document marked is present in these top 5 then take it to be accurate. Accuracy% = No. of Accurate records / Total number of Records.
The other way I find similar documents is by using the doc2Vec most similiar method. Then calculate accuracy using the above formula.
The above two accuracy doesn't match. With each epoch one increases other decreases.
I am using the code given here: https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e. For training the Doc2Vec.
I would like to know how to tune the hyperparameters so that I can get making accuracy by using above-mentioned formula. Should I use cosine distance to find the most similar documents or shall I use the gensim's most similar function?

The article you've referenced has a reasonable exposition of the Doc2Vec algorithm, but its example code includes a very damaging anti-pattern: calling train() multiple times in a loop, while manually managing alpha. This is hardly ever a good idea, and very error-prone.
Instead, don't change the default min_alpha, and call train() just once with the desired epochs, and let the method smoothly manage the alpha itself.
Your general approach is reasonable: develop a repeatable way of scoring your models based on some prior ideas of what, then try a wide range of model parameters and pick the one that scores best.
When you say that your own two methods of accuracy calculation don't match, that's a little concerning, because the most_similar() method does in fact check your query-point against all known doc-vectors, and returns those with the greatest cosine-similarity. Those should be identical as those that you've calculated to have the least cosine-distance. If you added to your question your exact code – how you're calculating cosine-distances, and how you're calling most_similar() – then it would probably be clear what subtle differences or errors are the cause of the discrepancy. (There shouldn't be any essential difference, but given that: you'll likely want to use the most_similar() results, because they're known non-buggy, and use efficient bulk array library operations that are probably faster than whatever loop you've authored.)
Note that you don't necessarily have to hold back your set of known-highly-similar document pairs. Since Doc2Vec is an unsupervised algorithm, you're not feeding it the preferred "make sure these documents are similar" results during training. It's fairly reasonable to train on the full set of documents, then pick the model that best captures your desired most-similar relationships, and believe that the inclusion of more documents actually helped you find the best parameters.
(Such a process might, however, slightly over-estimate the expected accuracy on future unseen docs, or some other hypothetical "other 20K" training documents. But it would still be plausibly finding the "best possible" metaparameters given your training data.)
(If you don't feed them all during training, then during testing you'll need to be using infer_vector() for the unseen docs, rather than just looking up the learned vectors from training. You haven't shown your code for such scoring/inference, but that's another step that might be done wrong. If you just train vectors for all available docs together, that possibility for error is eliminated.)
Checking if desired docs are in the top-5 (or top-N) most-similar is just one way to score a model. Another way, that was used in a couple of the original 'Paragraph Vector' (Doc2Vec) papers, is for each such pair, also pick another random document. Count the model as accurate each time it reports the known-similar docs as closer to each other than the 3rd randomly-chosen document. In the original 'Paragraph Vector' papers, existing search-ranking systems (which reported certain text snippets in response to the same probe queries) or hand-curated categories (as in Wikipedia or Arxiv) were used to generate such evaluation pairs: texts in the same search-results-page, or same category, were checked to see if they were 'closer' inside a model to each other than other random docs.
If your question were expanded to describe more about some of the initial parameters you've tried (such as the full parameters you're supplying to Doc2Vec and train()), and what has seemed to help or hurt, it might then be possible to suggest other ranges of parameters worth checking.

Optimize Random Forest regressor due to computational limits

Model fitting using Random Forest regressor takes up all the RAM which leads to online hosted notebook environment (Google colab or Kaggle kernel), crashing. Could you guys help me out with optimization of the model?
I already tried hypertuning the parameters like reducing the number of estimators but doesn't work. df.info() shows 4446965 records for train data which takes up ~1GB of memory.
I can't post the whole notebook code here as it would be too long, but could you please check this link for your reference. I've provided some information below related to the dataframe for training.
clf = RandomForestRegressor(n_estimators=100,min_samples_leaf=2,min_samples_split=3, max_features=0.5 ,n_jobs=-1)
clf.fit(train_X, train_y)
pred = clf.predict(val_X)
train_x.info() shows 3557572 records taking up almost 542 MB of memory
I'm still getting started with ML and any help would be appreciated. Thank you!

Random Forest by nature puts a massive load on the CPU and RAM and that's one of its very known drawbacks! So there is nothing unusual in your question.
Furthermore and more specifically, there are different factors that contribute in this issue, to name a few:
The Number of Attributes (features) in Dataset.
The Number of Trees (n_estimators).
The Maximum Depth of the Tree (max_depth).
The Minimum Number of Samples required to be at a Leaf Node (min_samples_leaf).
Moreover, it's clearly stated by Scikit-learn about this issue, and I am quoting here:
The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown
and unpruned trees which can potentially be very large on some data
sets. To reduce memory consumption, the complexity and size of the
trees should be controlled by setting those parameter values.
What to Do?
There's not too much that you can do especially Scikit-learn did not add an option to manipulate the storage issue on the fly (as far I am aware of).
Rather you need to change the value of the above mentioned parameters, for example:
Try to keep the most important features only if the number of features is already high (see Feature Selection in Scikit-learn and Feature importances with forests of trees).
Try to reduce the number of estimators.
max_depth is None by default which means the nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_leaf is 1 by default: A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
So try to change the the parameters by understanding their effects on the performance, the reference you need is this.
The final and last option you have is to create your own customized Random Forest from scratch and load the metadata to hard disk..etc or do any optimization, it's awkward but just to mention such option, here is an example of the basic implementation!
Side-Note:
Practically I experienced on my Core i7 laptop that setting the parameter n_jobs to -1 overwhelms the machine, I always find it more efficient to keep the default setting that is n_jobs=None! Although theoretically speaking it should be the opposite!

Python Sklearn - Multidimensional Parameter Optimization

I am not too convinced by the parameter optimization classes sklearn provides, in fact GridSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) just loops over the parameters I pass in via param_grid. RandomizedSearch (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) seems like a we-gave-up-approach to me.
One of my approaches was to re-use adaptive integration of a function (used in numerical mathematics): basically reduce the search-space of a parameter in every iteration until a certain error-threshold is reached. The biggest advantage of this method is that the error is also reduced in each iteration.
The problem is that certain parameter-values may be untouched if you consider the different score-values (precision, roc-auc etc.) as functions. I still got good results in one case, not so good in other cases.
What would be a good mathematical approach to optimize values more efficient than GridSearch and RandomizedSearch?

Dataset with only values (0,1,-1) with LSTM or CNN is giving 50% accuracy where as RF, SVM, ELM, Neural networks are giving above 90%

I have a dataset with 11k instances containing 0s,1s and -1s. I heard that deep learning can be applied to feature values.Hence applied the same for my dataset but surprisingly it resulted in less accuracy (<50%) compared to traditional machine learning algos (RF,SVM,ELM). Is it appropriate to apply deep learning algos to feature values for classification task? Any suggestion is greatly appreciated.

First of all, Deep Learning isn't a mythical hammer you can throw at every problem and expect better results. It requires careful analysis of your problem, choosing the right method, crafting your network, properly setting up your training, and only then, with a lot of luck will you see significantly better results than classical methods.
From what you describe (and without any more details about your implementation), it seems to me that there could have been several things going wrong:
Your task is simply not designed for a neural network. Some tasks are still better solved with classical methods, since they manually account for patterns in your data, or distill your advanced reasoning/knowledge into a prediction. You might not be directly aware of it, but sometimes neural networks are just overkill.
You don't describe how your 11000 instances are distributed with respect to the target classes, how big the input is, what kind of preprocessing you are performing for either method, etc, etc. Maybe your data is simply processed wrong, your training is diverging due to unfortunate parameter setups, or plenty of other things.
To expect a reasonable answer, you would have to share at least a bit of code regarding the implementation of your task, and parameters you are using for training.

NN: outputting a probability density function instead of a single value

This might sound silly but I'm just wondering about the possibility of modifying a neural network to obtain a probability density function rather than a single value when you are trying to predict a scalar. I know that when you are trying to classify images or words you can get a probability for each class, so I'm thinking there might be a way to do something similar with a continuous value and plot it. (Similar to the posterior plot with bayesian optimisation)
Such details could be interesting when deploying a model for prediction and could provide more flexibility than a single value.
Does anyone knows a way to obtain such an output?
Thanks!

Ok So I found a solution to this issue, though it adds a lot of overhead.
Initially I thought the keras callback could be of use but despite the fact that it provided the flexibility that I wanted i.e.: train only on test data or only a subset and not for every test. It seems that callbacks are only given summary data from the logs.
So the first step what to create a custom metric that would do the same calculation as any metric with the 2 arrays ( the true value and the predicted value) and once those calculations are done, output them to a file for later use.
Then once we found a way to gather all the data for every sample, the next step was to implement a method that could give a good measure of error. I'm currently implementing a handful of methods but the most fitting one seem to be bayesian bootstraping ( user lmc2179 has a great python implementation). I also implemented ensemble methods and gaussian process as alternatives or to use as other metrics and some other bayesian methods.
I'll try to find if there are internals in keras that are set during the training and testing phases to see if I can set a trigger for my metric. The main issue with using all the data is that you obtain a lot of unreliable data points at the start since the network is not optimized. Some data filtering could be useful to remove a good amount of those points to improve the results of the error predictors.
I'll update if I find anything interesting.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.