Sklearn - model persistence without pkl file

Sklearn - model persistence without pkl file - python

I'm interested in saving the model created in Sklearn (e.g., EmpiricalCovariance, MinCovDet or OneClassSVM) and re-applying later on.
I'm familiar with the option of saving a PKL file and joblib, however I would prefer to save the model explicitly and not a serialized python object.
The main motivation for this is that it allows easily viewing the model parameters.
I found one reference to doing this:
http://thiagomarzagao.com/2015/12/07/model-persistence-without-pickles/
The question is:
Can I count on this working over time (i.e., new versions of sklearn)? Is this too much of a "hacky" solution?
Does anyone have experience doing this?
Thanks
Jonathan

I don't think it's a hacky solution, a colleague has done a similar thing where he exports a model to be consumed by a scorer which is written in golang, and is much faster than the scikit-learn scorer. If you're worried about compatability with future versions of sklearn, you should consider using an environment manager like conda or virtualenv; in anycause this is just good software engineering practice and something you should start to get used to anyway.

Related

Refit python's surprise recommedation system with new data

I've built a recommender system using Python Surprise library.
Next step is to update algorithm with new data. For example a new user or a new item was added.
I've digged into documentation and got nothing for this case. The only possible way is to train new model from time to time from scratch.
It looks like I missed something but I can't figure out what exactly.
Can anybody point me out how I can refit existing algorithm with new data?

Unfortunately Surprise doesn't support partial fit yet.
In this thread there are some workarounds and forks with implemented partial fit.

Seasonal-Trend-Loess Method for Time Series in Python

Does anyone know if there is a Python-based procedure to decompose time series utilizing STL (Seasonal-Trend-Loess) method?
I saw references to a wrapper program to call the stl function
in R, but I found that to be unstable and cumbersome from the environment set-up perspective (Python and R together). Also, link was 4 years old.
Can someone point out something more recent (e.g. sklearn, spicy, etc.)?

I haven't tried STLDecompose but I took a peek at it and I believe it uses a general purpose loess smoother. This is hard to do right and tends to be inefficient. See the defunct STL-Java repo.
The pyloess package provides a python wrapper to the same underlying Fortran that is used by the original R version. You definitely don't need to go through a bridge to R to get this same functionality! This package is not actively maintained and I've occasionally had trouble getting it to build on some platforms (thus the fork here). But once built, it does work and is the fastest one you're likely to find. I've been tempted to modify it to include some new features, but just can't bring myself to modify the Fortran (which is pre-processed RATFOR - very assembly-language like Fortran, and I can't find a RATFOR preprocessor anywhere).
I wrote a native Java implementation, stl-decomp-4j, that can be called from python using the pyjnius package. This started as a direct port of the original Fortran, refactored to a more modern programming style. I then extended it to allow quadratic loess interpolation and to support post-decomposition smoothing of the seasonal component, features that are described in the original paper but that were not put into the Fortran/R implementation. (They apparently are in the S-plus implementation, but few of us have access to that.) The key to making this efficient is that the loess smoothing simplifies when the points are equidistant and the point-by-point smoothing is done by simply modifying the weights that one is using to do the interpolation.
The stl-decomp-4j examples include one Jupyter notebook demonstrating how to call this package from python. I should probably formalize that as a python package but haven't had time. Quite willing to accept pull requests. ;-)
I'd love to see a direct port of this approach to python/numpy. Another thing on my "if I had some spare time" list.

Here you can find an example of Seasonal-Trend decomposition using LOESS (STL), from statsmodels.
Basicaly it works this way:
from statsmodels.tsa.seasonal import STL
stl = STL(TimeSeries, seasonal=13)
res = stl.fit()
fig = res.plot()

There is indeed:
https://github.com/jrmontag/STLDecompose
In the repo you will find a jupyter notebook for usage of the package.

RSTL is a Python port of R's STL: https://github.com/ericist/rstl. It works pretty well except it is 3~5 times slower than R's STL according to the author.
If you just want to get lowess trend line, you can just use Statsmodels' lowess function
https://www.statsmodels.org/dev/generated/statsmodels.nonparametric.smoothers_lowess.lowess.html.

Why `xavier_initializer()` and `glorot_uniform_initializer()` are duplicated to some extent?

xavier_initializer(uniform=True, seed=None, dtype=tf.float32) and glorot_uniform_initializer(seed=None, dtype=tf.float32) refer to the same person Xavier Glorot. Why not consolidate them into one function?
xavier_initializer is in tf.contrib.layers. glorot_uniform_initializer in tf. Will the namespace of contrib eventually go away and things in contrib will be moved to the namespace of tf?

Yes, tf.contrib.layers.xavier_initializer and tf.glorot_uniform_initializer both implement the same concept described in this JMLR paper: Understanding the difficulty of training deep feedforward neural networks, which can be seen in the code:
tf.glorot_uniform_initializer
tf.contrib.layers.xavier_initializer
With typical values for fan_in, fan_out, mode = FAN_AVG , and uniform = True, both implementations sample values from the standard uniform distribution over the limit [-sqrt(3), sqrt(3))
Because tf.initializer has support for a wide variety of initialization strategies, it's highly likely that it will stay whereas the initialization from contrib which just has xavier_initialization will most probably be deprecated in future versions.
So, yes it's highly likely that in future versions the tf.contrib.layers.xavier_initialier way of initialization might go away.

Interesting question! I'll start with tf.contrib:
Will the namespace of contrib go away? Only when there's no more unstable community contributions to add to TensorFlow - so never. This question may be of interest. I'll summarize.
The contrib namespace is for user-contributed code that is supported by the community (not TensorFlow). Code in contrib is useful enough to be in the API, and probably will be merged eventually. But, until it's thoroughly tested by the TensorFlow team it stays in contrib. I'm confident the docs used to explain why contrib exists, but I can't find it anymore. The closest thing is in the API stability promise, which explains that contrib functions/classes are subject to change!
A little more in-depth, things in contrib generally merge into tf eventually. For example, the entirety of Keras merged from contrib to tf.keras in 1.4. But, the exact process of the merge varies. For instance, compare tf.contrib.rnn and RNN functionality in tf.nn. Quite a bit of tf.nn aliases tf.contrib.rnn. I mean, click anything on the tf.nn.rnn_cell guide. You'll be looking at the tf.rnn.contrib doc! Try it! It seems that using tf.contrib.rnn is very stable, despite the fact that it's migrated into "native" tf. On the other hand, the Datasets merge isn't so clean (contrib'ed in 1.3, merged in 1.4). Because some - very few - bits of code were changed during the merge, using tf.contrib.data.TFRecordDataset will give you a nice depreciation warning. And, some things have been in contrib for quite a while and show no signs of merging soon: tf.contrib.training.stratified_sample comes to mind. I believe contrib.keras had been around for a while before merging.
Now onto Xavier/Glorot:
Here's links to the source for contrib.xavier... and tf.glorot.... The source code looks (nearly) the same, but let's follow variance_scaling_initializer. Now things differ: xavier has a function and glorot uses a class (VarianceScaling is aliased as variance_scaling_initializer). Similar again, yes, but at a glance the "native" tf version gives us some different error messages and some better input validation.
So why not remove contrib.xavier? I don't know. If I had to speculate, it's because contrib.xavier took off. I mean, I still use it and I still see it all the time (citation needed?). Now that I know glorot is basically the same, I'm not sure that I'll keep using contrib.xavier. But I digress. I suspect xavier has stayed around because removing it would break a reasonable amount of code. Sure, there's no stability promise for contrib, but why fix (or break) what's not broken?
Posting an issue or pull request on Github could generate some more interesting responses from actual contributors. I suspect you would get reasons it hasn't and won't be removed, but maybe not. My quick search for "xavier" and then "glorot" in the Issues suggests it hasn't been asked before.
EDIT: To be clear, as kmario points out, they're mathematically identical. I'm pointing out that the implementation, as it is today, differs slightly in the realm of input validation and structure. He seems to think xavier is more likely to depreciate than I initially thought. I'll happily defer to him because he's probably more experienced than I am.

Python - go beyond RAM limits?

I'm trying to analyze text, but my Mac's RAM is only 8 gigs, and the RidgeRegressor just stops after a while with Killed: 9. I recon this is because it'd need more memory.
Is there a way to disable the stack size limiter so that the algorithm could use some kind of swap memory?

You will need to do it manually.
There are probably two different core-problems here:
A: holding your training-data
B: training the regressor
For A, you can try numpy's memmap which abstracts swapping away.
As an alternative, consider preparing your data to HDF5 or some DB. For HDF5, you can use h5py or pytables, both allowing numpy-like usage.
For B: it's a good idea to use some out-of-core ready algorithm. In scikit-learn those are the ones supporting partial_fit.
Keep in mind, that this training-process decomposes into at least two new elements:
Efficient being in regards to memory
Swapping is slow; you don't want to use something which holds N^2 aux-memory during learning
Efficient convergence
Those algorithms in the link above should be okay for both.
SGDRegressor can be parameterized to resemble RidgeRegression.
Also: it might be needed to use partial_fit manually, obeying the rules of the algorithm (often some kind of random-ordering needed for convergence-proofs). The problem with abstracting-away swapping is: if your regressor is doing a permutation in each epoch, without knowing how costly that is, you might be in trouble!
Because the problem itself is quite hard, there are some special libraries built for this, while sklearn needs some more manual work as explained. One of the most extreme ones (a lot of crazy tricks) might be vowpal_wabbit (where IO is often the bottleneck!). Of course there are other popular libs like pyspark, serving a slightly different purpose (distributed computing).

How to train a model in C++ with tensorflow?

I tried to trained a experiment with deep learning model.
I found that tensorflow is the best way to do this.
But there is problem that tensorflow need to be writen in python.
And my program contain many loops.Like this..
for i=1~2000
for j=1~2000
I know this is a big drawback for python.
It's very slow than c.
I know tensorfow has a C++ API, but it's not clear.
https://www.tensorflow.org/api_docs/cc/index.html
(This is the worst Specification I have ever looked)
Can someone give me an easy example in that?
All I need is two simple code.
One is how to create a graph.
The other is how to load this graph and run it.
I really eager need this.Hope someone can help me out.

It's not so easy, but it is possible.
First, you need to create tensorflow graph in python and save it in file.
This article may help you
https://medium.com/jim-fleming/loading-a-tensorflow-graph-with-the-c-api-4caaff88463f#.krslipabt
Second, you need to compile libtensorflow, link it to your program (you need tensorflow headers as well, so it's a bit tricky) and load the graph from the file.
This article may help you this time
https://medium.com/jim-fleming/loading-tensorflow-graphs-via-host-languages-be10fd81876f#.p9s69rn7u

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.