Loading saved params (a Pandas series) into a Statsmodels state-space model - python

I'm building a dynamic factor model using the excellent python package statsmodels, and I would like to pickle an estimated parameter vector so I can build the model again later, and load those params into it. (C.f., this Notebook built by Chad Fulton: https://github.com/ChadFulton/tsa-notebooks/blob/master/dfm_coincident.ipynb.)
In the following block of code, initial parameters are estimated with mod.fit() (using the Powell algo) and then given back to mod.fit() to complete the estimation (using the EM algo) using the initial parameters as initial_res.params. (The latter is a Pandas Series.)
mod = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=2, error_order=2)
initial_res = mod.fit(method='powell', disp=False)
res = mod.fit(initial_res.params)
I would like to pickle res.params (again, a small Pandas Series, and a small disk footprint). Then later build the model from scratch again, and load my saved parameters into it without having to re-estimate the model. Anyone know how that can be done?
Examples I have seen suggest pickling the results object res, but that can be a pretty big save. Building it from scratch is pretty simple, but estimation takes a while. It may be that estimation starting from the saved optimal params is quicker; but still, that's pretty amateurish, right?
TIA,
Drew

You can use the smooth method on any state space model to construct a results object from specific parameters. In your example:
mod = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=2, error_order=2)
initial_res = mod.fit(method='powell', disp=False)
res = mod.fit(initial_res.params)
res.params.to_csv(...)
# ...later...
params = pd.read_csv(...)
res = mod.smooth(params)

Related

How to properly regularize a np.polyval() function?

I'm attempting to recreate and plot this regularization function
based off of Bishop's Pattern Recognition and ML book, but I am getting vastly different results.
I am using np.polyfit() to obtain the coefficients w and np.polyval() to calculate them with an input. I get a correctly fit model (though I'm aware of the overfitting problem) when lambda=0 (meaning no regularization has actually taken place), but I struggle to get anything even resembling the examples in the book when lambda is anything else.
Here's a snippet of the relevant code
def poly_fit(M,x,t):
poly = np.polyfit(x, t, deg=M)
return poly
poly = poly_fit(M,x,t)
reg = np.polyval(poly,rand_gen) + ((lamb/2)*(poly#poly))
where rand_gen is a vector with random variables, and in my three attempts I set lambda as lamb=np.exp(-np.inf), lamb=np.exp(-18), and lamb=np.exp(0)
This is what I'm attempting to recreate:
This is what I get:
ln(lambda)=-infinity:
ln(lambda)=-18:
ln(lambda)=0:

Scikit-learn QuantileRegressor memory allocation error. No issue with statsmodel QuantReg with the same data

I'm trying to fit a quantile regression model to my input data. I would like to use sklearn, but I am getting a memory allocation error when I try to fit the model. The same data with the statsmodels equivalent function is working fine.
There error I get is the following:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 55.9 GiB for an array with shape (86636, 86636) and data type float64
It doesn't make any sense, my X and y are shapes (86636, 4) and (86636, 1) respectively.
Here's my script:
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import QuantileRegressor
training_df = pd.read_csv("/path/to/training_df.csv") # 86,000 rows
FEATURES = [
"feature_1",
"feature_2",
"feature_3",
"feature_4",
]
TARGET = "target"
# STATSMODELS WORKS FINE WITH 86,000, RUNS IN 2-3 SECONDS.
model_statsmodels = sm.QuantReg(training_df[TARGET], training_df[FEATURES]).fit(q=0.5)
# SKLEARN GIVES A MEMORY ALLOCATION ERROR, OR TAKES MINUTES TO RUN IF I SIGNIFICANTLY TRIM THE DATA TO < 1000 ROWS.
model_sklearn = QuantileRegressor(quantile=0.5, alpha=0)
model_sklearn.fit(training_df[FEATURES], training_df[TARGET])
I've checked the sklearn documentation and pretty sure my inputs are fine as dataframes, I get the same issues with NDarrays. So not sure what the issue is. Is it possible there's an issue with something under-the-hood?
[Here][1] is the scikit-learn documentation for QunatileRegressor.
Many thanks for any help / ideas.
[1]: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.QuantileRegressor.html
0
The sklearn QuantileRegressor class uses linear programming to solve the quantile regression problem which is much more computationally expensive than iterative reweighted least squares as used by statsmodel QuantReg class.
Here is a github issue for the same problem: https://github.com/scikit-learn/scikit-learn/issues/22922

How to replace sts.LinearRegression with non-linear model for a Tensorflow Probability Structured Time Series model component

I'm creating a time-series forecasting model with external, controllable features similar to the "Forecasting Demand for Electricity" example found at https://medium.com/tensorflow/structural-time-series-modeling-in-tensorflow-probability-344edac24083. In order to model the influence of the external factors, I am using an sts.LinearRegression() as a component of my model, but those external factors are very non-linear in nature and it's causing unwanted negative predictions in my model.
I've tried creating (simpler) forecasting outside of TFP STS, and found that a RandomForestRegressor works much better a LinearRegressor for these external features. What I'd LIKE to do is replace the sts.LinearRegression() with an sts.RandomForestRegressor(), but that isn't available from the sts library. In fact, there are hardly any options available from the sts library: https://www.tensorflow.org/probability/api_docs/python/tfp/sts/LinearRegression
I've also tried converting my target variable to log form, but there are numerous instances of zeros (which are inf for log), and this doesn't turn out to be a useful transformation.
My model architecture for TFP STS looks something like this:
def build_model(observed_time_series):
season_effect = sts.Seasonal(
num_seasons = 4, num_steps_per_season = 13, observed_time_series = observed_time_series,
name = 'season_effect')
marketing_effect = sts.LinearRegression(
design_matrix = tf.stack([recent_publicity - np.mean(recent_publicity),
active_ad - np.mean(active_ad)], axis = -1),
name = 'marketing_effect')
autoregressive = sts.Autoregressive(order=1,
observed_time_series = observed_time_series,
name = 'autoregressive')
model = sts.Sum([season_effect,
marketing_effect,
autoregressive],
observed_time_series = observed_time_series)
return model
Where I want to change the "marketing_effect" component of the model to something non-linear.
Is my only option here to clone the TFP STS library and create a custom function to handle non-linear data with something like a Random Forest Regressor? Does anyone know of a better option?
I'm not familiar with the usage of random forests in sts models. Can you point to a system where this exists? The trick with tfp.sts is that the math all works out nice & analytically because everything is marginally gaussian. If we can make that work, I think we're definitely open to bringing in other models.

Is there any way in Tensorflow 2.0 to pass a tensor as parameter of a Matlab function while running Matlab engine?

It is such a long title, but hopefully I will be able to explain myself properly in a few sentences:
I am trying to minimize a given score function using Tensorflow inspired by what was published in Minimize a function of one variable in Tensorflow. The value for such score function is obtained through making a call to a Matlab script which needs to be provided with only one parameter (which is related to the input variable, a tensor).
To do so I am using the beta version of Tensorflow 2.0, which includes a feature known as eager execution which allows to access to the contents of each tensor without needing to run any session whatsoever.
Here you may find a scratch of my code:
import tensorflow as tf
import numpy as np
eng = matlab.engine.start_matlab()
def costFunction():
z = tf.add(x,y).numpy()
H = np.asarray(eng.matlabfunction(matlab.double(z.tolist()),...)) # There are other parameters (Python lists) to be passed as arguments to my Matlab script alongside them, not included for the sake of simplicity
h = tf.convert_to_tensor(...) # Here I retrieve those elements from matrix H which I actually aim to maximize
return h
x = tf.Variable(initial_value=tf.zeros([6,N], tf.float64), trainable=True)
opt = tf.optimizers.Adam(learning_rate=1e-5, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
iters = 1000
for i in range(iters):
train = opt.minimize(costFunction, tunedPhases)
if i % 100 == 0:
print("Iteration {}, loss: {}".format(i+1, costFunction()))
Sadly, this solution does still not work out as I receive the following error message as output:
ValueError: No gradients provided for any variable: ['Variable:0'].
After a exhaustive search, I think this problem is related to this old post (TensorFlow: 'ValueError: No gradients provided for any variable'), which was solved by doing the corresponding operations from cost function directly to the tensors. However, I have no other option but to invoke this matlabfunction and use its output as the output of my cost function.
Do you have any ideas about how to overcome this?
Many thanks in advance, and may you all have a nice week!

Tensorflow - Training on condition

I am training a neural network with tensorflow (1.12) in a supervised fashion. I'd like to only train on specific examples. The examples are created on the fly by cutting out subsequences, hence I want to do the conditioning within tensorflow.
This is my original part of code:
train_step, gvs = minimize_clipped(optimizer, loss,
clip_value=FLAGS.gradient_clip,
return_gvs=True)
gradients = [g for (g,v) in gvs]
gradient_norm = tf.global_norm(gradients)
tf.summary.scalar('gradients/norm', gradient_norm)
eval_losses = {'loss1': loss1,
'loss2': loss2}
The training step is later executed as:
batch_eval, _ = sess.run([eval_losses, train_step])
I was thinking about inserting something like
train_step_fake = ????
eval_losses_fake = tf.zeros_like(tensor)
train_step_new = tf.cond(my_cond, train_step, train_step_fake)
eval_losses_new = tf.cond(my_cond, eval_losses, eval_losses_fake)
and then doing
batch_eval, _ = sess.run([eval_losses, train_step])
However, I am not sure how to create a fake train_step.
Also, is this a good idea in general or is there a smoother way of doing this? I am using a tfrecords pipeline, but no other high-level modules (like keras, tf.estimator, eager execution etc.).
Any help is obviously greatly appreciated!
Answering the specific question first. It's certainly possible to only perform your training step based on the tf.cond outcome. Note that the 2nd and 3rd params are lambdas though so more something like:
train_step_new = tf.cond(my_cond, lambda: train_step, lambda: train_step_fake)
eval_losses_new = tf.cond(my_cond, lambda: eval_losses, lambda: eval_losses_fake)
Your instinct that this may not be the right thing to do is correct though.
It's much more preferable (both in terms of efficiency and in terms of reading and reasoning about your code) to filter out the data you want to ignore before it gets to your model in the first place.
This is something you could achieve using the Dataset API. which has a really useful filter() method you could use. If you are using the dataset api to read your TFRecords right now then this should be as simple as adding something along the lines of:
dataset = dataset.filter(lambda x: {whatever op you were going to use in tf.cond})
If you are not yet using the dataset API, now is probably the time to have a little read up on it and consider it rather than butchering the model with that tf.cond() to act as a filter.

Categories

Resources