Pymc3: Optimizing parameters with multiple data?

Pymc3: Optimizing parameters with multiple data? - python

I've designed a model using Pymc3, and I have some trouble optimizing it with multiple data.
The model is a bit similar to the coal-mining disaster (as in the Pymc3 tutorial for those who know it), except there are multiple switchpoints.
The output of the network is a serie of real numbers for instance:
[151,152,150,20,19,18,0,0,0]
with Model() as accrochage_model:
time=np.linspace(0,n_cycles*data_length,n_cycles*data_length)
poisson = [Normal('poisson_0',5,1), Normal('poisson_1',10,1)]
variance=3
t = [Normal('t_0',0.5,0.01), Normal('t_1',0.7,0.01)]
taux = [Bernoulli('taux_{}'.format(i),t[i]) for i in range(n_peaks)]
switchpoint = [Poisson('switchpoint_{}'.format(i),poisson[i])*taux[i] for i in range(n_peaks)]
peak=[Normal('peak_0',150,2),Normal('peak_1',50,2),Normal('peak_2',0,2)]
z_init=switch(switchpoint[0]>=time%n_cycles,0,peak[0])
z_list=[switch(sum(switchpoint[j] for j in range(i))>=time%n_cycles,0,peak[i]-peak[i-1]) for i in range(1,n_peaks)]
z=(sum(z_list[i] for i in range(len(z_list))))
z+=z_init
m =Normal('m', z, variance,observed=data)
I have multiple realisations of the true distribution and I'd like taking all of them into account while performing optimization of the parameters of the system.
Right now my "data" that appears in observed=data is just one list of results , such as:
[151,152,150,20,19,18,0,0,0]
What I would like to do is give not just one but several lists of results,
for instance:
data=([151,152,150,20,19,18,0,0,0],[145,152,150,21,17,19,1,0,0],[151,149,153,17,19,18,0,0,1])
I tried using the shape parameter and making data an array of results but none of it seemed to work.
Does anyone have an idea of how it's possible to do the inference so that the network is optimized for an entire dataset and not a single output?

Related

Pytorch - extracting predicted values to an array

I'm a new pytorch user and moderate experience with Tensorflow/Keras. The pytorch examples are fantastic. I've worked through the demand forecasting lab using the Temporal Fusion Transform (https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/stallion.html).
All makes sense but haven't figured out how to save the predicted values in notebook section #20 to a numpy array.
Section #20,
*new_raw_predictions, new_x = best_tft.predict(new_prediction_data, mode="raw", return_x=True)*
I see the values in the tensors, print(new_raw_predictions) ,
like this --
*{'prediction': tensor([[[3.4951e+00, 1.7341e+01, 2.7446e+01, ..., 6.3175e+01,
9.0240e+01, 1.2589e+02],
[1.1698e+01, 2.3643e+01, 3.3291e+01, ..., 6.6374e+01,
9.1148e+01, 1.3173e+02],
I've seen some similar questions asked here but none seem to work. All attempts result in a similar error so I'm missing something fundamental about pytorch and the output tensor; I always get 'AttributeError: 'dict' object has no attribute new_raw_predictions'
A few examples of what's been tried:
*new_raw_predictions.cpu().numpy() new_raw_predictions.detach().cpu().numpy() new_raw_predictions.numpy()*
Goal is to save the predicted output so I can compare changes to the model. Thanks in advance!

It all depends on how you've created your model, because pytorch can return values however you specify. In your case, it looks like it returns a dictionary, of which 'prediction' is a key. You can convert to numpy using the command you supplied above, but with one change:
preds = new_raw_predictions['prediction'].detach().cpu().numpy()
of course if it's not on the GPU you don't need to use .detach().cpu(), just .numpy()

Multiple limits in fitting and toy generation in zfit

I have a running model with a pdf in zfit from where I want to generate toys from and after also fit the pdf to the toys. However I was wondering how to exclude certain areas in the toy generation as well as in the fitting after. More clear this means using multiple limits so that I have multiple ranges in where my fit and toy generations runs (simultaniosely).
Does anyone know how to do this?

This can be achieved using multiple limits by adding Spaces, (minus a remark on fitting range, see below)
I assume you define your model and data in a way like
obs = zfit.Space('x', (..,..))
model = zfit.pdf.Foo(obs=obs,...)
data = zfit.Data....(obs=obs,...)
To define a single Space with multiple limits, do
obs1a = zfit.Space('x', (..., ...))
obs1b = zfit.Space('x', (..., ...))
obs = obs1a + obs1b
Note that the observable 'x' is in both cases the same and therefore the Space will be added and not extended to higher dimensions.
"Fitting Range"
Just to clarify, there is actually "no such thing as a fitting range". There are two ranges that matter:
the data range which are the cuts applied to the data
the normalization range: Integral over it is 1 (by definition)
Often, this two coincide and are named fitting range. Sometimes though the normalization range can be different (e.g. from previous fits to an extended range to the left/right) and "fitting range" is often used equivalently for the "data range" since this defines, which data points are used in the likelihood.

h2o python balance classes

I'm having problems implementing a simple balancing for an H2ORandomForestEstimator, I'm trying to reproduce a simple example found in Darren Cook's book written in R ('Practical Machine Learning with H2O - pag. 107).
Working on the Iris Dataset, firstly I artificially unbalance the target variable cutting out a good share of virginica keeping first 120 rows.
Then I build 3 models, a vanilla one, one where I set balance_classes as True, and a last one where I set balance_classes as True and I input a list for class_sampling_factors to oversample the virginica one. List is [1.0,1.0,2.5], referred to columns sorted alphabetically.
I train them, and then output confusion matrix for train for each one.
I'm expecting an unbalanced output for the first one, and a balanced one for the last two, while I have always the same result. I checked the documentation example in Python, and I can't see anything wrong (I may be tired as well).
This is my code:
data_unb = data[1:120,:] # messing up with target variable
train, valid = data_unb.split_frame([0.8], seed=12345)
m1 = h2o.estimators.random_forest.H2ORandomForestEstimator(seed=12345)
m2 = h2o.estimators.random_forest.H2ORandomForestEstimator(balance_classes=True, seed=12345)
m3 = h2o.estimators.random_forest.H2ORandomForestEstimator(balance_classes=True, class_sampling_factors=[1.0,1.0,2.5], seed=12345)
m1.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_defaults')
m2.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_balanced')
m3.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_class_sampling',)
m1.confusion_matrix(train)
m2.confusion_matrix(train)
m3.confusion_matrix(train)
This is my output:
my confusion matrices (wrong)
this is my expected output.
expected confusion matrices
What am I evidently missing? Thanks in advance.

You're not missing anything. The offset_column is available in H2O Random Forest, but it's not actually functional. The bug is documented here and should be fixed in the next stable release of H2O. Sorry about the confusion!
It should work for the rest of the H2O algos (except XGBoost). If you wanted to try on a GBM, for example, you'd see it working.

CNTK & python: How to pass input data to the eval func?

With CNTK I have created a network with 2 input neurons and 1 output neuron.
A line in the training file looks like
|features 1.567518 2.609619 |labels 1.000000
Then the network was trained with brain script. Now I want to use the network for predicting values. For example: Input data is [1.82, 3.57]. What ist the output from the net?
I have tried Python with the following code, but here I am new. Code does not work. So my question is: How to pass the input data [1.82, 3.57] to the eval function?
On stackoverflow there are some hints, here and here, but this is too abstract for me.
Thank you.
import cntk as ct
import numpy as np
z = ct.load_model("LR_reg.dnn", ct.device.cpu())
input_data= np.array([1.82, 3.57], dtype=np.float32)
pred = z.eval({ z.arguments[0] : input_data })
print(pred)

Here's the most defensive way of doing it. CNTK can be forgiving if you omit some of this when the network is specified with V2 constructs. Not sure about a network that was created with V1 code.
Basically you need a pair of braces for each axis. Which axes exist in Brainscript? There's a batch axis, a sequence axis and then the static axes of your network. You have one dimensional data so that means the following should work:
input_data= np.array([[[1.82, 3.57]]], dtype=np.float32)
This specifies a batch of one sequence, of length one, containing one 1d vector of two elements. You can also try omitting the outermost braces and see if you are getting the same result.
Update based on more information from the comment below, we should not forget that the V1 code also saved the part of the network that computes things like loss and accuracy. If we provide only the features, CNTK will complain that the labels have not been provided. There are two ways to deal with this issue. One possibility is to provide some fake labels, so that the network can evaluate these auxiliary operations. Another possibility is to identify the prediction and use that. If the prediction was called 'p' in V1, this python code
p = z.find_by_name('p')
should create a CNTK function that only needs the features in order to compute the prediction.

How can I speed up an autoencoder to use on text data written in python's theano package?

I'm new to theano and I'm trying to adapt the autoencoder script here to work on text data. This code uses the MNIST dataset as training data. This data is in the form of a numpy 2d array.
My data is a csr sparse matrix of about 100,000 instances with about 50,000 features. The matrix is the result of using sklearn's tfidfvectorizer to fit and transform the text data. As I'm using sparse matrices I modify the code to use the theano.sparse package to represent my input.
My training set is the symbolic variable:
train_set_x = theano.sparse.shared(train_set)
However, theano.sparse matrices cannot perform all of the operations used in the original script (there is a list of sparse operations here). The code uses dot and sum from the tensor methods on the input. I have changed the dot to sparse.dot but I can't find out what to replace the sum with so I am converting the training batches to dense matrices and using the original tensor methods as shown in this cost function:
def get_cost(self):
tilde_x = self.get_corrupted_input(self.x, self.corruption)
y = self.get_hidden_values(tilde_x)
z = self.get_reconstructed_input(y)
#make dense, must be a better way to do this
L = - T.sum(SP.dense_from_sparse(self.x) * T.log(z) + (1 - SP.dense_from_sparse(self.x)) * T.log(1 - z), axis=1)
cost = T.mean(L)
return cost
def get_hidden_values(self, input):
# use theano.sparse.dot instead of T.dot
return T.nnet.sigmoid(theano.sparse.dot(input, self.W) + self.b)
The get_corrupted_input and get_reconstructed_input methods remain as they are in the link above. My question is is there a faster way to do this?
Converting the matrices to dense is making running the training very slow. Currently it takes 20.67m to do one training epoch with a batch size of 20 training instances.
Any help or tips you could give would be greatly appreciated!

In the most recent master branch of theano.sparse there is an sp_sum method listed.
(see here)
If you're not using the bleeding edge version I'd install that and see if calling it will work and if doing so speeds things up:
pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git
(And if it does, noting it here would be nice, it's not always clear that the sparse functionality is much faster than using dense calculations all the way through, especially on the gpu.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pymc3: Optimizing parameters with multiple data? - python

Related

Pytorch - extracting predicted values to an array

Multiple limits in fitting and toy generation in zfit

h2o python balance classes

CNTK & python: How to pass input data to the eval func?

How can I speed up an autoencoder to use on text data written in python's theano package?

Categories

Resources