How to get Lime predictions vs Actual predictions in a dataframe?

How to get Lime predictions vs Actual predictions in a dataframe? - python

I am working on a binary classification problem using Random forest and using LIME explainer to explain the predictions.
I used the below code to generate LIME explanations
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(ord_train_t.values, discretize_continuous=True,
feature_names=feat_names,
mode="classification",
feature_selection = "lasso_path",
class_names=rf_boruta.classes_,
categorical_names=output,
kernel_width=10, verbose=True)
i = 969
exp = explainer.explain_instance(ord_test_t.iloc[1,:],rf_boruta.predict_proba,distance_metric = 'euclidean',num_features=5)
I got an output like below
Intercept 0.29625037124439896
Prediction_local [0.46168824]
Right:0.6911888737552843
However, the above is printed as a message in screen
How can we get this info in a dataframe?

Lime doesn't have direct export-to-dataframe capabilities, so the way to go appears to be appending the predictions to a list and then transforming it into a Dataframe.
Yes, depending on how many predictions you have, this may take a lot of time, since the model has to predict every instance individually.
This is an example I found, the explain_instance needs to be adjusted to your model args, but follows the same logic.
l=[]
for n in range(0,X_test.shape[0]+1):
exp = explainer.explain_instance(X_test.values[n], clf.predict_proba, num_features=10)
a=exp.as_list()
l.append(a)
df = pd.DataFrame(l)
If you need more than what the as_list() provides, the explainer has more data on it. I ran an example to see what else explain instance would retrieve.
You can, instead of just using as_list(), append to this as_list the other values you need.
a = exp.to_list()
a.append(exp.intercept[1])
l.append(a)
Using this approach you can get the intercept and the prediction_local, for the right value I don't really know which one it would be, but I am certain the object explainer has it somewhere with another name.
Use a breakpoint on your code and explore the explainer, maybe there are other info you would want to save as well.
Lime Github: Issue ref 213

To see intercept and prediction_local of your explainer you can do explainer.intercept and explainer.local_pred. See this blog for details.

Related

Loading saved params (a Pandas series) into a Statsmodels state-space model

I'm building a dynamic factor model using the excellent python package statsmodels, and I would like to pickle an estimated parameter vector so I can build the model again later, and load those params into it. (C.f., this Notebook built by Chad Fulton: https://github.com/ChadFulton/tsa-notebooks/blob/master/dfm_coincident.ipynb.)
In the following block of code, initial parameters are estimated with mod.fit() (using the Powell algo) and then given back to mod.fit() to complete the estimation (using the EM algo) using the initial parameters as initial_res.params. (The latter is a Pandas Series.)
mod = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=2, error_order=2)
initial_res = mod.fit(method='powell', disp=False)
res = mod.fit(initial_res.params)
I would like to pickle res.params (again, a small Pandas Series, and a small disk footprint). Then later build the model from scratch again, and load my saved parameters into it without having to re-estimate the model. Anyone know how that can be done?
Examples I have seen suggest pickling the results object res, but that can be a pretty big save. Building it from scratch is pretty simple, but estimation takes a while. It may be that estimation starting from the saved optimal params is quicker; but still, that's pretty amateurish, right?
TIA,
Drew

You can use the smooth method on any state space model to construct a results object from specific parameters. In your example:
mod = sm.tsa.DynamicFactor(endog, k_factors=1, factor_order=2, error_order=2)
initial_res = mod.fit(method='powell', disp=False)
res = mod.fit(initial_res.params)
res.params.to_csv(...)
# ...later...
params = pd.read_csv(...)
res = mod.smooth(params)

Keras: model.prediction does not match model.evaluation loss

I applied this tutorial https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb (on a different dataset), the turorial did not compute the mean squared error from individual output, so I added the following line in the comparison function:
mean_squared_error(signal_true,signal_pred)
but the loss and mse from the prediction were different from loss and mse from the model.evaluation on the test data. The errors from the model.evaluation (Loss, mae, mse) (test-set):
[0.013499056920409203, 0.07980187237262726, 0.013792216777801514]
the error from individual target (outputs):
Target0 0.167851388666284
Target1 0.6068108648555771
Target2 0.1710370357827747
Target3 2.747463225418181
Target4 1.7965991690103074
Target5 0.9065426398192563
I think it might a problem in training the model but i could not find where is it exactly. I would really appreciate your help.
thanks

There are a number of reasons that you can have differences between the loss for training and evaluation.
Certain ops, such as batch normalization, are disabled on prediction- this can make a big difference with certain architectures, although it generally isn't supposed to if you're using batch norm correctly.
MSE for training is averaged over the entire epoch, while evaluation only happens on the latest "best" version of the model.
It could be due to differences in the datasets if the split isn't random.
You may be using different metrics without realizing it.
I'm not sure exactly what problem you're running into, but it can be caused by a lot of different things and it's often difficult to debug.

I had the same problem and found a solution. Hopefully this is the same problem you encountered.
It turns out that model.predict doesn't return predictions in the same order generator.labels does, and that is why MSE was much larger when I attempted to calculate manually (using the scikit-learn metric function).
>>> model.evaluate(valid_generator, return_dict=True)['mean_squared_error']
13.17293930053711
>>> mean_squared_error(valid_generator.labels, model.predict(valid_generator)[:,0])
91.1225401637833
My quick and dirty solution:
valid_generator.reset() # Necessary for starting from first batch
all_labels = []
all_pred = []
for i in range(len(valid_generator)): # Necessary for avoiding infinite loop
x = next(valid_generator)
pred_i = model.predict(x[0])[:,0]
labels_i = x[1]
all_labels.append(labels_i)
all_pred.append(pred_i)
print(np.shape(pred_i), np.shape(labels_i))
cat_labels = np.concatenate(all_labels)
cat_pred = np.concatenate(all_pred)
The result:
>>> mean_squared_error(cat_labels, cat_pred)
13.172956865002352
This can be done much more elegantly, but was enough for me to confirm my hypothesis of the problem and regain some sanity.

Tensorflow - Training on condition

I am training a neural network with tensorflow (1.12) in a supervised fashion. I'd like to only train on specific examples. The examples are created on the fly by cutting out subsequences, hence I want to do the conditioning within tensorflow.
This is my original part of code:
train_step, gvs = minimize_clipped(optimizer, loss,
clip_value=FLAGS.gradient_clip,
return_gvs=True)
gradients = [g for (g,v) in gvs]
gradient_norm = tf.global_norm(gradients)
tf.summary.scalar('gradients/norm', gradient_norm)
eval_losses = {'loss1': loss1,
'loss2': loss2}
The training step is later executed as:
batch_eval, _ = sess.run([eval_losses, train_step])
I was thinking about inserting something like
train_step_fake = ????
eval_losses_fake = tf.zeros_like(tensor)
train_step_new = tf.cond(my_cond, train_step, train_step_fake)
eval_losses_new = tf.cond(my_cond, eval_losses, eval_losses_fake)
and then doing
batch_eval, _ = sess.run([eval_losses, train_step])
However, I am not sure how to create a fake train_step.
Also, is this a good idea in general or is there a smoother way of doing this? I am using a tfrecords pipeline, but no other high-level modules (like keras, tf.estimator, eager execution etc.).
Any help is obviously greatly appreciated!

Answering the specific question first. It's certainly possible to only perform your training step based on the tf.cond outcome. Note that the 2nd and 3rd params are lambdas though so more something like:
train_step_new = tf.cond(my_cond, lambda: train_step, lambda: train_step_fake)
eval_losses_new = tf.cond(my_cond, lambda: eval_losses, lambda: eval_losses_fake)
Your instinct that this may not be the right thing to do is correct though.
It's much more preferable (both in terms of efficiency and in terms of reading and reasoning about your code) to filter out the data you want to ignore before it gets to your model in the first place.
This is something you could achieve using the Dataset API. which has a really useful filter() method you could use. If you are using the dataset api to read your TFRecords right now then this should be as simple as adding something along the lines of:
dataset = dataset.filter(lambda x: {whatever op you were going to use in tf.cond})
If you are not yet using the dataset API, now is probably the time to have a little read up on it and consider it rather than butchering the model with that tf.cond() to act as a filter.

Theano's function() reports that my `givens` value is not needed for the graph

Sorry for not posting entire snippets -- the code is very big and spread out, so hopefully this can illustrate my issue. I have these:
train = theano.function([X], output, updates=update_G,
givens={train_mode=:np.cast['int32'](1)})
and
test = theano.function([X], output, updates=update_G,
givens={train_mode=:np.cast['int32'](0)})
to my understanding, givens would input the value of train_mode (i.e. 1/0) wherever it's needed to compute the output.
The output is computed in the lines of this:
...
network2 = Net2()
# This is sort of a dummy variable so I don't get a NameError when this
# is called before `theano.function()` is called. Not sure if this is the
# right way to do this.
train_mode = T.iscalar('train_mode')
output = loss(network1.get_outputs(network2.get_outputs(X, train_mode=train_mode)),something).mean()
....
class Net2():
def get_outputs(self, x, train_mode):
from theano.ifelse import ifelse
import theano.tensor as T
my_flag = ifelse(T.eq(train_mode, 1), 1, 0)
return something if my_flag else something_else
So train_mode is used as an argument in one of the nested functions, and I use it to tell between train and test as I'd like to handle them slightly differently.
However, when I try to run this, I get this error:
theano.compile.function_module.UnusedInputError: theano.function was
asked to create a function computing outputs given certain inputs, but
the provided input variable at index 1 is not part of the computational
graph needed to compute the outputs: <TensorType(int32, scalar)>.To make
this error into a warning, you can pass the parameter
on_unused_input='warn' to theano.function. To disable it completely, use
on_unused_input='ignore'.
If I delete the givens parameter, the error disappears, so to my understanding Theano believes that my train_mode is not necessary for compute the function(). I can use on_unusued_input='ignore' as per their suggestion, but that would just ignore my train_mode if they think it's unused. Am I going around this the wrong way? I basically just want to train a neural network with dropout, but not use dropout when evaluating.

why you use "=" sign? I think, it made train_mode not readable, my code works well by writing:
givens = {train_mode:1}

Fails to fix the seed value in LDA model in gensim

When using LDA model, I get different topics each time and I want to replicate the same set. I have searched for the similar question in Google such as this.
I fix the seed as shown in the article by num.random.seed(1000) but it doesn't work. I read the ldamodel.py and find the code below:
def get_random_state(seed):
"""
Turn seed into a np.random.RandomState instance.
Method originally from maciejkula/glove-python, and written by #joshloyal
"""
if seed is None or seed is numpy.random:
return numpy.random.mtrand._rand
if isinstance(seed, (numbers.Integral, numpy.integer)):
return numpy.random.RandomState(seed)
if isinstance(seed, numpy.random.RandomState):
return seed
raise ValueError('%r cannot be used to seed a numpy.random.RandomState'
' instance' % seed)
So I use the code:
lda = models.LdaModel(
corpus_tfidf,
id2word=dic,
num_topics=2,
random_state=numpy.random.RandomState(10)
)
But it's still not working.

The dictionary generated by corpora.Dictionary may be different to the same corpus(such as same words but different order).So one should fix the dictionary as well as seed to get tht same topic each time.The code below may help to fix the dictionary:
dic = corpora.Dictionary(corpus)
dic.save("filename")
dic=corpora.Dictionary.load("filename")

I agree with #Marcel.Shen point that you should fix your input dictionary to LDA model by saving it once and reusing it again rather than generating it again every time. That could also be a possible reason why you are getting a different result.
But I think the main reason you are getting different results is that you are randomly setting the random state between 0-10 each time you run. Just set the random seed value to a constant like 1.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get Lime predictions vs Actual predictions in a dataframe? - python

To see intercept and prediction_local of your explainer you can do explainer.intercept and explainer.local_pred. See this blog for details.

Related

Loading saved params (a Pandas series) into a Statsmodels state-space model

Keras: model.prediction does not match model.evaluation loss

Tensorflow - Training on condition

Theano's function() reports that my `givens` value is not needed for the graph

Fails to fix the seed value in LDA model in gensim

Categories

Resources