Fails to fix the seed value in LDA model in gensim - python

When using LDA model, I get different topics each time and I want to replicate the same set. I have searched for the similar question in Google such as this.
I fix the seed as shown in the article by num.random.seed(1000) but it doesn't work. I read the ldamodel.py and find the code below:
def get_random_state(seed):
"""
Turn seed into a np.random.RandomState instance.
Method originally from maciejkula/glove-python, and written by #joshloyal
"""
if seed is None or seed is numpy.random:
return numpy.random.mtrand._rand
if isinstance(seed, (numbers.Integral, numpy.integer)):
return numpy.random.RandomState(seed)
if isinstance(seed, numpy.random.RandomState):
return seed
raise ValueError('%r cannot be used to seed a numpy.random.RandomState'
' instance' % seed)
So I use the code:
lda = models.LdaModel(
corpus_tfidf,
id2word=dic,
num_topics=2,
random_state=numpy.random.RandomState(10)
)
But it's still not working.

The dictionary generated by corpora.Dictionary may be different to the same corpus(such as same words but different order).So one should fix the dictionary as well as seed to get tht same topic each time.The code below may help to fix the dictionary:
dic = corpora.Dictionary(corpus)
dic.save("filename")
dic=corpora.Dictionary.load("filename")

I agree with #Marcel.Shen point that you should fix your input dictionary to LDA model by saving it once and reusing it again rather than generating it again every time. That could also be a possible reason why you are getting a different result.
But I think the main reason you are getting different results is that you are randomly setting the random state between 0-10 each time you run. Just set the random seed value to a constant like 1.

Related

How to get Lime predictions vs Actual predictions in a dataframe?

I am working on a binary classification problem using Random forest and using LIME explainer to explain the predictions.
I used the below code to generate LIME explanations
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(ord_train_t.values, discretize_continuous=True,
feature_names=feat_names,
mode="classification",
feature_selection = "lasso_path",
class_names=rf_boruta.classes_,
categorical_names=output,
kernel_width=10, verbose=True)
i = 969
exp = explainer.explain_instance(ord_test_t.iloc[1,:],rf_boruta.predict_proba,distance_metric = 'euclidean',num_features=5)
I got an output like below
Intercept 0.29625037124439896
Prediction_local [0.46168824]
Right:0.6911888737552843
However, the above is printed as a message in screen
How can we get this info in a dataframe?
Lime doesn't have direct export-to-dataframe capabilities, so the way to go appears to be appending the predictions to a list and then transforming it into a Dataframe.
Yes, depending on how many predictions you have, this may take a lot of time, since the model has to predict every instance individually.
This is an example I found, the explain_instance needs to be adjusted to your model args, but follows the same logic.
l=[]
for n in range(0,X_test.shape[0]+1):
exp = explainer.explain_instance(X_test.values[n], clf.predict_proba, num_features=10)
a=exp.as_list()
l.append(a)
df = pd.DataFrame(l)
If you need more than what the as_list() provides, the explainer has more data on it. I ran an example to see what else explain instance would retrieve.
You can, instead of just using as_list(), append to this as_list the other values you need.
a = exp.to_list()
a.append(exp.intercept[1])
l.append(a)
Using this approach you can get the intercept and the prediction_local, for the right value I don't really know which one it would be, but I am certain the object explainer has it somewhere with another name.
Use a breakpoint on your code and explore the explainer, maybe there are other info you would want to save as well.
Lime Github: Issue ref 213
To see intercept and prediction_local of your explainer you can do explainer.intercept and explainer.local_pred. See this blog for details.

Maximum likelihood with bins generated by Poisson distribudion

I try to use maximum likelihood and happen' to have problems. Let me start from the beginning - I am doing this the first time, so I found the code of how somebody did it and tried to modify it to meet my needs (here is the link to the page that I used: https://analyticsindiamag.com/maximum-likelihood-estimation-python-guide/). Everything worked perfectly fine until I tried to change the distribution used in the code from normal to Poisson. The program doesn't see it as an error, but it is not counted well and it gives me a response: "success: False". Have anybody have any idea what is going wrong? All answers appreciated. Here is my modified code:
def max_likehood(parameters):
a,b,c = parameters
prediction = a*x+b*y+c*z
#calculate log-likelihood for Poisson distribution
likelihood=np.sum(stats.poisson.logpmf(some_data, prediction, loc=0))
neg_likelihood=-1*likelihood
return neg_likelihood
mlm=minimize(max_likelihood, np.array[2,2,2], method='L-BFGS-B')
({x,y,z} are known data packs)

What are use cases to hand over different numbers in random.seed( 0 )

What are use cases to hand over different numbers in random.seed(0)?
import random
random.seed(0)
random.random()
For example, to use random.seed(17) or random.seed(9001) instead of always using random.seed(0). Both return the same "pseudo" random numbers that can be used for testing.
import random
random.seed(17)
random.random()
Why dont use always random.seed(0)?
The seed is saying "random, but always the same randomness". If you want to randomize, e.g. search results, but not for every search you could pass the current day.
If you want to randomize per user you could use a user ID and so on.
An application should specify its own seed (e.g., with random.seed()) only if it needs reproducible "randomness"; examples include unit tests, games that display a "code" based on the seed to players, and simulations. Specifying a seed this way is not appropriate where information security is involved. See also my article on this matter.

GARCH model in Python: Iteration limit exceeded

I have a problem with a GARCH model in python. My code looks as follow
import sys
import numpy as np
import pandas as pd
from arch import arch_model
sys.setrecursionlimit(1800)
spotmarket = pd.read_excel("./data/external/Spotmarket.xlsx", index=True)
l = spotmarket['Price'].pct_change().dropna()
returns = 100 * l
returns.plot()
plt.show()
model=arch_model(returns, vol='Garch', p=1, o=0, q=1, dist='Normal')
results=model.fit()
print(results.summary())
The first part of the code works well. I have end of the day prices in a separate excel table and want to model them with a GARCH model. The problem is, that I get the error message The optimizer returned code 9. The message is:
Iteration limit exceeded
See scipy.optimize.fmin_slsqp for code meaning.
Has someone an idea, how I can handle the problem with the iteration limit? Thank you!
Reading the source code (here), you can pass additional parameters to the fit method. Internally, scipy.optimize.minimize (doc) is called and the parameters of interest to you are probably max_iter and ftol.
Try manually changing the default values (max_iter=100 and ftol= 1e-06) to new ones that might lead to convergence. Example:
results=model.fit(options={'max_iter': 200})

TensorFlow: Resetting the seed to a constant value does not yield repeating results

I'm getting non-repeating results in TensorFlow (version 1.4.0), even though I'm resetting the seed to the same value each time:
import tensorflow as tf
sess = tf.InteractiveSession()
for i in range(5):
tf.set_random_seed(1234)
print(sess.run(tf.random_uniform([1])))
The output is:
[ 0.96046877]
[ 0.85591054]
[ 0.20277488]
[ 0.81463408]
[ 0.75180626]
I don't understand how this is consistent with the documentation :
If the graph-level seed is set, but the operation seed is not: The system deterministically picks an operation seed in conjunction with the graph-level seed so that it gets a unique random sequence.
Doesn't this mean that if I set the graph-level seed (as I did, using set_random_seed), and I don't set the operation seed (as in my case, where I didn't specify the seed argument in random_uniform), I should expect to get repeating results?
A similar issue is addressed in this question, but the main emphasis here is to understand what the documentation means.
Additional details:
>> tf.__version__
'1.4.0-rc0'
>> tf.__git_version__
'v1.3.0-rc1-3732-g2dd1015'
EDIT 1:
I think I have a conjecture as to why this is the behavior.
The sentence "The system deterministically picks an operation seed in conjunction with the graph-level seed" does not mean that the op seed is a function of the graph seed only. Rather, it is also a function of some internal counter variable. The reason this makes sense is that otherwise, running
tf.set_random_seed(1234)
print(sess.run(tf.random_uniform([1])))
print(sess.run(tf.random_uniform([1])))
would yield two identical results (This is due to the fact(?) that the random number generator is stateless, and each op has its own seed).
To confirm this, I found that when terminating python and opening it again, the whole sequence of code at the beginning of the question does yield repeating results. Also, delving into the source of TensorFlow, I see that the file random_seed.py has the lines
if graph_seed is not None:
if op_seed is None:
op_seed = ops.get_default_graph()._last_id
seeds = _truncate_seed(graph_seed), _truncate_seed(op_seed)
which shows that the seed of an op is a combination of two seeds: the graph seed and the _last_id property of the graph, which is a counter that increases on each op that is added to the graph.

Categories

Resources