Apparently the function predict itself works well and the whole model gets 75% success. But when I tried to do a test case for the function to check if it will return the correct outcome (1), I get the error outcomes = np.append(outcomes, y_train[n]) IndexError: index 160 is out of bounds for axis 0 with size 3. Any suggestions to what could be the bug?
This is impossible to debug based on your screenshot alone. Please provide a minimal working example in the future.
In this particular case, I assume that y_train[n] tries to access y_train[160], but your y_train only has three elements [1, 1, 0]. So I assume that get_neighbors does not return the type of data that you expect.
It appears that you are using PyCharm. Try to execute the very same code with the built-in debugger (a bug next to the run icon), wait for the Exception to appear (it will pause the code at that point), and check the content of your neighbors variable in the debugger menu. It should give you more hints.
I ran a different version of Mincer´s equation to estimate salary. Firstly, I ran an OLS version without considering endogeneity and the results are the following:Output summary
Salario is actually the natural log of it.
After that, I wrote next code to get an estimation with 2SLS method to solve endogeneity in variables No_Feliz and Años_Edu using No_Dep and max_edu_padres as instrumental variables for each one. However, the output is a bit confusing and I don´t know how to deal with it.
from statsmodels.sandbox.regression.gmm import IV2SLS
resultIV = IV2SLS(_dfb['Salario'], _dfb[['No_Feliz','Años_Edu']], _dfb[['No_Dep','Estatura','Años_Expe','Edad','Edad_2','Peso','Peso_2','DI','D_Hombre']]).fit()
resultIV.summary()
Results: Output summary IV2SLS
It´s clear the output has some issues (R2 too high compared to ols result, no result for F-statistic, coef of No_Feliz has a positive sign while it's negative in OLS estimation, the other exogenous variables are not taken into account despite the fact I included them)
I´d appreciate if someone could help me to fix it or at least make things a bit more clear to me. Thank you very much!
Don't use statsmodels's sandbox version which is not well tested. Instead use linearmodels. It is fully tested and correct.
Start by installing using pip install linearmodels
Then you need to be explicit about which variables are endogenous and need instrumenting, which are exogenous so in the model but not instrumented, and which are instruments only.
from linearmodels import IV2SLS
dep = _dfb['Salario']
exog = None
endog = _dfb[['No_Feliz','Años_Edu']]
instr = _dfb[['No_Dep','Estatura','Años_Expe','Edad','Edad_2','Peso','Peso_2','DI','D_Hombre']]
resultIV = IV2SLS(dep, exog, endog, instr).fit()
resultIV.summary # Note: this is a property so no ()
You can find the help which has more details. There are also some examples.
The original problem
While translating MATLAB code to python, I have the function [parmhat,parmci] = gpfit(x,alpha). This function fits a Generalized Pareto Distribution and returns the parameter estimates, parmhat, and the 100(1-alpha)% confidence intervals for the parameter estimates, parmci.
MATLAB also provides the function gplike that returns acov, the inverse of Fisher's information matrix. This matrix contains the asymptotic variances on the diagonal when using MLE. I have the feeling this can be coupled to the confidence intervals as well, however my statistics background is not strong enough to understand if this is true.
What I am looking for is Python code that gives me the parmci values (I can get the parmhat values by using scipy.stats.genpareto.fit). I have been scouring Google and Stackoverflow for 2 days now, and I cannot find any approach that works for me.
While I am specifically working with the Generalized Pareto Distribution, I think this question can apply to many more (if not all) distributions that scipy.stats has.
My data: I am interested in the shape and scale parameters of the generalized pareto fit, the location parameter should be fixed at 0 for my fit.
What I have done so far
scipy.stats While scipy.stats provides nice fitting performance, this library does not offer a way to calculate the confidence interval on the parameter estimates of the distribution fitter.
scipy.optimize.curve_fit As an alternative I have seen suggested to use scipy.optimize.curve_fit instead, as this does provide the estimated covariance of the parameter estimated. However that fitting method uses least squares, whereas I need to use MLE and I didn't see a way to make curve_fit use MLE instead. Therefore it seems that I cannot use curve_fit.
statsmodel.GenericLikelihoodModel Next I found a suggestion to use statsmodel.GenericLikelihoodModel. The original question there used a gamma distribution and asked for a non-zero location parameter. I altered the code to:
import numpy as np
from statsmodels.base.model import GenericLikelihoodModel
from scipy.stats import genpareto
# Data contains 24 experimentally obtained values
data = np.array([3.3768732 , 0.19022354, 2.5862942 , 0.27892331, 2.52901677,
0.90682787, 0.06842895, 0.90682787, 0.85465385, 0.21899145,
0.03701204, 0.3934396 , 0.06842895, 0.27892331, 0.03701204,
0.03701204, 2.25411215, 3.01049545, 2.21428639, 0.6701813 ,
0.61671203, 0.03701204, 1.66554224, 0.47953739, 0.77665706,
2.47123239, 0.06842895, 4.62970341, 1.0827188 , 0.7512669 ,
0.36582134, 2.13282122, 0.33655947, 3.29093622, 1.5082936 ,
1.66554224, 1.57606579, 0.50645878, 0.0793677 , 1.10646119,
0.85465385, 0.00534871, 0.47953739, 2.1937636 , 1.48512994,
0.27892331, 0.82967374, 0.58905024, 0.06842895, 0.61671203,
0.724393 , 0.33655947, 0.06842895, 0.30709881, 0.58905024,
0.12900442, 1.81854273, 0.1597266 , 0.61671203, 1.39384127,
3.27432715, 1.66554224, 0.42232511, 0.6701813 , 0.80323855,
0.36582134])
params = genpareto.fit(data, floc=0, scale=0)
# HOW TO ESTIMATE/GET ERRORS FOR EACH PARAM?
print(params)
print('\n')
class Genpareto(GenericLikelihoodModel):
nparams = 2
def loglike(self, params):
# params = (shape, loc, scale)
return genpareto.logpdf(self.endog, params[0], 0, params[2]).sum()
res = Genpareto(data).fit(start_params=params)
res.df_model = 2
res.df_resid = len(data) - res.df_model
print(res.summary())
This gives me a somewhat reasonable fit:
Scipy stats fit: (0.007194143471555344, 0, 1.005020562073944)
Genpareto fit: (0.00716650293, 8.47750397e-05, 1.00504535)
However in the end I get an error when it tries to calculate the covariance:
HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
If I do return genpareto.logpdf(self.endog, *params).sum() I get a worse fit compared to scipy stats.
Bootstrapping Lastly I found mentions to bootstrapping. While I did sort of understand what's the idea behind it, I have no clue how to implement it. What I understand is that you should resample N times (1000 for example) from your data set (24 points in my case). Then do a fit on that sub-sample, and register the fit result. Then do a statistical analysis on the N results, i.e. calculating mean, std_dev and then confidence interval, like Estimate confidence intervals for parameters of distribution in python or Compute a confidence interval from sample data assuming unknown distribution. I even found some old MATLAB documentation on the calculations behind gpfit explaining this.
However I need my code to run fast, and I am not sure if any implementation that I make will do this calculation fast.
Conclusions Does anyone know of a Python function that calculates this in an efficient manner, or can point me to a topic where this has been explained already in a way that it works for my case at least?
I had the same issue with GenericLikelihoodModel and I came across this post (https://pystatsmodels.narkive.com/9ndGFxYe/mle-error-warning-inverting-hessian-failed-maybe-i-cant-use-matrix-containers) which suggests using different starting parameter values to get a result with positive hessian. Solved my problem.
First of all, I tried to perform dimensionality reduction on my n_samples x 53 data using scikit-learn's Kernel PCA with precomputed kernel. The code worked without any issues when I tried using 50 samples at first. However, when I increased the number of samples into 100, suddenly I got the following message.
Process finished with exit code -1073740940 (0xC0000374)
Here's the detail of what I want to do:
I want to obtain the optimum value of kernel function hyperparameter in my Kernel PCA function, defined as the following.
from sklearn.decomposition.kernel_pca import KernelPCA as drm
from somewhere import costfunction
from somewhere_else import customkernel
def kpcafun(w,X):
# X is sample
# w is hyperparam
n_princomp = 2
drmodel = drm(n_princomp,kernel='precomputed')
k_matrix = customkernel (X,X,w)
transformed_x = drmodel.fit_transform(k_matrix)
cost = costfunction(transformed_x)
return cost
Therefore, to optimize the hyperparams I used the following code.
from scipy.optimize import minimize
# assume that wstart and optimbound are already defined
res = minimize(kpcafun, wstart, method='L-BFGS-B', bounds=optimbound, args=(X))
The strange thing is when I tried to debug the first 10 iterations of the optimization process, nothing strange has happened all values of the variables seemed normal. But, when I turned off the breakpoints and let the program continue the message appeared without any error notification.
Does anyone know what might be wrong with my code? Or anyone has some tips to resolve a problem like this?
Thanks
When using LDA model, I get different topics each time and I want to replicate the same set. I have searched for the similar question in Google such as this.
I fix the seed as shown in the article by num.random.seed(1000) but it doesn't work. I read the ldamodel.py and find the code below:
def get_random_state(seed):
"""
Turn seed into a np.random.RandomState instance.
Method originally from maciejkula/glove-python, and written by #joshloyal
"""
if seed is None or seed is numpy.random:
return numpy.random.mtrand._rand
if isinstance(seed, (numbers.Integral, numpy.integer)):
return numpy.random.RandomState(seed)
if isinstance(seed, numpy.random.RandomState):
return seed
raise ValueError('%r cannot be used to seed a numpy.random.RandomState'
' instance' % seed)
So I use the code:
lda = models.LdaModel(
corpus_tfidf,
id2word=dic,
num_topics=2,
random_state=numpy.random.RandomState(10)
)
But it's still not working.
The dictionary generated by corpora.Dictionary may be different to the same corpus(such as same words but different order).So one should fix the dictionary as well as seed to get tht same topic each time.The code below may help to fix the dictionary:
dic = corpora.Dictionary(corpus)
dic.save("filename")
dic=corpora.Dictionary.load("filename")
I agree with #Marcel.Shen point that you should fix your input dictionary to LDA model by saving it once and reusing it again rather than generating it again every time. That could also be a possible reason why you are getting a different result.
But I think the main reason you are getting different results is that you are randomly setting the random state between 0-10 each time you run. Just set the random seed value to a constant like 1.