Fitting with monte carlo in python

Fitting with monte carlo in python - python

I use a python package called emcee to fit a function to some data points. The fit looks great, but when I want to plot the value of each parameter at each step I get this:
In their example (with a different function and data points) they get this:
Why is my function converging so fast, and why does it have that weird shape in the beginning. I apply MCMC using likelihood and posterior probability. And even if the fit looks very good, the error on parameters of function are very small (10^10 smaller than the actual value) and I think it is because of the random walks. Any idea how to fix it? Here is their code for fitting: http://dan.iel.fm/emcee/current/user/line/ I used the same code with the obvious modifications for my data points and fitting function.

I would not say that your function is converging faster than the emcee line-fitting example you're linked to. In the example, the walkers start exploring the most likely values in the parameter space almost immediately, whereas in your case it takes more than 200 iterations to reach the high probability region.
The jump in your trace plots looks like a burn-in. It is a common feature of MCMC sampling algorithms where your walkers are given starting points away from the bulk of the posterior and then must find their way to it. It looks like in your case the likelihood function is fairly smooth so you only need a hundred or so iterations to achieve this, giving your function this "weird shape" you're talking about.
If you can constrain the starting points better, do so; if not, you might consider discarding this initial length before further analysis is made (see here and here for discussions on burn in lengths).
As for whether the errors are realistic or not, you need to inspect the resulting posterior models for that, because the ratio of the actual parameter value to its uncertainties does not have a say in this. For example, if we take your linked example and change the true value of b to 10^10, the resulting errors will be ten magnitudes smaller while remaining perfectly valid.

Related

Approximate maximum of an unknown curve

I have a data set that looks like this:
I used the scipy.signal.find_peaks function to determine the peaks of the data set, and it works out fine enough, but since this function determines the local maxima of the data, it is not neglecting the noise in the data which causes overshoot. Therefore what I'm determining isn't actually the location of the most likely maxima, but rather the location of an 'outlier'.
Is there another, more exact way to approximate the local maxima?

I'm not sure that you can consider those points to be outliers so easily, as they look to be close to the place I would expect them to be. But if you don't think they are a valid approximation let me tell you three other ways you can use.
First option
I would construct a physical model of these peaks (a mathematical formula) and do a fitting analysis around the peaks. You can for instance, suppose that the shape of the plot is the sum of some background model (maybe constant or maybe more complicated) plus some Gaussian peaks (or Lorentzian).
This is what we usually do in physics. Of course it will be more accurate taking knowledge from the underlying processes, which I don't have.
Having a good model, this way is definitely better as taking the maximum values, as even if they are not outliers, they still have errors which you want to reduce.
Second option
But if you want a easier way, just a rough estimation and you already found the location of the three peaks programmatically, you can make the average of a few points around the maximum. If you do it so, the function np.where or np.argwhere tend to be useful for this kind of things.
Third option
The easiest option is taking the value by hand. It could sound unacceptable for academic proposes and it probably is. Even worst, it is not a programmatic way, and this is SO. But at the end, it depends on why and for what you need those values and on the confidence interval you need for your measurement.

Approaches for using statistics packages for maximum likelihood estimation for hundreds of covariates

I am trying to investigate the distribution of maximum likelihood estimates for specifically for a large number of covariates p and a high dimensional regime (meaning that p/n, with n the sample size, is about 1/5). I am generating the data and then using statsmodels.api.Logit to fit the parameters to my model.
The problem is, this only seems to work in a low dimensional regime (like 300 covariates and 40000 observations). Specifically, I get that the maximum number of iterations has been reached, the log likelihood is inf i.e. has diverged, and a 'singular matrix' error.
I am not sure how to remedy this. Initially, when I was still working with smaller values (say 80 covariates, 4000 observations), and I got this error occasionally, I set a maximum of 70 iterations rather than 35. This seemed to help.
However it clearly will not help now, because my log likelihood function is diverging. It is not just a matter of non-convergence within the maixmum number of iterations.
It would be easy to answer that these packages are simply not meant to handle such numbers, however there have been papers specifically investigating this high dimensional regime, say here where p=800 covariates and n=4000 observations are used.
Granted, this paper used R rather than python. Unfortunately I do not know R. However I should think that python optimisation should be of comparable 'quality'?
My questions:
Might it be the case that R is better suited to handle data in this high p/n regime than python statsmodels? If so, why and can the techniques of R be used to modify the python statsmodels code?
How could I modify my code to work for numbers around p=800 and n=4000?

In the code you currently use (from several other questions), you implicitly use the Newton-Raphson method. This is the default for the sm.Logit model. It computes and inverts the Hessian matrix to speed-up estimation, but that is incredibly costly for large matrices - let alone oft results in numerical instability when the matrix is near singular, as you have already witnessed. This is briefly explained on the relevant Wikipedia entry.
You can work around this by using a different solver, like e.g. the bfgs (or lbfgs), like so,
model = sm.Logit(y, X)
result = model.fit(method='bfgs')
This runs perfectly well for me even with n = 10000, p = 2000.
Aside from estimation, and more problematically, your code for generating samples results in data that suffer from a large degree of quasi-separability, in which case the whole MLE approach is questionable at best. You should urgently look into this, as it suggests your data may not be as well-behaved as you might like them to be. Quasi-separability is very well explained here.

How can I statistically compare a lightcurve data set with the simulated lightcurve?

With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.

If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html

P value for Normality test very small despite normal histogram

I've looked over the normality tests in scipy stats for both scipy.stats.mstats.normaltest as well as scipy.stats.shapiro and it looks like they both assume the null hypothesis is that the data they're given is normal.
Ie, a p value less than .05 would indicate that they're not normal.
I'm doing a regression with LassoCV in SKLearn, and in order to give myself better results I log transformed the answers, which gives a histogram that looks like this:
Looks normal to me.
However, when I run the data through either of the two tests mentioned above I get very small p values that would indicate the data is not normal, and in a big way.
This is what I get when I use scipy.stats.shapiro
scipy.stats.shapiro(y)
Out[69]: (0.9919402003288269, 3.8889791653673456e-07)
And I get this when I run scipy.stats.mstats.normaltest:
scipy.stats.mstats.normaltest(y)
NormaltestResult(statistic=25.755128535282189, pvalue=2.5547293546709236e-06)
It seems implausible to me that my data would test out as being so far from normality with the histogram it has.
Is there something causing this discrepancy, or am I not interpreting the results correctly?

If the numbers on the vertical axis are the number of observations for the respective class, then the sample size is about 1500. For such a large sample size goodness-of-fit tests are rarely useful. But is it really necessary that your data is perfectly normally distributed? If you want to analyze the data with a statistical method, is this method maybe robust under ("small") deviations from the normal distribution assumption?
In practice the question is usually "Is the normal distribution assumption acceptable" for my statistical analysis. A perfect normal distribution is very rarly available.
An additional comment on histograms: One has to be careful by interpreting data from histograms because if the data "looks normal" or not may depend on the width of the histogram classes. Histograms are only hints which should be treated with caution.

If you run this n times and take the mean of the p values, you will get what you expect. Run it in a loop in a Monte Carlo way.

Curve_fit not converging means...?

I need to crossmatch a list of astronomical coordinates with different catalogues, and I want to decide a maximum radius for the crossmatch. This will avoid mismatches between my list and the catalogues.
To do this, I compute the separation between the best match with the catalogue for each object in my list. My initial list is supossed to be the position of a known object, but it could happend that it is not detected in the catalog, and my coordinates may suffer from small offsets.
They way I am computing the maximum radius is by fitting the gaussian kernel density of the separation with a gaussian, and use the center + 3sigmas value. The method works nicely for most of the cases, but when a small subsample of my list has an offset, I have two gaussians instead. In these cases, I will specify the max radius in a different way.
My problem is that when this happens, curve_fit can't normally do the fit with one gaussian. For a scientific publication, I will need to justify the "no fit" in curve_fit, and in which cases the "different way" is used. Could someone give me a hand on what this means in mathematical terms?

There are varying lengths to which you can go justifying this or that fitting ansatz --- which strongly depends on the details of your specific case (eg: why do you expect a gaussian to work in a first place? to what depth you need/want to delve into why exactly a certain fitting procedure fails and what exactly is a fail etc).
If the question is really about the curve_fit and its failure to converge, then show us some code and some input data which demonstrate the problem.
If the question is about how to evaluate the goodness-of-fit, you're best off going back to the library and picking a good book on statistics.
If all you look for is way of justifying why in a certain case a gaussian is not a good fitting ansatz, one way would be to calculate the moments: for a gaussian distribution 1st, 2nd, 3rd and higher moments are related to each other in a very precise way. If you can demonstrate that for your underlying data the relation between moments is very different, it sounds reasonable that these data can't be fit by a gaussian.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.