In python, there are several way of doing kernel density estimation, I want to know the diffenreces between them, and make a good choice.
They are:
scipy.stats.gaussian_kde,
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html
sklearn.neighbors.KernelDensity, http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity
statsmodel
http://statsmodels.sourceforge.net/stable/nonparametric.html#kernel-density-estimation
I think we can compare with 1d, 2d, bandwidth selection, Implementation and performance
I only have experience with sklearn.neighbors.KernelDensity. Here is what I know:
The speed is generally fast, and can be performed over multi-dimention, but does not have helper in deciding the bandwidth.
I looked over scipy.kde, it seems like have a bandwidth selection method.
It looks like the article Kernel Density Estimation in Python is precisely what you are looking for:
I'm going to focus here on comparing the actual implementations of KDE currently available in Python. (...) four KDE implementations I'm aware of in the SciPy/Scikits stack:
In SciPy: gaussian_kde.
In Statsmodels: KDEUnivariate and KDEMultivariate.
In Scikit-learn: KernelDensity.
Each has advantages and disadvantages, and each has its area of applicability.
The "sklearn way" of choosing model hyperparameters is grid-search, with cross-validation to choose the best values. Take a look at http://mark-kay.net/2013/12/24/kernel-density-estimation/ for an example of how to apply this to Kernel Density Estimation.
Related
I'm trying to fit my data and have so far used sp.optimize.leastsq. I changed to sp.optimize.least_squares to add bounds to the parameters, but both when I use bounds and when I don't the search doesn't converge, even in data sets sp.optimize.leastsq fitted just fine.
Shouldn't these functions work the same?
What could be the difference between them that makes the newer one not to find solutions the older one did?
leastsq
is a wrapper around MINPACK’s lmdif and lmder algorithms.
least_squares implements other methods in addition to the MINPACK algorithm.
method{‘trf’, ‘dogbox’, ‘lm’}, optional
Algorithm to perform minimization.
‘trf’ : Trust Region Reflective algorithm, particularly suitable for large sparse problems with bounds. Generally robust method.
‘dogbox’ : dogleg algorithm with rectangular trust regions, typical use case is small problems with bounds. Not recommended for problems with rank-deficient Jacobian.
‘lm’ : Levenberg-Marquardt algorithm as implemented in MINPACK. Doesn’t handle bounds and sparse Jacobians. Usually the most efficient method for small unconstrained problems.
Default is ‘trf’. See Notes for more information.
It is possible for some problems that lm method does not converge while trf converges.
We are using scipy.optimize.differential_evolution through lmfit for an optimization problem with major parameter magnitudes differences.
We know that leastsq does handle this type of problems but we are wondering if differential_evolution does as well or if we should normalize them in advance?
Best regards
I think the answer to both of your 2 options is No.
First No: scipy (and so lmfit) differential_evolution() does not automatically rescale parameter values to have similar orders of magnitude.
Second No: you should not have to rescale the values.
Basically, for some methods (say, leastsq) it is important to rescale because the algorithm constructs and uses derivative matrices that include all of the variables. If there are widely varying scales, the manipulation of the matrices can be numerically unstable.
Differential Evolution does not use such derivatives (even where they would be helpful, hence its incredible slowness). So, it will not have that problem.
That said, just from a stability-of-numerical-calculations perspective, I would recommend against some variables that are on the scale of 1.e-12 with other on the scale 1e12. If you know the scale is outside 1e-6 to 1e6, rescale.
Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points? I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows? When I say elbowing, I mean something like this (although usually it is not so drastic or obvious):
One idea I had was to use Numpy's polyfit: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html to compute polynomial regression for a range of orders/degrees. Polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore if I want to automatically compute the degree of polynomial where the error curve elbows: if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) - (E[d+1] - E[d]).
Is this even a valid approach? Are there other tools and approaches in well-established Python libraries lik Numpy or Scipy that can help with finding the appropriate polynomial fit (without me having to specify the order/degree)? I would appreciate any thoughts or suggestions! Thanks!
To select the "right" fit and prevent over-fitting, you can use the Akiake Information Criterion or the Bayesian Information Criterion. Note that your fitting procedure can be non-Bayesian and you can still use these to compare fits. Here is a quick comparison between the two methods.
I wish to use scikit-learn's SVM with a chi-squared kernel, as shown here. In this scenario, the kernel is on histograms, which is what my data is represented as. However, I can't find an example of these used with histograms. What is the proper way to do this?
Is the correct approach to just treat the histogram as a vector, where each element in the vector corresponds to a bin of the histogram?
Thank you in advance
There is an example of using an approximate feature map here. It is for the RBF kernel but it works just the same.
The example above uses "pipeline" but you can also just apply the transform to your data before handing it to a linear classifer, as AdditiveChi2Sampler doesn't actually fit to the data in any way.
Keep in mind that this is just and approximation of the kernel map (that I found to work quite well) and if you want to use the exact kernel, you should go with ogrisel's anwser.
sklearn.svm.SVC accepts custom kernels in 2 manners:
arbitrary python functions passed as kernel argument to the constructor
precomputed kernel matrix passed as first argument to fitand kernel=precomputed in the constructor
The former can be much slower but does not require to allocate the whole kernel matrix in advance (which can be prohibitive for large n_samples).
The are more details and links to examples in the documentation on custom kernels.
I have some data that are the integrals of an unknown curve within bins. For your interest, the data is ocean wave energy and the bins are for directions, e.g. 0-15 degrees. If possible, I would like to fit a curve on to the data that conserves the integrals within the bins. I've tried sketching it on a notepad with a pencil and it seems like it could be possible. Does anyone know of any curve-fitting tool in Python to do this, for example in the scipy interpolation sub-package?
Thanks in advance
Edit:
Thanks for the help. If I do it, it looks like I will try the method that is recommended in section 4 of this paper: http://journals.ametsoc.org/doi/abs/10.1175/1520-0485%281996%29026%3C0136%3ATIOFFI%3E2.0.CO%3B2. In theory, it basically uses matrices to make some 'fake' data from the known integrals between each band. When plotted, this data then produces an interpolated line graph that preserves the integrals.
It's a little outside my bailiwick, but I can suggest having a look at SciKits to see if there's anything there that might be useful. Other packages to browse would be pandas and StatsModels. Good luck!
If you have a curve f(x) which is an approximation to the integral of another curve g(x), i.e. f=int(g,x) then the two are related by the Fundamental theorem of calculus, that is, your original function is the derivative of the first curve g = df/dx. As such you can use numpy.diff or any of the higher order methods to approximate df/dx to obtain an estimate of your original curve.
One possibility: calculate the cumulative sum of the bin volumes (np.cumsum), fit an interpolating spline to it, and then take the derivative to get the curve.
scipy splines have methods to calculate the derivatives.
The only limitation, in case it is relevant in your case, the spline through the cumulative sum might not be monotonic, and the derivative might be negative over some intervals.
I guess that the literature on smoothing a histogram looks at similar constraints on the volume of the integral/bin, but I don't have any references ready.
1/ fit2histogram
Your question is about fitting an histogram. I just came through documentation for some Python package for Multi-Variate Pattern Analysis, PyMVPA, and some function for histogram fitting is proposed. An example is here: PyMVPA.
However, I guess that set of available distributions is limited to famous distributions.
2/ integral computation
As already mentionned, next solution is to approximate integral value, and to fit a model to the resulting set of data. Either you know explicit expression for the derivative, or you use computational derivation: finite difference, analytical method.