Discretizing lognormal distribution in Python - python

Suppose I have a grid given by
import numpy as np
grid = np.linspace(0,20,1000)
I want to get a 1000-by-1 vector p so that if one were to plot points
(grid[i], p[i]) the graph would look like the density of a lognormal distribution.

Use scipy's stats for obtaining pdf's of probability-distributions!
Numpy, in most (all?) cases only support sampling-methods, not pdf-calculations. What's needed surely depends on the use-case.
Often the pdf plays no role in practical sampling-only implementations, like in this case, where sampling is reduced to normal-distribution sampling (often reduced to uniform-sampling combined with other functions) followed by the exponential-function (code):
double rk_lognormal(rk_state *state, double mean, double sigma)
{
return exp(rk_normal(state, mean, sigma));
}
Make sure to read above docs to learn how to use these!
Example code:
import numpy as np
import scipy.stats as spt
import matplotlib.pyplot as plt
rv = spt.lognorm(0.954) # "frozen" RV (shape-param fixed)
x_points = np.linspace(1,20,1000, dtype=int) # 0 excluded
plt.scatter(x_points, rv.pdf(x_points))
plt.show()
Output:

Related

How to get maximum loglikelihood in python and fit parameters of a normal distribution and a t-student?

I am working in python and I have some performance data of some actions
DailyReturn = [0.325, -0.287, ...]
I've been trying to fit a normal distribution and a student's t-distribution to the density histogram of that data to use as a PDF. I would like to get the adjustment parameters, the standard errors of the parameters and the value of the LogLikelihood by the method of MLE (maximum likelihood). But I have run into some issues. At the moment I have this idea
import numpy as np
import math
import scipy.optimize as optimize
import statistics
def llnorm(par, data):
n = len(data)
mu, sigma = par
ll = -np.sum(-n/2*math.log(2*math.pi*(sigma**2))-((data-mu)**2)/(2*(sigma**2)))
return ll
data = DailyReturn
result = optimize.minimize(llnorm, [statistics.mean(data),statistics.stdev(data)], args = (data)
But I'm not sure and I'm lost with the t student distribution, is there an easier way to do it?
In scipy.stats you find several distributions, amongn them student's T and normal
These modules have a fit method. You can see an example here for normal distribution.
Your approach seems to be correct to normal distribution, I there is no point in this case since the optimal solution will be given by the mu and sigma you are passing.

Create unequally spaced values from (superimposed) distributions

I want to create an array with unequally spaced values. The spacing should be determined by the superposition of (for example) two normal distributions with different mean and width values. For a single (normal) distribution I managed to get what I want with the help of this post: python, weighted linspace
Using this code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
dist = stats.norm(loc=1.2, scale=0.6)
bounds = dist.cdf([0, 2])
pp = np.linspace(*bounds, num=21)
vals = dist.ppf(pp)
plt.plot(vals, [1]*vals.size, 'o')
plt.show()
I get the result I want for a single distribution:
However, I need exactly the same for a superposition of two normal distributions like:
dist1 = stats.norm(loc=3, scale=2)
dist2 = stats.norm(loc=1.2, scale=0.6)
This is how a histrogramm of the superimposed distributions looks like:
As a temporary solution I created the arrays for each distribution individually and added them together. However, this is not exactly what I want, because adding the the two individual arrays leads to fluctuating step sizes between the added arrays (for example it might happen that two values from the two different (individual) arrays are almost or exactly identical).
I also tried to define a new distribution that inherits from rv_continuous class from scipy.stats, but I failed to implement two different mean/width parameters.
I am pretty sure that it should work adding the individual probability density functions, but unfortunately I also failed with this approach.
Thanks in advance for any help and/or comment!
You could subclass rv_continuous and provide a pdf that is the mean of the two given pdfs.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
class sum_gaussians_gen(stats.rv_continuous):
def _pdf(self, x):
return (stats.norm.pdf(x, loc=3, scale=2) + stats.norm.pdf(x, loc=1.2, scale=0.6)) / 2
dist = sum_gaussians_gen()
bounds = dist.cdf([0, 7])
pp = np.linspace(*bounds, num=21)
vals = dist.ppf(pp)
plt.plot(vals, [0.5] * vals.size, 'o')
xs = np.linspace(0, 7, 500)
plt.plot(xs, dist.pdf(xs))
plt.ylim(ymin=0)
plt.show()

Quantile functions in Python

I'm having trouble finding quantile functions for well-known probability distributions in Python, do they exist? In particular, is there an inverse normal distribution function? I couldn't find anything in either Numpy or Scipy.
Check the .ppf() method of any distribution class in scipy.stats.
This is the equivalent of a quantile function (otherwise named as percent point function or inverse CDF)
An example with the exponential distribution from scipy.stats:
# analysis libs
import scipy
import numpy as np
# plotting libs
import matplotlib as mpl
import matplotlib.pyplot as plt
# Example with the exponential distribution
c = 0
lamb = 2
# Create a frozen exponential distribution instance with specified parameters
exp_obj = scipy.stats.expon(c,1/float(lamb))
x_in = np.linspace(0,1,200) # 200 numbers in [0,1], input for ppf()
y_out = exp_obj.ppf(x_in)
plt.plot(x_in,y_out) # graphically check the results of the inverse CDF
It seems new but I've found this about numpy and quantile. Maybe you can have a look (not tested)

MATLAB ksdensity equivalent in Python

I've looked online and have yet to find an answer or way to figure the following
I'm translating some MATLAB code to Python where in MATLAB im looking to find the kernel density estimation with the function:
[p,x] = ksdensity(data)
where p is the probability at point x in the distribution.
Scipy has a function but only returns p.
Is there a way to find the probability at values of x?
Thanks!
That form of the ksdensity call automatically generates an arbitrary x. scipy.stats.gaussian_kde() returns a callable function that can be evaluated with any x of your choosing. The equivalent x would be np.linspace(data.min(), data.max(), 100).
import numpy as np
from scipy import stats
data = ...
kde = stats.gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 100)
p = kde(x)
Another option is the kernel density estimator in the Scikit-Learn Python package, sklearn.neighbors.KernelDensity
Here is a little example similar to the Matlab documentation for ksdensity for a Gaussian distribution:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
np.random.seed(12345)
# similar to MATLAB ksdensity example x = [randn(30,1); 5+randn(30,1)];
Vecvalues=np.concatenate((np.random.normal(0,1,30), np.random.normal(5,1,30)))[:,None]
Vecpoints=np.linspace(-8,12,100)[:,None]
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(Vecvalues)
logkde = kde.score_samples(Vecpoints)
plt.plot(Vecpoints,np.exp(logkde))
plt.show()
The plot this produces looks like:
Matlab is orders of magnitude faster than KernelDensity when it comes to finding the optimal bandwidth. Any idea of how to make the KernelDenisty faster? – Yuca Jul 16 '18 at 20:58
Hi, Yuca. The matlab use Scott rule to estimate the bandwidth, which is fast but requires the data from the normal distribution. For more information, please see this Post.

scipy/numpy FFT on data from file

I looked into many examples of scipy.fft and numpy.fft. Specifically this example Scipy/Numpy FFT Frequency Analysis is very similar to what I want to do. Therefore, I used the same subplot positioning and everything looks very similar.
I want to import data from a file, which contains just one column to make my first test as easy as possible.
My code writes like this:
import numpy as np
import scipy as sy
import scipy.fftpack as syfp
import pylab as pyl
# Read in data from file here
array = np.loadtxt("data.csv")
length = len(array)
# Create time data for x axis based on array length
x = sy.linspace(0.00001, length*0.00001, num=length)
# Do FFT analysis of array
FFT = sy.fft(array)
# Getting the related frequencies
freqs = syfp.fftfreq(array.size, d=(x[1]-x[0]))
# Create subplot windows and show plot
pyl.subplot(211)
pyl.plot(x, array)
pyl.subplot(212)
pyl.plot(freqs, sy.log10(FFT), 'x')
pyl.show()
The problem is that I will always get my peak at exactly zero, which should not be the case at all. It really should appear at around 200 Hz.
With smaller range:
Still biggest peak at zero.
As already mentioned, it seems like your signal has a DC component, which will cause a peak at f=0. Try removing the mean with, e.g., arr2 = array - np.mean(array).
Furthermore, for analyzing signals, you might want to try plotting power spectral density.:
import matplotlib.pylab as plt
import matplotlib.mlab as mlb
Fs = 1./(d[1]- d[0]) # sampling frequency
plt.psd(array, Fs=Fs, detrend=mlb.detrend_mean)
plt.show()
Take a look at the documentation of plt.psd(), since there a quite a lot of options to fiddle with. For investigating the change of the spectrum over time, plt.specgram() comes in handy.

Categories

Resources