Python generate random right skewed gaussian with constraints - python

I need to generate a unit curve that is going to look like a right skewed gaussian and I have the following constraints:
The X axis is Days (variable but usually 45+)
All values on the Y axis sum to 1
The peak will always occur around day 4 or 5
Example:
Is there a way to do this programmatically in python?

as noted by #Severin, a gamma looks to be a reasonable fit. e.g:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sps
x = np.linspace(75)
plt.plot(x, sps.gamma.pdf(x, 4) '.-')
plt.show()
if they really need to sum to 1, rather than integrate, I'd use the cdf and then use np.diff on the result

Related

Least Square fit for Gaussian in Python

I have tried to implement a Gaussian fit in Python with the given data. However, I am unable to obtain the desired fit. Any suggestions would help.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
from scipy import asarray as ar, exp
xData=ar([-7.66E-06,-7.60E-06,-7.53E-06,-7.46E-06,-7.40E-06,-7.33E-06,-7.26E-06,-7.19E-06,-7.13E-06,-7.06E-06,-6.99E-06,
-6.93E-06,-6.86E-06,-6.79E-06,-6.73E-06,-6.66E-06,-6.59E-06,-6.52E-06,-6.46E-06,-6.39E-06,-6.32E-06,-6.26E-06,-6.19E-06,
-6.12E-06,-6.06E-06,-5.99E-06,-5.92E-06,-5.85E-06,-5.79E-06,-5.72E-06])
yData=ar([17763,2853,3694,4203,4614,4984,5080,7038,6905,8729,11687,13339,14667,16175,15953,15342,14340,15707,13001,10982,8867,6827,5262,4760,3869,3232,2835,2746,2552,2576])
#plot the data points
plt.plot(xData,yData,'bo',label='experimental_data')
plt.show()
#define the function we want to fit the plot into
# Define the Gaussian function
n = len(xData)
mean = sum(xData*yData)/n
sigma = np.sqrt(sum(yData*(xData-mean)**2)/n)
def Gauss(x,I0,x0,sigma,Background):
return I0*exp(-(x-x0)**2/(2*sigma**2))+Background
popt,pcov = curve_fit(Gauss,xData,yData,p0=[1,mean,sigma, 0.0])
print(popt)
plt.plot(xData,yData,'b+:',label='data')
plt.plot(xData,Gauss(xData,*popt),'ro:',label='fit')
plt.legend()
plt.title('Gaussian_Fit')
plt.xlabel('x-axis')
plt.ylabel('PL Intensity')
plt.show()
When computing mean and sigma, divide by sum(yData), not n.
mean = sum(xData*yData)/sum(yData)
sigma = np.sqrt(sum(yData*(xData-mean)**2)/sum(yData))
The reason is that, say for mean, you need to compute the average of xData weighed by yData. For this, you need to normalize yData to have sum 1, i.e., you need to multiply xData with yData / sum(yData) and take the sum.
With the correction by j1-lee and removing the first point which clearly doesn't agree with the Gaussian model, the fit looks like this:
Removing the bin that clearly doesn't belong in the fit reduces the fitted width by some 20% and the (fitted) noise to background ratio by some 30%. The mean is only marginally affected.

Create unequally spaced values from (superimposed) distributions

I want to create an array with unequally spaced values. The spacing should be determined by the superposition of (for example) two normal distributions with different mean and width values. For a single (normal) distribution I managed to get what I want with the help of this post: python, weighted linspace
Using this code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
dist = stats.norm(loc=1.2, scale=0.6)
bounds = dist.cdf([0, 2])
pp = np.linspace(*bounds, num=21)
vals = dist.ppf(pp)
plt.plot(vals, [1]*vals.size, 'o')
plt.show()
I get the result I want for a single distribution:
However, I need exactly the same for a superposition of two normal distributions like:
dist1 = stats.norm(loc=3, scale=2)
dist2 = stats.norm(loc=1.2, scale=0.6)
This is how a histrogramm of the superimposed distributions looks like:
As a temporary solution I created the arrays for each distribution individually and added them together. However, this is not exactly what I want, because adding the the two individual arrays leads to fluctuating step sizes between the added arrays (for example it might happen that two values from the two different (individual) arrays are almost or exactly identical).
I also tried to define a new distribution that inherits from rv_continuous class from scipy.stats, but I failed to implement two different mean/width parameters.
I am pretty sure that it should work adding the individual probability density functions, but unfortunately I also failed with this approach.
Thanks in advance for any help and/or comment!
You could subclass rv_continuous and provide a pdf that is the mean of the two given pdfs.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
class sum_gaussians_gen(stats.rv_continuous):
def _pdf(self, x):
return (stats.norm.pdf(x, loc=3, scale=2) + stats.norm.pdf(x, loc=1.2, scale=0.6)) / 2
dist = sum_gaussians_gen()
bounds = dist.cdf([0, 7])
pp = np.linspace(*bounds, num=21)
vals = dist.ppf(pp)
plt.plot(vals, [0.5] * vals.size, 'o')
xs = np.linspace(0, 7, 500)
plt.plot(xs, dist.pdf(xs))
plt.ylim(ymin=0)
plt.show()

integrating an array using np trapz

I have been using np.trapz for integration over arrays for a while and have not had any problems with it, until now. I have obtained a distribution which clearly has an area of less than 1, because its maxima are 0.16 and the width of the distribution is roughly 6 but it seems to return that the area underneath the distribution is >60.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
data = np.load('dist.npy')
thetavals=np.linspace(0,2*np.pi,1000)
plt.xlabel(r'$\theta$')
plt.ylabel(r'$P(\theta)$')
plt.plot(thetavals,data[0:1000])
plt.show()
integralvalue=np.trapz(data)
print('The integral of this distribution results in: ',integralvalue)
Using numpy trapz, without the choice of the x parameter, the spacing of our distribution is assumed to be evenly spaced apart, these however should be spaced apart in relation to the theta values that formed the distribution in the first place, using the following code:
import numpy as np
import matplotlib.pyplot as plt
data = np.load('dist.npy')
thetavals=np.linspace(0,2*np.pi,1001)
plt.xlabel(r'$\theta$')
plt.ylabel(r'$P(\theta)$')
plt.plot(thetavals,data)
plt.show()
integralvalue=np.trapz(data,thetavals)
print('The integral of this distribution results in: ',integralvalue)
a number less than 1 is obtained, as expected.

PDF of the sum of Gaussian distributions using FFT

I am trying to derive the PDF of the sum of independent random variables. At first i would like to do this for a simple case: sum of Gaussian random variables.
I was surprised to see that I don't get a Gaussian density function when I sum an even number of gaussian random variables. I actually get:
which looks like two halfs of a Gaussian distribution.
On the other hand, when I sum an odd number of Gaussian distributions i get the right distribution:
below the code I used to produce the results above:
import numpy as np
from scipy.stats import norm
from scipy.fftpack import fft,ifft
import matplotlib.pyplot as plt
%matplotlib inline
a=10**(-15)
end=norm(0,1).ppf(a)
sample=np.linspace(end,-end,1000)
pdf=norm(0,1).pdf(sample)
plt.subplot(211)
plt.plot(np.real(ifft(fft(pdf)**2)))
plt.subplot(212)
plt.plot(np.real(ifft(fft(pdf)**3)))
Could someone help me understand why I get odd results for even sums of Gaussian distributions?
Even though your code creates a zero-mean Gaussian PDF:
sample=np.linspace(end,-end,1000)
pdf=norm(0,1).pdf(sample)
the FFT does not know about sample, and only sees pdf with samples at 0, 1, 2, 3, ... 999. The FFT expects the origin to be the first sample of the signal. To the FFT function, your PDF is not zero mean, but has a mean of 500.
Thus, what is going on here is that you are adding two PDFs with a 500 mean, leading to one with a 1000 mean. And because the FFT imposes a periodicity to the spatial domain signal, you are seeing the PDF exiting the graph on the right and coming back in on the left.
Adding 3 PDFs shifts the mean to 1500, which due to periodicity is the same as 500, meaning it ends up in the same place as the original PDF.
The solution is to shift the origin to the first sample for the FFT, and shift the result back:
from scipy.fftpack import fftshift, ifftshift
pdf2 = fftshift(ifft(fft(ifftshift(pdf))**2))
ifftshift shifts the signal so that the center sample ends up at the first sample, and fftshift shifts it back to where you wanted it for display.
But do note that the way you generate the PDF, the origin is not at a sample, and so the above will not work exactly. Instead, use:
sample=np.linspace(end,-end,1001)
pdf=norm(0,1).pdf(sample)
By picking 1001 samples instead of 1000, zero is exactly at the middle sample.
Use R!
library(ggplot2)
f <- function(n) {
x1 <- rnorm(n)
x2 <- rnorm(n)
X <- x1+x2
return(ds)
}
ds.list <- lapply(10^(2:5),f)
ds <- Reduce(rbind,ds.list)
ggplot(ds,aes(X,fill = n)) + geom_density(alpha = 0.5) + xlab("")
Here's the distribution plot:

scipy.pdist() returns NaN values

I'm trying to cluster time series. The intra-cluster elements have same shapes but different scales. Therefore, I would like to use a correlation measure as metric for clustering. I'm trying correlation or pearson coefficient distance (any suggestion or alternative is welcome).
However, the following code returns error when I run Z = linkage(dist) because there are some NaN values in dist. There are not NaN values in time_series, this is confirmed by
np.any(isnan(time_series))
which returns False
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import dendrogram, linkage
dist = pdist(time_series, metric='correlation')
Z = linkage(dist)
fig = plt.figure()
dn = dendrogram(Z)
plt.show()
As alternative, I will use pearson distance
from scipy.stats import pearsonr
def pearson_distance(a,b):
return 1 - pearsonr(a,b)[0]
dist = pdist(time_series, pearson_distance)`
but this generates some runtime warnings and takes a lot of time.
scipy.pdist(time_series, metric='correlation')
If you take a look at the manual, the correlation options divides by the difference. So it could be that you have two timestamps that are the same, and dividing zero by zero gives us NaN.

Categories

Resources