I want to create an array with unequally spaced values. The spacing should be determined by the superposition of (for example) two normal distributions with different mean and width values. For a single (normal) distribution I managed to get what I want with the help of this post: python, weighted linspace
Using this code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
dist = stats.norm(loc=1.2, scale=0.6)
bounds = dist.cdf([0, 2])
pp = np.linspace(*bounds, num=21)
vals = dist.ppf(pp)
plt.plot(vals, [1]*vals.size, 'o')
plt.show()
I get the result I want for a single distribution:
However, I need exactly the same for a superposition of two normal distributions like:
dist1 = stats.norm(loc=3, scale=2)
dist2 = stats.norm(loc=1.2, scale=0.6)
This is how a histrogramm of the superimposed distributions looks like:
As a temporary solution I created the arrays for each distribution individually and added them together. However, this is not exactly what I want, because adding the the two individual arrays leads to fluctuating step sizes between the added arrays (for example it might happen that two values from the two different (individual) arrays are almost or exactly identical).
I also tried to define a new distribution that inherits from rv_continuous class from scipy.stats, but I failed to implement two different mean/width parameters.
I am pretty sure that it should work adding the individual probability density functions, but unfortunately I also failed with this approach.
Thanks in advance for any help and/or comment!
You could subclass rv_continuous and provide a pdf that is the mean of the two given pdfs.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
class sum_gaussians_gen(stats.rv_continuous):
def _pdf(self, x):
return (stats.norm.pdf(x, loc=3, scale=2) + stats.norm.pdf(x, loc=1.2, scale=0.6)) / 2
dist = sum_gaussians_gen()
bounds = dist.cdf([0, 7])
pp = np.linspace(*bounds, num=21)
vals = dist.ppf(pp)
plt.plot(vals, [0.5] * vals.size, 'o')
xs = np.linspace(0, 7, 500)
plt.plot(xs, dist.pdf(xs))
plt.ylim(ymin=0)
plt.show()
Related
I have been using np.trapz for integration over arrays for a while and have not had any problems with it, until now. I have obtained a distribution which clearly has an area of less than 1, because its maxima are 0.16 and the width of the distribution is roughly 6 but it seems to return that the area underneath the distribution is >60.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
data = np.load('dist.npy')
thetavals=np.linspace(0,2*np.pi,1000)
plt.xlabel(r'$\theta$')
plt.ylabel(r'$P(\theta)$')
plt.plot(thetavals,data[0:1000])
plt.show()
integralvalue=np.trapz(data)
print('The integral of this distribution results in: ',integralvalue)
Using numpy trapz, without the choice of the x parameter, the spacing of our distribution is assumed to be evenly spaced apart, these however should be spaced apart in relation to the theta values that formed the distribution in the first place, using the following code:
import numpy as np
import matplotlib.pyplot as plt
data = np.load('dist.npy')
thetavals=np.linspace(0,2*np.pi,1001)
plt.xlabel(r'$\theta$')
plt.ylabel(r'$P(\theta)$')
plt.plot(thetavals,data)
plt.show()
integralvalue=np.trapz(data,thetavals)
print('The integral of this distribution results in: ',integralvalue)
a number less than 1 is obtained, as expected.
I want a normal curve to fit the histogram I already have.
navf2 is a list of normalized random numbers and the histogram is based on those, and I want a curve to show the general trend of the histogram.
while len(navf2)<252:
number=np.random.normal(0,1,None)
navf2.append(number)
bin_edges=np.arange(70,130,1)
plt.style.use(["dark_background",'ggplot'])
plt.hist(navf2, bins=bin_edges, alpha=1)
plt.ylabel("Frequency of final NAV")
plt.xlabel("Ranges")
ymin=0
ymax=100
plt.ylim([ymin,ymax])
plt.show()
Here You go:
=^..^=
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
# create raw data
data = np.random.uniform(size=252)
# distribution fitting
mu, sigma = norm.fit(data)
# fitting distribution
x = np.linspace(-0.5,1.5,100)
y = norm.pdf(x, loc=mu, scale=sigma)
# plot data
plt.plot(x, y,'r-')
plt.hist(data, density=1, alpha=1)
plt.show()
Output:
Here is a another solution using your code as mentioned in the question. We can achieve the expected result without the use of the scipy library. we will have to do three things, compute the mean of the data set, compute the standard deviation of the set, and create a function that generates the normal or Gaussian curve.
To compute the mean we can use the function within numpy library, ie mu = np.mean(your_data_set_here)
The standard deviation of the set is the square root of the sum of the differences of the values and mean squared https://en.wikipedia.org/wiki/Standard_deviation. We can express it in code as follows, using the numpy library again:
data_set = [] # some data set
sigma = np.sqrt(1/(len(data_set))*sum((data_set-mu)**2))
Finally we have to build the function for the normal curve or Gaussian https://en.wikipedia.org/wiki/Gaussian_function, it relies on both the mean (mu) and the standard deviation (sigma), so we will use those as parameters in our function:
def Gaussian(x,sigma,mu): # sigma is the standard deviation and mu is the mean
return ((1/(np.sqrt(2*np.pi)*sigma))*np.exp(-(x-mu)**2/(2*sigma**2)))
putting it all together looks like this:
import numpy as np
import matplotlib.pyplot as plt
navf2 = []
while len(navf2)<252:
number=np.random.normal(0,1,None) # since all values will be between 0,1 the bin size doesnt work
navf2.append(number)
navf2 = np.asarray(navf2) # convert to array for better results
mu = np.mean(navf2) #the avg of all values in navf2
sigma = np.sqrt(1/(len(navf2))*sum((navf2-mu)**2)) # standard deviation of navf2
x_vals = np.arange(min(navf2),max(navf2),0.001) # create a flat range based off data
# to build the curve
gauss = [] #store values for normal curve here
def Gaussian(x,sigma,mu): # defining the normal curve
return ((1/(np.sqrt(2*np.pi)*sigma))*np.exp(-(x-mu)**2/(2*sigma**2)))
for val in x_vals :
gauss.append(Gaussian(val,sigma,mu))
plt.style.use(["dark_background",'ggplot'])
plt.hist(navf2, density = 1, alpha=1) # add density = 1 to fix the scaling issues
plt.ylabel("Frequency of final NAV")
plt.xlabel("Ranges")
plt.plot(x_vals,gauss)
plt.show()
Here is a picture of an output:
Hope this helps, I tired to keep it as close to your original code as possible !
I need to generate a unit curve that is going to look like a right skewed gaussian and I have the following constraints:
The X axis is Days (variable but usually 45+)
All values on the Y axis sum to 1
The peak will always occur around day 4 or 5
Example:
Is there a way to do this programmatically in python?
as noted by #Severin, a gamma looks to be a reasonable fit. e.g:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sps
x = np.linspace(75)
plt.plot(x, sps.gamma.pdf(x, 4) '.-')
plt.show()
if they really need to sum to 1, rather than integrate, I'd use the cdf and then use np.diff on the result
I have an list ordered by some quality function from which I'd like to take elements, preferring the good elements at the beginning of the list.
Currently, my function to generate the random indices looks essentially as follows:
def pick():
p = 0.2
for i in itertools.count():
if random.random() < p:
break
return i
It does a good job, but I wonder:
What's the name of the generated random distribution?
Is there a built-in function in Python for that distribution?
What you are describing sounds a lot like the exponential distribution. It already exists in the random module.
Here is some code that takes just the integer part of sampling from an exponential distribution with a rate parameter of 100.
import random
import matplotlib.pyplot as plt
d = [int(random.expovariate(1/100)) for i in range(10000)]
h,b = np.histogram(d, bins=np.arange(0,max(d)))
plt.bar(left=b[:-1], height=h, ec='none', width=1))
plt.show()
You could simulate it via exponential, but this is like making square peg fit round hole. As Mark said, it is geometric distribution - discrete, shifted by 1. And it is right here in the numpy:
import numpy as np
import random
import itertools
import matplotlib.pyplot as plt
p = 0.2
def pick():
for i in itertools.count():
if random.random() < p:
break
return i
q = np.random.geometric(p, size = 100000) - 1
z = [pick() for i in range(100000)]
bins = np.linspace(-0.5, 30.5, 32)
plt.hist(q, bins, alpha=0.2, label='geom')
plt.hist(z, bins, alpha=0.2, label='pick')
plt.legend(loc='upper right')
plt.show()
Output:
random.random() defaults to a uniform distribution, but there are other methods within random that would also work. For your given use case, I would suggest random.expovariate(2) (Documentation, Wikipedia). This is an exponential distribution that will heavily prefer lower values. If you google some of the other methods listed in the documentation, you can find some other built-in distributions.
Edit: Be sure to play around with the argument value for expovariate. Also note that it doesn't guarantee a value less than 1, so you might need to ensure that you only use values less than 1.
Suppose I have a grid given by
import numpy as np
grid = np.linspace(0,20,1000)
I want to get a 1000-by-1 vector p so that if one were to plot points
(grid[i], p[i]) the graph would look like the density of a lognormal distribution.
Use scipy's stats for obtaining pdf's of probability-distributions!
Numpy, in most (all?) cases only support sampling-methods, not pdf-calculations. What's needed surely depends on the use-case.
Often the pdf plays no role in practical sampling-only implementations, like in this case, where sampling is reduced to normal-distribution sampling (often reduced to uniform-sampling combined with other functions) followed by the exponential-function (code):
double rk_lognormal(rk_state *state, double mean, double sigma)
{
return exp(rk_normal(state, mean, sigma));
}
Make sure to read above docs to learn how to use these!
Example code:
import numpy as np
import scipy.stats as spt
import matplotlib.pyplot as plt
rv = spt.lognorm(0.954) # "frozen" RV (shape-param fixed)
x_points = np.linspace(1,20,1000, dtype=int) # 0 excluded
plt.scatter(x_points, rv.pdf(x_points))
plt.show()
Output: