I have been using np.trapz for integration over arrays for a while and have not had any problems with it, until now. I have obtained a distribution which clearly has an area of less than 1, because its maxima are 0.16 and the width of the distribution is roughly 6 but it seems to return that the area underneath the distribution is >60.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
data = np.load('dist.npy')
thetavals=np.linspace(0,2*np.pi,1000)
plt.xlabel(r'$\theta$')
plt.ylabel(r'$P(\theta)$')
plt.plot(thetavals,data[0:1000])
plt.show()
integralvalue=np.trapz(data)
print('The integral of this distribution results in: ',integralvalue)
Using numpy trapz, without the choice of the x parameter, the spacing of our distribution is assumed to be evenly spaced apart, these however should be spaced apart in relation to the theta values that formed the distribution in the first place, using the following code:
import numpy as np
import matplotlib.pyplot as plt
data = np.load('dist.npy')
thetavals=np.linspace(0,2*np.pi,1001)
plt.xlabel(r'$\theta$')
plt.ylabel(r'$P(\theta)$')
plt.plot(thetavals,data)
plt.show()
integralvalue=np.trapz(data,thetavals)
print('The integral of this distribution results in: ',integralvalue)
a number less than 1 is obtained, as expected.
Related
I want to create an array with unequally spaced values. The spacing should be determined by the superposition of (for example) two normal distributions with different mean and width values. For a single (normal) distribution I managed to get what I want with the help of this post: python, weighted linspace
Using this code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
dist = stats.norm(loc=1.2, scale=0.6)
bounds = dist.cdf([0, 2])
pp = np.linspace(*bounds, num=21)
vals = dist.ppf(pp)
plt.plot(vals, [1]*vals.size, 'o')
plt.show()
I get the result I want for a single distribution:
However, I need exactly the same for a superposition of two normal distributions like:
dist1 = stats.norm(loc=3, scale=2)
dist2 = stats.norm(loc=1.2, scale=0.6)
This is how a histrogramm of the superimposed distributions looks like:
As a temporary solution I created the arrays for each distribution individually and added them together. However, this is not exactly what I want, because adding the the two individual arrays leads to fluctuating step sizes between the added arrays (for example it might happen that two values from the two different (individual) arrays are almost or exactly identical).
I also tried to define a new distribution that inherits from rv_continuous class from scipy.stats, but I failed to implement two different mean/width parameters.
I am pretty sure that it should work adding the individual probability density functions, but unfortunately I also failed with this approach.
Thanks in advance for any help and/or comment!
You could subclass rv_continuous and provide a pdf that is the mean of the two given pdfs.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
class sum_gaussians_gen(stats.rv_continuous):
def _pdf(self, x):
return (stats.norm.pdf(x, loc=3, scale=2) + stats.norm.pdf(x, loc=1.2, scale=0.6)) / 2
dist = sum_gaussians_gen()
bounds = dist.cdf([0, 7])
pp = np.linspace(*bounds, num=21)
vals = dist.ppf(pp)
plt.plot(vals, [0.5] * vals.size, 'o')
xs = np.linspace(0, 7, 500)
plt.plot(xs, dist.pdf(xs))
plt.ylim(ymin=0)
plt.show()
When randomly generating random numbers using the numpy.random package, and the scipy.stats package, why is the histogram (total probabilities) generated by the former package have such large values with a maximum near 4, whereas the latter's histogram is more reasonable with a maximum much less than 1.
Probability distributions are supposed to only sum to 1, with no individual probability exceeding 1. Even though the scipy generator looks more tame, it still doesn't sum to 1. How can I make both generators from numpy.random and scipy.stats behave the same way, i.e. have no single probability exceeding the maximum of 1?
import numpy as np
import pandas as pd
from numpy.random import rand, randn
from scipy.stats import norm, johnsonsu
n = 100
x = randn(n)*.1
y = johnsonsu.rvs(a = 2.55, b= 2.25, size=n)
for i in [x, y]:
print(sum(i))
pd.Series(i).plot.kde()
Besides the plot, the output from a single run shows the sums of the randomly generated vectors to be wildly different:
0.9035925193845973
-144.49886490879146
I need to generate a unit curve that is going to look like a right skewed gaussian and I have the following constraints:
The X axis is Days (variable but usually 45+)
All values on the Y axis sum to 1
The peak will always occur around day 4 or 5
Example:
Is there a way to do this programmatically in python?
as noted by #Severin, a gamma looks to be a reasonable fit. e.g:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sps
x = np.linspace(75)
plt.plot(x, sps.gamma.pdf(x, 4) '.-')
plt.show()
if they really need to sum to 1, rather than integrate, I'd use the cdf and then use np.diff on the result
Suppose I have a grid given by
import numpy as np
grid = np.linspace(0,20,1000)
I want to get a 1000-by-1 vector p so that if one were to plot points
(grid[i], p[i]) the graph would look like the density of a lognormal distribution.
Use scipy's stats for obtaining pdf's of probability-distributions!
Numpy, in most (all?) cases only support sampling-methods, not pdf-calculations. What's needed surely depends on the use-case.
Often the pdf plays no role in practical sampling-only implementations, like in this case, where sampling is reduced to normal-distribution sampling (often reduced to uniform-sampling combined with other functions) followed by the exponential-function (code):
double rk_lognormal(rk_state *state, double mean, double sigma)
{
return exp(rk_normal(state, mean, sigma));
}
Make sure to read above docs to learn how to use these!
Example code:
import numpy as np
import scipy.stats as spt
import matplotlib.pyplot as plt
rv = spt.lognorm(0.954) # "frozen" RV (shape-param fixed)
x_points = np.linspace(1,20,1000, dtype=int) # 0 excluded
plt.scatter(x_points, rv.pdf(x_points))
plt.show()
Output:
I'm trying to fit a GEV distribution to temperature data to help identify extreme values. I have data sets for different regions - for some regions the fit works fine but for others it breaks down. It appears that it is setting the location parameter close to the maximum of the distribution range. All data sets are large, of the same size, complete and have no particularly strange values.
Could you please suggest what might be happening or how I can investigate the genextreme function process to work out what the problem is?
Here's the relevant bits of code (values are read in from NetCDF without any problem):
import pandas as pd
import numpy as np
import netCDF4 as nc
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import genextreme as gev
# calculate GEV fit
fit = gev.fit(season_temp)
# GEV parameters from fit
c, loc, scale = fit
fit_mean= loc
min_extreme,max_extreme = gev.interval(0.99,c,loc,scale)
# evenly spread x axis values for pdf plot
x = np.linspace(min(season_temp),max(season_temp),200)
# plot distribution
fig,ax = plt.subplots(1, 1)
plt.plot(x, gev.pdf(x, *fit))
plt.hist(season_temp,30,normed=True,alpha=0.3)
And here are two examples of outputs from different regions, successful and not:
Successful fit
Unsuccessful fit
The successfully fitted distribution has mean location parameter of 1.066 compared to data mean of 2.395. The one that failed has calculated a location parameter of 12.202 compared to data mean of 2.138.
Thanks in advance for your help!