I have a set of data values, and I want to get the CDF (cumulative distribution function) for that data set.
Since this is a continuous variable, we can't use binning approach as mentioned in (How to get cumulative distribution function correctly for my data in python?). So I came up with following approach.
import scipy.stats as st
def trapezoidal_2(ag, a, b, n):
h = np.float(b - a) / n
s = 0.0
s += ag(a)[0]/2.0
for i in range(1, n):
s += ag(a + i*h)[0]
s += ag(b)[0]/2.0
return s * h
def get_cdf(data):
a = np.array(data)
ag = st.gaussian_kde(a)
cdf = [0]
x = []
k = 0
max_data = max(data)
while (k < max_data):
x.append(k)
k = k + 1
sum_integral = 0
for i in range(1, len(x)):
sum_integral = sum_integral + (trapezoidal_2(ag, x[i - 1], x[i], 2))
cdf.append(sum_integral)
return x, cdf
This is how I use this method.
b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data) x_cdf, y_cdf = get_cdf(data)
Ideally I should get a value close to 1 at the end of y_cdf list. But I get a value close to 0.57.
What is going wrong here? Is my approach correct?
Thanks.
The value of the cdf at x is the integral of the pdf between -inf and x, but you are computing it between 0 and x. Maybe you are assuming that the pdf is 0 for x < 0 but it is not:
rs = np.random.RandomState(seed=52221829)
b = 1
data = st.pareto.rvs(b, size=10000, random_state=rs)
ag = st.gaussian_kde(data)
x = np.linspace(-100, 100)
plt.plot(x, ag.pdf(x))
So this is probably what's going wrong here: you not checking your assumptions.
Your code for computing the integral is painfully slow, there are better ways to do this with scipy but gaussian_kde provides the method integrate_box_1d to integrate the pdf. If you take the integral from -inf everything looks right.
cdf = np.vectorize(lambda x: ag.integrate_box_1d(-np.inf, x))
plt.plot(x, cdf(x))
Integrating between 0 and x you get the same you are seeing now (to the right of 0), but that's not a cdf at all:
wrong_cdf = np.vectorize(lambda x: ag.integrate_box_1d(0, x))
plt.plot(x, wrong_cdf(x))
Not sure about why your function is not working exactly but one way of calculating CDF is as follows:
def get_cdf_1(data):
# start with sorted list of data
x = [i for i in sorted(data)]
cdf = []
for xs in x:
# get the sum of the values less than each data point and store that value
# this is normalised by the sum of all values
cum_val = sum([i for i in data if i <= xs])/sum(data)
cdf.append(cum_val)
return x, cdf
There is no doubt a faster way of computing this using numpy arrays rather than appending values to a list, but this returns values in the same format as your original example.
I think it's just:
def get_cdf(data):
return sorted(data), np.linspace(0, 1, len(data))
but I might be misinterpreting the question!
when I compare this to the analytic result I get the same:
x_cdf, y_cdf = get_cdf(st.pareto.rvs(1, size=10000))
import matplotlib.pyplot as plt
plt.semilogx(x_cdf, y_cdf)
plt.semilogx(x_cdf, st.pareto.cdf(x_cdf, 1))
Related
Given this Matlab Code created by my teacher:
function [] = explicitWave(T,L,N,J)
% Explicit method for the wave eq.
% T: Length time-interval
% L: Length x-interval
% N: Number of time-intervals
% J: Number of x-intervals
k=T/N;
h=L/J;
r=(k*k)/(h*h);
k/h
x=linspace(0,L,J+1); % number of points = number of intervals + 1
uOldOld=f(x); % solution two time-steps backwards. Initial condition
disp(uOldOld)
uOld=zeros(1,length(x)); % solution at previuos time-step
uNext=zeros(1,length(x));
% First time-step
for j=2:J
uOld(j)=(1-r)*f(x(j))+r/2*(f(x(j+1))+f(x(j-1)))+k*g(x(j));
end
% Remaining time-steps
for n=0:N-1
for j=2:J
uNext(j)=2*(1-r)*uOld(j)+r*(uOld(j+1)+uOld(j-1))-uOldOld(j);
end
uOldOld=uOld;
uOld=uNext;
end
plot(x,uNext,'r')
end
I tried to implement this in Python by using this code:
import numpy as np
import matplotlib.pyplot as plt
def explicit_wave(f, g, T, L, N, J):
"""
:param T: Length of Time Interval
:param L: Length of X-interval
:param N: Number of time intervals
:param J: Number of X-intervals
:return:
"""
k = T/N
h = L/J
r = (k**2) / (h**2)
x = np.linspace(0, L, J+1)
Uoldold = f(x)
Uold = np.zeros(len(x))
Unext = np.zeros(len(x))
for j in range(1, J):
Uold[j] = (1-r)*f(x[j]) + (r/2)*(f(x[j+1]) + f(x[j-1])) + k*g(x[j])
for n in range(N-1):
for j in range(1, J):
Unext[j] = 2*(1-r) * Uold[j]+r*(Uold[j+1]+Uold[j-1]) - Uoldold[j]
Uoldold = Uold
Uold = Unext
plt.plot(x, Unext)
plt.show()
return Unext, x
However when I run the code with the same inputs, I get different results when plotting them. My inputs:
g = lambda x: -np.sin(2*np.pi*x)
f = lambda x: 2*np.sin(np.pi*x)
T = 8.0
L = 1.0
J = 60
N = 480
Python plot result compared to exact result. The x-es represent the actual solution, and the red line is the function:
Matlab plot result , x-es represent the exact solution and the red line is the function:
Could you see any obvious errors I might have made when translating this code?
In case anyone needs the exact solution:
exact = lambda x,t: 2*np.sin(np.pi*x)*np.cos(np.pi*t) - (1/(2*np.pi))*np.sin(2*np.pi*x)*np.sin(2*np.pi*t)
I found the error through debugging. The main problem here is the code:
Uoldold = Uold
Uold = Unext
So in Python when you define a new variable as equal to an older variable, they become references to each other (i.e dependent on each other). Let me illustrate this as an example consisting of lists:
a = [1,2,3,4]
b = a
b[1] = 10
print(a)
>> [1, 10, 3, 4]
So the solution here was to use .copy()
Resulting in this:
Uoldold = Uold.copy()
Uold = Unext.copy()
def fac(n):
value = 1
for i in range(2,n+1):
value = value * i
return value
def C(n,k):
return fac(n)/(fac(k) * (fac(n-k)))
for k in range(1,100):
for n in [10,20,30]:
F=C(n,k)
plt.plot(k,F)
plt.legend()
plt.show()
I wanna plot the binomial function as a function in k for certain n values say for n = 10, 20, 30, ...
However, I do not know how to plot that.
I suspect, there's a specific reason not to use NumPy, so here's a solution using plain Python.
If you're talking about binomial distribution, then the formula needs to incorporate the probability p. I added that in the code below. (If you actually only want to have the binomical coefficient, delete the two pow terms.)
In general, for plotting, you need to collect your x and y values in some arrays (say X and Y), so that you can use plt.plot(X, Y) to plot the whole function.
In your example, you also need to switch the two loops, because you want to have three functions, each for k = [1 ... 100].
That'd be my solution:
from matplotlib import pyplot as plt
def fac(n):
value = 1
for i in range(2, n+1):
value = value * i
return value
def C(n, k, p):
return fac(n)/(fac(k) * (fac(n-k))) * pow(p, k) * pow(1-p, n-k)
for N in [10, 20, 30]:
X = []
Y = []
for K in range(1, 100):
X.append(K)
Y.append(C(N, K, 0.5))
plt.plot(X, Y)
plt.legend(('N = 10', 'N = 20', 'N = 30'))
plt.show()
The generated output then looks like this:
Hope that helps!
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.16299-SP0
Python: 3.8.1
Matplotlib: 3.2.0rc1
----------------------------------------
Good question!
The "typical" way to plot a function is to compute 2 vectors (lists). One of the x values and one of f(x) and then plot them. You can either type in your x values or use one of several convenient functions such as numpy.linspace to make them. You can (and should) also use list comprehension to make the y values. Here is a toy example:
from matplotlib import pyplot as plt
def f(x): # just return x squared
return x**2
x = range(10)
y = [f(t) for t in x]
plt.plot(x,y)
produces:
If you want to make the graph smoother and use a lot of values for x, then use numpy.linspace perhaps like so:
import numpy as np
x = np.linspace(0,20,1000) # low, high, number of pts
y = [np.sin(4*t) for t in x]
plt.plot(x,y)
You should be able to use whatever function you want to compute you y-values and use this structure.
Suppose, I have a function that has three inputs prob(x,mu,sig).
With sizes:
x = 1 x 3
mu = 1 x 3
sig = 3 x 3
Now, I have a dataset X, mean matrix M and std. deviation matrix sigma.
Sizes are:-
X : m x 3.
mean : k x 3.
sigma : k x 3 x 3
For each value m, I want to pass all values of k in the function prob to calculate my responsibility value.
I can pass the values one by one using for loops.
What would be a better way of doing this in numpy.
The related code for reference:
responsibility = np.zeros((X.shape[0],k))
s = np.zeros(k)
for i in np.arange(X.shape[0]):
for j in np.arange(k):
s[j] = prob(X[i],MU[j],SIGMA[j])
s = s/s.sum()
responsibility[i] = s
responsibility = np.transpose(responsibility)
If using a single for loop is acceptable then you can probably use the following,
import itertools
sigma.shape = k, 9
zipped_array = np.array(list(zip(mean, sigma)))
all_possible_combo = list(itertools.product(X, zipped_array))
list_len = len(all_possible_combo) # = m * k
s = np.zeros(k)
responsibility = np.zeros((X.shape[0],k))
for i in range(list_len):
X_arow = all_possible_combo[i][0]
mean_single = all_possible_combo[i][1]
sigma_single = all_possible_combo[i][2].reshape((3, 3))
s = prob(X_arow, mean_single, sigma_single)
s = s/s.sum()
responsibility[i] = s
responsibility = np.transpose(responsibility)
I have the following code below that prints the PDF graph for a particular mean and standard deviation.
http://imgur.com/a/oVgML
Now I need to find the actual probability, of a particular value. So for example if my mean is 0, and my value is 0, my probability is 1. This is usually done by calculating the area under the curve. Similar to this:
http://homepage.divms.uiowa.edu/~mbognar/applets/normal.html
I am not sure how to approach this problem
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
def normal(power, mean, std, val):
a = 1/(np.sqrt(2*np.pi)*std)
diff = np.abs(np.power(val-mean, power))
b = np.exp(-(diff)/(2*std*std))
return a*b
pdf_array = []
array = np.arange(-2,2,0.1)
print array
for i in array:
print i
pdf = normal(2, 0, 0.1, i)
print pdf
pdf_array.append(pdf)
plt.plot(array, pdf_array)
plt.ylabel('some numbers')
plt.axis([-2, 2, 0, 5])
plt.show()
print
Unless you have a reason to implement this yourself. All these functions are available in scipy.stats.norm
I think you asking for the cdf, then use this code:
from scipy.stats import norm
print(norm.cdf(x, mean, std))
If you want to write it from scratch:
class PDF():
def __init__(self,mu=0, sigma=1):
self.mean = mu
self.stdev = sigma
self.data = []
def calculate_mean(self):
self.mean = sum(self.data) // len(self.data)
return self.mean
def calculate_stdev(self,sample=True):
if sample:
n = len(self.data)-1
else:
n = len(self.data)
mean = self.mean
sigma = 0
for el in self.data:
sigma += (el - mean)**2
sigma = math.sqrt(sigma / n)
self.stdev = sigma
return self.stdev
def pdf(self, x):
return (1.0 / (self.stdev * math.sqrt(2*math.pi))) * math.exp(-0.5*((x - self.mean) / self.stdev) ** 2)
The area under a curve y = f(x) from x = a to x = b is the same as the integral of f(x)dx from x = a to x = b. Scipy has a quick easy way to do integrals. And just so you understand, the probability of finding a single point in that area cannot be one because the idea is that the total area under the curve is one (unless MAYBE it's a delta function). So you should get 0 ≤ probability of value < 1 for any particular value of interest. There may be different ways of doing it, but a conventional way is to assign confidence intervals along the x-axis like this. I would read up on Gaussian curves and normalization before continuing to code it.
I have a function, roughness that is called quite often in a larger piece of code. I need some help with replacing this double for-loop with a simpler vectorized version. Here is the code below:
def roughness(c,d,e,f,z,ndim,half_tile,dx):
imin=0-half_tile
imax=half_tile
z_calc = np.zeros((ndim,ndim), dtype=float)
for j in range(ndim):
y=(j-half_tile)*dx
for i in range(ndim):
x=(i-half_tile)*dx
z_calc[i,j] = c*x*y + d*x + e*y + f - z[i,j]
z_min=z_calc[z_calc!=0].min()
z_max=z_calc[z_calc!=0].max()
# Calculate some statistics for the difference tile
difference = np.reshape(z_calc,ndim*ndim)
mean = np.mean(difference)
var = stats.tvar(difference,limits=None)
skew = stats.skew(difference,axis=None)
kurt = stats.kurtosis(difference, axis=None)
return(z_min,z_max,mean,var,skew,kurt)
After the main calculations, the various stats are calculated on them. The values, of c,d,e,f, ndim,half_tile are all single integer values, and the variable z is an array with size ndim x ndim I have tried to vectorize this before, but the values do not come out correctly, although the code does run.
Here is my attempt:
def roughness(c,d,e,f,z,ndim,half_tile,dx):
z_calc = np.zeros((ndim,ndim), dtype=float)
x = np.zeros((ndim,ndim), dtype=float)
y = np.zeros((ndim,ndim), dtype=float)
x,y = np.mgrid[1:ndim+1,1:ndim+1]
x = (x-half_tile)*dx
y = (y-half_tile)*dx
z_calc = c*x*y + d*x + e*y + f - z
z_min=z_calc[z_calc!=0].min()
z_max=z_calc[z_calc!=0].max()
# Calculate some statistics for the difference tile
difference = np.reshape(z_calc,ndim*ndim)
mean = np.mean(difference)
var = stats.tvar(difference,limits=None)
skew = stats.skew(difference,axis=None)
kurt = stats.kurtosis(difference, axis=None)
return(z_min,z_max,mean,var,skew,kurt)
Aside from getting the correct values, I would really like to know if I did the vectorization of the nested for loop correctly, which i'm assuming I didn't.