Why does this Kernel Density Estimation have values over 1.0? [duplicate] - python

This question already has an answer here:
Why does the y-axis of my distplot contain values up to 150?
(1 answer)
Closed 7 days ago.
I'm trying to analyse the features of the Pima Indians Diabetes Data Set (follow the link to get the dataset) by plotting their probability density distributions. I haven't yet removed invalid 0 data, so the plots sometimes show a bias at the very left. For the most part, the distributions look accurate:
I have a problem with the look of the plot for DiabetesPedigree, which shows probabilities over 1.0 (for x ~ between 0.1 and 0.5). As I understand it, the combined probabilities should equal 1.0.
I've isolated the code for the DiatebesPedigree plot but the same will work for the others by changing the dataset_index value:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
dataset_index = 6
feature_name = "DiabetesPedigree"
filename = 'pima-indians-diabetes.data.csv'
data = pd.read_csv(filename)
feature_data = data.ix[:, dataset_index]
graph_min = feature_data.min()
graph_max = feature_data.max()
density = gaussian_kde(feature_data)
density.covariance_factor = lambda : .25
density._compute_covariance()
xs = np.arange(graph_min, graph_max, (graph_max - graph_min)/200)
ys = density(xs)
plt.xlim(graph_min, graph_max)
plt.title(feature_name)
plt.plot(xs,ys)
plt.show()

As rightly marked , a continous pdf never says the value to be less than 1, with the pdf for continous random variable, function p(x) is not the probability. you can refer for continuous random varibales and their distrubutions

Related

Barplot with significant differences and interactions in python?

I started to use python 6 months ago and may be my question is a naive one. I would like to visualize my data and ANOVA statistics. It is common to do this using a barplot with added lines indicating significant differences and interactions. How do you make plot like this using python ?
enter image description here
Here is a simple dataframe, with 3 columns (A,B and the p_values already calculated with a t-test)
mport pandas as pd
import matplotlib.pyplot as plt
import numpy as np
ar = np.array([ [565.0, 81.0, 1.630947e-02],
[1006.0, 311.0, 1.222740e-27],
[2929.0, 1292.0, 5.559912e-12],
[3365.0, 1979.0, 2.507474e-22],
[2260.0, 1117.0, 1.540305e-01]])
df = pd.DataFrame(ar,columns = ['A', 'B', 'p_value'])
ax = plt.subplot()
# I calculate the percentage
(df.iloc[:,0:2]/df.iloc[:,0:2].sum()*100).plot.bar(ax=ax)
for container, p_val in zip(ax.containers,df['p_value']):
labels = [f"{round(v,1)}%" if (p_val > 0.05) else f"(**)\n{round(v,1)}%" for v in container.datavalues]
ax.bar_label(container,labels=labels, fontsize=10,padding=8)
plt.show()
Initially I just wanted to add a "**" each time a significant difference is observed between the 2 columns A & B. But the initial code above is not really working.
Now I would prefer having the added lines indicating significant differences and interactions between the A&B columns. But I have no ideas how to make it happen.
Regards
JYK

Gaussian Mixture Model with discrete data

I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?
You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))

mono-energetic gamma ray mean free path

I am writing a code about a mono-energetic gamma beam which the dominated interaction is photoelectric absorption, mu=2 cm-1, and i need to generate 50000 random numbers and sample the interaction depth(which I do not know if i did it or not).
I know that the mean free path=mu-1, but I need to find the mean free path from the simulation and from mu and compare them, is what I did right in the code or not?
import random
import matplotlib.pyplot as plt
import numpy as np
mu=(2)
random.seed=()
data = np.random.randn(50000)*10
bins = np.arange(data.min(), data.max()+1e-8, 0.1)
meanfreepath = 1/mu
print(meanfreepath)
plt.hist(data, bins=bins)
plt.show()
Well, interaction depth distribution is Exponential one, not a gaussian.
So code would be
lmbda = 2 # cm^-1
beta = 1.0/lmbda
data = np.random.exponential(scale=beta, size=50000)
mfp = np.mean(data)
print(mfp)
# build histogram
More details at https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.exponential.html
Code above produced
0.4977168417102998
which looks like 2-1 to me

Trying to fit Pearson3 probability distribution using scipy

I am trying to understand how to fit a probability distribution function, such as Pearson type 3, to a data set (specifically, mean annual rainfall in an area). I've read some questions about this, but I still miss something and the fitting doesn't get right. As for now my code is this (the specific data file can be downloaded from here):
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearson3
year,mm = np.loadtxt('yearly_mm_sde_boker_month_1960_2016.csv',delimiter=',').T
fig,ax=plt.subplots(1,2,figsize=(2*1.62*3,3))
ax[0].plot(year,mm)
dump=ax[1].hist(mm)
size = len(year)
param = pearson3.fit(mm)
pdf_fitted = pearson3.pdf(year, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,len(year))
plt.legend(loc='upper right')
plt.show()
What am I missing?
Well, this works:
param = pearson3.fit(mm) # distribution fitting
# now, param[0] and param[1] are the mean and
# the standard deviation of the fitted distribution
x = np.linspace(0,200,100)
# fitted distribution
pdf_fitted = pearson3.pdf(x,*param[:-2], loc=param[-2], scale=param[-1])
# original distribution
#pdf = norm.pdf(x)
plt.title('Pearson 3 distribution')
plt.plot(x,pdf_fitted,'r-')#,x,pdf,'b--')
dump=plt.hist(mm,normed=1,alpha=.3)

Single Component Metropolis-Hastings

So, let's say I have the following 2-dimensional target distribution that I would like to sample from (a mixture of bivariate normal distributions) -
import numba
import numpy as np
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
%matplotlib inline
def targ_dist(x):
target = (stats.multivariate_normal.pdf(x,[0,0],[[1,0],[0,1]])+stats.multivariate_normal.pdf(x,[-6,-6],[[1,0.9],[0.9,1]])+stats.multivariate_normal.pdf(x,[4,4],[[1,-0.9],[-0.9,1]]))/3
return target
and the following proposal distribution (a bivariate random walk) -
def T(x,y,sigma):
return stats.multivariate_normal.pdf(y,x,[[sigma**2,0],[0,sigma**2]])
The following is the Metropolis Hastings code for updating the "entire" state in every iteration -
#Initialising
n_iter = 30000
# tuning parameter i.e. variance of proposal distribution
sigma = 2
# initial state
X = stats.uniform.rvs(loc=-5, scale=10, size=2, random_state=None)
# count number of acceptances
accept = 0
# store the samples
MHsamples = np.zeros((n_iter,2))
# MH sampler
for t in range(n_iter):
# proposals
Y = X+stats.norm.rvs(0,sigma,2)
# accept or reject
u = stats.uniform.rvs(loc=0, scale=1, size=1)
# acceptance probability
r = (targ_dist(Y)*T(Y,X,sigma))/(targ_dist(X)*T(X,Y,sigma))
if u < r:
X = Y
accept += 1
MHsamples[t] = X
However, I would like to update "per component" (i.e. component-wise updating) in every iteration. Is there a simple way of doing this?
Thank you for your help!
From the tone of your question I assume you are looking performance improvements.
MonteCarlo algorithms are quite compute intensive. You will get better results, if you perform in algorithms on a lower level than in an interpreted language like python, e.g. writing a c-extension.
There are also implementations available for python (PyStan, PyMC3).

Categories

Resources