Estimate joint density with 2d Gaussian kernel

Estimate joint density with 2d Gaussian kernel - python

I have the following data set where I have to estimate the joint density of 'bwt' and 'age' using kernel density estimation with a 2-dimensional Gaussian kernel and width h=5. I can't use modules such as scipy where there are ready functions to do this and I have to built functions to calculate the density. Here's what I've gotten so far.
import numpy as np
import pandas as pd
babies_full = pd.read_csv("https://www2.helsinki.fi/sites/default/files/atoms/files/babies2.txt", sep='\t')
#Getting the columns I need
babies_full1=babies_full[['gestation', 'age']]
x=np.array(babies_full1,'int')
#2d Gaussian kernel
def k_2dgauss(x):
return np.exp(-np.sum(x**2, 1)/2) / np.sqrt(2*np.pi)
#Multivariate kernel density
def mv_kernel_density(t, x, h):
d = x.shape[1]
return np.mean(k_2dgauss((t - x)/h))/h**d
t = np.linspace(1.0, 5.0, 50)
h=5
print(mv_kernel_density(t, x, h))
However, I get a value error 'ValueError: operands could not be broadcast together with shapes (50,) (1173,2)' which think is because different shape of the matrices. I also don't understand why k_2dgauss(x) for me returns an array of zeros since it should only return one value. In general, I am new to the concept of kernel density estimation I don't really know if I've written the functions right so any hints would help!

Following on from my comments on your original post, I think this is what you want to do, but if not then come back to me and we can try again.
# info supplied by OP
import numpy as np
import pandas as pdbabies_full = \
pd.read_csv("https://www2.helsinki.fi/sites/default/files/atoms/files/babies2.txt", sep='\t')
#Getting the columns I need
babies_full1=babies_full[['gestation', 'age']]
x=np.array(babies_full1,'int')
# my contributions
from math import floor, ceil
def binMaker(arr, base):
"""function I already use for this sort of thing.
arr is the arr I want to make bins for
base is the bin separation, but does require you to import floor and ceil
otherwise you can make these bins manually yourself"""
binMin = floor(arr.min() / base) * base
binMax = ceil(arr.max() / base) * base
return np.arange(binMin, binMax + base, base)
bins1 = binMaker(x[:,0], 20.) # bins from 140. to 360. spaced 20 apart
bins2 = binMaker(x[:,1], 5.) # bins from 15. to 45. spaced 5. apart
counts = np.zeros((len(bins1)-1, len(bins2)-1)) # empty array for counts to go in
for i in range(0, len(bins1)-1): # loop over the intervals, hence the -1
boo = (x[:,0] >= bins1[i]) * (x[:,0] < bins1[i+1])
for j in range(0, len(bins2)-1): # loop over the intervals, hence the -1
counts[i,j] = np.count_nonzero((x[boo,1] >= bins2[j]) *
(x[boo,1] < bins2[j+1]))
# if you want your PDF to be a fraction of the total
# rather than the number of counts, do the next line
counts /= x.shape[0]
# plotting
import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm
# setting the levels so that each number in counts has its own colour
levels = np.linspace(-0.5, counts.max()+0.5, int(counts.max())+2)
cmap = plt.get_cmap('viridis') # or any colormap you like
norm = BoundaryNorm(levels, ncolors=cmap.N, clip=True)
fig, ax = plt.subplots(1, 1, figsize=(6,5), dpi=150)
pcm = ax.pcolormesh(bins2, bins1, counts, ec='k', lw=1)
fig.colorbar(pcm, ax=ax, label='Counts (%)')
ax.set_xlabel('Age')
ax.set_ylabel('Gestation')
ax.set_xticks(bins2)
ax.set_yticks(bins1)
plt.title('Manually making a 2D (joint) PDF')
If this is what you wanted, then there is an easier way with np.histgoram2d, although I think you specified it had to be using your own methods, and not built in functions. I've included it anyway for completeness' sake.
pdf = np.histogram2d(x[:,0], x[:,1], bins=(bins1,bins2))[0]
pdf /= x.shape[0] # again for normalising and making a percentage
levels = np.linspace(-0.5, pdf.max()+0.5, int(pdf.max())+2)
cmap = plt.get_cmap('viridis') # or any colormap you like
norm = BoundaryNorm(levels, ncolors=cmap.N, clip=True)
fig, ax = plt.subplots(1, 1, figsize=(6,5), dpi=150)
pcm = ax.pcolormesh(bins2, bins1, pdf, ec='k', lw=1)
fig.colorbar(pcm, ax=ax, label='Counts (%)')
ax.set_xlabel('Age')
ax.set_ylabel('Gestation')
ax.set_xticks(bins2)
ax.set_yticks(bins1)
plt.title('using np.histogram2d to make a 2D (joint) PDF')
Final note - in this example, the only place where counts doesn't equal pdf is for the bin between 40 <= age < 45 and 280 <= gestation 300, which I think is due to how, in my manual case, I've used <= and <, and I'm a little unsure how np.histogram2d handles values outside the bin ranges, or on the bin edges etc. We can see the element of x that is responsible
>>> print(x[1011])
[280 45]

Related

Can't set a,b in cv2.convertScaleAbs for a linear tranform of the input

I'm learning OpenCV, and looking at convertScaleAbs to transform the original values to the range [0,255], quite similar to what normalize do in the mode NORM_MINMAX.
As far as I understand, values are transformed according to y = a*x + b, then the resulting values are clipped and converted to uint8. If this is correct, then selecting a and b this way:
a = (255.0 - 0) / (x_max - x_min)
b = -x_min * a
should linearly transform the original values to [0,255] and the final step, the clipping, should not change the values (only the type). However I cannot obtain this correct result with the a and b values above. I create random original values, then show the result of normalize (the ones expected), then the results of convertScaleAbs (wrong, everything converted to 255):
Here is my code:
import numpy as np
import random as rnd
import matplotlib.pyplot as plt
import cv2
x_values = range(100)
y_values = [None]*3
# Original values
a,b = rnd.randint(0,10), rnd.randint(0,1e4)
y_values[0] = np.array([a*i+b for i in x_values])
np.random.shuffle(y_values[0])
# Transformed values, fist method
y_values[1] = np.zeros(y_values[0].shape)
y_values[1] = cv2.normalize(
y_values[0], y_values[1],
0, 255, cv2.NORM_MINMAX, cv2.CV_8U)
# Transformed values, alternate method
ymin, ymax = y_values[0].min(), y_values[0].max()
a = (255.0 - 0) / (ymax - ymin)
b = -ymin * a
y_values[2] = cv2.convertScaleAbs(y_values[0], a, b)
# Check visually
fig, ax = plt.subplots(3,1, figsize=(6,4), sharex=True)
for i,values in enumerate(y_values):
ax[i].set_ylim(y_values[i].min(), y_values[i].max())
ax[i].tick_params(axis='both', which='major', labelsize=8)
ax[i].set_title(titles[i], fontsize=8)
ax[i].grid(axis='both', ls=':')
ax[i].scatter(x_values, y_values[i], marker='.', s=1)
fig.tight_layout()
plt.locator_params(axis='y', nbins=5)
plt.ioff()
plt.show()

According to the documentation at convertScaleAbs, the second argument to the function should be the destination matrix: y_values[2] here.
After changing to
y_values[2] = cv2.convertScaleAbs(y_values[0], y_values[2], a, b)
, it seems fine now:

How to get value of area under multiple peaks

I have some data from a bioanalyzer which gives me time (x-axis) and absorbance values (y-axis). The time is every .05 seconds and its from 32s to 138 so you can imagine how many data points I have. I've created a graph using plotly and matplotlib, just so that I have more libraries to work with to find a solution, so a solution in either library is ok! What I'm trying to do is make my script find the area under each peak and return my value.
def create_plot(sheet_name):
sample = book.sheet_by_name(sheet_name)
data = [[sample.cell_value(r, c) for r in range(sample.nrows)] for c in range(sample.ncols)]
y = data[2][18:len(data[2]) - 2]
x = np.arange(32, 138.05, 0.05)
indices = peakutils.indexes(y, thres=0.35, min_dist=0.1)
peaks = [y[i] for i in indices]
This snippet gets my Y values, X values and indices of the peaks. Now is there a way to get the area under each curve? Let's say that there are 15 indices.
Here's what the graph looks like:

An automated answer
Given a set of x and y values as well as a set of peaks (the x-coordinates of the peaks), here's how you can automatically find the area under each of the peaks. I'm assuming that x, y, and peaks are all Numpy arrays:
import numpy as np
# find the minima between each peak
ixpeak = x.searchsorted(peaks)
ixmin = np.array([np.argmin(i) for i in np.split(y, ixpeak)])
ixmin[1:] += ixpeak
mins = x[ixmin]
# split up the x and y values based on those minima
xsplit = np.split(x, ixmin[1:-1])
ysplit = np.split(y, ixmin[1:-1])
# find the areas under each peak
areas = [np.trapz(ys, xs) for xs,ys in zip(xsplit, ysplit)]
Output:
The example data has been set up so that the area under each peak is (more-or-less) guaranteed to be 1.0, so the results in the bottom plot are correct. The green X marks are the locations of the minimum between each two peaks. The part of the curve "belonging" to each peak is determined as the part of the curve in-between the minima adjacent to each peak.
Complete code
Here's the complete code I used to generate the example data:
import scipy as sp
import scipy.stats
prec = 1e5
n = 10
N = 150
r = np.arange(0, N+1, N//n)
# generate some reasonable fake data
peaks = np.array([np.random.uniform(s, e) for s,e in zip(r[:-1], r[1:])])
x = np.linspace(0, N + n, num=int(prec))
y = np.max([sp.stats.norm.pdf(x, loc=p, scale=.4) for p in peaks], axis=0)
and the code I used to make the plots:
import matplotlib.pyplot as plt
# plotting stuff
plt.figure(figsize=(5,7))
plt.subplots_adjust(hspace=.33)
plt.subplot(211)
plt.plot(x, y, label='trace 0')
plt.plot(peaks, y[ixpeak], '+', c='red', ms=10, label='peaks')
plt.plot(mins, y[ixmin], 'x', c='green', ms=10, label='mins')
plt.xlabel('dep')
plt.ylabel('indep')
plt.title('Example data')
plt.ylim(-.1, 1.6)
plt.legend()
plt.subplot(212)
plt.bar(np.arange(len(areas)), areas)
plt.xlabel('Peak number')
plt.ylabel('Area under peak')
plt.title('Area under the peaks of trace 0')
plt.show()

Fit the gamma distribution only to a subset of the samples

I have the histogram of my input data (in black) given in the following graph:
I'm trying to fit the Gamma distribution but not on the whole data but just to the first curve of the histogram (the first mode). The green plot in the previous graph corresponds to when I fitted the Gamma distribution on all the samples using the following python code which makes use of scipy.stats.gamma:
img = IO.read(input_file)
data = img.flatten() + abs(np.min(img)) + 1
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins, patches = plt.hist(data, 1000, normed=True)
# slice histogram here
# estimation of the parameters of the gamma distribution
fit_alpha, fit_loc, fit_beta = gamma.fit(data, floc=0)
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, fit_loc, fit_beta)
print '(alpha, beta): (%f, %f)' % (fit_alpha, fit_beta)
# plot estimated model
plt.plot(x, y, linewidth=2, color='g')
plt.show()
How can I restrict the fitting only to the interesting subset of this data?
Update1 (slicing):
I sliced the input data by keeping only values below the max of the previous histogram, but the results were not really convincing:
This was achieved by inserting the following code below the # slice histogram here comment in the previous code:
max_data = bins[np.argmax(n)]
data = data[data < max_data]
Update2 (scipy.optimize.minimize):
The code below shows how scipy.optimize.minimize() is used to minimize an energy function to find (alpha, beta):
import matplotlib.pyplot as plt
import numpy as np
from geotiff.io import IO
from scipy.stats import gamma
from scipy.optimize import minimize
def truncated_gamma(x, max_data, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x < max_data, gammapdf / norm, 0)
# read image
img = IO.read(input_file)
# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1
# data histogram
n, bins = np.histogram(data, 100, normed=True)
# using minimize on a slice data below max of histogram
max_data = bins[np.argmax(n)]
data = data[data < max_data]
data = np.random.choice(data, 1000)
energy = lambda p: -np.sum(np.log(truncated_gamma(data, max_data, *p)))
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
# plot data histogram and model
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, 0, fit_beta)
plt.hist(data, 30, normed=True)
plt.plot(x, y, linewidth=2, color='g')
plt.show()
The algorithm above converged for a subset of data, and the output in o was:
x: array([ 16.66912781, 6.88105559])
But as can be seen on the screenshot below, the gamma plot doesn't fit the histogram:

You can use a general optimization tool such as scipy.optimize.minimize to fit a truncated version of the desired function, resulting in a nice fit:
First, the modified function:
def truncated_gamma(x, alpha, beta):
gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
return np.where(x<max_data, gammapdf/norm, 0)
This selects values from the gamma distribution where x < max_data, and zero elsewhere. The np.where part is not actually important here, because the data is exclusively to the left of max_data anyway. The key is normalization, because varying alpha and beta will change the area to the left of the truncation point in the original gamma.
The rest is just optimization technicalities.
It's common practise to work with logarithms, so I used what's sometimes called "energy", or the logarithm of the inverse of the probability density.
energy = lambda p: -np.sum(np.log(truncated_gamma(data, *p)))
Minimize:
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x
My output is (alpha, beta): (11.595208, 824.712481). Like the original, it is a maximum likelihood estimate.
If you're not happy with the convergence rate, you may want to
Select a sample from your rather big dataset:
data = np.random.choice(data, 10000)
Try different algorithms using the method keyword argument.
Some optimization routines output a representation of the inverse hessian, which is useful for uncertainty estimation. Enforcement of nonnegativity for the parameters may also be a good idea.
A log-scaled plot without truncation shows the entire distribution:

Here's another possible approach using a manually created dataset in excel that more or less matched the plot given.
Raw Data
Outline
Imported data into a Pandas dataframe.
Mask the indices after the
max response index.
Create a mirror image of the remaining data.
Append the mirror image while leaving a buffer of empty space.
Fit the desired distribution to the modified data. Below I do a normal fit by the method of moments and adjust the amplitude and width.
Working Script
# Import data to dataframe.
df = pd.read_csv('sample.csv', header=0, index_col=0)
# Mask indices after index at max Y.
mask = df.index.values <= df.Y.argmax()
df = df.loc[mask, :]
scaled_y = 100*df.Y.values
# Create new df with mirror image of Y appended.
sep = 6
app_zeroes = np.append(scaled_y, np.zeros(sep, dtype=np.float))
mir_y = np.flipud(scaled_y)
new_y = np.append(app_zeroes, mir_y)
# Using Scipy-cookbook to fit a normal by method of moments.
idxs = np.arange(new_y.size) # idxs=[0, 1, 2,...,len(data)]
mid_idxs = idxs.mean() # len(data)/2
# idxs-mid_idxs is [-53.5, -52.5, ..., 52.5, len(data)/2]
scaling_param = np.sqrt(np.abs(np.sum((idxs-mid_idxs)**2*new_y)/np.sum(new_y)))
# adjust amplitude
fmax = new_y.max()*1.2 # adjusted function max to 120% max y.
# adjust width
scaling_param = scaling_param*.7 # adjusted by 70%.
# Fit normal.
fit = lambda t: fmax*np.exp(-(t-mid_idxs)**2/(2*scaling_param**2))
# Plot results.
plt.plot(new_y, '.')
plt.plot(fit(idxs), '--')
plt.show()
Result
See the scipy-cookbook fitting data page for more on fitting a normal using method of moments.

Finding first derivative using DFT in Python

I want to find the first derivative of exp(sin(x)) on the interval [0, 2/pi] using a discrete Fourier transform. The basic idea is to first evaluate the DFT of exp(sin(x)) on the given interval, giving you say v_k, followed by computing the inverse DFT of ikv_k giving you the desired answer. In reality, due to the implementations of Fourier transforms in programming languages, you might need to reorder the output somewhere and/or multiply by different factors here and there.
I first did it in Mathematica, where there is an option FourierParameters, which enables you to specify a convention for the transform. Firstly, I obtained the Fourier series of a Gaussian, in order to see what the normalisation factors are that I have to multiply by and then went on finding the derivative. Unfortunately, translating my Mathematica code into Python thereafter (whereby again I first did the Fourier series of a Gaussian - this was successful), I didn't get the same results. Here is my code:
N=1000
xmin=0
xmax=2.0*np.pi
step = (xmax-xmin)/(N)
xdata = np.linspace(xmin, xmax-step, N)
v = np.exp(np.sin(xdata))
derv = np.cos(xdata)*v
vhat = np.fft.fft(v)
kvals1 = np.arange(0, N/2.0, 1)
kvals2 = np.arange(-N/2.0, 0, 1)
what1 = np.zeros(kvals1.size+1)
what2 = np.empty(kvals2.size)
it = np.nditer(kvals1, flags=['f_index'])
while not it.finished:
np.put(what1, it.index, 1j*(2.0*np.pi)/((xmax-xmin))*it[0]*vhat[[int(it[0])]])
it.iternext()
it = np.nditer(kvals2, flags=['f_index'])
while not it.finished:
np.put(what2, it.index, 1j*(2.0*np.pi)/((xmax-xmin))*it[0]*vhat[[int(it[0])]])
it.iternext()
xdatafull = np.concatenate((xdata, [2.0*np.pi]))
what = np.concatenate((what1, what2))
w = np.real(np.fft.ifft(what))
fig = plt.figure()
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
plt.plot(xdata, derv, color='blue')
plt.plot(xdatafull, w, color='red')
plt.show()
I can post the Mathematica code, if people want me to.

Turns out the problem is that np.zeros gives you an array of real zeroes and not complex ones, hence the assignments after that don't change anything, as they are imaginary.
Thus the solution is quite simply
import numpy as np
N=100
xmin=0
xmax=2.0*np.pi
step = (xmax-xmin)/(N)
xdata = np.linspace(step, xmax, N)
v = np.exp(np.sin(xdata))
derv = np.cos(xdata)*v
vhat = np.fft.fft(v)
what = 1j*np.zeros(N)
what[0:N/2.0] = 1j*np.arange(0, N/2.0, 1)
what[N/2+1:] = 1j*np.arange(-N/2.0 + 1, 0, 1)
what = what*vhat
w = np.real(np.fft.ifft(what))
# Then plotting
whereby the np.zeros is replaced by 1j*np.zeros

Referencing Data From a 2D Histogram

I have the following code that reads data from a CSV file and creates a 2D histogram:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
#Read in CSV data
filename = 'Complete_Storms_All_US_Only.csv'
df = pd.read_csv(filename)
min_85 = df.min85
min_37 = df.min37
verification = df.one_min_15
#Numbers
x = min_85
y = min_37
H = verification
#Estimate the 2D histogram
nbins = 33
H, xedges, yedges = np.histogram2d(x,y,bins=nbins)
#Rotate and flip H
H = np.rot90(H)
H = np.flipud(H)
#Mask zeros
Hmasked = np.ma.masked_where(H==0,H)
#Calculate Averages
avgarr = np.zeros((nbins, nbins))
xbins = np.digitize(x, xedges[1:-1])
ybins = np.digitize(y, yedges[1:-1])
for xb, yb, v in zip(xbins, ybins, verification):
avgarr[yb, xb] += v
divisor = H.copy()
divisor[divisor==0.0] = np.nan
avgarr /= divisor
binavg = np.around((avgarr * 100), decimals=1)
binper = np.ma.array(binavg, mask=np.isnan(binavg))
#Plot 2D histogram using pcolor
fig1 = plt.figure()
plt.pcolormesh(xedges,yedges,binper)
plt.title('1 minute at +/- 0.15 degrees')
plt.xlabel('min 85 GHz PCT (K)')
plt.ylabel('min 37 GHz PCT (K)')
cbar = plt.colorbar()
cbar.ax.set_ylabel('Probability of CG Lightning (%)')
plt.show()
Each pixel in the histogram contains the probability of lightning for a given range of temperatures at two different frequencies on the x and y axis (min_85 on the x axis and min_37 on the y axis). I am trying to reference the probability of lightning from the histogram based on a wide range of temperatures that vary on an individual basis for any given storm. Each storm has a min_85 and min_37 that corresponds to a probability from the 2D histogram. I know there is a brute-force method where you can create a ridiculous amount of if statements, with one for each pixel, but this is tedious and inefficient when trying to incorporate over multiple 2D histograms. Is there a more efficient way to reference the probability from the histogram based on the given min_85 and min_37? I have a separate file with the min_85 and min_37 data for a large amount of storms, I just need to assign the corresponding probability of lightning from the histogram to each one.

It sounds like all you need to do is turn the min_85 and min_37 values into indices. Something like this will work:
# min85data and min37data from your file
dx = xedges[1] - xedges[0]
dy = yedges[1] - yedges[0]
min85inds = np.floor((min85data - yedges[1]) / dx).astype(np.int)
min37inds = np.floor((min37data - yedges[0]) / dy).astype(np.int)
# Pretend you didn't do all that flipping of H, or make a copy of it first
hvals = h_orig[min85inds, min37ends]
But do make sure that the resulting indices are valid before you extract them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Estimate joint density with 2d Gaussian kernel - python

Related

Can't set a,b in cv2.convertScaleAbs for a linear tranform of the input

How to get value of area under multiple peaks

Fit the gamma distribution only to a subset of the samples

Finding first derivative using DFT in Python

Referencing Data From a 2D Histogram

Categories

Resources