Gaussian Kernel Density Estimation (KDE) of large numbers in Python

Gaussian Kernel Density Estimation (KDE) of large numbers in Python - python

I have 1000 large numbers, randomly distributed in range 37231 to 56661.
I am trying to use the stats.gaussian_kde but something does not work.
(maybe because of my poor knowledge of statistics?).
Here is the code:
from scipy import stats.gaussian_kde
import matplotlib.pyplot as plt
# 'data' is a 1D array that contains the initial numbers 37231 to 56661
xmin = min(data)
xmax = max(data)
# get evenly distributed numbers for X axis.
x = linspace(xmin, xmax, 1000) # get 1000 points on x axis
nPoints = len(x)
# get actual kernel density.
density = gaussian_kde(data)
y = density(x)
# print the output data
for i in range(nPoints):
print "%s %s" % (x[i], y[i])
plt.plot(x, density(x))
plt.show()
In the printout, I get x values in the column 1, and zeros in the column 2.
The plot shows a flat line.
I simply can not find the solution.
I tried a very wide range of X-es, the same result.
What is the problem? What am I doing wrong?
Could the large numbers be the cause?

I think what's happening is that your data array is made up of integers, which leads to problems:
>>> import numpy, scipy.stats
>>>
>>> data = numpy.random.randint(37231, 56661,size=10)
>>> xmin, xmax = min(data), max(data)
>>> x = numpy.linspace(xmin, xmax, 10)
>>>
>>> density = scipy.stats.gaussian_kde(data)
>>> density.dataset
array([[52605, 45451, 46029, 40379, 48885, 41262, 39248, 38247, 55987,
44019]])
>>> density(x)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
but if we use floats:
>>> density = scipy.stats.gaussian_kde(data*1.0)
>>> density.dataset
array([[ 52605., 45451., 46029., 40379., 48885., 41262., 39248.,
38247., 55987., 44019.]])
>>> density(x)
array([ 4.42201513e-05, 5.51130237e-05, 5.94470211e-05,
5.78485526e-05, 5.21379448e-05, 4.43176188e-05,
3.66725694e-05, 3.06297511e-05, 2.56191024e-05,
2.01305127e-05])

I have made a function to do this. You can vary the bandwidth as a parameter of the function. That is, smaller number = more pointy, larger number = smoother. The default is 0.3.
It works in IPython notebook --pylab=inline
The number of bins is optimized and coded so will vary on the number of variables in your data.
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
def hist_with_kde(data, bandwidth = 0.3):
#set number of bins using Freedman and Diaconis
q1 = np.percentile(data,25)
q3 = np.percentile(data,75)
n = len(data)**(.1/.3)
rng = max(data) - min(data)
iqr = 2*(q3-q1)
bins = int((n*rng)/iqr)
x = np.linspace(min(data),max(data),200)
kde = stats.gaussian_kde(data)
kde.covariance_factor = lambda : bandwidth
kde._compute_covariance()
plt.plot(x,kde(x),'r') # distribution function
plt.hist(data,bins=bins,normed=True) # histogram
data = np.random.randn(500)
hist_with_kde(data,0.25)

Related

How do i generate 30 random numbers using standard deviation and mean [duplicate]

How to generate a random integer as with np.random.randint(), but with a normal distribution around 0.
np.random.randint(-10, 10) returns integers with a discrete uniform distribution
np.random.normal(0, 0.1, 1) returns floats with a normal distribution
What I want is a kind of combination between the two functions.

One other way to get a discrete distribution that looks like the normal distribution is to draw from a multinomial distribution where the probabilities are calculated from a normal distribution.
import scipy.stats as ss
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-10, 11)
xU, xL = x + 0.5, x - 0.5
prob = ss.norm.cdf(xU, scale = 3) - ss.norm.cdf(xL, scale = 3)
prob = prob / prob.sum() # normalize the probabilities so their sum is 1
nums = np.random.choice(x, size = 10000, p = prob)
plt.hist(nums, bins = len(x))
Here, np.random.choice picks an integer from [-10, 10]. The probability for selecting an element, say 0, is calculated by p(-0.5 < x < 0.5) where x is a normal random variable with mean zero and standard deviation 3. I chose a std. dev. of 3 because this way p(-10 < x < 10) is almost 1.
The result looks like this:

It may be possible to generate a similar distribution from a Truncated Normal Distribution that is rounded up to integers. Here's an example with scipy's truncnorm().
import numpy as np
from scipy.stats import truncnorm
import matplotlib.pyplot as plt
scale = 3.
range = 10
size = 100000
X = truncnorm(a=-range/scale, b=+range/scale, scale=scale).rvs(size=size)
X = X.round().astype(int)
Let's see what it looks like
bins = 2 * range + 1
plt.hist(X, bins)

The accepted answer here works, but I tried Will Vousden's solution and it works well too:
import numpy as np
# Generate Distribution:
randomNums = np.random.normal(scale=3, size=100000)
randomInts = np.round(randomNums)
# Plot:
axis = np.arange(start=min(randomInts), stop = max(randomInts) + 1)
plt.hist(randomInts, bins = axis)

Old question, new answer:
For a bell-shaped distribution on the integers {-10, -9, ..., 9, 10}, you can use the binomial distribution with n=20 and p=0.5, and subtract 10 from the samples.
For example,
In [167]: import numpy as np
In [168]: import matplotlib.pyplot as plt
In [169]: rng = np.random.default_rng()
In [170]: N = 5000000 # Number of samples to generate
In [171]: samples = rng.binomial(n=20, p=0.5, size=N) - 10
In [172]: samples.min(), samples.max()
Out[172]: (-10, 10)
Note that the probability of -10 or 10 is pretty low, so you won't necessarily see them in any given sample, especially if you use a smaller N.
np.bincount() is an efficient way to generate a histogram for a collection of small nonnegative integers:
In [173]: counts = np.bincount(samples + 10, minlength=20)
In [174]: counts
Out[174]:
array([ 4, 104, 889, 5517, 22861, 73805, 184473, 369441,
599945, 800265, 881140, 801904, 600813, 370368, 185082, 73635,
23325, 5399, 931, 95, 4])
In [175]: plt.bar(np.arange(-10, 11), counts)
Out[175]: <BarContainer object of 21 artists>

This version is mathematically not correct (because you crop the bell) but will do the job quick and easily understandable if preciseness is not needed that much:
def draw_random_normal_int(low:int, high:int):
# generate a random normal number (float)
normal = np.random.normal(loc=0, scale=1, size=1)
# clip to -3, 3 (where the bell with mean 0 and std 1 is very close to zero
normal = -3 if normal < -3 else normal
normal = 3 if normal > 3 else normal
# scale range of 6 (-3..3) to range of low-high
scaling_factor = (high-low) / 6
normal_scaled = normal * scaling_factor
# center around mean of range of low high
normal_scaled += low + (high-low)/2
# then round and return
return np.round(normal_scaled)
Drawing 100000 numbers results in this histogramm:

Here we start by getting values from the bell curve.
CODE:
#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: Discretize a normal distribution centered at 0
#--------*---------*---------*---------*---------*---------*---------*---------*
import sys
import random
from math import sqrt, pi
import numpy as np
import matplotlib.pyplot as plt
def gaussian(x, var):
k1 = np.power(x, 2)
k2 = -k1/(2*var)
return (1./(sqrt(2. * pi * var))) * np.exp(k2)
#--------*---------*---------*---------*---------*---------*---------*---------#
while 1:# M A I N L I N E #
#--------*---------*---------*---------*---------*---------*---------*---------#
# # probability density function
# # for discrete normal RV
pdf_DGV = []
pdf_DGW = []
var = 9
tot = 0
# # create 'rough' gaussian
for i in range(-var - 1, var + 2):
if i == -var - 1:
r_pdf = + gaussian(i, 9) + gaussian(i - 1, 9) + gaussian(i - 2, 9)
elif i == var + 1:
r_pdf = + gaussian(i, 9) + gaussian(i + 1, 9) + gaussian(i + 2, 9)
else:
r_pdf = gaussian(i, 9)
tot = tot + r_pdf
pdf_DGV.append(i)
pdf_DGW.append(r_pdf)
print(i, r_pdf)
# # amusing how close tot is to 1!
print('\nRough total = ', tot)
# # no need to normalize with Python 3.6,
# # but can't help ourselves
for i in range(0,len(pdf_DGW)):
pdf_DGW[i] = pdf_DGW[i]/tot
# # print out pdf weights
# # for out discrte gaussian
print('\npdf:\n')
print(pdf_DGW)
# # plot random variable action
rv_samples = random.choices(pdf_DGV, pdf_DGW, k=10000)
plt.hist(rv_samples, bins = 100)
plt.show()
sys.exit()
OUTPUT:
-10 0.0007187932912256041
-9 0.001477282803979336
-8 0.003798662007932481
-7 0.008740629697903166
-6 0.017996988837729353
-5 0.03315904626424957
-4 0.05467002489199788
-3 0.0806569081730478
-2 0.10648266850745075
-1 0.12579440923099774
0 0.1329807601338109
1 0.12579440923099774
2 0.10648266850745075
3 0.0806569081730478
4 0.05467002489199788
5 0.03315904626424957
6 0.017996988837729353
7 0.008740629697903166
8 0.003798662007932481
9 0.001477282803979336
10 0.0007187932912256041
Rough total = 0.9999715875468381
pdf:
[0.000718813714486599, 0.0014773247784004072, 0.003798769940305483, 0.008740878047691289, 0.017997500190860556, 0.033159988420867426, 0.05467157824565407, 0.08065919989878699, 0.10648569402724471, 0.12579798346031068, 0.13298453855078374, 0.12579798346031068, 0.10648569402724471, 0.08065919989878699, 0.05467157824565407, 0.033159988420867426, 0.017997500190860556, 0.008740878047691289, 0.003798769940305483, 0.0014773247784004072, 0.000718813714486599]

I'm not sure if there (in scipy generator) is an option of var-type choice to be generated, but common generation can be such with scipy.stats
# Generate pseudodata from a single normal distribution
import scipy
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
dist_mean = 0.0
dist_std = 0.5
n_events = 500
toy_data = scipy.stats.norm.rvs(dist_mean, dist_std, size=n_events)
toy_data2 = [[i, j] for i, j in enumerate(toy_data )]
arr = np.array(toy_data2)
print("sample:\n", arr[1:500, 0])
print("bin:\n",arr[1:500, 1])
plt.scatter(arr[1:501, 1], arr[1:501, 0])
plt.xlabel("bin")
plt.ylabel("sample")
plt.show()
or in such a way (also option of dtype choice is absent):
import matplotlib.pyplot as plt
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 500)
count, bins, ignored = plt.hist(s, 30, density=True)
plt.show()
print(bins) # <<<<<<<<<<
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
linewidth=2, color='r')
plt.show()
without visualization the most common way (also no possibility to point out var-type)
bins = np.random.normal(3, 2.5, size=(10, 1))
a wrapper class could be done to instantiate the container with a given vars-dtype (e.g. by rounding floats to integers, as mentioned above)...

Estimate joint density with 2d Gaussian kernel

I have the following data set where I have to estimate the joint density of 'bwt' and 'age' using kernel density estimation with a 2-dimensional Gaussian kernel and width h=5. I can't use modules such as scipy where there are ready functions to do this and I have to built functions to calculate the density. Here's what I've gotten so far.
import numpy as np
import pandas as pd
babies_full = pd.read_csv("https://www2.helsinki.fi/sites/default/files/atoms/files/babies2.txt", sep='\t')
#Getting the columns I need
babies_full1=babies_full[['gestation', 'age']]
x=np.array(babies_full1,'int')
#2d Gaussian kernel
def k_2dgauss(x):
return np.exp(-np.sum(x**2, 1)/2) / np.sqrt(2*np.pi)
#Multivariate kernel density
def mv_kernel_density(t, x, h):
d = x.shape[1]
return np.mean(k_2dgauss((t - x)/h))/h**d
t = np.linspace(1.0, 5.0, 50)
h=5
print(mv_kernel_density(t, x, h))
However, I get a value error 'ValueError: operands could not be broadcast together with shapes (50,) (1173,2)' which think is because different shape of the matrices. I also don't understand why k_2dgauss(x) for me returns an array of zeros since it should only return one value. In general, I am new to the concept of kernel density estimation I don't really know if I've written the functions right so any hints would help!

Following on from my comments on your original post, I think this is what you want to do, but if not then come back to me and we can try again.
# info supplied by OP
import numpy as np
import pandas as pdbabies_full = \
pd.read_csv("https://www2.helsinki.fi/sites/default/files/atoms/files/babies2.txt", sep='\t')
#Getting the columns I need
babies_full1=babies_full[['gestation', 'age']]
x=np.array(babies_full1,'int')
# my contributions
from math import floor, ceil
def binMaker(arr, base):
"""function I already use for this sort of thing.
arr is the arr I want to make bins for
base is the bin separation, but does require you to import floor and ceil
otherwise you can make these bins manually yourself"""
binMin = floor(arr.min() / base) * base
binMax = ceil(arr.max() / base) * base
return np.arange(binMin, binMax + base, base)
bins1 = binMaker(x[:,0], 20.) # bins from 140. to 360. spaced 20 apart
bins2 = binMaker(x[:,1], 5.) # bins from 15. to 45. spaced 5. apart
counts = np.zeros((len(bins1)-1, len(bins2)-1)) # empty array for counts to go in
for i in range(0, len(bins1)-1): # loop over the intervals, hence the -1
boo = (x[:,0] >= bins1[i]) * (x[:,0] < bins1[i+1])
for j in range(0, len(bins2)-1): # loop over the intervals, hence the -1
counts[i,j] = np.count_nonzero((x[boo,1] >= bins2[j]) *
(x[boo,1] < bins2[j+1]))
# if you want your PDF to be a fraction of the total
# rather than the number of counts, do the next line
counts /= x.shape[0]
# plotting
import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm
# setting the levels so that each number in counts has its own colour
levels = np.linspace(-0.5, counts.max()+0.5, int(counts.max())+2)
cmap = plt.get_cmap('viridis') # or any colormap you like
norm = BoundaryNorm(levels, ncolors=cmap.N, clip=True)
fig, ax = plt.subplots(1, 1, figsize=(6,5), dpi=150)
pcm = ax.pcolormesh(bins2, bins1, counts, ec='k', lw=1)
fig.colorbar(pcm, ax=ax, label='Counts (%)')
ax.set_xlabel('Age')
ax.set_ylabel('Gestation')
ax.set_xticks(bins2)
ax.set_yticks(bins1)
plt.title('Manually making a 2D (joint) PDF')
If this is what you wanted, then there is an easier way with np.histgoram2d, although I think you specified it had to be using your own methods, and not built in functions. I've included it anyway for completeness' sake.
pdf = np.histogram2d(x[:,0], x[:,1], bins=(bins1,bins2))[0]
pdf /= x.shape[0] # again for normalising and making a percentage
levels = np.linspace(-0.5, pdf.max()+0.5, int(pdf.max())+2)
cmap = plt.get_cmap('viridis') # or any colormap you like
norm = BoundaryNorm(levels, ncolors=cmap.N, clip=True)
fig, ax = plt.subplots(1, 1, figsize=(6,5), dpi=150)
pcm = ax.pcolormesh(bins2, bins1, pdf, ec='k', lw=1)
fig.colorbar(pcm, ax=ax, label='Counts (%)')
ax.set_xlabel('Age')
ax.set_ylabel('Gestation')
ax.set_xticks(bins2)
ax.set_yticks(bins1)
plt.title('using np.histogram2d to make a 2D (joint) PDF')
Final note - in this example, the only place where counts doesn't equal pdf is for the bin between 40 <= age < 45 and 280 <= gestation 300, which I think is due to how, in my manual case, I've used <= and <, and I'm a little unsure how np.histogram2d handles values outside the bin ranges, or on the bin edges etc. We can see the element of x that is responsible
>>> print(x[1011])
[280 45]

Panda dataframe column cut - add more bins more frequently around the mean

I am categorizing quantitative variable (e.g. price) and I would like to categorize it in the manner that the bins would be much more frequent around the mean and less when away from the mean.
I have seen that there are possibilities to cut() in linear manner and thanks to numpy.logspace in logarithmic manner, but binning around the mean seems to be void and my ideas so far haven't worked and seem to be inefficient.

You can make bins that increase in size linearly:
import numpy as np
def make_progressive_bins(min_x, max_x, mean_x, num_bins=10):
x_rel_lim = max(mean_x - min_x, mean_x - max_x)
num_bins_half = num_bins // 2
bins_right = np.arange(0, num_bins_half + 1)
if num_bins % 2 == 1:
bins_right = bins_right + 0.5
bins_right = np.cumsum(bins_right)
bins = np.concatenate([-bins_right[bins_right > 0][::-1], bins_right])
bins = bins * (float(x_rel_lim) / bins[-1]) + mean_x
return bins
And then you can use it like:
import numpy as np
import matplotlib.pyplot as plt
bins = make_progressive_bins(-20, 50, 10, 15)
plt.bar(bins - 0.1, np.ones_like(bins), 0.2)

I made a script that might do what you want to achieve, but I'm not sure how to convert the resulted cut object into a histogram to see if it does what i want it to do, so please check and tell me if it works :).
# Make normally distributed price with mean 50.
df = pd.DataFrame(data=np.random.normal(50, size=1000), columns=['price'])
df.hist(bins=30)
num_bins = 100
# I used a square function to distribute the bins more around 0 and
# less at the outskirts of the range.
shape_func = lambda x: x**2
bin_loc = [shape_func(i) for i in range(num_bins//2)]
mirrored_bin_loc = [-x for x in bin_loc[::-1]]
bin_loc = mirrored_bin_loc + bin_loc[1:]
# Rescale and translate bins
data_mean = df.price.mean()
data_range = df.price.max() - df.price.min()
final_bin_loc = [(x + data_mean) / (data_range * num_bins) for x in bin_loc]
# display(final_bin_loc)
binned = pd.cut(df.price, bin_loc)

Vectorised code for sampling from truncated normal distributions with different intervals

The following code generates a sample of size 100 from trunctated normal distributions with different intervals. Is there any effecient(vectorised) way of doing this?
from scipy.stats import truncnorm
import numpy as np
sample=[]
a_s=np.random.uniform(0,1,size=100)
b_s=a_s+0.2
for i in range(100):
sample.append(truncnorm.rvs(a_s[i], b_s[i], size=100))
print sample

One day in the not so distant future, all NumPy/SciPy functions will broadcast all their arguments, and you will be able to do truncnorm.rvs(a_s, b_s, size=100), but since we are not there yet, you could manually generate your random samples from a uniform distribution and the CDF and PPF of a normal distribution:
import numpy as np
from scipy.stats import truncnorm, norm
a_s = np.random.uniform(0, 1, size=100)
b_s = a_s + 0.2
cdf_start = norm.cdf(a_s)
cdf_stop = norm.cdf(b_s)
cdf_samples = np.random.uniform(0, 1, size=(100, 100))
cdf_samples *= (cdf_stop - cdf_start)[:, None]
cdf_samples += cdf_start[:, None]
truncnorm_samples = norm.ppf(cdf_samples)

Plot periodic trajectories

I have some data of a particle moving in a corridor with closed boundary conditions.
Plotting the trajectory leads to a zig-zag trajectory.
I would like to know how to prevent plot() from connecting the points where the particle comes back to the start. Some thing like in the upper part of the pic, but without "."
The first idea I had was to find the index where the numpy array a[:-1]-a[1:] becomes positive and then plot from 0 to that index. But how would I get the index of the first occurrence of a positive element of a[:-1]-a[1:]?
Maybe there are some other ideas.

I'd go a different approach. First, I'd determine the jump points not by looking at the sign of the derivative, as probably the movement might go up or down, or even have some periodicity in it. I'd look at those points with the biggest derivative.
Second, an elegant approach to have breaks in a plot line is to mask one value on each jump. Then matplotlib will make segments automatically. My code is:
import pylab as plt
import numpy as np
xs = np.linspace(0., 100., 1000.)
data = (xs*0.03 + np.sin(xs) * 0.1) % 1
plt.subplot(2,1,1)
plt.plot(xs, data, "r-")
#Make a masked array with jump points masked
abs_d_data = np.abs(np.diff(data))
mask = np.hstack([ abs_d_data > abs_d_data.mean()+3*abs_d_data.std(), [False]])
masked_data = np.ma.MaskedArray(data, mask)
plt.subplot(2,1,2)
plt.plot(xs, masked_data, "b-")
plt.show()
And gives us as result:
The disadvantage of course is that you lose one point at each break - but with the sampling rate you seem to have I guess you can trade this in for simpler code.

To find where the particle has crossed the upper boundary, you can do something like this:
>>> import numpy as np
>>> a = np.linspace(0, 10, 50) % 5
>>> a = np.linspace(0, 10, 50) % 5 # some sample data
>>> np.nonzero(np.diff(a) < 0)[0] + 1
array([25, 49])
>>> a[24:27]
array([ 4.89795918, 0.10204082, 0.30612245])
>>> a[48:]
array([ 4.79591837, 0. ])
>>>
np.diff(a) calculates the discrete difference of a, while np.nonzero finds where the condition np.diff(a) < 0 is negative, i.e., the particle has moved downward.

To avoid the connecting line you will have to plot by segments.
Here's a quick way to plot by segments when the derivative of a changes sign:
import numpy as np
a = np.linspace(0, 20, 50) % 5 # similar to Micheal's sample data
x = np.arange(50) # x scale
indices = np.where(np.diff(a) < 0)[0] + 1 # the same as Micheal's np.nonzero
for n, i in enumerate(indices):
if n == 0:
plot(x[:i], a[:i], 'b-')
else:
plot(x[indices[n - 1]:i], a[indices[n - 1]:i], 'b-')

Based on Thorsten Kranz answer a version which adds points to the original data when the 'y' crosses the period. This is important if the density of data-points isn't very high, e.g. np.linspace(0., 100., 100) vs. the original np.linspace(0., 100., 1000). The x position of the curve transitions are linear interpolated. Wrapped up in a function its:
import numpy as np
def periodic2plot(x, y, period=np.pi*2.):
indexes = np.argwhere(np.abs(np.diff(y))>.5*period).flatten()
index_shift = 0
for i in indexes:
i += index_shift
index_shift += 3 # in every loop it adds 3 elements
if y[i] > .5*period:
x_transit = np.interp(period, np.unwrap(y[i:i+2], period=period), x[i:i+2])
add = np.ma.array([ period, 0., 0.], mask=[0,1,0])
else:
# interpolate needs sorted xp = np.unwrap(y[i:i+2], period=period)
x_transit = np.interp(0, np.unwrap(y[i:i+2], period=period)[::-1], x[i:i+2][::-1])
add = np.ma.array([ 0., 0., period], mask=[0,1,0])
x_add = np.ma.array([x_transit]*3, mask=[0,1,0])
x = np.ma.hstack((x[:i+1], x_add, x[i+1:]))
y = np.ma.hstack((y[:i+1], add, y[i+1:]))
return x, y
The code for comparison to the original answer of Thorsten Kranz with lower data-points density.
import matplotlib.pyplot as plt
x = np.linspace(0., 100., 100)
y = (x*0.03 + np.sin(x) * 0.1) % 1
#Thorsten Kranz: Make a masked array with jump points masked
abs_d_data = np.abs(np.diff(y))
mask = np.hstack([np.abs(np.diff(y))>.5, [False]])
masked_y = np.ma.MaskedArray(y, mask)
# Plot
plt.figure()
plt.plot(*periodic2plot(x, y, period=1), label='This answer')
plt.plot(x, masked_y, label='Thorsten Kranz')
plt.autoscale(enable=True, axis='both', tight=True)
plt.legend(loc=1)
plt.tight_layout()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Gaussian Kernel Density Estimation (KDE) of large numbers in Python - python

Related

How do i generate 30 random numbers using standard deviation and mean [duplicate]

Estimate joint density with 2d Gaussian kernel

Panda dataframe column cut - add more bins more frequently around the mean

Vectorised code for sampling from truncated normal distributions with different intervals

Plot periodic trajectories

Categories

Resources