My code so far, I'm very new to programming and have been trying for a while.
Here I apply the Box-Muller transform to approximate two Gaussian normal distributions starting from a random uniform sampling. Then, I create a histogram for both of them.
Now, I would like to compare the obtained histograms with "the real thing": a standard Bell curve. How to draw such a curve to match the histograms?
import numpy as np
import matplotlib.pyplot as plt
N = 10000
z1 = np.random.uniform(0, 1.0, N)
z2 = np.random.uniform(0, 1.0, N)
R_sq = -2 * np.log(z1)
theta = 2 * np.pi * z2
z1 = np.sqrt(R_sq) * np.cos(theta)
z2 = np.sqrt(R_sq) * np.sin(theta)
fig = plt.figure()
ax = fig.add_subplot(2, 1, 1)
ax.hist(z1, bins=40, range=(-4, 4), color='red')
plt.title("Histgram")
plt.xlabel("z1")
plt.ylabel("frequency")
ax2 = fig.add_subplot(2, 1, 2)
ax2.hist(z2, bins=40, range=(-4, 4), color='blue')
plt.xlabel("z2")
plt.show()
To obtain the 'kernel density estimation', scipy.stats.gaussian_kde calculates a function to fit the data.
To just draw a Gaussian normal curve, there is [scipy.stats.norm]. Subtracting the mean and dividing by the standard deviation, adapts the position to the given data.
Both curves would be drawn such that the area below the curve sums to one. To adjust them to the size of the histogram, these curves need to be scaled by the length of the data times the bin-width. Alternatively, this scaling can stay at 1, and the histogram scaled by adding the parameter hist(..., density=True).
In the demo code the data is mutilated to illustrate the difference between the kde and the Gaussian normal.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
x = np.linspace(-4,4,1000)
N = 10000
z1 = np.random.randint(1, 3, N) * np.random.uniform(0, .4, N)
z2 = np.random.uniform(0, 1, N)
R_sq = -2 * np.log(z1)
theta = 2 * np.pi * z2
z1 = np.sqrt(R_sq) * np.cos(theta)
z2 = np.sqrt(R_sq) * np.sin(theta)
fig = plt.figure(figsize=(12,4))
for ind_subplot, zi, col in zip((1, 2), (z1, z2), ('crimson', 'dodgerblue')):
ax = fig.add_subplot(1, 2, ind_subplot)
ax.hist(zi, bins=40, range=(-4, 4), color=col, label='histogram')
ax.set_xlabel("z"+str(ind_subplot))
ax.set_ylabel("frequency")
binwidth = 8 / 40
scale_factor = len(zi) * binwidth
gaussian_kde_zi = stats.gaussian_kde(z1)
ax.plot(x, gaussian_kde_zi(x)*scale_factor, color='springgreen', linewidth=3, label='kde')
std_zi = np.std(zi)
mean_zi = np.mean(zi)
ax.plot(x, stats.norm.pdf((x-mean_zi)/std_zi)*scale_factor, color='black', linewidth=2, label='normal')
ax.legend()
plt.show()
The original values for z1 and z2 very much resemble a normal distribution, and so the black line (the Gaussian normal for the data) and the green line (the KDE) very much resemble each other.
The current code first calculates the real mean and the real standard deviation of the data. As you want to mimic a perfect Gaussian normal, you should compare to the curve with mean zero and standard deviatio one. You'll see they're almost identical on the plot.
Related
I am plotting a polar 2d histogram in Python 3.7 using matplotlib and the following code (adapted from this answer to another question):
import numpy as np
import matplotlib.pyplot as plt
# input data
azimut = np.random.rand(3000)*2*np.pi
radius = np.random.rayleigh(9, size=3000)
# binning
rbins = np.linspace(0, radius.max(), 10)
abins = np.linspace(0, 2*np.pi, 10)
# histogram
hist, _, _ = np.histogram2d(azimut, radius, bins=(abins, rbins))
A, R = np.meshgrid(abins, rbins)
# plot
fig, ax = plt.subplots(subplot_kw=dict(projection="polar"))
pc = ax.pcolormesh(A, R, hist.T, cmap='inferno')
fig.colorbar(pc)
plt.show()
To produce the following plot:
Due to the larger bin sizes, the polar projection is appearing more like a polygon rather than a circle.
Is there any way to plot this so that the bins appear curved rather than straight? I.E. so that the plot is always circular, regardless of the bin size and doesn't become polygon-like when bins are larger?
A matplotlib solution would be preferable, but others are welcome.
Thanks very much for any help.
To get a rounded look, the mesh can be subdivided into more angles. Note that np.linspace(0, 2 * np.pi, 10) creates 9 bins (and 10 boundaries). For the subdivided mesh you need e.g. 90 bins, so 91 boundaries. The histogram values need to be repeated by the same factor.
The code below uses a different colormap for debugging purposes. An optional grid highlights the original boundaries.
import numpy as np
import matplotlib.pyplot as plt
# input data
azimut = np.random.rand(3000) * 2 * np.pi
radius = np.random.rayleigh(9, size=3000)
# binning
rbins = np.linspace(0, radius.max(), 7)
abins = np.linspace(0, 2 * np.pi, 10)
subdivs = 10
abins2 = np.linspace(0, 2 * np.pi, (len(abins) - 1) * subdivs + 1)
# histogram
hist, _, _ = np.histogram2d(azimut, radius, bins=(abins, rbins))
A1, R1 = np.meshgrid(abins, rbins)
A2, R2 = np.meshgrid(abins2, rbins)
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10, 4), subplot_kw=dict(projection="polar"))
# plot with original mesh
pc1 = ax1.pcolormesh(A1, R1, hist.T, cmap='hsv')
ax1.tick_params(axis='y', labelcolor='white')
ax1.set_xticks(abins[:-1])
fig.colorbar(pc1, ax=ax1)
# plot with subdivided mesh
pc2 = ax2.pcolormesh(A2, R2, np.repeat(hist.T, subdivs, axis=1), cmap='hsv')
ax2.tick_params(axis='y', labelcolor='white')
ax2.set_xticks(abins[:-1])
ax2.set_yticks(rbins, minor=True)
ax2.grid(axis='x', color='white')
ax2.grid(axis='y', which='minor', color='white')
fig.colorbar(pc2, ax=ax2)
plt.tight_layout()
plt.show()
I am using pyplot.hist2d to plot a 2D histogram (x vs.y) weighted by a third variable, z. Instead of summing z values in a given pixel [x_i,y_i] as done by hist2d, I'd like to obtain the average z of all data points falling in that pixel.
Is there a python script doing that ?
Thanks.
Numpy's histogram2d() can calculate both the counts (a standard histogram) as the sums (via the weights parameter). Dividing both gives the mean value.
The example below shows the 3 histograms together with a colorbar. The number of samples is chosen relatively small to demonstrate what would happen for cells with a count of zero (the division gives NaN, so the cell is left blank).
import numpy as np
import matplotlib.pyplot as plt
N = 1000
x = np.random.uniform(0, 10, N)
y = np.random.uniform(0, 10, N)
z = np.cos(x) * np.sin(y)
counts, xbins, ybins = np.histogram2d(x, y, bins=(30, 20))
sums, _, _ = np.histogram2d(x, y, weights=z, bins=(xbins, ybins))
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(15, 4))
m1 = ax1.pcolormesh(ybins, xbins, counts, cmap='coolwarm')
plt.colorbar(m1, ax=ax1)
ax1.set_title('counts')
m2 = ax2.pcolormesh(ybins, xbins, sums, cmap='coolwarm')
plt.colorbar(m2, ax=ax2)
ax2.set_title('sums')
with np.errstate(divide='ignore', invalid='ignore'): # suppress possible divide-by-zero warnings
m3 = ax3.pcolormesh(ybins, xbins, sums / counts, cmap='coolwarm')
plt.colorbar(m3, ax=ax3)
ax3.set_title('mean values')
plt.tight_layout()
plt.show()
The purpose of this code is to demonstrate CLT.
If I do the following:
num_samples = 10000
sample_means = np.empty(num_samples)
for i in range(num_samples):
mean = np.mean(st.bernoulli.rvs(p=0.5, size=100))
sample_means[i] = mean
sample_demeaned = np.subtract(sample_means, 0.5)
denominator = np.divide(0.5, np.sqrt(100))
z_ed = np.divide(sample_demeaned, denominator)
plt.hist(z_ed, bins=40, edgecolor='k', density=True)
x = np.linspace(st.norm.ppf(0.001), st.norm.ppf(0.999), 10000)
y = st.norm.pdf(x)
plt.plot(x, y, color='red')
I get:
However, if I try to do it with a for loop for different sample sizes:
num_samples = 10000
sample_sizes = np.array([5, 20, 75, 100])
sample_std_means = np.empty(shape=(num_samples, len(sample_sizes)))
for col, size in enumerate(sample_sizes):
sample_means = np.empty(num_samples)
for i in range(num_samples):
mean = np.mean(st.bernoulli.rvs(p=0.5, size=size))
sample_means[i] = mean
sample_demeaned = np.subtract(sample_means, 0.5)
denominator = np.divide(0.5, np.sqrt(size))
z_ed = np.divide(sample_demeaned, denominator)
sample_std_means[:, col] = sample_means
And then plot each of them in a 2x2 grid:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 7))
x = np.linspace(st.norm.ppf(0.001), st.norm.ppf(0.999), 10000)
y = st.norm.pdf(x)
for i, ax in enumerate(axes.flatten()):
ax.hist(sample_std_means[i], bins=40, edgecolor='k', color='midnightblue')
ax.set_ylabel('Density')
ax.set_xlabel(f'n = {sample_sizes[i]}')
ax.plot(x, y, color='red')
ax.set_xlim((-3, 3))
plt.show()
I get the following image:
I cannot debug the discrepancy here. Any help is highly appreciated.
Please note that scipy.stats and numpy have been imported as st and np respectively in both code blocks.
First, note that one numpy's strong points is that it allows operations which mix arrays and single numbers. This is called broadcasting. So, for example sample_demeaned = np.subtract(sample_means, 0.5) can be written more concise as sample_demeaned = sample_means - 0.5.
Several issues are going wrong:
sample_std_means[:, col] = sample_means should use the just calculated z_ed instead of sample_means.
ax.hist(sample_std_means[i], ...) uses the i'th row of the array. That row only contains 4 elements. You'd want sample_std_means[;,i] to take the i'th column.
The pdf is drawn in its normalized form (with an area below the curve equal to one). However, the histogram's height is proportional to the number of samples. Its total area is num_samples * bin_width, where the histogram's default bin width is the length from the first to the last element divided by the number of bins. To get both the pdf and histogram with similar sizes, either the histogram should be normalized (using density=True) or the pdf should be multiplied by the expected area of the histogram.
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
num_samples = 10000
sample_sizes = np.array([5, 20, 75, 100])
sample_std_means = np.empty(shape=(num_samples, len(sample_sizes)))
for col, size in enumerate(sample_sizes):
sample_means = np.empty(num_samples)
for i in range(num_samples):
sample_means[i] = np.mean(st.bernoulli.rvs(p=0.5, size=size))
sample_demeaned = sample_means - 0.5
z_ed = sample_demeaned / (0.5 / np.sqrt(size))
sample_std_means[:, col] = z_ed
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 7))
x = np.linspace(st.norm.ppf(0.001), st.norm.ppf(0.999), 1000)
y = st.norm.pdf(x)
for i, ax in enumerate(axes.flatten()):
ax.hist(sample_std_means[:, i], bins=40, edgecolor='k', color='midnightblue', density=True)
ax.set_ylabel('Density')
ax.set_xlabel(f'n = {sample_sizes[i]}')
# bin_width = (sample_std_means[:, i].max() - sample_std_means[:, i].min()) / 40
# ax.plot(x, y * num_samples * bin_width, color='red')
ax.plot(x, y, color='red')
ax.set_xlim((-3, 3))
plt.show()
Now note the weird empty bars in the histograms. A histogram works best for continuous distributions. But the mean of n Bernoulli trials can have at most n+1 different outcomes. When all trials would be True, the mean would be n/n = 1. When all would be False, the mean would be 0. Combined, the possible means are 0, 1/n, 2/n, ..., 1. The histogram of such a discrete distribution should take these values into account for the boundaries between the bins.
The following code creates a scatter plot, using the position of the means and a random y-value to visualize how many there are per x. Also, the position of the bin boundaries is calculated and visualized by dotted vertical lines.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 7))
for i, ax in enumerate(axes.flatten()):
ax.scatter(sample_std_means[:, i], np.random.uniform(0, 1, num_samples), color='r', alpha=0.5, lw=0, s=1)
# there are n+1 possible mean values for n bernoulli trials
# n+2 boundaries will be needed to separate the bins
bins = np.arange(-1, sample_sizes[i]+1) / sample_sizes[i]
bins += (bins[1] - bins[0]) / 2 # shift half a bin
bins -= 0.5 # subtract the mean
bins /= (0.5 / np.sqrt(sample_sizes[i])) # correction factor
for b in bins:
ax.axvline(b, color='g', ls=':')
ax.set_xlabel(f'n = {sample_sizes[i]}')
ax.set_xlim((-3, 3))
And here are the histograms using these bins:
ax.hist(sample_std_means[:, i], bins=bins, edgecolor='k', color='midnightblue', density=True)
Hello I have come across a problem where I need to generate dataset from a distribution given on a scatter plot where datapoints are mostly centred around the centre of the circle and also surrounded within particular radius of the circle.Any ideas of generating such datasets in python ?
One way of producing a distribution over a circular shape is to sample a one dimensional distribution and then stretch it over the 2 Pi circonference of a circle.
One could then decide to use a uniform or a normal distribution.
import matplotlib.pyplot as plt
import numpy as np
def dist(R=4., width=1., num=1000, uniform=True):
if uniform:
r = np.random.rand(num)*width+R
else:
r = np.random.normal(R, width, num)
phi = np.linspace(0,2.*np.pi, len(r))
x= r * np.sin(phi)
y = r* np.cos(phi)
return x,y
fig, ax = plt.subplots(ncols=2, figsize=(9,4))
ax[0].set_title("uniform")
x,y = dist()
ax[0].plot(x,y, linestyle="", marker="o", markersize="2")
x,y = dist(0,1.2, 400)
ax[0].plot(x,y, linestyle="", marker="o", markersize="2")
ax[1].set_title("normal")
x,y = dist(4,0.4, uniform=False)
ax[1].plot(x,y, linestyle="", marker="o", markersize="2")
x,y = dist(0,0.6, uniform=False)
ax[1].plot(x,y, linestyle="", marker="o", markersize="2")
for a in ax:
a.set_aspect("equal")
plt.show()
You can easily generalize random numbers with some distribution centered on a point, for example normal centered on the 0, 0.
x = np.random.normal(size=1000)
y = np.random.normal(size=1000)
plt.plot(x, y, 'o', alpha=0.6)
EDIT:
What we do is generate random points in polar coordinates. First we do a random for the angle (between 0 and 2 pi) and then we give the noise multiplying it by some random number.
n = 300
theta_out = np.random.uniform(low=0, high=2*np.pi, size=n)
noise_out = np.random.uniform(low=0.9, high=1.1, size=n)
x_out = np.cos(theta_out) * noise_out
y_out = np.sin(theta_out) * noise_out
theta_in = np.random.uniform(low=0, high=2*np.pi, size=n)
noise_in = np.random.uniform(low=0, high=0.5, size=n)
x_in = np.cos(theta_in) * noise_in
y_in = np.sin(theta_in) * noise_in
ax = plt.gca()
ax.set_aspect('equal')
plt.plot(x_out, y_out, 'o')
plt.plot(x_in, y_in, 'o')
Note that there is more density of points while the lower the radius.
I'm trying to generate a single array that follows an exact gaussian distribution. np.random.normal sort of does this by randomly sampling from a gaussian, but how can I reproduce and exact gaussian given some mean and sigma. So the array would produce a histogram that follows an exact gaussian, not just an approximate gaussian as shown below.
mu, sigma = 10, 1
s = np.random.normal(mu, sigma, 1000)
fig = figure()
ax = plt.axes()
totaln, bbins, patches = ax.hist(s, 10, normed = 1, histtype = 'stepfilled', linewidth = 1.2)
plt.show()
If you'd like an exact gaussian histogram, don't generate points. You can never get an "exact" gaussian distribution from observed points, simply because you can't have a fraction of a point within a histogram bin.
Instead, plot the curve in the form of a bar graph.
import numpy as np
import matplotlib.pyplot as plt
def gaussian(x, mean, std):
scale = 1.0 / (std * np.sqrt(2 * np.pi))
return scale * np.exp(-(x - mean)**2 / (2 * std**2))
mean, std = 2.0, 5.0
nbins = 30
npoints = 1000
x = np.linspace(mean - 3 * std, mean + 3 * std, nbins + 1)
centers = np.vstack([x[:-1], x[1:]]).mean(axis=0)
y = npoints * gaussian(centers, mean, std)
fig, ax = plt.subplots()
ax.bar(x[:-1], y, width=np.diff(x), color='lightblue')
# Optional...
ax.margins(0.05)
ax.set_ylim(bottom=0)
plt.show()