Plot normal distribution over histogram

Plot normal distribution over histogram - python

I am new to python and in the following code, I would like to plot a bell curve to show how the data follows a norm distribution. How would I go about it? Also, can anyone answer why when showing the hist, I have values (x-axis) greater than 100? I would assume by defining the Randels to 100, it would not show anything above it. If I am not mistaken, the x-axis represents what "floor" I am in and the y-axis represents how many observations matched that floor. By the way, this is a datacamp project.
"""
Let's say I roll a dice to determine if I go up or down a step in a building with
100 floors (1 step = 1 floor). If the dice is less than 2, I go down a step. If
the dice is less than or equal to 5, I go up a step, and if the dice is equal to 6,
I go up x steps based on a random integer generator between 1 and 6. What is the probability
I will be higher than floor 60?
"""
import numpy as np
import matplotlib.pyplot as plt
# Set the seed
np.random.seed(123)
# Simulate random walk
all_walks = []
for i in range(1000) :
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
if np.random.rand() <= 0.001 : # There's a 0.1% chance I fall and have to start at 0
step = 0
random_walk.append(step)
all_walks.append(random_walk)
# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))
# Select last row from np_aw_t: ends
ends = np_aw_t[-1,:]
# Plot histogram of ends, display plot
plt.hist(ends,bins=10,edgecolor='k',alpha=0.65)
plt.style.use('fivethirtyeight')
plt.xlabel("Floor")
plt.ylabel("# of times in floor")
plt.show()

You can use scipy.stats.norm to get a normal distribution. Documentation for it here. To fit any function to a data set you can use scipy.optimize.curve_fit(), documentation for that here. My suggestion would be something like the following:
import scipy.stats as ss
import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt
#Making a figure with two y-axis (one for the hist, one for the pdf)
#An alternative would be to multiply the pdf by the sum of counts if you just want to show the fit.
fig, ax = plt.subplots(1,1)
twinx = ax.twinx()
rands = ss.norm.rvs(loc = 1, scale = 1, size = 1000)
#hist returns the bins and the value of each bin, plot to the y-axis ax
hist = ax.hist(rands)
vals, bins = hist[0], hist[1]
#calculating the center of each bin
bin_centers = [(bins[i] + bins[i+1])/2 for i in range(len(bins)-1)]
#finding the best fit coefficients, note vals/sum(vals) to get the probability in each bin instead of the count
coeff, cov = opt.curve_fit(ss.norm.pdf, bin_centers, vals/sum(vals), p0 = [0,1] )
#loc and scale are mean and standard deviation i believe
loc, scale = coeff
#x-values to plot the normal distribution curve
x = np.linspace(min(bins), max(bins), 100)
#Evaluating the pdf with the best fit mean and std
p = ss.norm.pdf(x, loc = loc, scale = scale)
#plot the pdf to the other axis and show
twinx.plot(x,p)
plt.show()
There are likely more elegant ways to do this, but if you are new to python and are going to use it for calculations and such, getting to know curve_fit and scipy.stats is recomended. I'm not sure I understand whan you mean by "defining the Randels", hist will plot a "standard" histogram with bins on the x-axis and the count in each bin on the y-axis. When using these counts to fit a pdf we can just divide all the counts by the total number of counts.
Hope that helps, just ask if anything is unclear :)
Edit: compact version
vals, bins,_ = ax.hist(my_histogram_data)
bin_centers = [(bins[i] + bins[i+1])/2 for i in range(len(bins)-1)]
coeff, cov = opt.curve_fit(ss.norm.pdf, bin_centers, vals/sum(vals), p0 = [0,1] )
x = np.linspace(min(bins), max(bins), 100)
p = ss.norm.pdf(x, loc = coeff[0], scale = coeff[1])
#p is now the fitted normal distribution

Related

Live time series with mean and standrad deviation of the distribution

Let us assume a loop like below:
import numpy as np
ax = []; ay = []
for n in range(N):
avgC = np.zeros(M)
for m in range(M):
...
Cost = aFuncation
avgC[m] = Cost
ax.append(n); ay.append(np.mean(avgC))
I would like to use ax and ay to plot a live time series which shows how np.mean(avgC) evolves over different iterations of n. At the same time, I would like to plot the standrad deviation of the distribution according to avgC (a figure like below example).

First you should think about what the term "confidence interval" actually means in your case. To construct confidence intervals, you must specify for what quantity you construct the confidence interval, and you should give more background information how the values are distributed in your case. I assume for now, that your "Cost" values are normal distributed and you want the mean and standard deviation of the distribution plotted at each point n. Note that this is not the confidence interval on the mean. If you are unsure about this, you should probably edit your question and include more detailed information on the statistical properties of your investigation.
That being said, with this code you can plot the mean and a standard deviation band at each point n:
import numpy as np
import matplotlib.pyplot as plt
N = 25
M = 10
def aFuncation(x):
return np.random.normal(100*np.exp(-x), 10.0)
ax = np.zeros(N)
ay = np.zeros(N)
astd = np.zeros(N)
for n in range(N):
avgC = np.zeros(M)
for m in range(M):
Cost = aFuncation(n)
avgC[m] = Cost
ax[n] = n
ay[n] = np.mean(avgC)
astd[n] = np.std(avgC)
plt.fill_between(ax, ay-astd, ay+astd, alpha=0.3, color='black')
plt.plot(ax,ay,color='red')
plt.show()

Converting gaussian to histogram

I'm running a model of particles, and I want to have initial conditions for the particle locations mimicking a gaussian distribution.
If I have N number of particles on 1D grid from -10 to 10, I want them to be distributed on the grid according to a gaussian with a known mean and standard deviation. It's basically creating a histogram where each bin width is 1 (the x-axis of locations resolution is 1), and the frequency of each bin should be how many particles are in it, which should all add up to N.
My strategy was to plot a gaussian function on the x-axis grid, and then just approximate the value of each point for the number of particles:
def gaussian(x, mu, sig):
return 1./(np.sqrt(2.*np.pi)*sig)*np.exp(-np.power((x - mu)/sig, 2.)/2)
mean = 0
sigma = 1
x_values = np.arange(-10, 10, 1)
y = gaussian(x_values, mean, sigma)
However, I have normalization issues (the sum doesn't add up to N), and the number of particles in each point should be an integer (I thought about converting the y array to integers but again, because of the normalization issue I get a flat line).
Usually, the problem is fitting a gaussian to histogram, but in my case, I need to do the reverse - and I couldn't find a solution for it yet. I will appreciate any help!
Thank you!!!

You can use numpy.random.normal to sample this distribution. You can get N points inside range (-10, 10) that follows Gaussian distribution with the following code.
import numpy as np
import matplotlib.pyplot as plt
N = 10000
mean = 5
sigma = 3
bin_edges = np.arange(-10, 11, 1)
x_values = (bin_edges[1:] + bin_edges[:-1]) / 2
points = np.random.normal(mean, sigma, N * 10)
mask = np.logical_and(points < 10, points > -10)
points = points[mask] # drop points outside range
points = points[:N] # only use the first N points
y, _ = np.histogram(points, bins=bin_edges)
plt.scatter(x_values, y)
plt.show()
The idea is to generate a lot of random numbers (10 N in the code), and ignores the points outside your desired range.

scipy normal distribution with scale greater and less than 1 [duplicate]

This question already has answers here:
Why does scipy.norm.pdf sometimes give PDF > 1? How to correct it?
(3 answers)
Closed 2 years ago.
I'm using the normal distribution from numpy and having a hard time understanding its documentation. Let's say I have a normal distribution with mean of 5 and standard deviation of 0.5:
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm
mean = 5
std = 0.25
x = np.linspace(mean - 3*std, mean + 3*std, 1000)
y = norm(loc=mean, scale=std).pdf(x)
plt.plot(x,y)
The resulting chart is the familiar bell curve but with its peak at around 1.6. How can the probability of any value exceed 1? If I multiply it by scale then the probabilities are correct.
No such problem when std (and scale) are greater than 1 however:
mean = 5
std = 10
x = np.linspace(mean - 3*std, mean + 3*std, 1000)
y = norm(loc=mean, scale=std).pdf(x)
plt.plot(x,y)
The documentation on norm says loc is the mean and scale is the standard deviation. Why does it behave so strangely with scale greater and less than 1?
Python 3.8.2. Scipy 1.4.1

The "bell curve" you are plotting is a probability density function (PDF). This means that the probability for a random variable with that distribution falling in any interval [a, b] is the area under the curve between a and b. Thus the whole area under the curve (from -infinity to +infinity) must be 1. So when the standard deviation is small, the maximum of the PDF may well be greater than 1, there is nothing strange about that.
Follow-up question: Is the area under the curve in the first plot really 1?
Yes, it is. One way to confirm this is to approximate the area under the curve by calculating the total area of a series of rectangles whose heights are defined by the curve:
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm
import matplotlib.patches as patches
mean = 5
std = 0.25
x = np.linspace(4, 6, 1000)
y = norm(loc=mean, scale=std).pdf(x)
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_aspect('equal')
ax.set_xlim([4, 6])
ax.set_ylim([0, 1.7])
# Approximate area under the curve by summing over rectangles:
xlim_approx = [4, 6] # locations of left- and rightmost rectangle
n_approx = 17 # number of rectangles
# width of one rectangle:
width_approx = (xlim_approx[1] - xlim_approx[0]) / n_approx
# x-locations of rectangles:
x_approx = np.linspace(xlim_approx[0], xlim_approx[1], n_approx)
# heights of rectangles:
y_approx = norm(loc=mean, scale=std).pdf(x_approx)
# plot approximation rectangles:
for i, xi in enumerate(x_approx):
ax.add_patch(patches.Rectangle((xi - width_approx/2, 0), width_approx,
y_approx[i], facecolor='gray', alpha=.3))
# areas of the rectangles:
areas = y_approx * width_approx
# total area of the rectangles:
print(sum(areas))
0.9411599204607589
Okay, that's not quite 1, but let's get a better approximation by extending the x-limits and inreasing the number of rectangles:
xlim_approx = [0, 10]
n_approx = 100_000
width_approx = (xlim_approx[1] - xlim_approx[0]) / n_approx
x_approx = np.linspace(xlim_approx[0], xlim_approx[1], n_approx)
y_approx = norm(loc=mean, scale=std).pdf(x_approx)
areas = y_approx * width_approx
print(sum(areas))
0.9999899999999875

function to determine the frequency of a sinusoid

I am having trouble with my Digital Signal Processing homework. Using Python, I need to create a function that is able to determine the frequency of a sinusoid. I am given random frequencies form 0-4000 Hz with an Fs=8000. Can someone please help?
import numpy as np
def freqfinder(signal):
"""REPLACE"""
x=np.fft.fft(signal)
x=np.abs(x)
x=np.max(x)
return x
t=np.linspace(0,2*np.pi,8*8000)
y=np.sin(2*t)
print(freqfinder(y))
z = np.fft.fft(y)
zz = np.abs(z)
plt.plot(zz)
I tried this as a test for the fft.

Your code is off to a good start. A few things to note:
You should only look at the first half of your FFT -- For a REAL input, the output is symmetric around 0 and you only care about the frequencies greater than 0 (the first half of the fft output).
You want the magnitude of each frequency - so you should then take the absolute value of the resulting fft.
The max you are locating is NOT the frequency, but is related to the index of the frequency. It is the strength of the strongest frequency.
Here is a little script demonstrating these ideas:
import numpy as np
import matplotlib.pyplot as plt
fs = 8000
t = np.linspace(0, 2*np.pi, fs)
freqs = [ 2, 152, 423, 2423, 3541] # Frequencies to test
amps = [0.5, 0.5, 1.0, 0.8, 0.3] # Amplitude for each freq
y = np.zeros(len(t))
for freq, amp in zip(freqs, amps):
y += amp*np.sin(freq*t)
fig, ax = plt.subplots(1, 2)
ax = ax.flatten()
ax[0].plot(t, y)
ax[0].set_title("Original signal")
y_fft = np.fft.fft(y) # Original FFT
y_fft = y_fft[:round(len(t)/2)] # First half ( pos freqs )
y_fft = np.abs(y_fft) # Absolute value of magnitudes
y_fft = y_fft/max(y_fft) # Normalized so max = 1
freq_x_axis = np.linspace(0, fs/2, len(y_fft))
ax[1].plot(freq_x_axis, y_fft, "o-")
ax[1].set_title("Frequency magnitudes")
ax[1].set_xlabel("Frequency")
ax[1].set_ylabel("Magnitude")
plt.grid()
plt.tight_layout()
plt.show()
f_loc = np.argmax(y_fft) # Finds the index of the max
f_val = freq_x_axis[f_loc] # The strongest frequency value
print(f"The strongest frequency is f = {f_val}")
The output:
The strongest frequency is f = 423.1057764441111
You can see on the right graph that there is a peak at each of the frequencies we specified in freqs, which is what is expected.
This kind of setup is fine if you only have one frequency you're looking for, but otherwise you may need to find and implement some peak finding algorithms to find all the indices of all the frequency peaks of y_fft and then correlate that with the frequencies in freq_x_axis

getting output ifft at a different resolution

I'm trying to smooth and interpolate some periodic data in python using scipy.fftp. I have managed to take the fft of the data, remove the higher order frequencies above wn (by doing myfft[wn:-wn] = 0) and then reconstruct a "smoothed" version of the data with ifft(myfft). The array created by the ifft has the same number of points as the original data. How can I use that fft to create an array with more points.
x = [i*2*np.pi/360 for i in range(0,360,30)]
data = np.sin(x)
#get fft
myfft = fftp.fft(data)
#kill feqs above wn
myfft[wn:-wn] = 0
#make new series
newdata = fftp.ifft(myfft)
I've also been able to manually recreate the series at the same resolution as demonstrated here
Recreating time series data using FFT results without using ifft
but when I tried upping the resolution of the x-values array it didn't give me the right answer either.
Thanks in advance
Niall

What np.fft.fft returns has the DC component at position 0, followed by all positive frequencies, then the Nyquist frequency (only if the number of elements is even), then the negative frequencies in reverse order. So to add more resolution you could add zeros at both sides of the Nyquist frequency:
import numpy as np
import matplotlib.pyplot as plt
y = np.sin(np.linspace(0, 2*np.pi, 32, endpoint=False))
f = np.fft.fft(y)
n = len(f)
f_ = np.concatenate((f[0:(n+1)//2],
np.zeros(n//2),
[] if n%2 != 0 else f[(n+1)//2:(n+3)//2],
np.zeros(n//2),
f[(n+3)//2:]))
y_ = np.fft.ifft(f_)
plt.plot(y, 'ro')
plt.plot(y_, 'bo')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Plot normal distribution over histogram - python

Related

Live time series with mean and standrad deviation of the distribution

Converting gaussian to histogram

scipy normal distribution with scale greater and less than 1 [duplicate]

function to determine the frequency of a sinusoid

getting output ifft at a different resolution

Categories

Resources