Live time series with mean and standrad deviation of the distribution - python

Let us assume a loop like below:
import numpy as np
ax = []; ay = []
for n in range(N):
avgC = np.zeros(M)
for m in range(M):
...
Cost = aFuncation
avgC[m] = Cost
ax.append(n); ay.append(np.mean(avgC))
I would like to use ax and ay to plot a live time series which shows how np.mean(avgC) evolves over different iterations of n. At the same time, I would like to plot the standrad deviation of the distribution according to avgC (a figure like below example).

First you should think about what the term "confidence interval" actually means in your case. To construct confidence intervals, you must specify for what quantity you construct the confidence interval, and you should give more background information how the values are distributed in your case. I assume for now, that your "Cost" values are normal distributed and you want the mean and standard deviation of the distribution plotted at each point n. Note that this is not the confidence interval on the mean. If you are unsure about this, you should probably edit your question and include more detailed information on the statistical properties of your investigation.
That being said, with this code you can plot the mean and a standard deviation band at each point n:
import numpy as np
import matplotlib.pyplot as plt
N = 25
M = 10
def aFuncation(x):
return np.random.normal(100*np.exp(-x), 10.0)
ax = np.zeros(N)
ay = np.zeros(N)
astd = np.zeros(N)
for n in range(N):
avgC = np.zeros(M)
for m in range(M):
Cost = aFuncation(n)
avgC[m] = Cost
ax[n] = n
ay[n] = np.mean(avgC)
astd[n] = np.std(avgC)
plt.fill_between(ax, ay-astd, ay+astd, alpha=0.3, color='black')
plt.plot(ax,ay,color='red')
plt.show()

Related

Converting gaussian to histogram

I'm running a model of particles, and I want to have initial conditions for the particle locations mimicking a gaussian distribution.
If I have N number of particles on 1D grid from -10 to 10, I want them to be distributed on the grid according to a gaussian with a known mean and standard deviation. It's basically creating a histogram where each bin width is 1 (the x-axis of locations resolution is 1), and the frequency of each bin should be how many particles are in it, which should all add up to N.
My strategy was to plot a gaussian function on the x-axis grid, and then just approximate the value of each point for the number of particles:
def gaussian(x, mu, sig):
return 1./(np.sqrt(2.*np.pi)*sig)*np.exp(-np.power((x - mu)/sig, 2.)/2)
mean = 0
sigma = 1
x_values = np.arange(-10, 10, 1)
y = gaussian(x_values, mean, sigma)
However, I have normalization issues (the sum doesn't add up to N), and the number of particles in each point should be an integer (I thought about converting the y array to integers but again, because of the normalization issue I get a flat line).
Usually, the problem is fitting a gaussian to histogram, but in my case, I need to do the reverse - and I couldn't find a solution for it yet. I will appreciate any help!
Thank you!!!
You can use numpy.random.normal to sample this distribution. You can get N points inside range (-10, 10) that follows Gaussian distribution with the following code.
import numpy as np
import matplotlib.pyplot as plt
N = 10000
mean = 5
sigma = 3
bin_edges = np.arange(-10, 11, 1)
x_values = (bin_edges[1:] + bin_edges[:-1]) / 2
points = np.random.normal(mean, sigma, N * 10)
mask = np.logical_and(points < 10, points > -10)
points = points[mask] # drop points outside range
points = points[:N] # only use the first N points
y, _ = np.histogram(points, bins=bin_edges)
plt.scatter(x_values, y)
plt.show()
The idea is to generate a lot of random numbers (10 N in the code), and ignores the points outside your desired range.

Plot normal distribution over histogram

I am new to python and in the following code, I would like to plot a bell curve to show how the data follows a norm distribution. How would I go about it? Also, can anyone answer why when showing the hist, I have values (x-axis) greater than 100? I would assume by defining the Randels to 100, it would not show anything above it. If I am not mistaken, the x-axis represents what "floor" I am in and the y-axis represents how many observations matched that floor. By the way, this is a datacamp project.
"""
Let's say I roll a dice to determine if I go up or down a step in a building with
100 floors (1 step = 1 floor). If the dice is less than 2, I go down a step. If
the dice is less than or equal to 5, I go up a step, and if the dice is equal to 6,
I go up x steps based on a random integer generator between 1 and 6. What is the probability
I will be higher than floor 60?
"""
import numpy as np
import matplotlib.pyplot as plt
# Set the seed
np.random.seed(123)
# Simulate random walk
all_walks = []
for i in range(1000) :
random_walk = [0]
for x in range(100) :
step = random_walk[-1]
dice = np.random.randint(1,7)
if dice <= 2:
step = max(0, step - 1)
elif dice <= 5:
step = step + 1
else:
step = step + np.random.randint(1,7)
if np.random.rand() <= 0.001 : # There's a 0.1% chance I fall and have to start at 0
step = 0
random_walk.append(step)
all_walks.append(random_walk)
# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))
# Select last row from np_aw_t: ends
ends = np_aw_t[-1,:]
# Plot histogram of ends, display plot
plt.hist(ends,bins=10,edgecolor='k',alpha=0.65)
plt.style.use('fivethirtyeight')
plt.xlabel("Floor")
plt.ylabel("# of times in floor")
plt.show()
You can use scipy.stats.norm to get a normal distribution. Documentation for it here. To fit any function to a data set you can use scipy.optimize.curve_fit(), documentation for that here. My suggestion would be something like the following:
import scipy.stats as ss
import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt
#Making a figure with two y-axis (one for the hist, one for the pdf)
#An alternative would be to multiply the pdf by the sum of counts if you just want to show the fit.
fig, ax = plt.subplots(1,1)
twinx = ax.twinx()
rands = ss.norm.rvs(loc = 1, scale = 1, size = 1000)
#hist returns the bins and the value of each bin, plot to the y-axis ax
hist = ax.hist(rands)
vals, bins = hist[0], hist[1]
#calculating the center of each bin
bin_centers = [(bins[i] + bins[i+1])/2 for i in range(len(bins)-1)]
#finding the best fit coefficients, note vals/sum(vals) to get the probability in each bin instead of the count
coeff, cov = opt.curve_fit(ss.norm.pdf, bin_centers, vals/sum(vals), p0 = [0,1] )
#loc and scale are mean and standard deviation i believe
loc, scale = coeff
#x-values to plot the normal distribution curve
x = np.linspace(min(bins), max(bins), 100)
#Evaluating the pdf with the best fit mean and std
p = ss.norm.pdf(x, loc = loc, scale = scale)
#plot the pdf to the other axis and show
twinx.plot(x,p)
plt.show()
There are likely more elegant ways to do this, but if you are new to python and are going to use it for calculations and such, getting to know curve_fit and scipy.stats is recomended. I'm not sure I understand whan you mean by "defining the Randels", hist will plot a "standard" histogram with bins on the x-axis and the count in each bin on the y-axis. When using these counts to fit a pdf we can just divide all the counts by the total number of counts.
Hope that helps, just ask if anything is unclear :)
Edit: compact version
vals, bins,_ = ax.hist(my_histogram_data)
bin_centers = [(bins[i] + bins[i+1])/2 for i in range(len(bins)-1)]
coeff, cov = opt.curve_fit(ss.norm.pdf, bin_centers, vals/sum(vals), p0 = [0,1] )
x = np.linspace(min(bins), max(bins), 100)
p = ss.norm.pdf(x, loc = coeff[0], scale = coeff[1])
#p is now the fitted normal distribution

How to randomly generate continuous functions

My objective is to randomly generate good looking continuous functions, good looking meaning that functions which can be recovered from their plots.
Essentially I want to generate a random time series data for 1 second with 1024 samples per second. If I randomly choose 1024 values, then the plot looks very noisy and nothing meaningful can be extracted out of it. In the end I have attached plots of two sinusoids, one with a frequency of 3Hz and another with a frequency of 100Hz. I consider 3Hz cosine as a good function because I can extract back the timeseries by looking at the plot. But the 100 Hz sinusoid is bad for me as I cant recover the timeseries from the plot. So in the above mentioned meaning of goodness of a timeseries, I want to randomly generate good looking continuos functions/timeseries.
The method I am thinking of using is as follows (python language):
(1) Choose 32 points in x-axis between 0 to 1 using x=linspace(0,1,32).
(2) For each of these 32 points choose a random value using y=np.random.rand(32).
(3) Then I need an interpolation or curve fitting method which takes as input (x,y) and outputs a continuos function which would look something like func=curve_fit(x,y)
(4) I can obtain the time seires by sampling from the func function
Following are the questions that I have:
1) What is the best curve-fitting or interpolation method that I can
use. They should also be available in python.
2) Is there a better method to generate good looking functions,
without using curve fitting or interpolation.
Edit
Here is the code I am using currently for generating random time-series of length 1024. In my case I need to scale the function between 0 and 1 in the y-axis. Hence for me l=0 and h=0. If that scaling is not needed you just need to uncomment a line in each function to randomize the scaling.
import numpy as np
from scipy import interpolate
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
## Curve fitting technique
def random_poly_fit():
l=0
h=1
degree = np.random.randint(2,11)
c_points = np.random.randint(2,32)
cx = np.linspace(0,1,c_points)
cy = np.random.rand(c_points)
z = np.polyfit(cx, cy, degree)
f = np.poly1d(z)
y = f(x)
# l,h=np.sort(np.random.rand(2))
y = MinMaxScaler(feature_range=(l,h)).fit_transform(y.reshape(-1, 1)).reshape(-1)
return y
## Cubic Spline Interpolation technique
def random_cubic_spline():
l=0
h=1
c_points = np.random.randint(4,32)
cx = np.linspace(0,1,c_points)
cy = np.random.rand(c_points)
z = interpolate.CubicSpline(cx, cy)
y = z(x)
# l,h=np.sort(np.random.rand(2))
y = MinMaxScaler(feature_range=(l,h)).fit_transform(y.reshape(-1, 1)).reshape(-1)
return y
func_families = [random_poly_fit, random_cubic_spline]
func = np.random.choice(func_families)
x = np.linspace(0,1,1024)
y = func()
plt.plot(x,y)
plt.show()
Add sin and cosine signals
from numpy.random import randint
x= np.linspace(0,1,1000)
for i in range(10):
y = randint(0,100)*np.sin(randint(0,100)*x)+randint(0,100)*np.cos(randint(0,100)*x)
y = MinMaxScaler(feature_range=(-1,1)).fit_transform(y.reshape(-1, 1)).reshape(-1)
plt.plot(x,y)
plt.show()
Output:
convolve sin and cosine signals
for i in range(10):
y = np.convolve(randint(0,100)*np.sin(randint(0,100)*x), randint(0,100)*np.cos(randint(0,100)*x), 'same')
y = MinMaxScaler(feature_range=(-1,1)).fit_transform(y.reshape(-1, 1)).reshape(-1)
plt.plot(x,y)
plt.show()
Output:

Beginner Python Monte Carlo Simulation

I'm a beginner at Python and am working through exercises set by our instructor. I am struggling with this question.
In the Python editor, write a Monte Carlo simulation to estimate the value of the number π.
Specifically, follow these steps:
A. Produce two arrays, one called x, one called y, which contain 100 elements each,
which are randomly and uniformly distributed real numbers between -1 and 1.
B. Plot y versus x as dots in a plot. Label your axes accordingly.
C. Write down a mathematical expression that defines which (x, y) pairs of data points
are located in a circle with radius 1, centred on the (0, 0) origin of the graph.
D. Use Boolean masks to identify the points inside the circle, and overplot them in a
different colour and marker size on top of the data points you already plotted in B.
This is what I have at the moment.
import numpy as np
import math
import matplotlib.pyplot as plt
np.random.seed(12345)
x = np.random.uniform(-1,1,100)
y = np.random.uniform(-1,1,100)
plt.plot(x,y) //this works
for i in x:
newarray = (1>math.sqrt(y[i]*y[i] + x[i]*x[i]))
plt.plot(newarray)
Any suggestions?
as pointed out in the comment the error in your code is for i in x should be for i in xrange(len(x))
If you want to actually use a Boolean mask as said in the statement you could do something like this
import pandas as pd
allpoints = pd.DataFrame({'x':x, 'y':y})
# this is your boolean mask
mask = pow(allpoints.x, 2) + pow(allpoints.y, 2) < 1
circlepoints = allpoints[mask]
plt.scatter(allpoints.x, allpoints.y)
plt.scatter(circlepoints.x, circlepoints.y)
increasing the number of point to 10000 you would get something like this
to estimate PI you can use the famous montecarlo derivation
>>> n = 10000
>>> ( len(circlepoints) * 4 ) / float(n)
<<< 3.1464
You are close to the solution. I slightly reshape your MCVE:
import numpy as np
import math
import matplotlib.pyplot as plt
np.random.seed(12345)
N = 10000
x = np.random.uniform(-1, 1, N)
y = np.random.uniform(-1, 1, N)
Now, we compute a criterion that makes sense in this context, such as the distance of points to the origin:
d = x**2 + y**2
Then we use Boolean Indexing to discriminate between points within and outside the Unit Circle:
q = (d <= 1)
At this point lies the Monte Carlo Hypothesis. We assume the ratio of uniformly distributed points in the Circle and in the plane U(-1,1)xU(-1,1) is representative for the Area of the Unit Circle and the Square. Then we can statistically assess pi = 4*(Ac/As) from the ratio of points within the Circle/Square. This leads to:
pi = 4*q.sum()/q.size # 3.1464
Finally we plot the result:
fig, axe = plt.subplots()
axe.plot(x[q], y[q], '.', color='green', label=r'$d \leq 1$')
axe.plot(x[~q], y[~q], '.', color='red', label=r'$d > 1$')
axe.set_aspect('equal')
axe.set_title(r'Monte Carlo: $\pi$ Estimation')
axe.set_xlabel('$x$')
axe.set_ylabel('$y$')
axe.legend(bbox_to_anchor=(1, 1), loc='upper left')
fig.savefig('MonteCarlo.png', dpi=120)
It outputs:

Fast, elegant way to calculate empirical/sample covariogram

Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?
This is a screenshot of a book which contains a good definition of covariagram:
If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.
First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []
for w in np.arange(len(widths)-1): # for each width
# for each pairwise distance
for i in np.arange(distances.shape[0]):
for j in np.arange(distances.shape[1]):
if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
m1 = []
m2 = []
# when a distance is within a given width, calculate the means of
# the points involved
for x in np.arange(distances.shape[1]):
if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
m1.append(z[x])
for y in np.arange(distances.shape[1]):
if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
m2.append(z[y])
mean_m1 = np.array(m1).mean()
mean_m2 = np.array(m2).mean()
Z.append(z[i]*z[j] - mean_m1*mean_m2)
Z_mean = np.array(Z).mean() # calculate covariogram for width w
Cov.append(Z_mean) # collect covariances for all widths
However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:
And it is supposed to look like this:
Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.
One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.
First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.
Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.
from numpy import *
import random
import matplotlib.pyplot as plt
S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
# density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
# make a point cloud motivated by this density
random.seed(10) # so this can be repeated
points = []
while len(points)<N:
v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
if True: #v<density[ix,iy]:
points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
# build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]]
for i in range(N) for j in range(N) if i>j])
Which gives:
The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.
Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.
Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).
# now do the real calculations for the covariogram
# sort by h and give clear names
i = argsort(m[:,0]) # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh
hvals = linspace(0,S,1000) # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
start, stop = ihvals[i-1], ihval
N = stop-start
if N>0:
mnh = sum(zh[start:stop])/N
mph = sum(zsh[start:stop])/N
szz = sum(zz[start:stop])/N
C = szz-mnh*mph
result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()
which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.
The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.

Categories

Resources