Sampling from bivariate normal in python

Sampling from bivariate normal in python - python

I'm trying to create two random variables which are correlated with one another, and I believe the best way is to draw from a bivariate normal distribution with given parameters (open to other ideas). The uncorrelated version looks like this:
import numpy as np
sigma = np.random.uniform(.2, .3, 80)
theta = np.random.uniform( 0, .5, 80)
However, for each one of the 80 draws, I want the sigma value to be related to the theta value. Any thoughts?

Use the built-in: http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multivariate_normal.html
>>> import numpy as np
>>> mymeans = [13,5]
>>> # stdevs = sqrt(5),sqrt(2)
>>> # corr = .3 / (sqrt(5)*sqrt(2) = .134
>>> mycov = [[5,.3], [.3,2]]
>>> np.cov(np.random.multivariate_normal(mymeans,mycov,500000).T)
array([[ 4.99449936, 0.30506976],
[ 0.30506976, 2.00213264]])
>>> np.corrcoef(np.random.multivariate_normal(mymeans,mycov,500000).T)
array([[ 1. , 0.09629313],
[ 0.09629313, 1. ]])
As shown, things get a little hairier if you have to adjust for not-unit variances)
more reference: http://www.riskglossary.com/link/correlation.htm
To be real-world meaningful, the covariance matrix must be symmetric and must also be positive definite or positive semidefinite (it must be invertable). Particular anti-correlation structures might not be possible.

import multivariate_normal from scipy can be used. Suppose we create random variables x and y:
from scipy.stats import multivariate_normal
rv_mean = [0, 1] # mean of x and y
rv_cov = [[1.0,0.5], [0.5,2.0]] # covariance matrix of x and y
rv = multivariate_normal.rvs(rv_mean, rv_cov, size=10000)
You have x from rv[:,0] and y from rv[:,1]. Correlation coefficients can be obtained from
import numpy as np
np.corrcoef(rv.T)

Related

Euclidean distance between the two points using vectorized approach

I have two large numpy arrays for which I want to calculate an Euclidean Distance using sklearn. The following MRE achieves what I want in the final result, but since my RL usage is large, I really want a vectorized solution as opposed to using a for loop.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
n = 3
sample_size = 5
X = np.random.randint(0, 10, size=(sample_size, n))
Y = np.random.randint(0, 10, size=(sample_size, n))
lst = []
for f in range(0, sample_size):
ed = euclidean_distances([X[f]], [Y[f]])
lst.append(ed[0][0])
print(lst)

euclidean_distances computes the distance for each combination of X,Y points; this will grow large in memory and is totally unnecessary if you just want the distance between each respective row. Sklearn includes a different function called paired_distances that does what you want:
from sklearn.metrics.pairwise import paired_distances
d = paired_distances(X,Y)
# array([5.83095189, 9.94987437, 7.34846923, 5.47722558, 4. ])
If you need the full pairwise distances, you can get the same result from the diagonal (as pointed out in the comments):
d = euclidean_distances(X,Y).diagonal()
Lastly: arrays are a numpy type, so it is useful to know the numpy api itself (prob. what sklearn calls under the hood). Here are two examples:
d = np.linalg.norm(X-Y, axis=1)
d = np.sqrt(np.sum((X-Y)**2, axis=1))

Python scipy.sparse: how to efficiently set a set of entries to 0?

Let a be a big scipy.sparse matrix and IJ={(i0,j0),(i1,j1),...} a set of positions. How can I efficiently set all the entries in a in positions IJ to 0? Something like a[IJ]=0.
In Mathematica, I would create a new sparse matrix b with background value 1 (instead of 0) and all entries in IJ. Then, I would use a=a*b (entry-wise multiplication). That does not seem to be an option here.
A toy example:
import scipy.sparse as sp
import numpy as np
np.set_printoptions(linewidth=200,edgeitems=5,precision=4)
m=n=10**1;
a=sp.random(m,n,4/m,format='csr'); print(a.toarray())
IJ=np.array([range(0,n,2),range(0,n,2)]); print(IJ) #every second diagonal

You are almost there. To go by your definitions, all you'd need to do is:
a[IJ[0],IJ[1]] = 0
Note that scipy will warn you:
SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
You can read more about that here.

The scipy sparse matrices can't have a non-zero background value. While it it possible to make a "sparse" matrix with lots of non-zero value, the performance (speed & memory) would be far worse than dense matrix multiplication.
A possible work-around is to rewrite every sparse matrix to have a default value of zero. For example, if matrix Y' contains mostly 1, I can replace Y' by I - Y where Y = I - Y' and I is the identity matrix.
import scipy.sparse as sp
import numpy as np
size = (100, 100)
x = np.random.uniform(-1, 1, size=size)
y = sp.random(*size, 0.001, format='csr')
# Z = (I - Y)X = X - YX
z = x - y.multiply(x)
# A = X(I - Y) = X - XY = X - transpose(YX)
a = x - y.multiply(x).T

Calculate state transition matrix in python

Is there a direct way to calculate state transition matrix(i.e. e^(A*t), where A is a matrix)?
I planned to calculate it in this way:
but failed:
And if I directly calculate A*t first and then use expm(), it still cannot work since there should be no variable in expm().
I hope I illustrate my problem clearly :)
EDIT: Here is the code I think should be useful to solve my problem:
import numpy as np
import sympy
import scipy
from scipy.integrate import quad
Ts=0.02
s=sympy.symbols('s')
t=sympy.symbols('t')
T0=np.matrix([[1,0,0],
[0,1,0],
[0,-1,1]])
M0=np.matrix([[1.735,0.15851,0.042262],
[0.123728,0.07019322,0.02070838],
[0.042262,0.0243628,0.014375212]])
F0=np.matrix([[-22.915,0,0],
[0,-0.00969,0.00264],
[0,0.00264,-0.00264]])
N0=np.matrix([[0,0,0],
[0,1.553398,0],
[0,0,0.4141676]])
G0=np.matrix([[11.887],[0],[0]])
Ky=np.matrix([1.0121,4.5728,6.3652,0.9117,1.5246,0.9989])
A21=T0*(M0.I)*N0*(T0.I)
A22=T0*(M0.I)*F0*(T0.I)
Z=np.zeros((3,3))
Y=(np.matrix([0,0,0])).T
by1=np.row_stack((Z,A21))
by2=np.row_stack((np.identity(3),A22))
A=np.column_stack((by1,by2))
G=scipy.linalg.expm(A*Ts)
B2=T0*(M0.I)*G0
B=np.row_stack((Y,B2))
S1=sympy.Matrix((s*np.identity(6))-A)
S2=S1.inv()
S=S2
for (i,j), orinm in scipy.ndenumerate(S2):
S[i,j]=sympy.inverse_laplace_transform(orinm, s, t)
#integral
H=np.zeros(S2.shape, dtype=float)
for (i,j),func_sympy in scipy.ndenumerate(S2):
func = sympy.lambdify( (t),func_sympy, 'math')
H[i,j] = quad(func, 0, 0.02)[0]
print(H)

You can directly calculate the matrix exponential using scipy.
import numpy as np
from scipy.linalg import expm
A = np.random.random((10, 10))
exp_A = expm(A)
The documentation for this is here It uses the Pade approximation.
Here is an example using the 2x2 identity matrix.
>>> expm(np.eye(2))
array([[2.71828183, 0. ],
[0. , 2.71828183]])
If you need the matrix exponential of a symbolic matrix (as per your comment) then you can do this with Sympy:
t = sympy.symbols('t')
A = sympy.Matrix([[t, 0], [0, t]])
>>> sympy.exp(A)
Matrix([
[expt(t), 0],
[0, exp(t)]])

Numpy statistics, creating an array with certain statistics properties

If I have some numpy array, I can measure its mean, median, standard deviation, and so on with numpy routines, http://docs.scipy.org/doc/numpy/reference/routines.statistics.html
For example, for array arr, I would run
import numpy as np
print np.mean(arr) # prints the mean
print np.median(arr) # prints the median
However, for my purposes, instead of measuring the statistical properties after an array is created, I would like to create an array with data of distinct statistical properties.
So, for example, I would like to create an array shaped (1000,) of mean 2.5, variance 10, data points i.i.d. such that they are Gaussian draws, etc.
How could one do this with numpy?

You can use numpy.random.randn(size) which gives you normal(0,1) samples of length size. So multiply by the standard deviation and add the mean:
import numpy as np
m = 2.5
std = np.sqrt(10)
v = m + std*np.random.randn(1000)
print np.mean(v) # 2.43375955445
print np.var(v) # 9.9049376296

Yes,you can do this with numpy library
>>import numpy as np
>>import math
>>mean = 2.5
>>deviation = math.sqrt(10)
>>s = np.random.normal(mean,deviation, 1000)
It will give you 1000 Data points array which has mean value 2.5 and variance value 10.
For more information you can check this link http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html

Out of memory when using numpy's multivariate_normal random sampliing

I tried to use numpy.random.multivariate_normal to do random samplings on some 30000+ variables, while it always took all of my memory (32G) and then terminated. Actually, the correlation is spherical and every variable is correlated to about only 2500 other variables. Is there another way to specify the spherical covariance matrix, rather than the full covariance matrix, or any other way to reduce the usage of the memory?
My code is like this:
cm = [] #covariance matrix
for i in range(width*height):
cm.append([])
for j in range(width*height):
cm[i].append(corr_calc()) #corr is inversely proportional to the distance
mean = [vth]*(width*height)
cache_vth=numpy.random.multivariate_normal(mean,cm)

If your correlation is spherical, that is the same as saying that the value along each dimension is uncorrelated to the other dimensions, and that the variance along every dimension is the same. You don't need to build the covariance matrix at all, drawing one sample from your 30,000-D multivariate normal is the same as drawing 30,000 samples from a 1-D normal. That is, instead of doing:
n = 30000
mu= 0
corr = 1
cm = np.eye(n) * corr
mean = np.ones((n,)) * mu
np.random.multivariate_normal(mean, cm)
Which fails when trying to build the cm array, try the following:
n = 30000
mu = 0
corr = 1
>>> np.random.normal(mu, corr, size=n)
array([ 0.88433649, -0.55460098, -0.74259886, ..., 0.66459841,
0.71225572, 1.04012445])
If you want more than one random sample, say 3, try
>>> np.random.normal(mu, corr, size=(3, n))
array([[-0.97458499, 0.05072532, -0.0759601 , ..., -0.31849315,
-2.17552787, -0.36884723],
[ 1.5116701 , 2.53383547, 1.99921923, ..., -1.2769304 ,
0.36912488, 0.3024549 ],
[-1.12615267, 0.78125589, 0.67133243, ..., -0.45441239,
-1.21083007, 1.45696714]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sampling from bivariate normal in python - python

Related

Euclidean distance between the two points using vectorized approach

Python scipy.sparse: how to efficiently set a set of entries to 0?

Calculate state transition matrix in python

Numpy statistics, creating an array with certain statistics properties

Out of memory when using numpy's multivariate_normal random sampliing

Categories

Resources