PyMC3 Normal with variance per column

PyMC3 Normal with variance per column - python

I am trying to define a pymc3.Normal variable with the following as mu:
import numpy as np
import pymc3 as pm
mx = np.array([[0.25 , 0.5 , 0.75 , 1. ],
[0.25 , 0.333, 0.25 , 0. ],
[0.25 , 0.167, 0. , 0. ],
[0.25 , 0. , 0. , 0. ]])
epsilon = pm.Gamma('epsilon', alpha=10, beta=10)
p_ = pm.Normal('p_', mu=mx, shape = mx.shape, sd = epsilon)
The problem is that all random variables in p_ get the same std (epsilon). I would like the first row to use epsilon1, the second row epsilon2 etc.
How Can I do that?

One can pass an argument for the shape parameter to achieve this. To demonstrate this, let's make some fake data to pass as observed, where we use fixed values for epsilon that we can compare against the inferred ones.
Example Model
import numpy as np
import pymc3 as pm
import arviz as az
# priors
mu = np.array([[0.25 , 0.5 , 0.75 , 1. ],
[0.25 , 0.333, 0.25 , 0. ],
[0.25 , 0.167, 0. , 0. ],
[0.25 , 0. , 0. , 0. ]])
alpha, beta = 10, 10
# fake data
np.random.seed(2019)
# row vector will use a different sd for each column
sd = np.random.gamma(alpha, 1.0/beta, size=(1,4))
# generate 100 fake observations of the (4,4) random variables
Y = np.random.normal(loc=mu, scale=sd, size=(100,4,4))
# true column sd's
print(sd)
# [[0.90055471 1.24522079 0.85846659 1.19588367]]
# mean sd's per column
print(np.mean(np.std(Y, 0), 0))
# [0.92028042 1.24437592 0.83383181 1.22717313]
# model
with pm.Model() as model:
# use a (1,4) matrix to pool variance by columns
epsilon = pm.Gamma('epsilon', alpha=10, beta=10, shape=(1, mu.shape[1]))
p_ = pm.Normal('p_', mu=mu, sd=epsilon, shape=mu.shape, observed=Y)
trace = pm.sample(random_seed=2019)
This samples well, and gives the following summary
which clearly bound the true values of the standard deviations within the HPDs.

Related

Replace column by 0 based on probability

How to replace column in the numpy array be certain number based on probability, if it is (1,X,X) shape.
I found code to replace rows, but cannot figure out how to modify it, so it is applicable for columns replacement.
grid_example = np.random.rand(1,5,5)
probs = np.random.random((1,5))
grid_example[probs < 0.25] = 0
grid_example
Thanks!

Use:
import numpy as np
rng = np.random.default_rng(42)
grid_example = rng.random((1, 5, 5))
probs = rng.random((1, 5))
grid_example[..., (probs < 0.25).flatten()] = 0
print(grid_example)
Output
[[[0. 0.43887844 0. 0. 0.09417735]
[0. 0.7611397 0. 0. 0.45038594]
[0. 0.92676499 0. 0. 0.4434142 ]
[0. 0.55458479 0. 0. 0.6316644 ]
[0. 0.35452597 0. 0. 0.7783835 ]]]
The notation [..., (probs < 0.25).flatten()] applies the boolean indexing to the last index. More on the documentation.

Pythonic way for double for loop

I have the following code:
import numpy as np
epsilon = np.array([[0. , 0.00172667, 0.00071437, 0.00091779, 0.00154501],
[0.00128983, 0. , 0.00028139, 0.00215905, 0.00094862],
[0.00035811, 0.00018714, 0. , 0.00029365, 0.00036993],
[0.00035631, 0.00112175, 0.00022906, 0. , 0.00291149],
[0.00021527, 0.00017653, 0.00010341, 0.00104458, 0. ]])
Sii = np.array([19998169., 14998140., 9997923., 7798321., 2797958.])
n = len(Sii)
epsilonijSjj = np.zeros((n,n))
for i in range(n):
for j in range(n):
epsilonijSjj[i,j] = epsilon[i][j]*Sii[j]
print (epsilonijSjj)
How can I avoid the double for loop and write the code in a fast Pythonic way?
Thank you in advance

Numpy allow you to multiply 2 arrays directly.
So rather than define a 0 based array and populating it with the altered elements of the other array, you can simply create a copy of the other array and apply the multiplication directly like so:
import numpy as np
epsilon = np.array([[0. , 0.00172667, 0.00071437, 0.00091779, 0.00154501],
[0.00128983, 0. , 0.00028139, 0.00215905, 0.00094862],
[0.00035811, 0.00018714, 0. , 0.00029365, 0.00036993],
[0.00035631, 0.00112175, 0.00022906, 0. , 0.00291149],
[0.00021527, 0.00017653, 0.00010341, 0.00104458, 0. ]])
Sii = np.array([19998169., 14998140., 9997923., 7798321., 2797958.])
epsilonijSjj = epsilon.copy()
epsilonijSjj *= Sii
print(epsilonijSjj)
Output:
[[ 0. 25896.8383938 7142.21625351 7157.22103059
4322.87308958]
[25794.23832127 0. 2813.31555297 16836.96495505
2654.19891796]
[ 7161.54430059 2806.7519196 0. 2289.97696165
1035.04860294]
[ 7125.54759639 16824.163545 2290.12424238 0.
8146.22673742]
[ 4305.00584063 2647.6216542 1033.88521743 8145.97015018
0. ]]
Or, just do this, which is faster because it doesn't require creating a copy of an array:
import numpy as np
epsilon = np.array([[0. , 0.00172667, 0.00071437, 0.00091779, 0.00154501],
[0.00128983, 0. , 0.00028139, 0.00215905, 0.00094862],
[0.00035811, 0.00018714, 0. , 0.00029365, 0.00036993],
[0.00035631, 0.00112175, 0.00022906, 0. , 0.00291149],
[0.00021527, 0.00017653, 0.00010341, 0.00104458, 0. ]])
Sii = np.array([19998169., 14998140., 9997923., 7798321., 2797958.])
epsilonijSjj = epsilon * Sii

Compute the B-spline basis of a Bivariate spline

I need to compute uv queries on a bivariate spline in the B-spline basis. With this answer I have a good function (copied below) that leverages scipy.dfitpack.bispeu to get the results I need.
import numpy as np
import scipy.interpolate as si
def fitpack_bispeu(cv, u, v, count_u, count_v, degree_u, degree_v):
# cv = grid of control vertices
# u,v = list of u,v component queries
# count_u, count_v = grid counts along the u and v directions
# degree_u, degree_v = curve degree along the u and v directions
# Calculate knot vectors for both u and v
tck_u = np.clip(np.arange(count_u+degree_u+1)-degree_u,0,count_u-degree_u) # knot vector in the u direction
tck_v = np.clip(np.arange(count_v+degree_v+1)-degree_v,0,count_v-degree_v) # knot vector in the v direction
# Compute queries
positions = np.empty((u.shape[0], cv.shape[1]))
for i in range(cv.shape[1]):
positions[:, i] = si.dfitpack.bispeu(tck_u, tck_v, cv[:,i], degree_u, degree_v, u, v)[0]
return positions
This function works, but it occurred to me I could get better performance by calculating the bivariate basis ahead of time and then just get my result via a dot product. Here's what i wrote to compute the basis.
def basis_bispeu(cv, u, v, count_u, count_v, degree_u, degree_v):
# Calculate knot vectors for both u and v
tck_u = np.clip(np.arange(count_u+degree_u+1)-degree_u,0,count_u-degree_u) # knot vector in the u direction
tck_v = np.clip(np.arange(count_v+degree_v+1)-degree_v,0,count_v-degree_v) # knot vector in the v direction
# Compute basis for each control vertex
basis = np.empty((u.shape[0], cv.shape[0]))
cv_ = np.identity(len(cv))
for i in range(cv.shape[0]):
basis[:,i] = si.dfitpack.bispeu(tck_u, tck_v, cv_[i], degree_u, degree_v, u, v)[0]
return basis
lets compare and profile with cProfile:
# A test grid of control vertices
cv = np.array([[-0.5 , -0. , 0.5 ],
[-0.5 , -0. , 0.33333333],
[-0.5 , -0. , 0. ],
[-0.5 , 0. , -0.33333333],
[-0.5 , 0. , -0.5 ],
[-0.16666667, 1. , 0.5 ],
[-0.16666667, -0. , 0.33333333],
[-0.16666667, 0.5 , 0. ],
[-0.16666667, 0.5 , -0.33333333],
[-0.16666667, 0. , -0.5 ],
[ 0.16666667, -0. , 0.5 ],
[ 0.16666667, -0. , 0.33333333],
[ 0.16666667, -0. , 0. ],
[ 0.16666667, 0. , -0.33333333],
[ 0.16666667, 0. , -0.5 ],
[ 0.5 , -0. , 0.5 ],
[ 0.5 , -0. , 0.33333333],
[ 0.5 , -0.5 , 0. ],
[ 0.5 , 0. , -0.33333333],
[ 0.5 , 0. , -0.5 ]])
count_u = 4
count_v = 5
degree_u = 3
degree_v = 3
n = 10**6 # make 1 million random queries
u = np.random.random(n) * (count_u-degree_u)
v = np.random.random(n) * (count_v-degree_v)
# get the result from fitpack_bispeu
result_bispeu = fitpack_bispeu(cv,u,v,count_u,count_v,degree_u,degree_v) # 0.482 seconds
# precompute the basis for the same grid
basis = basis_bispeu(cv,u,v,count_u,count_v,degree_u,degree_v) # 2.124 seconds
# get results via dot product
result_basis = np.dot(basis,cv) # 0.028 seconds (17x faster than fitpack_bispeu)
# all close?
print np.allclose(result_basis, result_bispeu) # True
With a 17x speed increase, pre-calculating the basis seems like the way to go, but basis_bispeu is rather slow.
QUESTION
Is there a faster way to compute the basis of a bivariate spline? I know of deBoor's algorithm to compute a similar basis on a curve. Are there similar algorithms for bivariates that once written with numba or cython could yield better performance?
Otherwise can the basis_bispeu function above be improved to compute the basis faster? Maybe there are built in numpy functions I'm not aware of that could help.

Creating random Variables in Python with one third of the array to be zero

I want to create random variables in python and used the following below code
weights = np.random.random(10) but I want to create random variables such that one third of the weights should be zero. Is there any way possible? I have also tried below code but this is not what I want
weights = np.random.random(7)
weights.append(0, 0, 0)

With the clarification that you want the 0's to appear randomly, you can just use shuffle:
weights = np.random.random(7)
weights = np.append(weights,[0, 0, 0])
np.random.shuffle(weights)

One simple way:
>>> import numpy as np
>>>
>>> a = np.clip(np.random.uniform(-0.5, 1, (100,)), 0, np.inf)
>>> a
array([0.39497669, 0.65003362, 0. , 0. , 0. ,
0.75545815, 0.30772786, 0.1805628 , 0. , 0. ,
0. , 0.82527704, 0. , 0.63983682, 0.89283051,
0.25173721, 0.18409163, 0.63631959, 0.59095185, 0. ,
0.85817311, 0. , 0.06769175, 0. , 0.67807471,
0.29805637, 0.03429861, 0.53077809, 0.32317273, 0.52346321,
0.22966515, 0.98175502, 0.54615167, 0. , 0.88853359,
0. , 0.70622272, 0.08106305, 0. , 0.8767082 ,
0.52920044, 0. , 0. , 0.29394736, 0.4097331 ,
0.77977164, 0.62860222, 0. , 0. , 0.14899124,
0.81880283, 0. , 0.1398242 , 0. , 0.50113732,
0. , 0.68872893, 0.15582668, 0. , 0.34789122,
0.18510949, 0.60281713, 0.21097922, 0.77419626, 0.29588479,
0.18890799, 0.9781896 , 0.96220508, 0.52201816, 0.71087763,
0. , 0.43540516, 0.99297503, 0. , 0.69248893,
0.05157044, 0. , 0.75131066, 0. , 0. ,
0.25627591, 0.53367521, 0.58151298, 0.85662171, 0.455367 ,
0. , 0. , 0.21293519, 0.52337335, 0. ,
0.68644488, 0. , 0. , 0.39695189, 0. ,
0.40860821, 0.84549468, 0. , 0.21247807, 0.59054669])
>>> np.count_nonzero(a)
67
It draws uniformly from [-0.5, 1] and then sets everything below zero to zero.

Set Approximately 1/3 of weights
This will guarantee that approximately one third of your weights are 0:
weights = np.random.random(10)/np.random.choice([0,1],10,p=[0.3,0.7])
weights[np.isinf(weights)] = 0
# or
# weights[weights == np.inf] = 0
>>> weights
array([0. , 0.25715864, 0. , 0.80958258, 0.12880619,
0.48781856, 0.52278911, 0.76541417, 0.87736431, 0. ])
What it does is divides about 1/3 of your values by 0, giving you inf, then just replace the inf by 0
Set Exactly 1/3 of weights
Alternatively, if you need it to be exactly 1/3 (or in your case, 3 out of 10), you can replace 1/3 of your weights with 0:
weights = np.random.random(10)
# Replace 3 with however many indices you want changed...
weights[np.random.choice(range(len(weights)),3,replace=False)] = 0
>>> weights
array([0. , 0.36839012, 0. , 0.51468295, 0.45694205,
0.23881473, 0.1223229 , 0.68440171, 0. , 0.15542469])
That selects 3 random indices from weights and replaces them with 0

size = 10
v = np.random.random(size)
v[np.random.randint(0, size, size // 3)] = 0
A little bit more optimized (because random number generation is not "cheap"):
v = np.zeros(size)
nnonzero = size - size // 3
idx = np.random.choice(size, nnonzero, replace=False)
v[idx] = np.random.random(nnonzero)

What about replacing the first third of items with 0 then shuffle it as following
weights = np.random.random(10)
weights[: weights.size / 3] = 0
np.random.shuffle(weights)

speeding up moving time delta with irregular time intervals in a numpy array

I want to calculate the 10 second difference of a dataset where the time increments are irregular.The data exists in 2 1-D arrays of equal length, one for the time, and the other for the data value.
After some poking around I was able to come up with a solution, but it's too slow based on (i suspect) having to iterate through every item in the array.
My general method is to iterate through the time array, and for each time value i find the index of the time value that is x seconds earlier. I then use those indices on the data array to calculate the difference.
The code is shown below.
First, the find_closest function from Bi Rico
def find_closest(A, target):
#A must be sorted
idx = A.searchsorted(target)
idx = np.clip(idx, 1, len(A)-1)
left = A[idx-1]
right = A[idx]
idx -= target - left < right - target
return idx
Which I then use in the following manner
def trailing_diff(time_array,data_array,seconds):
trailing_list=[]
for i in xrange(len(time_array)):
now=time_array[i]
if now<seconds:
trailing_list.append(0)
else:
then=find_closest(time_array,now-seconds)
trailing_list.append(data_array[i]-data_array[then])
return np.asarray(trailing_list)
unfortunately this doesn't scale particularly well, and I'd like to be able to calculate this (and plot it) on the fly.
Any thoughts on how I can make it more expedient?
EDIT: input/output
In [48]:time1
Out[48]:
array([ 0.57200003, 0.579 , 0.58800006, 0.59500003,
0.5999999 , 1.05999994, 1.55900002, 2.00900006,
2.57599998, 3.05599999, 3.52399993, 4.00699997,
4.09599996, 4.57299995, 5.04699993, 5.52099991,
6.09299994, 6.55999994, 7.04099989, 7.50900006,
8.07500005, 8.55799985, 9.023 , 9.50699997,
9.59399986, 10.07200003, 10.54200006, 11.01999998,
11.58899999, 12.05699992, 12.53799987, 13.00499988,
13.57599998, 14.05599999, 14.52399993, 15.00199985,
15.09299994, 15.57599998, 16.04399991, 16.52199984,
17.08899999, 17.55799985, 18.03699994, 18.50499988,
19.0769999 , 19.5539999 , 20.023 , 20.50099993,
20.59099984, 21.07399988])
In [49]:weight1
Out[49]:
array([ 82.268, 82.268, 82.269, 82.272, 82.275, 82.291, 82.289,
82.288, 82.287, 82.287, 82.293, 82.303, 82.303, 82.314,
82.321, 82.333, 82.356, 82.368, 82.386, 82.398, 82.411,
82.417, 82.419, 82.424, 82.424, 82.437, 82.45 , 82.472,
82.498, 82.515, 82.541, 82.559, 82.584, 82.607, 82.617,
82.626, 82.626, 82.629, 82.63 , 82.636, 82.651, 82.663,
82.686, 82.703, 82.728, 82.755, 82.773, 82.8 , 82.8 ,
82.826])
In [50]:trailing_diff(time1,weight1,10)
Out[50]:
array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0.169, 0.182, 0.181, 0.209, 0.227, 0.254, 0.272,
0.291, 0.304, 0.303, 0.305, 0.305, 0.296, 0.274, 0.268,
0.265, 0.265, 0.275, 0.286, 0.309, 0.331, 0.336, 0.35 ,
0.35 , 0.354])

Use a ready-made interpolation routine. If you really want nearest neighbor behavior, I think it will have to be scipy's scipy.interpolate.interp1d, but linear interpolation seems a better option, and then you could use numpy's numpy.interp:
def trailing_diff(time, data, diff):
ret = np.zeros_like(data)
mask = (time - time[0]) >= diff
ret[mask] = data[mask] - np.interp(time[mask] - diff,
time, data)
return ret
time = np.arange(10) + np.random.rand(10)/2
weight = 82 + np.random.rand(10)
>>> time
array([ 0.05920317, 1.23000929, 2.36399981, 3.14701595, 4.05128494,
5.22100886, 6.07415922, 7.36161563, 8.37067107, 9.11371986])
>>> weight
array([ 82.14004969, 82.36214992, 82.25663272, 82.33764514,
82.52985723, 82.67820915, 82.43440796, 82.74038368,
82.84235675, 82.1333915 ])
>>> trailing_diff(time, weight, 3)
array([ 0. , 0. , 0. , 0.18093749, 0.20161107,
0.4082712 , 0.10430073, 0.17116831, 0.20691594, -0.31041841])
To get nearest neighbor, you would do
from scipy.interpolate import interp1d
def trailing_diff(time, data, diff):
ret = np.zeros_like(data)
mask = (time - time[0]) >= diff
interpolator = interp1d(time, data, kind='nearest')
ret[mask] = data[mask] - interpolator(time[mask] - diff)
return ret

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyMC3 Normal with variance per column - python

Related

Replace column by 0 based on probability

Pythonic way for double for loop

Compute the B-spline basis of a Bivariate spline

Creating random Variables in Python with one third of the array to be zero

speeding up moving time delta with irregular time intervals in a numpy array

Categories

Resources