Random walk on Markov Chain Transition matrix - python

I have a cumulative transition matrix and need to build a simple random walk algorithm to generate let's say 500 values from the matrix as efficiently as possible (the actual matrix is 1000 x 1000)
cum_sum
[array([0.3, 0.5, 0.7, 0.9, 1. ]),
array([0.5 , 0.5 , 0.5 , 0.75, 1. ]),
array([0.66666667, 1. , 1. , 1. , 1. ]),
array([0.4, 0.6, 0.8, 1. , 1. ]),
array([0.5, 0.5, 0.5, 1. , 1. ])]
Select the initial state i in the matrix randomly
Produce a random value between 0 and 1
The value of the random number is compared with the elements of the i-th row of the cumulative matrix. If the random number value was greater then the cumulative probability of the previous state but less than or equal to the cumulative probability of the following state the followin state is adopted.
Something I tried
def random_walk(cum_sum):
start_point=random.choice([item[0] for item in cum_sum])
random=np.random.uniform(0,1,1)
if random > start_point:

You can use numpy random choice at each stage to simulate a transition.
You can give the probability as the argument p, and the first positional argument define the sample space. In order to convert the cumulative probability distribution to a probability distribution what I do is to insert a zero at the beginning and use numpy diff to compute the increase at each step.
Preparing your example probabilities
P = np.array([
[0.3, 0.5, 0.7, 0.9, 1. ],
[0.5 , 0.5 , 0.5 , 0.75, 1. ],
[0.66666667, 1. , 1. , 1. , 1. ],
[0.4, 0.6, 0.8, 1. , 1. ],
[0.5, 0.5, 0.5, 1. , 1. ]])
P = np.diff(np.hstack([np.zeros((len(P), 1)), P]), axis=1)
Then run a few steps
i = 0;
path = [0]
I = np.arange(len(P))
for _ in range(10):
i = np.random.choice(I, p = P[i])
path.append(i);
print(path)

Related

How to choose a random row from 2d np array with np array of probabilites?

I have some difficulties with choosing a random row(point in my case) from my np array. I want to do that with probabilities for each point( so I have a P_i np array in which each row is the probability for a point). I tried to do it with np.random.choice and get "it's must be a 1-D array" so I did np.random.choice on the number of the rows so I get a random index of row. But how do I do it with a probability for each point?
You can use np.choice with a probability distribution that sums up to 1.
Getting probabilities that sum up to 1
Reshaping
If your probablities already sum up to 1, then you simply want to squeeze your probability vector:
# Example of probability vector
probs = np.array([[0.1, 0.2, 0.5, 0.2]])
# array([[0.1, 0.2, 0.5, 0.2]])
probs.shape
# > (1, 4)
p_squeezed = probs.squeeze()
# > array([0.1, 0.2, 0.5, 0.2])
p_squeezed.shape
# > (4,)
Getting a proper probability distribution
If your own probs don't add up to 1, then you can apply a division by the sum or a softmax.
Just generating random data:
import numpy as np
# Random 2D points
points = np.random.randint(0,10, size=(10,2))
# random independant probabilities
probs = np.random.rand(10).reshape(-1, 1)
data = np.hstack((probs, points))
print(data)
# > array([[0.01402932, 5. , 5. ],
# [0.01454579, 5. , 6. ],
# [0.43927214, 1. , 7. ],
# [0.36369286, 3. , 7. ],
# [0.09703463, 9. , 9. ],
# [0.56977406, 1. , 4. ],
# [0.0453545 , 4. , 2. ],
# [0.70413767, 4. , 4. ],
# [0.72133774, 7. , 1. ],
# [0.27297051, 3. , 6. ]])
Applying softmax:
from scipy.special import softmax
scale_softmax = softmax(data[:,0])
# > array([0.07077797, 0.07081454, 0.1082876 , 0.10040494, 0.07690364,
# 0.12338291, 0.0730302 , 0.14112644, 0.14357482, 0.09169694])
Applying division by the sum:
scale_divsum = data[: ,0] / data[:, 0].sum()
# > array([0.00432717, 0.00448646, 0.13548795, 0.11217647, 0.02992911,
# 0.17573962, 0.01398902, 0.21718238, 0.22248752, 0.08419431])
Here are the cumulative distributions of the scaling functions I proposed :
Softmax makes it more similarly likely to pick any point than division by the sum, but the latter probably better fits your needs.
Picking a random row
Now you can use np.random.choice and give it your probability distribution to the parameter p:
rand_idx = np.random.choice(np.arange(len(data)), p=scale_softmax)
data[rand_idx]
# > array([0.70413767, 4. , 4. ])
# or just the point:
data[rand_idx, 1:]
# > array([4., 4.])

Compute the B-spline basis of a Bivariate spline

I need to compute uv queries on a bivariate spline in the B-spline basis. With this answer I have a good function (copied below) that leverages scipy.dfitpack.bispeu to get the results I need.
import numpy as np
import scipy.interpolate as si
def fitpack_bispeu(cv, u, v, count_u, count_v, degree_u, degree_v):
# cv = grid of control vertices
# u,v = list of u,v component queries
# count_u, count_v = grid counts along the u and v directions
# degree_u, degree_v = curve degree along the u and v directions
# Calculate knot vectors for both u and v
tck_u = np.clip(np.arange(count_u+degree_u+1)-degree_u,0,count_u-degree_u) # knot vector in the u direction
tck_v = np.clip(np.arange(count_v+degree_v+1)-degree_v,0,count_v-degree_v) # knot vector in the v direction
# Compute queries
positions = np.empty((u.shape[0], cv.shape[1]))
for i in range(cv.shape[1]):
positions[:, i] = si.dfitpack.bispeu(tck_u, tck_v, cv[:,i], degree_u, degree_v, u, v)[0]
return positions
This function works, but it occurred to me I could get better performance by calculating the bivariate basis ahead of time and then just get my result via a dot product. Here's what i wrote to compute the basis.
def basis_bispeu(cv, u, v, count_u, count_v, degree_u, degree_v):
# Calculate knot vectors for both u and v
tck_u = np.clip(np.arange(count_u+degree_u+1)-degree_u,0,count_u-degree_u) # knot vector in the u direction
tck_v = np.clip(np.arange(count_v+degree_v+1)-degree_v,0,count_v-degree_v) # knot vector in the v direction
# Compute basis for each control vertex
basis = np.empty((u.shape[0], cv.shape[0]))
cv_ = np.identity(len(cv))
for i in range(cv.shape[0]):
basis[:,i] = si.dfitpack.bispeu(tck_u, tck_v, cv_[i], degree_u, degree_v, u, v)[0]
return basis
lets compare and profile with cProfile:
# A test grid of control vertices
cv = np.array([[-0.5 , -0. , 0.5 ],
[-0.5 , -0. , 0.33333333],
[-0.5 , -0. , 0. ],
[-0.5 , 0. , -0.33333333],
[-0.5 , 0. , -0.5 ],
[-0.16666667, 1. , 0.5 ],
[-0.16666667, -0. , 0.33333333],
[-0.16666667, 0.5 , 0. ],
[-0.16666667, 0.5 , -0.33333333],
[-0.16666667, 0. , -0.5 ],
[ 0.16666667, -0. , 0.5 ],
[ 0.16666667, -0. , 0.33333333],
[ 0.16666667, -0. , 0. ],
[ 0.16666667, 0. , -0.33333333],
[ 0.16666667, 0. , -0.5 ],
[ 0.5 , -0. , 0.5 ],
[ 0.5 , -0. , 0.33333333],
[ 0.5 , -0.5 , 0. ],
[ 0.5 , 0. , -0.33333333],
[ 0.5 , 0. , -0.5 ]])
count_u = 4
count_v = 5
degree_u = 3
degree_v = 3
n = 10**6 # make 1 million random queries
u = np.random.random(n) * (count_u-degree_u)
v = np.random.random(n) * (count_v-degree_v)
# get the result from fitpack_bispeu
result_bispeu = fitpack_bispeu(cv,u,v,count_u,count_v,degree_u,degree_v) # 0.482 seconds
# precompute the basis for the same grid
basis = basis_bispeu(cv,u,v,count_u,count_v,degree_u,degree_v) # 2.124 seconds
# get results via dot product
result_basis = np.dot(basis,cv) # 0.028 seconds (17x faster than fitpack_bispeu)
# all close?
print np.allclose(result_basis, result_bispeu) # True
With a 17x speed increase, pre-calculating the basis seems like the way to go, but basis_bispeu is rather slow.
QUESTION
Is there a faster way to compute the basis of a bivariate spline? I know of deBoor's algorithm to compute a similar basis on a curve. Are there similar algorithms for bivariates that once written with numba or cython could yield better performance?
Otherwise can the basis_bispeu function above be improved to compute the basis faster? Maybe there are built in numpy functions I'm not aware of that could help.

PyMC3 Normal with variance per column

I am trying to define a pymc3.Normal variable with the following as mu:
import numpy as np
import pymc3 as pm
mx = np.array([[0.25 , 0.5 , 0.75 , 1. ],
[0.25 , 0.333, 0.25 , 0. ],
[0.25 , 0.167, 0. , 0. ],
[0.25 , 0. , 0. , 0. ]])
epsilon = pm.Gamma('epsilon', alpha=10, beta=10)
p_ = pm.Normal('p_', mu=mx, shape = mx.shape, sd = epsilon)
The problem is that all random variables in p_ get the same std (epsilon). I would like the first row to use epsilon1, the second row epsilon2 etc.
How Can I do that?
One can pass an argument for the shape parameter to achieve this. To demonstrate this, let's make some fake data to pass as observed, where we use fixed values for epsilon that we can compare against the inferred ones.
Example Model
import numpy as np
import pymc3 as pm
import arviz as az
# priors
mu = np.array([[0.25 , 0.5 , 0.75 , 1. ],
[0.25 , 0.333, 0.25 , 0. ],
[0.25 , 0.167, 0. , 0. ],
[0.25 , 0. , 0. , 0. ]])
alpha, beta = 10, 10
# fake data
np.random.seed(2019)
# row vector will use a different sd for each column
sd = np.random.gamma(alpha, 1.0/beta, size=(1,4))
# generate 100 fake observations of the (4,4) random variables
Y = np.random.normal(loc=mu, scale=sd, size=(100,4,4))
# true column sd's
print(sd)
# [[0.90055471 1.24522079 0.85846659 1.19588367]]
# mean sd's per column
print(np.mean(np.std(Y, 0), 0))
# [0.92028042 1.24437592 0.83383181 1.22717313]
# model
with pm.Model() as model:
# use a (1,4) matrix to pool variance by columns
epsilon = pm.Gamma('epsilon', alpha=10, beta=10, shape=(1, mu.shape[1]))
p_ = pm.Normal('p_', mu=mu, sd=epsilon, shape=mu.shape, observed=Y)
trace = pm.sample(random_seed=2019)
This samples well, and gives the following summary
which clearly bound the true values of the standard deviations within the HPDs.

speeding up moving time delta with irregular time intervals in a numpy array

I want to calculate the 10 second difference of a dataset where the time increments are irregular.The data exists in 2 1-D arrays of equal length, one for the time, and the other for the data value.
After some poking around I was able to come up with a solution, but it's too slow based on (i suspect) having to iterate through every item in the array.
My general method is to iterate through the time array, and for each time value i find the index of the time value that is x seconds earlier. I then use those indices on the data array to calculate the difference.
The code is shown below.
First, the find_closest function from Bi Rico
def find_closest(A, target):
#A must be sorted
idx = A.searchsorted(target)
idx = np.clip(idx, 1, len(A)-1)
left = A[idx-1]
right = A[idx]
idx -= target - left < right - target
return idx
Which I then use in the following manner
def trailing_diff(time_array,data_array,seconds):
trailing_list=[]
for i in xrange(len(time_array)):
now=time_array[i]
if now<seconds:
trailing_list.append(0)
else:
then=find_closest(time_array,now-seconds)
trailing_list.append(data_array[i]-data_array[then])
return np.asarray(trailing_list)
unfortunately this doesn't scale particularly well, and I'd like to be able to calculate this (and plot it) on the fly.
Any thoughts on how I can make it more expedient?
EDIT: input/output
In [48]:time1
Out[48]:
array([ 0.57200003, 0.579 , 0.58800006, 0.59500003,
0.5999999 , 1.05999994, 1.55900002, 2.00900006,
2.57599998, 3.05599999, 3.52399993, 4.00699997,
4.09599996, 4.57299995, 5.04699993, 5.52099991,
6.09299994, 6.55999994, 7.04099989, 7.50900006,
8.07500005, 8.55799985, 9.023 , 9.50699997,
9.59399986, 10.07200003, 10.54200006, 11.01999998,
11.58899999, 12.05699992, 12.53799987, 13.00499988,
13.57599998, 14.05599999, 14.52399993, 15.00199985,
15.09299994, 15.57599998, 16.04399991, 16.52199984,
17.08899999, 17.55799985, 18.03699994, 18.50499988,
19.0769999 , 19.5539999 , 20.023 , 20.50099993,
20.59099984, 21.07399988])
In [49]:weight1
Out[49]:
array([ 82.268, 82.268, 82.269, 82.272, 82.275, 82.291, 82.289,
82.288, 82.287, 82.287, 82.293, 82.303, 82.303, 82.314,
82.321, 82.333, 82.356, 82.368, 82.386, 82.398, 82.411,
82.417, 82.419, 82.424, 82.424, 82.437, 82.45 , 82.472,
82.498, 82.515, 82.541, 82.559, 82.584, 82.607, 82.617,
82.626, 82.626, 82.629, 82.63 , 82.636, 82.651, 82.663,
82.686, 82.703, 82.728, 82.755, 82.773, 82.8 , 82.8 ,
82.826])
In [50]:trailing_diff(time1,weight1,10)
Out[50]:
array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0.169, 0.182, 0.181, 0.209, 0.227, 0.254, 0.272,
0.291, 0.304, 0.303, 0.305, 0.305, 0.296, 0.274, 0.268,
0.265, 0.265, 0.275, 0.286, 0.309, 0.331, 0.336, 0.35 ,
0.35 , 0.354])
Use a ready-made interpolation routine. If you really want nearest neighbor behavior, I think it will have to be scipy's scipy.interpolate.interp1d, but linear interpolation seems a better option, and then you could use numpy's numpy.interp:
def trailing_diff(time, data, diff):
ret = np.zeros_like(data)
mask = (time - time[0]) >= diff
ret[mask] = data[mask] - np.interp(time[mask] - diff,
time, data)
return ret
time = np.arange(10) + np.random.rand(10)/2
weight = 82 + np.random.rand(10)
>>> time
array([ 0.05920317, 1.23000929, 2.36399981, 3.14701595, 4.05128494,
5.22100886, 6.07415922, 7.36161563, 8.37067107, 9.11371986])
>>> weight
array([ 82.14004969, 82.36214992, 82.25663272, 82.33764514,
82.52985723, 82.67820915, 82.43440796, 82.74038368,
82.84235675, 82.1333915 ])
>>> trailing_diff(time, weight, 3)
array([ 0. , 0. , 0. , 0.18093749, 0.20161107,
0.4082712 , 0.10430073, 0.17116831, 0.20691594, -0.31041841])
To get nearest neighbor, you would do
from scipy.interpolate import interp1d
def trailing_diff(time, data, diff):
ret = np.zeros_like(data)
mask = (time - time[0]) >= diff
interpolator = interp1d(time, data, kind='nearest')
ret[mask] = data[mask] - interpolator(time[mask] - diff)
return ret

scipy smart optimize

I need to fit some points from different datasets with straight lines. From every dataset I want to fit a line. So I got the parameters ai and bi that describe the i-line: ai + bi*x. The problem is that I want to impose that every ai are equal because I want the same intercepta. I found a tutorial here: http://www.scipy.org/Cookbook/FittingData#head-a44b49d57cf0165300f765e8f1b011876776502f. The difference is that I don't know a priopri how many dataset I have. My code is this:
from numpy import *
from scipy import optimize
# here I have 3 dataset, but in general I don't know how many dataset are they
ypoints = [array([0, 2.1, 2.4]), # first dataset, 3 points
array([0.1, 2.1, 2.9]), # second dataset
array([-0.1, 1.4])] # only 2 points
xpoints = [array([0, 2, 2.5]), # first dataset
array([0, 2, 3]), # second, also x coordinates are different
array([0, 1.5])] # the first coordinate is always 0
fitfunc = lambda a, b, x: a + b * x
errfunc = lambda p, xs, ys: array([ yi - fitfunc(p[0], p[i+1], xi)
for i, (xi,yi) in enumerate(zip(xs, ys)) ])
p_arrays = [r_[0.]] * len(xpoints)
pinit = r_[[ypoints[0][0]] + p_arrays]
fit_parameters, success = optimize.leastsq(errfunc, pinit, args = (xpoints, ypoints))
I got
Traceback (most recent call last):
File "prova.py", line 19, in <module>
fit_parameters, success = optimize.leastsq(errfunc, pinit, args = (xpoints, ypoints))
File "/usr/lib64/python2.6/site-packages/scipy/optimize/minpack.py", line 266, in leastsq
m = check_func(func,x0,args,n)[0]
File "/usr/lib64/python2.6/site-packages/scipy/optimize/minpack.py", line 12, in check_func
res = atleast_1d(thefunc(*((x0[:numinputs],)+args)))
File "prova.py", line 14, in <lambda>
for i, (xi,yi) in enumerate(zip(xs, ys)) ])
ValueError: setting an array element with a sequence.
if you just need a linear fit, then it is better to estimate it with linear regression instead of a non-linear optimizer.
More fit statistics could be obtained be using scikits.statsmodels instead.
import numpy as np
from numpy import array
ypoints = np.r_[array([0, 2.1, 2.4]), # first dataset, 3 points
array([0.1, 2.1, 2.9]), # second dataset
array([-0.1, 1.4])] # only 2 points
xpoints = [array([0, 2, 2.5]), # first dataset
array([0, 2, 3]), # second, also x coordinates are different
array([0, 1.5])] # the first coordinate is always 0
xp = np.hstack(xpoints)
indicator = []
for i,a in enumerate(xpoints):
indicator.extend([i]*len(a))
indicator = np.array(indicator)
x = xp[:,None]*(indicator[:,None]==np.arange(3)).astype(int)
x = np.hstack((np.ones((xp.shape[0],1)),x))
print np.dot(np.linalg.pinv(x), ypoints)
# [ 0.01947973 0.98656987 0.98481549 0.92034684]
The matrix of regressors has a common intercept, but different columns for each dataset:
>>> x
array([[ 1. , 0. , 0. , 0. ],
[ 1. , 2. , 0. , 0. ],
[ 1. , 2.5, 0. , 0. ],
[ 1. , 0. , 0. , 0. ],
[ 1. , 0. , 2. , 0. ],
[ 1. , 0. , 3. , 0. ],
[ 1. , 0. , 0. , 0. ],
[ 1. , 0. , 0. , 1.5]])
(Side note: use def, not lambda assigned to a name -- that's utterly silly and has nothing but downsides, lambda's only use is making anonymous functions!).
Your errfunc should return a sequence (array or otherwise) of floating point numbers, but it's not, because you're trying to put as the items of your arrays the arrays which are the differences each y point (remember, ypoints aka ys is a list of arrays!) and the fit functions' results. So you need to "collapse" the expression yi - fitfunc(p[0], p[i+1], xi) to a single floating point number, e.g. norm(yi - fitfunc(p[0], p[i+1], xi)).

Categories

Resources