Building N-th order Markovian transition matrix from a given sequence - python

I am trying to create a function which can transform a given input sequence to a transition matrix of the requested order. I found an implementation for the first-order Markovian transition matrix.
Now, I want to be able to come up with a solution which can calculate 2nd and 3rd order transition matrices.
Example of the 1st order matrix implementation:
import numpy as np
# sequence with 3 states -> 0, 1, 2
a = [0, 1, 0, 0, 0, 2, 2, 1, 1, 1, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 2]
def transition_matrix_first_order(seq):
M = np.full((3, 3), fill_value = 1/3, dtype= np.float64)
for (i,j) in zip(seq, seq[1:]):
M[i, j] += 1
M = M / M.sum(axis = 1, keepdims = True)
return M
print(transition_matrix_first_order(a))
Which gives me this:
[[0.61111111 0.19444444 0.19444444]
[0.38888889 0.38888889 0.22222222]
[0.22222222 0.22222222 0.55555556]]
When making a 2nd order matrix, it should have unique_state_count ** order rows and unique_state_count columns. In the example above, I have 3 unique states, so the matrix will have 9x3 structure.
Desirable function sample:
cal_tr_matrix(seq, unique_state_count, order)

I think you have a slight misunderstanding about the Markov chains and their transition matrices.
First of all, the estimated transition matrix your function produces is unfortunately not correct. Why? Let's refresh.
A discrete Markov chain in discrete time with N different states has a transition matrix P of size N x N, where a (i, j) element is P(X_1=j|X_0=i), i.e. the probability of transition from state i to state j in a single time step.
Now a transition matrix of order n, denoted P^{n}is once again a matrix of size N x N where a (i, j) element is P(X_n=j|X_0=i), i.e. the probability of transition from state i to state j in n time steps.
A wonderful result says: P^{n} = P^n, i.e. taking n powers of single-step transition matrix gives you the n-step transition matrix.
Now with this recap, all that is needed is to estimate P from the given sequence, then to estimate P^{n} one can just use the already estimated P and take a n-th power of the matrix. So how to estimate the matrix P? Well if we denote N_{ij} the number of observations of transition from state i to state j and N_{i*} the number of observations being in state i, then P_{ij} = N_{ij} / N_{i*}.
Overall here in Python:
import numpy as np
def transition_matrix(arr, n=1):
""""
Computes the transition matrix from Markov chain sequence of order `n`.
:param arr: Discrete Markov chain state sequence in discrete time with states in 0, ..., N
:param n: Transition order
"""
M = np.zeros(shape=(max(arr) + 1, max(arr) + 1))
for (i, j) in zip(arr, arr[1:]):
M[i, j] += 1
T = (M.T / M.sum(axis=1)).T
return np.linalg.matrix_power(T, n)
transition_matrix(arr=a, n=1)
>>> array([[0.63636364, 0.18181818, 0.18181818],
>>> [0.4 , 0.4 , 0.2 ],
>>> [0.2 , 0.2 , 0.6 ]])
transition_matrix(arr=a, n=2)
>>> array([[0.51404959, 0.22479339, 0.26115702],
>>> [0.45454545, 0.27272727, 0.27272727],
>>> [0.32727273, 0.23636364, 0.43636364]])
transition_matrix(arr=a, n=3)
>>> array([[0.46927122, 0.23561232, 0.29511645],
>>> [0.45289256, 0.24628099, 0.30082645],
>>> [0.39008264, 0.24132231, 0.36859504]])
Interesting thing, when you set the order n to a fairly high number, the higher and higher powers of the P matrix seem to converge to some very specific values. That's known as stationary/invariant distribution of the Markov chain and it gives a very good indication of how the chain behaves over a long period of time/transitions. Also:
P = transition_matrix(a, 1)
P111 = transition_matrix(a, 111)
print(P)
print(P111.dot(P))
EDIT: Now to the tweaked solution based on your comment, I'd suggest to have higher dimensional matrices for higher orders instead of exploding the number of rows. One way would be like this:
def cal_tr_matrix(arr, order):
_shape = (max(arr) + 1,) * (order + 1)
M = np.zeros(_shape)
for _ind in zip(*[arr[_x:] for _x in range(order + 1)]):
M[_ind] += 1
return M
res1 = cal_tr_matrix(a, 1)
res2 = cal_tr_matrix(a, 2)
Now the element res1[i, j] says how many times transition i->j happened, while the element res2[i, j, k] says how many times transition i->j->k happened.

Related

SKlearn Minimum Covariance Determinant (MCD) Function yields different results if applied to whole data array vs looped

I have a repeated experiment (n=K) which measures time series of equal length N, i.e. my data matrix has the shape NxK. I now want to compute a robust estimate of the covariance between the experiments for which I use the Minimum Covariance Determinent algorithm implemented in Sci Kit Learn.
One way to apply the algorithm is to directly apply the function to the data array D, i.e.:
import numpy as np
from sklearn.covariance import MinCovDet
N = 300 #number of rows
K = 40 #number of columns
D = np.random.normal(0, 1, size=(N, K)) #create random Data
mcd = MinCovDet().fit(D) #yields a KxK matrix
cov_mat = mcd.covariance_ #covariances between the columns
another way is to loop over the experiments
cov_loop = np.zeros((K, K))
for i in range(0, K):
for j in range(i, K):
temp_arr = np.zeros((N, 2))
temp_arr[:, 0] = D[:, i]
temp_arr[:, 1] = D[:, j]
mcd_temp = MinCovDet().fit(temp_arr)
cov_temp = mcd_temp.covariance_ #yields 2x2 matrix, we are only interested in the [0,1] element
cov_loop[i, j] = cov_temp[0, 1]
cov_loop[j, i] = cov_loop[i, j]
print(cov_loop/cov_mat)
The results differ significantly, which is why I wanted to ask what went wrong here.

Generate binary random matrix with upper and lower limit on number of ones in each row?

I want to generate a binary matrix of numbers with M rows and N columns. Each row must sum to <=p and >=q. In other words, each row must have at most p and at least q ones.
This is the code I have been using.
import numpy as np
def randbin(M, N, P):
return np.random.choice([0, 1], size=(M, N), p=[P, 1 - P])
MyMatrix = randbin(200, 7, 0.5)
Notice that row 0 is all zeros:
I noticed that some rows have all zeros and some rows have all ones. How can I modify this to get what I want? Is there an efficient way of achieving this solution?
You can generate a random number in [q, p] for each row and then set that many random ones in each row. If by efficient you mean vectorized, then yes, there is an efficient way. The trick is to simulate sampling without replacement in one axis but with the the other. This can be done with np.argsort. You can select a variable number of indices by turning a random vector into a mask.
def randbin(m, n, p, q):
# output to assign ones into
result = np.zeros((m, n), dtype=bool)
# simulate sampling with replacement in one axis
col_ind = np.argsort(np.random.random(size=(m, n)), axis=1)
# figure out how many samples to take in each row
count = np.random.randint(p, q + 1, size=(m, 1))
# turn it into a mask over col_ind using a clever broadcast
mask = np.arange(n) < count
# apply the mask not only to col_ind, but also the corresponding row_ind
col_ind = col_ind[mask]
row_ind = np.broadcast_to(np.arange(m).reshape(-1, 1), (m, n))[mask]
# Set the corresponding elements to 1
result[row_ind, col_ind] = 1
return result
The selection is made so that each run of equal values in row_ind is between p and q elements long. The corresponding elements of col_ind are unique and uniformly distributed within each row.
An alternative is #Prunes solution. It requires np.argsort to shuffle the rows independently, since np.random.shuffle would keep the rows together:
def randbin(m, n, p, q):
# make the unique rows
options = np.arange(n) < np.arange(p, q + 1).reshape(-1, 1)
# select random unique row to go into each output row
selection = np.random.choice(options.shape[0], size=m, replace=True)
# perform the selection
result = options[selection]
# create indices to shuffle each row independently
col_ind = np.argsort(np.random.random(result.shape), axis=1)
row_ind = np.arange(m).reshape(-1, 1)
# perform the shuffle
result = result[row_ind, col_ind]
return result
Okay, then: a uniform distribution is easy enough. Let's take that case with [2,5] 1s required. Use a list of the allowable combinations:
[ [1, 1, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0] ]
For each of your rows, choose a random element from these four, and then shuffle it. There is your row.

Shortest distance between a point and a line in 3 d space

I am trying to find the minimum distance from a point (x0,y0,z0) to a line joined by (x1,y1,z1) and (x2,y2,z2) using numpy or anything in python. Unfortunately, all i can find on the net is related to 2d spaces and i am fairly new to python. Any help will be appreciated. Thanks in advance!
StackOverflow doesn't support Latex, so I'm going to gloss over some of the math. One solution comes from the idea that if your line spans the points p and q, then every point on that line can be represented as t*(p-q)+q for some real-valued t. You then want to minimize the distance between your given point r and any point on that line, and distance is conveniently a function of the single variable t, so standard calculus tricks work fine. Consider the following example, which calculates the minimum distance between r and the line spanned by p and q. By hand, we know the answer should be 1.
import numpy as np
p = np.array([0, 0, 0])
q = np.array([0, 0, 1])
r = np.array([0, 1, 1])
def t(p, q, r):
x = p-q
return np.dot(r-q, x)/np.dot(x, x)
def d(p, q, r):
return np.linalg.norm(t(p, q, r)*(p-q)+q-r)
print(d(p, q, r))
# Prints 1.0
This works fine in any number of dimensions, including 2, 3, and a billion. The only real constraint is that p and q have to be distinct points so that there is a unique line between them.
I broke the code down in the above example in order to show the two distinct steps arising from the way I thought about it mathematically (finding t and then computing the distance). That isn't necessarily the most efficient approach, and it certainly isn't if you want to know the minimum distance for a wide variety of points and the same line -- doubly so if the number of dimensions is small. For a more efficient approach, consider the following:
import numpy as np
p = np.array([0, 0, 0]) # p and q can have shape (n,) for any
q = np.array([0, 0, 1]) # n>0, and rs can have shape (m,n)
rs = np.array([ # for any m,n>0.
[0, 1, 1],
[1, 0, 1],
[1, 1, 1],
[0, 2, 1],
])
def d(p, q, rs):
x = p-q
return np.linalg.norm(
np.outer(np.dot(rs-q, x)/np.dot(x, x), x)+q-rs,
axis=1)
print(d(p, q, rs))
# Prints array([1. , 1. , 1.41421356, 2. ])
There may well be some simplifications I'm missing or other things that could speed that up, but it should be a good start at least.
This duplicates #Hans Musgrave solution, but imagine we know nothing of
'standard calculus tricks' that 'work fine' and also very bad at linear algebra.
All we know is:
how to calculate a distance between two points
a point on line can be represented as a function of two points and a paramater
we know find a function minimum
(lists are not friends with code blocks)
def distance(a, b):
"""Calculate a distance between two points."""
return np.linalg.norm(a-b)
def line_to_point_distance(p, q, r):
"""Calculate a distance between point r and a line crossing p and q."""
def foo(t: float):
# x is point on line, depends on t
x = t * (p-q) + q
# we return a distance, which also depends on t
return distance(x, r)
# which t minimizes distance?
t0 = sci.optimize.minimize(foo, 0.1).x[0]
return foo(t0)
# in this example the distance is 5
p = np.array([0, 0, 0])
q = np.array([2, 0, 0])
r = np.array([1, 5, 0])
assert abs(line_to_point_distance(p, q, r) - 5) < 0.00001
You should not use this method for real calculations, because it uses
approximations wher eyou have a closed form solution, but maybe it helpful
to reveal some logic begind the neighbouring answer.

Standard deviation from center of mass along Numpy array axis

I am trying to find a well-performing way to calculate the standard deviation from the center of mass/gravity along an axis of a Numpy array.
In formula this is (sorry for the misalignment):
The best I could come up with is this:
def weighted_com(A, axis, weights):
average = np.average(A, axis=axis, weights=weights)
return average * weights.sum() / A.sum(axis=axis).astype(float)
def weighted_std(A, axis):
weights = np.arange(A.shape[axis])
w1com2 = weighted_com(A, axis, weights)**2
w2com1 = weighted_com(A, axis, weights**2)
return np.sqrt(w2com1 - w1com2)
In weighted_com, I need to correct the normalization from sum of weights to sum of values (which is an ugly workaround, I guess). weighted_std is probably fine.
To avoid the XY problem, I still ask for what I actually want, (a better weighted_std) instead of a better version of my weighted_com.
The .astype(float) is a safety measure as I'll apply this to histograms containing ints, which caused problems due to integer division when not in Python 3 or when from __future__ import division is not active.
You want to take the mean, variance and standard deviation of the vector [1, 2, 3, ..., n] — where n is the dimension of the input matrix A along the axis of interest —, with weights given by the matrix A itself.
For concreteness, say you want to consider these center-of-mass statistics along the vertical axis (axis=0) — this is what corresponds to the formulas you wrote. For a fixed column j, you would do
n = A.shape[0]
r = np.arange(1, n+1)
mu = np.average(r, weights=A[:,j])
var = np.average(r**2, weights=A[:,j]) - mu**2
std = np.sqrt(var)
In order to put all of the computations for the different columns together, you have to stack together a bunch of copies of r (one per column) to form a matrix (that I have called R in the code below). With a bit of care, you can make things work for both axis=0 and axis=1.
import numpy as np
def com_stats(A, axis=0):
A = A.astype(float) # if you are worried about int vs. float
n = A.shape[axis]
m = A.shape[(axis-1)%2]
r = np.arange(1, n+1)
R = np.vstack([r] * m)
if axis == 0:
R = R.T
mu = np.average(R, axis=axis, weights=A)
var = np.average(R**2, axis=axis, weights=A) - mu**2
std = np.sqrt(var)
return mu, var, std
For example,
A = np.array([[1, 1, 0], [1, 2, 1], [1, 1, 1]])
print(A)
# [[1 1 0]
# [1 2 1]
# [1 1 1]]
print(com_stats(A))
# (array([ 2. , 2. , 2.5]), # centre-of-mass mean by column
# array([ 0.66666667, 0.5 , 0.25 ]), # centre-of-mass variance by column
# array([ 0.81649658, 0.70710678, 0.5 ])) # centre-of-mass std by column
EDIT:
One can avoid creating in-memory copies of r to build R by using numpy.lib.stride_tricks: swap the line
R = np.vstack([r] * m)
above with
from numpy.lib.stride_tricks import as_strided
R = as_strided(r, strides=(0, r.itemsize), shape=(m, n))
The resulting R is a (strided) ndarray whose underlying array is the same as r's — absolutely no copying of any values occurs.
from numpy.lib.stride_tricks import as_strided
FMT = '''\
Shape: {}
Strides: {}
Position in memory: {}
Size in memory (bytes): {}
'''
def find_base_nbytes(obj):
if obj.base is not None:
return find_base_nbytes(obj.base)
return obj.nbytes
def stats(obj):
return FMT.format(obj.shape,
obj.strides,
obj.__array_interface__['data'][0],
find_base_nbytes(obj))
n=10
m=1000
r = np.arange(1, n+1)
R = np.vstack([r] * m)
S = as_strided(r, strides=(0, r.itemsize), shape=(m, n))
print(stats(r))
print(stats(R))
print(stats(S))
Output:
Shape: (10,)
Strides: (8,)
Position in memory: 4299744576
Size in memory (bytes): 80
Shape: (1000, 10)
Strides: (80, 8)
Position in memory: 4304464384
Size in memory (bytes): 80000
Shape: (1000, 10)
Strides: (0, 8)
Position in memory: 4299744576
Size in memory (bytes): 80
Credit to this SO answer and this one for explanations on how to get the memory address and size of the underlying array of a strided ndarray.

Decrease array size by averaging adjacent values with numpy

I have a large array of thousands of vals in numpy. I want to decrease its size by averaging adjacent values.
For example:
a = [2,3,4,8,9,10]
#average down to 2 values here
a = [3,9]
#it averaged 2,3,4 and 8,9,10 together
So, basically, I have n number of elements in array, and I want to tell it to average down to X number of values, and it averages like above.
Is there some way to do that with numpy (already using it for other things, so I'd like to stick with it).
Using reshape and mean, you can average every m adjacent values of an 1D-array of size N*m, with N being any positive integer number. For example:
import numpy as np
m = 3
a = np.array([2, 3, 4, 8, 9, 10])
b = a.reshape(-1, m).mean(axis=1)
#array([3., 9.])
1)a.reshape(-1, m) will create a 2D image of the array without copying data:
array([[ 2, 3, 4],
[ 8, 9, 10]])
2)taking the mean in the second axis (axis=1) will then calculate the mean value of each row, resulting in:
array([3., 9.])
Try this:
n_averaged_elements = 3
averaged_array = []
a = np.array([ 2, 3, 4, 8, 9, 10])
for i in range(0, len(a), n_averaged_elements):
slice_from_index = i
slice_to_index = slice_from_index + n_averaged_elements
averaged_array.append(np.mean(a[slice_from_index:slice_to_index]))
>>>> averaged_array
>>>> [3.0, 9.0]
Looks like a simple non-overlapping moving window average to me, how about:
In [3]:
import numpy as np
a = np.array([2,3,4,8,9,10])
window_sz = 3
a[:len(a)/window_sz*window_sz].reshape(-1,window_sz).mean(1)
#you want to be sure your array can be reshaped properly, so the [:len(a)/window_sz*window_sz] part
Out[3]:
array([ 3., 9.])
In this example, I presume that a is the 1D numpy array that needs to be averaged. In the method that I give below, we first find the factors of the length of this array a. And, then we choose the an appropriate factor as the step size to average the array with.
Here is the code.
import numpy as np
from functools import reduce
''' Function to find factors of a given number 'n' '''
def factors(n):
return list(set(reduce(list.__add__,
([i, n//i] for i in range(1, int(n**0.5) + 1) if n % i == 0))))
a = [2,3,4,8,9,10] #Given array.
'''fac: list of factors of length of a.
In this example, len(a) = 6. So, fac = [1, 2, 3, 6] '''
fac = factors(len(a))
'''step: choose an appropriate step size from the list 'fac'.
In this example, we choose one of the middle numbers in fac
(3). '''
step = fac[int( len(fac)/3 )+1]
'''avg: initialize an empty array. '''
avg = np.array([])
for i in range(0, len(a), step):
avg = np.append( avg, np.mean(a[i:i+step]) ) #append averaged values to `avg`
print avg #Prints the final result
[3.0, 9.0]

Categories

Resources