Related
I wrote codes,
import numpy
import matplotlib.pyplot as plt
from tslearn.clustering import KShape
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
ks = KShape(n_clusters=3, n_init=10, verbose=True, random_state=seed)
y_pred = ks.fit_predict(data)
plt.figure(figsize=(16,9))
for yi in range(3):
plt.subplot(3, 1, 1 + yi)
for xx in stack_data[y_pred == yi]:
plt.plot(xx.ravel(), "k-", alpha=.2)
plt.title("Cluster %d" % (yi + 1))
plt.tight_layout()
plt.show()
I want to divide data by usigng KShape’s clustering.Now plot is shown, but I cannot find what data is in each 3 clustering.
data is an order of A,B,C,D ’s kind.So I want to show label to plot or the result of the clustering.I searched KShape’s document (http://tslearn.readthedocs.io/en/latest/auto_examples/plot_kshape.html ),but I cannot find the information to do my ideal things.How should I do it?
Why there are no perfect solutions
K-Shape works randomly, and without setting a seed for every iteration you might get different clusters and centroids. There is no deterministic way to know a-priori if a given class is completely described by a given centroid, but you can proceed in an offline fashion, in a fuzzy way, by checking to which centroid a given class is classified mostly.
Also any given class, A for instance, could contain elements that are part of two clusters in the space of the features you are considering.
Suppose you have 3 classes but your dataset is best described (for example by maximal average density) by 4 clusters: you'd surely have some points of at least one class that go in the 4th cluster.
Or alternatively, suppose your classes do not overlap with the centroids generated by the distance metric you are considering: take in consideration an obvious example: you have 3 classes, numbers from 0 to 100, from 100 to 1000 and from 1000 to 1100, but your dataset contains numbers from 0 to 150 and from 950 to 1100: a clustering algorithm would find its optimum with 2 clusters and put the points of class A in either one of the two.
Once you have determined that, for example, class A goes mostly to cluster 1, class B to cluster 2 etc... you can proceed to assign that cluster to the given class.
A possible fuzzy approach
We will proceed to determining the clusters classes by assigning the best fitted class to the cluster that contains most of its points:
Simple example: classes that actually fit clusters
For this example we use one of tslearn.datasets. This code is partially taken from this K-Shape example on tslearn.
import numpy as np
import matplotlib.pyplot as plt
from tslearn.clustering import KShape
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from seaborn import heatmap
We set the seed, for code reproducibility:
seed = 0
np.random.seed(seed)
Firstly we prepare the dataset, selecting the first classes_number=3 classes:
classes_number = 3
X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")
mask = y_train <= classes_number
X_train, y_train = X_train[mask], y_train[mask] # Keep first 3 classes
X_train = TimeSeriesScalerMeanVariance().fit_transform(X_train) # Keep only 50 time series
sz = X_train.shape[1]
Now we find the clusters, with clusters_number=3:
# Euclidean k-means
clusters_number = 3
ks = KShape(n_clusters=clusters_number, verbose=False, random_state=seed)
y_pred = ks.fit_predict(X_train)
We now proceed to count the elements of each class that are assigned to each cluster and to add the 0 paddings for where no elements of a given class was assigned to a given cluster (surely there will be a more pythonic way to d this but I've yet to find it):
data = [np.unique(y_pred[y_train==i+1], return_counts=True) for i in range(classes_number)]
>>>[(array([2]), array([26])),
(array([0]), array([21])),
(array([1]), array([22]))]
Adding the padding:
padded_data = np.array([[
data[j][1][data[j][0] == i][0] if np.any(data[j][0] == i) else 0
for i in range(clusters_number)
] for j in range(classes_number)])
>>> array([[ 0, 0, 26],
[21, 0, 0],
[ 0, 22, 0]])
Normalising the obtained matrix:
normalized_data = padded_data / np.sum(padded_data, axis=-1)[:, np.newaxis]
>>> array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.]])
We can visualise the obtained matrix using seaborn.heatmap:
xticklabels = ["Cluster n. %s" % (1+i) for i in range(clusters_number)]
yticklabels = ["Class n. %s" % (1+i) for i in range(classes_number)]
heatmap(
normalized_data,
cbar=False,
square=True,
annot=True,
cmap="YlGnBu",
xticklabels=xticklabels,
yticklabels=yticklabels)
plt.yticks(rotation=0)
Obtaining:
In this optimal situation, every cluster contains only and exactly one class, so with absolute precision we obtain:
classes_clusters = np.argmax(normalized_data, axis=1)
>>> array([2, 0, 1])
Second example: classes that do not overlap with clusters
For simplicity sake, to simulate classes that do not overlap completely with the clusters I'm just going to shuffle part of the labels, but there a vast range of example: most of clustering problems ends up with classes that do not exactly coincide with a cluster.
tmp = y_train[:20]
np.random.shuffle(tmp)
y_train[:20] = tmp
Now, when we execute the script again we get quite a different matrix:
But we are still able to determine the classes clusters:
classes_clusters = np.argmax(normalized_data, axis=1)
>>> array([2, 0, 1])
Third example: classes that do not exist in the dataset
Suppose we were lead to believe that in the dataset existed 4 classes: we would find after running with different values of k that the best number of clusters is k=3 in our current dataset: how would we proceed to assign the classes to the clusters? Which class could be thrown away?
We proceed to simulate such a situation by arbitrarily assigning a forth class to our labels:
y_train[:20] = 4
Running again our script we would obtain:
Clearly the 4th class has got to go. We can proceed by thresholding on the mean variance:
threshold = np.mean(np.var(normalized_data, axis=1))
result = np.argmax(normalized_data[np.var(normalized_data, axis=1)>threshold], axis=1)
And we obtain yet again:
array([2, 0, 1])
I hope this explanation has cleared most of your doubts!
I have an arbitrary input curve, given as numpy array. I want to create a smoothed version of it, similar to a rolling mean, but which is strictly greater than the original and strictly smooth. I could use the rolling mean value but if the input curve has a negative peak, the smoothed version will drop below the original around that peak. I could then simply use the maximum of this and the original but that would introduce non-smooth spots where the transition occurs.
Furthermore, I would like to be able to parameterize the algorithm with a look-ahead and a look-behind for this resulting curve, so that given a large look-ahead and a small look-behind the resulting curve would rather stick to the falling edges, and with a large look-behind and a small look-ahead it would rather be close to rising edges.
I tried using the pandas.Series(a).rolling() facility to get rolling means, rolling maxima, etc., but up to now I found no way to generate a smoothed version of my input which in all cases stays above to input.
I guess there is a way to combine rolling maxima and rolling means somehow to achieve what I want, so here is some code for computing these:
import pandas as pd
import numpy as np
my input curve:
original = np.array([ 5, 5, 5, 8, 8, 8, 2, 2, 2, 2, 2, 3, 3, 7 ])
This can be padded left (pre) and right (post) with the edge values as a preparation for any rolling function:
pre = 2
post = 3
padded = np.pad(original, (pre, post), 'edge')
Now we can apply a rolling mean:
smoothed = pd.Series(padded).rolling(
pre + post + 1).mean().get_values()[pre+post:]
But now the smoothed version is below the original, e. g. at index 4:
print(original[4], smoothed[4]) # 8 and 5.5
To compute a rolling maximum, you can use this:
maximum = pd.Series(padded).rolling(
pre + post + 1).max().get_values()[pre+post:]
But a rolling maximum alone would of course not be smooth in many cases and would display a lot of flat tops around the peaks of the original. I would prefer a smooth approach to these peaks.
If you have also pyqtgraph installed, you can easily plot such curves:
import pyqtgraph as pg
p = pg.plot(original)
p.plotItem.plot(smoothed, pen=(255,0,0))
(Of course, other plot libraries would do as well.)
What I would like to have as a result is a curve which is e. g. like the one formed by these values:
goal = np.array([ 5, 7, 7.8, 8, 8, 8, 7, 5, 3.5, 3, 4, 5.5, 6.5, 7 ])
Here is an image of the curves. The white line is the original (input), the red the rolling mean, the green is about what I would like to have:
EDIT: I just found the functions baseline() and envelope() of a module named peakutils. These two functions can compute polynomials of a given degree fitting the lower resp. upper peaks of the input. For small samples this can be a good solution. I'm looking for something which can also be applied on very large samples with millions of values; then the degree would need to be very high and the computation then also takes a considerate amount of time. Doing it piecewise (section for section) opens up a bunch of new questions and problems (like how to stitch properly while staying smooth and guaranteed above the input, performance when processing a massive amount of pieces etc.), so I'd like to avoid that if possible.
EDIT 2: I have a promising approach by a repetitively applying a filter which creates a rolling mean, shifts it slightly to the left and the right, and then takes the maximum of these two and the original sample. After applying this several times, it smoothes out the curve in the way I wanted it. Some unsmooth spots can remain, though, in deep valleys. Here is the code for this:
pre = 30
post = 30
margin = 10
s = [ np.array(sum([[ x ] * 100 for x in
[ 5, 5, 5, 8, 8, 8, 2, 2, 2, 2, 2, 3, 3, 7 ]], [])) ]
for _ in range(30):
s.append(np.max([
pd.Series(np.pad(s[-1], (margin+pre, post), 'edge')).rolling(
1 + pre + post).mean().get_values()[pre+post:-margin],
pd.Series(np.pad(s[-1], (pre, post+margin), 'edge')).rolling(
1 + pre + post).mean().get_values()[pre+post+margin:],
s[-1]], 0))
This creates 30 iterations of applying the filter, plotting these can be done using pyqtplot so:
p = pg.plot(original)
for q in s:
p.plotItem.plot(q, pen=(255, 100, 100))
The resulting image looks like this:
There are two aspects I don't like about this approach: ① It needs iterating lots of time (which slows me down), ② it still has unsmooth parts in the valleys (although in my usecase this might be acceptable).
I have now played around quite a bit and I think I found two main answers which solve my direct need. I will give them below.
import numpy as np
import pandas as pd
from scipy import signal
import pyqtgraph as pg
These are just the necessary imports, used in all code blow. pyqtgraph is only used for displaying stuff of course, so you do not really need it.
Symmetrical Smoothing
This can be used to create a smooth line which is always above the signal, but it cannot distinguish between rising and falling edges, so the curve around a single peak will look symmetrical. In many cases this might be quite okay and since it is way less complex than the asymmetrical solution below (and also does not have any quirks I would know about).
s = np.repeat([5, 5, 5, 8, 8, 8, 2, 2, 2, 2, 2, 3, 3, 7], 400) + 0.1
s *= np.random.random(len(s))
pre = post = 400
x = pd.Series(np.pad(s, (pre, post), 'edge')).rolling(
pre + 1 + post).max().get_values()[pre+post:]
y = pd.Series(np.pad(x, (pre, post), 'edge')).rolling(
pre + 1 + post, win_type='blackman').mean().get_values()[pre+post:]
p = pg.plot(s, pen=(100,100,100))
for c, pen in ((x, (0, 200, 200)),
(y, pg.mkPen((255, 255, 255), width=3, style=3))):
p.plotItem.plot(c, pen=pen)
Create a rolling maximum (x, cyan), and
create a windowed rolling mean of this (z, white dotted).
Asymmetrical Smoothing
My usecase called for a version which allowed to distinguish between rising and falling edges. The speed of the output should be different when falling or when rising.
Comment: Used as an envelope for a compressor/expander, a quickly rising curve would mean to dampen the effect of a sudden loud noise almost completely, while a slowly rising curve would mean to slowly compress the signal for a long time before the loud noise, keeping the dynamics when the bang appears. On the other hand, if the curve falls quickly after a loud noise, this would make quiet stuff shortly after a bang audible while a slowly falling curve would keep the dynamics there as well and only slowly expanding the signal back to normal levels.
s = np.repeat([5, 5, 5, 8, 8, 8, 2, 2, 2, 2, 2, 3, 3, 7], 400) + 0.1
s *= np.random.random(len(s))
pre, post = 100, 1000
t = pd.Series(np.pad(s, (post, pre), 'edge')).rolling(
pre + 1 + post).max().get_values()[pre+post:]
g = signal.get_window('boxcar', pre*2)[pre:]
g /= g.sum()
u = np.convolve(np.pad(t, (pre, 0), 'edge'), g)[pre:]
g = signal.get_window('boxcar', post*2)[:post]
g /= g.sum()
v = np.convolve(np.pad(t, (0, post), 'edge'), g)[post:]
u, v = u[:len(v)], v[:len(u)]
w = np.min(np.array([ u, v ]),0)
pre = post = max(100, min(pre, post)*3)
x = pd.Series(np.pad(w, (pre, post), 'edge')).rolling(
pre + 1 + post).max().get_values()[pre+post:]
y = pd.Series(np.pad(x, (pre, post), 'edge')).rolling(
pre + 1 + post, win_type='blackman').mean().get_values()[pre+post:]
p = pg.plot(s, pen=(100,100,100))
for c, pen in ((t, (200, 0, 0)),
(u, (200, 200, 0)),
(v, (0, 200, 0)),
(w, (200, 0, 200)),
(x, (0, 200, 200)),
(y, pg.mkPen((255, 255, 255), width=3))):
p.plotItem.plot(c, pen=pen)
This sequence combines ruthlessly several methods of signal processing.
The input signal is shown in grey. It is a noisy version of the input mentioned above.
A rolling maximum is applied to this (t, red).
Then a specially designed convolution curve for the falling edges is created (g) and the convolution is computed (u, yellow).
This is repeated for the rising edges with a different convolution curve (g again) and the convolution is computed (v, green).
The minimum of u and v is a curve having the desired slopes but is not very smooth yet; especially it has ugly spikes when the falling and the rising slope reach into each other (w, purple).
On this the symmetrical method above is applied:
Create a rolling maximum of this curve (x, cyan).
Create a windowed rolling mean of this curve (y, white dotted).
As an initial stab at part of the problem, I've produced a function which fits a polynomial to the data by minimising the integral subject to constraints that the polynomial be strictly above the points. I suspect if you do this piecewise over your data, it may work for you.
import scipy.optimize
def upperpoly(xdata, ydata, order):
def objective(p):
"""Minimize integral"""
pint = np.polyint(p)
integral = np.polyval(pint, xdata[-1]) - np.polyval(pint, xdata[0])
return integral
def constraints(p):
"""Polynomial values be > data at every point"""
return np.polyval(p, xdata) - ydata
p0 = np.polyfit(xdata, ydata, order)
y0 = np.polyval(p0, xdata)
shift = (ydata - y0).max()
p0[-1] += shift
result = scipy.optimize.minimize(objective, p0,
constraints={'type':'ineq',
'fun': constraints})
return result.x
As pointed out in my note, the behaviour of your green line is inconsistent in the regions before and after the eight-high plateau. If the left region behavior is what you want, you could do something like this:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
from scipy.spatial import ConvexHull
# %matplotlib inline # for interactive notebooks
y=np.array([ 5, 5, 5, 8, 8, 8, 2, 2, 2, 2, 2, 3, 3, 7])
x=np.array(range(len(y)))
#######
# This essentially selects the vertices that you'd touch streatching a
# rubber band over the top of the function
vs = ConvexHull(np.asarray([x,y]).transpose()).vertices
indices_of_upper_hull_verts = list(reversed(np.concatenate([vs[np.where(vs == len(x)-1)[0][0]: ],vs[0:1]])))
newX = x[indices_of_upper_hull_verts]
newY = y[indices_of_upper_hull_verts]
#########
x_smooth = np.linspace(newX.min(), newX.max(),500)
f = interp1d(newX, newY, kind='quadratic')
y_smooth=f(x_smooth)
plt.plot (x,y)
plt.plot (x_smooth,y_smooth)
plt.scatter (x, y)
which yields:
UPDATE:
Here's an alternative that might better suit you. If instead of a rolling average you use a simple convolution centered around 1, the resulting curve will always be larger than the input. Wings of the convolution kernel can be adjusted for look-ahead/look-behind.
Like this:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
from scipy.ndimage.filters import convolve
## For interactive notebooks
#%matplotlib inline
y=np.array([ 5, 5, 5, 8, 8, 8, 2, 2, 2, 2, 2, 3, 3, 7]).astype(np.float)
preLength = 1
postLength = 1
preWeight = 0.2
postWeight = 0.2
kernal = [preWeight/preLength for i in range(preLength)] + [1] + [postWeight/postLength for i in range(postLength)]
output = convolve(y,kernal)
x=np.array(range(len(y)))
plt.plot (x,y)
plt.plot (x,output)
plt.scatter (x, y)
A drawback is that because the integrated kernel will typically be larger than one (which ensures that the output curve is smooth and never below the input), the output curve will always be larger than the input curve, e.g. on top of the large peak and not sitting right on top as you drew.
This is my first question on Stackoverflow, so I apologize if I word it poorly. I am writing code to take raw acceleration data from an IMU and then integrate it to update the position of an object. Currently this code takes a new accelerometer reading every milisecond, and uses that to update the position. My system has a lot of noise, which results in crazy readings due to compounding error, even with the ZUPT scheme I implemented. I know that a Kalman filter is theoretically ideal for this scenario, and I would like to use the pykalman module instead of building one myself.
My first question is, can pykalman be used in real time like this? From the documentation it looks to me like you have to have a record of all measurements and then perform the smooth operation, which would not be practical as I want to filter recursively every milisecond.
My second question is, for the transition matrix can I only apply pykalman to the acceleration data by itself, or can I somehow include the double integration to position? What would that matrix look like?
If pykalman is not practical for this situation, is there another way I can implement a Kalman Filter? Thank you in advance!
You can use a Kalman Filter in this case, but your position estimation will strongly depend on the precision of your acceleration signal. The Kalman Filter is actually useful for a fusion of several signals. So error of one signal can be compensated by another signal. Ideally you need to use sensors based on different physical effects (for example an IMU for acceleration, GPS for position, odometry for velocity).
In this answer I'm going to use readings from two acceleration sensors (both in X direction). One of these sensors is an expansive and precise. The second one is much cheaper. So you will see the sensor precision influence on the position and velocity estimations.
You already mentioned the ZUPT scheme. I just want to add some notes: it is very important to have a good estimation of the pitch angle, to get rid of the gravitation component in your X-acceleration. If you use Y- and Z-acceleration you need both pitch and roll angles.
Let's start with modelling. Assume you have only acceleration readings in X-direction. So your observation will look like
Now you need to define the smallest data set, which completely describes your system in each point of time. It will be the system state.
The mapping between the measurement and state domains is defined by the observation matrix:
Now you need to describe the system dynamics. According to this information the Filter will predict a new state based on the previous one.
In my case dt=0.01s. Using this matrix the Filter will integrate the acceleration signal to estimate the velocity and position.
The observation covariance R can be described by the variance of your sensor readings. In my case I have only one signal in my observation, so the observation covariance is equal to the variance of the X-acceleration (the value can be calculated based on your sensors datasheet).
Through the transition covariance Q you describe the system noise. The smaller the matrix values, the smaller the system noise. The Filter will become stiffer and the estimation will be delayed. The weight of the system's past will be higher compared to new measurement. Otherwise the filter will be more flexible and will react strongly on each new measurement.
Now everything is ready to configure the Pykalman. In order to use it in real time, you have to use the filter_update function.
from pykalman import KalmanFilter
import numpy as np
import matplotlib.pyplot as plt
load_data()
# Data description
# Time
# AccX_HP - high precision acceleration signal
# AccX_LP - low precision acceleration signal
# RefPosX - real position (ground truth)
# RefVelX - real velocity (ground truth)
# switch between two acceleration signals
use_HP_signal = 1
if use_HP_signal:
AccX_Value = AccX_HP
AccX_Variance = 0.0007
else:
AccX_Value = AccX_LP
AccX_Variance = 0.0020
# time step
dt = 0.01
# transition_matrix
F = [[1, dt, 0.5*dt**2],
[0, 1, dt],
[0, 0, 1]]
# observation_matrix
H = [0, 0, 1]
# transition_covariance
Q = [[0.2, 0, 0],
[ 0, 0.1, 0],
[ 0, 0, 10e-4]]
# observation_covariance
R = AccX_Variance
# initial_state_mean
X0 = [0,
0,
AccX_Value[0, 0]]
# initial_state_covariance
P0 = [[ 0, 0, 0],
[ 0, 0, 0],
[ 0, 0, AccX_Variance]]
n_timesteps = AccX_Value.shape[0]
n_dim_state = 3
filtered_state_means = np.zeros((n_timesteps, n_dim_state))
filtered_state_covariances = np.zeros((n_timesteps, n_dim_state, n_dim_state))
kf = KalmanFilter(transition_matrices = F,
observation_matrices = H,
transition_covariance = Q,
observation_covariance = R,
initial_state_mean = X0,
initial_state_covariance = P0)
# iterative estimation for each new measurement
for t in range(n_timesteps):
if t == 0:
filtered_state_means[t] = X0
filtered_state_covariances[t] = P0
else:
filtered_state_means[t], filtered_state_covariances[t] = (
kf.filter_update(
filtered_state_means[t-1],
filtered_state_covariances[t-1],
AccX_Value[t, 0]
)
)
f, axarr = plt.subplots(3, sharex=True)
axarr[0].plot(Time, AccX_Value, label="Input AccX")
axarr[0].plot(Time, filtered_state_means[:, 2], "r-", label="Estimated AccX")
axarr[0].set_title('Acceleration X')
axarr[0].grid()
axarr[0].legend()
axarr[0].set_ylim([-4, 4])
axarr[1].plot(Time, RefVelX, label="Reference VelX")
axarr[1].plot(Time, filtered_state_means[:, 1], "r-", label="Estimated VelX")
axarr[1].set_title('Velocity X')
axarr[1].grid()
axarr[1].legend()
axarr[1].set_ylim([-1, 20])
axarr[2].plot(Time, RefPosX, label="Reference PosX")
axarr[2].plot(Time, filtered_state_means[:, 0], "r-", label="Estimated PosX")
axarr[2].set_title('Position X')
axarr[2].grid()
axarr[2].legend()
axarr[2].set_ylim([-10, 1000])
plt.show()
When using the better IMU-sensor, the estimated position is exactly the same as the ground truth:
The cheaper sensor gives significantly worse results:
I hope I could help you. If you have some questions, I will try to answer them.
UPDATE
If you want to experiment with different data you can generate them easily (unfortunately I don't have the original data any more).
Here is a simple matlab script to generate reference, good and poor sensor set.
clear;
dt = 0.01;
t=0:dt:70;
accX_var_best = 0.0005; % (m/s^2)^2
accX_var_good = 0.0007; % (m/s^2)^2
accX_var_worst = 0.001; % (m/s^2)^2
accX_ref_noise = randn(size(t))*sqrt(accX_var_best);
accX_good_noise = randn(size(t))*sqrt(accX_var_good);
accX_worst_noise = randn(size(t))*sqrt(accX_var_worst);
accX_basesignal = sin(0.3*t) + 0.5*sin(0.04*t);
accX_ref = accX_basesignal + accX_ref_noise;
velX_ref = cumsum(accX_ref)*dt;
distX_ref = cumsum(velX_ref)*dt;
accX_good_offset = 0.001 + 0.0004*sin(0.05*t);
accX_good = accX_basesignal + accX_good_noise + accX_good_offset;
velX_good = cumsum(accX_good)*dt;
distX_good = cumsum(velX_good)*dt;
accX_worst_offset = -0.08 + 0.004*sin(0.07*t);
accX_worst = accX_basesignal + accX_worst_noise + accX_worst_offset;
velX_worst = cumsum(accX_worst)*dt;
distX_worst = cumsum(velX_worst)*dt;
subplot(3,1,1);
plot(t, accX_ref);
hold on;
plot(t, accX_good);
plot(t, accX_worst);
hold off;
grid minor;
legend('ref', 'good', 'worst');
title('AccX');
subplot(3,1,2);
plot(t, velX_ref);
hold on;
plot(t, velX_good);
plot(t, velX_worst);
hold off;
grid minor;
legend('ref', 'good', 'worst');
title('VelX');
subplot(3,1,3);
plot(t, distX_ref);
hold on;
plot(t, distX_good);
plot(t, distX_worst);
hold off;
grid minor;
legend('ref', 'good', 'worst');
title('DistX');
The simulated data looks pretty the same like the data above.
I'm a new-learner of python, recently I'm working on some project to perform computation of Joint distribution of a markov process.
An example of a stochastic kernel is the one used in a recent study by Hamilton (2005), who investigates a nonlinear statistical model of the business cycle based on US unemployment data. As part of his calculation he estimates the kernel
pH := 0.971 0.029 0
0.145 0.778 0.077
0 0.508 0.492
Here S = {x1, x2, x3} = {NG, MR, SR}, where NG corresponds to normal growth, MR to mild recession, and SR to severe recession. For example, the probability of transitioning from severe recession to mild recession in one period is 0.508. The length of the period is one month.
the excercise based on the above markov process is
With regards to Hamilton’s kernel pH, and using the same initial condition ψ = (0.2, 0.2, 0.6) , compute the probability that the economy starts and remains in recession through periods 0, 1, 2 (i.e., that xt = NG fort = 0, 1, 2).
My script is like
import numpy as np
## In this case, X should be a matrix rather than vector
## and we compute w.r.t P rather than merely its element [i][j]
path = []
def path_prob2 (p, psi , x2): # X a sequence giving the path
prob = psi # initial distribution is an row vector
for t in range(x2.shape[1] -1): # .shape[1] grasp # of columns
prob = np.dot(prob , p) # prob[t]: marginal distribution at period t
ression = np.dot(prob, x2[:,t])
path.append(ression)
return path,prob
p = ((0.971, 0.029, 0 ),
(0.145, 0.778, 0.077),
(0 , 0.508, 0.492))
# p must to be a 2-D numpy array
p = np.array(p)
psi = (0.2, 0.2, 0.6)
psi = np.array(psi)
x2 = ((0,0,0),
(1,1,1),
(1,1,1))
x2 = np.array(x2)
path_prob2(p,psi,x2)
During the execute process, two problems arise. The first one is , in the first round of loop, I don't need the initial distribution psi to postmultiply transaction matrix p, so the probability of "remaining in recession" should be 0.2+0.6 = 0.8, but I don't know how to write the if-statement.
The second one is , as you may note, I use a list named path to collect the probility of "remaining in recession" in each period. And finally I need to multiply every element in the list one-by-one, I don't manage to find a method to implement such task , like path[0]*path[1]*path[2] (np.multiply can only take two arguments as far as I know). Please give me some clues if such method do exist.
An additional ask is please give me any suggestion that you think can make the code more efficient. Thank you.
If I understood you correctly this should work (I'd love for some manual calculations for some of the steps/outcome), take notice of the fact that I didn't use if/else statement but instead started iterating from the second column:
import numpy as np
# In this case, X should be a matrix rather than vector
# and we compute w.r.t P rather than merely its element [i][j]
path = []
def path_prob2(p, psi, x2): # X a sequence giving the path
path.append(np.dot(psi, x2[:, 0])) # first step
prob = psi # initial distribution is an row vector
for t in range(1, x2.shape[1]): # .shape[1] grasp # of columns
prob = np.dot(prob, p) # prob[t]: marginal distribution at period t
path.append(np.prod(np.dot(prob, t)))
return path, prob
# p must to be a 2-D numpy array
p = np.array([[0.971, 0.029, 0],
[0.145, 0.778, 0.077],
[0, 0.508, 0.492]])
psi = np.array([0.2, 0.2, 0.6])
x2 = np.array([[0, 0, 0],
[1, 1, 1],
[1, 1, 1]])
print path_prob2(p, psi, x2)
For your second question I believe Numpy.prod will give you a multiplication between all elements of a list/array.
You can use the prod as such:
>>> np.prod([15,20,31])
9300
I'm using statsmodels' weighted least squares regression, but getting some really huge values.
Here's my code:
X = np.array([[1,2,3],[1,2,3],[4,5,6],[1,2,3],[4,5,6],[1,2,3],[1,2,3],[4,5,6],[4,5,6],[1,2,3]])
y = np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 1])
w = np.array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
temp_g = sm.WLS(y, X, w).fit()
Now, what I understand is that in WLS regression, just like in any linear regression problem, we provide the endog vector and the exog vector and the function can find the line of the best fit and tell us what the coefficients/regression parameters for each observation ought to be. For example, in my data, where each observation consists of 3 features, I'm expecting there to be 3 parameters.
So I fetch them like this:
parameters = temp_g.params # I'm hoping I've got this right! Or do I need to use "fittedvalues" instead?
The issue is that I'm getting really huge values like this:
temp g params :
[ -7.66645036e+198 -9.01935337e+197 5.86257969e+198]
or this:
temp g params :
[-2.77777778 -0.44444444 1.88888889]
Which is creating problems in further usage of these parameters, especially since I have some exponents to work with as well, and I need to raise e to the power of some of the regression parameters, which is proving impossible, given such big numbers. Because I keep getting overflow errors when using exp().
Is this normal? Am I doing something wrong? Or is there a specific way to make them useful?