I try to use Kalman filtering for my one dimensional data. So, assume that I have the following dataset:
Variable
250.1
248.5
262.3
265.3
270.2
I do know that there is a noise in my data and hence, I want to clean this data by using Kalman filtering. Which way can produce the most efficient result for me?
I run the following code:
from pykalman import KalmanFilter
import numpy as np
kf = KalmanFilter(transition_matrices = [[1, 1], [0, 1]],
observation_matrices = [[0.1, 0.5], [-0.3, 0.0]])
measurements = np.asarray([(250.1),(248.5),(262.3),(265.3), (270.2)])
kf = kf.em(measurements, n_iter=5)
(filtered_state_means, filtered_state_covariances)=kf.filter(measurements)
(smoothed_state_means, smoothed_state_covariances)=kf.smooth(measurements)
As you can see, I try to use pykalman, however I cannot install this module. I try to use easy_install pykalman direction, and the error is invalid syntax. Another problem is, I have a huge data set, so I have more than one hundred thousand rows in my variable column. So, I cannot write all observations one by one.
To install pykalman I used:
pip install pykalman --user
The --user flag installs to my home directory, avoiding having to use sudo to install. I was told that scipy was missing, so I pip installed that as well. On the project github page there is a list of dependent libraries, so you may be asked to install any one of those.
You are using single values for each of your readings. Most examples have more than this, for instance position and velocity for each reading. To get something to plot with the transition and observation matrices you supplied, I added a second bogus reading of '1' to each of your measurements. The following Jupyter notebook script will produce a plot, but the output is poor as the matrices values need to be adjusted for your data set.
%matplotlib inline
from pykalman import KalmanFilter
import numpy as np
kf = KalmanFilter(transition_matrices = [[1, 1], [0, 1]],
observation_matrices = [[0.1, 0.5], [-0.3, 0.0]])
# measurements = np.asarray([(250.1),(248.5),(262.3),(265.3), (270.2)])
measurements = np.array([[250.1,1],[248.5,1],[262.3,1],[265.3,1], [270.2,1]])
kf = kf.em(measurements, n_iter=5)
filtered_state_estimates = kf.filter(measurements)[0]
(smoothed_state_estimates, smoothed_state_covariances)=kf.smooth(measurements)
# draw estimates
pl.figure()
lines_true = pl.plot(measurements, color='b')
lines_filt = pl.plot(filtered_state_estimates, color='r')
lines_smooth = pl.plot(smoothed_state_estimates, color='g')
pl.legend((lines_true[0], lines_filt[0], lines_smooth[0]),
('true', 'filt', 'smooth'),
loc='lower right'
)
pl.show()
For the data set that you propose, a fast and simpler way to produce a filtered output would be to use a one minus alpha filter. Have a look at this link for more details on this type of filter:
http://stats.stackexchange.com/questions/44650/a-simpler-way-to-calculate-exponentially-weighted-moving-average
Related
I am trying to do a simple linear curve fit with scipy, normally this method works fine for me. This time however for a reason unknown to me it doesn't work.
(I suspect that maybe the numbers are so big that it reaches the limit of what can be stored under a given data type.)
Regardless of the reason, the idea is to make a plot that looks like this:
As you see on the axis here the numbers are of a common order of magnitude. However this time I tried to make a fit to much bigger data points on the order of 1E10, for this I tried to use the following code (here I present only the code for making a scatter plot and then fitting only one data set).
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
ucrt_T = 2/np.sqrt(3)
ucrt_U = 0.1/np.sqrt(3)
T = [314.1, 325.1, 335.1, 345.1, 355.1, 365.1, 374.1, 384.1, 393.1]
T_to_4th = [9733560790.61, 11170378213.80, 12609495509.84, 14183383217.88, 15900203737.92, 17768359469.96, 19586229219.65, 21765930026.49, 23878782252.31]
ucrt_T_lst = [143130823.11, 158701221.00, 173801148.95, 189829733.26, 206814686.75, 224783722.22, 241820148.88, 261735288.93, 280568229.17]
UBlack = [1.9,3.1, 4.4, 5.6, 7.0, 8.7, 10.2, 11.8, 13.4]
def lin_function(x,a,b):
return a*x + b
def line_fit_2():
#Dodanie pozostałych punktów na wykresie
plt.scatter(UBlack, T_to_4th, color='blue')
plt.errorbar(UBlack, T_to_4th, yerr=ucrt_T, fmt='o')
#Seria CZARNA
VltBlack = np.array(UBlack)
Tt4 = np.array(T_to_4th)
popt, pcov = curve_fit(lin_function, VltBlack, Tt4, absolute_sigma=False)
perr = np.sqrt(np.diag(pcov))
y = lin_function(VltBlack, *popt)
#Stylistyka i wygląd wykresu
#plt.plot(Pressure1, y, '--', color = 'g', label="fit with: $a={:.3f}\pm{:.3f}$, $b={:.3f}\pm{:.3f}$" .format(popt[0], perr[0], popt[1], perr[1]))
plt.plot(VltBlack, y, '--', color='green')
plt.ylabel(r'$T^4$ w $[K^4]$')
plt.xlabel(r'Napięcie termometru U w [mV]')
plt.legend(['Fit', 'Data points'])
plt.grid()
plt.show()
line_fit_2()
If you will run it you will find out that the scatter plot is created however the fit isn't executed properly, as only a horizontal line will be added. Additionally an error OptimizeWarning: Covariance of the parameters could not be estimated category=OptimizeWarning) is raised.
I would be very happy to know what I am doing wrong or how to resolve this problem. All help is appreciated!
You've pretty much already answered your question, so I'll just confirm your suspicion: the reason the OptimizeWarning is raised is because the underlying optimization algorithm doesn't work properly/diverges due to large parameter numbers.
The solution is very simple, just scale your input parameters before using the fitting tool. Just keep the scaling in mind when you add labels to your x/y axis:
T_to_4th = np.array([9733560790.61, 11170378213.80, 12609495509.84, 14183383217.88, 15900203737.92, 17768359469.96, 19586229219.65, 21765930026.49, 23878782252.31])/10e6
ucrt_T_lst = np.array([143130823.11, 158701221.00, 173801148.95, 189829733.26, 206814686.75, 224783722.22, 241820148.88, 261735288.93, 280568229.17])/10e6
What I did is just divide the lists with big numbers by 10e6. This means that the values are no longer in kPa for example, but in mega kPa (which would be GPa now).
To divide the entire list by the same value, first convert it to a numpy array.
Hope this helps :)
I'm trying to understand this example of a Bayesian network. Figured I'd dumb it down even more such that it's only looking at three variables: D1, D2, and D3. Each is categorical, with their probability tables given at the top of the code below. I'd like to set D3 = 0 and then compute the posterior probabilities of D1 and D2, like a simpler version of what's done at the bottom of this page. I've tried to do this by playing with the code from the first source but have been unsuccessful and I don't understand the error messages.
Any assistance in this would be greatly appreciated - I've really been struggling to implement Bayesian inference. I've tried looking at the PYMC3 Categorical documentation but it's pretty bare-bones. And the example of inference I could find uses continuous variables and seems to be doing a different thing than what I'm trying to do. Or if it isn't, I'm not smart enough to make the connection and use whatever they're demonstrating to meet my needs.
I'm not sure if posting large sections of code is approved here? But I'm not sure how else to do this. Here is my code (a much shorter, simpler version of the code in the first source):
import networkx as nx
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pymc3 as pm
import theano
import theano.tensor as T
from theano.compile.ops import as_op
d1_prob = np.array([0.3,0.7]) # 2 choices
d2_prob = np.array([0.6,0.3,0.1]) # 3 choices
d3_prob = np.array([[[0.1, 0.9], # (2x3)x2 choices
[0.3, 0.7],
[0.4, 0.6]],
[[0.6, 0.4],
[0.8, 0.2],
[0.9, 0.1]]])
BN = nx.DiGraph()
BN.add_node('D1', dtype='Discrete', prob=d1_prob)
BN.add_node('D2', dtype='Discrete', prob=d2_prob)
BN.add_node('D3', dtype='Discrete', prob = d3_prob, observe=np.array([0.]))
BN.add_edges_from([('D1', 'D3'), ('D2', 'D3')])
#print(BN.nodes(data=True))
#print(BN.pred['D3'])
def gpm(BN, node, num=0):
return BN.node[BN.predecessors(node)[num]]['dist_obj']
with pm.Model() as mod2:
BN.node['D1']['dist_obj'] = pm.Categorical('D1', p=BN.node['D1']['prob'])
BN.node['D2']['dist_obj'] = pm.Categorical('D2', p=BN.node['D2']['prob'])
BN.node['D3']['dist_obj'] = pm.Categorical('D3', p=BN.node['D3']['prob'][
gpm(BN,'D3', num=1),
gpm(BN,'D3', num=0)
], observed=BN.node['D3']['observe'])
with mod2:
trace = pm.sample(10000)
pm.summary(trace, varnames=['D3'], start=1000)
pm.traceplot(trace[1000:], varnames=['D3'])
I can't help you with PyMC3 , sorry. But maybe you just need the numbers.
Actually I don't understand why you need an inference algorithm at all here.
The probability tables are fully specified, there is no missing data, and therefore you can just apply Bayes Rule here. Admittedly I don't want to do this with pencil and paper even for such a simple example. So I've used java-based GUI tool samiam here, to use Bayes' rule for me.
When nothing is observed:
Interpreting your code gpm() and observe(), you observe d3 = 1. Then the CPT values change to this:
(The state0 values are arbitrary, samiam just assigns default labels stateX). The row-position in the CPT is what matters.
I am new to working with pymc3 and I am having trouble generating an easy-to-read traceplot.
I'm fitting a mixture of 4 multivariate gaussians to some (x, y) points in a dataset. The model runs fine. My question is with regard to manipulating the pm.traceplot() command to make the output more user-friendly.
Here's my code:
import matplotlib.pyplot as plt
import numpy as np
model = pm.Model()
N_CLUSTERS = 4
with model:
#cluster prior
w = pm.Dirichlet('w', np.ones(N_CLUSTERS))
#latent cluster of each observation
category = pm.Categorical('category', p=w, shape=len(points))
#make sure each cluster has some values:
w_min_potential = pm.Potential('w_min_potential', tt.switch(tt.min(w) < 0.1, -np.inf, 0))
#multivariate normal means
mu = pm.MvNormal('mu', [0,0], cov=[[1,0],[0,1]], shape = (N_CLUSTERS,2) )
#break symmetry
pm.Potential('order_mu_potential', tt.switch(
tt.all(
[mu[i, 0] < mu[i+1, 0] for i in range(N_CLUSTERS - 1)]), -np.inf, 0))
#multivariate centers
data = pm.MvNormal('data', mu =mu[category], cov=[[1,0],[0,1]], observed=points)
with model:
trace = pm.sample(1000)
A call to pm.traceplot(trace, ['w', 'mu']) produces this image:
As you can see, it is ambiguous which mean peak corresponds to an x or y value, and which ones are paired together. I have managed a workaround as follows:
from cycler import cycler
#plot the x-means and y-means of our data!
fig, (ax0, ax1) = plt.subplots(nrows=2)
plt.xlabel('$\mu$')
plt.ylabel('frequency')
for i in range(4):
ax0.hist(trace['mu'][:,i,0], bins=100, label='x{}'.format(i), alpha=0.6);
ax1.hist(trace['mu'][:,i,1],bins=100, label='y{}'.format(i), alpha=0.6);
ax0.set_prop_cycle(cycler('color', ['c', 'm', 'y', 'k']))
ax1.set_prop_cycle(cycler('color', ['c', 'm', 'y', 'k']))
ax0.legend()
ax1.legend()
This produces the following, much more legible plot:
I have looked through the pymc3 documentation and recent questions here, but to no avail. My question is this: is it possible to do what I have done here with matplotlib via builtin methods in pymc3, and if so, how?
Better differentiation between multidimensional variables and the different chains was recently added to ArviZ (the library PyMC3 relies on for plotting).
In ArviZ latest version, you should be able to do:
az.plot_trace(trace, compact=True, legend=True)
to get the different dimensions of each variable distinguished by color and the different chains distinguished by linestyle. The default setting is using matplotlib's default color cycle and 4 different linestyles, solid, dashed, dotted and dash-dotted. Both properties can be set to custom aesthetics and custom values by using compact_prop to customize dimension representation and chain_prop to customize chain representation. In addition, if using compact, it may also be a good idea to use combined=True to reduce the clutter in the first column. As an example:
az.plot_trace(trace, compact=True, combined=True, legend=True, chain_prop=("ls", "-"))
would plot the KDEs in the first column using the data from all chains, and would plot all chains using a solid linestyle (due to combined arg, only relevant for the second column). Two legends will be shown, one for the chain info and another for the compact info.
At least in recent versions, you can use compact=True as in:
pm.traceplot(trace, var_names = ['parameters'], compact=True)
to get one graph with all you params combined
Docs in: https://arviz-devs.github.io/arviz/_modules/arviz/plots/traceplot.html
However, I haven't been able to get the colors to differ between lines
I'm using SciPy instead of MATLAB in a control systems class to plot the step responses of LTI systems. It's worked great so far, but I've run into an issue with a very specific system. With this code:
from numpy import min
from scipy import linspace
from scipy.signal import lti, step
from matplotlib import pyplot as p
# Create an LTI transfer function from coefficients
tf = lti([64], [1, 16, 64])
# Step response (redo it to get better resolution)
t, s = step(tf)
t, s = step(tf, T = linspace(min(t), t[-1], 200))
# Plotting stuff
p.plot(t, s)
p.xlabel('Time / s')
p.ylabel('Displacement / m')
p.show()
The code as-is displays a flat line. If I modify the final coefficient of the denominator to 64.00000001 (i.e., tf = lti([64], [1, 16, 64.0000001])) then it works as it should, showing an underdamped step response. Setting the coefficient to 63.9999999 also works. Changing all the coefficients to have explicit decimal places (i.e., tf = lti([64.0], [1.0, 16.0, 64.0])) doesn't affect anything, so I guess it's not a case of integer division messing things up.
Is this a bug in SciPy, or am I doing something wrong?
This is a limitation of the implementation of the step function. It uses a matrix exponential to find the step response, and it doesn't handle repeated poles well. (Your system has a repeated pole at -8.)
Instead of using step, you can use the function scipy.signal.step2
In [253]: from scipy.signal import lti, step2
In [254]: sys = lti([64], [1, 16, 64])
In [255]: t, y = step2(sys)
In [256]: plot(t, y)
Out[256]: [<matplotlib.lines.Line2D at 0x5ec6b90>]
I have a data set that I know has a Pareto distribution. Can someone point me to how to fit this data set in Scipy? I got the below code to run but I have no idea what is being returned to me (a,b,c). Also, after obtaining a,b,c, how do I calculate the variance using them?
import scipy.stats as ss
import scipy as sp
a,b,c=ss.pareto.fit(data)
Be very careful fitting power laws!! Many reported power laws are actually badly fitted by a power law. See Clauset et al. for all the details (also on arxiv if you don't have access to the journal). They have a companion website to the article which now links to a Python implementation. Don't know if it uses Scipy because I used their R implementation when I last used it.
Here's a quickly written version, taking some hints from the Reference page that Rupert gave.
This is currently work in progress in scipy and statsmodels and requires MLE with some fixed or frozen parameters, which is only available in the trunk versions.
No standard errors on the parameter estimates or other result statistics are available yet.
'''estimating pareto with 3 parameters (shape, loc, scale) with nested
minimization, MLE inside minimizing Kolmogorov-Smirnov statistic
running some examples looks good
Author: josef-pktd
'''
import numpy as np
from scipy import stats, optimize
#the following adds my frozen fit method to the distributions
#scipy trunk also has a fit method with some parameters fixed.
import scikits.statsmodels.sandbox.stats.distributions_patch
true = (0.5, 10, 1.) # try different values
shape, loc, scale = true
rvs = stats.pareto.rvs(shape, loc=loc, scale=scale, size=1000)
rvsmin = rvs.min() #for starting value to fmin
def pareto_ks(loc, rvs):
est = stats.pareto.fit_fr(rvs, 1., frozen=[np.nan, loc, np.nan])
args = (est[0], loc, est[1])
return stats.kstest(rvs,'pareto',args)[0]
locest = optimize.fmin(pareto_ks, rvsmin*0.7, (rvs,))
est = stats.pareto.fit_fr(rvs, 1., frozen=[np.nan, locest, np.nan])
args = (est[0], locest[0], est[1])
print 'estimate'
print args
print 'kstest'
print stats.kstest(rvs,'pareto',args)
print 'estimation error', args - np.array(true)
Let's say you data is formated like this
import openturns as ot
data = [
[2.7018013],
[8.53280352],
[1.15643882],
[1.03359467],
[1.53152735],
[32.70434285],
[12.60709624],
[2.012235],
[1.06747063],
[1.41394096],
]
sample = ot.Sample([[v] for v in data])
You can easily fit a Pareto distribution using ParetoFactory of OpenTURNS library:
distribution = ot.ParetoFactory().build(sample)
You can of course print it:
print(distribution)
>>> Pareto(beta = 0.00317985, alpha=0.147365, gamma=1.0283)
or plot its PDF:
from openturns.viewer import View
pdf_graph = distribution.drawPDF()
pdf_graph.setTitle(str(distribution))
View(pdf_graph, add_legend=False)
More details on the ParetoFactory are provided in the documentation.
Before passing the data to build() function in OPENTURNS, make sure to convert it this way:
data = [[i] for i in data]
Because Sample() function may return an error.
FYI #Tropilio