PYMC3 - Random Walk Forecasting - python

I was hoping someone may be able to clarify something for me. I am trying to do a timeseries forecasting with the GaussianRandomWalk function in PyMC3. I have been suggested that my code is wrong as I’ve modeled it so that the standard deviation of the latent walk is the same as the observation noise, which seems like it might be a mistake. Is it a mistake? How would i change it?
import pymc3 as pm
#import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# generate a random walk
sd = .1
N = 200
deltas = np.random.normal(scale=sd, size=N)
y = np.cumsum(deltas)
x = np.arange(N)
df = pd.DataFrame({‘y’: y})
df = df.reindex(np.arange(250))
with pm.Model() as model:
sd = pm.HalfNormal(‘sd’)
mu = pm.Uniform(“mu”, 0, 100)
prior = pm.GaussianRandomWalk(‘prior’, mu=mu, sd=sd, shape=len(df))
obs = pm.Normal("obs", mu=prior, sd=sd, observed=df["y"])
# graph = pm.model_to_graphviz(model)
# print(graph)
trace = pm.sample(2000, chains=1)
pm.traceplot(trace)
plt.show()
with model:
ppc = pm.sample_posterior_predictive(trace)
pm.traceplot(ppc)
plt.show()
print(ppc)

GuassianRandomWalk is pure random, without any trend/inertia. You might want to look into tfp.sts.LocalLinearTrend or pm.AR which has some "inertia" in it.
I don't know more about how to model timeseries.

Related

How to plot a smooth curve in python for a list of values?

I have created a list of values of Shannon entropy for a pair of multiple sequence aligned sequences. While plotting the values I get a simple plot. I want to plot a smooth curve over the lines. Can anyone suggest to me what will be the right way to process it? BAsically I want to plot a smooth curve that touches the tip of every bar and goes to zero where the "y axis value" is zero.
link for image: [1]: https://i.stack.imgur.com/SY3jH.png
#importing the relevant packages
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import make_interp_spline
from Bio import AlignIO
import warnings
warnings.filterwarnings("ignore")
#function to calculate the Shannon Entropy of a MSA
# H = -sum[p(x).log2(px)]
def shannon_entropy(list_input):
unique_aa = set(list_input)
M = len(list_input)
entropy_list = []
# Number of residues in column
for aa in unique_aa:
n_i = list_input.count(aa)
P_i = n_i/float(M)
entropy_i = P_i*(math.log(P_i,2))
entropy_list.append(entropy_i)
sh_entropy = -(sum(entropy_list))
#print(sh_entropy)
return sh_entropy
#importing the MSA file
#importing the clustal file
align_clustal1 =AlignIO.read("/home/clustal.aln", "clustal")
def shannon_entropy_list_msa(alignment_file):
shannon_entropy_list = []
for col_no in range(len(list(alignment_file[0]))):
list_input = list(alignment_file[:, col_no])
shannon_entropy_list.append(shannon_entropy(list_input))
return shannon_entropy_list
clustal_omega1 = shannon_entropy_list_msa(align_clustal1)
# Plotting the data
plt.figure(figsize=(18,10))
plt.plot(clustal_omega1, 'r')
plt.xlabel('Residue', fontsize=16)
plt.ylabel("Shannon's entropy", fontsize=16)
plt.show()
Edit 1:
Here is what my graph looks like after implementing the "pchip" method. link for the pchip output: https://i.stack.imgur.com/hA3KW.png
pchip monotonic spline output
One approach would be to use PCHIP interpolation, which will give you the monotonic curve with the required behaviour for zero values on the y-axis.
We can't run your exact code example on our machines because you point to a local Clustal file in your 'home' directory.
Here's a simple working example, with link to output image:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import pchip
mylist = [10,0,0,0,0,9,9,0,0,0,11,11,11,0,0]
mylist_np = np.array(mylist)
samples = np.array(range(len(mylist)))
xnew = np.linspace(samples.min(), samples.max(), 100)
plt.plot(xnew,pchip(samples, mylist_np )(xnew))
plt.show()

generating uniform distribution of integeres with python

I tried to generate an uniform distribution of random integeres on a given interval (it's unimportant whether it contains its upper limit or not) with python. I used the next snippet of code to do so and plot the result:
import numpy as np
import matplotlib.pyplot as plt
from random import randint
propsedPython = np.random.randint(0,32767,8388602)%2048
propsedPythonNoMod = np.random.randint(0,2048,8388602)
propsedPythonNoModIntegers = np.random.random_integers(0,2048,8388602)
propsedPythonNoModRandInt = np.empty(8388602)
for i in range(8388602):
propsedPythonNoModRandInt[i] = randint(0,2048)
plt.figure(figsize=[16,10])
plt.title(r'distribution $\rho_{prop}$ off all the python simulated proposed indices')
plt.xlabel(r'indices')
plt.ylabel(r'$\rho_{prop}$')
plt.yscale('log')
plt.hist(propsedPython,bins=1000,histtype='step',label=r'np.random.randint(0,32767,8388602)%2048')
plt.hist(propsedPythonNoMod,bins=1000,histtype='step',label=r'np.random.randint(0,2048,8388602')
plt.hist(propsedPythonNoModIntegers,bins=1000,histtype='step',label=r'np.random.random_integers(0,2048,8388602)')
plt.hist(propsedPythonNoModRandInt,bins=1000,histtype='step',label=r'for i in range(8388602):propsedPythonNoModRandInt[i] = randint(0,2048)')
plt.legend(loc=0)
The resulting plot is: Could somebody point me in the right direction why these spikes appear in al the different cases and or gives some advice which routine to use to got uniformly distributed random integers?
Thanks a lot!
Mmm...
I used new NumPy rng facility, and graph looks ok to me.
Code
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng()
N = 1024*500
hist = np.zeros(2048, dtype=np.int32)
q = rng.integers(0, 2048, dtype=np.int32, size=N, endpoint=False)
for k in range(0, N):
hist[q[k]] += 1
x = np.arange(0, 2048, dtype=np.int32)
fig, ax = plt.subplots()
ax.stem(x, hist, markerfmt=' ')
plt.show()
and graph

How to pull samples with a tweedie distribution using numpy

I'm trying to plot a CDF of random samples to compare to a target within a dataset that follows a tweedie distribution. I know the following code will pull random samples along a poisson distribution:
import numpy as np
import matplotlib.pyplot as plt
x_r = np.random.poisson(lam = coll_df['pure_premium'].mean(), size = len(coll_df['pure_premium'])).sort()
y_r = np.arange(1, len(x)+1)/len(x)
_ = plt.plot(x, y_r, color = 'red')
_ = plt.xlabel('Percent of Pure Premium')
_ = plt.ylabel('ECDF')
However, there is no tweedie distribution option on the random sampling. Anyone know how to hack this together?
PyPI has a tweedie package. A minimal example drawing a sample would be:
import tweedie, seaborn as sns, matplotlib.pyplot as plt
tvs = tweedie.tweedie(mu=10, p=1.5, phi=20).rvs(100000)
sns.distplot(tvs)
plt.show()
The package's GitHub pages have a more fancy example. The package implements rv_continuous, so one gets a bunch of other functionality besides rvs(). Also, while there seems no nice online docs, help(tweedie.tweedie) gives lots of detail.

Replace outliers with neighbour-Value

I have a plot with some outliers (wrong measurements):
The base data is good though. I want to just delete everything that is too far off the "current average". I tried using pd.rolling().mean() but with no satisfactory result:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = np.genfromtxt('shard_height_plot.csv', delimiter = ',')
df = pd.DataFrame(data)
df.set_index(0, inplace = True)
df2 = df.rolling(20).mean()
plt.plot(df)
plt.plot(df2)
plt.show()
I tried to search the web for a good solution but couldn't find one. It shouldn't be that hard to delete data points, that jump through the roof, should it?
Edit:
data file can be downloaded here: https://ufile.io/pviuc
Edit2:
I takled this problem of too many outliers by improving my data set creation.
The core of it:
if abs(D - D_List[-2]) > 30:
D = D_List[-2]
D_List.pop()
D_List.append(D)
Basically what this does is checking if the change of a value is larger than 30, if so it deletes the last value and replaces is with the second last. Not very spectacular but just what I need. I used one of the answers though because it is so much prettier. Thank you guys very much.
Let's try using scipy.signal see docs:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import signal
data = np.genfromtxt('shard_height_plot.csv', delimiter = ',')
df = pd.DataFrame(data)
df.set_index(0, inplace = True)
df2 = df.rolling(20).mean()
b, a = signal.butter(3, 0.05)
y = signal.filtfilt(b,a, df[1].values)
df3 = pd.DataFrame(y, index=df2.index)
plt.plot(df, alpha=.3)
plt.plot(df2, alpha=.3)
plt.plot(df3)
plt.show()
Output:
Use medfilt:
y = signal.medfilt(df[1].values)
Output:
There are many ways to smooth a curve (rolling mean, GAM, smoothing spline etc.), my favorite one is the Savitzky–Golay method.
It works as follows: after having regressed a small window around a data point y onto a polynomial (with least squares), it uses this polynomial to get the estimation of your data point ^y. Then the window is shifted forward by one data point.
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
x = np.linspace(0,5,150)
y = np.cos(x) + np.random.random(150) * 0.15
yhat = savgol_filter(y, 49, 3)
plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()
Note that rolling mean can't work in your case with a perimeter as low as 20, since the outlier point will have a non-negligible weight (5%) and will always induce a big bias...

messy fitted line plot

I generated some data y with linear relationship to log(x). I put y and x in a dataframe, sorted by the value of x, fitted model, and then tried to plot data points along with fitted line. However, what I got was a very messy fitted line. I must have done something wrong. This can be easily done in R, but with statsmodels.... I still cannot figure out why. Help needed. Thanks in advance.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
%matplotlib inline
B0 = 3
B1 = 0.5
X = np.random.rand(1000)
epsilon = np.random.normal(0,0.1, size=1000)
y=B0 + B1*np.log(X)+epsilon
df1 = pd.DataFrame({'Y':y, 'X':X})
df1.sort_values('X', inplace=True)
model1 = smf.ols ('Y~np.log(X)', data=df1).fit()
plt.scatter(df1.X, df1.Y)
plt.plot(df1.X, model1.predict(np.log(df1.X)), 'r-')
This is what I got:

Categories

Resources