I'm trying to plot a CDF of random samples to compare to a target within a dataset that follows a tweedie distribution. I know the following code will pull random samples along a poisson distribution:
import numpy as np
import matplotlib.pyplot as plt
x_r = np.random.poisson(lam = coll_df['pure_premium'].mean(), size = len(coll_df['pure_premium'])).sort()
y_r = np.arange(1, len(x)+1)/len(x)
_ = plt.plot(x, y_r, color = 'red')
_ = plt.xlabel('Percent of Pure Premium')
_ = plt.ylabel('ECDF')
However, there is no tweedie distribution option on the random sampling. Anyone know how to hack this together?
PyPI has a tweedie package. A minimal example drawing a sample would be:
import tweedie, seaborn as sns, matplotlib.pyplot as plt
tvs = tweedie.tweedie(mu=10, p=1.5, phi=20).rvs(100000)
sns.distplot(tvs)
plt.show()
The package's GitHub pages have a more fancy example. The package implements rv_continuous, so one gets a bunch of other functionality besides rvs(). Also, while there seems no nice online docs, help(tweedie.tweedie) gives lots of detail.
Related
I have created a list of values of Shannon entropy for a pair of multiple sequence aligned sequences. While plotting the values I get a simple plot. I want to plot a smooth curve over the lines. Can anyone suggest to me what will be the right way to process it? BAsically I want to plot a smooth curve that touches the tip of every bar and goes to zero where the "y axis value" is zero.
link for image: [1]: https://i.stack.imgur.com/SY3jH.png
#importing the relevant packages
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import make_interp_spline
from Bio import AlignIO
import warnings
warnings.filterwarnings("ignore")
#function to calculate the Shannon Entropy of a MSA
# H = -sum[p(x).log2(px)]
def shannon_entropy(list_input):
unique_aa = set(list_input)
M = len(list_input)
entropy_list = []
# Number of residues in column
for aa in unique_aa:
n_i = list_input.count(aa)
P_i = n_i/float(M)
entropy_i = P_i*(math.log(P_i,2))
entropy_list.append(entropy_i)
sh_entropy = -(sum(entropy_list))
#print(sh_entropy)
return sh_entropy
#importing the MSA file
#importing the clustal file
align_clustal1 =AlignIO.read("/home/clustal.aln", "clustal")
def shannon_entropy_list_msa(alignment_file):
shannon_entropy_list = []
for col_no in range(len(list(alignment_file[0]))):
list_input = list(alignment_file[:, col_no])
shannon_entropy_list.append(shannon_entropy(list_input))
return shannon_entropy_list
clustal_omega1 = shannon_entropy_list_msa(align_clustal1)
# Plotting the data
plt.figure(figsize=(18,10))
plt.plot(clustal_omega1, 'r')
plt.xlabel('Residue', fontsize=16)
plt.ylabel("Shannon's entropy", fontsize=16)
plt.show()
Edit 1:
Here is what my graph looks like after implementing the "pchip" method. link for the pchip output: https://i.stack.imgur.com/hA3KW.png
pchip monotonic spline output
One approach would be to use PCHIP interpolation, which will give you the monotonic curve with the required behaviour for zero values on the y-axis.
We can't run your exact code example on our machines because you point to a local Clustal file in your 'home' directory.
Here's a simple working example, with link to output image:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import pchip
mylist = [10,0,0,0,0,9,9,0,0,0,11,11,11,0,0]
mylist_np = np.array(mylist)
samples = np.array(range(len(mylist)))
xnew = np.linspace(samples.min(), samples.max(), 100)
plt.plot(xnew,pchip(samples, mylist_np )(xnew))
plt.show()
According to the Python docs, random.paretovariate(alpha) simulates from the Pareto distribution where alpha is the shape parameter. But the Pareto distribution takes both a shape and scale parameter.
How can I simulate from this distribution specifying both parameters?
You can use NumPy instead:
from numpy import random
pareto = random.pareto(a=4, size=(4, 8))
print(pareto)
>>>[[0.32803729 0.03626127 0.73736579 0.53301595 0.33443536 0.12561402
0.00816275 0.0134468 ]
[0.21536643 0.15798882 0.52957712 0.06631794 0.03728101 0.80383849
0.01727098 0.03910042]
[0.24481661 0.13497905 0.00665971 0.41875676 0.20252262 0.13701287
0.06929994 0.05350275]
[0.93898544 0.02621125 0.0873763 0.15660287 0.31329102 3.95332518
0.09149938 0.08415795]]
You can also nicely graph the data using matplotlib and seaborn:
from numpy import random
import matplotlib.pyplot as plt
import seaborn
seaborn.distplot(random.pareto(a=4, size=1000), kde=False)
plt.show()
I was hoping someone may be able to clarify something for me. I am trying to do a timeseries forecasting with the GaussianRandomWalk function in PyMC3. I have been suggested that my code is wrong as I’ve modeled it so that the standard deviation of the latent walk is the same as the observation noise, which seems like it might be a mistake. Is it a mistake? How would i change it?
import pymc3 as pm
#import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# generate a random walk
sd = .1
N = 200
deltas = np.random.normal(scale=sd, size=N)
y = np.cumsum(deltas)
x = np.arange(N)
df = pd.DataFrame({‘y’: y})
df = df.reindex(np.arange(250))
with pm.Model() as model:
sd = pm.HalfNormal(‘sd’)
mu = pm.Uniform(“mu”, 0, 100)
prior = pm.GaussianRandomWalk(‘prior’, mu=mu, sd=sd, shape=len(df))
obs = pm.Normal("obs", mu=prior, sd=sd, observed=df["y"])
# graph = pm.model_to_graphviz(model)
# print(graph)
trace = pm.sample(2000, chains=1)
pm.traceplot(trace)
plt.show()
with model:
ppc = pm.sample_posterior_predictive(trace)
pm.traceplot(ppc)
plt.show()
print(ppc)
GuassianRandomWalk is pure random, without any trend/inertia. You might want to look into tfp.sts.LocalLinearTrend or pm.AR which has some "inertia" in it.
I don't know more about how to model timeseries.
I would like to Fill_Between a sub section of a normal distribution, say the left 5%tile.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as stats
plt.style.use('ggplot')
mean=1000
std=250
x=np.linspace(mean-3*std, mean+3*std,1000)
iq=stats.norm(mean,std)
plt.plot(x,iq.pdf(x),'b')
Great so far.
Then I set px to fill the area between x=0 to 500
px=np.arange(0,500,10)
plt_fill_between(px,iq.pdf(px),color='r')
The problem is that the above will only show the pdf from 0 to 500 in red.
I want to show the full pdf from 0 to 2000 where the 0 to 500 is shaded?
Any idea how to create this?
As commented, you need to use plt.fill_between instead of plt_fill_between. When doing so the output looks like this which seems to be exactly what you're looking for.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as stats
plt.style.use('ggplot')
mean=1000
std=250
x=np.linspace(mean-3*std, mean+3*std,1000)
iq=stats.norm(mean,std)
plt.plot(x,iq.pdf(x),'b')
px=np.arange(0,500,10)
plt.fill_between(px,iq.pdf(px),color='r')
plt.show()
You only use the x values from 0 to 500 in your np.arange if you want to go to 2000 write:
px=np.arange(0,2000,10)
plt.fill_between(px,iq.pdf(px),color='r')
I am trying to do a Kernel Density Estimation (KDE) plot with seaborn and locate the median. The code looks something like this:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
sns.kdeplot(data, shade=True)
# x_median, y_median = magic_function()
# plt.vlines(x_median, 0, y_median)
plt.show()
As you can see I need a magic_function() to fetch the median x and y values from the kdeplot. Then I would like to plot them with e.g. vlines. However, I can't figure out how to do that. The result should look something like this (obviously the black median bar is wrong here):
I guess my question is not strictly related to seaborn and also applies to other kinds of matplotlib plots. Any ideas are greatly appreciated.
You need to:
Extract the data of the kde line
Integrate it to calculate the cumulative distribution function (CDF)
Find the value that makes CDF equal 1/2, that is the median
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
p=sns.kdeplot(data, shade=True)
x,y = p.get_lines()[0].get_data()
#care with the order, it is first y
#initial fills a 0 so the result has same length than x
cdf = scipy.integrate.cumtrapz(y, x, initial=0)
nearest_05 = np.abs(cdf-0.5).argmin()
x_median = x[nearest_05]
y_median = y[nearest_05]
plt.vlines(x_median, 0, y_median)
plt.show()