According to the Python docs, random.paretovariate(alpha) simulates from the Pareto distribution where alpha is the shape parameter. But the Pareto distribution takes both a shape and scale parameter.
How can I simulate from this distribution specifying both parameters?
You can use NumPy instead:
from numpy import random
pareto = random.pareto(a=4, size=(4, 8))
print(pareto)
>>>[[0.32803729 0.03626127 0.73736579 0.53301595 0.33443536 0.12561402
0.00816275 0.0134468 ]
[0.21536643 0.15798882 0.52957712 0.06631794 0.03728101 0.80383849
0.01727098 0.03910042]
[0.24481661 0.13497905 0.00665971 0.41875676 0.20252262 0.13701287
0.06929994 0.05350275]
[0.93898544 0.02621125 0.0873763 0.15660287 0.31329102 3.95332518
0.09149938 0.08415795]]
You can also nicely graph the data using matplotlib and seaborn:
from numpy import random
import matplotlib.pyplot as plt
import seaborn
seaborn.distplot(random.pareto(a=4, size=1000), kde=False)
plt.show()
Related
I have created a list of values of Shannon entropy for a pair of multiple sequence aligned sequences. While plotting the values I get a simple plot. I want to plot a smooth curve over the lines. Can anyone suggest to me what will be the right way to process it? BAsically I want to plot a smooth curve that touches the tip of every bar and goes to zero where the "y axis value" is zero.
link for image: [1]: https://i.stack.imgur.com/SY3jH.png
#importing the relevant packages
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import make_interp_spline
from Bio import AlignIO
import warnings
warnings.filterwarnings("ignore")
#function to calculate the Shannon Entropy of a MSA
# H = -sum[p(x).log2(px)]
def shannon_entropy(list_input):
unique_aa = set(list_input)
M = len(list_input)
entropy_list = []
# Number of residues in column
for aa in unique_aa:
n_i = list_input.count(aa)
P_i = n_i/float(M)
entropy_i = P_i*(math.log(P_i,2))
entropy_list.append(entropy_i)
sh_entropy = -(sum(entropy_list))
#print(sh_entropy)
return sh_entropy
#importing the MSA file
#importing the clustal file
align_clustal1 =AlignIO.read("/home/clustal.aln", "clustal")
def shannon_entropy_list_msa(alignment_file):
shannon_entropy_list = []
for col_no in range(len(list(alignment_file[0]))):
list_input = list(alignment_file[:, col_no])
shannon_entropy_list.append(shannon_entropy(list_input))
return shannon_entropy_list
clustal_omega1 = shannon_entropy_list_msa(align_clustal1)
# Plotting the data
plt.figure(figsize=(18,10))
plt.plot(clustal_omega1, 'r')
plt.xlabel('Residue', fontsize=16)
plt.ylabel("Shannon's entropy", fontsize=16)
plt.show()
Edit 1:
Here is what my graph looks like after implementing the "pchip" method. link for the pchip output: https://i.stack.imgur.com/hA3KW.png
pchip monotonic spline output
One approach would be to use PCHIP interpolation, which will give you the monotonic curve with the required behaviour for zero values on the y-axis.
We can't run your exact code example on our machines because you point to a local Clustal file in your 'home' directory.
Here's a simple working example, with link to output image:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import pchip
mylist = [10,0,0,0,0,9,9,0,0,0,11,11,11,0,0]
mylist_np = np.array(mylist)
samples = np.array(range(len(mylist)))
xnew = np.linspace(samples.min(), samples.max(), 100)
plt.plot(xnew,pchip(samples, mylist_np )(xnew))
plt.show()
I have a distribution that changes over time for which I would like to plot a violin plot for each time step side-by-side using seaborn. My initial attempt failed as violinplot cannot handle a np.ndarray for the y argument:
import numpy as np
import seaborn as sns
time = np.arange(0, 10)
samples = np.random.randn(10, 200)
ax = sns.violinplot(x=time, y=samples) # Exception: Data must be 1-dimensional
The seaborn documentation has an example for a vertical violinplot grouped by a categorical variable. However, it uses a DataFrame in long format.
Do I need to convert my time series into a DataFrame as well? If so, how do I achieve this?
A closer look at the documentation made me realize that omitting the x and y argument altogether leads to the data argument being interpreted in wide-form:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
samples = np.random.randn(20, 10)
ax = sns.violinplot(data=samples)
plt.show()
In the violin plot documentation it says that the input x and y parameters do not have to be a data frame, but they have a restriction of having the same dimension. In addition, the variable y that you created has 10 rows and 200 columns. This is detrimental when plotting the graphics and causes a dimension problem.
I tested it and this code has no problems when reading the python file.
import numpy as np
import seaborn as sns
import pandas as pd
time = np.arange(0, 200)
samples = np.random.randn(10, 200)
for sample in samples:
ax = sns.violinplot(x=time, y=sample)
You can then group the resulting graphs using this link:
https://python-graph-gallery.com/199-matplotlib-style-sheets/
If you want to convert your data into data frames it is also possible. You just need to use pandas.
example
import pandas as pd
x = [1,2,3,4]
df = pd.DataFrame(x)
I want to use power law to fit on my data points because I have to calculate the value of v. But the error on my fitting parameters is too large, although curve seems to pass all data points. How to reduce is error?
`import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import math
import scipy
from scipy import optimize
x_data= np.array([30, 45, 60, 75])
y_data= np.array([0.42597867, 0.26249343, 0.19167837, 0.08116507])
fig = plt.figure()
ax= fig.add_subplot(111)
def ff(L,v,c):
return (L**(-1/v)+c)
ax2.scatter(x_data, y_data, marker='s',s=4**2,)
pfit,pcov = optimize.curve_fit(ff,x_data,y_data)
print("pfit: ",pfit)
print("pcov: ",pcov.shape)
#print(pcov)
perr = np.sqrt(np.diag(pcov))
x=np.linspace(20,85,1000)
ax2.plot(x,ff(x,*pfit),color='red')`
I'm trying to plot a CDF of random samples to compare to a target within a dataset that follows a tweedie distribution. I know the following code will pull random samples along a poisson distribution:
import numpy as np
import matplotlib.pyplot as plt
x_r = np.random.poisson(lam = coll_df['pure_premium'].mean(), size = len(coll_df['pure_premium'])).sort()
y_r = np.arange(1, len(x)+1)/len(x)
_ = plt.plot(x, y_r, color = 'red')
_ = plt.xlabel('Percent of Pure Premium')
_ = plt.ylabel('ECDF')
However, there is no tweedie distribution option on the random sampling. Anyone know how to hack this together?
PyPI has a tweedie package. A minimal example drawing a sample would be:
import tweedie, seaborn as sns, matplotlib.pyplot as plt
tvs = tweedie.tweedie(mu=10, p=1.5, phi=20).rvs(100000)
sns.distplot(tvs)
plt.show()
The package's GitHub pages have a more fancy example. The package implements rv_continuous, so one gets a bunch of other functionality besides rvs(). Also, while there seems no nice online docs, help(tweedie.tweedie) gives lots of detail.
I am trying to do a Kernel Density Estimation (KDE) plot with seaborn and locate the median. The code looks something like this:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
sns.kdeplot(data, shade=True)
# x_median, y_median = magic_function()
# plt.vlines(x_median, 0, y_median)
plt.show()
As you can see I need a magic_function() to fetch the median x and y values from the kdeplot. Then I would like to plot them with e.g. vlines. However, I can't figure out how to do that. The result should look something like this (obviously the black median bar is wrong here):
I guess my question is not strictly related to seaborn and also applies to other kinds of matplotlib plots. Any ideas are greatly appreciated.
You need to:
Extract the data of the kde line
Integrate it to calculate the cumulative distribution function (CDF)
Find the value that makes CDF equal 1/2, that is the median
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
p=sns.kdeplot(data, shade=True)
x,y = p.get_lines()[0].get_data()
#care with the order, it is first y
#initial fills a 0 so the result has same length than x
cdf = scipy.integrate.cumtrapz(y, x, initial=0)
nearest_05 = np.abs(cdf-0.5).argmin()
x_median = x[nearest_05]
y_median = y[nearest_05]
plt.vlines(x_median, 0, y_median)
plt.show()