What is the pandas equivalent of R's qnorm() - python

I am moving some code from R to Anaconda Python. The R code uses qnorm, documented as "quantile function for the normal distribution with mean equal to mean and standard deviation equal to sd."
The call and parameters are:
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
p vector of probabilities.
mean vector of means.
sd vector of standard deviations.
log.p logical; if TRUE, probabilities p are given as log(p).
lower.tail logical; if TRUE (default), probabilities are
P[X≤x] otherwise, P[X].
I don't see any equivalent in pandas.Series. Have I missed it, is it on another object, or is there some equivalent in another library?

A lot of this equivalent functionality is found in scipy.stats. In this case, you're looking for scipy.stats.norm.ppf.
qnorm(p, mean = 0, sd = 1) is equivalent to scipy.stats.norm.ppf(q, loc=0, scale=1).
import scipy.stats as st
>>> st.norm.ppf([0.01, 0.99])
array([-2.32634787, 2.32634787])
>>> st.norm.ppf([0.01, 0.99], loc=10, scale=0.1)
array([ 9.76736521, 10.23263479])

Just to expand #miradulo answer. If you want to get also qnorm(lower.tail=FALSE) you can just multiply by -1:
In R:
qnorm(0.8, lower.tail = F)
-0.8416212
In python
from scipy.stats import norm
norm.ppf(0.8) * -1
-0.8416212

Related

Drawing sample and calculating sample probability from multivariate normal distribution using scipy.stats.multivariate_normal

I would like to do something that is likely very simple, but is giving me difficulty. Trying to draw N samples from a multivariate normal distribution and calculate the probability of each of those randomly drawn samples. Here I attempt to use scipy, but am open to using np.random.multivariate_normal as well. Whichever is easiest.
>>> import numpy as np
>>> from scipy.stats import multivariate_normal
>>> num_samples = 10
>>> num_features = 6
>>> std = np.random.rand(num_features)
# define distribution
>>> mvn = multivariate_normal(mean = np.zeros(num_features), cov = np.diag(std), allow_singular = False, seed = 42)
# draw samples
>>> sample = mvn.rvs(size = num_samples); sample
# determine probability of each drawn sample
>>> prob = mvn.pdf(x = sample)
# print samples
>>> print(sample)
[[ 0.04816243 -0.00740458 -0.00740406 0.04967142 -0.01382643 0.06476885]
...
[-0.00977815 0.01047547 0.03084945 0.10309995 0.09312801 -0.08392175]]
# print probability all samples
[26861.56848337 17002.29353025 2182.26793265 3749.65049331
42004.63147989 3700.70037411 5569.30332186 16103.44975393
14760.64667235 19148.40325233]
This is confusing for me for a number of reasons:
For the rvs sampling function: I don't use the keyword arguments mean and cov per the docs because it seems odd to define a distribution with a mean and cov in mvn = multivariate_normal(mean = np.zeros(num_features), cov = np.diag(std), allow_singular = False, seed = 42) and then repeat that definition in the rvs call. Am I missing something?
For the mvn.pdf call, the probability density is obviously >>>1 which isn't impossible for a continuous multivariate normal, but I would like to convert these numbers to approximate probabilities at that particular point. How can I do this?
Thanks!

Random Samples from Gamma distribution with two parameters / Python

If I would like to generate 10 random samples from a gamma distribution with (with the following form):
with alpha = 2 and beta = 3, how would I do it?
The documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gamma.html
is a bit unclear to me.
My guess is that it would be like:
a = 2
b = 3
scipy.stats.gamma.rvs(a, loc = 0, scale = 1/b, size = 10)
Can anyone verify wether this is correct or provide the correct solution?
Yes, that is correct. In the formula that you show, β is often called the rate parameter. The gamma distribution in SciPy uses a scale parameter, which corresponds to 1/β. You can see the formulas for these two common parameterizations side-by-side in the wikipedia article on the gamma distribution.
If all you need is the generation of random samples (and not all the other methods provided by scipy.stats.gamma), you can use the gamma method of the NumPy class numpy.random.Generator. It uses the same parameter conventions as the SciPy gamma distribution, except that it does not have the loc parameter:
In [26]: import numpy as np
In [27]: rng = np.random.default_rng()
In [28]: a = 2
In [29]: b = 3
In [30]: rng.gamma(a, scale=1/b, size=10)
Out[30]:
array([0.637065 , 0.18436688, 1.36876183, 0.74692619, 0.12608862,
0.38395668, 0.81947237, 0.63437319, 0.47902819, 0.39094079])

Python: Kernel Density Estimation for positive values

I want to get kernel density estimation for positive data points. Using Python Scipy Stats package, I came up with the following code.
def get_pdf(data):
a = np.array(data)
ag = st.gaussian_kde(a)
x = np.linspace(0, max(data), max(data))
y = ag(x)
return x, y
This works perfectly for most data sets, but it gives an erroneous result for "all positive" data points. To make sure this works correctly, I use numerical integration to compute the area under this curve.
def trapezoidal_2(ag, a, b, n):
h = np.float(b - a) / n
s = 0.0
s += ag(a)[0]/2.0
for i in range(1, n):
s += ag(a + i*h)[0]
s += ag(b)[0]/2.0
return s * h
Since the data is spread in the region (0, int(max(data))), we should get a value close to 1, when executing the following line.
b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data)
a = np.array(data)
ag = st.gaussian_kde(a)
trapezoidal_2(ag, 0, int(max(data)), int(max(data))*2)
But it gives a value close to 0.5 when I test.
But when I intergrate from -100 to max(data), it provides a value close to 1.
trapezoidal_2(ag, -100, int(max(data)), int(max(data))*2+200)
The reason is, ag (KDE) is defined for values less than 0, even though the original data set contains only positive values.
So how can I get a kernel density estimation that considers only positive values, such that area under the curve in the region (o, max(data)) is close to 1?
The choice of the bandwidth is quite important when performing kernel density estimation. I think the Scott's Rule and Silverman's Rule work well for distribution similar to a Gaussian. However, they do not work well for the Pareto distribution.
Quote from the doc:
Bandwidth selection strongly influences the estimate obtained from
the KDE (much more so than the actual shape of the kernel). Bandwidth selection
can be done by a "rule of thumb", by cross-validation, by "plug-in
methods" or by other means; see [3], [4] for reviews. gaussian_kde
uses a rule of thumb, the default is Scott's Rule.
Try with different bandwidth values, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
b = 1
sample = stats.pareto.rvs(b, size=3000)
kde_sample_scott = stats.gaussian_kde(sample, bw_method='scott')
kde_sample_scalar = stats.gaussian_kde(sample, bw_method=1e-3)
# Compute the integrale:
print('integrale scott:', kde_sample_scott.integrate_box_1d(0, np.inf))
print('integrale scalar:', kde_sample_scalar.integrate_box_1d(0, np.inf))
# Graph:
x_span = np.logspace(-2, 1, 550)
plt.plot(x_span, stats.pareto.pdf(x_span, b), label='theoretical pdf')
plt.plot(x_span, kde_sample_scott(x_span), label="estimated pdf 'scott'")
plt.plot(x_span, kde_sample_scalar(x_span), label="estimated pdf 'scalar'")
plt.xlabel('X'); plt.legend();
gives:
integrale scott: 0.5572130540733236
integrale scalar: 0.9999999999968957
and:
We see that the kde using the Scott method is wrong.

Binomial test in Python vs R

I am trying to re-implement a binomial test initialy developed in R with Python. However, I am not sure if I am using the right functionality.
In R, I get:
> binom.test (2, 8, 11/2364, alternative = "greater")
0.25
With Python & SciPy, I use
from scipy.stats import binom
binom.sf(2, 8, float(11)/float(2364))
5.5441613055814931e-06
In fact I have to do binom.sf(2, 8, float(11)/float(2364)) to make sure the third parameter is not 0 because of int division.
Why do the values differ? Do I have to specify the moments for Scipy / binom.sf?
Should I use some other library?
Here's what I get in R:
> binom.test(2, 8, 11/2364, alternative = "greater")
Exact binomial test
data: 2 and 8
number of successes = 2, number of trials = 8, p-value = 0.0005951
alternative hypothesis: true probability of success is greater than 0.00465313
95 percent confidence interval:
0.04638926 1.00000000
sample estimates:
probability of success
0.25
>
Note that the p-value is 0.0005951.
Compare that to the result of scipy.stats.binom_test (which returns just the p-value):
In [25]: from scipy.stats import binom_test
In [26]: binom_test(2, 8, 11/2364, alternative='greater')
Out[26]: 0.00059505960517880572
So that agrees with R.
To use the survival function of scipy.stats.binom, you have to adjust the first argument (as noted in a comment by Marius):
In [27]: from scipy.stats import binom
In [28]: binom.sf(1, 8, 11/2364)
Out[28]: 0.00059505960517880572
(I am using Python 3, so 11/2364 equals 0.004653130287648054. If you are using Python 2, be sure to write that fraction as 11.0/2364 or float(11)/2364.)

Difference between R.scale() and sklearn.preprocessing.scale()

I am currently moving my data analysis from R to Python. When scaling a dataset in R i would use R.scale(), which in my understanding would do the following: (x-mean(x))/sd(x)
To replace that function I tried to use sklearn.preprocessing.scale(). From my understanding of the description it does the same thing. Nonetheless I ran a little test-file and found out, that both of these methods have different return-values. Obviously the standard deviations are not the same... Is someone able to explain why the standard deviations "deviate" from one another?
MWE:
# import packages
from sklearn import preprocessing
import numpy
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up R namespaces
R = rpy2.robjects.r
np1 = numpy.array([[1.0,2.0],[3.0,1.0]])
print "Numpy-array:"
print np1
print "Scaled numpy array through R.scale()"
print R.scale(np1)
print "-------"
print "Scaled numpy array through preprocessing.scale()"
print preprocessing.scale(np1, axis = 0, with_mean = True, with_std = True)
scaler = preprocessing.StandardScaler()
scaler.fit(np1)
print "Mean of preprocessing.scale():"
print scaler.mean_
print "Std of preprocessing.scale():"
print scaler.std_
Output:
It seems to have to do with how standard deviation is calculated.
>>> import numpy as np
>>> a = np.array([[1, 2],[3, 1]])
>>> np.std(a, axis=0)
array([ 1. , 0.5])
>>> np.std(a, axis=0, ddof=1)
array([ 1.41421356, 0.70710678])
From numpy.std documentation,
ddof : int, optional
Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default ddof is zero.
Apparently, R.scale() uses ddof=1, but sklearn.preprocessing.StandardScaler() uses ddof=0.
EDIT: (To explain how to use alternate ddof)
There doesn't seem to be a straightforward way to calculate std with alternate ddof, without accessing the variables of the StandardScaler() object itself.
sc = StandardScaler()
sc.fit(data)
# Now, sc.mean_ and sc.std_ are the mean and standard deviation of the data
# Replace the sc.std_ value using std calculated using numpy
sc.std_ = numpy.std(data, axis=0, ddof=1)
The current answers are good, but sklearn has changed a bit meanwhile. The new syntax that makes sklearn behave exactly like R.scale() now is:
from sklearn.preprocessing import StandardScaler
import numpy as np
sc = StandardScaler()
sc.fit(data)
sc.scale_ = np.std(data, axis=0, ddof=1).to_list()
sc.transform(data)
Feature request:
https://github.com/scikit-learn/scikit-learn/issues/23758
R.scale documentation says:
The root-mean-square for a (possibly centered) column is defined as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing values and n is the number of non-missing values. In the case center = TRUE, this is the same as the standard deviation, but in general it is not. (To scale by the standard deviations without centering, use scale(x, center = FALSE, scale = apply(x, 2, sd, na.rm = TRUE)).)
However, sklearn.preprocessing.StandardScale always scale with standard deviation.
In my case, I want to replicate R.scale in Python without centered,I followed #Sid advice in a slightly different way:
import numpy as np
def get_scale_1d(v):
# I copy this function from R source code haha
v = v[~np.isnan(v)]
std = np.sqrt(
np.sum(v ** 2) / np.max([1, len(v) - 1])
)
return std
sc = StandardScaler()
sc.fit(data)
sc.std_ = np.apply_along_axis(func1d=get_scale_1d, axis=0, arr=x)
sc.transform(data)

Categories

Resources