I've closely followed this book (http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter2_MorePyMC/MorePyMC.ipynb) but have found myself running into problems when trying to use Pymc for my own problem.
I've got a bunch of order values from customers who have placed an order and they look reasonably like a Gamma distribution. I'm running an AB test and want to see how the distribution of order values changes - enter Pymc. I was following the example in the book but found it didn't really work for me - first attempt was this:
import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
from pylab import savefig
## Replace these with the actual order values in the test set
## Have made slightly different to be able to see differing distributions
observations_A = pm.rgamma(3.5, 0.013, size=1000)
observations_B = pm.rgamma(3.45, 0.016, size=2000)
## Identical prior assumptions
prior_a = pm.Gamma('prior_a', 3.5, 0.015)
prior_b = pm.Gamma('prior_b', 3.5, 0.015)
## The difference in the test groups is the most important bit
#pm.deterministic
def delta(p_A = prior_a, p_B = prior_b):
return p_A - p_B
## Add observations
observation_a = pm.Gamma('observation_a', prior_a, value=observations_A, observed=True)
observation_b = pm.Gamma('observation_b', prior_b, value=observations_A, observed=True)
mcmc = pm.MCMC([prior_a, prior_b, delta, observation_a, observation_b])
mcmc.sample(20000,1000)
Looking at the mean of the trace for prior_a and prior_b I see values of around 3.97/3.98 and when I look at the stats of these priors I see a similar story. However, upon defining the priors, calling the rand() method on the prior gives me the kind of values I would expect (between 100 and 400). Basically, one of the updating stages (I'm least certain about the observation stages) is doing something I don't expect.
Having struggled with this for a bit I found this page (http://matpalm.com/blog/2012/12/27/dead_simple_pymc/) and decided a different approach may be in order:
import pymc as pm
import numpy as np
from matplotlib import pyplot as plt
from pylab import savefig
## Replace these with the actual order values in the test set
observations_A = pm.rgamma(3.5, 0.013, size=1000)
observations_B = pm.rgamma(3.45, 0.016, size=2000)
## Initial assumptions
A_Rate = pm.Uniform('A_Rate', 2, 4)
B_Rate = pm.Uniform('B_Rate', 2, 4)
A_Shape = pm.Uniform('A_Shape', 0.005, 0.05)
B_Shape = pm.Uniform('B_Shape', 0.005, 0.05)
p_A = pm.Gamma('p_A', A_Rate, A_Shape, value=observations_A, observed=True)
p_B = pm.Gamma('p_B', A_Rate, B_Shape, value=observations_B, observed=True)
## Sample
mcmc = pm.MCMC([p_A, p_B, A_Rate, B_Rate, A_Shape, B_Shape])
mcmc.sample(20000, 1000)
## Plot the A_Rate, B_Rate, A_Shape, B_Shape
## Using those, determine the Gamma distribution
## Plot both - and draw 1000000... samples from each.
## Perform statistical tests on these.
So instead of going straight for the Gamma distribution, we're looking to find the parameters (I think). This seems to work a treat in that it gives me values in the traces of the right order of magnitude. However, now I can plot a histogram of samples for alpha for both test groups and for beta but that's not really what I'm after. I want to be able to plot each of the test group's 'gamma-like' distributions, calculated from a prior and the values I supply. I also want to be able to plot a 'delta' as the AB testing example shows. I feel a deterministic variable on the second example is going to be my best bet but I don't really know the best way to go about constructing this.
Long story short - I've got data drawn from a Gamma distribution that I'd like to AB test. I've got a gamma prior view of the data, though could be persuaded that I've got a normal prior view if that's easier. I'd like to update identical priors with the data I've collected, in a sensible way, and plot the distributions and the difference between them.
Cheers,
Matt
Related
I am illustrating hyperopt's TPE algorithm for my master project and cant seem to get the algorithm to converge. From what i understand from the original paper and youtube lecture the TPE algorithm works in the following steps:
(in the following, x=hyperparameters and y=loss)
Start by creating a search history of [x,y], say 10 points.
Sort the hyperparameters according to their loss and divide them into two sets using some quantile γ (γ = 0.5 means the sets will be equally sized)
Make a kernel density estimation for both the poor hyperparameter group (g(x)) and good hyperparameter group (l(x))
Good estimations will have low probability in g(x) and high probability in l(x), so we propose to evaluate the function at argmin(g(x)/l(x))
Evaluate (x,y) pair at the proposed point and repeat steps 2-5.
I have implemented this in python on the objective function f(x) = x^2, but the algorithm fails to converge to the minimum.
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde
def objective_func(x):
return x**2
def measure(x):
noise = np.random.randn(len(x))*0
return x**2+noise
def split_meassures(x_obs,y_obs,gamma=1/2):
#split x and y observations into two sets and return a seperation threshold (y_star)
size = int(len(x_obs)//(1/gamma))
l = {'x':x_obs[:size],'y':y_obs[:size]}
g = {'x':x_obs[size:],'y':y_obs[size:]}
y_star = (l['y'][-1]+g['y'][0])/2
return l,g,y_star
#sample objective function values for ilustration
x_obj = np.linspace(-5,5,10000)
y_obj = objective_func(x_obj)
#start by sampling a parameter search history
x_obs = np.linspace(-5,5,10)
y_obs = measure(x_obs)
nr_iterations = 100
for i in range(nr_iterations):
#sort observations according to loss
sort_idx = y_obs.argsort()
x_obs,y_obs = x_obs[sort_idx],y_obs[sort_idx]
#split sorted observations in two groups (l and g)
l,g,y_star = split_meassures(x_obs,y_obs)
#aproximate distributions for both groups using kernel density estimation
kde_l = gaussian_kde(l['x']).evaluate(x_obj)
kde_g = gaussian_kde(g['x']).evaluate(x_obj)
#define our evaluation measure for sampling a new point
eval_measure = kde_g/kde_l
if i%10==0:
plt.figure()
plt.subplot(2,2,1)
plt.plot(x_obj,y_obj,label='Objective')
plt.plot(x_obs,y_obs,'*',label='Observations')
plt.plot([-5,5],[y_star,y_star],'k')
plt.subplot(2,2,2)
plt.plot(x_obj,kde_l)
plt.subplot(2,2,3)
plt.plot(x_obj,kde_g)
plt.subplot(2,2,4)
plt.semilogy(x_obj,eval_measure)
plt.draw()
#find point to evaluate and add the new observation
best_search = x_obj[np.argmin(eval_measure)]
x_obs = np.append(x_obs,[best_search])
y_obs = np.append(y_obs,[measure(np.asarray([best_search]))])
plt.show()
I suspect this happens because we keep sampling where we are most certain, thus making l(x) more and more narrow around this point, which doesn't change where we sample at all. So where is my understanding lacking?
So, I am still learning about TPE as well. But here's are the two problems in this code:
This code will only evaluate a few unique point. Because the best location is calculated based on the best recommended by the kernel density functions but there is no way for the code to do exploration of the search space. For example, what acquisition functions do.
Because this code is simply appending new observations to the list of x and y. It adds a whole lot of duplicates. The duplicates lead to a skewed set of observations and that leads to a very weird split and you can easily see that in the later plots. The eval_measure starts as something similar to the objective function but diverges later on.
If you remove the duplicates in x_obs and y_obs you can remove the problem no. 2. However, the first problem can only be removed through the addition of some way of exploring the search space.
I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$
I generated two distributions using the following code:
rand_num1 = 2*np.random.randn(10000) + 1
rand_num2 = 2*np.random.randn(10000) + 1
stats.ks_2samp(rand_num1, rand_num2)
My question is why do both these distributions do not test to be the same based on kstest and chisquare test.
When I run a kstest on the 2 distributions I get:
Ks_2sampResult(statistic=0.019899999999999973, pvalue=0.037606196570126725)
which implies that the two distributions are statistically different. I use the following code to plot the CDF of the two distributions:
count1, bins = np.histogram(rand_num1, bins = 100)
count2, _ = np.histogram(rand_num2, bins = bins)
plt.plot(np.cumsum(count1), 'g-')
plt.plot(np.cumsum(count2), 'b.')
This is how the CDF of two distributions looks.
When I run a chisquare test I get the following:
stats.chisquare(count1, count2) # Gives an nan output
stats.chisquare(count1+1, count2+1) # Outputs "Power_divergenceResult(statistic=180.59294741316694, pvalue=1.0484033143507713e-06)"
I have 3 questions below:
Even though the CDF looks the same and the data comes from same distribution why do kstest and chisquare test both reject the same distribution hypothesis? Is there an underlying assumption that I am missing here?
Some counts are 0 and hence the first chisquare() gives an error. Is it an accepted practice to just add a non-0 number to all counts to get a correct estimate?
Is there a kstest to test against non standard distributions, say a normal with a non 0 mean and std != 1?
CDF, in my humble opinion, is not a good curve to look at. It will hide a lot of details due to the fact that it is an integral. Basically, some outlier in distribution which is way below will be compensated by another outlier which is way above.
Ok, lets take a look at distribution of K-S results. I've run the test 100 times and plotted statistics vs p-value, and, as expected, for some cases there would be (small p, large stat) points.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
np.random.seed(12345)
x = []
y = []
for k in range(0, 100):
rand_num1 = 2.0*np.random.randn(10000) + 1.0
rand_num2 = 2.0*np.random.randn(10000) + 1.0
q = stats.ks_2samp(rand_num1, rand_num2)
x.append(q.statistic)
y.append(q.pvalue)
plt.scatter(x, y, alpha=0.1)
plt.show()
Graph
UPDATE
In reality if I run a test and see the test vs control distribution of my metric as shown in my plot then I would want to be able to say that they are they same - are there any statistics or parameters around these tests that can tell me how close these distributions are?
Of course, they are - and you're using one of such tests! K-S is most general but weakest test. And as with any test you would use there are ALWAYS cases where test will say those samples come from different distributions even you deliberately sample them from the same routine. It is just NATURE of the things,
you'll get yes or no with some confidence, but not much more. Look
at the graph again for illustrations.
Concerning your exercises with chi2 I'm very skeptical from the beginning to use chi2 for such task. For me, given the problem of making decision about two samples, test to be used should be explicitly symmetric. K-S is ok, but looking at the definition of chi2, it is NOT symmetric. Simple modification of
your code
count1, bins = np.histogram(rand_num1, bins = 40, range=(-2.,2.))
count2, _ = np.histogram(rand_num2, bins = bins, range=(-2.,2.))
q = stats.chisquare(count2, count1)
print(q)
q = stats.chisquare(count1, count2)
print(q)
produces something like
Power_divergenceResult(statistic=87.645335824746468, pvalue=1.3298580128472864e-05)
Power_divergenceResult(statistic=77.582358201839526, pvalue=0.00023275129585256563)
Basically, it means that test may pass if you run (1,2) but fail if you run (2,1), which is not good, IMHO. Chi2 is ok with me as soon as you test against expected values from known distribution curve - here test asymmetry makes sense
I would advice to try Anderson-Darling test along the lines
q = stats.anderson_ksamp([np.sort(rand_num1), np.sort(rand_num2)])
print(q)
But remember, it is the same as with K-S, some samples may fail to pass the test even if they are drawn from the same underlying distribution - this is just the nature of the beast.
UPDATE: Some reading material
https://stats.stackexchange.com/questions/187016/scipy-chisquare-applied-on-continuous-data
I'm trying to use pyMC to provide a Bayesian estimate of a covariance matrix given some data. I'm roughly following the stock covariance example provided in this online guide (link here), but I have a more simplistic example model that I made up. I've got two values that I draw from a multivariate normal distribution, and I've constructed it in such a way that I know the covariance/correlation between the two variables.
I've posted my short code below. Essentially what I'm doing is constructing an artificial data set where the correlation matrix should be [[1, -0.5], [-0.5, 1]]. At the end of the mcmc sampling, I get a predicted value for the off-diagonal term that is quite a bit different. I've looked at the convergence criteria, and it looks like the autocorrelation is low and the distribution is stationary. However, I will admit I'm still wrapping my head around all the nuances here and there could be aspects of this that are still beyond my grasp.
This question is related to and very much based on these other two SO questions (One and Two). I felt the need to ask my own question despite the similarity because I'm not getting the answer I expect to get. If any of you computational statisticians out there can help provide insight into this problem it would be greatly appreciated!
import numpy as np
import pandas as pd
import pymc as pm
import matplotlib.pyplot as plt
import seaborn as sns
p=2
prior_mu=np.ones(p)
prior_sdev=np.ones(p)
prior_corr_inv=np.eye(p)
def cov2corr(A):
"""
covariance matrix to correlation matrix.
"""
d = np.sqrt(A.diagonal())
A = ((A.T / d).T) / d
#A[ np.diag_indices(A.shape[0]) ] = np.ones( A.shape[0] )
return A
# construct artificial data set
muVector=[10,5]
sdevVector=[3.,5.]
corrMatrix=np.matrix([[1,-0.5],[-0.5, 1]])
cov_matrix=np.diag(sdevVector)*corrMatrix*np.diag(sdevVector)
n_obs = 500
x = np.random.multivariate_normal(muVector,cov_matrix,n_obs)
prior_mu = np.array(muVector)
prior_std = np.array(sdevVector)
inv_cov_matrix = pm.Wishart( "inv_cov_matrix", n_obs, np.diag(prior_std**2) )
mu = pm.Normal( "returns", prior_mu, 1, size = 2)
# create the model and sample
obs = pm.MvNormal( "observed returns", mu, inv_cov_matrix, observed = True, value = x )
model = pm.Model( [obs, mu, inv_cov_matrix] )
mcmc = pm.MCMC(model)
mcmc.use_step_method(pm.AdaptiveMetropolis,inv_cov_matrix)
mcmc.sample( 1e5, 2e4, 10)
# Determine prediction - Does not equal corrMatrix!
inv_cov_samples = mcmc.trace("inv_cov_matrix")[:]
mean_covariance_matrix = np.linalg.inv( inv_cov_samples.mean(axis=0) )
prediction = cov2corr(mean_covariance_matrix*n_obs)
I have some data I have sampled from a radar satellite image and wanted to perform some statistical tests on. Before this I wanted to conduct a normality test so I could be sure my data was normally distributed. My data appears to be normally distributed but when I perform the test Im getting a Pvalue of 0, suggesting my data is not normally distributed.
I have attached my code along with the output and a histogram of the distribution (Im relatively new to python so apologies if my code is clunky in any way). Can anyone tell me if Im doing something wrong - I find it hard to believe from my histogram that my data is not normally distributed?
values = 'inputfile.h5'
f = h5py.File(values,'r')
dset = f['/DATA/DATA']
array = dset[...,0]
print('normality =', scipy.stats.normaltest(array))
max = np.amax(array)
min = np.amin(array)
histo = np.histogram(array, bins=100, range=(min, max))
freqs = histo[0]
rangebins = (max - min)
numberbins = (len(histo[1])-1)
interval = (rangebins/numberbins)
newbins = np.arange((min), (max), interval)
histogram = bar(newbins, freqs, width=0.2, color='gray')
plt.show()
This prints this: (41099.095955202931, 0.0). the first element is a chi-square value and the second is a pvalue.
I have made a graph of the data which I have attached. I thought that maybe as Im dealing with negative values it was causing a problem so I normalised the values but the problem persists.
This question explains why you're getting such a small p-value. Essentially, normality tests almost always reject the null on very large sample sizes (in yours, for example, you can see just some skew in the left side, which at your enormous sample size is way more than enough).
What would be much more practically useful in your case is to plot a normal curve fit to your data. Then you can see how the normal curve actually differs (for example, you can see whether the tail on the left side does indeed go too long). For example:
from matplotlib import pyplot as plt
import matplotlib.mlab as mlab
n, bins, patches = plt.hist(array, 50, normed=1)
mu = np.mean(array)
sigma = np.std(array)
plt.plot(bins, mlab.normpdf(bins, mu, sigma))
(Note the normed=1 argument: this ensures that the histogram is normalized to have a total area of 1, which makes it comparable to a density like the normal distribution).
In general when the number of samples is less than 50, you should be careful about using tests of normality. Since these tests need enough evidences to reject the null hypothesis, which is "the distribution of the data is normal", and when the number of samples is small they are not able to find those evidences.
Keep in mind that when you fail to reject the null hypothesis it does not mean that the alternative hypothesis is correct.
There is another possibility that:
Some implementations of the statistical tests for normality compare the distribution of your data to standard normal distribution. In order to avoid this, I suggest you to standardize the data and then apply the test of normality.