Problems with computing the joint probability mass function with np.histogram2d

Problems with computing the joint probability mass function with np.histogram2d - python

I currently have a 4024 by 10 array - where column 0 represent the 4024 different returns of stock 1, column 1 the 4024 returns of stock 2 and so on - for an assignment for my masters where I'm asked to compute the entropy and joint entropy of the different random variables (each random variable obviously being the stock returns). However, these entropy calculations both require the calculation of P(x) and P(x,y). So far I've managed to successfully compute the individual empirical probabilities using the following code:
def entropy(ret,t,T,a,n):
returns=pd.read_excel(ret)
returns_df=returns.iloc[t:T,:]
returns_mat=returns_df.as_matrix()
asset_returns=returns_mat[:,a]
hist,bins=np.histogram(asset_returns,bins=n)
empirical_prob=hist/hist.sum()
entropy_vector=np.empty(len(empirical_prob))
for i in range(len(empirical_prob)):
if empirical_prob[i]==0:
entropy_vector[i]=0
else:
entropy_vector[i]=-empirical_prob[i]*np.log2(empirical_prob[i])
shannon_entropy=np.sum(entropy_vector)
return shannon_entropy, empirical_prob
P.S. ignore the whole entropy part of the code
As you can see I've simply done the 1d histogram and then divided each count by the total sum of the histogram results in order to find the individual probabilities. However, I'm really struggling with how to go about computing P(x,y) using
np.histogram2d()
Now, obviously P(x,y)=P(x)*P(y) if the random variables are independent, but in my case they are not, as these stocks belong to the same index, and therefore posses some positive correlation, i.e. they're dependent, so taking the product of the two individual probabilities does not hold. I've tried following the suggestions of my professor, where he said:
"We had discussed how to get the empirical pdf for a univariate distribution: one defines the bins and then counts simply how many observations are in the respective bin (relative to the total number of observations). For bivariate distributions you can do the same, but now you make 2-dimensional binning (check for example the histogram2 command in matlab)"
As you can see he's referring to the 2d histogram function of MATLAB, but I've decided to do this assignment on Python, and so far I've elaborated the following code:
def jointentropy(ret,t,T,a,b,n):
returns=pd.read_excel(ret)
returns_df=returns.iloc[t:T,:]
returns_mat=returns_df.as_matrix()
assetA=returns_mat[:,a]
assetB=returns_mat[:,b]
hist,bins1,bins2=np.histogram2d(assetA,assetB,bins=n)
But I don't know what to do from here, because
np.histogram2d()
returns a 4025 by 4025 array as well as the two separate bins, so I don't know what I can do to compute P(x,y) for my two dependent random variables.
I've tried to figure this out for hours without any luck or success, so any kind of help would be highly appreciated! Thank you very much in advance!

Looks like you've got a clear case of conditional or Bayesian probability on your hands. You can look it up, for example, here, http://www.mathgoodies.com/lessons/vol6/dependent_events.html, which gives the probability of both events occurring as P(x,y) = P(x) · P(x|y), where P(x|y) is "probability of event x given y". This should apply in your situation because, if two stocks are from the same index, one price cannot happen without the other. Just build two separate bins like you did for one and calculate probabilities as above.

Related

Creating vector with intervals drawn from Poisson process

I'm looking for some advice on how to implement some statistical models in Python. I'm interested in constructing a sequence of z values (z_1,z_2,z_3,...,z_n) where the number of jumps in an interval (z_1,z_2] is distributed according to the Poisson distribution with parameter lambda(z_2-z_1)
and the numbers of random jumps over disjoint intervals are independent random variables. I want my piecewise constant plot to look something like the two images below, where the y axis is Y(z), where Y(z) consists of N(0,1) random variables in each interval say.
To construct the z data, what would be the best way to tackle this? I have tried sampling values via np.random.poisson and then taking a cumulative sum, but the values drawn are repeated for small intensity values. Please any help or thoughts would be really helpful. Thanks.

np.random.poisson is used to sample the count of events that occured in [z_i, z_j). if you want to sample the events as they occur, then you just want the exponential distribution. for example:
import numpy as np
n = 50
z = np.cumsum(np.random.exponential(1/n, size=n))
y = np.random.normal(size=n)
plotting these (using step in matplotlib) gives something similar to your plots:
note the 1/n sets a "lambda" so on average we expect n points within [0,1]. in this case we got slightly less so it overshoot. feel free to rescale if that's important to you

Convert independent sklearn GaussianMixture log probability scores to probabilities summing to 1

I have labeled 2D data. There are 4 labels in the set, and I know the correspondence of every point to its label. I'd like to, given a new arbitrary data point, find the probability that it has each of the 4 labels. It must belong to one and only one of the labels, so the probabilities should sum to 1.
What I've done so far is to train 4 independent sklearn GMMs (sklearn.mixture.GaussianMixture) on the data points associated with each label. It should be noted that I do not wish to train a single GMM with 4 components because I already know the labels, and don't want to re-cluster in a way that is worse than my known labels. (It would appear that there is a way to provide Y= labels to the fit() function, but I can't seem to get it to work).
In the above plot, points are colored by their known labels, and the contours represent the four independent GMMs fitted to these 4 sets of points.
For a new point, I attempted to compute the probability of its label in a couple ways:
GaussianMixture.predict_proba(): Since each independent GMM has only one distribution, this simply returns a probability of 1 for all models.
GaussianMixture.score_samples(): According to documentation, this one returns the "weighted log probabilities for each sample". My procedure is, for a single new point, I make four calls to this function from each of the four independently trained GMMs represenenting each distribution above. I do get semi sensible results here--typically a positive number for the correct model and negative numbers for each of the three incorrect models, with more muddled results for points near intersecting distribution boundaries. Here's a typical clear-cut result:
2.904136, -60.881554, -20.824841, -30.658509
This point is actually associated with the first label and is least likely to be the second label (is farthest from the second distribution). My issue is how to convert the above scores into probabilities that sum to 1 and accurately represent the chance that the given point belongs to one and only one of the four distributions? Given that these are 4 independent models, is this possible? If not, is there another method I have overlooked that could allow me to train GMM(s) based on known labels and will provide probabilities that sum to 1?

In general, if you don't know how the scores are calculated but you know that there is a monotonic relationship between the scores and the probability, you can simply use the softmax function to approximate a probability, with an optional temperature variable that controls the spikiness of the distribution.
Let V be your list of scores and tau be the temperature. Then,
p = np.exp(V/tau) / np.sum(np.exp(V/tau))
is your answer.
PS: Luckily, we know how sklearn GMM scoring works and softmax with tau=1 is your exact answer.

Multiple linear regression: appending an array on ones to Matrix of features (Python)

I'm currently learning basics of Data Science online. In one of the session on Multiple Linear Regression using Python, the tutor executed below step to add an array on ones to the Matrix of features ; I did not understand why it is being added. From online forums, it is mentioned that it is added so that model (equation) have a constant offset. But why 1 and not any other values. Does the number of independent variables (3) have any impact on this value
X -> Matrix of features ; number of rows in data set : 50 ; Number of
X = np.append(arr = np.ones([50,1]).astype(int), values = X,axis=1)

To better explain, let's imagine you have only 1 feature stored, and let's say 3
training examples.
Then, your parameters are:
And your input variables are:
If you want to realize a linear classification, you must compute the cost function for each training example i:
And if you need to vectorize the calculus (for efficiency and code readability), you want to compute the following matricial product:
However, by definition of the matricial product, the number of columns of matrix X should be the same than the number of rows of matrix Theta. Thus, to compute the product but leave the result unchanged, you add a column of ones to the left of matrix X:
Then, the result for each sample i is the following:
TLDR: You need to append a column of ones to X for the matricial product X*Theta to be defined. If you were adding any other coefficient c instead of 1, then your constant offset theta_0 would be multiplied by your coefficient c.

I think the cost function is the summation of errors between the predicted label and the actual label which we want to minimise. The J function given above is the hypothesis function.

How to compute the shannon entropy and mutual information of N variables

I need to compute the mutual information, and so the shannon entropy of N variables.
I wrote a code that compute shannon entropy of certain distribution.
Let's say that I have a variable x, array of numbers.
Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.
import scipy.integrate as scint
from numpy import*
from scipy import*
def shannon_entropy(a, bins):
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.
return scint.simps(g,x=x)
Choosing inserting x, and carefully the bin number this function works.
But this function is very dependent on the bin number: choosing different values of this parameter I got different values.
Particularly if my input is an array of values constant:
x=[0,0,0,....,0,0,0]
the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.
I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error.
The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation
def intNd(c,axes):
assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
return scint.simps(c,axes[0])
else:
return intNd(scint.simps(c,axes[-1]),axes[:-1])
def shannon_entropydd(c,bins=30):
hist,ax=histogramdd(c,bins,normed=True)
for i in range(len(ax)):
ax[i]=ax[i][:-1]
p=-hist*log2(hist)
p[isnan(p)]=0
return intNd(p,ax)
I need these quantities in order to be able to compute the mutual information between certain set of variables:
M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)
where H(x) is the shannon entropy of the variable x
I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!

The result will depend pretty strongly on the estimated density. Can you assume a specific form for the density? You can reduce the dependence of the result on the estimate if you avoid histograms or other general-purpose estimates such as kernel density estimates. If you can give more detail about the variables involved, I can make more specific comments.
I worked with estimates of mutual information as part of the work for my dissertation [1]. There is some stuff about MI in section 8.1 and appendix F.
[1] http://riso.sourceforge.net/docs/dodier-dissertation.pdf

I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...
I must say that I did not read your code, but this works for me:
import numpy as np
x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]
def entropy(c):
c_normalized = c/float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
h = -sum(c_normalized * np.log(c_normalized))
return h
hx = entropy(cx)

Pseudoexperiments in PyMC

Is it possible to perform "pseudoexperiments" using PyMC?
By pseudoexperiments, I mean generating random "observations" by sampling from the prior, and then, given each pseudoexperiment, drawing samples from the posterior. Afterwards, one would compare the trace for each parameter to the sample (obtained from the prior) used in sampling from the posterior.
A more concrete example: Suppose that I want to know the rate of process X. I count how many occurrences there are in a certain period of time. However, I know that process Y also sometimes occurs and will contaminate my count. The rate of process Y is known with some uncertainty. So, I build a model, include my observations, and sample from the posterior:
import pymc
class mymodel:
rate_x = pymc.Uniform('rate_x', lower=0, upper=100)
rate_y = pymc.Normal('rate_y', mu=150, tau=1./(15**2))
total_rate = pymc.LinearCombination('total_rate', [1,1], [rate_x, rate_y])
data = pymc.Poisson('data', mu=total_rate, value=193, observed=True)
Mod = pymc.Model(mymodel)
MCMC = pymc.MCMC(Mod)
MCMC.sample(100000, burn=5000, thin=5)
print MCMC.stats()['rate_x']['quantiles']
However, before I do my experiment (or before I "unblind" my analysis and look at my data), I would like to know how sensitive I expect to be -- what will be the uncertainty on my measurement of rate_x?
To answer this, I could sample from the prior
Mod.draw_from_prior()
but this only samples rate_x, rate_y, and calculates total_rate. But once the values of those are set by draw_from_prior(), I can draw a pseudoexperiment:
Mod.data.random()
This just returns a number, so I have to set the value of Mod.data to a random sample. Because Mod.data has the observed flag set, I have to also "force" it:
Mod.data.set_value(Mod.data.random(), force=True)
Now I can sample from the posterior again
MCMC.sample(100000, burn=500, thin=5)
print MCMC.stats()['rate_x']['quantiles']
All this works, so I suppose the simple answer to my question is "yes". But it feels very hacky. Is there a better or more natural way to accomplish this?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.