I'm looking for some advice on how to implement some statistical models in Python. I'm interested in constructing a sequence of z values (z_1,z_2,z_3,...,z_n) where the number of jumps in an interval (z_1,z_2] is distributed according to the Poisson distribution with parameter lambda(z_2-z_1)
and the numbers of random jumps over disjoint intervals are independent random variables. I want my piecewise constant plot to look something like the two images below, where the y axis is Y(z), where Y(z) consists of N(0,1) random variables in each interval say.
To construct the z data, what would be the best way to tackle this? I have tried sampling values via np.random.poisson and then taking a cumulative sum, but the values drawn are repeated for small intensity values. Please any help or thoughts would be really helpful. Thanks.
np.random.poisson is used to sample the count of events that occured in [z_i, z_j). if you want to sample the events as they occur, then you just want the exponential distribution. for example:
import numpy as np
n = 50
z = np.cumsum(np.random.exponential(1/n, size=n))
y = np.random.normal(size=n)
plotting these (using step in matplotlib) gives something similar to your plots:
note the 1/n sets a "lambda" so on average we expect n points within [0,1]. in this case we got slightly less so it overshoot. feel free to rescale if that's important to you
Related
I am currently gathering X gyro data of a device. I need to find the differences between them in terms of their y values in the graphs such as this one.
I have written an algorithm where it finds the min and max y value but when they have some minor fluctuations like this one my algorithm returns faulty answers. I have written it in python and it is as follows;
import numpy as np
x, y = np.loadtxt("gyroDataX.txt", unpack=True, delimiter=",")
max_value = max(y)
min_value = min(x)
print(min_value)
borders = max_value - min_value
I need to now write and algorithm that will;
Determine the max and min y value and draw their borders.
If it sees minor fluctuations it will ignore them.
Would writing such an algorithm be possible, and if so how could I go about writing one? Is there any libraries or any reading material you would recommend? Thanks for your help.
1. Generally maths like this should be done in pure code, with the help of little or none external API's, so it's easier to debug the algorithm and processes.
2. Now for a possible answer:
Since you do not want any outliers (those buggy and irritating minor fluctuations), you need to calculate the standard deviation of your data.
What is the standard deviation, you might ask?
It represents how far your data is from the mean(Symbol: µ) (average) of your data set.
If you do not know what it is, here is quick tutorial on the standard deviation and its partner, variance.
First, the MEAN:
It should not be that hard to calculate the mean of your x and y arrays/lists . Just loop (for loop would be optimal) through the lists and add up all the values and divide it by the length of the list itself. There you have the mean.
Second, the VARIANCE(σ squared):
If you followed the website above, to calculate the variance, loop through the x and y lists again, subtract the x and y values from their respective mean to get the difference, square this difference, add all the differences up and divide by the length of the respective lists and you have the variance.
For the Standard Deviation (Symbol: σ), just take the square root of the variance.
Now the standard deviation can be used to find the real max and min (leaving out those buggy outliers).
Use this graph as an approximate reference to find where your most of your values may be:
Since your data is mostly uniform, you should get a pretty accurate answer.
Do test the different situations: σ + µ or 2σ + µ; to find the optimum max and min.
Edit:
Sorry, only y now:
Sorry for the horrible representation and drawing. This is what it should like graphically is. Also do experiment by yourself with the Standard Deviation from the mean (like above; σ + µ or 2σ + µ) to find the your suited max and min.
I have this task:
Choose your favorite continuous distribution (the less it looks normal, the more interesting; try to choose one of the distributions we have not discussed in the course).
Generate a sample of 1000 from it, build a histogram of the sample,
and draw a theoretical density of distribution of your random value
on top of it (so that the values are on the same scale, don't forget
to set the histogram to normed=True). Your task is to estimate the
distribution of your random sample average for different sample
sizes.
3
To do this, generate 1000 samples of n volume and build
histograms of their sample averages for three or more n values (for
example, 5, 10, 50).
Using the information on the mean and
variance of the original distribution (easily found on wikipedia),
calculate the values of the normal distribution parameters, which,
according to the central limit theorem, approximate the distribution
of the sample averages. Note: to calculate the values of these
parameters, it is the theoretical mean and the variance of your
random value that should be used, and not their sample estimates.
5.
On top of each histogram, draw the density of the corresponding
normal distribution (be careful with the parameters of the function,
it takes the input not the dispersion, but the standard deviation).
Describe the difference between the distributions obtained at different n values. How
does the accuracy of the approximation of
the distribution of sample averages change with the growth of n?
So if I want to pick exponential distribution in Python do I need to go like that?
from scipy.stats import expon
import matplotlib.pyplot as plt
exdist=sc.expon(loc=2,scale=3) # loc and scale - shift and scale parameters, default values 0 and 1.
mean, var, skew, kurt = exdist.stats(moments='mvsk') # Let's see the moments of our distribution.
x = np.linspace((0,2,100))
ax.plot(x, exdist.pdf(x)) # Let's draw it
arr=exdist.rvc(size=1000) # generation of thousand of random numbers. (Is it for task 3?)
And I constantly getting this error: here is a screenshot from Jupyter:
https://i.stack.imgur.com/zwUtu.png
Could you please explain to me how to write the right code? I can't figure out where to start or where to make a mistake. Do I have to use arr.mean() to search for a sample mean and plt.hist(arr,bins=) to build a histogram? I would be very grateful for the explanation.
I've a huge dataset with 271116 rows of data. I normalized the data using the z-score normalization method. I've no idea of knowing if the data actually follows a normal distribution. So I plotted a simple density graph using matplotlib:
hdf = df['Height'].plot(kind = 'kde', stacked = False)
plt.show()
I got this for a result:
Though, the data seems somewhat normal, can I apply the Central Limit Theorem where I take the means of different random samples (say, 10000 times) to get a smooth bell-curve?
Any help in python is appreciated, thanks.
Something like:
import numpy as np
sampleMeans = []
for _ in range(100000):
samples = df['Height'].sample(n=100)
sampleMean = np.mean(samples)
sampleMeans.append(sampleMean)
#Now you have a list of sample means to plot - should be normally distributed
The mean of the distribution should equal the mean of the original data, and the standard deviation should be a factor of ten less than the original data. If the result isn't smooth enough, then increase .sample(n=100) to a higher figure. This will also decrease the standard deviation of the resulting bell curve. The general rule is that the CLT standard deviation is the data standard deviation divided by sqrt(n).
It's important to note that the resulting distribution is different from the original. It is not merely smoothed out using the CLT.
I have 4 different distributions which I've fitted to a sample of observations. Now I want to compare my results and find the best solution. I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (q-q) plot.
The formulas for my 4 distributions are:
where K0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function.
My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them.
I've read the instructions on this site and tried to implement them in python. So, like in the link:
1) I sorted my data from the smallest to the largest value.
2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size.
3) And this is the point I can't manage.
As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions.
For reference, here are the inverse functions (partly calculated with wolframalpha, and as far it was possible):
where W is the Lambert W-function and everything in brackets afterwards is the argument.
The problem is, apparently there doesn't exist an inverse function for the first distribution. The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python).
So my question is, is there a way to calculate the q-q plots without the analytical expressions of the inverse distribution functions?
I'd appreciate any help you could give me very much!
A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. You don't need the cdf or quantile function for that, only the density function, which you have already.
The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters.
You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; that would be a Bayesian approach.
It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf.
It happens that there is an easier way. It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. I was looking for the wrong sort of name!
First, build a subclass of rv_continuous to represent one of your distributions. We know the pdf for your distributions, so that's what we define. In this case there's just one parameter. If more are needed just add them to the def statement and use them in the return statement as required.
>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
... def _pdf(self, x, param):
... return param*exp(-param*x)
...
Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the r.v. can assume), and what the parameters are called.
>>> noname = NoName(a=0, shapes='param')
I don't have an actual sample of values to play with. I'll create a pseudo-random sample.
>>> sample = noname.rvs(size=100, param=param)
Sort it to make it into the so-called 'empirical cdf'.
>>> empirical_cdf = sorted(sample)
The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced.
>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]
Get the quantile function values at these points.
>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]
Plot it all.
>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
>>> plt.show()
Here's the Q-Q plot that results.
Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). But it can also be done by simple brute force.
At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). Then you need to calculate the corresponding CDF values. Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. The according quantiles of the CDF values with the smallest deviations are the ones you were looking for.
The precision of this solution is limited by the resolution of the quantiles you defined yourself.
But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it!
I currently have a 4024 by 10 array - where column 0 represent the 4024 different returns of stock 1, column 1 the 4024 returns of stock 2 and so on - for an assignment for my masters where I'm asked to compute the entropy and joint entropy of the different random variables (each random variable obviously being the stock returns). However, these entropy calculations both require the calculation of P(x) and P(x,y). So far I've managed to successfully compute the individual empirical probabilities using the following code:
def entropy(ret,t,T,a,n):
returns=pd.read_excel(ret)
returns_df=returns.iloc[t:T,:]
returns_mat=returns_df.as_matrix()
asset_returns=returns_mat[:,a]
hist,bins=np.histogram(asset_returns,bins=n)
empirical_prob=hist/hist.sum()
entropy_vector=np.empty(len(empirical_prob))
for i in range(len(empirical_prob)):
if empirical_prob[i]==0:
entropy_vector[i]=0
else:
entropy_vector[i]=-empirical_prob[i]*np.log2(empirical_prob[i])
shannon_entropy=np.sum(entropy_vector)
return shannon_entropy, empirical_prob
P.S. ignore the whole entropy part of the code
As you can see I've simply done the 1d histogram and then divided each count by the total sum of the histogram results in order to find the individual probabilities. However, I'm really struggling with how to go about computing P(x,y) using
np.histogram2d()
Now, obviously P(x,y)=P(x)*P(y) if the random variables are independent, but in my case they are not, as these stocks belong to the same index, and therefore posses some positive correlation, i.e. they're dependent, so taking the product of the two individual probabilities does not hold. I've tried following the suggestions of my professor, where he said:
"We had discussed how to get the empirical pdf for a univariate distribution: one defines the bins and then counts simply how many observations are in the respective bin (relative to the total number of observations). For bivariate distributions you can do the same, but now you make 2-dimensional binning (check for example the histogram2 command in matlab)"
As you can see he's referring to the 2d histogram function of MATLAB, but I've decided to do this assignment on Python, and so far I've elaborated the following code:
def jointentropy(ret,t,T,a,b,n):
returns=pd.read_excel(ret)
returns_df=returns.iloc[t:T,:]
returns_mat=returns_df.as_matrix()
assetA=returns_mat[:,a]
assetB=returns_mat[:,b]
hist,bins1,bins2=np.histogram2d(assetA,assetB,bins=n)
But I don't know what to do from here, because
np.histogram2d()
returns a 4025 by 4025 array as well as the two separate bins, so I don't know what I can do to compute P(x,y) for my two dependent random variables.
I've tried to figure this out for hours without any luck or success, so any kind of help would be highly appreciated! Thank you very much in advance!
Looks like you've got a clear case of conditional or Bayesian probability on your hands. You can look it up, for example, here, http://www.mathgoodies.com/lessons/vol6/dependent_events.html, which gives the probability of both events occurring as P(x,y) = P(x) · P(x|y), where P(x|y) is "probability of event x given y". This should apply in your situation because, if two stocks are from the same index, one price cannot happen without the other. Just build two separate bins like you did for one and calculate probabilities as above.