Testing for stationarity

Testing for stationarity - python

I have a question regarding the "correct" Augmented Dickey–Fuller (ADF) test with "sm.tsa.stattools.adfuller" in Python / iPython. I'm testing ADF on a time series of temperatures.
According to the documentation, the test provides an autolag in the args:
autolag {‘AIC’, ‘BIC’, ‘t-stat’, None}
if None, then maxlag lags are used
if ‘AIC’ or ‘BIC’, then the number
of lags is chosen to minimize the corresponding information criterium
‘t-stat’ based choice of maxlag. Starts with maxlag and drops a lag
until the t-statistic on the last lag length is significant at the 95
% level.
Which one is scientifically correct / accepted? (And what does AIC/BIC mean?)
Also, is it possible to calculate an approximate p-value for Z(t), like in some Stata tables? See Link

Related

Optimization method selection & dealing with convergence and variability

The Problem
I am looking to tackle a minimization problem using scipy's optimization utilities.
Specifically, I've been using this function:
result = spo.minimize(s21_mag, goto_start, options={"disp": True}, bounds=bnds)
My s21_mag function takes a couple of seconds to return an output (due to physically moving motors). It consists of 3 parameters (3 moving parts), with no constraints - just three bounds (identical for all 3 parameters):
bnds = ((0,45000),(0,45000),(0,45000))
The limit on the amount of iterations is not very constraint (1000 is probably a good enough upper limit for me), but I expect the optimizer to try many configurations in this set of iterations to identify an optimal value. So far, some methods I've tried just seem to converge somewhere with meaningless progress.
Here's progress beyond the 50th iteration (full code here) - the goal is the maximization of S21 at a specific frequency (purple vertical line):
This is with no method passed tospo.minimize(), so it uses the default (and it looks like it applies the exact same movement to each motor).
Questions
Although scipy's minimization function offers a wide variety of optimization methods/algorithms, how could I (as a beginner in optimization math) select the one that would work best for my application? What kind of aspects of my problem should I take into account to jump to such conclusions? Assume I have no idea about the initial value of each parameter and want the optimizer to figure that out (I usually just set it to the midpoint, i.e. initial: x1=x2=x3=22500).
The same set of parameters as an input to my s21_mag function could yield different results at different times the function is called.
This happens for two reasons:
(a) The parameter step of the optimizer can get extremely small (particularly as the number of iterations increase and the convergence is approached), whereas the motor expects a minimum value of ~100 to make a step.
Is there a way to somehow set a minimum step? Otherwise, it tries to step from e.g. 1234.0 to 1234.0001 and eventually gets "stuck" between trying tiny changes.
(b) The output of the function goes through a measuring instrument, which exhibits a little bit of noise (e.g. one measurement may yield 5.42 dB, while another measurement (with the exact same parameters) may yield 5.43 dB).
Is there a way to deal with these kinds of small variabilities/errors to avoid confusions for the optimizer?

Backtest a Probability of default model

Background
Financial institutions use Probability of Default (PD) models for purposes such as client acceptance, provisioning and regulatory capital calculation as required by the Basel accords and the European Capital requirements regulation and directive (CRR/CRD IV).
A PD model is supposed to calculate the probability that a client defaults on its obligations within a one year horizon.
Backtests
To test whether a model is performing as expected so-called backtests are performed. One such a backtest would be to calculate how likely it is to find the actual number of defaults at or beyond the actual deviation from the expected value (the sum of the client PD values). If this probability turns out to be below a certain threshold the model will be rejected.
For this procedure one would need the CDF of the distribution of the sum of n Bernoulli experiments,each with an individual, potentially unique PD.
The Question
Does Python have a built-in distribution that describes the sum of a number of Bernoulli draws each with its own probability?
Addendum
Since many financial institutions divide their portfolios in buckets in which clients have identical PDs, can we optimize the calculation for this situation?
Note: This question has been asked on mathematica stack exchange and answer has been provided for the same. I need to get the answer in python code. Here is the link to the mathematica solution:
https://mathematica.stackexchange.com/questions/131347/backtesting-a-probability-of-default-pd-model

One sample & Two Sample t-tests in Python

I have a doubt here on how to work this. New to the world of Stats and Python. A student is trying to decide between two Processing Units. He want to use the Processing Unit for his research to run high performance algorithms, so the only thing he is concerned with is speed. He picks a high performance algorithm on a large data set and runs it on both Processing Units 10 times, timing each run in hours. Results are given in the below lists TestSample1 and TestSample2.
from scipy import stats
import numpy as nupy
TestSample1 = nupy.array([11,9,10,11,10,12,9,11,12,9])
TestSample2 = nupy.array([11,13,10,13,12,9,11,12,12,11])
Assumption: Both the dataset samples above are random, independent, parametric & normally distributed
Hint: You can import ttest function from scipy to perform t tests
First T test
One sample t-test
Check if the mean of the TestSample1 is equal to zero.
Null Hypothesis is that mean is equal to zero.
Alternate hypothesis is that it is not equal to zero.
Question 2
Given,
1. Null Hypothesis : There is no significant difference between datasets
2. Alternate Hypothesis : There is a significant difference
Do two-sample testing and check whether to reject Null Hypothesis or not.
Question 3 -
Do two-sample testing and check whether there is significant difference between speeds of two samples: - TestSample1 & TestSample3
He is trying a third Processing Unit - TestSample3.
TestSample3 = nupy.array([9,10,9,11,10,13,12,9,12,12])
Assumption: Both the datasets (TestSample1 & TestSample3) are random, independent, parametric & normally distributed

Question 1
The way to do this with SciPy would be this:
stats.ttest_1samp(TestSample1, popmean=0)
It is not a useful test to perform in this context though, because we already know that the null hypothesis must be false. Negative times are impossible, so the only way for the population mean of times to be zero would be if every time measured were always zero, which is clearly not the case.
Question 2
Here's how to do a two-sample t-test for independent samples with SciPy:
stats.ttest_ind(TestSample1, TestSample2)
Output:
Ttest_indResult(statistic=-1.8325416653445783, pvalue=0.08346710398411555)
So the t-statistic is -1.8, but its deviation from zero is not formally significant (p = 0.08). This result is inconclusive. Of course it would be better to have more precise measurements, not rounded to hours.
In any case, I would argue that given your stated setting, you do not really need this test either. It is highly unlikely that two different CPU perform exactly the same, and you just want to decide which one to go with. Simply choosing the one with the lower average time, regardless of significance test results, is clearly the right decision here.
Question 3
This is analogous to Question 2.

How can I statistically compare a lightcurve data set with the simulated lightcurve?

With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps.
In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this?
In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic).
What is the best method to create a suitable fit?
A "Brute-Force-Fit" is currently an option that comes to my mind.
This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.

If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python.
That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points.
Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing.
(Note that this might trap you in a local minimum and not work.)
More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1
Edit: I overlooked this part
The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html

Fitting data to distributions?

I am not a statistician (more of a researchy web developer) but I've been hearing a lot about scipy and R these days. So out of curiosity I wanted to ask this question (though it might sound silly to the experts around here) because I am not sure of the advances in this area and want to know how people without a sound statistics background approach these problems.
Given a set of real numbers observed from an experiment, let us say they belong to one of the many distributions out there (like Weibull, Erlang, Cauchy, Exponential etc.), are there any automated ways of finding the right distribution and the distribution parameters for the data? Are there any good tutorials that walk me through the process?
Real-world Scenario:
For instance, let us say I initiated a small survey and recorded information about how many people a person talks to every day for say 300 people and I have the following information:
1 10
2 5
3 20
...
...
where X Y tells me that person X talked to Y people during the period of the survey. Now using the information from the 300 people, I want to fit this into a model. The question boils down to are there any automated ways of finding out the right distribution and distribution parameters for this data or if not, is there a good step-by-step procedure to achieve the same?

This is a complicated question, and there are no perfect answers. I'll try to give you an overview of the major concepts, and point you in the direction of some useful reading on the topic.
Assume that you a one dimensional set of data, and you have a finite set of probability distribution functions that you think the data may have been generated from. You can consider each distribution independently, and try to find parameters that are reasonable given your data.
There are two methods for setting parameters for a probability distribution function given data:
Least Squares
Maximum Likelihood
In my experience, Maximum Likelihood has been preferred in recent years, although this may not be the case in every field.
Here's a concrete example of how to estimate parameters in R. Consider a set of random points generated from a Gaussian distribution with mean of 0 and standard deviation of 1:
x = rnorm( n = 100, mean = 0, sd = 1 )
Assume that you know the data were generated using a Gaussian process, but you've forgotten (or never knew!) the parameters for the Gaussian. You'd like to use the data to give you reasonable estimates of the mean and standard deviation. In R, there is a standard library that makes this very straightforward:
library(MASS)
params = fitdistr( x, "normal" )
print( params )
This gave me the following output:
mean sd
-0.17922360 1.01636446
( 0.10163645) ( 0.07186782)
Those are fairly close to the right answer, and the numbers in parentheses are confidence intervals around the parameters. Remember that every time you generate a new set of points, you'll get a new answer for the estimates.
Mathematically, this is using maximum likelihood to estimate both the mean and standard deviation of the Gaussian. Likelihood means (in this case) "probability of data given values of the parameters." Maximum likelihood means "the values of the parameters that maximize the probability of generating my input data." Maximum likelihood estimation is the algorithm for finding the values of the parameters which maximize the probability of generating the input data, and for some distributions it can involve numerical optimization algorithms. In R, most of the work is done by fitdistr, which in certain cases will call optim.
You can extract the log-likelihood from your parameters like this:
print( params$loglik )
[1] -139.5772
It's more common to work with the log-likelihood rather than likelihood to avoid rounding errors. Estimating the joint probability of your data involves multiplying probabilities, which are all less than 1. Even for a small set of data, the joint probability approaches 0 very quickly, and adding the log-probabilities of your data is equivalent to multiplying the probabilities. The likelihood is maximized as the log-likelihood approaches 0, and thus more negative numbers are worse fits to your data.
With computational tools like this, it's easy to estimate parameters for any distribution. Consider this example:
x = x[ x >= 0 ]
distributions = c("normal","exponential")
for ( dist in distributions ) {
print( paste( "fitting parameters for ", dist ) )
params = fitdistr( x, dist )
print( params )
print( summary( params ) )
print( params$loglik )
}
The exponential distribution doesn't generate negative numbers, so I removed them in the first line. The output (which is stochastic) looked like this:
[1] "fitting parameters for normal"
mean sd
0.72021836 0.54079027
(0.07647929) (0.05407903)
Length Class Mode
estimate 2 -none- numeric
sd 2 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -40.21074
[1] "fitting parameters for exponential"
rate
1.388468
(0.196359)
Length Class Mode
estimate 1 -none- numeric
sd 1 -none- numeric
n 1 -none- numeric
loglik 1 -none- numeric
[1] -33.58996
The exponential distribution is actually slightly more likely to have generated this data than the normal distribution, likely because the exponential distribution doesn't have to assign any probability density to negative numbers.
All of these estimation problems get worse when you try to fit your data to more distributions. Distributions with more parameters are more flexible, so they'll fit your data better than distributions with less parameters. Also, some distributions are special cases of other distributions (for example, the Exponential is a special case of the Gamma). Because of this, it's very common to use prior knowledge to constrain your choice models to a subset of all possible models.
One trick to get around some problems in parameter estimation is to generate a lot of data, and leave some of the data out for cross-validation. To cross-validate your fit of parameters to data, leave some of the data out of your estimation procedure, and then measure each model's likelihood on the left-out data.

Take a look at fitdistrplus (http://cran.r-project.org/web/packages/fitdistrplus/index.html).
A couple of quick things to note:
Try the function descdist, which provides a plot of skew vs. kurtosis of the data and also shows some common distributions.
fitdist allows you to fit any distributions you can define in terms of density and cdf.
You can then use gofstat which computes the KS and AD stats which measure distance of the fit from the data.

This is probably a bit more general than you need, but might give you something to go on.
One way to estimate a probability density function from random data is to use an Edgeworth or Butterworth expansion. These approximations use density function properties known as cumulants (the unbiased estimators for which are the k-statistics) and express the density function as a perturbation from a Gaussian distribution.
These both have some rather dire weaknesses such as producing divergent density functions, or even density functions that are negative over some regions. However, some people find them useful for highly clustered data, or as starting points for further estimation, or for piecewise estimated density functions, or as part of a heuristic.
M. G. Kendall and A. Stuart, The advanced theory of statistics, vol. 1,
Charles Griffin, 1963, was the most complete reference I found for this, with a whopping whole page dedicated to the topic; most other texts had a sentence on it at most or listed the expansion in terms of the moments instead of the cumulants which is a bit useless. Good luck finding a copy, though, I had to send my university librarian on a trip to the archives for it... but this was years ago, so maybe the internet will be more helpful today.
The most general form of your question is the topic of a field known as non-parametric density estimation, where given:
data from a random process with an unknown distribution, and
constraints on the underlying process
...you produce a density function that is the most likely to have produced the data. (More realistically, you create a method for computing an approximation to this function at any given point, which you can use for further work, eg. comparing the density functions from two sets of random data to see whether they could have come from the same process).
Personally, though, I have had little luck in using non-parametric density estimation for anything useful, but if you have a steady supply of sanity you should look into it.

I'm not a scientist, but if you were doing it with a pencil an paper, the obvious way would be to make a graph, then compare the graph to one of a known standard-distribution.
Going further with that thought, "comparing" is looking if the curves of a standard-distribution and yours are similar.
Trigonometry, tangents... would be my last thought.
I'm not an expert, just another humble Web Developer =)

You are essentially wanting to compare your real world data to a set of theoretical distributions. There is the function qqnorm() in base R, which will do this for the normal distribution, but I prefer the probplot function in e1071 which allows you to test other distributions. Here is a code snippet that will plot your real data against each one of the theoretical distributions that we paste into the list. We use plyr to go through the list, but there are several other ways to go through the list as well.
library("plyr")
library("e1071")
realData <- rnorm(1000) #Real data is normally distributed
distToTest <- list(qnorm = "qnorm", lognormal = "qlnorm", qexp = "qexp")
#function to test real data against list of distributions above. Output is a jpeg for each distribution.
testDist <- function(x, data){
jpeg(paste(x, ".jpeg", sep = ""))
probplot(data, qdist = x)
dev.off()
}
l_ply(distToTest, function(x) testDist(x, realData))

For what it's worth, it seems like you might want to look at the Poisson distribution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.