Time series - correlation and lag time - python
I am studying the correlation between a set of input variables and a response variable, price. These are all in time series.
1) Is it necessary that I smooth out the curve where the input variable is cyclical (autoregressive)? If so, how?
2) Once a correlation is established, I would like to quantify exactly how the input variable affects the response variable.
Eg: "Once X increases >10% then there is an 2% increase in y 6 months later."
Which python libraries should I be looking at to implement this - in particular to figure out the lag time between two correlated occurrences?
Example:
I already looked at: statsmodels.tsa.ARMA but it seems to deal with predicting only one variable over time. In scipy the covariance matrix can tell me about the correlation, but does not help with figuring out the lag time.
While part of the question is more statistics based, the bit about how to do it in Python seems at home here. I see that you've since decided to do this in R from looking at your question on Cross Validated, but in case you decide to move back to Python, or for the benefit of anyone else finding this question:
I think you were in the right area looking at statsmodels.tsa, but there's a lot more to it than just the ARMA package:
http://statsmodels.sourceforge.net/devel/tsa.html
In particular, have a look at statsmodels.tsa.vector_ar for modelling multivariate time series. The documentation for it is available here:
http://statsmodels.sourceforge.net/devel/vector_ar.html
The page above specifies that it's for working with stationary time series - I presume this means removing both trend and any seasonality or periodicity. The following link is ultimately readying a model for forecasting, but it discusses the Box-Jenkins approach for building a model, including making it stationary:
http://www.colorado.edu/geography/class_homepages/geog_4023_s11/Lecture16_TS3.pdf
You'll notice that link discusses looking for autocorrelations (ACF) and partial autocorrelations (PACF), and then using the Augmented Dickey-Fuller test to test whether the series is now stationary. Tools for all three can be found in statsmodels.tsa.stattools. Likewise, statsmodels.tsa.arma_process has ACF and PACF.
The above link also discusses using metrics like AIC to determine the best model; both statsmodels.tsa.var_model and statsmodels.tsa.ar_model include AIC (amongst other measures). The same measures seem to be used for calculating lag order in var_model, using select_order.
In addition, the pandas library is at least partially integrated into statsmodels and has a lot of time series and data analysis functionality itself, so will probably be of interest. The time series documentation is located here:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html
Related
Wavelet for time series
I am trying to use wavelets coefficients as feature for neural networks on a time series data and I am bit confused on usage of the same. Do I need to find the coefficients on entire time series at once, or use a sliding window for finding the same. I mean, will finding coefficients on entire time series for once, include the future data points while determining those coefficients? What should be the approach to go about using Wavelets on a time series data without look ahead bias if any?
It is hard to provide you with a detailed answer without knowing what you are trying to achieve. In a nutshell, you first need to decide whether you want to apply a discrete (DWT) or a continous (CWT) wavelet transform to your time series. A DWT will allow you to decompose your input data into a set of discrete levels, providing you with information about the frequency content of the signal i.e. determining whether the signal contains high frequency variations or low frequency trends. Think of it as applying several band-pass filters to your input data. I do not think that you should apply a DWT to your entire time series at once. Since you are working with financial data, maybe decomposing your input signal into 1-day windows and applying a DWT on these subsets would do the trick for you. In any case, I would suggest: Installing the pywt toolbox and playing with a dummy time series to understand how wavelet decomposition works. Checking out the abundant literature available about wavelet analysis of financial data. For instance, if you are interested into financial time series forecasting, you might want to read this paper. Posting your future questions on the DSP stack exchange, unless you have a specific coding-related answer.
Why does mne resample method does not sample the data point to point?
My understanding of downsampling is that it is an operation to decrease the sample rate of x by keeping the first sample and then every nth sample after the first. The example provided from the resample method of scipy package clearly illustrated about this operation as depicted the picture which is accessible from the link (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.resample.html) or as extracted below In an enlarged view, it is evident that the original data points were resampled point by point. However, using the mne example of downsampling which accessible via the link : https://mne.tools/dev/auto_examples/preprocessing/plot_resample.html , I notice that the data points were not resampled point by point as illustrated visually below This given that, mne resample is based on the resample method of scipy package as indicated from mne resample function as shown: https://github.com/mne-tools/mne-python/blob/607fb4613fb5a80dd225132a4a53fe43b8fde0fb/mne/filter.py#L1342 May I know whether this issue is due to the ringing artifacts or due to other problems? Also, are there remedies to mitigate this problem. Thanks for any insight. Appreciate it The same question has been asked in mne discussion repo, but un-answered as of the time of writing
My understanding of downsampling is that it is an operation to decrease the sample rate of x by keeping the first sample and then every nth sample after the first. Resampling typically consists of two steps: low-pass filtering to avoid aliasing, then sample rate reduction (subselecting samples from the resulting signal). The low-passing actually changes the values, so the subselection-of-filtered-data step will not necessarily yield points that were "on" the original signal. May I know whether this issue is due to the ringing artifacts or due to other problems? In this case it's likely due to the (implicit) low-pass filtering in the frequency-domain resampling of the signal. It looks pretty reasonable to me. If you want to play around with it a bit, you can Call scipy.signal.resample directly on your data and see how closely it matches. Pad your signal, call scipy.signal.resample, and remove the (now reduced-length) padding -- this is what MNE does internally. Use scipy.signal.resample_poly directly on your data. Manually low-pass filter and then directly subselect samples from the low-passed signal, which is what resample_poly does internally. Also, scipy.signal.resample does frequency-domain resampling, so implicitly uses a brick-wall filter at Nyquist when downsampling (unless you specify something for the window argument, which gets applied in the frequency domain in addition to the effective brick-wall filter). p.s. The answer provided is an extract from the discussion with the folk at mne, namely Eric Larson,Brunner Clemens,Phillip Alday. Credit should be given to them
How can I statistically compare a lightcurve data set with the simulated lightcurve?
With python I want to compare a simulated light curve with the real light curve. It should be mentioned that the measured data contain gaps and outliers and the time steps are not constant. The model, however, contains constant time steps. In a first step I would like to compare with a statistical method how similar the two light curves are. Which method is best suited for this? In a second step I would like to fit the model to my measurement data. However, the model data is not calculated in Python but in an independent software. Basically, the model data depends on four parameters, all of which are limited to a certain range, which I am currently feeding mannualy to the software (planned is automatic). What is the best method to create a suitable fit? A "Brute-Force-Fit" is currently an option that comes to my mind. This link "https://imgur.com/a/zZ5xoqB" provides three different plots. The simulated lightcurve, the actual measurement and lastly both together. The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period.
If I understand this correctly, you're asking a more foundational question that could be better answered in https://datascience.stackexchange.com/, rather than something specific to Python. That said, as a data science layperson, this may be a problem suited for gradient descent with a mean-square-error cost function. You initialize the parameters of the curve (possibly randomly), then calculate the square error at your known points. Then you make tiny changes to each parameter in turn, and calculate how the cost function is affected. Then you change all the parameters (by a tiny amount) in the direction that decreases the cost function. Repeat this until the parameters stop changing. (Note that this might trap you in a local minimum and not work.) More information: https://towardsdatascience.com/implement-gradient-descent-in-python-9b93ed7108d1 Edit: I overlooked this part The simulation is not good, but by playing with the parameters one can get an acceptable result. Which means the phase and period are the same, magnitude is in the same order and even the specular flashes should occur at the same period. Is the simulated curve just a sum of sine waves, and are the parameters just phase/period/amplitude of each? In this case what you're looking for is the Fourier transform of your signal, which is very easy to calculate with numpy: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html
Machine Learning for optimizing parameters
For my master's thesis I am using a 3rd party program (SExtractor) in addition to a python pipeline to work with astronomical image data. SExtractor takes a configuration file with numerous parameters as input, which influences (after some intermediate steps) the statistics of my data. I've already spent way too much time playing around with the parameters, so I've looked a little bit into machine learning and have gained a very basic understanding. What I am wondering now is: Is it reasonable to use a machine learning algorithm to optimize the parameters of the SExtractor, when the only method to judge the performance or quality of the parameters is with the final statistics of the analysis run (which takes at least an hour on my machine) and there are more than 6 parameters which influence the statistics. As an example, I have included 2 different versions of the statistics I am referring to, made from slightly different versions of Sextractor parameters. Red line in the left image is the median value of the standard deviation (as it should be). Blue line is the median of the standard deviation as I get them. The right images display the differences of the objects in the 2 data sets. I know this is a very specific question, but as I am new to machine learning, I can't really judge if this is possible. So it would be great if someone could suggest me if this is a pointless endeavor and point me in the right.
You can try an educated guess based on the data that you already have. You are trying to optimize the parameters such that the median of the standard deviation has the desired value. You could assume various models and try to estimate the parameters based on the model and the estimated data. But I think you should have a good understanding of machine learning to do so. With good I mean beyound an undergraduate course.
Utilising Genetic algorithm to overcome different size datasets in model
SO I realise the question I am asking here is large and complex. A potential solution to variences in sizes of In all of my searching through statistical forums and posts I haven't come across a scientifically sound method of taking into account the type of data that I am encountering, but I have thought up a (novel?) potential solutions to account perfectly (in my mind) for large and small datasets within the same model. The proposed method involves using a genetic algorithm to alter two numbers defining a relationship between the size of the dataset making up an implied strike rate and the percentage of the implied strike to be used, with the target of the model to maximise the homology of the number 1 in two columns of the following csv. (ultra simplified but hopefully demonstrates the principle) Example data Date,PupilName,Unique class,Achieved rank,x,y,x/y,Average xy 12/12/2012,PupilName1,UniqueClass1,1,3000,9610,0.312174818,0.08527 12/12/2012,PupilName2,UniqueClass1,2,300,961,0.312174818,0.08527 12/12/2012,PupilName3,UniqueClass1,3,1,3,0.333333333,0.08527 13/12/2012,PupilName1,UniqueClass2,1,2,3,0.666666667,0.08527 13/12/2012,PupilName2,UniqueClass2,2,0,1,0,0.08527 13/12/2012,PupilName3,UniqueClass2,3,0,5,0,0.08527 13/12/2012,PupilName4,UniqueClass2,4,0,2,0,0.08527 13/12/2012,PupilName5,UniqueClass2,5,0,17,0,0.08527 14/12/2012,PupilName1,UniqueClass3,1,1,2,0.5,0.08527 14/12/2012,PupilName2,UniqueClass3,2,0,1,0,0.08527 14/12/2012,PupilName3,UniqueClass3,3,0,5,0,0.08527 14/12/2012,PupilName4,UniqueClass3,4,0,6,0,0.08527 14/12/2012,PupilName5,UniqueClass3,5,0,12,0,0.08527 15/12/2012,PupilName1,UniqueClass4,1,0,0,0,0.08527 15/12/2012,PupilName2,UniqueClass4,2,1,25,0.04,0.08527 15/12/2012,PupilName3,UniqueClass4,3,1,29,0.034482759,0.08527 15/12/2012,PupilName4,UniqueClass4,4,1,38,0.026315789,0.08527 16/12/2012,PupilName1,UniqueClass5,1,12,24,0.5,0.08527 16/12/2012,PupilName2,UniqueClass5,2,1,2,0.5,0.08527 16/12/2012,PupilName3,UniqueClass5,3,13,59,0.220338983,0.08527 16/12/2012,PupilName4,UniqueClass5,4,28,359,0.077994429,0.08527 16/12/2012,PupilName5,UniqueClass5,5,0,0,0,0.08527 17/12/2012,PupilName1,UniqueClass6,1,0,0,0,0.08527 17/12/2012,PupilName2,UniqueClass6,2,2,200,0.01,0.08527 17/12/2012,PupilName3,UniqueClass6,3,2,254,0.007874016,0.08527 17/12/2012,PupilName4,UniqueClass6,4,2,278,0.007194245,0.08527 17/12/2012,PupilName5,UniqueClass6,5,1,279,0.003584229,0.08527 So I have created a tiny model dataset, which contains some good examples of where my current methods fall short and how I feel a genetic algorithm can be used to fix this. If we look in the dataset above it contains 6 unique classes the ultimate objective of the algorithm is to create as high as possible correspondence between a rank of an adjusted x/y and the achieved rank in column 3 (zero based referencing.) In uniqueclass1 we have two identical x/y values, now these are comparatively large x/y values if you compare with the average (note the average isn't calculated from this dataset) but it would be common sense to expect that the 3000/9610 is more significant and therefore more likely to have an achieved rank of 1 than the 300/961. So what I want to do is make an adjusted x/y to overcome these differences in dataset sizes using a logarithmic growth relationship defined by the equation: adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy) Where α is the only dynamic number If I can explain my logic a little and open myself up to (hopefully) constructive criticsm. This graph below shows is an exponential growth relationship between size of the data set and the % of x/y contributing to the adjusted x/y. Essentially what the above equation says is as the dataset gets larger the percentage of the original x/y used in the adjusted x/y gets larger. Whatever percentage is left is made up by the average xy. Could hypothetically be 75% x/y and 25% average xy for 300/961 and 95%/5% for 3000/9610 creating an adjusted x/y which clearly demonstrates For help with understanding the lowering of α would produce the following relationship where by a larger dataset would be requred to achieve the same "% of xy contributed" Conversly increasing α would produce the following relationship where by a smaller dataset would be requred to achieve the same "% of xy contributed" So I have explained my logic. I am also open to code snippets to help me overcome the problem. I have plans to make a multitude of genetic/evolutionary algorithms in the future and could really use a working example to pick apart and play with in order to help my understanding of how to utilise such abilities of python. If additional detail is required or further clarification about the problem or methods please do ask, I really want to be able to solve this problem and future problems of this nature. So after much discussion about the methods available to overcome the problem presented here I have come to the conclusion that he best method would be a genetic algorithm to iterate α in order to maximise the homology/correspondance between a rank of an adjusted x/y and the achieved rank in column 3. It would be greatly greatly appreciated if anyone be able to help in that department? So to clarify, this post is no longer a discussion about methodology I am hoping someone can help me produce a genetic algorithm to maximise the homology between the results of the equation adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy) Where adjusted xy applies to each row of the csv. Maximising homology could be achieved by minimising the difference between the rank of the adjusted xy (where the rank is by each Unique class only) and Achieved rank. Minimising this value would maximise the homology and essentially solve the problem presented to me of different size datasets. If any more information is required please ask, I check this post about 20 times a day at the moment so should reply rather promptly. Many thanks SMNALLY.
The problem you are facing sounds to me like "Bias Variance Dilemna" from a general point of view. In a nutshell, a more precise model favours variance (sensitivity to change in a single training set), a more general model favours bias (model works for many training sets) May I suggest not to focus on GA but look at Instance Base Learning and advanced regression techniques. The Andrew moore page at CMU is a good entry point. And particularly those slides. [EDIT] After a second reading, here is my second understanding: You have a set of example data with two related attributes X and Y. You do not want X/Y to dominate when Y is small, (considered as less representative). As a consequence you want to "weigth" the examples with a adapted value adjusted_xy . You want adjusted_xy to be related to a third attribute R (rank). Related such as,per class, adjusted_xy is sorted like R. To do so you suggest to put it as an optimization problem, searching for PARAMS of a given function F(X,Y,PARAMS)= adjusted_xy . With the constraint that D=Distance( achieved rank for this class, rank of adjusted_xy for this class ) is minimal. Your question, at least for me, is in the field of attribute selection/attribute adaptation. (I guess the data set will later be used for supervised learning ). One problem that I see in your approach (if well understood) is that, at the end, rank will be highly related to adjusted_xy which will bring therefore no interesting supplementary information. Once this said, I think you surely know how GA works . You have to define the content of the chromosome : this appears to be your alpha parameter. define an appropriate fitness function The fitness function for one individual can be a sum of distances over all examples of the dataset. As you are dealing with real values , other metaheuristics such as Evolution Strategies (ES) or Simulated Anealing may be more adapted than GA. As solving optimization problems is cpu intensive, you might eventually consider C or Java instead of Python. (as fitness at least will be interpreted and thus cost a lot). Alternatively I would look at using Y as a weight to some supervised learning algorithm (if supervised learning is the target).
Let's start by the problem: You consider the fact that some features lead to some of your classes a 'strike'. You are taking a subset of your data and try to establish a rule for the strikes. You do establish one but then you notice that the accuracy of your rule depends on the volume of the dataset that was used to establish the 'strike' rate anyway. You are also commenting on the effect of some samples in biasing your 'strike' estimate. The immediate answer is that it looks like you have a lot of variation in your data, therefore you will in one way or another need to collect more to account for that variation. (That is, variation that is inherent to the problem). The fact that in some cases the numbers end up in 'unusable cases' could also be down to outliers. That is, measurements that are 'out of bounds' for a number of reasons and which you would have to find a way to either exclude them or re-adjust them. But this depends a lot on the context of the problem. 'Strike rates' on their own will not help but they are perhaps a step towards the right direction. In any case, you can not compare strike rates if they are coming from samples of different sizes as you have found out too. If your problem is purely to determine the size of your sample so that your results conform to some specific accuracy then i would recommend that you have a look at Statistical Power and how does the sample size affects it. But still, to determine the sample size you need to know a bit more about your data, which brings us back to point #1 about the inherent variation. Therefore, my attempt to an answer is this: If i have understood your question correctly, you are dealing with a classification problem in which you seek to assign a number of items (patients) to a number of classes (types of cancer) on the evidence of some features (existence of genetic markers, or frequency of their appearance or any other quantity anyway) about these items. But, some features might not exist for all items or, there is a core group of features but there might be some more that do not appear all the time. The question now is, which classifier do you use to achieve this? Logistic regression was mentioned previously and has not helped. Therefore, what i would suggest is going for a Naive Bayesian Classifier. The classifier can be trained with the datasets you have used to derive the 'strike rates' which will provide the a-priori probabilities. When the classifier is 'running' it will be using the features of new data to construct a likelihood that the patient who provided this data should be assigned to each class. Perhaps the more common example for such a classifier is the spam-email detectors where the likelihood that an email is spam is judged on the existence of specific words in the email (and a suitable training dataset that provides a good starting point of course). Now, in terms of trying this out practically (and since your post is tagged with python related tags :) ), i would like to recommend Weka. Weka contains a lot of related functionality including bootstrapping that could potentially help you with those differences in the size of the datasets. Although Weka is Java, bindings exist for it in Python too. I would definitely give it a go, the Weka package, book and community are very helpful.
No. Don't use a genetic algorithm. The bigger the search space of models and parameters, the better your chances of finding a good fit for your data points. But the less this fit will mean. Especially since for some groups your sample sizes are small and therefore the measurements have a high random component to them. This is why, somewhat counterintuitively, it is often actually harder to find a good model for your data after collecting it than before. You have taken the question to the programmer's lair. This is not the place for it. We solve puzzles. This is not a puzzle to find the best line through the dots. You are searching for a model that makes sense and brings understanding on the subject matter. A genetic algorithm is very creative at line-through-dot drawing but will bring you little understanding. Take the problem back where it belongs and ask the statisticians instead. For a good model should be based on theory behind the data. It'll have to match the points on the right side of the graph, where (if I understand you right) most of the samples are. It'll be able to explain in hard probabilities how likely the deviations on the left are and tell you if they are significant or not. If you do want to do some programming, I'd suggest you take the simplest linear model, add some random noise, and do a couple simulation runs for a population like your subjects. See if the data looks like the data you're looking at or if it generally 'looks' different, in which case there really is something nonlinear (and possibly interesting) going on on the left.
I once tackled a similar problem (as similar as problems like this ever are), in which there were many classes and high variance in features per data point. I personally used a Random Forest classifier (which I wrote in Java). Since your data is highly variant, and therefore hard to model, you could create multiple forests from different random samples of your large dataset and put a control layer on top to classify data against all the forests, then take the best score. I don't write python, but i found this link http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html which may give you something to play with.
Following Occam's razor, you must select a simpler model for small dataset and may want to switch to a more complex model as your dataset grows. There are no [good] statistical tests that show you if a given model, in isolation, is a good predictor of your data. Or rather, a test may tell you that given model fitness is N, but you can never tell what the acceptable value of N is. Thus, build several models and pick one with better tradeoff of predictive power and simplicity using Akaike information criterion. It has useful properties and not too hard to understand. :) There are other tests of course, but AIC should get you started. For a simple test, check out p-value