T-test of stock data

T-test of stock data - python

I have a dataset which contains of 4(4 different portfolios) * 120 rows monthly return data of stocks in the US-market (over ten years).
I want to compare the means of the different portfolios and want to tell if they are significantly different from eachother.
Should I use the monthly stock data or the cumulative data (over the 120 months) for an unpaired two-sided t-test?
When I use the monthly data I have no significance, when I use cumulative data i have some significant p-values between some portfolios.
At this point I don't know which data I have to use for this sort of t-test in order to obtain meaningful results

You may try to use ANOVA instead of a t-test since you have more than two groups. Before using ANOVA, you need to run the Homogeneity of Variance (HOV) test. With this, you measure if there is a variational difference between the groups. As HOV, you can use Bartlett test or Fligner-Killeen test. Bartlett test gives more accurate results if the data is normal. If the data is not normal, Fligner test is a better choice. Only in the case that the null hypothesis (no variational difference) is accepted, ANOVA can be used. If the null hypothesis is rejected, then multiple t-tests can be used in place of ANOVA as you have done already.
As far as which dataset to use is concerned, monthly stock data sounds more plausible. An alternative solution could be to use correlations between portfolio returns and to carry out a significance test of those correlations.

Related

Univariate time series forecasting based (also) on other univariate series

I am given a set of data of energy consumption of some households in some given areas for a period of several months. For some households I have the complete dataset, for others the last month is missing and I have to estimate it in Python. I am trying to find an estimation technique that allows me to merge two aspects:
historical time-series analysis: I want to consider past consumptions in order to estimate the last month;
"similar" households consumptions: I want to integrate in the analysis also information for those similar households that have a complete dataset, with the assumptions that there exists external variables (like temperature) that affects all households in similar ways.
The datasets for each household are univariate (consisting in just month/consumption).
Is there a Python module that allows me to do forecast considering both aspects?
Thanks!

How to correlate partial signal with dataset

I have a some large datasets of sensor values consisting of a single sensor value sampled at a one-minute interval, like a waveform. The total dataset spans a few years.
I wish to (using python) enter/select a arbitrary set of sensor data (for instance consisting of 600 values, so for 10hrs worth of data) and find all similar time stamps where roughly the same shape occurred in these datasets.
The matches should be made by shape (relative differences), not by actual values, as there are different sensors used with different biases and environments. Also, I wish to retrieve multiple matches within a single dataset, to further analyse.
I’ve been looking into pandas, but I’m stuck at the moment... any guru here?

I don't know much about the functionalities available in Pandas.
I think you need to first decide the typical time span T over which
the correlation is supposed to occurred. What I would do is to
split all your times series into (possibly overlapping) segments
of duration T using Numpy (see here for instance).
This will lead to a long list of segments. I would then compute
the correlation between all pairs of segments using e.g. corrcoef.
You get a large correlation matrix where you can spot the
pairs of similar segments by applying a threshold on the absolute
value of the correlation. You can estimate the correct threshold
by applying this algorithm to a data set where you don't expect
any correlation, or by randomizing your data.

Need approach recomendation in field ML or DL in forecast value task

I am trying to forecast value in 30 days. I have a timeseries data with some parameteers. The example of date I will attach at the bottom.
The main idea is that Y value is our aim variable to predict in 30 days from today. The f1-f5 variables is values which influence on Y value. So I need to predict Y using Date and f1-f5 columns. All the data comes every day.
Recomend me please some ML and DL approaches to predict "Y" value?
My thoughts.
As I understood it is time series data. And the task is regression. But I am a bit disapointed because time series approaches, as I understood, predict value based on only date value, using seasonality and so on. But I afraid that if I will use XGBoost or Linear regression approaches I will loose timeseries effect on this data.
Date,f1,f2,f3,f4,f5,Y
2015-01-01,183,34,15,1166,50,3251
2015-01-02,364,173,5,739,32,8132
2015-01-03,83,72,38,551,49,6271
2015-01-04,183,81,7,937,32,3334
2015-01-05,324,61,73,554,71,3742
2015-01-06,183,97,15,337,17,5543
2015-01-07,38,152,83,883,32,9143
2015-01-08,78,72,5,551,11,6435
2015-01-09,183,30,21,443,92,4353
...,...,...,...,...,...,...
2018-06-08,924,9,53,897,88,7446

Time series are traditionally modeled with AR (auto-regression) and MA (moving average). Trend and seasonality should also be accounted for. So why not use ARIMA or Prophet? Here's some theory on the subject - https://otexts.com/fpp2/
There are some ML/DL implementations based on RNN/LSTM but they are really complex, often hard to explain, and tend to suffer from vanishing gradient problem. If you must use ML/DL, you may want to have a look at LSTNet.

Is there a way to cluster transactions (journals) data using python if a transaction is represented by two or more rows?

In accounting the data set representing the transactions is called a 'general ledger' and takes following form:
Note that a 'journal' i.e. a transaction consists of two line items. E.g. transaction (Journal Number) 1 has two lines. The receipt of cash and the income. Companies could also have transactions (journals) which can consist of 3 line items or even more.
Will I first need to cleanse the data to only have one line item for each journal? I.e. cleanse the above 8 rows into 4.
Are there any python machine learning algorithms which will allow me to cluster the above data without further manipulation?
The aim of this is to detect anomalies in transactions data. I do not know what anomalies look like so this would need to be unsupervised learning.

Use gaussians on each dimension of the data to determine what is an anomaly. Mean and variance are backed out per dimension, and if the value of a new datapoint on that dimension is below a threshold, it is considered an outlier. This creates one gaussian per dimension. You can use some feature engineering here, rather than just fit gaussians on the raw data.
If features don't look gaussian (plot their histogram), use data transformations like log(x) or sqrt(x) to change them until they look better.
Use anomaly detection if supervised learning is not available, or if you want to find new, previously unseen kind of anomalies (such as the failure of a power plant, or someone acting suspiciously rather than whether someone is male/female)
Error analysis: However, what if p(x), the probability the an example is not an anomaly, is large for all examples? Add another dimension, and hope it helps to show the anomaly. You could create this dimension by combining some of the others.
To fit the gaussian a bit more to the shape of your data, you can make it multivariate. It then takes a matrix mean and variance, and you can vary parameters to change its shape. It will also show feature correlations, if your features are not all independent.
https://stats.stackexchange.com/questions/368618/multivariate-gaussian-distribution

numeric value for time series data with frequency every 2 minutes

I have a dataset of two months in which I am getting reading every 2 minutes. statsmodel.tsa.seasonal_decompose method asks for the numeric value for frequency. What would be the numeric value for such data and what is the proper method to calculate freq in such time series data.

You need to identify frequency of the seasonality yourself. Often this is done using knowledge of the dataset or by visually inspecting a partial autocorrelation plot, which the statsmodels library provides:
statsmodels - partial autocorrelation
If the data has an hourly seasonality to it, you might see a significant partial autocorrelation lag 30 (as there are 30 data points between this hour's first 2 minutes, and last hour's first 2 minutes). I'm assuming statsmodels would expect this value; I'm assuming if you had monthly data it would expect a 12, or if you had daily data, it would expect a 7 for the weekly shape etc.
It sounds like you have multiple seasonalities to consider judging by your other post. You might see a significant lag that corresponds to the same 2 minutes in the previous hours, previous days, and/or the previous weeks. This approach to seasonal decomposition is considered to be naive and only addresses 1 seasonality as described in its documentation:
Seasonal Decomposition
If you would like to continue down the seasonal decomposition path, you could try the relatively recently released double seasonal model released by facebook. It is specifically designed to work well with daily data where it models intra-year and intra-week seasonalities. Perhaps it can be adapted to your problem.
fbprophet
The shortcoming with seasonal decomposition models is that it does not capture how a season might change through time. For example, the characteristics of a week of electricity demand in the summer is much different than the winter. This method will determine an average seasonal pattern and leave the remaining information in the residuals. So given your characteristics vary by day of week (mentioned in your other post), this won't capture that.
If you'd like to send me your data, I'd be interested in taking a look. Given my experience, you've jumped into the deep end of time-series forecasting that doesn't necessarily have an easy to use off-the-shelf solution. If you do provide it, please also clarify what your objective is:
Are you attempting to forecast ahead, and if so, how many 2 minute intervals?
Do you require confidence intervals, monte carlo results, or neither?
How do you measure accuracy of the model's performance? How 'good' does it need to be?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.