Univariate time series forecasting based (also) on other univariate series - python

I am given a set of data of energy consumption of some households in some given areas for a period of several months. For some households I have the complete dataset, for others the last month is missing and I have to estimate it in Python. I am trying to find an estimation technique that allows me to merge two aspects:
historical time-series analysis: I want to consider past consumptions in order to estimate the last month;
"similar" households consumptions: I want to integrate in the analysis also information for those similar households that have a complete dataset, with the assumptions that there exists external variables (like temperature) that affects all households in similar ways.
The datasets for each household are univariate (consisting in just month/consumption).
Is there a Python module that allows me to do forecast considering both aspects?
Thanks!

Related

T-test of stock data

I have a dataset which contains of 4(4 different portfolios) * 120 rows monthly return data of stocks in the US-market (over ten years).
I want to compare the means of the different portfolios and want to tell if they are significantly different from eachother.
Should I use the monthly stock data or the cumulative data (over the 120 months) for an unpaired two-sided t-test?
When I use the monthly data I have no significance, when I use cumulative data i have some significant p-values between some portfolios.
At this point I don't know which data I have to use for this sort of t-test in order to obtain meaningful results
You may try to use ANOVA instead of a t-test since you have more than two groups. Before using ANOVA, you need to run the Homogeneity of Variance (HOV) test. With this, you measure if there is a variational difference between the groups. As HOV, you can use Bartlett test or Fligner-Killeen test. Bartlett test gives more accurate results if the data is normal. If the data is not normal, Fligner test is a better choice. Only in the case that the null hypothesis (no variational difference) is accepted, ANOVA can be used. If the null hypothesis is rejected, then multiple t-tests can be used in place of ANOVA as you have done already.
As far as which dataset to use is concerned, monthly stock data sounds more plausible. An alternative solution could be to use correlations between portfolio returns and to carry out a significance test of those correlations.

Create a ML model with tensorflow that predicts a values at any given time range at hourly intervals

I am pretty new to ML and completely new to creating my own models. I have went through tensorflows time-series forecasting tutorial and other LSTM time series examples on how to predict with multi-variate inputs. After trying multiple examples I think I realized that this is not what I want to achieve.
My problem involves a dataset that is in hourly intervals and includes 4 different variables with the possibility of more in the future. One column being the datetime. I want to train a model with data that can range from one month to many years. I will then create an input set that involves a few of the variables included during training with at least one of them missing which is what I want it to predict.
For example if I had 2 years worth of solar panel data and weather conditions from Jan-12-2016 to Jan-24-2018 I would like to train the model on that then have it predict the solar panel data from May-05-2021 to May-09-2021 given the weather conditions during that date range. So in my case I am not necessarily forecasting but using existing data to point at a certain day of any year given the conditions of that day at each hour. This would mean I should be able to go back in time as well.
Is this possible to achieve using tensorflow and If so are there any resources or tutorials I should be looking at?
See Best practice for encoding datetime in machine learning.
Relevant variables for solar power seem to be hour-of-day and day-of-year. (I don't think separating into month and day-of-month is useful, as they are part of the same natural cycle; if you had data over spending habits of people who get paid monthly, then it would make sense to model the month cycle, but Sun doesn't do anything particular monthly.) Divide them by hours-in-day and days-in-year respectively to get them to [0,1), multiply by 2*PI to get to a circle, and create a sine column and a cosine column for each of them. This gives you four features that capture the cyclic nature of day and year.
CyclicalFeatures discussed in the link is a nice convenience, but you should still pre-process your day-of-year into a [0,1) interval manually, as otherwise you can't get it to handle leap years correctly - some years days-in-year are 365 and some years 366, and CyclicalFeatures only accepts a single max value per column.

How can we measure the variability in time-series power consumption patterns?

Figure 1: X-axis represents the time (0-23 hours) and Y-axis represents the electricity consumption in kWh. I am working with time-series power consumption data. Figure 1 shows the shapes of one user on Wednesday, Thursday and Friday. All the shapes look different that indicates the stochastic behavior of that user. I want to measure this variability in load shapes. My ultimate goal is to quantify the difference in consumption patterns across days. What kind of statistical techniques I can use for the same? Is there any library/package available in python/R?

Is there a way to cluster transactions (journals) data using python if a transaction is represented by two or more rows?

In accounting the data set representing the transactions is called a 'general ledger' and takes following form:
Note that a 'journal' i.e. a transaction consists of two line items. E.g. transaction (Journal Number) 1 has two lines. The receipt of cash and the income. Companies could also have transactions (journals) which can consist of 3 line items or even more.
Will I first need to cleanse the data to only have one line item for each journal? I.e. cleanse the above 8 rows into 4.
Are there any python machine learning algorithms which will allow me to cluster the above data without further manipulation?
The aim of this is to detect anomalies in transactions data. I do not know what anomalies look like so this would need to be unsupervised learning.
Use gaussians on each dimension of the data to determine what is an anomaly. Mean and variance are backed out per dimension, and if the value of a new datapoint on that dimension is below a threshold, it is considered an outlier. This creates one gaussian per dimension. You can use some feature engineering here, rather than just fit gaussians on the raw data.
If features don't look gaussian (plot their histogram), use data transformations like log(x) or sqrt(x) to change them until they look better.
Use anomaly detection if supervised learning is not available, or if you want to find new, previously unseen kind of anomalies (such as the failure of a power plant, or someone acting suspiciously rather than whether someone is male/female)
Error analysis: However, what if p(x), the probability the an example is not an anomaly, is large for all examples? Add another dimension, and hope it helps to show the anomaly. You could create this dimension by combining some of the others.
To fit the gaussian a bit more to the shape of your data, you can make it multivariate. It then takes a matrix mean and variance, and you can vary parameters to change its shape. It will also show feature correlations, if your features are not all independent.
https://stats.stackexchange.com/questions/368618/multivariate-gaussian-distribution

numeric value for time series data with frequency every 2 minutes

I have a dataset of two months in which I am getting reading every 2 minutes. statsmodel.tsa.seasonal_decompose method asks for the numeric value for frequency. What would be the numeric value for such data and what is the proper method to calculate freq in such time series data.
You need to identify frequency of the seasonality yourself. Often this is done using knowledge of the dataset or by visually inspecting a partial autocorrelation plot, which the statsmodels library provides:
statsmodels - partial autocorrelation
If the data has an hourly seasonality to it, you might see a significant partial autocorrelation lag 30 (as there are 30 data points between this hour's first 2 minutes, and last hour's first 2 minutes). I'm assuming statsmodels would expect this value; I'm assuming if you had monthly data it would expect a 12, or if you had daily data, it would expect a 7 for the weekly shape etc.
It sounds like you have multiple seasonalities to consider judging by your other post. You might see a significant lag that corresponds to the same 2 minutes in the previous hours, previous days, and/or the previous weeks. This approach to seasonal decomposition is considered to be naive and only addresses 1 seasonality as described in its documentation:
Seasonal Decomposition
If you would like to continue down the seasonal decomposition path, you could try the relatively recently released double seasonal model released by facebook. It is specifically designed to work well with daily data where it models intra-year and intra-week seasonalities. Perhaps it can be adapted to your problem.
fbprophet
The shortcoming with seasonal decomposition models is that it does not capture how a season might change through time. For example, the characteristics of a week of electricity demand in the summer is much different than the winter. This method will determine an average seasonal pattern and leave the remaining information in the residuals. So given your characteristics vary by day of week (mentioned in your other post), this won't capture that.
If you'd like to send me your data, I'd be interested in taking a look. Given my experience, you've jumped into the deep end of time-series forecasting that doesn't necessarily have an easy to use off-the-shelf solution. If you do provide it, please also clarify what your objective is:
Are you attempting to forecast ahead, and if so, how many 2 minute intervals?
Do you require confidence intervals, monte carlo results, or neither?
How do you measure accuracy of the model's performance? How 'good' does it need to be?

Categories

Resources