calculating a weighted daily average for each DOY in xarray across a decade

calculating a weighted daily average for each DOY in xarray across a decade - python

I have a few years of sea level height data with variables of both absolute height and sea level anomaly. I want to calculate an improved anomaly dataset that takes into account seasonal changes in absolute height. Towards that goal I'm trying to calculate the mean height at every point on the grid for each day of the year. Ideally I'd like to take into account the previous two weeks and following two weeks with the closer days carrying more weight in the final mean. I think a normal distribution of weights would be ideal. There is a nice example in the xarray documentation of how to calculate seasonal averages, but I've yet to find a suitable approach for this weighted mean of each day.
My initial ds looks like:
I'm able to calculate this daily average via:
ds_daily_avg = ds.groupby('time.dayofyear').mean(dim='time')
The output of ds_daily_avg
but there is too much variation in the daily averages because I only have a decade of data. I've thought of then just doing a rolling average of ~ 14 days and while good enough this doesn't properly do the weighting I'm hoping to implement:
ds_daily_avg.sla.rolling(dayofyear=14).mean()
Any advice for properly doing this weighted mean through time?

Related

Unsupervised learning: Anomaly detection on discrete time series

I am working on a final year project on an unlabelled dataset consisting of vibration data from multiple components inside a wind turbine.
Datasets:
I have data from 4 wind turbines each consisting of 415 10-second intervals.
About the 10 second interval data:
Each of the 415 10-second intervals consist of vibration data for the generator, gearbox etc. (14 features in total)
The vibration data (the 14 features) have a resolution of 25.6kHz (262144 rows in each interval)
The 10-seconds are recorded once every day, at different times => A little more than 1 year worth of data
Head of dataframe with some of the features shown:
Plan:
My current plan is to
Do a Fast Fourier Transformation (FFT) from the time domain for each of the different sensors (gearbox, generator etc.) for each of the 415 intervals. From the FFT I am able to extract frequency information to put in a dataframe. (Statistical data from the FFT like spectral RMS per bin)
Build different data sets for different components.
Add features such as wind speed, wind direction, power produced etc.
I will then build unsupervised ML models that can detect anomalies.
Unsupervised models I consider using are Encoder-Decorder and clustering.
Questions:
Does it look like I have enough data for this type of task? 415
intervals x 4 different turbines = 1660 rows and approx. 20 features
Should the data be treated as a time series? (It is sampled for 10 seconds once a day at random times..)
What other unsupervised ML models/approaches that could be good for this task?
I hope this was clearly written. Thanks in advance for any input!

Forecasting product return rates based on past returns

I have a waterfall dataset which shows the returns of laptops over different weeks due to defects. For sales of each month it shows returns over different months
I transformed the data to a weekly level: For example the returns column will be the returns of the first month for each sale month.(it will be the sum of the first diagonal
I fit the pdf into a 2 parameter weibull distribution which models failure rates. (to approximate a bathtub curve)
I then used the fit curve to estimate reliability rates form the predicted cdf. Using that the unreliability rate is predicted.
However the Weibull distribution does not accurately model a bathtub curve which is frequent with failure rates and hence the accuracy is quite low.
Is there a better approach to this problem of predicted return rates?

fbprophet yearly seasonality values too high

I have recently started using fbprophet package in python. I have monthly data for last 2 years and forecasting for next 9 months.
Since, I have monthly data I have only included yearly seasonality (Prophet(yearly_seasonality = True)).
When I plot components, trend values seem to be fine however, Yearly seasonality values are too high, I don't understand why?
Seasonality is showing 300 increase or -200 decrease. However, in actual graph it is not happening in any months in the past - what I can do to correct?
Code Used is as follows:
m = Prophet(yearly_seasonality = True)
m.fit(df_bu_country1)
future = m.make_future_dataframe(periods=9, freq='M')
forecast = m.predict(future)
m.plot(forecast)
m.plot_components(forecast)

There is no seasonality at all in your data. For there to be yearly seasonality, you should have a pattern that repeats year after year, but the shape of your time series from 10/2015 to 10/2016 is completely different from the shape between 10/2016 to 10/2017. So forcing a yearly seasonality is going to give you strange results, you should switch it off (i.e. just use Prophet's default settings).

There is an inconsistency in the seasonality factor of your data, there seems a little yearly seasonality between 2017-04 to 2018-10 . The first answer is absolutely true but incase if you feel there some seasonality you can reduce its impact by altering fourier order it has.
https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#fourier-order-for-seasonalities
This page has how to do so, the default fourier order is 10, reducing the values cahnges it effects.
Try this hope it helps you

Python - How do I check time series stationarity?

I have a car speed dataset on a highway. The observations are collected at 15 min steps, which means I have 96 observations per day and 672 per week.
I have a whole month dataset (2976 observations)
My goal is to predict future values using an Autoregressive AR(p) model.
Here's my data repartition over the month.
In addition, here's the autocorrelation plot (ACF)
The visualization of the 2 plots above lead to think of a seasonal component and hence, a non-stationnary time series, which for me makes no doubt.
However, to make sure of the non-stationarity, I applied on it a Dickey-Fuller test. Here are the results.
Results of Dickey-Fuller Test:
Test Statistic -1.666334e+01
p-value 1.567300e-29
#Lags Used 3.000000e+00
Number of Observations Used 2.972000e+03
Critical Value (5%) -2.862513e+00
Critical Value (1%) -3.432552e+00
Critical Value (10%) -2.567288e+00
dtype: float64
The results clearly show that the absolute value of Test statistic is greater than the critical values, therefore, we reject the null hypothesis which means we have a stationary series !
So I'm very confused of the seasonality and stationarity of my time series.
Any help about that would be appreciated.
Thanks a lot

Actually, stationarity and seasonality are not controversial qualities. Stationarity represent a constancy (no variation) on the series moments (such as mean, variance for weak stationarity), and seasonality is a periodic component of the series that can be extracted with filters.
Seasonality and cyclical patterns are not exactly the same thing, but are very close. You can think as if this series in the images that you show can have a sum of sines and cosines that repeats itself for weekly (or monthly, yearly, ...) periods. It does not have any correlation with the fact that the mean value of the series seems to be constant over the period, or even variance.

Holt-Winters for multi-seasonal forecasting in Python

My data: I have two seasonal patterns in my hourly data... daily and weekly. For example... each day in my dataset has roughly the same shape based on hour of the day. However, certain days like Saturday and Sunday exhibit increases in my data, and also slightly different hourly shapes.
(Using holt-winters, as I found discovered here: https://gist.github.com/andrequeiroz/5888967)
I ran the algorithm, using 24 as my periods per season, and forecasting out 7 seasons (1 week), I noticed that it would over-forecast the weekdays and under-forecast the weekend since its estimating the saturday curve based on fridays curve and not a combination of friday's curve and saturday(t-1)'s curve. What would be a good way to include a secondary period in my data, as in, both 24 and 7? Is their a different algorithm that I should be using?

One obvious way to account for different shapes would be to use just one sort of period, but make it have a periodicity of 7*24, so you would be forecasting the entire week as a single shape.
Have you tried linear regression, in which the predicted value is a linear trend plus a contribution from dummy variables? The simplest example to explain would be trend plus only a daily contribution. Then you would have
Y = X*t + c + A*D1 + B*D2 + ... F * D6 (+ noise)
Here you use linear regression to find the best fitting values of X, c, and A...F. t is the time, counting up 0, 1, 2, 3,... indefinitely, so the fitted value of X gives you a trend. c is a constant value, so it moves all the predicted Ys up or down. D1 is set to 1 on Tuesdays and 0 otherwise, D2 is set to 1 on Wednesdays and 0 otherwise... D6 is set to 1 on Sundays and 0 otherwise, so the A..F terms give contributions for days other than Mondays. We don't fit a term for Mondays because if we did then we could not distinguish the c term - if you added 1 to c and subtracted one from each of A..F the predictions would be unchanged.
Hopefully you can now see that we could add 23 terms to account for an shape for the 24 hours of each day, and a total of 46 terms to account for a shape for the 24 hours of each weekday and the different 24 hours of each weekend day.
You would be best to look for a statistical package to handle this for you, such as the free R package (http://www.r-project.org/). It does have a bit of a learning curve, but you can probably find books or articles that take you through using it for just this sort of prediction.
Whatever you do, I would keep on checking forecasting methods against your historical data - people have found that the most accurate forecasting methods in practice are often surprisingly simple.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.