Predict number of rows using Machine Learnnig - python

I have the bank data of around 4 years of different branches. I am trying to predict number of rows in daily and hourly level. I have issue_datetime (year, month, day, hour) as important features. I applied different regression techniques (linear, decision trees, random forest, xgb) using graph lab but could not get better accuracy.
I was also thinking to set the threshold based on past data like taking the mean of counts in daily, monthly level after removing outliers and set that as a threshold.
What is the best approach?

Since you have 1d time series data, it should be relatively easy to graph your data and look for interesting patterns.
Once you establish that there are some non-stationary aspects to your data, the class of models you are probably wanting to check out first are auto-regressive models, possibly with seasonal additions. ARIMA models are pretty standard for time-series data. http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/

Related

Predict / Forecast exchange rate based on finacial indicators

This is the dataset. I want to create a time series to forecast the last row (EURUSD).
Is it possible to forecast the last variable based on the other financial indicators present in the dataset?
You can use multiple linear regression for the prediction.
With your independent variables(interest rate, etc.) you can find the dependent variable(EURUSD in your case).
For further instructions and how to write it, you can visit these links;
Basic One
More intuitive

Backcast time series using Python

I have a time-series data from 2016 to 2021, how could I backcast to get the data from 2010 to 2015 using ARIMA in Python?
COuld you guys give me some sample Python code?
Thank you very much
The only possibiliy I see here is to simply inverse your time series. That means the last observations becomes the first, the second last becomes the second and so on. You then have a series from 2021 to 2016.
You can do that by:
df = df.reindex(index=df.index[::-1])
You can then train an ARIMA model on this data and predict the "next" five years from 2015 to 2010. Remember that the first prediction will be for 2015-12-31, so you need to inverse this again to have the series from 2010 to 2015.
Keep in mind that ARIMA the predictions will be very, very bad, since your forecasts will be based on forecasts and so on. ARIMA is not made for predictions on such long time frames, so the results will be useless anyway, I guess. It is very likely that the predicitons will become a straight line after 30 or 40 predicions.And you can only use the autoregression part in such a case, since the order of the moving average model will limit the amount of steps you can forecast into the future.
Forecasting from an inversed timeseries would be the solution if you had more data.
However, only having 6 observations is problematic. Creating a forecasting (or backcasting) model requires using some of the observations to train the model and others to validate it. If you train with 4 observations then you only have 2 observations for validation. Is a model good if it forecasts those two well or did you just get lucky? Is it bad if it forecasts one observation well and the other poorly? If you increase the validation set to 3 observations, you get more confidence on whether the model is good or bad but creating the model (with only 3 observations) gets even harder than before.
Like others have stated, regardless of what machine learning model you choose, the results are likely to be poor with so little data. If you had the monthly data it might be more fruitful.
If you can't get the monthly data, since you are backcasting into the past, it might be better to estimate the values manually based on some related variables that you have data of (if any). E.g. if your timeseries is about a company's sales then maybe you could estimate based on the company's annual revenue (or company size, or something else) if you can get the historical data of that variable. This is not precise but can still be more precise than what ARIMA or similar methods would give with the data you have.

Time Series Anomaly Detection Across Multiple Dimensions

I'm working on a project where I'm tasked to find anomalous data (count of people) across different dimensions (categorical i.e country, occupation and a few more) and different days.
Below is a sample of the data
count is count for people per day, country and occupation
How do I go about this? Any recommended Python libraries or models? I found lots of tutorials on multivariate time series analysis but my data isn't multivariate time series as the categorical variables in this dataset do not depend on time.
You can try with LSTM, BiRNN, GRU with multivariable time-series prediction.
You can use tensorflow or pytorch to build the model.
Sklearn has multiple possibilities. You could take a look at Isolation Forest.

Input data necessary for forecasting/ estimating trends for a given variable

This could be more of a theoretical question than a code-related one. In my current job I find myself estimating/ predicting (this last is more opportunistic) the water level for a given river in Africa.
The point is that I am developing a simplistic multiple regression model that takes more than 15 years of historical water levels and precipitation (from different locations) to generate water level estimates.
I am not that used to work with Machine Learning or whatever the correct name is. I am more used to model data and generate fittings (the current data can be perfectly defined with asymetric gaussians and sigmoids functions combined with low order polynomials.
So the point is; once I have a multiple regression model, my colleagues advised me not to use fitted data for the estimation but all the raw data instead. Since they couldn't explain to me the reason of that, I attempted to use the fitted data as raw inputs (in my defense, a median of all the fitting models has a very low deviation error == nice fittings). But what I don't understand is why should I use just the raw data, which cold be noisy, innacurate, taking into account factors that are not directly related (biasing the regression?). What is the advantage of that?
My lack of theoretical knowledge in the field is what makes me wonder about that. Should I always use all the raw data to determine the variables of my multiple regression or can I use the fitted values (i.e. get a median of the different fitting models of each historical year)?
Thanks a lot!
here is my 2 cents
I think your colleagues are saying that because it would be better for the model to learn the correlations between the raw data and the actual rain fall.
In the field you will start with the raw data so being able to predict directly from it is very useful. The more work you do after the raw data is work you will have to do every time you want to make a prediction.
However, if a simpler model work perfectly defined with asymetric gaussians and sigmoids functions combined with low order polynomials then I would recommend doing that. As long as your (y_pred - t_true) ** 2 is very small

how to analyse and predict(machine learning) a time series data set using scikit-learn for python

i got data-set like this
i need to analyse and predict the status column. This is just 2 entrees from the training data set. In this data set there is heart rate pattern(which is collected in 1 second intervals, 10 numbers altogether) its a time series array(correct me if i'm wrong) i just need to know best way to analyse and get a prediction using this data. I'm using scikit-learning for my data-mining and machine learning.
What i just want to know is what is the best way to analyse these time series data? should i use vector based approach or something else. If you can give me example code that would be great for me to understand it.
Feed in each point in the heart rate time series as a separate column, along with a separate column (feature) for all of the the other data points. Do feature normalization (substract the mean, divide by the standard deviation) for each column over the entire dataset, and feed into a classifier.

Categories

Resources