I have converted date into numerical values but I am stuck on next step how to preparing data for prediction, How to use date for prediction in python code? How to count eventhappen attribute Please guide me and improve my code where it does not make any sense. Below is my code
#Here is Dataset
date Eventhappen
2016-01-14 A
2016-01-15 C
2016-01-16 B
2016-01-17 A
2016-01-18 C
2016-02-18 B
#Converting Date into Numerical Value
df['Dispatch_Date_Time'] = pd.to_datetime(df['Dispatch_Date_Time'])
df.set_index('Dispatch_Date_Time', inplace=True)
df.sort_index(inplace=True)
df['month'] = df.index.month
df['year'] = df.index.year
df['day'] = df.index.day
df['eventhappen'] = 1
#Preparing the data
X = df[['year']]
y = df['eventhappen']
#Trainng the Algorithm
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#Making the Predictions
y_pred = regressor.predict(X_test)
#Plotting the Least Square Line
sns.pairplot(df, x_vars=['year'], y_vars='eventhappen', size=7, aspect=0.7, kind='reg')
There is a lot of confusion in your code, for me at least. The column names are not the same used in the processing.You have two scenarios to consider :
SN-A : If you want to predict which event is happening on some future date, the target column which is 'Eventhappen' will be categorical, you have a multi-classification task not a regression one, so you should encode your target column, then split your dataset using train/test split and finally implement a classifier to predict the event on some future date.
SN-B : If you want to predict the number of the events happening on some future date, then you are on the right way, you should have a numerical column to predict which is the count. It means that this line of code should not be a constant :
df['eventhappen'] = 1
When you have it, you should consider some timeseries techniques (power transformation, lags...), then split into train/test datasets and finally implement/evaluate your regressor model.
Use this function to extract the features needed from the date column and then use them directly in your machine learning model. You can also encode the cyclic features, that gives the model the ability to extract cyclic insights from the data.
def transform_col_date(data, date_col):
'''
data : Dataframe (Your dataset).
date_col : String (name of the date column)
'''
data_ = data.copy()
data_.reset_index(inplace=True)
data_[date_col] = pd.to_datetime(data_[date_col], infer_datetime_format = True)
data_['day'] = data_[date_col].dt.day
data_['month'] = data_[date_col].dt.month
data_['dayofweek'] = data_[date_col].dt.dayofweek
data_['dayofyear'] = data_[date_col].dt.dayofyear
data_['quarter'] = data_[date_col].dt.quarter
data_['weekofyear'] = data_[date_col].dt.weekofyear
data_['year'] = data_[date_col].dt.year
return data_
#in your case
data = transform_col_date(df, 'date')
Related
I am currently in the process of creating a deep learning model to forecast stock prices. I am using AutoGluon since I am relatively new to deep learning/ML.
My pandas dataframe consists of the historical daily close price of about 1800 different stocks. This is about 8 million rows total. Here is how I have preprocessed my data:
import pandas as pd
import matplotlib.pyplot as plt
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor
from datetime import datetime, timedelta, time
import random
import numpy as np
df = pd.read_csv(
"/content/drive/MyDrive/stock_data/training_data_formatted.csv",
parse_dates=["Date"],
)
values_to_remove = ['Date', 'Close', 'ID']
df["Date"] = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True)
df = df[~df['Date'].isin(values_to_remove) & ~df['ID'].isin(values_to_remove) & ~df['Close'].isin(values_to_remove)]
df = df.drop(columns=['Unnamed: 0'])
#add time and microseconds to make a unique index since stocks have overlapping dates.
df['Date'] = df['Date'].apply(lambda x: x.replace(hour=12, minute=0, second=0))
df['Date'] = pd.to_datetime(df['Date'])
microseconds = df.groupby(['Date']).cumcount()
df['Date'] += np.array(microseconds, dtype='m8[us]')
ts_dataframe = TimeSeriesDataFrame.from_data_frame(
df,
id_column="ID", # column that contains unique ID of each time series
timestamp_column="Date",
)
#add frequency so model understands
def forward_fill_missing(ts_dataframe: TimeSeriesDataFrame, freq="D") -> TimeSeriesDataFrame:
original_index = ts_dataframe.index.get_level_values("timestamp")
start = original_index.min()
end = original_index.max()
filled_index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
return ts_dataframe.droplevel("item_id").reindex(filled_index, method="ffill")
ts_dataframe = ts_dataframe.groupby("item_id").apply(forward_fill_missing)
I first trained the model on best_quality and with no hyperparameter adjustment. This trained all the available models on Autogluon and I used a weighted ensemble of the models. However, the mean of my predictions was simply a straight line (I assume this has to do with overfitting). Here is the code:
prediction_length = 30
test_data = ts_dataframe
# last prediction_length timesteps of each time series are excluded, akin to `x[:-48]`
train_data = ts_dataframe.slice_by_timestep(None, -prediction_length)
predictor = TimeSeriesPredictor(
path="/content/drive/MyDrive/TrainedModel",
target="Close",
prediction_length=prediction_length,
eval_metric="MAPE",
)
predictor.fit(
train_data,
presets="best_quality",
num_gpus=1
)
I then decided to only use only DeepAR from AutoGluon with configured hyperparameters. I have tried a different bunch of configurations, however, I don't get a model which gives even a halfway decent forecast. This is the configuration I last did:
predictor.fit(
train_data,
hyperparameters={
"DeepAR": {
"num_layers": 5,
"hidden_size": 100,
"learning_rate": 0.1,
"batch_size": 128,
"num_batches_per_epoch": 80,
"epochs": 300
},
},
num_gpus=1
)
This configuration gives slightly better results than the weighted ensemble, however it's still an almost linear mean which doesn't quite capture the ups and downs in the trend.
I don't know whether this is because of the way I'm preprocessing the data or because I'm doing something extremely wrong in configuring the hyper params.
I should note that to test the model, I am passing it the entire historical data of just one stock.
I'm relatively new to ML, so I apologize if I've made an extremely rookie error, but I'd really appreciate it if someone could guide me as to where I'm going wrong.
If there's something I'd left out, or any more info I could give to make things clearer, please let me know.
Thanks!
I am training a DeepAR model (arXiv) in Jupyter Notebook. I am following this tutorial.
I create a collection of time series (concat_df), as needed by the DeepAR method:
Each row is a time series. This collection is used to train the DeepAR model. The input format expected by DeepAr is a list of series. So I create this from the above data frame:
time_series = []
for index, row in concat_df.iterrows():
time_series.append(row)
With this list of time series I then set the freq, prediction_length and context_length (note I am setting frequency to Daily in this first example):
freq = "D"
prediction_length = 3
context_length = 3
...as well as the T-Naught, Data Length, Number of Time Series and Period:
t0 = concat_df.columns[0]
data_length = concat_df.shape[1]
num_ts = concat_df.shape[0]
period = 12
I create the training set:
time_series_training = []
for ts in time_series:
time_series_training.append(ts[:-prediction_length])
..and visualize this with the test set:
time_series[0].plot(label="test")
time_series_training[0].plot(label="train", ls=":")
plt.legend()
plt.show()
So far so good, everything seems to agree with the tutorial.
I then use the remaining code to invoke the deployed model as explained in the referenced article:
list_of_df = predictor.predict(time_series_training[:5], content_type="application/json")
BUT, if I change the frequency to monthly (freq = "M") I get the following error:
ValueError: Units 'M', 'Y', and 'y' are no longer supported, as they do not represent unambiguous timedelta values durations.
Why would monthly data not be accepted? How can I train monthly data on DeepAR? Is there a way to specify some daily equivalent to monthly data?
This appears to be some kind of Pandas error, as shown here.
I am working on a small machine learning project.
The dataset which i use has 56 input parameters and one categorical response variable (0/1). My problem is that the response variables are not evenly distributed. Now my question I want to prepare the training data set, that the responses are evenly distributed. How can this be done?
That's how the data looks like
-> the training dataset should have the same amount of 1 and 0 from the response.
Thanks for your help, as you can imagine i am really a beginner...
i am the same person like the one who asked the question. sorry for that.
first i load the data from a csv file.(not in the code shown here) this is stored as data, next, i create a new column named " response_class" based on the value in the column "response" if it is below .045, response_class =1, other 0. second, i randomly sample 10000 rows from the data. (due to computation limits), and here i want to make sure that i get the same amount of 1 and 0 from the response_class. at the end i split the data to make it ready for a correlation matrix and test and train data
Here is my code:
data = data[data.response != 0]
pd.DataFrame(data)
data['response_class'] = np.where(data['response'] <= 0.045, 1, 0)
#1=below .045 / 0=above 0.045
#reduce amount of data by picking random samples
data= data.sample(n=10000)
#split data
data.drop(['response'], axis=1, inplace=True)
y = data['response_class']
X = data.drop('response_class', axis=1)
X_names = X.columns
data.head()
found a solution:
#seperate based on the response variable in response_class
df_zero = pd.DataFrame(data[data.response_class== 0])
df_one = pd.DataFrame(data[data.response_class == 1])
# upsampling minority class
df_zero_min = resample(df_zero,
replace = True,
n_samples = len(df_one),
random_state = 123)
df_upsampled = pd.concat([df_one,df_zero_min])
df_upsampled.response_class.value_counts()
I have a time series index with few variables and humidity reading. I have already trained an ML model to predict Humidity values based on X, Y and Z. Now, when I load the saved model using pickle, I would like to fill the Humidity missing values using X, Y and Z. However, it should consider the fact that X, Y and Z themselves shouldnt be missing.
Time X Y Z Humidity
1/2/2017 13:00 31 22 21 48
1/2/2017 14:00 NaN 12 NaN NaN
1/2/2017 15:00 25 55 33 NaN
In this example, the last row of humidity will be filled using the model. Whereas the 2nd row should not be predicted by the model since X and Z is also missing.
I have tried this so far:
with open('model_pickle','rb') as f:
mp = pickle.load(f)
for i, value in enumerate(df['Humidity'].values):
if np.isnan(value):
df['Humidity'][i] = mp.predict(df['X'][i],df['Y'][i],df['Z'][i])
This gave me an error 'predict() takes from 2 to 5 positional arguments but 6 were given' and also I did not consider X, Y and Z column values. Below is the code I used to train the model and save it to a file:
df = df.dropna()
dfTest = df.loc['2017-01-01':'2019-02-28']
dfTrain = df.loc['2019-03-01':'2019-03-18']
features = [ 'X', 'Y', 'Z']
train_X = dfTrain[features]
train_y = dfTrain.Humidity
test_X = dfTest[features]
test_y = dfTest.Humidity
model = xgb.XGBRegressor(max_depth=10,learning_rate=0.07)
model.fit(train_X,train_y)
predXGB = model.predict(test_X)
mae = mean_absolute_error(predXGB,test_y)
import pickle
with open('model_pickle','wb') as f:
pickle.dump(model,f)
I had no errors during training and saving the model.
For prediction, since you want to make sure you have all the X, Y, Z values, you can do,
df = df.dropna(subset = ["X", "Y", "Z"])
And now you can predict the values for the remaining valid examples as,
# where features = ["X", "Y", "Z"]
df['Humidity'] = mp.predict(df[features])
mp.predict will return prediction for all the rows, so there is no need to predict iteratively.
Edit:.
For inference, say you have a dataframe df, you can do,
# Get rows with missing Humidity where it can be predicted.
df_inference = df[df.Humidity.isnull()]
# remaining rows
df = df[df.Humidity.notnull()]
# This might still have rows with missing features.
# Since you cannot infer with missing features, Remove them too and add them to remaining rows
df = df.append(df_inference[df_inference[features].isnull().any(1)])
# and remove them from df_inference
df_inference = df_inference[~df_inference[features].isnull().any(1)]
#Now you can infer on these rows
df_inference['Humidity'] = mp.predict(df_inference[features])
# Now you can merge this back to the remaining rows to get the original number of rows and sort the rows by index
df = df.append(df_inference)
df.sort_index()
Can you report the error?
Anyway, if you have missing values you have different options to deal with them. You can either discard the datapoint entirely or try and infer the missing parts with a method of choice: mean, interpolation, etc.
Pandas documentation has a nice guide on how to deal with them:
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
Try
df['Humidity'][i] = mp.predict(df[['X', 'Y', 'Z']][i])
This way you the data is passed as a single argument, as function expects. The way you wrote it, you split your data to 3 arguments.
So here I am attempting to forecast a year worth of values in a timeseries (ts), using arima model but I can't actually get the forecasted values', the predicted values are somewhat in different scale (you can see the last one from the dataset is 339 and the predicted are very small) but I am not sure where to tweak the code. I was trying to change fill_value to different value but I don't know if this is proper method.
I suppose this might also have something to do with this line:
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
Is there a way to extend the index to cover forecasted values?
The code is below:
ts_log = np.log(ts)
ts_log_diff = ts_log - ts_log.shift()
model = ARIMA(ts_log, order=(2, 1, 2))
results_ARIMA = model.fit(disp=-1)
plt.plot(ts_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_log_diff)**2))
predictions_ARIMA_diff = pd.Series(results_ARIMA.predict('1949-02-01','1961-12-01'), copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.plot(ts)
plt.plot(predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((predictions_ARIMA-ts)**2)/len(ts)))
So here you can see how the results look, first one is the last value I have and beginning with 1961-01-01 I have predicted values.
1960-12-01 339.216967
1961-01-01 3.111950
1961-02-01 3.295407
1961-03-01 3.540066
1961-04-01 3.789093
1961-05-01 3.980322
1961-06-01 4.068641
1961-07-01 4.045327
1961-08-01 3.939715
1961-09-01 3.802622
1961-10-01 3.684713
1961-11-01 3.622262
1961-12-01 3.632668