Converting new dataset into a previous patsy dmatrix form

Converting new dataset into a previous patsy dmatrix form - python

I have divided my data set into 2 sets, training and test set. I have 6 types of dummy variable in my data. Every time I try to run the model on my training set I get error.
This is my code:
X = dmatrix('sfdc_tier + poc_image + sub_segment + Product_Set + Volume_2019_Product', data = data)
x = dmatrix('sfdc_tier + poc_image + sub_segment + Product_Set + Volume_2019_Product', data = Training_Set)
Y = data['Discount_Total')
model = sm.OLS(Y,X).fit()
y_pred = model.predict(x)
Note that 'Volume_2019_Product' is the only numerical data and rest of the data is categorical.
The error I am getting is the following:
ValueError: shapes (662,69) and (90,) not aligned: 69 (dim 1) != 90 (dim 0)
How do I resolve this error?? I need my Training data matrix to look exactly like the original dmatrix of X. Training data contains the same column heads as the other dataset on which I trained the model but it doesn't contain each and every categorical variable under heads which is creating an error in model prediction.

Related

Predicting new data with statsmodels gives ValueError: shapes

I built a multiple regression model using Python statsmodels.
X = df[['var1','var2','var3','var4']]
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model
y = df['target_trait']
model = sm.OLS(y, X).fit() #argument order: sm.OLS(output, input), see (https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9)
predictions = model.predict(X)
model.summary()
Now, I want to predict new data. the dataframe for my new data has 4 columns (var1, var2, var3, var4) and 143 rows. Below is how I proceeded.
X_new = df_new[['var1','var2','var3','var4']] #df_new has other variables not to be used. I am extracting the relevant variables.
y_new = model.predict(X_new)
y_new
Running the code above gave me ValueError: shapes (143,4) and (5,) not aligned: 4 (dim 1) != 5 (dim 0).
I am not sure how to fix it. I really would appreciate your help. Thank you in advance for your time

I think I found the issue. When fitting the model I added a constant to the X matrix by doing X = sm.add_constant(X). By doing the same to X_new, the algorithm worked. Anyway, thank you for taking a look.

Linear Regression TypeError: float() argument must be a string or a number, not 'NaTType'

I am currently working on a linear regression model to try and predict prices of certain shoes. After loading in the data frames for the training data and the data to test and dropping the unwanted columns, I am beginning to work with independent and dependent variable for the regression coefficients. Here was the set of code I ran and was able to get working:
SourceData_train_independent = Training_Data.drop(['Sale Price'], axis = 1) #
Drop depedent variable from training dataset
SourceData_train_dependent = Training_Data['Sale Price'].copy() # New dataframe
with only Dependent variable value for training dataset
SourceData_test_independent = Test_Data.drop(['Sale Price'], axis = 1)
SourceData_test_dependent = Test_Data['Sale Price'].copy()
When I ran the next block of code to scale the data:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(SourceData_train_independent.values) #scale the
independent variables
y_train = SourceData_train_dependent # scaling is not required for dependent
variable
X_test = sc_X.transform(SourceData_test_independent)
y_test = SourceData_test_dependent
I get the following error:
TypeError: float() argument must be a string or a number, not 'NaTType'
on the line:
X_train = sc_X.fit_transform(SourceData_train_independent.values) #scale the
independent variables
Currently, the sale price variable is a float64, do I just need to change the datatype to a number value or is it a more complicated fix? I have already run the code:
Training_Data["Sale Price"] = Training_Data["Sale
Price"].astype(str).str.strip().replace("",0).astype(float)
Test_Data["Sale Price"] = Training_Data["Sale
Price"].astype(str).str.strip().replace("",0).astype(float)
Hopefully someone can help as I'm struggling with this project. I used the code from: https://towardsdatascience.com/linear-regression-in-python-sklearn-vs-excel-6790187dc9ca
Thank you so much!

CNN: Generate a confusion matrix for entire test dataset

I'm using following code to predict my model output on dataset.
correct = 0
total_predictions = []
actual_labels = []
with torch.no_grad():
for images, labels in testloader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
actual_labels.append(labels)
total_predictions.append(final_pred)
final_pred = torch.FloatTensor(final_pred).to(device)
correct += (predicted == labels).sum().item()
Now to generate a confusion matrix of entire dataset, I tried storing my predictions and test labels in a list and pass it to confusion_matrix in sklearn, but that fails with following error:
ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.
Can someone help me in calculating confusion matrix for my entire dataset?
The following code only calculates it for the last batch:
cf = confusion_matrix(predicted.cpu(), labels.cpu())
Update-1
Using #CutePoison's template, I'm getting this.
You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.
labels={}
labels['healthy_wheat'] = 0
labels['leaf_rust'] = 1
labels['stem_rust'] = 2
def conf_mat(y_true,y_pred,columns,**kwargs):
conf_mat = confusion_matrix(y_true,y_pred,labels = columns,**kwargs)
df = pd.DataFrame(conf_mat,columns = columns, index = columns)
df.columns.name="pred"
df.index.name="true"
return df
conf_mat(actual_labels,total_predictions ,columns =labels,normalize="true")

I use this snippet for createing confusion matrices, which works for multiple classes
from sklearn.metrics import confusion_matrix
def conf_mat(y_true,y_pred,columns,**kwargs):
"""
Creates a "pretty" confusion matrix
"""
conf_mat = confusion_matrix(y_true,y_pred,labels = columns,**kwargs)
df = pd.DataFrame(conf_mat,columns = columns, index = columns)
df.columns.name="pred"
df.index.name="true"
return df
conf_mat(actual_labels,final_pred ,columns =np.unique(actual_labels),normalize="true")
Note, you might want change the columns depending on how your labels are created.
Furthmore your final_pred has to contain your class-prediction and not the score i.e final_pred = [0,1,2,0...] and not final_pred= [[0.8,0.1,0.1], [0.1,0.7,0.2],[0.05,0.05,0.9],[0.75,0.2,0.05],...]

How to interpret the output of a RNN with Keras?

I would like to use a RNN for time series prediction to use 96 backwards steps to predict 96 steps into the future. For this I have the following code:
#Import modules
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from tensorflow import keras
# Define the parameters of the RNN and the training
epochs = 1
batch_size = 50
steps_backwards = 96
steps_forward = 96
split_fraction_trainingData = 0.70
split_fraction_validatinData = 0.90
randomSeedNumber = 50
helpValueStrides = int(steps_backwards /steps_forward)
#Read dataset
df = pd.read_csv('C:/Users1/Desktop/TestValues.csv', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0]}, index_col=['datetime'])
# standardize data
data = df.values
indexWithYLabelsInData = 0
data_X = data[:, 0:3]
data_Y = data[:, indexWithYLabelsInData].reshape(-1, 1)
scaler_standardized_X = StandardScaler()
data_X = scaler_standardized_X.fit_transform(data_X)
data_X = pd.DataFrame(data_X)
scaler_standardized_Y = StandardScaler()
data_Y = scaler_standardized_Y.fit_transform(data_Y)
data_Y = pd.DataFrame(data_Y)
# Prepare the input data for the RNN
series_reshaped_X = np.array([data_X[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
series_reshaped_Y = np.array([data_Y[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
timeslot_x_train_end = int(len(series_reshaped_X)* split_fraction_trainingData)
timeslot_x_valid_end = int(len(series_reshaped_X)* split_fraction_validatinData)
X_train = series_reshaped_X[:timeslot_x_train_end, :steps_backwards]
X_valid = series_reshaped_X[timeslot_x_train_end:timeslot_x_valid_end, :steps_backwards]
X_test = series_reshaped_X[timeslot_x_valid_end:, :steps_backwards]
Y_train = series_reshaped_Y[:timeslot_x_train_end, steps_backwards:]
Y_valid = series_reshaped_Y[timeslot_x_train_end:timeslot_x_valid_end, steps_backwards:]
Y_test = series_reshaped_Y[timeslot_x_valid_end:, steps_backwards:]
# Build the model and train it
np.random.seed(randomSeedNumber)
tf.random.set_seed(randomSeedNumber)
model = keras.models.Sequential([
keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 3]),
keras.layers.SimpleRNN(10, return_sequences=True),
keras.layers.Conv1D(16, helpValueStrides, strides=helpValueStrides),
keras.layers.TimeDistributed(keras.layers.Dense(1))
])
model.compile(loss="mean_squared_error", optimizer="adam", metrics=['mean_absolute_percentage_error'])
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_valid, Y_valid))
#Predict the test data
Y_pred = model.predict(X_test)
prediction_lastValues_list=[]
for i in range (0, len(Y_pred)):
prediction_lastValues_list.append((Y_pred[i][0][1 - 1]))
# Create thw dataframe for the whole data
wholeDataFrameWithPrediciton = pd.DataFrame((X_test[:,1]))
wholeDataFrameWithPrediciton.rename(columns = {indexWithYLabelsInData:'actual'}, inplace = True)
wholeDataFrameWithPrediciton.rename(columns = {1:'Feature 1'}, inplace = True)
wholeDataFrameWithPrediciton.rename(columns = {2:'Feature 2'}, inplace = True)
wholeDataFrameWithPrediciton['predictions'] = prediction_lastValues_list
wholeDataFrameWithPrediciton['difference'] = (wholeDataFrameWithPrediciton['predictions'] - wholeDataFrameWithPrediciton['actual']).abs()
wholeDataFrameWithPrediciton['difference_percentage'] = ((wholeDataFrameWithPrediciton['difference'])/(wholeDataFrameWithPrediciton['actual']))*100
# Inverse the scaling (traInv: transformation inversed)
data_X_traInv = scaler_standardized_X.inverse_transform(data_X)
data_Y_traInv = scaler_standardized_Y.inverse_transform(data_Y)
series_reshaped_X_notTransformed = np.array([data_X_traInv[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
X_test_notTranformed = series_reshaped_X_notTransformed[timeslot_x_valid_end:, :steps_backwards]
predictions_traInv = scaler_standardized_Y.inverse_transform(wholeDataFrameWithPrediciton['predictions'].values.reshape(-1, 1))
edictions_traInv = wholeDataFrameWithPrediciton['predictions'].values.reshape(-1, 1)
# Create thw dataframe for the inversed transformed data
wholeDataFrameWithPrediciton_traInv = pd.DataFrame((X_test_notTranformed[:,0]))
wholeDataFrameWithPrediciton_traInv.rename(columns = {indexWithYLabelsInData:'actual'}, inplace = True)
wholeDataFrameWithPrediciton_traInv.rename(columns = {1:'Feature 1'}, inplace = True)
wholeDataFrameWithPrediciton_traInv['predictions'] = predictions_traInv
wholeDataFrameWithPrediciton_traInv['difference_absolute'] = (wholeDataFrameWithPrediciton_traInv['predictions'] - wholeDataFrameWithPrediciton_traInv['actual']).abs()
wholeDataFrameWithPrediciton_traInv['difference_percentage'] = ((wholeDataFrameWithPrediciton_traInv['difference_absolute'])/(wholeDataFrameWithPrediciton_traInv['actual']))*100
wholeDataFrameWithPrediciton_traInv['difference'] = (wholeDataFrameWithPrediciton_traInv['predictions'] - wholeDataFrameWithPrediciton_traInv['actual'])
Here you can have some test data (don't care about the actual values as I made them up, just the shape is important) Download test data
How can the output of the Y_pred data be interpreted? Which of those values yields me the predicted values 96 steps into the future? I have attached a screenshot of the 'Y_pred' data. One time with 5 output neurons in the last layer and one time only with 1. Can anyone tell me, how to interpret the 'Y_pred' data meaning what exactly is the RNN predicting? I can use any values in the output (last layer ) of the RNN model. The 'Y_pred' data always has the shape (Batch size of X_test, timesequence, Number of output neurons). My question is targeting at the last dimension. I thought that these might be the features, but this is not true in my case, as I only have 1 output features (you can see that in the shape of the Y_train, Y_test and Y_valid data).
**Reminder **: The bounty is expiring soon and unfortunately I still have not received any answer. So I would like to remind you on the question and the bounty. I'll highly appreciate every comment.

It may be useful to step through the model inputs/outputs in detail.
When using the keras.layers.SimpleRNN layer with return_sequences=True, the output will return a 3-D tensor where the 0th axis is the batch size, the 1st axis is the timestep, and the 2nd axis is the number of hidden units (in the case for both SimpleRNN layers in your model, 10).
The Conv1D layer will produce an output tensor where the last dimension becomes the number of hidden units (in the case for your model, 16), as it's just being convolved with the input.
keras.layers.TimeDistributed, the layer supplied (in the example provided, Dense(1)) will be applied to each timestep in the batch independently. So with 96 timesteps, we have 96 outputs for each record in the batch.
So stepping through your model:
model = keras.models.Sequential([
keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 3]), # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 10)
keras.layers.SimpleRNN(10, return_sequences=True), # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 10)
keras.layers.Conv1D(16, helpValueStrides, strides=helpValueStrides) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 16),
keras.layers.TimeDistributed(keras.layers.Dense(1)) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)
])
To answer your question, the output tensor from your model contains the predicted values for 96 steps into the future, for each sample. If it's easier to conceptualize, for the case of 1 output, you can apply np.squeeze to the result of model.predict, which will make the output 2-D:
Y_pred = model.predict(X_test) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)
Y_pred_squeezed = np.squeeze(Y_pred) # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS)
In that way, you have a rectangular matrix where each row corresponds to a sample in the batch, and each column i corresponds to the prediction for the timestep i.
In the loop after the prediction step, all the timestep predictions are being discarded except for the first one:
for i in range(0, len(Y_pred)):
prediction_lastValues_list.append((Y_pred[i][0][1 - 1]))
which means the end result is just a list of predictions for the first timestep for each sample in the batch. If you wanted the prediction for the 96th timestep, you could do:
for i in range(0, len(Y_pred)):
prediction_lastValues_list.append((Y_pred[i][-1][1 - 1]))
Notice the -1 instead of 0 for the second bracket, to ensure we grab the last predicted timestep instead of the first.
As a side note, to replicate the results, I had to make one change to your code, specifically when creating series_reshaped_X and series_reshaped_Y. I hit an exception when using np.array to create the array from the list: ValueError: cannot copy sequence with size 192 to array axis with dimension 3 , but looking at what you were doing (joining tensors along a new axis), I changed it to np.stack, which will accomplish the same goal (https://numpy.org/doc/stable/reference/generated/numpy.stack.html):
series_reshaped_X = np.stack([data_X[i:i + (steps_backwards + steps_forward)].copy() for i in
range(len(data) - (steps_backwards + steps_forward))])
series_reshaped_Y = np.stack([data_Y[i:i + (steps_backwards + steps_forward)].copy() for i in
range(len(data) - (steps_backwards + steps_forward))])
Update
"What are those 5 values representing when I only have 1 target feature?"
That's actually just the broadcasting feature of the Tensorflow API (which is also a feature of NumPy). If you perform an arithmetic operation on two tensors with differing shapes, it will try to make them compatible. In this case, if you change the output layer size to be "5" instead of "1" (keras.layers.Dense(5)), the output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 5) instead of (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1), which just means the output from the convolutional layer is going into 5 neurons instead of 1. When the loss (mean squared error) is computed between the two, the size of the label tensor ((BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1)) is broadcast to the size of the prediction tensor ((BATCH_SIZE, NUMBER_OF_TIMESTEPS, 5)). In this case, the broadcasting is accomplished by replicating the column. For example, if Y_train had [-1.69862224] in the first row for the first timestep, and Y_pred had [-0.6132075 , -0.6621697 , -0.7712653 , -0.60011995, -0.48753992] in the first row for the first timestep, to perform the subtraction operation, the entry in Y_train is converted to [-1.69862224, -1.69862224, -1.69862224, -1.69862224, -1.69862224].
And which of those 5 values is the "correct" value to choose for the 96 time step ahead prediciton?
There is no real "correct" value when trained this way - as detailed above, this just a feature of the API. All output should converge to the single target value for the timestep, they're all being compared to that value, so you could technically train that way, but it's just adding parameters and complexity to the model (and you would just have to choose one to be the "real" prediction). The correct approach for getting the prediction for 96 timesteps ahead is detailed in the original answer, but just to reiterate, the output of the model contains future timestep predictions for each sample in the batch. The output tensor could be iterated over to retrieve the predictions for each timestep, for each sample. Furthermore, ensure the number of neurons in the final dense layer matches the number of target values you are trying to predict, otherwise you'll hit the broadcasting issue (and the "correct" output will be unclear).
Just to be exhaustive (and I am not recommending this), if you really wanted to incorporate several neurons in the output despite only having one target value, you could do something like averaging the results:
for i in range(0, len(Y_pred)):
prediction_lastValues_list.append(np.mean(Y_pred[i][0]))
But there is absolutely no benefit to this approach, so I would recommend just sticking with the previous suggestion.
Update 2
Is my model only predicting one time slot which is 96 time steps into the future or is it also predicting everything in between?
The model is predicting everything in between. So for a sample at timestep t, the output of the model are predictions [t + 1, t + 2, ..., t + NUMBER_OF_TIMESTEPS]. Per my original answer, "the output tensor from your model contains the predicted values for 96 steps into the future, for each sample". To specify that in your evaluation code, you can do something like:
Y_pred = np.squeeze(Y_pred)
predictions_for_all_samples_and_timesteps = Y_pred.tolist()
This results in a list of length BATCH_SIZE, and each element in the list is a list of length NUMBER_OF_TIMESTEPS (to be clear, predictions_for_all_samples_and_timesteps is a list of lists). The element at index i in predictions_for_all_samples_and_timesteps contains the predictions for each timestep from 1-96 for the i^th sample (row) in X_test.
As a side note, you could omit np.squeeze, but then you will have a list of lists of lists, where each element in the inner list is a list of one item (instead of [[1, 2, 3, ...], ], the output would look like [[[1], [2], [3], ...], ].
Update 3
Y_test and Y_pred are both 3-D numpy arrays of size (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 1). To compare them, you can take the absolute (or squared) difference between the two:
abs_diff = np.abs(Y_pred - Y_test)
This results in an array of the same dimensions, (BATCH_SIZE, NUMBER_OF_TIMESTEPS). You can then iterate over the rows and generate a plot of the timestep error for each row.
for diff in abs_diff:
print(diff.shape)
plt.plot(list(range(diff)), diff)
It may get a bit unwieldy with a large batch size (as you can see in the image), so maybe you plot a subset of the rows. You can also transform the absolute difference to an error percentage if you would prefer to plot that:
percentage_diff = abs_diff / Y_test
which would be the absolute difference over the actual value, as I see you were originally doing in Pandas. This numpy array will have the same dimensions, so you can iterate over it and generate plots in the same fashion.
For future inquiries, instead of posting the comments, please open a new question and just provide the link - I would be happy to continue helping, but I would like to continue gaining reputation from it.

I disagree with #danielcahall on just one point:
The output tensor from your model contains the predicted values for 96 steps into the future, for each sample
The output does contain 96 time steps, one for each input time step, and you can take an output to mean whatever you want. But this is just not a good model for what you're trying to do. The main reason is that the RNNs you're using are single direction.
x x x x x x # input
| | | | | |
x-->x-->x-->x-->x-->x # SimpleRNN
| | | | | |
x-->x-->x-->x-->x-->x # SimpleRNN
| /|\ /|\ /|\ /|\ |
| / | \ | \ | \ | \ |
x x x x x x # Conv
| | | | | |
x x x x x x # Dense -> output
So the first time index of the output only sees the first 2 input times (thanks to the Conv), it can't see the later times. The first prediction is based only on old data. It's only the last few outputs that can see all the inputs.
use 96 backwards steps to predict 96 steps into the future
Most of the outputs just can't see all the data.
This model would be appropriate if you were trying to predict 1 step into the future from each of the input times.
To predict 96 steps into the future it would be much more reasonable to drop the return_sequences=True and the Conv layer. Then expand the Dense layer to make the prediction:
model = keras.models.Sequential([
keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 3]), # output size is (BATCH_SIZE, NUMBER_OF_TIMESTEPS, 10)
keras.layers.SimpleRNN(10), # output size is (BATCH_SIZE, 10)
keras.layers.Dense(96) # output size is (BATCH_SIZE, 96)
])
That way all 96 predictions see all 96 inputs.
See https://www.tensorflow.org/tutorials/structured_data/time_series for more details.
Also SimpleRNN is terrible. Never use it over more than a couple of steps.

how to solve ValueError: Expected 2D array, got 1D array instead

i have using metrics on sklearn to compare metrics accuracy with this model :
# Model Accuracy, how often is the classifier correct?
a = y_test
b = df_train['Interest_Rate']
k = a.reshape(1, -1)
y = b.reshape(1, -1)
print("Accuracy:",metrics.accuracy_score(k, y))
but when i run it, why it become error
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[1 3 3 ... 1 3 2].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting new dataset into a previous patsy dmatrix form - python

Related

Predicting new data with statsmodels gives ValueError: shapes

Linear Regression TypeError: float() argument must be a string or a number, not 'NaTType'

CNN: Generate a confusion matrix for entire test dataset

How to interpret the output of a RNN with Keras?

how to solve ValueError: Expected 2D array, got 1D array instead

Categories

Resources