I am using Keras on some data. Here are the details:
8,000 customers, each customer has varying time steps ranging from 2 - 41. So I am using zero padding to ensure all customers have 41 time steps. All 8,000 customers have 2 features and the data comes with multiclass labels, 0-4. Each tilmestep has a label.
After training the model, in the test part of the process I'd like to feed in the features and labels for timesteps 1-40, then have it predict the label in the 41st timestep. Does anyone know if this is possible? I've found Keras to be somewhat of a black box in interpreting what is it actually predicting (eg when it gives an accuracy score, what is this the accuracy of? What is it trying to predict: the last tilmestep label or all tilmestep labels?).
Is there a particular sub-division of model that should be used within sequential Keras LSTM models? I've read 'A many-to-one model (f(...)) produces one output (y(t)) value after receiving multiple input values (X(t), X(t+1), ...). ' (Brownlee 2017). However, it doesn't seem to make accommodation for the fact that my input is Xt & Yt for all time steps except the last one that I want to predict. I'm not sure how I would set up my code to instruct the model to predict the last timestep (that I have the data for but then I want to compare the predicted category with the actual category).
To predict the next timestep for each feature you would want your final Dense layer to be the same width as the number of features:
model.add(Dense(n_features))
There's a good example of a similar problem here under Multiple Parallel Series
https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
The accuracy is just a metric for measuring the effectiveness of your model. For accuracy, it's correct_predictions / total_predictions
https://keras.io/metrics/
Related
I have a dataset that consists of different features, like "gender". The task of the model is to determine if the annual income is above or below 50k.
Let say I have a trained network that does the classification.
Now I want to see how often the classifier makes false positive respectively false negative predictions by grouping them accordingly to the gender feature.
The basic idea is a confusion matrix of some sorts, but not a matrix of class to class but class to feature.
The image below illustrates the result I would like to have.
The basic idea is as follows:
1)Make a prediction with the Network.
2)Set the predicted values as new column in your Dataset, you now have a new dataset data_new
Your dataset now has two columns, one for the predicted and one for the true values. You can calculate the overall accuracy by boolean comparison (1 and 1 is right prediction and 0 and 1 and 1 and 0 are wrong predictions respectively).
3)Now you can filter the new data for any column you want, so in my case for the specific gender.
4)Now you can calculate the accuracy w.r.t to chosen gender.
I have to model a ANN to predict the level of consumer complains regarding the in-process parameters on the chain production for my master thesis. Unfortunately, the firm gives me unregulated collected data and there are a lot of missing data. It's about a year of data grouped by open day, so I have 17 column of physical values for 260 days. To infer the missing values, I try to model an denoising autoencoder but it doesn't provide good results. For training the model, I have only 113 days with complete data. The values are real-valued, with different unit and range (some are in the range (100,150) and others are in (90.03,90.35)).
To simulate noise and like the missing dynamic is Not Random At All, I modify a value, with this condition (Random.random()
def DAE(train,l1,l2,num_layer):
input_size = train.shape[1]
#num_layer = 2
theta = 1 #int(input_size/num_layer)
code_size = input_size-theta*(num_layer+1)
epochss=1000
lrr=0.01
autoencoder = Sequential()
autoencoder.add(Dense(input_size, input_shape=(input_size,), kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01)))
for index in range(num_layer):
layer_size=input_size-(index+1)*theta
autoencoder.add(Dense(layer_size,input_shape=(input_size,),activation='linear'))
print(layer_size)
autoencoder.add(Dense(code_size,activation='linear'))
print(code_size)
for index in range(num_layer):
layer_size=input_size-(num_layer-index)*theta
autoencoder.add(Dense(layer_size,input_shape=(input_size,),activation='linear'))
print(layer_size)
autoencoder.add(Dense(input_size,activation='linear'))
autoencoder.compile(Adam(lr=lrr), loss='mean_squared_error', metrics=['accuracy'])
return autoencoder
autoencoder = DAE(AE_train,l1,l2,3)
history = autoencoder.fit(AE_train,AE_target,epochs=1000,validation_split=0.2)
On train and test loss plot, it converge really fastly but after a certain number of epochs it appears a big peak with log decay just after. I don't understand why it rise.
When I try to predict the missing values, I change the nan by the mean of the column. The prediction is always out of the min max range of the specific physical values.
So here are my question, how can I deal with missing data in a small set of values ? Here I have different type of values(unit), should I normalize the values ? But if a do that, how to reconstruct them, as I want to infer real value. Is there a better solution for missing data imputation than autoencoder in ML family techniques?
Thanks for reading my problem and even more for bring me an answer.
Loss plot for test and train sets
I'm trying to build a NN to do regression with Keras in Tensorflow.
I've trying to predict the chart ranking of a song based on a set of features, I've identified a strong correlation of having a low feature 1, a high feature 2 and a high feature 3, with having a high position on the chart (a low output ranking, eg position 1).
However after training my model, the MAE is coming out at about 3500 (very very high) on both the training and testing set. Throwing some values in, it seems to give the lowest output rankings for observations with low values in all 3 features.
I think this could be something to do with the way I'm normalising my data. After brining it into a pandas dataframe with a column for each feature, I use the following code to normalise:
def normalise_dataset(df):
return df-(df.mean(axis=0))/df.std()
I'm using a sequential model with one Dense input layer with 64 neurons and one dense output layer with one neuron. Here is the definition code for that:
model = keras.Sequential([
keras.layers.Dense(64, activation=tf.nn.relu, input_dim=3),
keras.layers.Dense(1)
])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse', optimizer=optimizer, metrics=['mae'])
I'm a software engineer, not a data scientist so I don't know if this model set-up is the correct configuration for my problem, I'm very open to advice on how to make it better fit my use case.
Thanks
EDIT: Here's the first few entires of my training data, there are ~100,000 entires. The final col (finalPos) contains the labels, the field I'm trying to predict.
chartposition,tagcount,artistScore,finalPos
256,191,119179,4625
256,191,5902650,292
256,191,212156,606
205,1480523,5442
256,195,5675757,179
256,195,933171,7745
The first obvious thing is that you are normalizing your data in the wrong way. The correct way is
return (df - df.mean(axis=0))/df.std()
I just changed the bracket, but basically it is (data - mean) divided by standard deviation, whereas you are dividing the mean by the standard deviation.
Let me first explain about data set that I am using.
I have three set.
train with shape of (1277, 927), target is present about 12% of time
Eval set with shape of (174, 927), target is present about 11.5% of time
Hold out set with shape of (414, 927), target is present about 10% of time
This set is also building using time slices. Train set is oldest data. Hold out set is newest data. and Eval set is in middle set.
Now I am building two models.
Model1:
# Initialize CatBoostClassifier
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations=300,
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train_set.values, Y_train.values,
cat_features=categorical_features_indices,
eval_set=(test_set.values, Y_test)
# logging_level='Verbose' # you can uncomment this for text output
)
predicting on hold out set.
Model2:
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations= 'bestIteration from model1',
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train.values, Y.values,
cat_features=categorical_features_indices,
# logging_level='Verbose' # you can uncomment this for text output
)
Both model is identical except iterations. First model has fix 300 round, but it will Shrink model to bestIteration. Where second model uses that bestIteration from model1.
However, When I compare feature importance. It looks drastically difference.
Feature Score_m1 Score_m2 delta
0 x0 3.612309 2.013193 -1.399116
1 x1 3.390630 3.121273 -0.269357
2 x2 2.762750 1.822564 -0.940186
3 x3 2.553052 NaN NaN
4 x4 2.400786 0.329625 -2.071161
As you can see one of feature x3 which was on top3 in first model, dropped off in second model. Not only that but there is large shift in weights between models for given feature. There are about 60 features that are present in model1 are not present in model2. And there about 60 features that present in model2 are not present in model1. delta is difference between Score_m1 and Score_m2. I have seen where model changes score little bit not this drastic. AUC and LogLoss doesn't change that much when I use model1 or model2.
Now I have following questions regarding this situation.
Is this models are instable due to small number of sample and large number of features. If this is case, how to check for this?
Are there feature in this model are just not giving that much information regarding model outcome and there is random change that it is creating split. If this case how to check for this situation?
This catboost is right model for this situation ?
Any help regarding this issue will be appreciated
Yes. Trees in general are somewhat unstable. If you remove the least important feature, you can get a very different model.
Having more data reduces this tendency.
Having more features increases this tendency.
Tree algorithms are random by nature, so the results will be different.
Things to try:
Run the model a large number of times but with different random seeds. Use the results to determine which feature seems to be the least important. (How many features do you have?)
Try to balance your training set. This might require you to upsample the rarer cases.
Get more data. Maybe you'll have to combine your train and test set and use the holdout as the test.
I have figured out how to train a LSTM using just values, but what would the data look like if I wanted to include the time? Perhaps input dimension of 2, with time as epoch seconds and normalized values? There may be time gaps in the data and I want the training to reflect that.
Assuming I only want to periodically train the LSTM, since this is an expensive operation, how would you predict values in the future with a gap between the last training time and the first predicted time? For example, lets says I trained the LSTM 3 days ago, but now I want to predict the values for the next day.
All my work so far is based on this article: http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/. But it doesn't cover these kinds of questions.
I think you can handle this situation when constructing your training set, at least if the time delay between the last value (in the input sequence) and the value to predict is fixed.
Let X_train have dimension: (nb_samples, timesteps, input_dim) and y_train have dimension (n_samples, output_dim). Let x be one training input sample. It corresponds to a multivariate time series with dimension (timesteps, input_dim). Its corresponding output is y with dimension (output_dim).
In y you put the value to predict which can be 3 days after the last value in x, the LSTM "should" grasp the temporal dependency. So if the time delay between the last value in the input and the value to predict is fixed, this should work.
That was the case for such a problem: https://challengedata.ens.fr/en/challenge/9/prediction_of_transaction_volumes_in_financial_markets.html