Data missing imputation with autoencoder on small set of data

Data missing imputation with autoencoder on small set of data - python

I have to model a ANN to predict the level of consumer complains regarding the in-process parameters on the chain production for my master thesis. Unfortunately, the firm gives me unregulated collected data and there are a lot of missing data. It's about a year of data grouped by open day, so I have 17 column of physical values for 260 days. To infer the missing values, I try to model an denoising autoencoder but it doesn't provide good results. For training the model, I have only 113 days with complete data. The values are real-valued, with different unit and range (some are in the range (100,150) and others are in (90.03,90.35)).
To simulate noise and like the missing dynamic is Not Random At All, I modify a value, with this condition (Random.random()
def DAE(train,l1,l2,num_layer):
input_size = train.shape[1]
#num_layer = 2
theta = 1 #int(input_size/num_layer)
code_size = input_size-theta*(num_layer+1)
epochss=1000
lrr=0.01
autoencoder = Sequential()
autoencoder.add(Dense(input_size, input_shape=(input_size,), kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01)))
for index in range(num_layer):
layer_size=input_size-(index+1)*theta
autoencoder.add(Dense(layer_size,input_shape=(input_size,),activation='linear'))
print(layer_size)
autoencoder.add(Dense(code_size,activation='linear'))
print(code_size)
for index in range(num_layer):
layer_size=input_size-(num_layer-index)*theta
autoencoder.add(Dense(layer_size,input_shape=(input_size,),activation='linear'))
print(layer_size)
autoencoder.add(Dense(input_size,activation='linear'))
autoencoder.compile(Adam(lr=lrr), loss='mean_squared_error', metrics=['accuracy'])
return autoencoder
autoencoder = DAE(AE_train,l1,l2,3)
history = autoencoder.fit(AE_train,AE_target,epochs=1000,validation_split=0.2)
On train and test loss plot, it converge really fastly but after a certain number of epochs it appears a big peak with log decay just after. I don't understand why it rise.
When I try to predict the missing values, I change the nan by the mean of the column. The prediction is always out of the min max range of the specific physical values.
So here are my question, how can I deal with missing data in a small set of values ? Here I have different type of values(unit), should I normalize the values ? But if a do that, how to reconstruct them, as I want to infer real value. Is there a better solution for missing data imputation than autoencoder in ML family techniques?
Thanks for reading my problem and even more for bring me an answer.
Loss plot for test and train sets

Related

How can I use the test_proportion data in a machine learning model?

I have a data with 4000 CNN features and it is a binary classification problem. All I know about the test data is the proportions of 1 and 0. How can I tell to my model to predict test labels by using the proportions data ? (Like is there a way to say in order to reach this proportions I will give this instance 0.)
How can I use it to increase accuracy ? In my case the training data is mostly consist of 1 (85%) and 0(15%)
However in my test data proportion of l is given as (%38) So it is much different than training data.
I worked a little bit with balancing the data and it helped. However my model still predicts 1 for nearly all of the data. It may occur because of the adaptation problem also.
As #birdwatch suggested I decrease the threshold for the 0 value and try to increase the 0 label count on the prediction.
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
threshold=0.3
y_pred [:,0] = (y_pred [:,0] < threshold).astype('int')
Before the number of classes were as in follows:
1 : 8906
0 : 2968
After changing threshold now it is
1 : 3221
0 : 8653
However is there any other way that I can use test_proportions which ensures the result?

There isn't any sensible way to that. Doing so would create a weird bias in the model. One thing you could do is accept the less likely outcome only is it has high enough score. Normally you'd use 0.5 threshold, but here you might take e.g. 0.7.

Keras time series, how to predict the next time period

I am using Keras on some data. Here are the details:
8,000 customers, each customer has varying time steps ranging from 2 - 41. So I am using zero padding to ensure all customers have 41 time steps. All 8,000 customers have 2 features and the data comes with multiclass labels, 0-4. Each tilmestep has a label.
After training the model, in the test part of the process I'd like to feed in the features and labels for timesteps 1-40, then have it predict the label in the 41st timestep. Does anyone know if this is possible? I've found Keras to be somewhat of a black box in interpreting what is it actually predicting (eg when it gives an accuracy score, what is this the accuracy of? What is it trying to predict: the last tilmestep label or all tilmestep labels?).
Is there a particular sub-division of model that should be used within sequential Keras LSTM models? I've read 'A many-to-one model (f(...)) produces one output (y(t)) value after receiving multiple input values (X(t), X(t+1), ...). ' (Brownlee 2017). However, it doesn't seem to make accommodation for the fact that my input is Xt & Yt for all time steps except the last one that I want to predict. I'm not sure how I would set up my code to instruct the model to predict the last timestep (that I have the data for but then I want to compare the predicted category with the actual category).

To predict the next timestep for each feature you would want your final Dense layer to be the same width as the number of features:
model.add(Dense(n_features))
There's a good example of a similar problem here under Multiple Parallel Series
https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
The accuracy is just a metric for measuring the effectiveness of your model. For accuracy, it's correct_predictions / total_predictions
https://keras.io/metrics/

How to handle Shift in Forecasted value

I implemented a forecasting model using LSTM in Keras. The dataset is 15mints seperated and I am forecasting for 12 future steps.
The model performs good for the problem. But there is a small problem with the forecast made. It is showing a small shift effect. To get a more clear picture see the below attached figure.
How to handle this problem.? How the data must be transformed to handle this kind of issue.?
The model I used is given below
init_lstm = RandomUniform(minval=-.05, maxval=.05)
init_dense_1 = RandomUniform(minval=-.03, maxval=.06)
model = Sequential()
model.add(LSTM(15, input_shape=(X.shape[1], X.shape[2]), kernel_initializer=init_lstm, recurrent_dropout=0.33))
model.add(Dense(1, kernel_initializer=init_dense_1, activation='linear'))
model.compile(loss='mae', optimizer=Adam(lr=1e-4))
history = model.fit(X, y, epochs=1000, batch_size=16, validation_data=(X_valid, y_valid), verbose=1, shuffle=False)
I made the forecasts like this
my_forecasts = model.predict(X_valid, batch_size=16)
Time series data is transformed to supervised to feed the LSTM using this function
# convert time series into supervised learning problem
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
super_data = series_to_supervised(data, 12, 1)
My timeseries is a multi-variate one. var2 is the one that I need to forecast. I dropped the future var1 like
del super_data['var1(t)']
Seperated train and valid like this
features = super_data[feat_names]
values = super_data[val_name]
ntest = 3444
train_feats, test_feats = features[0:-n_test], features[-n_test:]
train_vals, test_vals = values [0:-n_test], values [-n_test:]
X, y = train_feats.values, train_vals.values
X = X.reshape(X.shape[0], 1, X.shape[1])
X_valid, y_valid = test_feats .values, test_vals .values
X_valid = X_valid.reshape(X_valid.shape[0], 1, X_valid.shape[1])
I haven't made the data stationary for this forecast. I also tried taking difference and making the model as stationary as I can, but the issue remains the same.
I have also tried different scaling ranges for the min-max scaler, hoping it may help the model. But the forecasts are getting worsened.
Other Things I have tried
=> Tried other optimizers
=> Tried mse loss and custom log-mae loss functions
=> Tried varying batch_size
=> Tried adding more past timesteps
=> Tried training with sliding window and TimeSeriesSplit
I understand that the model is replicating the last known value to it, thereby minimizing the loss as good as it can
The validation and training loss remains low enough through out the training process. This makes me think whether I need to come up with a new loss function for this purpose.
Is that necessary.? If so what loss function should I go for.?
I have tried all the methods that I stumbled upon. I can't find any resource at all that points to this kind of issue. Is this the problem of data.? Is this because the problem is very hard to be learned by a LSTM .?

you asked for my help at:
stock prediction : GRU model predicting same given values instead of future stock price
Hope not late. What you can try is that you can divert the numerical explicitness of your features. Let me explain:
Similar to my answer in the previous topic; the regression algorithm will use the value from the time-window you give as a sample, to minimize the error. Let's assume you are trying to predict the closing price of BTC at time t. One of your features consists of previous closing prices and you are giving a time-series window of last 20 inputs from t-20 to t-1. A regressor probably will learn to choose the closing value at time step t-1 or t-2 or a close value in this case, cheating. Think like that: if closing price was $6340 at t-1, predicting $6340 or something close at t+1 would minimize the error at strongest. But actually the algorithm did not learn any patterns; it just replicates, so it basically does nothing but accomplishing its optimization duty.
Think analogously from my example: By diverting the explicitness, what I mean is that: do not give the closing prices directly, but scale them or do not use explicit ones at all. Do not use any features explicitly showing the closing prices to the algorithm, do not use open, high, low etc for every time step. You will need to be creative here, engineer the features to get rid of explicit ones; you can give squared close differences (regressor can still steal from past with linear differences, with experience), its ratio to volume. Or, can make the features categorical by digitizing them in a manner that would make sense to use. The point is do not give direct intuition to what it should predict, only provide patterns for algorithm to work on.
A faster approach may be suggested depending on your task. You can do multi-class classification if predicting how much percent of change that your labels is enough for you, just be careful about class imbalance situations. If even just the up/down fluctuations are enough for you, you can directly go for the binary classification. Replication or shifting problems are only seen at the regression tasks, if you are not leaking data from training to the test set. If possible, get rid out of regression for time-series windowed applications.
If anything misunderstood or missing, I will be around. Hope I could help. Good Luck.

Most likely your LSTM is learning to guess roughly what its previous input value was (modulated a bit). That's why you see a "shift".
So let's say your data looks like:
x = [1, 1, 1, 4, 5, 4, 1, 1]
And your LSTM learned to just output the previous input for the current timestep. Then your output would look like:
y = [?, 1, 1, 1, 4, 5, 4, 1]
Because your network has some complicated machinery it is not quite this straightforward but in principle the "shift" you see is caused by this phenomenon.

Tensorflow neural network has very high error with simple regression

I'm trying to build a NN to do regression with Keras in Tensorflow.
I've trying to predict the chart ranking of a song based on a set of features, I've identified a strong correlation of having a low feature 1, a high feature 2 and a high feature 3, with having a high position on the chart (a low output ranking, eg position 1).
However after training my model, the MAE is coming out at about 3500 (very very high) on both the training and testing set. Throwing some values in, it seems to give the lowest output rankings for observations with low values in all 3 features.
I think this could be something to do with the way I'm normalising my data. After brining it into a pandas dataframe with a column for each feature, I use the following code to normalise:
def normalise_dataset(df):
return df-(df.mean(axis=0))/df.std()
I'm using a sequential model with one Dense input layer with 64 neurons and one dense output layer with one neuron. Here is the definition code for that:
model = keras.Sequential([
keras.layers.Dense(64, activation=tf.nn.relu, input_dim=3),
keras.layers.Dense(1)
])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse', optimizer=optimizer, metrics=['mae'])
I'm a software engineer, not a data scientist so I don't know if this model set-up is the correct configuration for my problem, I'm very open to advice on how to make it better fit my use case.
Thanks
EDIT: Here's the first few entires of my training data, there are ~100,000 entires. The final col (finalPos) contains the labels, the field I'm trying to predict.
chartposition,tagcount,artistScore,finalPos
256,191,119179,4625
256,191,5902650,292
256,191,212156,606
205,1480523,5442
256,195,5675757,179
256,195,933171,7745

The first obvious thing is that you are normalizing your data in the wrong way. The correct way is
return (df - df.mean(axis=0))/df.std()
I just changed the bracket, but basically it is (data - mean) divided by standard deviation, whereas you are dividing the mean by the standard deviation.

Drastically different feature importance between very same data and very similar model for catboost

Let me first explain about data set that I am using.
I have three set.
train with shape of (1277, 927), target is present about 12% of time
Eval set with shape of (174, 927), target is present about 11.5% of time
Hold out set with shape of (414, 927), target is present about 10% of time
This set is also building using time slices. Train set is oldest data. Hold out set is newest data. and Eval set is in middle set.
Now I am building two models.
Model1:
# Initialize CatBoostClassifier
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations=300,
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train_set.values, Y_train.values,
cat_features=categorical_features_indices,
eval_set=(test_set.values, Y_test)
# logging_level='Verbose' # you can uncomment this for text output
)
predicting on hold out set.
Model2:
model = CatBoostClassifier(
# custom_loss=['Accuracy'],
depth=9,
random_seed=42,
l2_leaf_reg=1,
# has_time= True,
iterations= 'bestIteration from model1',
learning_rate=0.05,
loss_function='Logloss',
logging_level='Verbose',
)
## Fitting catboost model
model.fit(
train.values, Y.values,
cat_features=categorical_features_indices,
# logging_level='Verbose' # you can uncomment this for text output
)
Both model is identical except iterations. First model has fix 300 round, but it will Shrink model to bestIteration. Where second model uses that bestIteration from model1.
However, When I compare feature importance. It looks drastically difference.
Feature Score_m1 Score_m2 delta
0 x0 3.612309 2.013193 -1.399116
1 x1 3.390630 3.121273 -0.269357
2 x2 2.762750 1.822564 -0.940186
3 x3 2.553052 NaN NaN
4 x4 2.400786 0.329625 -2.071161
As you can see one of feature x3 which was on top3 in first model, dropped off in second model. Not only that but there is large shift in weights between models for given feature. There are about 60 features that are present in model1 are not present in model2. And there about 60 features that present in model2 are not present in model1. delta is difference between Score_m1 and Score_m2. I have seen where model changes score little bit not this drastic. AUC and LogLoss doesn't change that much when I use model1 or model2.
Now I have following questions regarding this situation.
Is this models are instable due to small number of sample and large number of features. If this is case, how to check for this?
Are there feature in this model are just not giving that much information regarding model outcome and there is random change that it is creating split. If this case how to check for this situation?
This catboost is right model for this situation ?
Any help regarding this issue will be appreciated

Yes. Trees in general are somewhat unstable. If you remove the least important feature, you can get a very different model.
Having more data reduces this tendency.
Having more features increases this tendency.
Tree algorithms are random by nature, so the results will be different.
Things to try:
Run the model a large number of times but with different random seeds. Use the results to determine which feature seems to be the least important. (How many features do you have?)
Try to balance your training set. This might require you to upsample the rarer cases.
Get more data. Maybe you'll have to combine your train and test set and use the holdout as the test.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.