Use neural network to learn distribution of values for classification

Use neural network to learn distribution of values for classification - python

Use neural network to learn distribution of values for classification
The aim is to classify 1-D inputs using a neural network. There are two classes that should be classified, A and B. Each input, used to determine the class, is a number between 0.0 and 1.0.
The input values for class A are evenly distributed between 0 and 1 like so:
The input values for class B are all in the range of 0.4 to 0.6 like so:
Now I want to train a neural network that can learn to classify values in the range of 0.4 to 0.6 as B and the rest as A. So I need a neural network that can approximate the upper and lower bounds of a class. My previous attemps at doing so have been unsuccessful - the neural network always returns a 50% probability for any input across the board, and the loss does not decrease during epochs.
Using Tensorflow and Keras in Python I have trained simple models such as the following:
model = keras.Sequential([
keras.layers.Dense(1),
keras.layers.Dense(5, activation=tf.nn.relu),
keras.layers.Dense(5, activation=tf.nn.relu),
keras.layers.Dense(2, activation=tf.nn.softmax)
])
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
(full training script linked below)
On a side note, I would imagine the neural network to work like this: Some neurons fire only below 0.4, some only above 0.6. If either of those groups of neuron fires, it's class A, if neither fires, it's class B. Unfortunately, that's not what is happening.
How does one go about classifying the inputs described above using neural networks?
--
Example script: https://pastebin.com/xNJUqXyU

Several things could be changed in your model architecture here.
First, the loss should not be loss='mean_squared_error', it is better to use loss='binary_crossentropy', which is better suited for binary classification problems. I will not explain the difference here, this is something that can be looked up easily in the Keras documentation.
You also need to change the definition of your last layer. You only need to have one last node, which will be the probability of belonging to class 1 (hence having a node for the probability of belonging to class 0 is redundant), and you should be using activation=tf.nn.sigmoid instead of softmax.
Something else you can do is define class weights to deal with the imbalance of your data. It seems like given how you define your sample here, weighting class 0 to be 4 times as much as class 1 would make sense.
Once all these changes are made, you should be left with something that looks like this:
model = keras.Sequential([
keras.layers.Dense(1),
keras.layers.Dense(5, activation=tf.nn.relu),
keras.layers.Dense(5, activation=tf.nn.relu),
keras.layers.Dense(1, activation=tf.nn.sigmoid)
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(np.array(inputs_training), np.array(targets_training), epochs=5, verbose=1, class_weight = {0:4, 1:1})
This gives me 96% accuracy on the validation set, and each epoch does reduce the loss.
(On a side note, it seems to me like a Decision Tree would be much better suited here, as it would behave explicitely like you described to perform the classification)

Related

SVC Classifier to Keras CNN with probabilities or confidence to distinguish untrained classes

This question is pretty similar to this one and based on this post over GitHub, in the sense that I am trying to convert an SVM multiclass classification model (e.g., using sklearn) to a Keras model.
Specifically, I am looking for a way of retrieving probabilities (similar to SVC probability=True) or confidence value at the end so that I can define some sort of threshold and be able to distinguish between trained classes and non-trained ones. That is if I train my model with 3 or 4 classes, but then use a 5th that it wasn't trained with, it will still output some prediction, even if totally wrong. I want to avoid that in some way.
I got the following working reasonably well, but it relies on picking the maximum value at the end (argmax), which I would like to avoid:
model = Sequential()
model.add(Dense(30, input_shape=(30,), activation='relu', kernel_initializer='he_uniform'))
# output classes
model.add(Dense(3, kernel_regularizer=regularizers.l2(0.1)))
# the activation is linear by default, which works; softmax makes the accuracy be stuck 33% if targeting 3 classes, or 25% if targeting 4.
#model.add(Activation('softmax'))
model.compile(loss='categorical_hinge', optimizer=keras.optimizers.Adam(lr=1e-3), metrics=['accuracy'])
Any ideas on how to tackle this untrained-class problem? Something like Plat scaling or Temperature scaling would work, if I can still save the model as onnx.

As I suspected, got softmax to work by scaling the features (input) of the model. No need for stop gradient or anything. I was specifically using really big numbers, which despite training well, were preventing softmax (logistic regression) to work properly. The scaling of the features can be done, for instance, through the following code:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
By doing this the output of the SVM-like model using keras is outputting probabilities as originally intended.

Need help defining a simple neural network

I am very new to this and I have several question. I have code snippets of a neural network created python with keras. The model is used for sentiment anaylsis. A training dataset of labeled data (sentiment = 1 or 0) was used.
Now I have several questions on how to describe the neural network.
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(train_x, train_y,
batch_size=32,
epochs=5,
verbose=1,
validation_split=0.1,
shuffle=True)
I am not very clear on many of the following terms so don't be too hard on me.
1: Is there anything that makes this a typical model for sentiment anaylsis?
2: Is it "bag of words"? (My guess is yes, since the data was pre-processed using a tokenizer)
3: Is it "convolusional"?
4: Is it deep?
5: Is it dense - How dense is it?
6: What is the reason for the density(?)-numbers: 512, 256, 2
7: How many layers does it have (input and output layer included/excluded?)
8: Is it supervised / unsupervised?
9: What is the reason behind the three different activation functions 'relu', 'sigmoid', 'softmax' in the used order?
I appreciate any help!

Categorical Cross Entropy, which is the loss function for this neural network, makes it usable for Sentiment Analysis. Cross Entropy loss returns probabilities for different classes. In your case, you need probabilities for two possible classes (0 or 1).
I am not sure if you are using a tokenizer since it is not apparent from the code you provided but if you are, then yes, it is a Bad of words model. A Bag of words model essentially creates a storage for the word roots you have in your text.
From Wikipedia, if the following is your text:
John likes to watch movies. Mary likes movies too.
then, a BoW for this text would be:
{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
The network architecture you are using is not Convolutional, rather it is a feedforward model, which connects all units from one layer to all the units in the next, providing a dot product of the values from the two layers.
There is no one accepted definition of a network being deep. But, as a rule of thumb, if a network has more than 2 middle layers (layers excluding the input and output layer), then it can be considered as a deep network.
In the code provided above, Dense reflects to the fact that all units in the first layer (512) are connected to every other unit in the next layer, i.e., a total of 512x256 connections between first layer and the second.
Yes, the connections between the 512 units in the first layer to the 256 units in the second layer resulting in a 512x256 dimensional matrix of parameters makes it dense. But the usage of Dense here is more from an API perspective rather than semantic. Similarly, the parameter matrix between the second and third layer would be 256x2 dimensional.
If you exclude the input layer (having 512 units) and output layer (having 2 possible outputs, i.e., 0/1), then your network here has one layer, with 256 units.
This model is supervised, since the sentiment analysis task has an output (positive or negative) associated with every input data point. You can see this output as being a supervisor to the network indicating it whether a data point has a positive or negative sentiment. An unsupervised task does not have an output signal associated with the data points.
The activation functions being used here serve the purpose of providing nonlinearity to the network's computations. In a little more detail, sigmoid has a nice property that its output can be interpreted as probabilities. So if the network is outputting 0.89 for a data point, then it would mean that the model evaluates that data point to be positive with a probability of 0.89 .
The usage of sigmoid is probably for teaching purposes since ReLU activation units are favored over sigmoid/tanh because of better convergence properties and I don't see a convincing reason to use sigmoid instead of ReLU.

Tensorflow Loss & Acc remain constant in CNN model

I have just started to learn CNN on Tensorflow. However, when I train the model the Loss and accuracy don't change.
I am using images of size 128x128x3 and the images are normalized (in [0,1]). And here is the compiler that I am using.
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.000001), loss='binary_crossentropy', metrics=['accuracy'])
And here is the summary of my model
I tried the following things but I always have the same values:
Change the learning rate from 0.00000001 to 10
Change the convolution kernel I tried 5x5 and 3x3
I added another fully connected layer and a Conv layer.
update
The layers' weights didn't change after fitting the model. I have the same initial weights.

You could try this,
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
Also, remove the softmax activation from the last layer, a binary classification problem does not need a softmax. So, what softmax does in this case is clip the value always to 1 since there is only one probability and thus the network doesn't train. This link might help you understand softmax.
Additionally, you can try using sigmoid activation at the final node. That clips the output to a value in the range of 0 to 1 and the network weights wont blow up because of a very high loss.

Predicting the percentage accuracy based on limited features

A practice problem based on whether or not and with what accuracy/probability an uber ride gets completed after being ordered has the following features:
Available Drivers int64
Placed Time float64
Response Distance float64
Car Type int32
Day Of Week int64
Response Delay float64
Order Completion int32 [target]
My approach has been to use tf.Keras Sequential to predict the target. Here's what it looks like:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(16, activation='relu', input_shape=input_shape),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1, activation='sigmoid')
])
adam_optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
binary_crossentropy_loss = tf.keras.losses.BinaryCrossentropy()
model.compile(optimizer=adam_optimizer,
loss=binary_crossentropy_loss,
metrics=['accuracy'])
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=ES_PATIENCE)
history = model.fit(train_dataset, validation_data=validation_dataset, epochs=EPOCHS, verbose=2,
callbacks=[early_stop])
I normalize the data like this (note that train_data is a dataframe):
train_data = tf.keras.utils.normalize(train_data)
And then for predicting,
predictions = model.predict_proba(prediction_dataset, batch_size=None)
Training results:
loss: 0.3506 - accuracy: 0.8817 - val_loss: 0.3493 - val_accuracy: 0.8773
But this still gives me a poor quality probability for the corresponding occurrence. Is this the wrong approach ?
What approach would you suggest for a problem like this and am I doing it completely wrong ? Are Neural Networks a bad idea for this solution? Thanks a lot!

As you framed the problem, this is a classic machine learning classification problem.
Given N features(independent variables) you have to predict 1(one) dependent variable.
The way in which you constructed the neural network is good.
Since you have a binary classification problem, the sigmoid activation is the correct one.
With respect to the complexity of your model (number of layers, number of neurons per layer) it depends very much on your dataset.
If you have a comprehensive dataset with a lot of features and a lot of examples(an example is a row in dataframe with X1,X2,X3... Y), where X are the features and Y the dependent variable, your model can vary in complexity.
If you have a small dataset with a few features, a small model is recommended. Always begin with a small model.
If you run into the issue of underfitting (poor accuracy on the training set and also on the validation and test set), you can gradually increase the complexity of the model (add more layers, add more neurons per layer).
If you run into the issue of overfitting, implementing regularisation techniques may help (Dropout, L1/L2 Regularisation, Noise Addition, Data Augmentation).
What you have to take into consideration is that, if you have a small dataset, then a classical machine learning algorithm could outperform the deep learning model. This happens because neural networks are very 'hungry' ---> as compared to machine learning models, they require much more data in order to properly work. You could choose SVM/Kernel SVM/Random Forest/ XGBoost and other similar algorithms.
EDIT!
Whether or not and with what accuracy/probability automatically splits the problem into two parts, not only a simple classification one.
What I would personally do is the following: Since the probabilities occur between 0% and 100%, if you had probability as a feature in your X columns (which you don't), then, according to the number of data points(rows) you have you could do the following: I would assign a label to each probability section: 1 to (0%,25%), 2 to (25%, 50%), 3 to (50%,75%), 4 to (75%, 100%). But that depends exclusively on the prior probability information(if you had the probability as a feature). Then if you inferred and you get label 3, you would know the probability of the ride being completed.
Otherwise, you cannot frame your current problem as both a classification and a probablity one.
I hope that I have given you an introductory insight. Happy coding.

If you are doing classification, you may want to look into ensemble methods (forests, boosts, etc.)
If you are calculating probability, you may want to look into probabilistic graphical models (Bayesian networks, etc.)

Training different outputs at different epochs

Is it possible in Keras that the training of each or some of outputs in multi-output training start at different epochs? For example one of the outputs takes some other outputs as its input. But those outputs at the beginning are quite premature and it brings huge computational burdens to the model. This output that I would like its training to be postponed to some time later is a custom layer that has to apply some image processing operations to its input which is an image generated by another output but at the beginning that the generated image is quite meaningless, I think it's just waste of time for first epochs to apply this custom layer. Is there a way to do that? Like we have weights over each output's loss, do we have different starting point for calculating each output's loss?

Build a model that does not contain the later output.
Train that model to the degree you want.
Build a new model that incorporates the old model into it.
Compile the new model with the new loss functions you want.
Train that model.
To elaborate on step 3: Keras models can be used like layers in Keras' functional API.
You can build a normal model like so:
input = Input((100,))
x = Dense(50)(input)
x = Dense(1, activation='sigmoid')(x)
model = Model(input, x)
However, if you have another standard Keras model, it can be used just like any other layer. For example, if we have a model (created with Sequential(), Model(), or keras.models.load_model()) called model1, we can put it in like this:
input = Input((100,))
x = model1(input)
x = Dense(1, activation='sigmoid')(x)
model = Model(input, x)
This would be the equivalent of putting in each layer in model1 individually.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.