I have the following data
feat_1 feat_2 ... feat_n label
gene_1 100.33 10.2 ... 90.23 great
gene_2 13.32 87.9 ... 77.18 soso
....
gene_m 213.32 63.2 ... 12.23 quitegood
The size of M is large ~30K rows, and N is much smaller ~10 columns.
My question is what is the appropriate Deep Learning structure to learn
and test the data like above.
At the end of the day, the user will give a vector of genes with expression.
gene_1 989.00
gene_2 77.10
...
gene_N 100.10
And the system will label which label does each gene apply e.g. great or soso, etc...
By structure I mean one of these:
Convolutional Neural Network (CNN)
Autoencoder
Deep Belief Network (DBN)
Restricted Boltzman Machine
To expand a little on #sung-kim 's comment:
CNN's are used primarily for problems in computer imaging, such as
classifying images. They are modelled on animals visual cortex, they
basically have a connection network such that there are tiles of
features which have some overlap. Typically they require a lot of
data, more than 30k examples.
Autoencoder's are used for feature generation and dimensionality reduction. They start with lots of neurons on each layer, then this number is reduced, and then increased again. Each object is trained on itself. This results in the middle layers (low number of neurons) providing a meaningful projection of the feature space in a low dimension.
While I don't know much about DBN's they appear to be a supervised extension of the Autoencoder. Lots of parameters to train.
Again I don't know much about Boltzmann machines, but they aren't widely used for this sort of problem (to my knowledge)
As with all modelling problems though, I would suggest starting from the most basic model to look for signal. Perhaps a good place to start is Logistic Regression before you worry about deep learning.
If you have got to the point where you want to try deep learning, for whatever reasons. Then for this type of data a basic feed-forward network is the best place to start. In terms of deep-learning, 30k data points is not a large number, so always best start out with a small network (1-3 hidden layers, 5-10 neurons) and then get bigger. Make sure you have a decent validation set when performing parameter optimisation though. If your a fan of the scikit-learn API, I suggest that Keras is a good place to start
One further comment, you will want to use a OneHotEncoder on your class labels before you do any training.
EDIT
I see from the bounty and the comments that you want to see a bit more about how these networks work. Please see the example of how to build a feed-forward model and do some simple parameter optisation
import numpy as np
from sklearn import preprocessing
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
# Create some random data
np.random.seed(42)
X = np.random.random((10, 50))
# Similar labels
labels = ['good', 'bad', 'soso', 'amazeballs', 'good']
labels += labels
labels = np.array(labels)
np.random.shuffle(labels)
# Change the labels to the required format
numericalLabels = preprocessing.LabelEncoder().fit_transform(labels)
numericalLabels = numericalLabels.reshape(-1, 1)
y = preprocessing.OneHotEncoder(sparse=False).fit_transform(numericalLabels)
# Simple Keras model builder
def buildModel(nFeatures, nClasses, nLayers=3, nNeurons=10, dropout=0.2):
model = Sequential()
model.add(Dense(nNeurons, input_dim=nFeatures))
model.add(Activation('sigmoid'))
model.add(Dropout(dropout))
for i in xrange(nLayers-1):
model.add(Dense(nNeurons))
model.add(Activation('sigmoid'))
model.add(Dropout(dropout))
model.add(Dense(nClasses))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
return model
# Do an exhaustive search over a given parameter space
for nLayers in xrange(2, 4):
for nNeurons in xrange(5, 8):
model = buildModel(X.shape[1], y.shape[1], nLayers, nNeurons)
modelHist = model.fit(X, y, batch_size=32, nb_epoch=10,
validation_split=0.3, shuffle=True, verbose=0)
minLoss = min(modelHist.history['val_loss'])
epochNum = modelHist.history['val_loss'].index(minLoss)
print '{0} layers, {1} neurons best validation at'.format(nLayers, nNeurons),
print 'epoch {0} loss = {1:.2f}'.format(epochNum, minLoss)
Which outputs
2 layers, 5 neurons best validation at epoch 0 loss = 1.18
2 layers, 6 neurons best validation at epoch 0 loss = 1.21
2 layers, 7 neurons best validation at epoch 8 loss = 1.49
3 layers, 5 neurons best validation at epoch 9 loss = 1.83
3 layers, 6 neurons best validation at epoch 9 loss = 1.91
3 layers, 7 neurons best validation at epoch 9 loss = 1.65
Deep learning structure would be recommended if you were dealing with raw data and wanted to find features, that work towards your classification goal, automatically. But based on the names of your columns and their number (only 10) it seems that you have your features already engineered.
For this reason you could just go with a standard multi-layer neural network and use supervised learning (back propagation). Such network would have the number of inputs matching the number of your columns (10), followed by a number of hidden layers, and then followed by an output layer with the number of neurons matching the number of your labels. You could experiment with using different number of hidden layers, neurons, different neuron types (sigmoid, tanh, rectified linear etc.) and so on.
Alternatively you could use the raw data (if it's available) and then go with DBNs (they're known to be robust and achieve good results across different problems) or auto-encoders.
If you expect the output to be thought of like scores for a label (as I understood from your question), try a supervised multi-class logistic regression classifier. (the highest score takes the label).
If you're bound to use deep-learning.
A simple feed-forward ANN should do, supervise learning through back propagation. Input layer with N neurons, and one or two hidden layers can be added, not more than that. There is no need to go 'deep' and add more layers for this data, there is risk to overfit the data easily with more layers, if you do so it can be tricky to figure out what the problem is, and the test accuracy will be affected greatly.
Simply plotting or visualizing the data i.e with t-sne can be a good start, if you need to figure out which features are important (or any correlation that may exist).
you can then play with higher powers of those feature dimensions/ or add increased weight to their score.
For problems like this, deep-learning probably isn't very well suited. but a simpler ANN architecture like this should work well depending on the data.
Related
The reason I am trying to overfit specifically, is because I am following the "Deep Learning with Python" by François Chollet's steps to designing a network. This is important as this is for my final project in my degree.
At this stage, I need to make a network large enough to overfit my data in order to determine a maximal capacity, an upper-bounds for the size of networks that I will optimise for.
However, as the title suggests, I am struggling to make my network overfit. Perhaps my approach is naïve, but let me explain my model:
I am using this dataset, to train a model to classify stars. There are two classes that a star must be classified by (into both of them): its spectral class (100 classes) and luminosity class (10 classes).
For example, our sun is a 'G2V', it's spectral class is 'G2' and it's luminosity class is 'V'.
To this end, I have built a double-headed network, it takes this input data:
DataFrame containing input data
It then splits into two parallel networks.
# Create our input layer:
input = keras.Input(shape=(3), name='observation_data')
# Build our spectral class
s_class_branch = layers.Dense(100000, activation='relu', name = 's_class_branch_dense_1')(input)
s_class_branch = layers.Dense(500, activation='relu', name = 's_class_branch_dense_2')(s_class_branch)
# Spectral class prediction
s_class_prediction = layers.Dense(100,
activation='softmax',
name='s_class_prediction')(s_class_branch)
# Build our luminosity class
l_class_branch = layers.Dense(100000, activation='relu', name = 'l_class_branch_dense_1')(input)
l_class_branch = layers.Dense(500, activation='relu', name = 'l_class_branch_dense_2')(l_class_branch)
# Luminosity class prediction
l_class_prediction = layers.Dense(10,
activation='softmax',
name='l_class_prediction')(l_class_branch)
# Now we instantiate our model using the layer setup above
scaled_model = Model(input, [s_class_prediction, l_class_prediction])
optimizer = keras.optimizers.RMSprop(learning_rate=0.004)
scaled_model.compile(optimizer=optimizer,
loss={'s_class_prediction':'categorical_crossentropy',
'l_class_prediction':'categorical_crossentropy'},
metrics=['accuracy'])
logdir = os.path.join("logs", "2raw100k")
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)
scaled_model.fit(
input_data,{
's_class_prediction':spectral_targets,
'l_class_prediction':luminosity_targets
},
epochs=20,
batch_size=1000,
validation_split=0.0,
callbacks=[tensorboard_callback])
In the code above you can see me attempting a model with two hidden layers in both branches, one layer with a shape of 100 000, following into another layer with 500, before going to the output layer. The training targets are one-hot encoded, so there is one node for every class.
I have tried a wide range of sizes with one to four hidden layers, ranging from a shape of 500 to 100 000, only stopping because I ran out of RAM. I have only used dense layers, with the exception of trying a normalisation layer to no affect.
Graph of losses
They will all happily train and slowly lower the loss, but they never seem to overfit. I have run networks out to 100 epochs and they still will not overfit.
What can I do to make my network fit the data better? I am fairly new to machine learning, having only been doing this for a year now, so I am sure there is something that I am missing. I really appreciate any help and would be happy to provide the logs shown in the graph.
After a lot more training I think I have this answered. Basically, the network did not have adequate capacity and needed more layers. I had tried more layers earlier but because I was not comparing it to validation data the overfitting was not apparent!
The proof is in the pudding:
So thank you to #Aryagm for their comment, because that let me work it out. As you can see, the validation data (grey and blue) clearly overfits, while the training data (green and orange) does not show it.
If anything, this goes to show why a separate validation set is so important and I am a fool for not having used it in the first place! Lesson learned.
I started PyTorch with image recognition. Now I want to test (very basically) with pure NumPy arrays. I struggle with getting the setup to work, so basically I have vectors with values between 0 and 1 (normalized curves). Those vectors are always of length 1500 and I want to find e.g. "high values at the beginning" or "sine wave-like function", "convex", "concave" etc. stuff like that, so just shapes of those curves.
My training set consists of many vectors with their classes; I have chosen 7 classes. The net should be trained to classify a vector into one or more of those 7 classes (not one hot).
I'm struggling with multiple issues, but first my very basic Net
class Net(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(Net, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim)
self.fc = nn.Linear(self.hidden_dim, output_dim)
def forward(self, x):
h0 = torch.zeros(self.layer_dim, x.size(1), self.hidden_dim).requires_grad_()
out, h0 = self.rnn(x, h0.detach())
out = out[:, -1, :]
out = self.fc(out)
return out
network = Net(1500, 70, 20, 7)
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice? It is a time series, but then again it is an image recognition problem when plotting the curve.
Now, this here is an attempt to batch the data. The data object contains all training curves together with the correct classifiers.
def train(epoch):
network.train()
network.float()
batching = True
index = 0
# monitor the cummulative loss for an epoch
cummloss = []
# start batching some curves
while batching:
optimizer.zero_grad()
# here I start clustering come curves to a batch and normalize the curves
_input = []
batch_size = min(len(data)-1, index+batch_size_train) - index
for d in data[index:min(len(data)-1, index+batch_size_train)]:
y = np.array(d['data']['y'], dtype='d')
y = np.multiply(y, y.max())
y = y[0:1500]
y = np.pad(y, (0, max(1500-len(y), 0)), 'edge')
if len(_input) == 0:
_input = y
else:
_input = np.vstack((_input, y))
input = torch.from_numpy(_input).float()
input = torch.reshape(input, (1, batch_size, len(y)))
target = np.zeros((1,7))
# the correct classes have indizes, to I create a vector with 1 at the correct locations
for _index in np.array(d['classifier']):
target[0,_index-1] = 1
target = torch.from_numpy(target)
# get the result form the network
output = network(input)
# is this a good loss function?
loss = F.l1_loss(output, target)
loss.backward()
cummloss.append(loss.item())
optimizer.step()
index = index + batch_size_train
if index > len(data):
print(np.mean(cummloss))
batching = False
for e in range(1, n_epochs):
print('Epoch: ' + str(e))
train(0)
The problem I'm facing right now is, the loss doesn't change very little, even with hundreds of epochs.
Are there existing examples of this kind of problem? I didn't find any, just pure png/jpg image recognition. When I convert the curves to png then I have a little issue to train a net, I took densenet and it worked just fine but it seems to be super overkill for this simple task.
This is just a copy-paste from an RNN demo. Here is my first issue. Is an RNN the right choice?
In theory what model you choose does not matter as much as "How" you formulate your problem.
But in your case the most obvious limitation you're going to face is your sequence length: 1500. RNN store information across steps and typically runs into trouble over long sequence with vanishing or exploding gradient.
LSTM net have been developed to circumvent this limitations with memory cell, but even then in the case of long sequence it will still be limited by the amount of information stored in the cell.
You could try using a CNN network as well and think of it as an image.
Are there existing examples of this kind of problem?
I don't know but I might have some suggestions : If I understood your problem correctly, you're going from a (1500, 1) input to a (7,1) output, where 6 of the 7 positions are 0 except for the corresponding class where it's 1.
I don't see any activation function, usually when dealing with multi class you don't use the output of the dense layer to compute the loss you apply a normalizing function like softmax and then you can compute the loss.
From your description of features you have in the form of sin like structures, the closes thing that comes to mind is frequency domain. As such, if you have and input image, just transform it to the frequency domain by a Fourier transform and use that as your feature input.
Might be best to look for such projects on the internet, one such project that you might want to read the research paper or video from this group (they have some jupyter notebooks for you to try) or any similar works. They use the furrier features, that go though a multi layer perceptron (MLP).
I am not sure what exactly you want to do, but seems like a classification task, you would use RNN if you want your neural network to work with a sequence. To me it seems like the 1500 dimensions are independent, and as such can be just treated as input.
Regarding the last layer, for a classification problem it usually is a probability distribution obtained by applying softmax (if only the classification is distinct - i.e. probability sums up to 1), in which, given an input, the net gives a probability of it being from each class. If we are predicting multiple classes we are going to use sigmoid as the last layer of the neural network.
Regarding your loss, there are many losses you can try and see if they are better. Once again, for different features you have to know what exactly is the measurement of distance (a.k.a. how different 2 things are). Check out this website, or just any loss function explanations on the net.
So you should try a simple MLP on top of fourier features as a starting point, assuming that is your feature vector.
Image Recognition is different from Time-Series data. In the imaging domain your data-set might have more similarity with problems like Activity-Recognition, Video-Recognition which have temporal component. So, I'd recommend looking into some models for those.
As for the current model, I'd recommend using LSTM instead of RNN. And also for classification you need to use an activation function in your final layer. This should softmax with cross entropy based loss or sigmoid with MSE loss.
Keras has a Timedistributed model which makes it easy to handle time components. You can use a similar approach with Pytorch by applying linear layers followed by LSTM.
Look into these for better undertsanding ::
Activity Recognition : https://www.narayanacharya.com/vision/2019-12-30-Action-Recognition-Using-LSTM
https://discuss.pytorch.org/t/any-pytorch-function-can-work-as-keras-timedistributed/1346
How to implement time-distributed dense (TDD) layer in PyTorch
Activation Function ::
https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html
Introduction
As it will be clear I am not a machine learning expert, but I work in a management position in data science and I am studying ML to understand the potential.
Mostly as an exercise, I am training a neural network to predict resulting points in a Bejewel-like game given the initial board (let's say R rows, C columns, L colors). Following the deep-chess paper (https://arxiv.org/abs/1711.09667), I am not defining features but I am letting the NN find them. As an input I therefore use a R*C*L binary input neurons, plus 2 R*C additional input neurons for the 2 "special" gems (obtained by destroying more than 3 gems at a time). For a 8x8 board with the classic 5 colors, this amount to 448 input neurons. I use two hidden layers (100,20) and a sigmoid function between layers.
I train this using a database I have built with a thousand games: starting board as input and points obtained after 5 moves as output.
Question
In general any suggestion is welcome of course, as I do not have professional experience in Machine Learning. My question is more theoretical however.
I was wondering how I could exploit the symmetry between the five colors. Indeed I know that shifting colors (i.e. switching the input neurons) nothing would change. An option I am thinking about now while I am writing this question (as usual!) would be to just multiply the training set by adding all the color permutations (5! = 120) of the input board to the set with the same output.
A more refined and more conceptually appealing idea however would be to constraint the weights or the network structure somehow to reflect this theoretical symmetry, so that any learning would automatically update the network in a symmetric way. Is it feasible/advised? How to implement it?
Present Implementation
I use a very standard implementation of supervised learning in pybrain. The dataset DS has R*C*(L+2) binary input neurons and the output layer is normalized (subtracting mean and dividing by standard deviation). There are two hidden layers.
Everything is fully connected.
The dataset DS is split 80-20 in TrainDS and TestDS.
nn=FeedForwardNetwork()
inLayer = LinearLayer(R*C*(L+2))
hidden1 = SigmoidLayer(100)
hidden2 = SigmoidLayer(20)
outLayer = SigmoidLayer(1)
nn.addInputModule(inLayer)
nn.addModule(hidden1)
nn.addModule(hidden2)
nn.addOutputModule(outLayer)
in_to_hidden = FullConnection(inLayer, hidden1)
hidden1_to_hidden2 = FullConnection(hidden1, hidden2)
hidden2_to_out = FullConnection(hidden2,outLayer)
nn.addConnection(in_to_hidden)
nn.addConnection(hidden1_to_hidden2)
nn.addConnection(hidden2_to_out)
nn.sortModules()
TrainDS, TestDS = DS.splitWithProportion(0.8)
trainer = BackpropTrainer( nn, dataset=DS, momentum=0.1, verbose=True, weightdecay=0.01)
print sum(np.abs(nn.activateOnDataset(TrainDS) - TrainDS.data['target'][:len(TrainDS)]))/len(TrainDS)
print sum(np.abs(nn.activateOnDataset(TestDS) - TestDS.data['target'][:len(TestDS)]))/len(TestDS)
I get an average error on the train and test set of respectively 0.70 and 0.77.
I'm trying to build a NN to do regression with Keras in Tensorflow.
I've trying to predict the chart ranking of a song based on a set of features, I've identified a strong correlation of having a low feature 1, a high feature 2 and a high feature 3, with having a high position on the chart (a low output ranking, eg position 1).
However after training my model, the MAE is coming out at about 3500 (very very high) on both the training and testing set. Throwing some values in, it seems to give the lowest output rankings for observations with low values in all 3 features.
I think this could be something to do with the way I'm normalising my data. After brining it into a pandas dataframe with a column for each feature, I use the following code to normalise:
def normalise_dataset(df):
return df-(df.mean(axis=0))/df.std()
I'm using a sequential model with one Dense input layer with 64 neurons and one dense output layer with one neuron. Here is the definition code for that:
model = keras.Sequential([
keras.layers.Dense(64, activation=tf.nn.relu, input_dim=3),
keras.layers.Dense(1)
])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse', optimizer=optimizer, metrics=['mae'])
I'm a software engineer, not a data scientist so I don't know if this model set-up is the correct configuration for my problem, I'm very open to advice on how to make it better fit my use case.
Thanks
EDIT: Here's the first few entires of my training data, there are ~100,000 entires. The final col (finalPos) contains the labels, the field I'm trying to predict.
chartposition,tagcount,artistScore,finalPos
256,191,119179,4625
256,191,5902650,292
256,191,212156,606
205,1480523,5442
256,195,5675757,179
256,195,933171,7745
The first obvious thing is that you are normalizing your data in the wrong way. The correct way is
return (df - df.mean(axis=0))/df.std()
I just changed the bracket, but basically it is (data - mean) divided by standard deviation, whereas you are dividing the mean by the standard deviation.
I am building a classifying ANN with python and the Keras library. I am using training the NN on an imbalanced dataset with 3 different classes. Class 1 is about 7.5 times as prevalent as Classes 2 and 3. As remedy, I took the advice of this stackoverflow answer and set my class weights as such:
class_weight = {0 : 1,
1 : 6.5,
2: 7.5}
However, here is the problem: The ANN is predicting the 3 classes at equal rates!
This is not useful because the dataset is imbalanced, and predicting the outcomes as each having a 33% chance is inaccurate.
Here is the question: How do I deal with an imbalanced dataset so that the ANN does not predict Class 1 every time, but also so that the ANN does not predict the classes with equal probability?
Here is my code I am working with:
class_weight = {0 : 1,
1 : 6.5,
2: 7.5}
# Making the ANN
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
classifier = Sequential()
# Adding the input layer and the first hidden layer with dropout
classifier.add(Dense(activation = 'relu',
input_dim = 5,
units = 3,
kernel_initializer = 'uniform'))
#Randomly drops 0.1, 10% of the neurons in the layer.
classifier.add(Dropout(rate= 0.1))
#Adding the second hidden layer
classifier.add(Dense(activation = 'relu',
units = 3,
kernel_initializer = 'uniform'))
#Randomly drops 0.1, 10% of the neurons in the layer.
classifier.add(Dropout(rate = 0.1))
# Adding the output layer
classifier.add(Dense(activation = 'sigmoid',
units = 2,
kernel_initializer = 'uniform'))
# Compiling the ANN
classifier.compile(optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = ['accuracy'])
# Fitting the ANN to the training set
classifier.fit(X_train, y_train, batch_size = 100, epochs = 100, class_weight = class_weight)
The most evident problem that I see with your model is that it is not properly structured for classification.
If your samples can belong to only one class at a time, then you should not overlook this fact by having a sigmoid activation as your last layer.
Ideally, the last layer of a classifier should output the probability of a sample belonging to a class, i.e. (in your case) an array [a, b, c] where a + b + c == 1..
If you use a sigmoid output, then the output [1, 1, 1] is a possible one, although it is not what you are after. This is also the reason why your model is not generalizing properly: given that you're not specifically training it to prefer "unbalanced" outputs (like [1, 0, 0]), it will defalut to predicting the average values that it sees during training, accounting for the reweighting.
Try changing the activation of your last layer to 'softmax' and the loss to 'catergorical_crossentropy':
# Adding the output layer
classifier.add(Dense(activation='softmax',
units=2,
kernel_initializer='uniform'))
# Compiling the ANN
classifier.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
If this doesn't work, see my other comment and get back to me with that info, but I'm pretty confident that this is the main problem.
Cheers
Imbalanced datasets (where classes are uneven or unequally distributed) are a prevalent problem in classification. For example, one class label has a very high number of observations, and the other has a pretty low number of observations. Significant causes of data imbalance include:
Faulty data collection
Domain peculiarity – when some domains have an imbalanced dataset.
Imbalanced datasets can create many problems in classification hence the need to improve datasets for robust models and improve performance.
Here are several methods to bring balance to imbalanced datasets:
Undersampling – works by resampling the majority class points in a dataset to match or make them equal to the minority class points. It brings equilibrium between the majority and minority classes so that the classifier gives equal importance to both classes. However, it’s important to note that undersampling may cause some loss of information hence some insignificant results.
Oversampling – Also known as upsampling, oversampling resamples the minority class to equal the total number of majority class points. It replicates the observations from minority class points to balance datasets.
Synthetic Minority Oversampling Technique – As the name suggests, the SMOTE technique uses oversampling to create artificial data points for minority classes. It creates new instances between the attributes of the minority class, which are synthesized from existing data.
Searching optimal value from a grid – This technique involves finding probabilities for a particular class label then finding the optimum threshold to map the possibilities to the correct class label.
Using the BalancedBaggingClassifier – The BalancedBaggingClassifier allows you to resample each subclass of a dataset before training a random estimator to create a balanced dataset.
Use different algorithms – Some algorithms aren’t effective in restoring balance in imbalanced datasets. Sometimes it’s wise to try different algorithms to stand a better chance at creating a balanced dataset and improving performance. For instance, you can employ regularization or penalized models to punish the wrong predictions on the minority class.
The effects of imbalanced datasets can be significant. Hopefully, one of the approaches above can help you get in the right direction.
To test which approach works best for you, I’d suggest using deepchecks, an awesome open python package for validating data and models quickly.