The situation: I have a big dataset with more than 18 million examples.
I train several models and want to track the accuracy.
When forwarding all examples and computing accuracy this is approximately 83 percent. But this takes a long time.
So I try to sample a small subset of the whole dataset and compute accuracy for that. I expect to see approximately the same number (around 80 percent)
total = 4096
N = dataset.shape[0]
indices = np.random.randint(N-1, size=total)
batch = dataset[indices,:]
However, now the output looks like this, when running it for 10 'random' batches:
> satisfied 4096/4096
> 1.0 satisfied 4095/4096
> 0.999755859375 satisfied 4095/4096
> 0.999755859375 satisfied 4094/4096
> 0.99951171875 satisfied 4095/4096
> 0.999755859375 satisfied 4095/4096
> 0.999755859375 satisfied 4094/4096
> 0.99951171875 satisfied 4096/4096
> 1.0 satisfied 4095/4096
> 0.999755859375 satisfied 4096/4096
> 1.0
So here it performs always way too good and seems to only almost only sample from the 80 percent good examples. What can I do to make it really random, such that it gives a good view of the accuracy?
This makes also the training go wrong, because for the next training batch only the good examples are sampled.
EDIT: so this is not about the training itself! I have a trained model with 83 percent accuracy. I use only this model for testing accuracy. When testing accuracy on small subsets it gives always 99 or 100 percent, even for 100 random batches.
Edit:
And the code I generate the output with that gets 99 or 100 percent
def constraints_satisfied_v3(sess, model, dataset, pointclouds, instructions, trajectories, distances, is_training=0):
satisfied = 0
total = 4096
# Pick random examples
N = dataset.shape[0]
indices = np.random.randint(N-1, size=total)
batch = dataset[indices,:]
pdb.set_trace()
# Fill a feed dictionary with the actual set of images and labels
feed_dict = {model.input_pointcloud: pointclouds[batch[:,0],:],
model.input_language: instructions[batch[:,1],:],
model.input_traj: trajectories[batch[:,2],:],
model.input_traj_mv: trajectories[batch[:,3],:],
model.distances: distances[batch[:,2], batch[:,3]],
model.is_training: is_training}
loss_value,emb_pl,emb_t,emb_t_mv,sim_mv,sim = sess.run([model.loss,model.embeddings_pl,model.embeddings_t,model.embeddings_t_mv,model.sim_mv,model.sim],
feed_dict=feed_dict)
result = np.greater_equal(sim, distances[batch[:,2], batch[:,3]]+sim_mv)
satisfied = satisfied + np.sum(result)
print 'satisfied %d/%d' % (satisfied, total)
percentage = float(satisfied)/float(total)
#pdb.set_trace()
return percentage
Edit: Okay, you have a point. When training batches are sampled the same way the model is only trained on that data. On that is why it is doing almost perfect on that data. But the issue stays how to sample from the whole dataset
So this is the version that get 83 percent accuracy
def constraints_satisfied_v2(sess, model, dataset, pointclouds, instructions, trajectories, distances, is_training=0):
satisfied = 0
total = 0
N = dataset.shape[0]
#indices = np.random.randint(N-1, size=int(total))
#batch = dataset[indices,:]
i = 10000
while i < N:
indices = np.arange(i-10000, i)
if i+10000 < N:
i = i+10000
else:
i = N
batch = dataset[indices,:]
# Fill a feed dictionary with the actual set of images and labels
feed_dict = {model.input_pointcloud: pointclouds[batch[:,0],:],
model.input_language: instructions[batch[:,1],:],
model.input_traj: trajectories[batch[:,2],:],
model.input_traj_mv: trajectories[batch[:,3],:],
model.distances: distances[batch[:,2], batch[:,3]],
model.is_training: is_training}
loss_value,emb_pl,emb_t,emb_t_mv,sim_mv,sim = sess.run([model.loss,model.embeddings_pl,model.embeddings_t,model.embeddings_t_mv,model.sim_mv,model.sim],
feed_dict=feed_dict)
result = np.greater_equal(sim, distances[batch[:,2], batch[:,3]]+sim_mv)
satisfied = satisfied + np.sum(result)
total = total + batch.shape[0]
print 'satisfied %d/%d' % (satisfied, total)
percentage = float(satisfied)/float(total)
return percentage
Edit: it seems the difference between constraints_satisfied_v2 and constraints_satisfied_v3 has to do with the use of batch normalization. In v3 random samples are picked which correspond to training mean and std statistics, thus high performance. In v2 the data is not in a random order, which makes the mean and std not very representative
This has to do nothing with randomness, to my mind, and NumPy's random is perfectly fine.
In this answer I'm assuming you're training some sort of a neural network, abbreviated as ANN (Artificial Neural Network) here.
The problem is, it's almost always easier to 'understand' common patterns in a small batch of objects than in a large one. That's how our brain works and that's how ANNs work as well. Although it doesn't mean that what your ANN has 'understood' from a smaller batch applies to any other collection of objects of the same nature.
For example, you can teach the computer to distinguish between cats and dogs on photos using 6 photos of cats and 7 - of dogs. Now, when you give it a photo of a cat it has never seen before, it may fail to tell anything specific because it has not 'seen' enough cats and dogs, it was unable to generalize.
They can also simply memorize the data and give you fascinating accuracy on the training set but fail horribly on test data.
So, this is a very important and difficult problem of ANNs, which you may try to solve... by increasing the number of objects in a set, thus allowing the computer to generalize.
Related
I have a very simple Neural Network classifier built using sklearn (see code below). The input is t time windows (currently 8) of a simple 1D time-series signal. The signal itself, as can be seen from the full code below, is a simple brownian-type random motion, with a gaussian (mean zero) delta movement at each time step.
This is where the set up get's a little less standard, as I am then creating the output class labels of 0 and 1 completely randomly. The model is then trained using these random output labels, using standard params (adam optimiser etc).
I have then built an accuracy metric which is similar to Positive predictive value (PPV), and is calculated thus: For each set of contiguous "predicted" 1s in the testing data (so ' .. 0 0 0 1 1 1 0 0 .. ' contains just one set), if the underlying time series signal increases during that period then this is a "hit" or a "true positive". The accuracy is then the proportion of these "1" or positive sets which are "hits".
Now, statistically this should be 50% as far as I am aware, assuming that the signal is random enough that a simple shallow NN couldn't detect the underlying pattern, and that the label assignment is random as well.
The crux of all of this is that I'm not getting 50%. I'm fairly consistently getting around 52-53%. This is done fairly archaically through a simple For loop which creates a fresh signal/label sets each time etc and then averages the results over say, 20 iterations (it only takes a few for the 53% trend to show itself).
My question is, does anyone know why is the accuracy not 50%?
Code:
import pandas as PD
import autograd.numpy as np
from random import gauss
from sklearn.neural_network import MLPClassifier
list_of_acc = []
for k in range(10):
## create the data
time_steps = 8
data_size = 6000
signal_value = 500
time_series = []
for k in range(data_size):
delta = gauss(0,0.5)
signal_value = signal_value + delta
time_series.append(signal_value)
all_inputs = np.array([time_series[i-time_steps:i] for i in range(time_steps,len(time_series))])
all_data = PD.DataFrame(np.array(all_inputs))
all_data[len(all_data.columns)] = np.array([random.randint(0, 1) for i in range(len(all_data))])
train_prop = 0.6
test_final_index = int(train_prop*len(all_data))
x_train = all_data.loc[:test_final_index,:len(all_data.columns) - 2]
x_test = all_data.loc[test_final_index:,:len(all_data.columns) - 2]
y_train = all_data.loc[:test_final_index,len(all_data.columns) - 1:]
#create model
number_classes = np.unique(y_train)
model = MLPClassifier(random_state=0,
hidden_layer_sizes = (20,),
learning_rate_init = 0.000025, #0.001 gave 0.55
momentum = 0.85,
solver = 'adam',
batch_size = 32,
max_iter = 80)
model.fit(x_train, y_train)
### now test fitted model for accuracy of +- movement
tot_play = 0
total_hits = 0
in_play = False # this will by True whenever pred is = 1
for i in range(test_final_index,len(all_data)-1):
pred = model.predict(np.array([all_data.loc[i,:len(all_data.columns) - 2]]))
if in_play:
if pred == 1:
in_play = True # continue to next t steps with still in_play
else:
if all_data.loc[i,len(all_data.columns) - 2] > current_signal_value:
total_hits = total_hits + 1
in_play = False
else:
if pred == 1:
in_play = True
tot_play = tot_play + 1
current_signal_value = all_data.loc[i,len(all_data.columns) - 2]
accuracy = total_hits/tot_play
print(accuracy)
list_of_acc.append(accuracy)
print(np.mean(list_of_acc))
You're confusing the input concept "random" with the result of "has no pattern".
Avoiding pattern usually requires (1) knowledge of the patterns that will be recognized, and (2) careful construction of data to counter each potential pattern with the opposing patterns.
Just as surely as the Central Tendency Theorem says that we will get close to 50%, it says that hitting exactly 50% is less likely as the quantities grow large.
When you generate random data, the sample deviates slightly from a totally balanced distribution. Your model is good enough to find some patterns in those deviations, and it learns to bias its predictions in favor of those imbalances.
A simple example is coin-tossing. If you toss a coin 100 times, you are likely to find that, rather than 50 heads and 50 tails, you will have a slightly off-center split, such as 52 tails - 48 heads. Take this as your random input.
Train your model. It will learn that tails are more prevalent. A horridly simplistic model will simply predict "tails" on every toss, and achieve 52% accuracy. A more sophisticated model will find other ways to predict, but will still lean toward tails, and tend toward 52% accuracy.
I have a data with 4000 CNN features and it is a binary classification problem. All I know about the test data is the proportions of 1 and 0. How can I tell to my model to predict test labels by using the proportions data ? (Like is there a way to say in order to reach this proportions I will give this instance 0.)
How can I use it to increase accuracy ? In my case the training data is mostly consist of 1 (85%) and 0(15%)
However in my test data proportion of l is given as (%38) So it is much different than training data.
I worked a little bit with balancing the data and it helped. However my model still predicts 1 for nearly all of the data. It may occur because of the adaptation problem also.
As #birdwatch suggested I decrease the threshold for the 0 value and try to increase the 0 label count on the prediction.
# Predicting the Test set results
y_pred = classifier.predict_proba(X_test)
threshold=0.3
y_pred [:,0] = (y_pred [:,0] < threshold).astype('int')
Before the number of classes were as in follows:
1 : 8906
0 : 2968
After changing threshold now it is
1 : 3221
0 : 8653
However is there any other way that I can use test_proportions which ensures the result?
There isn't any sensible way to that. Doing so would create a weird bias in the model. One thing you could do is accept the less likely outcome only is it has high enough score. Normally you'd use 0.5 threshold, but here you might take e.g. 0.7.
To get to grips with PyTorch (and deep learning in general) I started by working through some basic classification examples. One such example was classifying a non-linear dataset created using sklearn (full code available as notebook here)
n_pts = 500
X, y = datasets.make_circles(n_samples=n_pts, random_state=123, noise=0.1, factor=0.2)
x_data = torch.FloatTensor(X)
y_data = torch.FloatTensor(y.reshape(500, 1))
This is then accurately classified using a pretty basic neural net
class Model(nn.Module):
def __init__(self, input_size, H1, output_size):
super().__init__()
self.linear = nn.Linear(input_size, H1)
self.linear2 = nn.Linear(H1, output_size)
def forward(self, x):
x = torch.sigmoid(self.linear(x))
x = torch.sigmoid(self.linear2(x))
return x
def predict(self, x):
pred = self.forward(x)
if pred >= 0.5:
return 1
else:
return 0
As I have an interest in health data I then decided to try and use the same network structure to classify some a basic real-world dataset. I took heart rate data for one patient from here, and altered it so all values > 91 would be labelled as anomalies (e.g. a 1 and everything <= 91 labelled a 0). This is completely arbitrary, but I just wanted to see how the classification would work. The complete notebook for this example is here.
What is not intuitive to me is why the first example reaches a loss of 0.0016 after 1,000 epochs, whereas the second example only reaches a loss of 0.4296 after 10,000 epochs
Perhaps I am being naive in thinking that the heart rate example would be much easier to classify. Any insights to help me understand why this is not what I am seeing would be great!
TL;DR
Your input data is not normalized.
use x_data = (x_data - x_data.mean()) / x_data.std()
increase the learning rate optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
You'll get
convergence in only 1000 iterations.
More details
The key difference between the two examples you have is that the data x in the first example is centered around (0, 0) and has very low variance.
On the other hand, the data in the second example is centered around 92 and has relatively large variance.
This initial bias in the data is not taken into account when you randomly initialize the weights which is done based on the assumption that the inputs are roughly normally distributed around zero.
It is almost impossible for the optimization process to compensate for this gross deviation - thus the model gets stuck in a sub-optimal solution.
Once you normalize the inputs, by subtracting the mean and dividing by the std, the optimization process becomes stable again and rapidly converges to a good solution.
For more details about input normalization and weights initialization, you can read section 2.2 in He et al Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (ICCV 2015).
What if I cannot normalize the data?
If, for some reason, you cannot compute mean and std data in advance, you can still use nn.BatchNorm1d to estimate and normalize the data as part of the training process. For example
class Model(nn.Module):
def __init__(self, input_size, H1, output_size):
super().__init__()
self.bn = nn.BatchNorm1d(input_size) # adding batchnorm
self.linear = nn.Linear(input_size, H1)
self.linear2 = nn.Linear(H1, output_size)
def forward(self, x):
x = torch.sigmoid(self.linear(self.bn(x))) # batchnorm the input x
x = torch.sigmoid(self.linear2(x))
return x
This modification without any change to the input data, yields similar convergance after only 1000 epochs:
A minor comment
For numerical stability, it is better to use nn.BCEWithLogitsLoss instead of nn.BCELoss. For this end, you need to remove the torch.sigmoid from the forward() output, the sigmoid will be computed inside the loss.
See, for example, this thread regarding the related sigmoid + cross entropy loss for binary predictions.
Let's start first by understanding how neural networks work, neural networks observe patterns, hence the necessity for large datasets. In the case of the example, two what pattern you intend to find is when if HR < 91: label = 0, this if-condition can be represented by the formula, sigmoid((HR-91) * 1) , if you plug various values into the formula you can see you that all values < 91, label 0 and others label 1. I have inferred this formula and it could be anything as long as it gives the correct values.
Basically, we apply the formula wx+b, where x in our input data and we learn the values for w and b. Now initially the values are all random, so getting the b value from 1030131190 (a random value), to maybe 98 is fast, since the loss is great, the learning rate allows the values to jump fast. But once you reach 98, your loss is decreasing, and when you apply the learning rate, it takes it more time to reach closer to 91, hence the slow decrease in loss. As the values get closer, the steps taken are even slower.
This can be confirmed via the loss values, they are constantly decreasing, initially, the deceleration is higher, but then it becomes smaller. Your network is still learning but slowly.
Hence in deep learning, you use this method called stepped learning rate, wherewith the increase in epochs you decrease your learning rate so that your learning is faster
I am currently trying to vary the threshold of a Random Forest Classifier in order to plot a ROC Curve. I was under the impression that the only way to do this for a Random Forest is through the use of the class_weight parameter. I have been able to do this successfully, increasing and decreasing precision, recall, true positive and false positive rates; however, I am not sure what I am actually doing. Currently I have the following;
rfc = RandomForestClassifier(n_jobs=-1, oob_score=True, n_estimators=50,max_depth=40,min_samples_split=100,min_samples_leaf=80, class_weight={0:.4, 1:.9})
What is the .4 and .9 actually referring too. I thought it was 40% of data set is 0's and 90% 1's however, this obviously makes no sense (over %100). What is it actually doing? THANKS!
Class weights typically do not need to normalise to 1 (it's only the ratio of the class weights that is important, so demanding that they sum to 1 would not actually be a restriction though).
So setting the class weights to 0.4 and 0.9 is equivalent to assuming a split of class labels in the data of 0.4 / (0.4+0.9) to 0.9 / (0.4+0.9) [roughly ~30% belonging to class 0 and ~70% belonging to class 1].
An alternative way to view differing class weights is as a way of more strongly penalising mistakes in one class compared to another, but still assuming balanced numbers of labelings in the data. In your example, it would be 9/4 times worse to misclassify a 1 as a 0 than it would be to misclassify a 0 as a 1.
The easiest (in my experience) way to vary the discrimination threshold of any of the scikit-learn classifiers is to use the predict_proba() function. Rather than returning a single output class, this returns the probabilities for membership in each class (concretely what it is doing is outputting the proportion of samples in the leaf nodes reached during the classification, averaged over all trees in the random forest.) Once you have these probabilities, it is easy to implement your own final classification step by comparing the probability for each class to some threshold which you can change.
probs = RF.predict_proba(X) # output dimension: [num_samples x num_classes]
for threshold in range(0,100):
threshold = threshold / 100.0
classes = (probs > threshold).astype(int)
# further analysis here as desired
There are 350 samples for each of 50 letters. Neural network has 3 layers. Input layer 400(20*20 images), hidden 200 and output 50. The training parameters I've used are:
max_steps = 1000
max_err = 0.000001
condition = cv2.TERM_CRITERIA_COUNT | cv2.TERM_CRITERIA_EPS
criteria = (condition, max_steps, max_err)
train_params = dict(term_crit = criteria,
train_method = cv2.ANN_MLP_TRAIN_PARAMS_BACKPROP,
bp_dw_scale = 0.1,
bp_moment_scale = 0.1)
What are the the optimal values I can use for this situation?
I fear you'll have to choose them manually by trial & error.
These values depend on lots of factors and, as far as I know, there's no formula to compute them. When I start training a new ANN, I just run it over and over again changing these parameters slightly each time.