Is this a logical bug or strange statistical behavior?

Is this a logical bug or strange statistical behavior? - python

I have a very simple Neural Network classifier built using sklearn (see code below). The input is t time windows (currently 8) of a simple 1D time-series signal. The signal itself, as can be seen from the full code below, is a simple brownian-type random motion, with a gaussian (mean zero) delta movement at each time step.
This is where the set up get's a little less standard, as I am then creating the output class labels of 0 and 1 completely randomly. The model is then trained using these random output labels, using standard params (adam optimiser etc).
I have then built an accuracy metric which is similar to Positive predictive value (PPV), and is calculated thus: For each set of contiguous "predicted" 1s in the testing data (so ' .. 0 0 0 1 1 1 0 0 .. ' contains just one set), if the underlying time series signal increases during that period then this is a "hit" or a "true positive". The accuracy is then the proportion of these "1" or positive sets which are "hits".
Now, statistically this should be 50% as far as I am aware, assuming that the signal is random enough that a simple shallow NN couldn't detect the underlying pattern, and that the label assignment is random as well.
The crux of all of this is that I'm not getting 50%. I'm fairly consistently getting around 52-53%. This is done fairly archaically through a simple For loop which creates a fresh signal/label sets each time etc and then averages the results over say, 20 iterations (it only takes a few for the 53% trend to show itself).
My question is, does anyone know why is the accuracy not 50%?
Code:
import pandas as PD
import autograd.numpy as np
from random import gauss
from sklearn.neural_network import MLPClassifier
list_of_acc = []
for k in range(10):
## create the data
time_steps = 8
data_size = 6000
signal_value = 500
time_series = []
for k in range(data_size):
delta = gauss(0,0.5)
signal_value = signal_value + delta
time_series.append(signal_value)
all_inputs = np.array([time_series[i-time_steps:i] for i in range(time_steps,len(time_series))])
all_data = PD.DataFrame(np.array(all_inputs))
all_data[len(all_data.columns)] = np.array([random.randint(0, 1) for i in range(len(all_data))])
train_prop = 0.6
test_final_index = int(train_prop*len(all_data))
x_train = all_data.loc[:test_final_index,:len(all_data.columns) - 2]
x_test = all_data.loc[test_final_index:,:len(all_data.columns) - 2]
y_train = all_data.loc[:test_final_index,len(all_data.columns) - 1:]
#create model
number_classes = np.unique(y_train)
model = MLPClassifier(random_state=0,
hidden_layer_sizes = (20,),
learning_rate_init = 0.000025, #0.001 gave 0.55
momentum = 0.85,
solver = 'adam',
batch_size = 32,
max_iter = 80)
model.fit(x_train, y_train)
### now test fitted model for accuracy of +- movement
tot_play = 0
total_hits = 0
in_play = False # this will by True whenever pred is = 1
for i in range(test_final_index,len(all_data)-1):
pred = model.predict(np.array([all_data.loc[i,:len(all_data.columns) - 2]]))
if in_play:
if pred == 1:
in_play = True # continue to next t steps with still in_play
else:
if all_data.loc[i,len(all_data.columns) - 2] > current_signal_value:
total_hits = total_hits + 1
in_play = False
else:
if pred == 1:
in_play = True
tot_play = tot_play + 1
current_signal_value = all_data.loc[i,len(all_data.columns) - 2]
accuracy = total_hits/tot_play
print(accuracy)
list_of_acc.append(accuracy)
print(np.mean(list_of_acc))

You're confusing the input concept "random" with the result of "has no pattern".
Avoiding pattern usually requires (1) knowledge of the patterns that will be recognized, and (2) careful construction of data to counter each potential pattern with the opposing patterns.
Just as surely as the Central Tendency Theorem says that we will get close to 50%, it says that hitting exactly 50% is less likely as the quantities grow large.
When you generate random data, the sample deviates slightly from a totally balanced distribution. Your model is good enough to find some patterns in those deviations, and it learns to bias its predictions in favor of those imbalances.
A simple example is coin-tossing. If you toss a coin 100 times, you are likely to find that, rather than 50 heads and 50 tails, you will have a slightly off-center split, such as 52 tails - 48 heads. Take this as your random input.
Train your model. It will learn that tails are more prevalent. A horridly simplistic model will simply predict "tails" on every toss, and achieve 52% accuracy. A more sophisticated model will find other ways to predict, but will still lean toward tails, and tend toward 52% accuracy.

Related

Suport Vector Machine training :Is sklearn SGDClassifier.partial_fit able to train an SVM incrementally?

I am trying to train an SVM model through sklearn to apply as binary classifier to get audio's Ideal Binary Mask(IBM), applied after a neural network that I am developing for my graduation thesis, however, as shown in
!this graph, the accuracy never converges. The mean accuracy is always about 50% it doesn't matter how many audios are used, which is random considering we've got only two choices.
#SVM instance
from sklearn.linear_model import SGDClassifier
SVM = SGDClassifier(loss='hinge',penalty='l2',warm_start = True,shuffle=True)
#Start training
CLEAN_DATA_PATH = r"D:\clean_trainset_56spk_wav/"
NOISY_DATA_PATH = r"D:\noisy_trainset_56spk_wav/"
audio_files = os.listdir(CLEAN_DATA_PATH)
shuffle(audio_files)
count = 0
for filename in audio_files:
if count == 1000:
break
start = time.time()
count += 1
Clean, Sr = sf.read(CLEAN_DATA_PATH + filename,dtype='float32')
Noisy, Sr = sf.read(NOISY_DATA_PATH + filename,dtype='float32')
print("Áudio " + filename )
Features, ibm = Extract_Features(Clean, Sr,Noisy)
y = ibm.reshape(-1,1)
y = np.ravel(y)
Features = sc.fit_transform(Features) # Scale
SVM.partial_fit(Features,y,classes=np.unique(y))
end = time.time()
print("Files training duration: "+str(round(end-start,2))+ " seconds")
print("Done: "+str(round((contador/len(audio_files))*100,2))+"%")
As far as I know, SGDClassifier.partial_fit changes the weights in small batches, what would allow us to use different files as batches (since each audio contains thousands of samples for classifications. Is it right?
Thanks a lot!

At least one of your problems is that at every iteration, the samples are on a different scale, because you fit sc to every new batch.
for filename in audio_files:
...
Features = sc.fit_transform(Features)
sc should be defined outside of the loop, and used as such:
Features = sc.transform(Features)

Large dataset with many weights causing an extremely slow training process with Tensorflow

I have a background in biology and am currently experimenting and learning machine learning to train a microarray dataset I have that consists of 140 cell lines with 54871 gene expressions of each cell line. Essentially, I have 140 rows, each row is comprised of 54871 columns representing a value that is a gene expression level of that cell line. Basically, a 140*54871 matrix. Within the 140 cell lines, I have labeled each row(cell line) as either group 1 or group 2 for my code to learn to discern and predict if I were to input a 1*54871 matrix, which group it would belongs to.
I have divided the dataset in two parts for training and testing. My question comes: since I have 54871 weights for each gene expression, my training is extremely slow as in every 1000 iterations, my cost function (mean squared error) only goes from 0.3057 to 0.3047 and this would take around 2-3 minutes. Also, as the iteration increase you can see that it kind of plateaus making it seems like it would take forever to train until the model has a cost function of even >=0.1. I left it overnight waking up with a mse value of 0.3014 when it began with a 0.3103.
Is there anything I can do to speed up the training process? Or is there something I am doing wrong. Thanks!
This is my code, sorry if it is a little messy:
import pandas as pd
import tensorflow as tf
import numpy
# download csv data sheet of all cell lines
input_data = pd.read_csv(
'C:/Users/lalalalalalala.csv',
index_col=[0, 1],
header=0,
na_values='---')
matrix_data = input_data.as_matrix()
# user define cell lines of interest for supervised training
group1 = input(
"Please enter cell lines that makes up the your cluster of interest with spaces in between(case sensitive):")
group_split1 = group1.split(sep=" ")
# assign label of each: input cluster = 1
# rest of cluster = 0
# extract data of input group
# split training and test set
# all these if else statement represents split when the input group1 is not a even number
split = len(group_split1)
g1_train = input_data.loc[:, group_split1[0:int(split / 2) if len(group_split1) % 2 == 0 else (int(split / 2) + 1)]]
g1_test = input_data.loc[:,
group_split1[(int(split / 2) if len(group_split1) % 2 == 0 else (int(split / 2) + 1)):split]]
g2 = input_data.loc[:, [x for x in list(input_data) if x not in group_split1]]
split2 = g2.shape[1]
g2_train = g2.iloc[:, 0:int(split2 / 2) if len(group_split1) % 2 == 0 else (int(split2 / 2) + 1)]
g2_test = g2.iloc[:, (int(split2 / 2) if len(group_split1) % 2 == 0 else (int(split2 / 2) + 1)):split2]
# amplify the input data if the input data is too small:
amp1 = (int((g2_train.shape[1] - split) / int(split / 2))) if g2_train.shape[
1] >= split else 1 # if g1 is less than g2 amplify
g1_train = pd.DataFrame(pd.np.tile(g1_train, (1, amp1)), index=g2_train.index)
amp2 = (int((g2_test.shape[1] - split) / int(split / 2))) if g2_test.shape[1] >= split else 1
g1_test = pd.DataFrame(pd.np.tile(g1_test, (1, amp2)), index=g2_test.index)
regroup_train = pd.concat([g1_train, g2_train], axis=1, join_axes=[g1_train.index])
regroup_train = numpy.transpose(regroup_train.as_matrix())
regroup_test = pd.concat([g1_test, g2_test], axis=1, join_axes=[g1_test.index])
regroup_test = numpy.transpose(regroup_test.as_matrix())
# create labels
split3 = g1_train.shape[1]
labels_train = numpy.zeros(shape=[len(regroup_train), 1])
labels_train[0:split3] = 1
split4 = g1_test.shape[1]
labels_test = numpy.zeros(shape=[len(regroup_test), 1])
labels_test[0:split4] = 1
# change all nan to 0
regroup_train = numpy.nan_to_num(regroup_train)
regroup_test = numpy.nan_to_num(regroup_test)
labels_train = numpy.nan_to_num(labels_train)
labels_test = numpy.nan_to_num(labels_test)
#######################################################################################################################
#####################################################NEURAL NETWORK####################################################
#######################################################################################################################
# define variables
trainingtimes = 1000
# create model
x = tf.placeholder(tf.float32, [None, 54781])
w = tf.Variable(tf.zeros([54781, 1]))
b = tf.Variable(tf.zeros([1]))
# define linear regression model, loss function
y = tf.nn.sigmoid((tf.matmul(x, w) + b))
# define correct training group
ytt = tf.placeholder(tf.float32, [None, 1])
# define cross optimizer and cost function
mse = tf.reduce_mean(tf.losses.mean_squared_error(y, ytt))
# train step
train_step = tf.train.GradientDescentOptimizer(learning_rate=0.3).minimize(mse)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for i in range(trainingtimes):
sess.run(train_step, feed_dict={x: regroup_train, ytt: labels_train})
if i % 100 == 0:
print(sess.run(mse, feed_dict={x: regroup_train, ytt: labels_train}))

A few key issues here. You're trying to define a 1-layer neural network, which sounds good for this problem. But your hidden layer is much larger than it should be. Experiment with smaller weight sizes. Try 128, 256, 512, numbers like this (powers of two are not required).
Also, your input dimensionality is quite high. I know someone working on a very similar gene expression problem for cancer with something like 60,000 gene expressions and 10,000 samples. She has used PCA to reduce the dimensionality of the data while maintaining ~90% of the variance (she experimented with different values and found this about optimal).
That improved the results. Neural networks can overfit, the PCA dimensionality reductions was beneficial. The 1-layer fully connected network also out performed Logstic Regression and XGA boost in her experiments.
A couple of other things that she's working on with this problem, which may also apply to you:
Multi-task learning proved to improve the results. She originally had 4 different neural networks (4 outputs given the same data) when she combined them into 1 neural network with 4 loss functions it improved the results of all 4.
Instead of PCA you can use auto-encoders as an alternative dimensionality reduction technique. It's entirely possible to connect an auto-encoder to this network and train it in conjunction with a loss function. I haven't actually experimented with this (yet) though, so I can only say that I expect it to improve the results in theory. The PCA approach will be quicker to test so I'd start there.

Keras variable length input for regression

Everyone!
I am trying to develop a neural network using Keras and TensorFlow, which should be able to take variable length arrays as input and give either some single value (see the toy example below) or classify them (that a problem for later and will not be touched in this question).
The idea is fairly simple.
We have variable length arrays. I am currently using very simple toy data, which is generated by the following code:
import numpy as np
import pandas as pd
from keras import models as kem
from keras import activations as kea
from keras import layers as kel
from keras import regularizers as ker
from keras import optimizers as keo
from keras import losses as kelo
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
n = 100
x = pd.DataFrame(columns=['data','res'])
mms = MinMaxScaler(feature_range=(-1,1))
for i in range(n):
k = np.random.randint(20,100)
ss = np.random.randint(0,100,size=k)
idres = np.sum(ss[np.arange(0,k,2)])-np.sum(ss[np.arange(1,k,2)])
x.loc[i,'data'] = ss
x.loc[i,'res'] = idres
x.res = mms.fit_transform(x.res)
x_train,x_test,y_train, y_test = train_test_split(x.data,x.res,test_size=0.2)
x_train = sliding_window(x_train.as_matrix(),2,2)
x_test = sliding_window(x_test.as_matrix(),2,2)
To put it simple, I generate arrays with random length and the result (output) for each array is sum of even elements - sum of odd elements. Obviously, it can be negative and positive. The output then scaled to the range [-1,1] to fit with tanh activation function.
The Sequential model is generated as following:
model = kem.Sequential()
model.add(kel.LSTM(20,return_sequences=False,input_shape=(None,2),recurrent_activation='tanh'))
model.add(kel.Dense(20,activation='tanh'))
model.add(kel.Dense(10,activation='tanh'))
model.add(kel.Dense(5,activation='tanh'))
model.add(kel.Dense(1,activation='tanh'))
sgd = keo.SGD(lr=0.1)
mseloss = kelo.mean_squared_error
model.compile(optimizer=sgd,loss=mseloss,metrics=['accuracy'])
And the training of the model is doing in the following way:
def calcMSE(model,x_test,y_test):
nTest = len(x_test)
sum = 0
for i in range(nTest):
restest = model.predict(np.reshape(x_test[i],(1,-1,2)))
sum+=(restest-y_test[0,i])**2
return sum/nTest
i = 1
mse = calcMSE(model,x_test,np.reshape(y_test.values,(1,-1)))
lrPar = 0
lrSteps = 30
while mse>0.04:
print("Epoch %i" % (i))
print(mse)
for j in range(len(x_train)):
ntrain=j
model.train_on_batch(np.reshape(x_train[ntrain],(1,-1,2)),np.reshape(y_train.values[ntrain],(-1,1)))
i+=1
mse = calcMSE(model,x_test,np.reshape(y_test.values,(1,-1)))
The problem is that optimiser gets stuck usually around MSE=0.05 (on test set). Last time I tested, it actually stuck around MSE=0.12 (on test data).
Moreover, if you will look at what the model gives on test data (left column) in comparison with the correct output (right column):
[[-0.11888303]] 0.574923547401
[[-0.17038491]] -0.452599388379
[[-0.20098214]] 0.065749235474
[[-0.22307695]] -0.437308868502
[[-0.2218809]] 0.371559633028
[[-0.2218741]] 0.039755351682
[[-0.22247596]] -0.434250764526
[[-0.17094387]] -0.151376146789
[[-0.17089397]] -0.175840978593
[[-0.16988073]] 0.025993883792
[[-0.16984619]] -0.117737003058
[[-0.17087571]] -0.515290519878
[[-0.21933308]] -0.366972477064
[[-0.09379648]] -0.178899082569
[[-0.17016701]] -0.333333333333
[[-0.17022927]] -0.195718654434
[[-0.11681376]] 0.452599388379
[[-0.21438009]] 0.224770642202
[[-0.12475857]] 0.151376146789
[[-0.2225963]] -0.380733944954
And on training set the same is:
[[-0.22209576]] -0.00764525993884
[[-0.17096499]] -0.247706422018
[[-0.22228305]] 0.276758409786
[[-0.16986915]] 0.340978593272
[[-0.16994311]] -0.233944954128
[[-0.22131597]] -0.345565749235
[[-0.17088912]] -0.145259938838
[[-0.22250554]] -0.792048929664
[[-0.17097935]] 0.119266055046
[[-0.17087702]] -0.2874617737
[[-0.1167363]] -0.0045871559633
[[-0.08695849]] 0.159021406728
[[-0.17082921]] 0.374617737003
[[-0.15422876]] -0.110091743119
[[-0.22185338]] -0.7125382263
[[-0.17069265]] -0.678899082569
[[-0.16963181]] -0.00611620795107
[[-0.17089556]] -0.249235474006
[[-0.17073657]] -0.414373088685
[[-0.17089497]] -0.351681957187
[[-0.17138508]] -0.0917431192661
[[-0.22351067]] 0.11620795107
[[-0.17079701]] -0.0795107033639
[[-0.22246087]] 0.22629969419
[[-0.17044055]] 1.0
[[-0.17090379]] -0.0902140672783
[[-0.23420531]] -0.0366972477064
[[-0.2155242]] 0.0366972477064
[[-0.22192241]] -0.675840978593
[[-0.22220723]] -0.354740061162
[[-0.1671907]] -0.10244648318
[[-0.22705412]] 0.0443425076453
[[-0.22943887]] -0.249235474006
[[-0.21681401]] 0.065749235474
[[-0.12495813]] 0.466360856269
[[-0.17085686]] 0.316513761468
[[-0.17092516]] 0.0275229357798
[[-0.17277785]] -0.325688073394
[[-0.22193027]] 0.139143730887
[[-0.17088208]] 0.422018348624
[[-0.17093034]] -0.0886850152905
[[-0.17091317]] -0.464831804281
[[-0.22241674]] -0.707951070336
[[-0.1735626]] -0.337920489297
[[-0.16984227]] 0.00764525993884
[[-0.16756304]] 0.515290519878
[[-0.22193302]] -0.414373088685
[[-0.22419722]] -0.351681957187
[[-0.11561158]] 0.17125382263
[[-0.16640976]] -0.321100917431
[[-0.21557514]] -0.313455657492
[[-0.22241823]] -0.117737003058
[[-0.22165506]] -0.646788990826
[[-0.22238114]] -0.261467889908
[[-0.1709189]] 0.0902140672783
[[-0.17698884]] -0.626911314985
[[-0.16984172]] 0.587155963303
[[-0.22226149]] -0.590214067278
[[-0.16950315]] -0.469418960245
[[-0.22180589]] -0.133027522936
[[-0.2224243]] -1.0
[[-0.22236891]] 0.152905198777
[[-0.17089345]] 0.435779816514
[[-0.17422611]] -0.233944954128
[[-0.17177556]] -0.324159021407
[[-0.21572633]] -0.347094801223
[[-0.21509495]] -0.646788990826
[[-0.17086846]] -0.34250764526
[[-0.17595944]] -0.496941896024
[[-0.16803505]] -0.382262996942
[[-0.16983894]] -0.348623853211
[[-0.17078683]] 0.363914373089
[[-0.21560851]] -0.186544342508
[[-0.22416025]] -0.374617737003
[[-0.1723443]] -0.186544342508
[[-0.16319042]] -0.0122324159021
[[-0.18837349]] -0.181957186544
[[-0.17371364]] -0.539755351682
[[-0.22232121]] -0.529051987768
[[-0.22187822]] -0.149847094801
As you can see, model output is actually all quite close to each other unlike the training set, where variability is much bigger (although, I should admit, that negative values are dominants in both training and test set.
What am I doing wrong here? Why training gets stuck or is it normal process and I should leave it for much longer (I was doing several hudreds epochs couple of times and still stay stuck). I also tried to use variable learning rate (used, for example, cosine annealing with restarts (as in I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with restarts.
arXiv preprint arXiv:1608.03983, 2016.)
I would appreciate any suggestions both from network structure and training approach and from coding/detailed sides.
Thank you very much in advance for help.

Numpy randint not really random?

The situation: I have a big dataset with more than 18 million examples.
I train several models and want to track the accuracy.
When forwarding all examples and computing accuracy this is approximately 83 percent. But this takes a long time.
So I try to sample a small subset of the whole dataset and compute accuracy for that. I expect to see approximately the same number (around 80 percent)
total = 4096
N = dataset.shape[0]
indices = np.random.randint(N-1, size=total)
batch = dataset[indices,:]
However, now the output looks like this, when running it for 10 'random' batches:
> satisfied 4096/4096
> 1.0 satisfied 4095/4096
> 0.999755859375 satisfied 4095/4096
> 0.999755859375 satisfied 4094/4096
> 0.99951171875 satisfied 4095/4096
> 0.999755859375 satisfied 4095/4096
> 0.999755859375 satisfied 4094/4096
> 0.99951171875 satisfied 4096/4096
> 1.0 satisfied 4095/4096
> 0.999755859375 satisfied 4096/4096
> 1.0
So here it performs always way too good and seems to only almost only sample from the 80 percent good examples. What can I do to make it really random, such that it gives a good view of the accuracy?
This makes also the training go wrong, because for the next training batch only the good examples are sampled.
EDIT: so this is not about the training itself! I have a trained model with 83 percent accuracy. I use only this model for testing accuracy. When testing accuracy on small subsets it gives always 99 or 100 percent, even for 100 random batches.
Edit:
And the code I generate the output with that gets 99 or 100 percent
def constraints_satisfied_v3(sess, model, dataset, pointclouds, instructions, trajectories, distances, is_training=0):
satisfied = 0
total = 4096
# Pick random examples
N = dataset.shape[0]
indices = np.random.randint(N-1, size=total)
batch = dataset[indices,:]
pdb.set_trace()
# Fill a feed dictionary with the actual set of images and labels
feed_dict = {model.input_pointcloud: pointclouds[batch[:,0],:],
model.input_language: instructions[batch[:,1],:],
model.input_traj: trajectories[batch[:,2],:],
model.input_traj_mv: trajectories[batch[:,3],:],
model.distances: distances[batch[:,2], batch[:,3]],
model.is_training: is_training}
loss_value,emb_pl,emb_t,emb_t_mv,sim_mv,sim = sess.run([model.loss,model.embeddings_pl,model.embeddings_t,model.embeddings_t_mv,model.sim_mv,model.sim],
feed_dict=feed_dict)
result = np.greater_equal(sim, distances[batch[:,2], batch[:,3]]+sim_mv)
satisfied = satisfied + np.sum(result)
print 'satisfied %d/%d' % (satisfied, total)
percentage = float(satisfied)/float(total)
#pdb.set_trace()
return percentage
Edit: Okay, you have a point. When training batches are sampled the same way the model is only trained on that data. On that is why it is doing almost perfect on that data. But the issue stays how to sample from the whole dataset
So this is the version that get 83 percent accuracy
def constraints_satisfied_v2(sess, model, dataset, pointclouds, instructions, trajectories, distances, is_training=0):
satisfied = 0
total = 0
N = dataset.shape[0]
#indices = np.random.randint(N-1, size=int(total))
#batch = dataset[indices,:]
i = 10000
while i < N:
indices = np.arange(i-10000, i)
if i+10000 < N:
i = i+10000
else:
i = N
batch = dataset[indices,:]
# Fill a feed dictionary with the actual set of images and labels
feed_dict = {model.input_pointcloud: pointclouds[batch[:,0],:],
model.input_language: instructions[batch[:,1],:],
model.input_traj: trajectories[batch[:,2],:],
model.input_traj_mv: trajectories[batch[:,3],:],
model.distances: distances[batch[:,2], batch[:,3]],
model.is_training: is_training}
loss_value,emb_pl,emb_t,emb_t_mv,sim_mv,sim = sess.run([model.loss,model.embeddings_pl,model.embeddings_t,model.embeddings_t_mv,model.sim_mv,model.sim],
feed_dict=feed_dict)
result = np.greater_equal(sim, distances[batch[:,2], batch[:,3]]+sim_mv)
satisfied = satisfied + np.sum(result)
total = total + batch.shape[0]
print 'satisfied %d/%d' % (satisfied, total)
percentage = float(satisfied)/float(total)
return percentage
Edit: it seems the difference between constraints_satisfied_v2 and constraints_satisfied_v3 has to do with the use of batch normalization. In v3 random samples are picked which correspond to training mean and std statistics, thus high performance. In v2 the data is not in a random order, which makes the mean and std not very representative

This has to do nothing with randomness, to my mind, and NumPy's random is perfectly fine.
In this answer I'm assuming you're training some sort of a neural network, abbreviated as ANN (Artificial Neural Network) here.
The problem is, it's almost always easier to 'understand' common patterns in a small batch of objects than in a large one. That's how our brain works and that's how ANNs work as well. Although it doesn't mean that what your ANN has 'understood' from a smaller batch applies to any other collection of objects of the same nature.
For example, you can teach the computer to distinguish between cats and dogs on photos using 6 photos of cats and 7 - of dogs. Now, when you give it a photo of a cat it has never seen before, it may fail to tell anything specific because it has not 'seen' enough cats and dogs, it was unable to generalize.
They can also simply memorize the data and give you fascinating accuracy on the training set but fail horribly on test data.
So, this is a very important and difficult problem of ANNs, which you may try to solve... by increasing the number of objects in a set, thus allowing the computer to generalize.

RFE giving same accuracy for different number of features selected

In the program, I am scanning a number of brain samples taken in a time series of 40 x 64 x 64 images every 2.5 seconds. The number of 'voxels' (3D pixels) in each image is thus ~ 168,000 ish (40 * 64 * 64), each of which is a 'feature' for an image sample.
I thought of using Recursive Feature Elimination (RFE). Then follow this up with Principle Component Analysis (PCA) because of the rediculously high n to perform dimensionality reduction.
There are 9 classes to predict. Thus a multi class classification problem. Starting with RFE:
estimator = SVC(kernel='linear')
rfe = RFE(estimator,n_features_to_select= 20000, step=0.05)
rfe = rfe.fit(X_train,y_train)
X_best = rfe.transform(X_train)
Now perform PCA :
X_best = scale(X_best)
def get_optimal_number_of_components():
cov = np.dot(X_best,X_best.transpose())/float(X_best.shape[0])
U,s,v = svd(cov)
print 'Shape of S = ',s.shape
S_nn = sum(s)
for num_components in range(0,s.shape[0]):
temp_s = s[0:num_components]
S_ii = sum(temp_s)
if (1 - S_ii/float(S_nn)) <= 0.01:
return num_components
return s.shape[0]
n_comp = get_optimal_number_of_components()
print 'optimal number of components = ', n_comp
pca = PCA(n_components = n_comp)
pca = pca.fit(X_best)
X_pca_reduced = pca.transform(X_best)
Train the reduced component dataset with SVM
svm = SVC(kernel='linear',C=1,gamma=0.0001)
svm = svm.fit(X_pca_reduced,y_train)
Now transform the training set to RFE-PCA reduced and make the predictions
X_test = scale(X_test)
X_rfe = rfe.transform(X_test)
X_pca = pca.transform(X_rfe)
predictions = svm.predict(X_pca)
print 'predictions = ',predictions
print 'actual = ',y_test
I trained it for a subset of my data and got 76.92%. I'm not too worried about the low number because it is trained only against 1/12 of my dataset.
I tried doubling the training size and get 92% accuracy. This is pretty good. But then I trained against the entire dataset and saw an accuracy of 92.5%
So I got a 0.5% increase in accuracy for 6 times the dataset increase. Furthermore, the data samples aren't noisy. So nothing is wrong with the samples.
Also, for 1/12 th the dataset training size, I get the same 76.92% when I choose n_features_to_select = 1000. (The same for 20000!!) while performing RFE. There must be something wrong here. Why do I get the same performance when selecting such a less number of features?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.