RFE giving same accuracy for different number of features selected - python

In the program, I am scanning a number of brain samples taken in a time series of 40 x 64 x 64 images every 2.5 seconds. The number of 'voxels' (3D pixels) in each image is thus ~ 168,000 ish (40 * 64 * 64), each of which is a 'feature' for an image sample.
I thought of using Recursive Feature Elimination (RFE). Then follow this up with Principle Component Analysis (PCA) because of the rediculously high n to perform dimensionality reduction.
There are 9 classes to predict. Thus a multi class classification problem. Starting with RFE:
estimator = SVC(kernel='linear')
rfe = RFE(estimator,n_features_to_select= 20000, step=0.05)
rfe = rfe.fit(X_train,y_train)
X_best = rfe.transform(X_train)
Now perform PCA :
X_best = scale(X_best)
def get_optimal_number_of_components():
cov = np.dot(X_best,X_best.transpose())/float(X_best.shape[0])
U,s,v = svd(cov)
print 'Shape of S = ',s.shape
S_nn = sum(s)
for num_components in range(0,s.shape[0]):
temp_s = s[0:num_components]
S_ii = sum(temp_s)
if (1 - S_ii/float(S_nn)) <= 0.01:
return num_components
return s.shape[0]
n_comp = get_optimal_number_of_components()
print 'optimal number of components = ', n_comp
pca = PCA(n_components = n_comp)
pca = pca.fit(X_best)
X_pca_reduced = pca.transform(X_best)
Train the reduced component dataset with SVM
svm = SVC(kernel='linear',C=1,gamma=0.0001)
svm = svm.fit(X_pca_reduced,y_train)
Now transform the training set to RFE-PCA reduced and make the predictions
X_test = scale(X_test)
X_rfe = rfe.transform(X_test)
X_pca = pca.transform(X_rfe)
predictions = svm.predict(X_pca)
print 'predictions = ',predictions
print 'actual = ',y_test
I trained it for a subset of my data and got 76.92%. I'm not too worried about the low number because it is trained only against 1/12 of my dataset.
I tried doubling the training size and get 92% accuracy. This is pretty good. But then I trained against the entire dataset and saw an accuracy of 92.5%
So I got a 0.5% increase in accuracy for 6 times the dataset increase. Furthermore, the data samples aren't noisy. So nothing is wrong with the samples.
Also, for 1/12 th the dataset training size, I get the same 76.92% when I choose n_features_to_select = 1000. (The same for 20000!!) while performing RFE. There must be something wrong here. Why do I get the same performance when selecting such a less number of features?

Related

PCA Why my validate data's SPE is smaller than training data's SPE

So my understanding is that SPE is the reconstruction error while using PCA(principle component analysis). Therefore when I obtained my loading matrix from the training data and use it to calculate SPE for both training data and validation data, the SPE for validation data should in general always bigger or close to the SPE for training data. However, the result I get has validation SPE sometimes smaller than training SPE.
Below is my code in python. The PCA function was from sklearn. train_x and valid_x are standardized dataset by the mean and stdev of train_x. P50_cols is the number of total columns.
Is there anything wrong with my code?
pca = PCA(n_components=P50_cols)
pca.fit(train_x)
sumratio = 0
k = 0
for k, r in enumerate(pca.explained_variance_ratio_):
sumratio += r
if sumratio > 0.9:
break
num_component = k+1
pca = PCA(n_components=num_component)
pca.fit(train_x)
train_x_restore = pca.inverse_transform((pca.transform(train_x)))
valid_x_restore = pca.inverse_transform((pca.transform(valid_x)))
spe_train = np.sum((train_x_restore-train_x.values)**2, 1)
spe_valid = np.sum((valid_x_restore-valid_x.values)**2, 1)

Is this a logical bug or strange statistical behavior?

I have a very simple Neural Network classifier built using sklearn (see code below). The input is t time windows (currently 8) of a simple 1D time-series signal. The signal itself, as can be seen from the full code below, is a simple brownian-type random motion, with a gaussian (mean zero) delta movement at each time step.
This is where the set up get's a little less standard, as I am then creating the output class labels of 0 and 1 completely randomly. The model is then trained using these random output labels, using standard params (adam optimiser etc).
I have then built an accuracy metric which is similar to Positive predictive value (PPV), and is calculated thus: For each set of contiguous "predicted" 1s in the testing data (so ' .. 0 0 0 1 1 1 0 0 .. ' contains just one set), if the underlying time series signal increases during that period then this is a "hit" or a "true positive". The accuracy is then the proportion of these "1" or positive sets which are "hits".
Now, statistically this should be 50% as far as I am aware, assuming that the signal is random enough that a simple shallow NN couldn't detect the underlying pattern, and that the label assignment is random as well.
The crux of all of this is that I'm not getting 50%. I'm fairly consistently getting around 52-53%. This is done fairly archaically through a simple For loop which creates a fresh signal/label sets each time etc and then averages the results over say, 20 iterations (it only takes a few for the 53% trend to show itself).
My question is, does anyone know why is the accuracy not 50%?
Code:
import pandas as PD
import autograd.numpy as np
from random import gauss
from sklearn.neural_network import MLPClassifier
list_of_acc = []
for k in range(10):
## create the data
time_steps = 8
data_size = 6000
signal_value = 500
time_series = []
for k in range(data_size):
delta = gauss(0,0.5)
signal_value = signal_value + delta
time_series.append(signal_value)
all_inputs = np.array([time_series[i-time_steps:i] for i in range(time_steps,len(time_series))])
all_data = PD.DataFrame(np.array(all_inputs))
all_data[len(all_data.columns)] = np.array([random.randint(0, 1) for i in range(len(all_data))])
train_prop = 0.6
test_final_index = int(train_prop*len(all_data))
x_train = all_data.loc[:test_final_index,:len(all_data.columns) - 2]
x_test = all_data.loc[test_final_index:,:len(all_data.columns) - 2]
y_train = all_data.loc[:test_final_index,len(all_data.columns) - 1:]
#create model
number_classes = np.unique(y_train)
model = MLPClassifier(random_state=0,
hidden_layer_sizes = (20,),
learning_rate_init = 0.000025, #0.001 gave 0.55
momentum = 0.85,
solver = 'adam',
batch_size = 32,
max_iter = 80)
model.fit(x_train, y_train)
### now test fitted model for accuracy of +- movement
tot_play = 0
total_hits = 0
in_play = False # this will by True whenever pred is = 1
for i in range(test_final_index,len(all_data)-1):
pred = model.predict(np.array([all_data.loc[i,:len(all_data.columns) - 2]]))
if in_play:
if pred == 1:
in_play = True # continue to next t steps with still in_play
else:
if all_data.loc[i,len(all_data.columns) - 2] > current_signal_value:
total_hits = total_hits + 1
in_play = False
else:
if pred == 1:
in_play = True
tot_play = tot_play + 1
current_signal_value = all_data.loc[i,len(all_data.columns) - 2]
accuracy = total_hits/tot_play
print(accuracy)
list_of_acc.append(accuracy)
print(np.mean(list_of_acc))
You're confusing the input concept "random" with the result of "has no pattern".
Avoiding pattern usually requires (1) knowledge of the patterns that will be recognized, and (2) careful construction of data to counter each potential pattern with the opposing patterns.
Just as surely as the Central Tendency Theorem says that we will get close to 50%, it says that hitting exactly 50% is less likely as the quantities grow large.
When you generate random data, the sample deviates slightly from a totally balanced distribution. Your model is good enough to find some patterns in those deviations, and it learns to bias its predictions in favor of those imbalances.
A simple example is coin-tossing. If you toss a coin 100 times, you are likely to find that, rather than 50 heads and 50 tails, you will have a slightly off-center split, such as 52 tails - 48 heads. Take this as your random input.
Train your model. It will learn that tails are more prevalent. A horridly simplistic model will simply predict "tails" on every toss, and achieve 52% accuracy. A more sophisticated model will find other ways to predict, but will still lean toward tails, and tend toward 52% accuracy.

Suport Vector Machine training :Is sklearn SGDClassifier.partial_fit able to train an SVM incrementally?

I am trying to train an SVM model through sklearn to apply as binary classifier to get audio's Ideal Binary Mask(IBM), applied after a neural network that I am developing for my graduation thesis, however, as shown in
!this graph, the accuracy never converges. The mean accuracy is always about 50% it doesn't matter how many audios are used, which is random considering we've got only two choices.
#SVM instance
from sklearn.linear_model import SGDClassifier
SVM = SGDClassifier(loss='hinge',penalty='l2',warm_start = True,shuffle=True)
#Start training
CLEAN_DATA_PATH = r"D:\clean_trainset_56spk_wav/"
NOISY_DATA_PATH = r"D:\noisy_trainset_56spk_wav/"
audio_files = os.listdir(CLEAN_DATA_PATH)
shuffle(audio_files)
count = 0
for filename in audio_files:
if count == 1000:
break
start = time.time()
count += 1
Clean, Sr = sf.read(CLEAN_DATA_PATH + filename,dtype='float32')
Noisy, Sr = sf.read(NOISY_DATA_PATH + filename,dtype='float32')
print("Áudio " + filename )
Features, ibm = Extract_Features(Clean, Sr,Noisy)
y = ibm.reshape(-1,1)
y = np.ravel(y)
Features = sc.fit_transform(Features) # Scale
SVM.partial_fit(Features,y,classes=np.unique(y))
end = time.time()
print("Files training duration: "+str(round(end-start,2))+ " seconds")
print("Done: "+str(round((contador/len(audio_files))*100,2))+"%")
As far as I know, SGDClassifier.partial_fit changes the weights in small batches, what would allow us to use different files as batches (since each audio contains thousands of samples for classifications. Is it right?
Thanks a lot!
At least one of your problems is that at every iteration, the samples are on a different scale, because you fit sc to every new batch.
for filename in audio_files:
...
Features = sc.fit_transform(Features)
sc should be defined outside of the loop, and used as such:
Features = sc.transform(Features)

How to balance training set in python?

I'm trying to apply baseline model to my data set. But the data set is imbalanced and only 11% of the data belongs to positive category. I split the data without sampling, the recall for positive records is very low. I want to balance the training data(0.5 negative 0.5 positive) without balancing testing data. Does anyone know how to do that?
#splitting train and test data
train,test = train_test_split(coupon,test_size = 0.3,random_state = 100)
##separating dependent and independent variables
cols = [i for i in coupon.columns if i not in target_col]
train_X = train[cols]
train_Y = train[target_col]
test_X = test[cols]
test_Y = test[target_col]
#Function attributes
#dataframe - processed dataframe
#Algorithm - Algorithm used
#training_x - predictor variables dataframe(training)
#testing_x - predictor variables dataframe(testing)
#training_y - target variable(training)
#training_y - target variable(testing)
#cf - ["coefficients","features"](cooefficients for logistic
#regression,features for tree based models)
#threshold_plot - if True returns threshold plot for model
def coupon_use_prediction(algorithm,training_x,testing_x,
training_y,testing_y,cols,cf,threshold_plot) :
#model
algorithm.fit(training_x,training_y)
predictions = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
#coeffs
if cf == "coefficients" :
coefficients = pd.DataFrame(algorithm.coef_.ravel())
elif cf == "features" :
coefficients = pd.DataFrame(algorithm.feature_importances_)
column_df = pd.DataFrame(cols)
coef_sumry = (pd.merge(coefficients,column_df,left_index= True,
right_index= True, how = "left"))
coef_sumry.columns = ["coefficients","features"]
coef_sumry = coef_sumry.sort_values(by = "coefficients",ascending = False)
print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))
print ("Accuracy Score : ",accuracy_score(testing_y,predictions))
You have to way of balancing data : up sampling or down sampling.
Up sampling : duplication of the under-represented data.
Down sampling : sampling of the over-represented data.
For the upsampling it is pretty much easy.
For the downsampling you can use sklearn.utils.resample and provide the number of sample you want to get.
Please note that as #paritosh-singh mentioned, changing the distribution may not be the only solution. There are machine learning algorithms that can:
- support imbalanced data
- already have built-in weighting option to takes in account the data distribution

Large dataset with many weights causing an extremely slow training process with Tensorflow

I have a background in biology and am currently experimenting and learning machine learning to train a microarray dataset I have that consists of 140 cell lines with 54871 gene expressions of each cell line. Essentially, I have 140 rows, each row is comprised of 54871 columns representing a value that is a gene expression level of that cell line. Basically, a 140*54871 matrix. Within the 140 cell lines, I have labeled each row(cell line) as either group 1 or group 2 for my code to learn to discern and predict if I were to input a 1*54871 matrix, which group it would belongs to.
I have divided the dataset in two parts for training and testing. My question comes: since I have 54871 weights for each gene expression, my training is extremely slow as in every 1000 iterations, my cost function (mean squared error) only goes from 0.3057 to 0.3047 and this would take around 2-3 minutes. Also, as the iteration increase you can see that it kind of plateaus making it seems like it would take forever to train until the model has a cost function of even >=0.1. I left it overnight waking up with a mse value of 0.3014 when it began with a 0.3103.
Is there anything I can do to speed up the training process? Or is there something I am doing wrong. Thanks!
This is my code, sorry if it is a little messy:
import pandas as pd
import tensorflow as tf
import numpy
# download csv data sheet of all cell lines
input_data = pd.read_csv(
'C:/Users/lalalalalalala.csv',
index_col=[0, 1],
header=0,
na_values='---')
matrix_data = input_data.as_matrix()
# user define cell lines of interest for supervised training
group1 = input(
"Please enter cell lines that makes up the your cluster of interest with spaces in between(case sensitive):")
group_split1 = group1.split(sep=" ")
# assign label of each: input cluster = 1
# rest of cluster = 0
# extract data of input group
# split training and test set
# all these if else statement represents split when the input group1 is not a even number
split = len(group_split1)
g1_train = input_data.loc[:, group_split1[0:int(split / 2) if len(group_split1) % 2 == 0 else (int(split / 2) + 1)]]
g1_test = input_data.loc[:,
group_split1[(int(split / 2) if len(group_split1) % 2 == 0 else (int(split / 2) + 1)):split]]
g2 = input_data.loc[:, [x for x in list(input_data) if x not in group_split1]]
split2 = g2.shape[1]
g2_train = g2.iloc[:, 0:int(split2 / 2) if len(group_split1) % 2 == 0 else (int(split2 / 2) + 1)]
g2_test = g2.iloc[:, (int(split2 / 2) if len(group_split1) % 2 == 0 else (int(split2 / 2) + 1)):split2]
# amplify the input data if the input data is too small:
amp1 = (int((g2_train.shape[1] - split) / int(split / 2))) if g2_train.shape[
1] >= split else 1 # if g1 is less than g2 amplify
g1_train = pd.DataFrame(pd.np.tile(g1_train, (1, amp1)), index=g2_train.index)
amp2 = (int((g2_test.shape[1] - split) / int(split / 2))) if g2_test.shape[1] >= split else 1
g1_test = pd.DataFrame(pd.np.tile(g1_test, (1, amp2)), index=g2_test.index)
regroup_train = pd.concat([g1_train, g2_train], axis=1, join_axes=[g1_train.index])
regroup_train = numpy.transpose(regroup_train.as_matrix())
regroup_test = pd.concat([g1_test, g2_test], axis=1, join_axes=[g1_test.index])
regroup_test = numpy.transpose(regroup_test.as_matrix())
# create labels
split3 = g1_train.shape[1]
labels_train = numpy.zeros(shape=[len(regroup_train), 1])
labels_train[0:split3] = 1
split4 = g1_test.shape[1]
labels_test = numpy.zeros(shape=[len(regroup_test), 1])
labels_test[0:split4] = 1
# change all nan to 0
regroup_train = numpy.nan_to_num(regroup_train)
regroup_test = numpy.nan_to_num(regroup_test)
labels_train = numpy.nan_to_num(labels_train)
labels_test = numpy.nan_to_num(labels_test)
#######################################################################################################################
#####################################################NEURAL NETWORK####################################################
#######################################################################################################################
# define variables
trainingtimes = 1000
# create model
x = tf.placeholder(tf.float32, [None, 54781])
w = tf.Variable(tf.zeros([54781, 1]))
b = tf.Variable(tf.zeros([1]))
# define linear regression model, loss function
y = tf.nn.sigmoid((tf.matmul(x, w) + b))
# define correct training group
ytt = tf.placeholder(tf.float32, [None, 1])
# define cross optimizer and cost function
mse = tf.reduce_mean(tf.losses.mean_squared_error(y, ytt))
# train step
train_step = tf.train.GradientDescentOptimizer(learning_rate=0.3).minimize(mse)
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for i in range(trainingtimes):
sess.run(train_step, feed_dict={x: regroup_train, ytt: labels_train})
if i % 100 == 0:
print(sess.run(mse, feed_dict={x: regroup_train, ytt: labels_train}))
A few key issues here. You're trying to define a 1-layer neural network, which sounds good for this problem. But your hidden layer is much larger than it should be. Experiment with smaller weight sizes. Try 128, 256, 512, numbers like this (powers of two are not required).
Also, your input dimensionality is quite high. I know someone working on a very similar gene expression problem for cancer with something like 60,000 gene expressions and 10,000 samples. She has used PCA to reduce the dimensionality of the data while maintaining ~90% of the variance (she experimented with different values and found this about optimal).
That improved the results. Neural networks can overfit, the PCA dimensionality reductions was beneficial. The 1-layer fully connected network also out performed Logstic Regression and XGA boost in her experiments.
A couple of other things that she's working on with this problem, which may also apply to you:
Multi-task learning proved to improve the results. She originally had 4 different neural networks (4 outputs given the same data) when she combined them into 1 neural network with 4 loss functions it improved the results of all 4.
Instead of PCA you can use auto-encoders as an alternative dimensionality reduction technique. It's entirely possible to connect an auto-encoder to this network and train it in conjunction with a loss function. I haven't actually experimented with this (yet) though, so I can only say that I expect it to improve the results in theory. The PCA approach will be quicker to test so I'd start there.

Categories

Resources