Random Forest Classifier Batch Learning Python Dimension Error - python

I have a large dataframe with around a million records and 19 features (+1 target variable). Since I was unable to train my RF classifier due to memory error (it's a multi-class classification with around 750 classes) hence I resorted to batch learning. The model is trained fine, but when I run the model.predict command, it gives me the following ValueError:
ValueError: operands could not be broadcast together with shapes (231106,628) (231106,620) (231106,628).
My code is the following :
#Splitting into Dependent and Independent Variables
X= df.iloc[:,1:]
y= df.iloc[:,0]
#Train-test Split
train_X, test_X, train_y, test_y = train_test_split(X,y,test_size=0.25,random_state=1234)
data_splits= zip(np.array_split(train_X,6),np.array_split(train_y,6))
rf_clf= RandomForestClassifier(warm_start=True, n_estimators=1,criterion='entropy',random_state=1234)
for i in range(10): #10 passes through the data
for X,y in data_splits:
rf_clf.fit(X,y)
rf_clf.n_estimators +=1 # increment by one, so next will add 1 tree
y_preds= rf_clf.predict(test_X)
I would be highly grateful for any help. Any other suggestions are also welcomed.

Found the answer. This was happening due to the inconsistency of y-variable classes in the data batches.

Related

Keras Validation Accuracy is Zero but other metrics are normal

I am working on a computer vision problem in keras and I have run into a an interesting problem. My val_acc is 0.0000e+00. This is especially interesting as my other metrics such as loss, acc, and val_loss all are acting normally.
This started happening when I switched from the Sequence data_generator to a custom one that I'm pretty sure is working as intended. My issue is very similar to this one validation accuracy is 0 with Keras fit_generator but no answer was reached in that thread.
I have checked to make sure my activations and loss metrics are appropriate for my particular problem. I am using: loss='categorical_crossentropy' metrics=['accuracy'] and am attempting to predict the month that a certain spectrogram comes from.The validation data is being loaded in the exact same way as the training data so I really can't figure out whats happening also even random guessing should give a 1/12 val_acc right? It can't be zero.
Here is my model architecture:
x = (Convolution2D(32,5,5,activation='relu',input_shape=(501,501,1)))(input_img)
x = (MaxPooling2D(pool_size=(2,2)))(x)
x = (Convolution2D(32,5,5,activation='relu'))(x)
x = (MaxPooling2D(pool_size=(2,2)))(x)
x = (Dropout(0.25))(x)
x = (Flatten())(x)
x = (Dense(128,activation='relu'))(x)
x = (Dropout(0.5))(x)
classify = (Dense(12,activation='softmax', kernel_regularizer=regularizers.l1_l2(l1 = 0.001,l2 = 0.001)))(x)
model = Model(input_img,classify)
model.compile(loss='categorical_crossentropy',optimizer='nadam',metrics=['accuracy'])
and here is my call to fit_generator:
model.fit_generator(generator = pd.data_generator(folder,'train'),
validation_data = pd.data_generator(folder,'test'),
steps_per_epoch=(120),
validation_steps=(24),
nb_epoch=20,
verbose=1,
shuffle=True,
callbacks=[tensorboard_callback,early_stop_callback])
and finally here is the important part of my data generator:
if mode == 'test':
print('test')
while True:
for things in up.unpickle_batch(folder,50,6000,7200): #The last 1200 things in batches of 50
random.shuffle(things)
test_spect = []
test_months = []
for thing in things:
test_spect.append(thing.spect) #GET BATCH DATA
test_months.append(thing.month-1) #this is is here because the months go from 1-12 but should go from 0-11 for to_categorical
x_test = np.asarray(test_spect) #PREPARE BATCH DATA
x_test = x_test.astype('float32')
x_test /= np.amax(x_test) #- 0.5
X_test = np.reshape(x_test, (-1,501, 501,1))
Y_test = np_utils.to_categorical(test_months,12)
yield X_test,Y_test #RETURN BATCH DATA
Check for bad data.
Make sure your data is what you think it is -- shuffled, distributed the same as your validation and/or test set, free of misleading/erroneous/contradictory samples. You can probably generate a failproof dataset (e.g. distinguish dark images from light ones, or sharp versus blurry) and prove that everything but the data is OK. If you can't, then look more closely at your code. This, however, sounds like a data problem.
I just fixed a similar problem in a simple 3-layer MLP network for which training loss & accuracy were heading in reasonable directions, validation loss was following training loss (but lagging) yet validation accuracy hovered at zero. There was an off-by-one error in my training dataset generation (a sampling script from a larger set) that meant that 1 sample in the entire block of samples for one type had the label for the next block for a different type. 499 correct samples out of 500 was insufficient to keep the training on track.

Random Forest warm_start = True gives value error when running the scoring function - operands could not be broadcast together

I am implementing a random forest forecast as baseline for my ml model. Since my X_train_split_xgb has shape (48195, 300), i need to do batchtraining (memory). To do that i set up randomforest with warm_start=True, but when i enable this i get an error in rf.predict(X_train_split_xgb line, namely: ValueError: operands could not be broadcast together with shapes (48195,210) (48195,187) (48195,210). If warm_start = False i do not get this error and the code runs through. Does anybody know why i get this valuerror and how to fix it? I tried lots of stuff already. Appreciate your help!
X_batch has shape (1000,300)
y_batch has shape 1000
X_train_split_xgb has shape (48195, 300)
y_train_split_xgb_encoded has shape 48195
i dont even know how it tries to broadcast (48195,210) (48195,187) (48195,210)together, where is 210 and 187 coming from?
from sklearn.ensemble import RandomForestClassifier
errors = []
rf = RandomForestClassifier(n_estimators=5,
random_state=0,warm_start=True)
for X_batch, y_batch in get_batches(X_train_split_xgb, y_train_split_xgb_encoded, 1000):
# Run training and evaluate accuracy
rf.fit(X_batch, y_batch)# warm_start=True
print(X_batch.shape)
print(rf.predict(X_train_split_xgb))
print(rf.score(X_train_split_xgb, y_train_split_xgb_encoded))
#pred = rf.predict(X_batch)
#errors.append(MSE(y_batch, rf.predict(X_batch)))
rf.n_estimators += 1
Error:
ValueError: operands could not be broadcast together with shapes (48195,210) (48195,187) (48195,210)
Expected: code runs through and gives the scores at each iteration.
Actual: code stops running in loop2, thus, when the prediction/scoring needs to be done the second time. stops in rf.predict()
Very late to answer, but leaving a response in case someone else runs into the same issue.
The reason this is happening is that the class labels are not stratified between different calls to the fit method.
I performed a simple test where I fed the same X and y to the fit method in a loop, and that seems to work.
rf = RandomForestClassifer(warm_start=True)
for _ in range(10):
X = df.head(100).drop(columns='class')
y = df.tail(100)['class'].values
rf.fit(X, y)
rf.score(X_test, y_test)
enter image description here

Keras Input Shape Issue

I can find many questions and answers related to my question but somehow they did not solve my problem. I have data with shape (10584, 56) and specified input_shape=(10584,56) in the code but getting following error:
ValueError: Error when checking input: expected dense_1_input to have 3 dimensions, but got array with shape (10584, 56).
I have somehow idea that I have to reshape my input data frame but not sure how. Following is my code:
y = df['Target']
x_train, x_test, y_train, y_test = train_test_split(df, y, test_size=0.2)
model = keras.models.Sequential()
model.add(keras.layers.Dense(64,input_shape(10584,56),activation='relu'))
Any help/suggestion will be much appreciated.
There is always an additional dimension for the batch size that you need add even if you want to use a batch size of 1.
Another possibility: If in fact your samples are not 2d vectors but 1d vectors of size 64 and 10584 is the number of samples you have, than the number of samples is not part of the input shape. You only provide the size of a single sample. Keras will take care of splitting your data into batches and setting the network up the right way.

Keras predict not working for multiple GPU's

I recently implemented this make_parallel code (https://github.com/kuza55/keras-extras/blob/master/utils/multi_gpu.py) for testing on multiple GPUs. After implementing the predict_classes() function did not work with the new model structure, after some reading I switched to using the predict function instead. This function only works using certain batch sizes, for example a batch size of 750 works, while 500, 100 and 350 fails with the following error:
ValueError: could not broadcast input array from shape (348,15) into shape
(350,15)
The training was completed with a batch_size of 75. Any idea why this is happening or how I can fix?
pointFeatures = np.zeros((batchSize,featureSize))
libfeatures.getBatchOfFeatures(i,batchSize,pointFeatures)
pointFeatures = pointFeatures.reshape(batchSize, FeatureShape.img_rows,
FeatureShape.img_cols, FeatureShape.img_width,
FeatureShape.img_channels)
pointFeatures = pointFeatures.astype('float32')
results = model.predict(pointFeatures, verbose=True,
batch_size=FeatureShape.mini_batch_size)
If you are using make_parallel function, you need to make sure number of samples is divisible by batch_size*N, where N is the number of GPUs you are using. For example:
nb_samples = X.shape[0] - X.shape[0]%(batch_size*N)
X = X[:nb_samples]
You can use different batch_size for training and testing as long as the number of samples is divisible by batch_size*N.

Preparing Time-Series Data for Keras LSTM - Network Trains with Extremely High Loss

I am running into issues preparing my data for use in Keras's LSTM layer. The data is a 1,600,000 item time-series csv consisting of a date and three features:
Date F1 F2 F3
2016-03-01 .252 .316 .690
2016-03-02 .276 .305 .691
2016-03-03 .284 .278 .687
...
My goal is to predict the value of F1 prediction_period timesteps in the future. Understanding that Keras's LSTM layer takes import data in the format (samples,timesteps,dimensions) I wrote the following function to convert my data into a 3D numpy array in this format (Using 2016-03-03 as an example):
[[[.284, .278, .687], [.276, .305, .691], [.252, .316, .690]],...other samples...]
This function creates the array by stacking copies of the data, with each copy shifted one step further back in time. Lookback is the number of "layers" in the stack and trainpercent is train/test split:
def loaddata(path):
df = pd.read_csv(path)
df.drop(['Date'], axis=1, inplace=True)
df['label'] = df.F1.shift(periods=-prediction_period)
df.dropna(inplace=True)
df_train, df_test = df.iloc[:int(trainpercent * len(df))], df.iloc[int(trainpercent * len(df)):]
train_X, train_Y = df_train.drop('label', axis=1).copy(), df_train[['label']].copy()
test_X, test_Y = df_test.drop('label', axis=1).copy(), df_test[['label']].copy()
train_X, train_Y, test_X, test_Y = train_X.as_matrix(), train_Y.as_matrix(), test_X.as_matrix(), test_Y.as_matrix()
train_X, train_Y, test_X, test_Y = train_X.astype('float32'), train_Y.astype('float32'), test_X.astype('float32'), test_Y.astype('float32')
train_X, test_X = stackit(train_X), stackit(test_X)
train_X, test_X = train_X[:, lookback:, :], test_X[:, lookback:, :]
train_Y, test_Y = train_Y[lookback:, :], test_Y[lookback:, :]
train_X = np.reshape(train_X, (train_X.shape[1], train_X.shape[0], train_X.shape[2]))
test_X = np.reshape(test_X, (test_X.shape[1], test_X.shape[0], test_X.shape[2]))
train_Y, test_Y = np.reshape(train_Y, (train_Y.shape[0])), np.reshape(test_Y, (test_Y.shape[0]))
return train_X, train_Y, test_X, test_Y
def stackit(thearray):
thelist = []
for i in range(lookback):
thelist.append(np.roll(thearray, shift=i, axis=0))
thelist = tuple(thelist)
thestack = np.stack(thelist)
return thestack
While the network accepted the data and did train, the loss values were exceptionally high, which was very surprising considering that the data has a definite periodic trend. To try and isolate the problem, I replaced my dataset and network structure with a sin-wave dataset and structure from this example:
http://www.jakob-aungiers.com/articles/a/LSTM-Neural-Network-for-Time-Series-Prediction.
Even with the sin wave dataset, the loss was still orders of magnitude higher that were produced by the example function. I went through the function piece by piece, using a one column sequential dataset and compared expected values with the actual values. I didn't find any errors.
Am I structuring my input data incorrectly for Keras's LSTM layer? If so, what is the proper way to do this? If not, what would you expect to cause these symptoms (extremely high loss which does not decrease over time, even with 40+ epochs) in my function or otherwise.
Thanks in advance for any advice you can provide!
Here are some things you can do to improve your predictions:
First make sure you input data is centered i.e. apply some
standardization or normalization. You can either use the
MinMaxScaler or StandardScaler from sklearn library or implement
some custom scaling based on your data.
Make sure your network(LSTM/GRU/RNN) is big enough to capture the
complexity in your data.
Use the tensorboard callback in Keras to monitor your weight
matrices and loss functions.
Use an adaptive optimizer instead of setting custom learning
parameters. Maybe'adam' or 'adagrad' .
Using these will at least make sure that your network is training. You should see gradual decrease of losses over time. After you've solved this problem you are free to experiment with your initial hyper-parameters and implementing different regularization techniques
Good Luck !
A "high loss" is a very subjective thing. We can not assess this without seeing your model.
It can come from multiple reasons:
training loss can be influenced by regularization techniques. For example, the whole point of L2 regularization is to add the weights of the model in the loss.
the loss is defined by an objective function, so it depends on what objective you are using.
the optmizer you are using for that objective function might not be adapted. Some optimizers do not garantee convergence of the loss.
your time serie might not be predictable (but apparently this is not your case).
your model might not be adequate for the task you are trying to achieve.
your training data is not correctly prepared (but you have investigated this)
You see that there are plenty of possibilities. A high loss doesn't mean anything in itself. You can have a really small loss and just do + 1000 and your loss will be high eventhough the problem is solved

Categories

Resources