I want to train 1D CNN on physioNet2017 ECG data. Each row in training data is of valiable length i.e, some rows are 9000 columns long and some are 18286 columns long. To make them of same length i have padded zeros to each row upto maximum length that 18286.
Now i have 20200 rows and each row is 18286 columns long so data shape is (20200, 18286). now i want to reshape this data in order to train 1D CNN. i have used following code for splitting the data into training and validation.
Xt, Xv, Yt, Yv = train_test_split(trainX_bal, trainY_bal, random_state=42, test_size=0.2)
print("Train shape: ", Xt.shape)
print("Valdation shape: ", Xv.shape)
and i have output:
Train shape: (16160, 18286)
Valdation shape: (4040, 18286)
Now i have reshaped the training and validation data using following code:
samples_train = list()
samples_val = list()
samples_test = list()
length = 8
for i in range(0,Xt.shape[0],length):
sample = Xt[i:i+length]
samples_train.append(sample)
for i in range(0,Xv.shape[0],length):
sample_val = Xv[i:i+length]
samples_val.append(sample_val)
data = np.array(samples_train).astype(np.float32)
data_val = np.array(samples_val).astype(np.float32)
print("Training new shape: ", data.shape)
print("Validation new shape: ", data_val.shape)
Xt_cnn = data.reshape((len(samples_train), length, data.shape[2]))
Xv_cnn = data_val.reshape((len(samples_val), length, data_val.shape[2]))
Yt = to_categorical(Yt, num_classes=4)
Yv = to_categorical(Yv, num_classes=4)
the output is:
Training new shape: (2020, 8, 18286)
Validation new shape: (505, 8, 18286)
Now i fit this data to CNN model using following code:
mod = cnn_model(Xt_cnn)
cnn_history = mod.fit(Xt_cnn, Yt, batch_size=64, validation_data = (Xv_cnn, Yv),
epochs=20)
i get this error.
Error
Your reshaping is wrong. You are altering the number of samples so your data becomes incompatible with your labels. As I understand you are trying to reshape (1,18286) into (8,18286/8) values which is impossible since 18286/8=2285,75. If you increase your padding and make shape 18288 then it becomes possible, since 18288/8=2286(since it's an integer).
You can do this reshaping as the following pseudo-code:
Arr=[]
for samp in range(number_of_samples):
new_array=Xt[samp,:].reshape(8,2286)
Arr.append(new_array)
Arr=np.array(Arr)
Arr's shape becomes (number_of_samples,8,2886)
Related
I am running an LSTM model on a simple stock market data.
When training the data, the y_train values are a simple array of floats64 of size (985,). But upon using the lstm.predict(X_test), the y_predicted values are of size array of float32 (246,2,1).
Basically, it is giving me two predictions per input X_test value. Ideally I would expect the output to be an array (246,)
Please help, here is the code:
def lstm_split(data,n_steps):
X,y=[],[]
for i in range(len(stock_data)-n_steps+1):
X.append(data[i:i+n_steps,:-1])
y.append(data[i+n_steps-1,-1])
return np.array(X),np.array(y)
stock_data_ft = X_ft
X1,y1 = lstm_split(stock_data_ft.values,n_steps=2)
train_split=0.8
split_idx = int(np.ceil(len(X1)*train_split))
date_index = stock_data_ft.index
X_train, X_test = X1[:split_idx] , X1[split_idx:]
y_train,y_test = y1[:split_idx] , y1[split_idx:]
X_train_date, X_test_date = date_index[:split_idx], date_index[split_idx:]
print(X1.shape , X_train.shape, X_test.shape, y_test.shape)
print(X_train)
lstm = Sequential()
lstm.add(LSTM(32,input_shape=(X_train.shape[1],X_train.shape[2]),activation='relu',return_sequences=True))
lstm.add(Dense(1))
lstm.compile(loss='mean_squared_error',optimizer='adam')
lstm.summary()
history = lstm.fit(X_train,y_train,epochs=100,batch_size=4,verbose=2,shuffle=False)
y_pred = lstm.predict(X_test)
I tried to get predicted values from the model.
y_pred = lstm.predict(X_test)
Was expecting output of array (246,) but instead got an output size array of float32 (246,2,1).
Some additional clarifications:
X_train.shape[1] is 2 and X_train.shape[2] is 3. These indicate the dimensions of the input features.
Basically the X values in training data is an array of dimension (985,2,3).
Some samples below:
[[ 1.53055021, 1.52204214, 1.53825887], [ 1.5526797 , 1.56142366, 1.56073994]],
[[ 1.5526797 , 1.56142366, 1.56073994], [ 1.58880785, 1.59418392, 1.6166433 ]]]
In principle I'd like to do the opposite of what was done here https://datascience.stackexchange.com/questions/45916/loading-own-train-data-and-labels-in-dataloader-using-pytorch.
I have a Pytorch dataloader train_dataloader with shape (2000,3). I want to store the 3 dataloader columns in 3 separate numpy arrays. (The first column of the dataloader contains the data, the second column contains the labels.)
I managed to do it for the last batch of the train_dataloader (see below), but unfortunately couldn't make it work for the whole train_dataloader.
for X, y, ind in train_dataloader:
pass
train_X = np.asarray(X, dtype=np.float32)
train_y = np.asarray(y, dtype=np.float32)
Any help would be very much appreciated!
You can collect all the data:
all_X = []
all_y = []
for X, y, ind in train_dataloader:
all_X.append(X)
all_y.append(y)
train_X = torch.cat(all_X, dim=0).numpy()
train_y = torch.cat(all_y, dim=0).numpy()
I am writing a neural network to take the Mel frequency coefficients as inputs and then run the model. My dataset contains 100 samples - each sample is an array of 12 values corresponding to the coefficients. After splitting this data into train and test sets, I have created the X input corresponding to the array and the y input corresponding to the label.
Data array containing the coefficients
Here is a small sample of my data containing 5 elements in the X_train array:
['[107.59366 -14.153783 24.799461 -8.244417 20.95272\n -4.375943 12.77285 -0.92922235 3.9418116 7.3581047\n -0.30066165 5.441765 ]'
'[ 96.49664 2.0689797 21.557552 -32.827045 7.348135 -23.513977\n 7.9406714 -16.218931 10.594619 -21.4381 0.5903044 -10.569035 ]'
'[105.98041 -2.0483367 12.276348 -27.334534 6.8239 -23.019623\n 7.5176797 -21.884727 11.349695 -22.734652 3.0335162 -11.142375 ]'
'[ 7.73094559e+01 1.91073620e+00 6.72225571e+00 -2.74525508e-02\n 6.60858107e+00 5.99264860e-01 1.96265772e-01 -3.94772577e+00\n 7.46383286e+00 5.42239428e+00 1.21432066e-01 2.44894314e+00]']
When I create the Neural network, I want to use the 12 coefficients as an input for the network. In order to do this, I need to use each row of my X_train dataset that contains these arrays as the input. However, when I try to consider the array index as an input it gives me shape errors when trying to fit the model. My model is as follows:
def build_model_graph():
model = Sequential()
model.add(Input(shape=(12,)))
model.add(Dense(12))
model.add(Activation('relu'))
model.add(Dense(10))
model.add(Activation('relu'))
model.add(Dense(num_labels))
model.add(Activation('softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
return model
Here, I want to use every row of the X_train array as an input which would correspond to the shape(12,). When I use something like this:
num_epochs = 50
num_batch_size = 32
model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs,
validation_data=(x_test, y_test), verbose=1)
I get an error for the shape which makes sense to me.
For reference, the error is as follows:
ValueError: Exception encountered when calling layer "sequential_20" (type Sequential).
Input 0 of layer "dense_54" is incompatible with the layer: expected min_ndim=2, found ndim=1. Full shape received: (None,)
But I am not exactly sure how I can extract the array of 12 coefficients present at each index of the X_train and then use it in the model input. Indexing the x_train and y_train did not work either. If anyone could point me in a relevant direction, it would be extremely helpful. Thanks!
Edit: My code for the dataframe is as follows:
clapdf = pd.read_csv("clapsdf.csv")
clapdf.drop('Unnamed: 0', inplace=True, axis=1)
clapdf.head()
nonclapdf = pd.read_csv("nonclapsdf.csv")
nonclapdf.drop('Unnamed: 0', inplace=True, axis=1)
sound_df = clapdf.append(nonclapdf)
sound_df.head()
d=sound_data.tolist()
df=pd.DataFrame(data=d)
data = df[0].to_numpy()
print("Before-->", data.shape)
dat = np.array([np.array(d) for d in data])
print('After-->', dat.shape)
Here, the shape remains the same as the values of each of the 80 samples are not in a comma separated format but instead in the form of a series.
If your data looks like this:
samples = 2
features = 12
x_train = tf.random.normal((samples, 1, features))
tf.Tensor(
[[[-2.5988803 -0.629626 -0.8306641 -0.78226614 0.88989156
-0.3851106 -0.66053045 1.0571191 -0.59061646 -1.1602987
0.69124466 -0.04354193]]
[[-0.86917496 2.2923143 -0.05498986 -0.09578358 0.85037625
-0.54679644 -1.2213608 -1.3766612 0.35416105 -0.57801914
-0.3699728 0.7884727 ]]], shape=(2, 1, 12), dtype=float32)
You will have to reshape it to (2, 12) in order to fit your model with the input shape (batch_size, 12):
import tensorflow as tf
def build_model_graph():
model = tf.keras.Sequential()
model.add(tf.keras.layers.Input(shape=(12,)))
model.add(tf.keras.layers.Dense(12))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dense(10))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dense(2))
model.add(tf.keras.layers.Activation('softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
return model
model = build_model_graph()
samples = 2
features = 12
x_train = tf.random.normal((samples, 1, features))
x_train = tf.reshape(x_train, (samples, features))
y = tf.random.uniform((samples, 1), maxval=2, dtype=tf.int32)
y_train = tf.keras.utils.to_categorical(y, 2)
model.fit(x_train, y_train, batch_size=1, epochs=2)
Also, you usually need to convert your labels to one-hot encoded vectors if you plan to use categorical_crossentropy.
y_train looks like this:
[[0. 1.]
[1. 0.]]
Update 1:
If your data is coming from a dataframe, try something like this:
import numpy as np
import pandas as pd
d = {'features': [[0.18525402, 0.92130125, 0.2296906, 0.75818471, 0.69813222, 0.47147329,
0.03560711, 0.06583931, 0.90921289, 0.76002148, 0.50413995, 0.36099004],
[0.18525402, 0.92130125, 0.2296906, 0.75818471, 0.69813222, 0.47147329,
0.03560711, 0.06583931, 0.90921289, 0.76002148, 0.50413995, 0.36099004]]}
df = pd.DataFrame(data=d)
data = df['features'].to_numpy()
print('Before -->', data.shape)
data = np.array([np.array(d) for d in data])
print('After -->', data.shape)
Before --> (2,)
After --> (2, 12)
I start with my train and test sets. They are NumPy arrays.
Then I create a variable history = x for x in train. This is a class list.
For i in range len(test), I do a forecast. In this forecast() function, history is made into an array with the shape (30, 72, 7) then flattened to (2160, 7). Within the loop, before the iteration goes to next step, test is appended to history as so: history.append(test[i, :]).
When the iteration runs the next time, it stops when trying to run forecast() again because the array history has the shape (31,) and cannot be flattened.
I am suspecting that either the types are the problem or the history.append(test[i, :]). But which is it? And how do I fix the problem?
Here are the relevant functions:
# evaluate a single model
def evaluate_model(train, test, n_input):
print("ENTERED EVALUATE_MODEL!")
# fit model
model = build_model(train, n_input)
# history is a list of weekly data
history = [x for x in train]
print("HISTORY TYPE after declaration: ", type(history))
# walk-forward validation over each week
predictions = list()
for i in range(len(test)):
print("Test.shape(): ", test.shape)
print("Round: ", i+1)
# predict the week
yhat_sequence = forecast(model, history, n_input)
print("yhat_sequence type: ", type(yhat_sequence))
# store the predictions
predictions.append(yhat_sequence)
# get real observation and add to history for predicting the next week
#test = np.array(test)
# print("TEST SHAPE :", test.shape)
print("TEST ", test)
print("History ", history)
print("HISTORY TYPE BEFORE APPEND: ", type(history))
print("TEST TYPE BEFORE APPEND: ", type(test))
history.append(test[i, :])
#test = test.tolist()
print("TEST after: ", type(test))
print("HISTORY after: ", type(history))
# evaluate predictions days for each week
predictions = array(predictions)
score, scores = evaluate_forecasts(test[:, :, 0], predictions)
return score, scores
# Make a forecast.
def forecast(model, history, n_input):
print("forecast()")
# flatten data
data = array(history) #History is entered again each time. But for the second round this is (31,) in shape...
print("data(history) shape in forecast(): ")
print(data.shape)
data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) #...so then this reshape doesn't work.
print(data.shape)
# retrieve last observations for input data
# For multivariate, make sure to use all features.
input_x = data[-n_input:, :]
# reshape into [1, n_input, 1]
# We need to change the shape as well to take all features.
input_x = input_x.reshape((1, input_x.shape[0], input_x.shape[1]))
# forecast the next week
yhat = model.predict(input_x, verbose=0)
# we only want the vector forecast
yhat = yhat[0]
return yhat
If you have a list (of lists or arrays) that can be turned into a
(30, 72, 7)
array. The only thing you can add to that list is a
(1, 72, 7)
shape array, which will make a (31, 72, 7) array.
Anything else will result in a (31,) object dtype array (or an error). And if the numpy is new enough I'd also expect a "ragged array" warning.
So when checking shape also check dtype.
I have a series of sine waves that i have loaded in using a custom dataloader. The data is converted to a torch tensor using from_numpy. I then try to load the data using an enumerator over the train_loader. The iterator is shown below.
for epoch in range(epochs):
for i, data in enumerate(train_loader):
input = np.array(data)
train(epoch)
The error i receive is:
RuntimeError: input must have 3 dimensions, got 2
I know i need to have my input data in [sequence length, batch_size, input_size] for an LSTM but i have no idea how to format my array data of 1000 sine waves of length 10000.
Below is my training method.
def train(epoch):
model.train()
train_loss = 0
def closure():
optimizer.zero_grad()
print(input.shape)
output = model(Variable(input))
loss = loss_function(output)
print('epoch: ', epoch.item(),'loss:', loss.item())
loss.backward()
return loss
optimizer.step(closure)
I thought i would try add (seq_length, batch_size, input_size) in a tuple but this cant be fed into the network. Further to this my assumption was that the dataloader fed batch size into the system. Any help would be appreciated.
edit:
Here is my sample data:
T = 20
L = 1000
N = 100
x = np.empty((N, L), 'int64')
x[:] = np.array(range(L)) + np.random.randint(-4 * T, 4 * T, N).reshape(N, 1)
data = np.sin(x / 1.0 / T).astype('float64')
torch.save(data, open('traindata.pt', 'wb'))
Can you share a simple example of your data just to confirm?
Also, you have to have a different order for your shape. Generally, the first dimension is always batch_size, and then afterwards the other dimensions, like [batch_size, sequence_length, input_dim].
One way to achieve this, if you have a batch size of 1, is to use torch.unsqueeze(). This allows you to create a "fake" dimension:
import torch as t
x = t.Tensor([1,2,3])
print(x.shape)
x = x.unsqueeze(dim=0) # adds a 0-th dimension of size 1
print(x.shape)