Data preparation for neural network in Python

Data preparation for neural network in Python - python

I want to learn how to prepare data for training samples in python. I found a simple example of a neural network that predicts the stock price. At the moment I am not interested in the accuracy of training the network, but I am interested in how to take any data and prepare it for submission to the neural network.
As an example, I took these stocks over the past 5 years. As planned, the neural network accepts data for the last 50 days as input and predicts the course for the next 5 days. To do this, I read the .csv file, processed the data in such a way that after the transformation I got two dataframes, the first one is responsible for the input data, and the second for the output.
The problem is, no matter what I do, I keep getting errors and so I cannot complete the training. What am I doing wrong? The code is shown below:
import matplotlib.pylab as plt
import torch
import random
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
import pandas_profiling as pprf
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, BatchNormalization, LeakyReLU
from tensorflow.keras.layers import Activation, Input, MaxPooling1D, Dropout
from tensorflow.keras.layers import AveragePooling1D, Conv1D, Flatten
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers import Adam, RMSprop, SGD
from tensorflow.keras.utils import plot_model
from IPython.display import display, Image
random.seed(0)
np.random.seed(0)
torch.manual_seed(0)
torch.backends.cudnn.deterministic = True
data = pd.read_csv('F:\\YNDX_ME.csv')[::]
data = data.drop('Date',axis=1)
data = data.drop('Adj Close',axis=1)
data = data.drop(np.where(data['Volume'] == 0)[0])
data = data.reset_index(drop=True)
#profiler = pprf.ProfileReport(data)
#profiler.to_file(r'F:\profiling.html')
days_edu = 50
days_pred = 5
df_edu_list = []
for i in range(len(data.index)-days_edu-days_pred+1):
df_temp = []
for j in range(days_edu):
df_temp.extend(data.loc[i+j,:].tolist())
df_edu_list.append(df_temp)
df_edu_out_list = []
for i in range(len(data.index)-days_edu-days_pred+1):
df_temp = []
for j in range(5):
df_temp.extend(data.loc[i+j+days_edu,:].tolist())
df_edu_out_list.append(df_temp)
df_edu_train = pd.DataFrame(df_edu_list[:int(len(df_edu_list)*0.8)])
df_edu_val = pd.DataFrame(df_edu_list[int(len(df_edu_list)*0.8):])
df_edu_train_out = pd.DataFrame(df_edu_out_list[:int(len(df_edu_out_list)*0.8)])
df_edu_val_out = pd.DataFrame(df_edu_out_list[int(len(df_edu_out_list)*0.8):])
df_edu_train = normalize(df_edu_train.values)
df_edu_val = normalize(df_edu_val.values)
df_edu_train_out = normalize(df_edu_train_out.values)
df_edu_val_out = normalize(df_edu_val_out.values)
df_edu_train = np.expand_dims(df_edu_train,axis=0)
df_edu_train_out = np.expand_dims(df_edu_train_out,axis=0)
model = Sequential()
model.add(Conv1D(filters=32, kernel_size=5, padding="same", strides=1, input_shape= (959,250),data_format='channels_first'))
model.add(Conv1D(32, 5))
model.add(Dropout(0.3))
model.add(Conv1D(16, 5))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(25, activation=None))
optimizer = Adam(learning_rate=0.0001, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(optimizer=optimizer, loss='mae', metrics=['accuracy'])
EPOCHS = 1000
model.fit(df_edu_train, df_edu_train_out, epochs=EPOCHS)
Error:
InvalidArgumentError: Conv2DCustomBackpropFilterOp only supports NHWC.
[[node gradient_tape/sequential/conv1d/Conv1D/Conv2DBackpropFilter
(defined at C:\Users\nick0\anaconda3\lib\site-packages\keras\optimizer_v2\optimizer_v2.py:464)
]] [Op:__inference_train_function_1046]
Errors may have originated from an input operation.
Input Source operations connected to node gradient_tape/sequential/conv1d/Conv1D/Conv2DBackpropFilter:
In[0] sequential/conv1d/Conv1D/ExpandDims (defined at C:\Users\nick0\anaconda3\lib\site-packages\keras\layers\convolutional.py:231)
In[1] gradient_tape/sequential/conv1d/Conv1D/ShapeN:
In[2] gradient_tape/sequential/conv1d/Conv1D/Reshape:
Update:
Changed data_format = 'channels_first' to data_format = 'channels_last'. The training began, but as I understood, the training took place on the entire training set, i.e. the neural network just thought that there was one example and it was trained on it specifically. How to make the neural network take each line in turn? is each line essentially a separate example?

Related

The accuracy problem of hand sign gestures recognition with using CNN in Python

Im working on my senior project in my university and I have only 2 days to fix this problem.I created a hand gesture recognition with using CNN in Python.I used 78000 images with 50x50px values.But I got stuck in the last part of my model.I can not improve my accuracy.When I start to train the data with 100 epochs,the first 15 epochs show 0,039 accuracy and it is horrible,because of that I'm not waiting the end of the train.Maybe it happens because of the values of conv2d or pooling because I don't know how to put the correct values into conv2d,pooling etc.
I'm new and I could not fix the problem.If you help me,I will be grateful for you
The code I wrote is given below;
from keras.models import Sequential
from keras.layers import Convolution2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
from keras.layers import Dropout
from keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
import pickle
import cv2
import os
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from PIL import Image
from numpy import asarray
DATADIR = "asl_alphabet_train"
CATEGORIES = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
X_train = []
y_train = []
X_test=[]
y_test=[]
IMG_SIZE=50
def create_training_data():
for category in CATEGORIES:
path = os.path.join(DATADIR,category) # create path to dogs and cats
class_num = CATEGORIES.index(category) # get the classification (0 or a 1).
for img in tqdm(os.listdir(path)): # iterate over each image per dogs and cats
try:
img_array = cv2.imread(os.path.join(path,img)) # convert to array
#new_array = cv2.resize(img_array, (28, 50 )) # resize to normalize data size
X_train.append(img_array) # add this to our trainingdata
# add this to our X_train
y_train.append(class_num) # add this to our X_train
except Exception as e: # in the interest in keeping the output clean...
pass
create_training_data()
X_train = asarray(X_train)
y_train = asarray(y_train)
"""
nsamples, nx, ny = X_train.shape
X_train = X_train.reshape((nsamples,nx*ny))
"""
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2,random_state=0)
N = y_train.size
M = y_train.max()+1
resultArray = np.zeros((N,M),int)
idx = (np.arange(N)*M) + y_train
resultArray.ravel()[idx] = 1
y_train=resultArray
classifier=Sequential()
#convolution step
classifier.add(Convolution2D(filters=96, input_shape=(50,50,3), kernel_size=(11,11), padding='valid',activation="relu"))
#pooling step
classifier.add(MaxPooling2D(pool_size=(2,2)))
#convolution step
classifier.add(Convolution2D(filters=256,kernel_size=(11,11),padding="valid",activation="relu"))
#pooling step
classifier.add(MaxPooling2D(pool_size=(2,2)))
classifier.add(Convolution2D(filters=384,kernel_size=(3,3),padding="valid",activation="relu"))
classifier.add(MaxPooling2D(pool_size=(2,2)))
#flatten step
classifier.add(Flatten())
#Dense(Fully connected step)
classifier.add(Dense(output_dim=128,activation="relu"))
#Dropout to decrease the possibility of overfitting
classifier.add(Dropout(0.5))
#Dense to determine the output
classifier.add(Dense(output_dim=26,activation="softmax"))
#compile step
classifier.compile(optimizer="adam",loss="categorical_crossentropy",metrics=["accuracy"])
enter code here
classifier.fit(X_train,y_train,epochs=100,batch_size=32)
filename="CNN_TEST.sav"
pickle.dump(classifier, open(filename, 'wb'))
y_pred=classifier.predict(X_test)
print(y_pred)

Would recommend the following :
1) Reduce the kernel size in the first two convolutional layers of your model.
2) I believe the MaxPooling layer is not necessary after every convolutional layer. Do verify this.
3) A DropOut of 0.5 could drop out a large number of essential neurons, you might want to lower that.
4) Vary the number of epochs and see how your model performs each time.
Plot "train accuracy vs val accuracy" and "train loss vs val loss" at each attempt and see if your model overfits or underfits.

Tensorboard callback not writing the training metrics

When the model is taking sufficiently long to infer (i.e. enough parameters and data big enough), and when profile_batch is on, the TensorBoard callback fails to write the training metrics to the log events (at least they are not visible in Tensorboard).
Here is the code used to get that failure:
import os.path as op
import time
import numpy as np
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.layers import Conv2D, Input
from tensorflow.keras.models import Model
size = 512
im = Input((size, size, 1))
im_conv = Conv2D(512, 3, padding='same', activation='relu')(im)
im_conv = Conv2D(1, 3, padding='same', activation='linear')(im_conv)
model = Model(im, im_conv)
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
data = np.random.rand(1, size, size, 1)
run_id = f'{int(time.time())}'
log_dir = op.join('logs', run_id)
tboard_cback = TensorBoard(
log_dir=log_dir,
histogram_freq=0,
write_graph=False,
write_images=False,
profile_batch=2,
)
model.fit(
x=data,
y=data,
validation_data=[data, data],
callbacks=[tboard_cback,],
epochs=100,
verbose=0,
);
Here is the Tensorboard viz I have:
Is there something wrong with the way I am using this callback?
I use Python 3.6.8, tensorflow 2.0.0 on GPU (but the behaviour is the same on CPU).

So apparently, this is due to the profiling done in the callback. We can disable it via profile_batch=0. The issue is ongoing and to be followed here: https://github.com/tensorflow/tensorboard/issues/2084

LSTM Keras confusion

#enumaris thank you for your answer. I'll try to explain my approach a bit:
I pushed the video frames through resnet model and got fature shapes of (k, 2048). I have the data into train/validation and test folders. Then I was writing this script:
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Activation, Dropout, Dense
import tensorflow as tf
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import argparse
import random
import cv2
import os
dataTrain = []
labelsTrain = []
# Prepare the Training Data. The .txt files contain name of the name of the
#file and the label which is 0,1,or 2 based on which class the video belongs
#to (nameVideo.npy 0)
with open('D:...\Data\/train_files.txt') as f:
trainingList = f.readlines()
for line in trainingList:
npyFiles = line.split( )
loadTrainingData = np.load(npyFiles[0])
dataTrain.append(loadTrainingData)
labelsTrain.append(npyFiles[1])
dataNp = np.array(dataTrain, dtype=object)
labelsNp = np.array(labelsTrain, dtype=object)
f.close()
dataVal = []
labelsVal = []
# Prepare the Validation Data
with open('D:\...\Data\/val_files.txt') as f:
valList = f.readlines()
for line in valList:
npyValFiles = line.split( )
loadValData = np.load(npyValFiles[0])
dataVal.append(loadValData)
labelsVal.append(npyValFiles[1])
f.close()
print(len(dataVal))
model = Sequential()
model.add(LSTM(32,
batch_input_shape=(None, None, 1),
return_sequences=True))
model.add(LSTM(32, return_sequences=True))
model.add(LSTM(32))
model.add(Dense(10, activation='softmax'))
model.compile(loss='mean_absolute_error',
optimizer='adam',
metrics=['accuracy'])
model.summary()
history = model.fit(dataTrain, labelsTrain,
epochs=10,
validation_data=(dataVal, labelsVal))
Which results in the following error:
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 3521 arrays.

OneHotEncoding for LSTM Categorical Sequence in TensorFlow

From a set of categories labelled by numbers I am predicting the next category in the sequence. I have modeled this on a text generator (hence the random titles!).
I created a number for each category so it could be interpreted by keras and tensorflow as numerical information assigning these numbers through the enumerate function. It threw up an error suggesting I should use OneHotEncoding for outputs. I don't know how to proceed.
I have sampled what OneHotEncoding of the information would look like but I don't know how to work this into the body of the code going forward/ conversely how to change my code so that the input without OneHotEncoding works.
I don't think I understand M/c learning well enough just yet, I am teaching myself.
import numpy as np
from numpy import array
from numpy import argmax
import tensorflow as tf
import keras
from keras.utils import to_categorical
from keras.utils import np_utils
from keras.layers import LSTM
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import Input, Dense
from keras.layers import TimeDistributed
from keras.models import Model
data= ['10001426', '10001426','10001426','5121550', '5431000', '10001426', '10001426', '10001466','10001426','5121550', '10001426', '10001426', '10001426','10001426','5431000', '10001426', '10001426', '10001466','10001426','5121550', '5431000', '10001426', '10001426', '10001466','10001426','5121550', '5431000', '10001426', '10001426', '10001466','10001426','5121550', '5431000', '10001426', '10001426', '10001466','10001426','5121550']
data= array(data)
chars=['10001426','5121550','5431000','10001466']
chars= array(chars)
"""
#OneHotEncode - turns the category into an encoded array
encoded = to_categorical(data)
print(encoded)
encoded2 = to_categorical(chars)
print(encoded2)
#Invert OneHotEncode
inverted = argmax(encoded[0])
print inverted
inverted2 = argmax(encoded[0])
print inverted2
"""
#Parameters
SEQ_LENGTH = 2 # Learn in steps of 2
VOCAB_SIZE = len(chars) #numer of features - how many categories of fault
#Prepare training data
ix_to_char={ix:char for ix, char in enumerate(chars)}
char_to_ix={char:ix for ix, char in enumerate(chars)}
X= np.zeros((len(data)/SEQ_LENGTH, SEQ_LENGTH, VOCAB_SIZE))
y= np.zeros((len(data)/SEQ_LENGTH, SEQ_LENGTH, VOCAB_SIZE))
for i in range((len(data)/SEQ_LENGTH)):
if (i+1)*SEQ_LENGTH<len(data):
X_sequence = data[(i)*SEQ_LENGTH:(i+1)*SEQ_LENGTH]
X_sequence_ix=[char_to_ix[value] for value in X_sequence]
input_sequence= np.zeros((SEQ_LENGTH, VOCAB_SIZE))
print ((i+1)*SEQ_LENGTH, len(data))
print input_sequence
for j in range(SEQ_LENGTH):
input_sequence[j][X_sequence_ix[j]]=1.
X[i]=input_sequence
y_sequence = data[i*SEQ_LENGTH+1:(i+1)*(SEQ_LENGTH+1)]
y_sequence_ix = [char_to_ix[value] for value in y_sequence]
target_sequence= np.zeros((SEQ_LENGTH, VOCAB_SIZE))
for j in range(SEQ_LENGTH):
if (i+1)*(SEQ_LENGTH+1)<(SEQ_LENGTH):
target_sequence[j][y_sequence_ix[j]]=1
y[i]=target_sequence
print y[i]
#Create the network
HIDDEN_DIM=1
LAYER_NUM= 1
model = Sequential()
model.add(LSTM(HIDDEN_DIM, input_shape=(None, VOCAB_SIZE),
return_sequences=True))
for i in range(LAYER_NUM-1):
model.add(LSTM(HIDDEN_DIM, return_sequences=True))
model.add(Activation('softmax'))
model.compile(loss="categorical_crossentropy",optimizer="rmsprop")
#Train the network
nb_epoch = 0
BATCH_SIZE = 5
GENERATE_LENGTH = 7
while True:
print ('\n\n')
model.fit(X,y,batch_size=BATCH_SIZE,verbose=1, epochs=1)
nb_epoch +=1
generate_text(model, GENERATE_LENGTH)
if nb_epoch %5==0:
model.save_weights('checkpoint_{}_epoch_{}.hdf5'.format(HIDDEN_DIM, nb_epoch))
model.summary()

You forgot that your final layer should have an output of size VOCAB_SIZE. You could either do this by adding a special Dense layer:
for i in range(LAYER_NUM-1):
model.add(LSTM(HIDDEN_DIM, return_sequences=True))
model.add(Dense(VOCAB_SIZE))
model.add(Activation('softmax'))
model.compile(loss="categorical_crossentropy",optimizer="rmsprop")
or by setting appropriate output from last LSTM layer (I will skip code for this part as it's a little bit tedious).

theano error from keras

I am running a keras script (no direct call to theano in my script) and I get the following error:
TypeError: ('An update must have the same type as the original shared
variable (shared_var=<TensorType(float32, matrix)>,
shared_var.type=TensorType(float32, matrix),
update_val=Elemwise{add,no_inplace}.0,
update_val.type=TensorType(float64, matrix)).',
'If the difference is related to the broadcast pattern,
you can call the tensor.unbroadcast(var, axis_to_unbroadcast[, ...])
function to remove broadcastable dimensions.')
I have seen the error from folks running theano directly, but not through keras. Not sure what I should do, since I am not dealing with tensors directly.

the problem was that there is a change in keras version (I am currently using keras 0.3.2 with theano 0.8.0) and what used to be fine does not work well with he new keras version.
The following was the original code, and see the fix below.
from keras.models import Sequential
import keras.optimizers
from keras.layers.core import Dense, Dropout
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import PReLU
from keras.layers.core import Activation
from keras.optimizers import SGD, Adam
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, RegressorMixin
class NnRegression(BaseEstimator, RegressorMixin):
def __init__(self, apply_standart_scaling=True,
dropx=[0.2, 0.5, 0.5], nb_neuronx=[50, 30], nb_epoch=105, validation_split=0.,
verbose=1):
self.apply_standart_scaling = apply_standart_scaling
self.dropx = dropx
self.nb_neuronx = nb_neuronx
self.nb_epoch = nb_epoch
self.validation_split = validation_split
self.verbose = verbose
def fit(self, X, y):
nb_features = X.shape[1]
self.standart_scaling = StandardScaler() if self.apply_standart_scaling else None
if self.standart_scaling:
X = self.standart_scaling.fit_transform(X)
model = Sequential()
model.add(Dropout(input_shape = (nb_features,),p= self.dropx[0]))
model.add(Dense(output_dim = self.nb_neuronx[0], init='glorot_uniform'))
model.add(PReLU())
model.add(BatchNormalization(self.nb_neuronx[0],)))
model.add(Dropout(self.dropx[1]))
model.add(Dense(self.nb_neuronx[1], init='glorot_uniform'))
model.add(PReLU())
model.add(BatchNormalization(self.nb_neuronx[0],)))
model.add(Dropout(self.dropx[2]))
model.add(Dense(1, init='glorot_uniform'))
nn_verbose = 1 if self.verbose>0 else 0
optz = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(optimizer=Adam(),loss='mse')
model.fit(X, y, batch_size=16,
nb_epoch=self.nb_epoch, validation_split=self.validation_split, verbose=nn_verbose)
self.model = model
def predict(self, X):
if self.standart_scaling:
X = self.standart_scaling.transform(X)
return self.model.predict_proba(X, verbose=0)
well, it turns out that the problem is this single line of code:
model.add(BatchNormalization(self.nb_neuronx[0],)))
It should actually be:
model.add(BatchNormalization())
because the number of neurons has no business within the normalization layer (however this did not bother in a previous keras version).
This apparently causes theano to generate new weights that are not float32 but float64, and that triggers the message above.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data preparation for neural network in Python - python

Related

The accuracy problem of hand sign gestures recognition with using CNN in Python

Tensorboard callback not writing the training metrics

LSTM Keras confusion

OneHotEncoding for LSTM Categorical Sequence in TensorFlow

theano error from keras

Categories

Resources