I have SEED-IV dataset and I want it to use for emotion recognition using MLP. I converted the preprocessed data into separate npy arrays and saved all those arrays into a single file. After loading the data I wanted to convert x and y labels into tensors and then make x and y training datasets to use to train the model. The problem is, all of my arrays are in different shapes, and therefore I'm not able to convert them into tensors. How do I overcome this issue? Please Help. Thank You!
I have tried these two methods
reshaped_arrays = {}
for key in loaded_data.keys():
# Skip the 'labels' array
if key == 'labels':
continue
# Reshape the array to the desired shape
reshaped_array = loaded_data[key].reshape(new_shape)
# Store the reshaped array in the dictionary
reshaped_arrays[key] = reshaped_array
# Extract x and y data
x = np.concatenate(list(reshaped_arrays.values()), axis=0)
y = loaded_data['labels']
x_data_tensor = tf.convert_to_tensor(x, dtype=tf.float32)
y_data_tensor = tf.convert_to_tensor(y, dtype=tf.float32)
# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x_data_tensor, y_data_tensor, test_size=0.2, random_state=42)
for arr_name in loaded_data:
arr = loaded_data[arr_name]
x_data.append(arr[:-1])
y_data.append(arr[-1])
#Convert the lists to numpy arrays
x_data = np.array(x_data)
y_data = np.array(y_data)
x_data_tensor = tf.convert_to_tensor(x, dtype=tf.float32)
y_data_tensor = tf.convert_to_tensor(y, dtype=tf.float32)
# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x_data_tensor, y_data_tensor, test_size=0.2, random_state=42)
After reshaping the array format, you can reshape the tensor format what you want with torch.Tensor.reshape, If using torch is not problem for your environment.
Reference
https://pytorch.org/docs/stable/generated/torch.Tensor.reshape.html
Related
i'm working on a project with a very big dataset NF-UQ-NIDS. I couldn't even fit in a pandas so I decided to use dask, but I'm having problems.
I might be doing something else wrong, but when I try to train_test_split X and y I can't do it without converting them to dask_array. The train_test_split results in the incorrect shape of y, which should be 7, since I use 7 classification labels, but it results in it being shape (x, 42), which is the same shape as X.
here is a reproducable sample, dataset is in the link above:
df = dd.read_hdf(root_folder+"hdf/"+hdf_name,hdf_name.split(".")[0])
def encode_numeric_zscore(df, name, mean=None, standard_deviation=None):
if mean is None:
mean = df[name].mean()
if standard_deviation is None:
standard_deviation = df[name].std()
df[name] = (df[name] - mean) / standard_deviation
for column in df.columns:
if(column != 'attack_map'): encode_numeric_zscore(df,column)
X_columns = df.columns.drop('attack_map')
X = df[X_columns].values
y = dd.get_dummies(df['attack_map'].to_frame().categorize()).values
print(type(X))
print(type(y))
X = df.to_dask_array(lengths=True)
y = df.to_dask_array(lengths=True)
print(type(X))
print(type(y))
X.compute()
y.compute()
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, shuffle=True, random_state=2)
print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)
If you are facing problem in train test split, then use the one from dask-ml while using a dask dataframe / series / array and not a sklearn train test split.
Link : https://ml.dask.org/modules/generated/dask_ml.model_selection.train_test_split.html
I am currently using Keras to provide a sequential model for my data, but am thinking my data is skewed to much because of one of my categories contains 7 values and one of those accounts for 85% of the data. I am thinking I need to standardize the columns by adjusting the weights. Will the prepossessing function from sklean be able to help with this?
Below is the current code I have so far:
# load the dataset as a pandas DataFrame
data = read_csv(filename)
data = pd.DataFrame(data,columns= ['A','B','C'])
datas = data
# retrieve numpy array
dataset = data.values
# split into input (X) and output (y) variables
X = dataset[:, :-1]
y = dataset[:,-1]
# format all fields as string
X = X.astype(str)
# reshape target to be a 2d array
y = y.reshape((len(y), 1))
# load the dataset
def load_dataset(filename):
# load the dataset as a pandas DataFrame
data = read_csv(filename, header=None)
# retrieve numpy array
dataset = data.values
# split into input (X) and output (y) variables
X = dataset[:, :-1]
y = dataset[:,-1]
# format all fields as string
X = X.astype(str)
# reshape target to be a 2d array
y = y.reshape((len(y), 1))
return X, y
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
stdscaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = stdscaler.transform(x)
X_train_scaled = stdscaler.transform(X_train)
X_test_scaled = stdscaler.transform(X_test)
print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)
# prepare input data
def prepare_inputs(X_train, X_test):
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit(X_train)
X_train_enc = ohe.transform(X_train)
X_test_enc = ohe.transform(X_test)
return X_train_enc, X_test_enc
# prepare target
def prepare_targets(y_train, y_test):
le = LabelEncoder()
le.fit(y_train)
y_train_enc = le.transform(y_train)
y_test_enc = le.transform(y_test)
return y_train_enc, y_test_enc
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train_scaled, X_test_scaled)
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)
# define the model
model =Sequential()
model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset
model.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2)
# evaluate the keras model
_, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0)
print('Accuracy: %.2f' % (accuracy*100))
I highly suggest to go through the example in keras documentation. You are facing class imbalanced problem .Check out this url.
You need to use class_weight argument while fitting your model
I am trying to train a model using KNNClassifier. I split the data as follows:
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.30, random_state=42)
y_train= y_train.astype('int')
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
When I try to test it, I get a value error.
pre = neigh.predict(y_test)
Expected 2D array, got 1D array instead:
array=[0. 1. 1. ... 0. 0. 0.].
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
My y_test is of type pandas.core.series.Series
So how do I convert pandas.core.series.Series to array of 2D to make this testing work?
I have tried to convert y_test to dataframe and then to array, but I get another value error and I am stuck.
y_test = pd.DataFrame(y_test)
y_test = y_test.as_matrix().reshape(-1,1)
pre = neigh.predict(y_test)
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 1 while Y.shape[1] == 6038
I guess you need to use your X_test variable / array, not y_test.
X_test are the independent variables / features used to test the accuracy of our model, and y_test are the actual target values which will be compared with the predicted values.
Example:
pre = neigh.predict(X_test)
To measure accuracy:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pre)
I wrote a function to split numpy ndarrays x_data and y_data into training and test data based on a percentage of the total size.
Here is the function:
def split_data_into_training_testing(x_data, y_data, percentage_split):
number_of_samples = x_data.shape[0]
p = int(number_of_samples * percentage_split)
x_train = x_data[0:p]
y_train = y_data[0:p]
x_test = x_data[p:]
y_test = y_data[p:]
return x_train, y_train, x_test, y_test
In this function, the top part of the data goes to the training dataset and the bottom part of the data samples go to the testing dataset based on percentage_split. How can this data split be made more randomized before being fed to the machine learning model?
Assuming there's a reason you're implementing this yourself instead of using sklearn.train_test_split, you can shuffle an array of indices (this leaves the training data untouched) and index on that.
def split_data_into_training_testing(x_data, y_data, split, shuffle=True):
idx = np.arange(len(x_data))
if shuffle:
np.random.shuffle(idx)
p = int(len(x_data) * split)
x_train = x_data[idx[:p]]
x_test = x_data[idx[p:]]
... # Similarly for y_train and y_test.
return x_train, x_test, y_train, y_test
You can create a mask with p randomly selected true elements and index the arrays that way. I would create the mask by shuffling an array of the available indices:
ind = np.arange(number_of_samples)
np.random.shuffle(ind)
ind_train = np.sort(ind[:p])
ind_test = np.sort(ind[p:])
x_train = x_data[ind_train]
y_train = y_data[ind_train]
x_test = x_data[ind_test]
y_test = y_data[ind_test]
Sorting the indices is only necessary if your original data is monotonically increasing or decreasing in x and you'd like to keep it that way. Otherwise, ind_train = ind[:p] is just fine.
I am doing a machine learning project to recognize handwritten digits. Actually, I just want to add few more data sets to MNIST but I am unable to do so.
I have done following:
n_samples = len(mnist.data)
x = mnist.data.reshape((n_samples, -1))# array of feature of 64 pixel
y = mnist.target # Class label from 0-9 as there are digits
img_temp_train=cv2.imread('C:/Users/amuly/Desktop/Soap/crop/2.jpg',0)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
#Now I want to add the img_temp_train to my dataset for training.
X_train=np.append(X_train,img_temp_train.reshape(-1))
y_train=np.append(y_train,[4.0])
The length after training is:
43904784 (X_train)
56001(y_train)
But it should be 56001 for both.
Try this:
X_train = np.append(X_train, [img_temp_train], axis=0)
You shouldn't be reshaping things willy-nilly without thinking about what you're doing first!
Also, it's usually a better idea to use concatenate:
X_train = np.concatenate((X_train, [img_temp_train]), axis=0)