I am doing a machine learning project to recognize handwritten digits. Actually, I just want to add few more data sets to MNIST but I am unable to do so.
I have done following:
n_samples = len(mnist.data)
x = mnist.data.reshape((n_samples, -1))# array of feature of 64 pixel
y = mnist.target # Class label from 0-9 as there are digits
img_temp_train=cv2.imread('C:/Users/amuly/Desktop/Soap/crop/2.jpg',0)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
#Now I want to add the img_temp_train to my dataset for training.
X_train=np.append(X_train,img_temp_train.reshape(-1))
y_train=np.append(y_train,[4.0])
The length after training is:
43904784 (X_train)
56001(y_train)
But it should be 56001 for both.
Try this:
X_train = np.append(X_train, [img_temp_train], axis=0)
You shouldn't be reshaping things willy-nilly without thinking about what you're doing first!
Also, it's usually a better idea to use concatenate:
X_train = np.concatenate((X_train, [img_temp_train]), axis=0)
Related
I'm facing a problem I can't solve. Indeed, I try to create a model LSTM with keras, but I don't understand what the input data format should be.
My data train and my data test look like this:
date/value/value/value/value/value_i_want_to_predict
I've seen some people doing this:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)X_train = []
y_train = []
for i in range(60, len(training_set_scaled)):
X_train.append(training_set_scaled[i-60: i, 0])
y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
But if I do that how do I predict my features without modifying the test data set?
I have a hard time understanding why we do this. Moreover, what I would like to use the values to predict the target in the last column. With this method I feel like I have to change the format of the data test and it's important that I can test the model on test data that are different and that I don't have to change.
Can someone help me?
EDIT
scaler.fit(df_train_x)
X_train = scaler.fit_transform(df_train_x)
X_test = scaler.transform(df_test_x)
y_train = np.array(df_train_y)
y_train = np.insert(y_train, 0, 0)
y_train = np.delete(y_train, -1)
The shape of the data is: (2420, 7)
That what I did. But The shape still remain 2D. So i used :
generator = TimeseriesGenerator(X_train, y_train, length=n_input, batch_size=32)
And the input shape of first layer is:
model.add(LSTM(150, activation='relu', return_sequences=True,input_shape=(2419, 7)))
but when i fit the generator to the model:
ValueError: Error when checking target: expected dense_10 to have 3 dimensions, but got array with shape (1, 1)
i really don't understand
I'm not sure to fullly understand your question but I will try my best.
I think the code you provided is problem specific, meaning it maybe not suitable for your imlementation.
For an LSTM (and for pretty much any neural network) you always want to scale your data before feeding it to the model. This helps avoid having completely different data ranges across your features. The MinMaxScaler scale your features to the range provided. For an explanation of why do you need scaling, you can have a look at this article.
Usualy, you want to first split your dataset in training and testing sets, using for example the train_test_split function of sklearn, then scale your features.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = data.drop("feature_I_want_to_predict",axis=1)
y = data["feature_I_want_to_predict"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
That way, X_train represent your training data, and y_train represent your labels for the training data. (and similarly for the test data)
I here used the StandardScaler instead of the MinMaxScaler. The standard scaler substracts the mean of the feature then divides by the standard deviation.
I wrote a function to split numpy ndarrays x_data and y_data into training and test data based on a percentage of the total size.
Here is the function:
def split_data_into_training_testing(x_data, y_data, percentage_split):
number_of_samples = x_data.shape[0]
p = int(number_of_samples * percentage_split)
x_train = x_data[0:p]
y_train = y_data[0:p]
x_test = x_data[p:]
y_test = y_data[p:]
return x_train, y_train, x_test, y_test
In this function, the top part of the data goes to the training dataset and the bottom part of the data samples go to the testing dataset based on percentage_split. How can this data split be made more randomized before being fed to the machine learning model?
Assuming there's a reason you're implementing this yourself instead of using sklearn.train_test_split, you can shuffle an array of indices (this leaves the training data untouched) and index on that.
def split_data_into_training_testing(x_data, y_data, split, shuffle=True):
idx = np.arange(len(x_data))
if shuffle:
np.random.shuffle(idx)
p = int(len(x_data) * split)
x_train = x_data[idx[:p]]
x_test = x_data[idx[p:]]
... # Similarly for y_train and y_test.
return x_train, x_test, y_train, y_test
You can create a mask with p randomly selected true elements and index the arrays that way. I would create the mask by shuffling an array of the available indices:
ind = np.arange(number_of_samples)
np.random.shuffle(ind)
ind_train = np.sort(ind[:p])
ind_test = np.sort(ind[p:])
x_train = x_data[ind_train]
y_train = y_data[ind_train]
x_test = x_data[ind_test]
y_test = y_data[ind_test]
Sorting the indices is only necessary if your original data is monotonically increasing or decreasing in x and you'd like to keep it that way. Otherwise, ind_train = ind[:p] is just fine.
I don't know the reason why the results I have obtained aren't scale well.
As you can see on the pictures bellow, there is a problem with scaling.
There are two issues:
There are no negative values
There is problem with maximum values prediction
I don't have idea why I have those problems.
Do you have any idea's how I can fix this issue?
I would be very grateful for your help
CODE:
# Read inputs
X = dataset.iloc[0:20000, [1, 4, 10]].values
# Read output
y = dataset.iloc[0:20000, 5].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Output matrix conversion
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
y_train = sc.fit_transform(y_train)
y_test = sc.transform(y_test)
# Import the Keras libraries and package
from keras.models import Sequential
from keras.layers import Dense
# building model
classifier = Sequential()
classifier.add(Dense(activation="sigmoid", input_dim=3, units=64, kernel_initializer="uniform"))
classifier.add(Dense(activation="sigmoid", units=32, kernel_initializer="uniform"))
classifier.add(Dense(activation="sigmoid", units=16, kernel_initializer="uniform"))
classifier.add(Dense(activation="sigmoid", units=1, kernel_initializer="uniform"))
classifier.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
# Fitting the ANN to the training set
results = classifier.fit(X_train, y_train, batch_size=16, epochs=25)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
If your blue curves show your initial output y and your orange ones the output of your model (you have not cared to clarify this...), then there is nothing strange here...
There is problem with maximum values prediction
Looking more carefully at your code, you will realize that you don't actually feed your initial y into your network, but its scaled version, i.e. the result of sc.transform(); hence, your output is also scaled, and you should use the inverse_transform method to get it back to the initial scale:
y_final = sc.inverse_transform(y_pred)
BTW, this will happen to work now, but in general it is not a good idea to use the same scaler (sc here) for two different datasets (i.e. your X's and y's) - you should define two different scalers instead, say sc_X and sc_y.
There are no negative values
That is because the sigmoid function you have used as activation in your output layer takes only positive values in [0, 1], so you may want to change it to something else that will be able to give the required value range (linear is a candidate), and possibly also change your other sigmoids to tanh.
I have a dataset with 155 features. 40143 samples. It is sorted by date (oldest to newest) then I deleted the date column from the dataset.
label is on the first column.
CV results c. %65 (mean accuracy of scores +/- 0.01) with the code below:
def cross(dataset):
dropz = ["result"]
X = dataset.drop(dropz, axis=1)
X = preprocessing.normalize(X)
y = dataset["result"]
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1)
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
Also I get similar accuracy with the code below:
def train(dataset):
dropz = ["result"]
X = dataset.drop(dropz, axis=1)
X = preprocessing.normalize(X)
y = dataset["result"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000, random_state=42)
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)
But If I don't use shuffle in the code below it results c. %49
If I use shuffle then it results c. %65
I should mention that I try every 1000 consecutive split of all set from end to beginning and the result is same.
dataset = pd.read_csv("./dataset.csv", header=0,sep=";")
dataset = shuffle(dataset) #!!!???
X_train = dataset.iloc[:-1000,1:]
X_train = preprocessing.normalize(X_train)
y_train = dataset.iloc[:-1000,0]
X_test = dataset.iloc[-1000:,1:]
X_test = preprocessing.normalize(X_test)
y_test = dataset.iloc[-1000:,0]
clf = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1).fit(X_train, y_train)
clf.score(X_test, y_test)
Assuming your question is "Why does it happen":
In both your first and second code snippets you have underlying shuffling happening (in your cross validation and your train_test_split methods), therefore they are equivalent (both in score and algorithm) to your last snippet with shuffling "on".
Since your original dataset is ordered by date there might be (and usually likely) some data that changes over time, which means that since your classifier never sees data from the last 1000 time points - it is unaware of the change in the underlying distribution and therefore fails to classify it.
Addendum to answer further data in comment:
This suggests that there might be some indicative process that is captured in smaller time frames. Two interesting ways to explore it are:
reduce the size of the test set until you find a window size in which the difference between shuffle/no shuffle is negligible.
this process essentially manifests as some dependence between your features so you could see if in a small time frame there is a dependence between your features
I started learning how to use theano with lasagne, and started with the mnist example. Now, I want to try my own example: I have a train.csv file, in which every row starts with 0 or 1 which represents the correct answer, followed by 773 0s and 1s which represent the input. I didn't understand how can I turn this file to the wanted numpy arrays in the load_database() function. this is the part from the original function for the mnist database:
...
with gzip.open(filename, 'rb') as f:
data = pickle_load(f, encoding='latin-1')
# The MNIST dataset we have here consists of six numpy arrays:
# Inputs and targets for the training set, validation set and test set.
X_train, y_train = data[0]
X_val, y_val = data[1]
X_test, y_test = data[2]
...
# We just return all the arrays in order, as expected in main().
# (It doesn't matter how we do this as long as we can read them again.)
return X_train, y_train, X_val, y_val, X_test, y_test
and I need to get the X_train (the input) and the y_train (the beginning of every row) from my csv files.
Thanks!
You can use numpy.genfromtxt() or numpy.loadtxt() as follows:
from sklearn.cross_validation import KFold
Xy = numpy.genfromtxt('yourfile.csv', delimiter=",")
# the next section provides the required
# training-validation set splitting but
# you can do it manually too, if you want
skf = KFold(len(Xy))
for train_index, valid_index in skf:
ind_train, ind_valid = train_index, valid_index
break
Xy_train, Xy_valid = Xy[ind_train], Xy[ind_valid]
X_train = Xy_train[:, 1:]
y_train = Xy_train[:, 0]
X_valid = Xy_valid[:, 1:]
y_valid = Xy_valid[:, 0]
...
# you can simply ignore the test sets in your case
return X_train, y_train, X_val, y_val #, X_test, y_test
In the code snippet we ignored passing the test set.
Now you can import your dataset to the main modul or script or whatever, but be aware to remove all the test part from that too.
Or alternatively you can simply pass the valid sets as test set:
# you can simply pass the valid sets as `test` set
return X_train, y_train, X_val, y_val, X_val, y_val
In the latter case we don't have to care about the main moduls sections refer to the excepted test set, but as scores (if have) you will get the the validation scores twice i.e. as test scores.
Note: I don't know, which mnist example is that one, but probably, after you prepared your data as above, you have to make further modifications in your trainer module too to suit to your data. For example: input shape of data, output shape i.e. the number of classes e.g. in your case the former is 773, the latter is 2.