Related
I am a beginner in ML. I am helping my Math-major friend create a stock predictor with TensorFlow based on a .csv file he provided.
There are a few problems I have. The first one is his .csv file. The file has only dates and closing values, which are not separated, therefore I had to manually separate the dates and values. I've managed to do that, and now I'm having trouble with the MinMaxScaler(). I was told I could pretty much disregard the dates and only test the closing values, normalize them, and make a prediction based off of them.
I keep getting this error:
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a
minimum of 1 is required by MinMaxScaler()
I honestly have never used SKLearn or TensorFlow before, and it is my first time working on such a project. All the guides I see on the topic utilize pandas, but in my case, the .csv file is a mess and I don't believe I can use pandas for it.
I'm following this DataCamp tutorial:
But unfortunately, due to my lack of experience, some things are not really working for me, and I would appreciate a little more clarity of how I should proceed in my case.
Attached below is my (messy) code:
import pandas as pd
import numpy as np
import tensorflow as tf
import sklearn
from sklearn.model_selection import KFold
from sklearn.preprocessing import scale
from sklearn.preprocessing import MinMaxScaler
import matplotlib
import matplotlib.pyplot as plt
from dateutil.parser import parse
from datetime import datetime, timedelta
from collections import deque
stock_data = []
stock_date = []
stock_value = []
f = open("s&p500closing.csv","r")
data = f.read()
rows = data.split("\n")
rows_noheader = rows[1:len(rows)]
#Separating values from messy `.csv`, putting each value to it's list and also a combined list of both
for row in rows_noheader:
[date, value] = row[1:len(row)-1].split('\t')
stock_date.append(date)
stock_value.append((value))
stock_data.append((date, value))
#Numpy array of all closing values converted to floats and normalized against the maximum
stock_value = np.array(stock_value, dtype=np.float32)
normvalue = [i/max(stock_value) for i in stock_value]
#Number of closing values and days. Since there is one closing value for each, they both match and there are 4528 of them (each)
nclose_and_days = 0
for i in range(len(stock_data)):
nclose_and_days+=1
train_data = stock_value[:2264]
test_data = stock_value[2264:]
scaler = MinMaxScaler()
train_data = train_data.reshape(-1,1)
test_data = test_data.reshape(-1,1)
# Train the Scaler with training data and smooth data
smoothing_window_size = 1100
for di in range(0,4400,smoothing_window_size):
#error occurs here
scaler.fit(train_data[di:di+smoothing_window_size,:])
train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])
# You normalize the last bit of remaining data
scaler.fit(train_data[di+smoothing_window_size:,:])
train_data[di+smoothing_window_size:,:] = scaler.transform(train_data[di+smoothing_window_size:,:])
# Reshape both train and test data
train_data = train_data.reshape(-1)
# Normalize test data
test_data = scaler.transform(test_data).reshape(-1)
# Now perform exponential moving average smoothing
# So the data will have a smoother curve than the original ragged data
EMA = 0.0
gamma = 0.1
for ti in range(1100):
EMA = gamma*train_data[ti] + (1-gamma)*EMA
train_data[ti] = EMA
# Used for visualization and test purposes
all_mid_data = np.concatenate([train_data,test_data],axis=0)
window_size = 100
N = train_data.size
std_avg_predictions = []
std_avg_x = []
mse_errors = []
for pred_idx in range(window_size,N):
std_avg_predictions.append(np.mean(train_data[pred_idx-window_size:pred_idx]))
mse_errors.append((std_avg_predictions[-1]-train_data[pred_idx])**2)
std_avg_x.append(date)
print('MSE error for standard averaging: %.5f'%(0.5*np.mean(mse_errors)))
I know that this post is old, but as I stumbled here, others will..
After running in the same problem and googling quite a bit I found a post
https://github.com/llSourcell/Make_Money_with_Tensorflow_2.0/issues/7
so it seems that if you download a too small dataset it will throw that error.
Download a .csv from 1962 and it'll be big enough ;).
Now,I just have to find the right parameters for my dataset..as I'm adapting this to another type o prediction..
Hope it helps
The train_data variable has a length of 2264:
train_data = stock_value[:2264]
Then, when you go to fit the scaler, you go outside of train_data's bounds on the third iteration of the for loop:
smoothing_window_size = 1100
for di in range(0, 4400, smoothing_window_size):
Notice the size of the data set in the tutorial. The training and testing chunks each have length 11,000, and the smoothing_window_size is 2500, so it will never go exceed train_data's boundaries.
You have a column of all 0's in your data. If you try to scale it the MinMaxScaler can't assign a scale and it trips up. You need to filter out empty/0 columns before you scale the data. Try :
stock_value=stock_value[:,~np.all(np.isnan(d), axis=0)]
to filter out nan columns in your data
I have got to apologize, for the whole time you guys were trying to figure out a solution to my issue, I ended up finding a decent guide and taking a much less sophisticated approach (as it was my first ever taste of AI and Statistics). Funny thing is, I was breaking my head for months over this, until I went to a conference in Florida last November and ended finishing it in less than two hours at 3 am in my hotel room.
Here is the finished code I had wrote back then and ended up presenting to my colleague as a working example
import tensorflow as tf
from keras import backend as K
from tensorflow.python.saved_model import builder as saved_model_builder
from tensorflow.python.saved_model import tag_constants, signature_constants, signature_def_utils_impl
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
import numpy as np
import matplotlib.pyplot as plt
stock_data = []
stock_date = []
stock_value = []
f = open("s&p500closing.csv","r")
data = f.read()
rows = data.split("\n")
rows_noheader = rows[1:len(rows)]
#Separating values from messy CSV, putting each value to it's list and also a combined list of both
for row in rows_noheader:
[date, value] = row[1:len(row)-1].split('\t')
stock_date.append(date)
stock_value.append((value))
stock_data.append((date, value))
#Making an array of arrays ready for use with TF,
#slicing array of data to smaller train data
#and normalizing the values against the max for training
stock_value = np.array(stock_value, dtype=np.float32)
normvalue = [i/max(stock_value) for i in stock_value]
normvalue = np.array(normvalue)
train_data = [np.array(i) for i in normvalue[:500]]
train_data = np.array(train_data)
train_labels = train_data
#First plotting the actual values
plt.plot(normvalue)
#Creating TF session
sess = tf.Session()
K.set_session(sess)
K.set_learning_phase(0)
model_version = "2"
#Declaring the amount of epochs, the amount of periods the machine will learn
#(can play around with it)
epoch = 20
#Building the model
####################
model = Sequential()
model.add(Dense(8, input_dim=1))
model.add(Activation('tanh'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
sgd = SGD(lr=0.1)
#Compiling and fitting our data to the model
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(train_data, train_labels, batch_size=1, nb_epoch=epoch)
#declaring varaibles for the models input and output to make sure they are all valid
x = model.input
y = model.output
prediction_signature = tf.saved_model.signature_def_utils.predict_signature_def({"inputs": x}, {"prediction":y})
valid_prediction_signature = tf.saved_model.signature_def_utils.is_valid_signature(prediction_signature)
if(valid_prediction_signature == False):
raise ValueError("Error: Prediction signature not valid!")
#Here the actual prediction of the real values occurs
predictions = model.predict(normvalue)
#Plotting the prediction values
plt.xlabel("Blue: Actual Orange: Prediction")
plt.plot(predictions)
Please feel free to make changes and experiment around with it as you deem fit.
I would like to thank you all for taking the time to examine my issue and providing a variety of solutions, and am looking forward to learn more in the future :)
my first time commenting on Stackoverflow, please if you find errors on how i answer, or if you find mistakes , please correct me
so for the above error to make the calculation simple, and how you can avoid the above value error.
mid_prices = (high_prices+low_prices)/2.0
print(len(mid_prices))#length 2024
train_data = mid_prices[:1012]
test_data = mid_prices[1012:]
scaler = MinMaxScaler()
train_data = train_data.reshape(-1,1)
test_data = test_data.reshape(-1,1)
smoothing_window_size = 200
for di in range (0,1000,smoothing_window_size):
scaler.fit(train_data[di:di+smoothing_window_size,:])
train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])
this code above works my mid_prices variable has a len of 2024
so my
train_data = mid_prices[:1012]
test_data = mid_prices[1012:]
is split into two 1012 sized chunks
now if you look at the code the tutorial provided
his total size is 22000 he splits them up into two 11000 chunks for test and train
and then for the scaler he uses a range from 0 to 10000 in the for loop and a smothing window size of 2500 which , please correct me if i am wrong makes 5 iterations through that 10k set.
using this logic the author used i did this
smoothing_window_size = 200
for di in range (0,1000,smoothing_window_size):
scaler.fit(train_data[di:di+smoothing_window_size,:])
train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])
which worked perfectly with my set of data and the example that is provided.
i hope this answer suffices to solve this issue
Your issue isn't your CSV or pandas. You can actually read the CSV with pandas straight into a dataframe which is what I recommend you do. df = pd.read_csv(path)
I am having the same issue with the same code.
Whats happening is the Scaler = MinMaxScaler, then in the for di in range part and you are fitting the data in the training set then transforming it and reassigning it back to itself.
The problem is, is its trying to find more data in your training set to fit to the scaler and its running out of data. Which is weird because of the way the tutorial you are following presented it.
I see that your Window is 1100 and in your for loop you go from 0 to 4400 with 1100 intervals. With that you have 0 as a remainder, which in turn leaves 0 items to normalize, so your code that has
# You normalize the last bit of remaining data
scaler.fit(train_data[di+smoothing_window_size:,:])
train_data[di+smoothing_window_size:,:] = scaler.transform(train_data[di+smoothing_window_size:,:])
You do not need those lines of code, just comment those out. It should work after that
I ran into the same error message while writing a unittest for a text classification package based on logistic regression, I realized that it was due to trying to apply a model to an empty df.
In my case, this could happen because my model was actually a tree of models which mirrorred the category tree in my (huge) training data, but only a few of those subcases actually took place in the tiny test df in my unittest.
Long story short: I think the error probably happens because at some point
train_data[di:di+smoothing_window_size,:]
ends up having length 0 in your loop.
I had almost a similar exception, only that mine read:
"ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required by StandardScaler."
The shape() function denotes the structure of a given dataset. I realized that there was a possibility that my code was not finding any data (and more so the rows denoted by the "0" in the shape()) to analyse and work on. So I simply included the directory that my dataset lied, and voila!, everything worked as expected! I used the glob() function to point to my directory and indicated the files of interest to be analyzed, in my case audio .wav files. Below is the line of code I had to correct:
for wav_file in glob.glob("/home/directory_1/directory_2/*.wav")
I have been trying to find a way to load the EMNIST-letters dataset but without much success. I have found interesting stuff in the structure and can't wrap my head around what is happening. Here is what I mean:
I downloaded the .mat format in here
I can load the data using
import scipy.io
mat = scipy.io.loadmat('letter_data.mat') # renamed for conveniance
it is a dictionnary with the keys as follow:
dict_keys(['__header__', '__version__', '__globals__', 'dataset'])
the only key with interest is dataset, which I havent been able to gather data from. printing the shape of it give this:
>>>print(mat['dataset'].shape)
(1, 1)
I dug deeper and deeper to find a shape that looks somewhat like a real dataset and came across this:
>>>print(mat['dataset'][0][0][0][0][0][0].shape)
(124800, 784)
which is exactly what I wanted but I cant find the labels nor the test data, I tried many things but cant seem to understand the structure of this dataset.
If someone could tell me what is going on with this I would appreciate it
Because of the way the dataset is structured, the array of image arrays can be accessed with mat['dataset'][0][0][0][0][0][0] and the array of label arrays with mat['dataset'][0][0][0][0][0][1]. For instance, print(mat['dataset'][0][0][0][0][0][0][0]) will print out the pixel values of the first image, and print(mat['dataset'][0][0][0][0][0][1][0]) will print the first image's label.
For a less...convoluted dataset, I'd actually recommend using the CSV version of the EMNIST dataset on Kaggle: https://www.kaggle.com/crawford/emnist, where each row is a separate image, there are 785 columns where the first column = class_label and each column after represents one pixel value (784 total for a 28 x 28 image).
#Josh Payne's answer is correct, but I'll expand on it for those who want to use the .mat file with an emphasis on typical data splits.
The data itself has already been split up in to a training and test set. Here's how I accessed the data:
from scipy import io as sio
mat = sio.loadmat('emnist-letters.mat')
data = mat['dataset']
X_train = data['train'][0,0]['images'][0,0]
y_train = data['train'][0,0]['labels'][0,0]
X_test = data['test'][0,0]['images'][0,0]
y_test = data['test'][0,0]['labels'][0,0]
There is an additional field 'writers' (e.g. data['train'][0,0]['writers'][0,0]) that distinguishes the original sample writer. Finally, there is another field data['mapping'], but I'm not sure what it is mapping the digits to.
In addition, in Secion II D, the EMNIST paper states that "the last portion of the training set, equal in size to the testing set, is set aside as a validation set". Strangely, the .mat file training/testing size does not match the number listed in Table II, but it does match the size in Fig. 2.
val_start = X_train.shape[0] - X_test.shape[0]
X_val = X_train[val_start:X_train.shape[0],:]
y_val = y_train[val_start:X_train.shape[0]]
X_train = X_train[0:val_start,:]
y_train = y_train[0:val_start]
If you don't want a validation set it is fine to leave these samples in the training set.
Also, if you would like to reshape the data into 2D, 28x28 sized images instead of a 1D 784 array, to get the correct image orientation you'll need to do a numpy reshape using Fortran ordering (Matlab uses column-major ordering, just like Fortran. reference). e.g. -
X_train = X_train.reshape( (X_train.shape[0], 28, 28), order='F')
An alternative solution is to use the EMNIST python package. (Full details at https://pypi.org/project/emnist/)
This lets you pip install emnist in your environment then import the datasets (they will download when you run the program for the first time).
Example from the site:
>>> from emnist import extract_training_samples
>>> images, labels = extract_training_samples('digits')
>>> images.shape
(240000, 28, 28)
>>> labels.shape
(240000,)
You can also list the datasets
>>> from emnist import list_datasets
>>> list_datasets()
['balanced', 'byclass', 'bymerge', 'digits', 'letters', 'mnist']
And replace 'digits' in the first example with your choice.
This gives you all the data in numpy arrays which I have found makes things easy to work with.
I suggest downloading the 'Binary format as the original MNIST dataset' from the Yann LeCun website.
Unzip the downloaded File and then with Python:
import idx2numpy
X_train = idx2numpy.convert_from_file('./emnist-letters-train-images-idx3-ubyte')
y_train = idx2numpy.convert_from_file('./emnist-letters-train-labels-idx1-ubyte')
X_test = idx2numpy.convert_from_file('./emnist-letters-test-images-idx3-ubyte')
y_test = idx2numpy.convert_from_file('./emnist-letters-test-labels-idx1-ubyte')
Everyone!
I am trying to develop a neural network using Keras and TensorFlow, which should be able to take variable length arrays as input and give either some single value (see the toy example below) or classify them (that a problem for later and will not be touched in this question).
The idea is fairly simple.
We have variable length arrays. I am currently using very simple toy data, which is generated by the following code:
import numpy as np
import pandas as pd
from keras import models as kem
from keras import activations as kea
from keras import layers as kel
from keras import regularizers as ker
from keras import optimizers as keo
from keras import losses as kelo
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
n = 100
x = pd.DataFrame(columns=['data','res'])
mms = MinMaxScaler(feature_range=(-1,1))
for i in range(n):
k = np.random.randint(20,100)
ss = np.random.randint(0,100,size=k)
idres = np.sum(ss[np.arange(0,k,2)])-np.sum(ss[np.arange(1,k,2)])
x.loc[i,'data'] = ss
x.loc[i,'res'] = idres
x.res = mms.fit_transform(x.res)
x_train,x_test,y_train, y_test = train_test_split(x.data,x.res,test_size=0.2)
x_train = sliding_window(x_train.as_matrix(),2,2)
x_test = sliding_window(x_test.as_matrix(),2,2)
To put it simple, I generate arrays with random length and the result (output) for each array is sum of even elements - sum of odd elements. Obviously, it can be negative and positive. The output then scaled to the range [-1,1] to fit with tanh activation function.
The Sequential model is generated as following:
model = kem.Sequential()
model.add(kel.LSTM(20,return_sequences=False,input_shape=(None,2),recurrent_activation='tanh'))
model.add(kel.Dense(20,activation='tanh'))
model.add(kel.Dense(10,activation='tanh'))
model.add(kel.Dense(5,activation='tanh'))
model.add(kel.Dense(1,activation='tanh'))
sgd = keo.SGD(lr=0.1)
mseloss = kelo.mean_squared_error
model.compile(optimizer=sgd,loss=mseloss,metrics=['accuracy'])
And the training of the model is doing in the following way:
def calcMSE(model,x_test,y_test):
nTest = len(x_test)
sum = 0
for i in range(nTest):
restest = model.predict(np.reshape(x_test[i],(1,-1,2)))
sum+=(restest-y_test[0,i])**2
return sum/nTest
i = 1
mse = calcMSE(model,x_test,np.reshape(y_test.values,(1,-1)))
lrPar = 0
lrSteps = 30
while mse>0.04:
print("Epoch %i" % (i))
print(mse)
for j in range(len(x_train)):
ntrain=j
model.train_on_batch(np.reshape(x_train[ntrain],(1,-1,2)),np.reshape(y_train.values[ntrain],(-1,1)))
i+=1
mse = calcMSE(model,x_test,np.reshape(y_test.values,(1,-1)))
The problem is that optimiser gets stuck usually around MSE=0.05 (on test set). Last time I tested, it actually stuck around MSE=0.12 (on test data).
Moreover, if you will look at what the model gives on test data (left column) in comparison with the correct output (right column):
[[-0.11888303]] 0.574923547401
[[-0.17038491]] -0.452599388379
[[-0.20098214]] 0.065749235474
[[-0.22307695]] -0.437308868502
[[-0.2218809]] 0.371559633028
[[-0.2218741]] 0.039755351682
[[-0.22247596]] -0.434250764526
[[-0.17094387]] -0.151376146789
[[-0.17089397]] -0.175840978593
[[-0.16988073]] 0.025993883792
[[-0.16984619]] -0.117737003058
[[-0.17087571]] -0.515290519878
[[-0.21933308]] -0.366972477064
[[-0.09379648]] -0.178899082569
[[-0.17016701]] -0.333333333333
[[-0.17022927]] -0.195718654434
[[-0.11681376]] 0.452599388379
[[-0.21438009]] 0.224770642202
[[-0.12475857]] 0.151376146789
[[-0.2225963]] -0.380733944954
And on training set the same is:
[[-0.22209576]] -0.00764525993884
[[-0.17096499]] -0.247706422018
[[-0.22228305]] 0.276758409786
[[-0.16986915]] 0.340978593272
[[-0.16994311]] -0.233944954128
[[-0.22131597]] -0.345565749235
[[-0.17088912]] -0.145259938838
[[-0.22250554]] -0.792048929664
[[-0.17097935]] 0.119266055046
[[-0.17087702]] -0.2874617737
[[-0.1167363]] -0.0045871559633
[[-0.08695849]] 0.159021406728
[[-0.17082921]] 0.374617737003
[[-0.15422876]] -0.110091743119
[[-0.22185338]] -0.7125382263
[[-0.17069265]] -0.678899082569
[[-0.16963181]] -0.00611620795107
[[-0.17089556]] -0.249235474006
[[-0.17073657]] -0.414373088685
[[-0.17089497]] -0.351681957187
[[-0.17138508]] -0.0917431192661
[[-0.22351067]] 0.11620795107
[[-0.17079701]] -0.0795107033639
[[-0.22246087]] 0.22629969419
[[-0.17044055]] 1.0
[[-0.17090379]] -0.0902140672783
[[-0.23420531]] -0.0366972477064
[[-0.2155242]] 0.0366972477064
[[-0.22192241]] -0.675840978593
[[-0.22220723]] -0.354740061162
[[-0.1671907]] -0.10244648318
[[-0.22705412]] 0.0443425076453
[[-0.22943887]] -0.249235474006
[[-0.21681401]] 0.065749235474
[[-0.12495813]] 0.466360856269
[[-0.17085686]] 0.316513761468
[[-0.17092516]] 0.0275229357798
[[-0.17277785]] -0.325688073394
[[-0.22193027]] 0.139143730887
[[-0.17088208]] 0.422018348624
[[-0.17093034]] -0.0886850152905
[[-0.17091317]] -0.464831804281
[[-0.22241674]] -0.707951070336
[[-0.1735626]] -0.337920489297
[[-0.16984227]] 0.00764525993884
[[-0.16756304]] 0.515290519878
[[-0.22193302]] -0.414373088685
[[-0.22419722]] -0.351681957187
[[-0.11561158]] 0.17125382263
[[-0.16640976]] -0.321100917431
[[-0.21557514]] -0.313455657492
[[-0.22241823]] -0.117737003058
[[-0.22165506]] -0.646788990826
[[-0.22238114]] -0.261467889908
[[-0.1709189]] 0.0902140672783
[[-0.17698884]] -0.626911314985
[[-0.16984172]] 0.587155963303
[[-0.22226149]] -0.590214067278
[[-0.16950315]] -0.469418960245
[[-0.22180589]] -0.133027522936
[[-0.2224243]] -1.0
[[-0.22236891]] 0.152905198777
[[-0.17089345]] 0.435779816514
[[-0.17422611]] -0.233944954128
[[-0.17177556]] -0.324159021407
[[-0.21572633]] -0.347094801223
[[-0.21509495]] -0.646788990826
[[-0.17086846]] -0.34250764526
[[-0.17595944]] -0.496941896024
[[-0.16803505]] -0.382262996942
[[-0.16983894]] -0.348623853211
[[-0.17078683]] 0.363914373089
[[-0.21560851]] -0.186544342508
[[-0.22416025]] -0.374617737003
[[-0.1723443]] -0.186544342508
[[-0.16319042]] -0.0122324159021
[[-0.18837349]] -0.181957186544
[[-0.17371364]] -0.539755351682
[[-0.22232121]] -0.529051987768
[[-0.22187822]] -0.149847094801
As you can see, model output is actually all quite close to each other unlike the training set, where variability is much bigger (although, I should admit, that negative values are dominants in both training and test set.
What am I doing wrong here? Why training gets stuck or is it normal process and I should leave it for much longer (I was doing several hudreds epochs couple of times and still stay stuck). I also tried to use variable learning rate (used, for example, cosine annealing with restarts (as in I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with restarts.
arXiv preprint arXiv:1608.03983, 2016.)
I would appreciate any suggestions both from network structure and training approach and from coding/detailed sides.
Thank you very much in advance for help.
I want to make a program to recognize the digit in an image. I follow the tutorial in scikit learn .
I can train and fit the svm classifier like the following.
First, I import the libraries and dataset
from sklearn import datasets, svm, metrics
digits = datasets.load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
Second, I create the SVM model and train it with the dataset.
classifier = svm.SVC(gamma = 0.001)
classifier.fit(data[:n_samples], digits.target[:n_samples])
And then, I try to read my own image and use the function predict() to recognize the digit.
Here is my image:
I reshape the image into (8, 8) and then convert it to a 1D array.
img = misc.imread("w1.jpg")
img = misc.imresize(img, (8, 8))
img = img[:, :, 0]
Finally, when I print out the prediction, it returns [1]
predicted = classifier.predict(img.reshape((1,img.shape[0]*img.shape[1] )))
print predicted
Whatever I user others images, it still returns [1]
When I print out the "default" dataset of number "9", it looks like:
My image number "9" :
You can see the non-zero number is quite large for my image.
I dont know why. I am looking for help to solve my problem. Thanks
My best bet would be that there is a problem with your data types and array shapes.
It looks like you are training on numpy arrays that are of the type np.float64 (or possibly np.float32 on 32 bit systems, I don't remember) and where each image has the shape (64,).
Meanwhile your input image for prediction, after the resizing operation in your code, is of type uint8 and shape (1, 64).
I would first try changing the shape of your input image since dtype conversions often just work as you would expect. So change this line:
predicted = classifier.predict(img.reshape((1,img.shape[0]*img.shape[1] )))
to this:
predicted = classifier.predict(img.reshape(img.shape[0]*img.shape[1]))
If that doesn't fix it, you can always try recasting the data type as well with
img = img.astype(digits.images.dtype).
I hope that helps. Debugging by proxy is a lot harder than actually sitting in front of your computer :)
Edit: According to the SciPy documentation, the training data contains integer values from 0 to 16. The values in your input image should be scaled to fit the same interval. (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits)
1) You need to create your own training set - based on data similar to what you will be making predictions. The call to datasets.load_digits() in scikit-learn is loading a preprocessed version of the MNIST Digits dataset, which, for all we know, could have very different images to the ones that you are trying to recognise.
2) You need to set the parameters of your classifier properly. The call to svm.SVC(gamma = 0.001) is just choosing an arbitrary value of the gamma parameter in SVC, which may not be the best option. In addition, you are not configuring the C parameter - which is pretty important for SVMs. I'd bet that this is one of the reasons why your output is 'always 1'.
3) Whatever final settings you choose for your model, you'll need to use a cross-validation scheme to ensure that the algorithm is effectively learning
There's a lot of Machine Learning theory behind this, but, as a good start, I would really recommend to have a look at SVM - scikit-learn for a more in-depth description of how the SVC implementation in sickit-learn works, and GridSearchCV for a simple technique for parameter setting.
It's just a guess but... The Training set from Sk-Learn are black numbers on a white background. And you are trying to predict numbers which are white on a black background...
I think you should either train on your training set, or train on the negative version of your pictures.
I hope this help !
If you look at:
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits
you can see that each point in the matrix as a value between 0-16.
You can try to transform the values of the image to between 0-16. I did it and now the prediction works well for the digit 9 but not for 8 and 6. It doesn't give 1 any more.
from sklearn import datasets, svm, metrics
import cv2
import numpy as np
# Load digit database
digits = datasets.load_digits()
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Train SVM classifier
classifier = svm.SVC(gamma = 0.001)
classifier.fit(data[:n_samples], digits.target[:n_samples])
# Read image "9"
img = cv2.imread("w1.jpg")
img = img[:,:,0];
img = cv2.resize(img, (8, 8))
# Normalize the values in the image to 0-16
minValueInImage = np.min(img)
maxValueInImage = np.max(img)
normaliizeImg = np.floor(np.divide((img - minValueInImage).astype(np.float),(maxValueInImage-minValueInImage).astype(np.float))*16)
# Predict
predicted = classifier.predict(normaliizeImg.reshape((1,normaliizeImg.shape[0]*normaliizeImg.shape[1] )))
print predicted
I have solved this problem using below methods:
check the number of attributes, too large or too small.
check the scale of your gray value, I change to [0,16].
check data type, I change it to uint8.
check the number of training data, too small or not.
I hope it helps. ^.^
Hi in addition to #carrdelling respond, i will add that you may use the same training set, if you normalize your images to have the same range of value.
For example you could binaries your data ( 1 if > 0, 0 else ) or you could divide by the maximum intensity in your image to have an arbitrary interval [0;1].
You probably want to extract features relevant to to your data set from the images and train your model on them.
One example I copied from here.
surf = cv2.SURF(400)
kp, des = surf.detectAndCompute(img,None)
But the SURF features may not be the most useful or relevant to your dataset and training task. You should try others too like HOG or others.
Remember this more high level the features you extract the more general/error-tolerant your model will be to unseen images. However, you may be sacrificing accuracy in your known samples and test cases.
I have a dataset (71094 train images and 17000 test) for which i need to train a CNN.During preprocessing , i tried creating a matrix using numpy that turns out to be ridiculously large(71094*100*100*3 for the train data) [all images are RGB 100 by 100].. Hence i get a memory error.How do i tackle the situation.??Pls help .
This is my code..
import numpy as np
import cv2
from matplotlib import pyplot as plt
data_dir = './fashion-data/images/'
train_data = './fashion-data/train.txt'
test_data = './fashion-data/test.txt'
f = open(train_data, 'r').read()
ims = f.split('\n')
print len(ims)
train = np.zeros((71094, 100, 100, 3)) #this line causes the error..
for ix in range(train.shape[0]):
i = cv2.imread(data_dir + ims[ix] + '.jpg')
label = ims[ix].split('/')[0]
train[ix, :, :, :] = cv2.resize(i, (100, 100))
print train[0]
train_labels = np.zeros((71094, 1))
for ix in range(train_labels.shape[0]):
l = ims[ix].split('/')[0]
train_labels[ix] = int(l)
print train_labels[0]
np.save('./data/train', train)
np.save('./data/train_labels', train_labels)
I recently ran into the same problem, and I believe its a common problem when working with image data.
There are a number of methods you can use to tackle this problem depending on what you would like to do.
1) It can make sense to sample the data from each image when training, so not to train on ALL 71094*100*100 pixels. This can be done simply, by creating a function which loads one image at a time, and samples your pixels. There is some argument that doing this randomly for each epoch can reduce overfitting, but again depends on the exact problem. Stratified sampling may also help to balance the classes should you be working with pixel classification.
2) mini-batch training - split your training data into small "mini-batches" and train on each of these individually. Your epoch will end after you have completed training on all of your mini-batches, with all of your data. Here you should, each epoch, randomise the order of the data in order to avoid overfitting.
3) Load and train one image at a time - similar to mini-batch training, but just use one image as a "mini-batch" for each iteration, and run a for-loop through all of the images in the folder. this way only 1x100x100x3 is stored in memory at a time. depending on the size of your memory, you could perhaps use more than one image per mini batch i.e. - Nx100x100x3, and run for 71094/N iterations to go over all training data
I hope this was clear.. and that it helps somewhat!