I am following through the tutorial here:
https://pythonprogramming.net/train-test-tensorflow-deep-learning-tutorial/
I can get the Neural Network trained and print out the accuracy.
However, I do not know how to use the Neural Network to make a prediction.
Here is my attempt. Specifically the issue is this line - I believe my issue is that I cannot get my input string into the format the model expects:
features = get_features_for_input("This was the best store i've ever seen.")
result = (sess.run(tf.argmax(prediction.eval(feed_dict={x:features}),1)))
Here is a larger listing:
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epochs):
epoch_loss = 0
i = 0
while i < len(train_x):
start = i
end = i + batch_size
batch_x = np.array(train_x[start:end])
batch_y = np.array(train_y[start:end])
_, c = sess.run([optimizer, cost], feed_dict={x: batch_x, y: batch_y})
epoch_loss += c
i+=batch_size
print('Epoch', epoch, 'completed out of', hm_epochs, 'loss:', epoch_loss)
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct,'float'))
print('Accuracy', accuracy.eval({x:test_x, y:test_y}))
# pos: [1,0] , argmax: 0
# neg: [0,1] , argmax: 1
features = get_features_for_input("This was the best store i've ever seen.")
result = (sess.run(tf.argmax(prediction.eval(feed_dict={x:features}),1)))
if result[0] == 0:
print('Positive:',input_data)
elif result[0] == 1:
print('Negative:',input_data)
def get_features_for_input(input):
current_words = word_tokenize(input.lower())
current_words = [lemmatizer.lemmatize(i) for i in current_words]
features = np.zeros(len(lexicon))
for word in current_words:
if word.lower() in lexicon:
index_value = lexicon.index(word.lower())
# OR DO +=1, test both
features[index_value] += 1
features = np.array(list(features))
train_neural_network(x)
Following your comment above, it feels like your error ValueError: Cannot feed value of shape () is due to the fact that features is None, because your function get_features_for_input doesn't return anything.
I added the return features line and gave features a correct shape of [1, len(lexicon)] to match the shape of the placeholder.
def get_features_for_input(input):
current_words = word_tokenize(input.lower())
current_words = [lemmatizer.lemmatize(i) for i in current_words]
features = np.zeros((1, len(lexicon)))
for word in current_words:
if word.lower() in lexicon:
index_value = lexicon.index(word.lower())
# OR DO +=1, test both
features[0, index_value] += 1
return features
Your get_features_for_input function returns a single list representing features of a sentences but for feed_dict, the input needs to be of size [num_examples, features_size], here num_examples is 1.
The following code should work.
def get_features_for_input(input):
current_words = word_tokenize(input.lower())
current_words = [lemmatizer.lemmatize(i) for i in current_words]
features = np.zeros(len(lexicon))
for word in current_words:
if word.lower() in lexicon:
index_value = lexicon.index(word.lower())
# OR DO +=1, test both
features[index_value] += 1
features = np.array(list(features))
batch_features = []
batch_features[0] = features
return np.array(batch_features)
Basic funda for any machine learning algorithm is dimention should be same during training and testing.
During training you created matrix shape number of training samples, len(lexicon). Here you are trying bag of words approach and lexicons are nothing but the unique word in your training data.
During testing your input vector size should be same as your vector size for training. And it is just the size of lexicon created during training. Also each element in test vector defines the corresponding index word in lexicons.
Now come to your problem, in get_features_for_input(input) you used the lexicon, you must have defined somewhere in program. Given the error what I conclude is your lexicon list is empty, so in get_features_for_input function features = np.zeros(len(lexicon)) will produce array of zero shape and also never enters in loop.
Few expected modifications:
You can find function create_feature_sets_and_labels in your tutorial. That returns your cleaned formatted training data. Change return statement to return the lexicon list along with data.
return train_x,train_y,test_x,test_y,lexicon
Make small change to collect lexicon list, ref:here
train_x,train_y,test_x,test_y,lexicon = create_feature_sets_and_labels('/path/to/pos.txt','/path/to/neg.txt')
And just pass this lexicon list alongwith your input to get_features_for_input function
features = get_features_for_input("This was the best store i've ever seen.",lexicon)
Make small change in get_features_for_input function
def get_features_for_input(text,lexicon):
featureset = []
current_words = word_tokenize(text.lower())
current_words = [lemmatizer.lemmatize(i) for i in current_words]
features = np.zeros(len(lexicon))
for word in current_words:
if word.lower() in lexicon:
index_value = lexicon.index(word.lower())
features[index_value] += 1
featureset.append(features)
return np.asarray(featureset)
I am writing a KNN classifier taken from here for character recognition from accelerometer and gyroscopic data.But, the below functions are not working correctly and prediction is not happening.Are there any mistakes in below code? kindly guide me.
trainingset-> training data with 20 samples(10=A,10=B).
testset-> live reading taken for recognition.
#-- KNN Classifier Functions ----------
def loaddataset():
global trainingset
with open('imudata.csv','rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)):
trainingset.append(dataset[x])
def euclideandistance(instance1,instance2,length):
distance = 0
for x in range(length-1):
instance1[x] = float(instance1[x])
instance2[x] = float(instance2[x])
for x in range(length-1):
distance += pow((instance1[x]-instance2[x]),2)
return math.sqrt(distance)
def getneighbours(trainingset,testinstance,k):
distances = []
length = len(testinstance)-1
for x in range(len(trainingset)):
dist = euclideandistance(testinstance, trainingset[x],length)
#print(trainingset[x][-1],dist)
distances.append((trainingset[x],dist))
#print(distances)
distances.sort(key=operator.itemgetter(1))
#print(distances)
neighbours = []
print('k='+repr(k)+'length of distances='+repr(len(distances)))
for x in range(k):
neighbours.append(distances[x][0])
return neighbours
def getresponse(neighbours):
classvotes = {}
for x in range(len(neighbours)):
response = neighbours[x][-1]
if response in classvotes:
classvotes[response] += 1
else:
classvotes[response] = 1
sortedvotes = sorted(classvotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedvotes[0][0]
def getaccuracy(testset, predictions):
correct = 0
for x in range(len(testset)):
if testset[x][-1] is predictions[x]:
correct +=1
return ((correct/float(len(testset))) * 100.0)
#------- END of KNN Classifier Functions -------------
My main compare function is
def compare():
loaddataset()
testset.append(testdata)
print 'Train set: '+ repr(len(trainingset))
print 'Test set: '+ repr(len(testset))
predictions=[]
k = len(trainingset)
for x in range(len(testset)):
neighbours = getneighbours(trainingset,testset[x],k)
result = getresponse(neighbours)
predictions.append(result)
print('>Predicted=' +repr(result)+', actual=' + repr(testset[x][-1]))
accuracy = getaccuracy(testset, predictions)
print('Accuracy: '+repr(accuracy)+'%')
My output is
Train set: 20
Test set: 1
k=20 length of distance=20
>Predicted='A', actual='B'
Accuracy: 0.0%
My sample data packet:
-1.1945864763443935e-16,1.0000000000000031,0.81335962823925234,1.2678119727931405,4.6396523259663871,3,1.0000000000000013,108240.99999999988,328.99999999999966,4.3008487686466931e-16,1.000000000000002,0.73006871826334618,0.88693535629714804,4.3903300136708818,15,1.0000000000000011,108240.99999999977,328.99999999999932,1.990977460573989e-16,1.0000000000000009,0.8120281400849243,1.3556881217171162,4.2839744646260876,9,1.0000000000000004,108240.99999999994,328.99999999999983,-3.4217816017322454e-16,1.0000000000000009,0.7842111273340705,1.0882622268942712,4.4762484049613418,4,1.0000000000000004,108241.00000000038,329.00000000000114,2.6996304550155782e-18,1.000000000000004,0.76504908035654873,1.1890598964371606,4.2138613873737967,7,1.000000000000002,108241.0000000001,329.00000000000028,7.154020705791282e-17,1.0,0.83945423805187047,1.4309844267934049,3.7008217934312198,6,1.0,108240.99999999983,328.99999999999949,-0.66014932688009009,0.48967404184734276,0.083592048161537938,A
I am from hardware and dont know much about KNN, thatswhy I am asking for corrections in my code if any.I added my dataset here.
I could see from your data that number of samples is very less that no. of features, that may affect the accuracy of prediction and the number of samples need to be very high. You can't expect to predict everything correctly, algorithms have their own accuracies. Try to check the correctness of this code by using any other well-known datasets like iris Or try to use built-in knn classifier from scikit learn python.
I am quite new to using python for machine learning. I come from a background of programming in Fortran, so as you may imagine, python is quite a leap. I work in chemistry and have become involved in chemiformatics (applying data science techniques to chemistry). As such, the application of pythons extensive machine learning libraries is important. I also need my codes to be efficent. I have written a code which runs and seems to work OK. What I would like to know is:
1 How best to improve it/make it more efficient.
2 Any suggestions on alternative formulations to those I have used and if possible a reason why another route maybe superior?
I tend to work with continuous data and regression models.
Any suggestions would be great and thank you in advance for those.
import scipy
import math
import numpy as np
import pandas as pd
import plotly.plotly as py
import os.path
import sys
from time import time
from sklearn import preprocessing, metrics, cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import KFold
fname = str(raw_input('Please enter the input file name containing total dataset and descriptors (assumes csv file, column headings and first column are labels\n'))
if os.path.isfile(fname) :
SubFeAll = pd.read_csv(fname, sep=",")
else:
sys.exit("ERROR: input file does not exist")
#SubFeAll = pd.read_csv(fname, sep=",")
SubFeAll = SubFeAll.fillna(SubFeAll.mean()) # replace the NA values with the mean of the descriptor
header = SubFeAll.columns.values # Use the column headers as the descriptor labels
SubFeAll.head()
# Set the numpy global random number seed (similar effect to random_state)
np.random.seed(1)
# Random Forest results initialised
RFr2 = []
RFmse = []
RFrmse = []
# Predictions results initialised
RFpredictions = []
metcount = 0
# Give the array from pandas to numpy
npArray = np.array(SubFeAll)
print header.shape
npheader = np.array(header[1:-1])
print("Array shape X = %d, Y = %d " % (npArray.shape))
datax, datay = npArray.shape
# Print specific nparray values to check the data
print("The first element of the input data set, as a minial check please ensure this is as expected = %s" % npArray[0,0])
# Split the data into: names labels of the molecules ; y the True results ; X the descriptors for each data point
names = npArray[:,0]
X = npArray[:,1:-1].astype(float)
y = npArray[:,-1] .astype(float)
X = preprocessing.scale(X)
print X.shape
# Open output files
train_name = "Training.csv"
test_name = "Predictions.csv"
fi_name = "Feature_importance.csv"
with open(train_name,'w') as ftrain, open(test_name,'w') as fpred, open(fi_name,'w') as ffeatimp:
ftrain.write("This file contains the training information for the Random Forest models\n")
ftrain.write("The code use a ten fold cross validation 90% training 10% test at each fold so ten training sets are used here,\n")
ftrain.write("Interation %d ,\n" %(metcount+1))
fpred.write("This file contains the prediction information for the Random Forest models\n")
fpred.write("Predictions are made over a ten fold cross validation hence training on 90% test on 10%. The final prediction are return iteratively over this ten fold cros validation once,\n")
fpred.write("optimised parameters are located via a grid search at each fold,\n")
fpred.write("Interation %d ,\n" %(metcount+1))
ffeatimp.write("This file contains the feature importance information for the Random Forest model,\n")
ffeatimp.write("Interation %d ,\n" %(metcount+1))
# Begin the K-fold cross validation over ten folds
kf = KFold(datax, n_folds=10, shuffle=True, random_state=0)
print "------------------- Begining Ten Fold Cross Validation -------------------"
for train, test in kf:
XTrain, XTest, yTrain, yTest = X[train], X[test], y[train], y[test]
ytestdim = yTest.shape[0]
print("The test set values are : ")
i = 0
if ytestdim%5 == 0:
while i < ytestdim:
print round(yTest[i],2),'\t', round(yTest[i+1],2),'\t', round(yTest[i+2],2),'\t', round(yTest[i+3],2),'\t', round(yTest[i+4],2)
ftrain.write(str(round(yTest[i],2))+','+ str(round(yTest[i+1],2))+','+str(round(yTest[i+2],2))+','+str(round(yTest[i+3],2))+','+str(round(yTest[i+4],2))+',\n')
i += 5
elif ytestdim%4 == 0:
while i < ytestdim:
print round(yTest[i],2),'\t', round(yTest[i+1],2),'\t', round(yTest[i+2],2),'\t', round(yTest[i+3],2)
ftrain.write(str(round(yTest[i],2))+','+str(round(yTest[i+1],2))+','+str(round(yTest[i+2],2))+','+str(round(yTest[i+3],2))+',\n')
i += 4
elif ytestdim%3 == 0 :
while i < ytestdim :
print round(yTest[i],2),'\t', round(yTest[i+1],2),'\t', round(yTest[i+2],2)
ftrain.write(str(round(yTest[i],2))+','+str(round(yTest[i+1],2))+','+str(round(yTest[i+2],2))+',\n')
i += 3
elif ytestdim%2 == 0 :
while i < ytestdim :
print round(yTest[i],2), '\t', round(yTest[i+1],2)
ftrain.write(str(round(yTest[i],2))+','+str(round(yTest[i+1],2))+',\n')
i += 2
else :
while i< ytestdim :
print round(yTest[i],2)
ftrain.write(str(round(yTest[i],2))+',\n')
i += 1
print "\n"
# random forest grid search parameters
print "------------------- Begining Random Forest Grid Search -------------------"
rfparamgrid = {"n_estimators": [10], "max_features": ["auto", "sqrt", "log2"], "max_depth": [5,7]}
rf = RandomForestRegressor(random_state=0,n_jobs=2)
RfGridSearch = GridSearchCV(rf,param_grid=rfparamgrid,scoring='mean_squared_error',cv=10)
start = time()
RfGridSearch.fit(XTrain,yTrain)
# Get best random forest parameters
print("GridSearchCV took %.2f seconds for %d candidate parameter settings" %(time() - start,len(RfGridSearch.grid_scores_)))
RFtime = time() - start,len(RfGridSearch.grid_scores_)
#print(RfGridSearch.grid_scores_) # Diagnos
print("n_estimators = %d " % RfGridSearch.best_params_['n_estimators'])
ne = RfGridSearch.best_params_['n_estimators']
print("max_features = %s " % RfGridSearch.best_params_['max_features'])
mf = RfGridSearch.best_params_['max_features']
print("max_depth = %d " % RfGridSearch.best_params_['max_depth'])
md = RfGridSearch.best_params_['max_depth']
ftrain.write("Random Forest")
ftrain.write("RF search time, %s ,\n" % (str(RFtime)))
ftrain.write("Number of Trees, %s ,\n" % str(ne))
ftrain.write("Number of feature at split, %s ,\n" % str(mf))
ftrain.write("Max depth of tree, %s ,\n" % str(md))
# Train random forest and predict with optimised parameters
print("\n\n------------------- Starting opitimised RF training -------------------")
optRF = RandomForestRegressor(n_estimators = ne, max_features = mf, max_depth = md, random_state=0)
optRF.fit(XTrain, yTrain) # Train the model
RFfeatimp = optRF.feature_importances_
indices = np.argsort(RFfeatimp)[::-1]
print("Training R2 = %5.2f" % optRF.score(XTrain,yTrain))
print("Starting optimised RF prediction")
RFpreds = optRF.predict(XTest)
print("The predicted values now follow :")
RFpredsdim = RFpreds.shape[0]
i = 0
if RFpredsdim%5 == 0:
while i < RFpredsdim:
print round(RFpreds[i],2),'\t', round(RFpreds[i+1],2),'\t', round(RFpreds[i+2],2),'\t', round(RFpreds[i+3],2),'\t', round(RFpreds[i+4],2)
i += 5
elif RFpredsdim%4 == 0:
while i < RFpredsdim:
print round(RFpreds[i],2),'\t', round(RFpreds[i+1],2),'\t', round(RFpreds[i+2],2),'\t', round(RFpreds[i+3],2)
i += 4
elif RFpredsdim%3 == 0 :
while i < RFpredsdim :
print round(RFpreds[i],2),'\t', round(RFpreds[i+1],2),'\t', round(RFpreds[i+2],2)
i += 3
elif RFpredsdim%2 == 0 :
while i < RFpredsdim :
print round(RFpreds[i],2), '\t', round(RFpreds[i+1],2)
i += 2
else :
while i< RFpredsdim :
print round(RFpreds[i],2)
i += 1
print "\n"
RFr2.append(optRF.score(XTest, yTest))
RFmse.append( metrics.mean_squared_error(yTest,RFpreds))
RFrmse.append(math.sqrt(RFmse[metcount]))
print ("Random Forest prediction statistics for fold %d are; MSE = %5.2f RMSE = %5.2f R2 = %5.2f\n\n" % (metcount+1, RFmse[metcount], RFrmse[metcount],RFr2[metcount]))
ftrain.write("Random Forest prediction statistics for fold %d are, MSE =, %5.2f, RMSE =, %5.2f, R2 =, %5.2f,\n\n" % (metcount+1, RFmse[metcount], RFrmse[metcount],RFr2[metcount]))
ffeatimp.write("Feature importance rankings from random forest,\n")
for i in range(RFfeatimp.shape[0]) :
ffeatimp.write("%d. , feature %d , %s, (%f),\n" % (i + 1, indices[i], npheader[indices[i]], RFfeatimp[indices[i]]))
# Store prediction in original order of data (itest) whilst following through the current test set order (j)
metcount += 1
ftrain.write("Fold %d, \n" %(metcount))
print "------------------- Next Fold %d -------------------" %(metcount+1)
j = 0
for itest in test :
RFpredictions.append(RFpreds[j])
j += 1
lennames = names.shape[0]
lenpredictions = len(RFpredictions)
lentrue = y.shape[0]
if lennames == lenpredictions == lentrue :
fpred.write("Names/Label,, Prediction Random Forest,, True Value,\n")
for i in range(0,lennames) :
fpred.write(str(names[i])+",,"+str(RFpredictions[i])+",,"+str(y[i])+",\n")
else :
fpred.write("ERROR - names, prediction and true value array size mismatch. Dumping arrays for manual inspection in predictions.csv\n")
fpred.write("Array printed in the order names/Labels, predictions RF and true values\n")
fpred.write(names+"\n")
fpred.write(RFpredictions+"\n")
fpred.write(y+"\n")
sys.exit("ERROR - names, prediction and true value array size mismatch. Dumping arrays for manual inspection in predictions.csv")
print "Final averaged Random Forest metrics : "
RFamse = sum(RFmse)/10
RFmse_sd = np.std(RFmse)
RFarmse = sum(RFrmse)/10
RFrmse_sd = np.std(RFrmse)
RFslope, RFintercept, RFr_value, RFp_value, RFstd_err = scipy.stats.linregress(RFpredictions, y)
RFR2 = RFr_value**2
print "Average Mean Squared Error = ", RFamse, " +/- ", RFmse_sd
print "Average Root Mean Squared Error = ", RFarmse, " +/- ", RFrmse_sd
print "R2 Final prediction against True values = ", RFR2
fpred.write("\n")
fpred.write("FINAL PREDICTION STATISTICS,\n")
fpred.write("Random Forest average MSE, %s, +/-, %s,\n" %(str(RFamse), str(RFmse_sd)))
fpred.write("Random Forest average RMSE, %s, +/-, %s,\n" %(str(RFarmse), str(RFrmse_sd)))
fpred.write("Random Forest slope, %s, Random Forest intercept, %s,\n" %(str(RFslope), str(RFintercept)))
fpred.write("Random Forest standard error, %s,\n" %(str(RFstd_err)))
fpred.write("Random Forest R, %s,\n" %(str(RFr_value)))
fpred.write("Random Forest R2, %s,\n" %(str(RFR2)))
ftrain.close()
fpred.close()
ffeatimp.close()
you can also add Feature Selection to your data:
sickit learn feature selection
some feature selection techniques are provided in sickit learn and you can use it to improve some aspect of your DM project