numpy array from csv file for lasagne - python

I started learning how to use theano with lasagne, and started with the mnist example. Now, I want to try my own example: I have a train.csv file, in which every row starts with 0 or 1 which represents the correct answer, followed by 773 0s and 1s which represent the input. I didn't understand how can I turn this file to the wanted numpy arrays in the load_database() function. this is the part from the original function for the mnist database:
...
with gzip.open(filename, 'rb') as f:
data = pickle_load(f, encoding='latin-1')
# The MNIST dataset we have here consists of six numpy arrays:
# Inputs and targets for the training set, validation set and test set.
X_train, y_train = data[0]
X_val, y_val = data[1]
X_test, y_test = data[2]
...
# We just return all the arrays in order, as expected in main().
# (It doesn't matter how we do this as long as we can read them again.)
return X_train, y_train, X_val, y_val, X_test, y_test
and I need to get the X_train (the input) and the y_train (the beginning of every row) from my csv files.
Thanks!

You can use numpy.genfromtxt() or numpy.loadtxt() as follows:
from sklearn.cross_validation import KFold
Xy = numpy.genfromtxt('yourfile.csv', delimiter=",")
# the next section provides the required
# training-validation set splitting but
# you can do it manually too, if you want
skf = KFold(len(Xy))
for train_index, valid_index in skf:
ind_train, ind_valid = train_index, valid_index
break
Xy_train, Xy_valid = Xy[ind_train], Xy[ind_valid]
X_train = Xy_train[:, 1:]
y_train = Xy_train[:, 0]
X_valid = Xy_valid[:, 1:]
y_valid = Xy_valid[:, 0]
...
# you can simply ignore the test sets in your case
return X_train, y_train, X_val, y_val #, X_test, y_test
In the code snippet we ignored passing the test set.
Now you can import your dataset to the main modul or script or whatever, but be aware to remove all the test part from that too.
Or alternatively you can simply pass the valid sets as test set:
# you can simply pass the valid sets as `test` set
return X_train, y_train, X_val, y_val, X_val, y_val
In the latter case we don't have to care about the main moduls sections refer to the excepted test set, but as scores (if have) you will get the the validation scores twice i.e. as test scores.
Note: I don't know, which mnist example is that one, but probably, after you prepared your data as above, you have to make further modifications in your trainer module too to suit to your data. For example: input shape of data, output shape i.e. the number of classes e.g. in your case the former is 773, the latter is 2.

Related

Python (sklearn) train_test_split: choosing which data to train and which data to test

I want to use sklearn's train_test_split to manually split data into train and test categories. Specifically, in my .csv file, I want to use all the rows of data until the last row to train, and the last row to test. The reason I'm doing this is because I need to launch a machine learning model but am incredibly short on time. I thought the best way would be to use predictions rather than deploying it using IBM Watson. I don't need it to be live. My code so far looks like this:
df=pd.read_csv('Book5.csv', names=['Amiability', 'Email'])
from sklearn.model_selection import train_test_split
df_x = df['Amiability']
df_y = df['Email']
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)
Then,
len(df)
Produces
331
I want to train with rows 0-330, and test with row 331. How can I do this?
If you don't absolutely need the test row to be the last row you should be able to do:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=1, random_state=4)
When test_size= is an integer it specifies the absolute number of sample rows for the test set.

Using seperated test and train files with train_test_split()

I have two .csv files that one of them is test.csv and the other one is train.csv. However, as you can predict the test file does not have the target column ('y' in this case) while train file has.
What I wanted to do is first using train file to train the system entirely, then using the test file just to see predictions.
I'm using from sklearn.model_selection import train_test_split() to create train and test examples but it accepts 1 file path only. I want to train the system using train file first, then when it finished I want to get test datas from test.csv file and make the predictions.
So first I tried classic way but decreasing test size so It'll be like "this file used for train only",
import pandas as pd
from sklearn.svm import SVC
dataset = pd.read_csv(r'path\train.csv', sep=",")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.001, random_state = 45)
clf = SVC(kernel = 'rbf')
clf.fit(X_train, y_train)
but then, when it comes to real test part(which I want to use the data in test.csv that doesn't have target values), how can I import test.csv somehow I can use the test data in trained model above
#get data from test.csv as somehow X_test
clfPredict = clf.predict(X_test)
If this is not possible using train_test_split(), what's the proper way to accomplish this task?
You need to load the train CSV and split it to:
y_train = df1['Y column']
X_train = df1.drop('Y Column', axis = 1)
And regarding test:
X_test = df2
and y_test will be the result from clf.predict(X_test)

Pandas :TypeError: float() argument must be a string or a number, not 'pandas._libs.interval.Interval'

I am trying to do the machine learning practice problem of Heart disease , dataset from kaggle.
Then i tried to split data into train set and test set and after that combing models into single function and predicting,this error shows up in jupyter notebook .
Here's my code:
# Split data into X and y
X = df.drop("target", axis=1)
y = df["target"]
Spliting
# Split data into train and test sets
np.random.seed(42)
# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
Prediction function
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
"KNN": KNeighborsClassifier(),
"Random Forest": RandomForestClassifier()}
# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
"""
Fits and evaluates given machine learning models.
models : a dict of differetn Scikit-Learn machine learning models
X_train : training data (no labels)
X_test : testing data (no labels)
y_train : training labels
y_test : test labels
"""
# Set random seed
np.random.seed(42)
# Make a dictionary to keep model scores
model_scores = {}
# Loop through models
for name, model in models.items():
# Fit the model to the data
model.fit(X_train, y_train)
# Evaluate the model and append its score to model_scores
model_scores[name] = model.score(X_test, y_test)
return model_scores
And when i run this code , that error shows up
model_scores = fit_and_score(models=models,
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test)
model_scores
This is error
Your X_train, y_train, or both, seem to have entries that are not float numbers.
At some point in the code, try using
X_train = X_train.astype(float)
y_train = y_train.astype(float)
X_test = X_test.astype(float)
y_test = y_test.astype(float)
Either this will work and the error will go away, or one of the conversions will fail, at which point you will need to decide how (or if) you want to encode the data as a float.

Add data to MNIST dataset

I am doing a machine learning project to recognize handwritten digits. Actually, I just want to add few more data sets to MNIST but I am unable to do so.
I have done following:
n_samples = len(mnist.data)
x = mnist.data.reshape((n_samples, -1))# array of feature of 64 pixel
y = mnist.target # Class label from 0-9 as there are digits
img_temp_train=cv2.imread('C:/Users/amuly/Desktop/Soap/crop/2.jpg',0)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
#Now I want to add the img_temp_train to my dataset for training.
X_train=np.append(X_train,img_temp_train.reshape(-1))
y_train=np.append(y_train,[4.0])
The length after training is:
43904784 (X_train)
56001(y_train)
But it should be 56001 for both.
Try this:
X_train = np.append(X_train, [img_temp_train], axis=0)
You shouldn't be reshaping things willy-nilly without thinking about what you're doing first!
Also, it's usually a better idea to use concatenate:
X_train = np.concatenate((X_train, [img_temp_train]), axis=0)

How can I do K fold cross-validation for splitting the train and test set?

I have a set of documents and a set of labels.
Right now, I am using train_test_split to split my dataset in a 90:10 ratio. However, I wish to use Kfold cross-validation.
train=[]
with open("/Users/rte/Documents/Documents.txt") as f:
for line in f:
train.append(line.strip().split())
labels=[]
with open("/Users/rte/Documents/Labels.txt") as t:
for line in t:
labels.append(line.strip().split())
X_train, X_test, Y_train, Y_test= train_test_split(train, labels, test_size=0.1, random_state=42)
When I try the method provided in the documentation of scikit learn: I receive an error that says:
kf=KFold(len(train), n_folds=3)
for train_index, test_index in kf:
X_train, X_test = train[train_index],train[test_index]
y_train, y_test = labels[train_index],labels[test_index]
error
X_train, X_test = train[train_index],train[test_index]
TypeError: only integer arrays with one element can be converted to an index
How can I perform a 10 fold cross-validation on my documents and labels?
There are two ways to solve this error:
First way:
Cast your data to a numpy array:
import numpy as np
[...]
train = np.array(train)
labels = np.array(labels)
then it should work with your current code.
Second way:
Use list comprehension to index the train & label list with the train_index & test_index list
for train_index, test_index in kf:
X_train, X_test = [train[i] for i in train_index],[train[j] for j in test_index]
y_train, y_test = [labels[i] for i in train_index],[labels[j] for j in test_index]
(For this solution also see related question index list with another list)

Categories

Resources