Using seperated test and train files with train_test_split()

Using seperated test and train files with train_test_split() - python

I have two .csv files that one of them is test.csv and the other one is train.csv. However, as you can predict the test file does not have the target column ('y' in this case) while train file has.
What I wanted to do is first using train file to train the system entirely, then using the test file just to see predictions.
I'm using from sklearn.model_selection import train_test_split() to create train and test examples but it accepts 1 file path only. I want to train the system using train file first, then when it finished I want to get test datas from test.csv file and make the predictions.
So first I tried classic way but decreasing test size so It'll be like "this file used for train only",
import pandas as pd
from sklearn.svm import SVC
dataset = pd.read_csv(r'path\train.csv', sep=",")
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.001, random_state = 45)
clf = SVC(kernel = 'rbf')
clf.fit(X_train, y_train)
but then, when it comes to real test part(which I want to use the data in test.csv that doesn't have target values), how can I import test.csv somehow I can use the test data in trained model above
#get data from test.csv as somehow X_test
clfPredict = clf.predict(X_test)
If this is not possible using train_test_split(), what's the proper way to accomplish this task?

You need to load the train CSV and split it to:
y_train = df1['Y column']
X_train = df1.drop('Y Column', axis = 1)
And regarding test:
X_test = df2
and y_test will be the result from clf.predict(X_test)

Related

Python (sklearn) train_test_split: choosing which data to train and which data to test

I want to use sklearn's train_test_split to manually split data into train and test categories. Specifically, in my .csv file, I want to use all the rows of data until the last row to train, and the last row to test. The reason I'm doing this is because I need to launch a machine learning model but am incredibly short on time. I thought the best way would be to use predictions rather than deploying it using IBM Watson. I don't need it to be live. My code so far looks like this:
df=pd.read_csv('Book5.csv', names=['Amiability', 'Email'])
from sklearn.model_selection import train_test_split
df_x = df['Amiability']
df_y = df['Email']
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)
Then,
len(df)
Produces
331
I want to train with rows 0-330, and test with row 331. How can I do this?

If you don't absolutely need the test row to be the last row you should be able to do:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=1, random_state=4)
When test_size= is an integer it specifies the absolute number of sample rows for the test set.

How to pass different set of data to train and test without splitting a dataframe. (python)?

I have gone through multiple questions that help divide your dataframe into train and test, with scikit, without etc.
But my question is I have 2 different csvs ( 2 different dataframes from different years). I want to use one as train and other as test?
How to do so for LinearRegression / any model?

Load the datasets individually.
Make sure they are in the same format of rows and columns (features).
Use the train set to fit the model.
Use the test set to predict the output after training.
# Load the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Split features and value
# when trying to predict column "target"
X_train, y_train = train.drop("target"), train["target"]
X_test, y_test = test.drop("target"), test["target"]
# Fit (train) model
reg = LinearRegression()
reg.fit(X_train, y_train)
# Predict
pred = reg.predict(X_test)
# Score
accuracy = reg.socre(X_test, y_test)

please skillsmuggler what about the X_train and X_Test how I can define it because when I try to do that it said NameError: name 'X_train' is not defined

I couldn't edit the first answer which is almost there. There is some code missing though...
# Load the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
y_train = train[:, :1] #if y is only one column
X_train = train[:, 1:]
# Fit (train) model
reg = LinearRegression()
reg.fit(X_train, y_train)
# Predict
pred = reg.predict(X_test)
# Score
accuracy = reg.socre(X_test, y_test)

How can we give explicit test data and train data to SVM instead of using train_test_split function?

I'm planning to provide the test and train datasets explicitly to the algorithm and not to use the train_test_split method for random splitting of the data into test and train respectively.
And I want to keep the reviews and labels data in the same file while testing as well as training the model.
Can anyone of you please suggest me regarding the same ...
My code:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import average_precision_score
from sklearn.metrics import confusion_matrix
with open("/Users/xyz/Desktop/reviews.txt") as f:
reviews = f.read().split("\n")
with open("/Users/xyz/Desktop/labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=None)
lsvm = LinearSVC()
lsvm.fit(onehot_enc.transform(X_train), y_train)
accuracy_score = lsvm.score(onehot_enc.transform(X_test), y_test)
print("Accuracy score of SVM:" , accuracy_score)
Test.txt
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
Train.txt:
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative

Just do what you want. The solution is pretty easy:
X_train = reviews_tokens[:number_of_rows_of_train_data]
X_test = reviews_tokens[number_of_rows_of_train_data:]
Do the same for y_train and y_test.
Of course you need to know which rows in your file are for training and which are for testing.
If you want to keep features and labels in the same file - no problem with that. You will need one additional step to separate labels from features. It would be a lot easier with pandas.
EDIT
Having the files you provided you can get what you want like this:
def load_data(filename):
X = list()
y = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
y.append(line[1])
X.append(line[0].split())
return X, y
X_train, y_train = load_data('train.txt')
X_test, y_test = load_data('test.txt')

How do I accept a non-csv input for my machine learning model?

Language: Python.
I have created a model and saved it with joblib. Now I want to load it to make predictions for new data---but the data is in a form of string(numerical in value but the features are a single line separated by "," instead of in columns as one big dataframe) Can I do that? I know I can send in single inputs and get a single prediction but I'm not sure how to do it.
I used
https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
as reference but I'm not too clear about the last bit (loading the model)
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Fitting K-NN to the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# save the model to disk
filename = 'test_model.sav'
joblib.dump(classifier, filename)
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, y_test)
print(result)
*I did not post the data preprocessing part of the code

If your problem is about how to load the input vector X_test from a string input, you can use np.fromstring:
input_string = '34,144,13'
X_test=np.fromstring(input_string, dtype=int, sep=',')
To get the model's prediction for the above X_test, you can use:
loaded_model = joblib.load(filename)
prediction= loaded_model.predict(X_test)

numpy array from csv file for lasagne

I started learning how to use theano with lasagne, and started with the mnist example. Now, I want to try my own example: I have a train.csv file, in which every row starts with 0 or 1 which represents the correct answer, followed by 773 0s and 1s which represent the input. I didn't understand how can I turn this file to the wanted numpy arrays in the load_database() function. this is the part from the original function for the mnist database:
...
with gzip.open(filename, 'rb') as f:
data = pickle_load(f, encoding='latin-1')
# The MNIST dataset we have here consists of six numpy arrays:
# Inputs and targets for the training set, validation set and test set.
X_train, y_train = data[0]
X_val, y_val = data[1]
X_test, y_test = data[2]
...
# We just return all the arrays in order, as expected in main().
# (It doesn't matter how we do this as long as we can read them again.)
return X_train, y_train, X_val, y_val, X_test, y_test
and I need to get the X_train (the input) and the y_train (the beginning of every row) from my csv files.
Thanks!

You can use numpy.genfromtxt() or numpy.loadtxt() as follows:
from sklearn.cross_validation import KFold
Xy = numpy.genfromtxt('yourfile.csv', delimiter=",")
# the next section provides the required
# training-validation set splitting but
# you can do it manually too, if you want
skf = KFold(len(Xy))
for train_index, valid_index in skf:
ind_train, ind_valid = train_index, valid_index
break
Xy_train, Xy_valid = Xy[ind_train], Xy[ind_valid]
X_train = Xy_train[:, 1:]
y_train = Xy_train[:, 0]
X_valid = Xy_valid[:, 1:]
y_valid = Xy_valid[:, 0]
...
# you can simply ignore the test sets in your case
return X_train, y_train, X_val, y_val #, X_test, y_test
In the code snippet we ignored passing the test set.
Now you can import your dataset to the main modul or script or whatever, but be aware to remove all the test part from that too.
Or alternatively you can simply pass the valid sets as test set:
# you can simply pass the valid sets as `test` set
return X_train, y_train, X_val, y_val, X_val, y_val
In the latter case we don't have to care about the main moduls sections refer to the excepted test set, but as scores (if have) you will get the the validation scores twice i.e. as test scores.
Note: I don't know, which mnist example is that one, but probably, after you prepared your data as above, you have to make further modifications in your trainer module too to suit to your data. For example: input shape of data, output shape i.e. the number of classes e.g. in your case the former is 773, the latter is 2.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using seperated test and train files with train_test_split() - python

You need to load the train CSV and split it to: y_train = df1['Y column'] X_train = df1.drop('Y Column', axis = 1) And regarding test: X_test = df2 and y_test will be the result from clf.predict(X_test)

Related

Python (sklearn) train_test_split: choosing which data to train and which data to test

How to pass different set of data to train and test without splitting a dataframe. (python)?

How can we give explicit test data and train data to SVM instead of using train_test_split function?

How do I accept a non-csv input for my machine learning model?

numpy array from csv file for lasagne

Categories

Resources