I am new to Machine Learning and am in the process of trying to run a simple classification model that I trained and saved using pickle, on another dataset of the same format. I have the following python code.
Code
#Training set
features = pd.read_csv('../Data/Train_sop_Computed.csv')
#Testing set
testFeatures = pd.read_csv('../Data/Test_sop_Computed.csv')
print(colored('\nThe shape of our features is:','green'), features.shape)
print(colored('\nThe shape of our Test features is:','green'), testFeatures.shape)
features = pd.get_dummies(features)
testFeatures = pd.get_dummies(testFeatures)
features.iloc[:,5:].head(5)
testFeatures.iloc[:,5].head(5)
labels = np.array(features['Truth'])
testlabels = np.array(testFeatures['Truth'])
features= features.drop('Truth', axis = 1)
testFeatures = testFeatures.drop('Truth', axis = 1)
feature_list = list(features.columns)
testFeature_list = list(testFeatures.columns)
def add_missing_dummy_columns(d, columns):
missing_cols = set(columns) - set(d.columns)
for c in missing_cols:
d[c] = 0
def fix_columns(d, columns):
add_missing_dummy_columns(d, columns)
# make sure we have all the columns we need
assert (set(columns) - set(d.columns) == set())
extra_cols = set(d.columns) - set(columns)
if extra_cols: print("extra columns:", extra_cols)
d = d[columns]
return d
testFeatures = fix_columns(testFeatures, features.columns)
features = np.array(features)
testFeatures = np.array(testFeatures)
train_samples = 100
X_train, X_test, y_train, y_test = model_selection.train_test_split(features, labels, test_size = 0.25, random_state = 42)
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)
print(colored('\n TRAINING SET','yellow'))
print(colored('\nTraining Features Shape:','magenta'), X_train.shape)
print(colored('Training Labels Shape:','magenta'), X_test.shape)
print(colored('Testing Features Shape:','magenta'), y_train.shape)
print(colored('Testing Labels Shape:','magenta'), y_test.shape)
print(colored('\n TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), testX_train.shape)
print(colored('Training Labels Shape:','magenta'), textX_test.shape)
print(colored('Testing Features Shape:','magenta'), testy_train.shape)
print(colored('Testing Labels Shape:','magenta'), testy_test.shape)
from sklearn.metrics import precision_recall_fscore_support
import pickle
loaded_model_RFC = pickle.load(open('../other/SOPmodel_RFC', 'rb'))
result_RFC = loaded_model_RFC.score(textX_test, testy_test)
print(colored('Random Forest Classifier: ','magenta'),result_RFC)
loaded_model_SVC = pickle.load(open('../other/SOPmodel_SVC', 'rb'))
result_SVC = loaded_model_SVC.score(textX_test, testy_test)
print(colored('Support Vector Classifier: ','magenta'),result_SVC)
loaded_model_GPC = pickle.load(open('../other/SOPmodel_Gaussian', 'rb'))
result_GPC = loaded_model_GPC.score(textX_test, testy_test)
print(colored('Gaussian Process Classifier: ','magenta'),result_GPC)
loaded_model_SGD = pickle.load(open('../other/SOPmodel_SGD', 'rb'))
result_SGD = loaded_model_SGD.score(textX_test, testy_test)
print(colored('Stocastic Gradient Descent: ','magenta'),result_SGD)
I am able to get the results for the test set.
But the problem I am facing is that I need to run the model on the entire Test_sop_Computed.csv dataset. But it is only being run on the test dataset that I've split.
I would sincerely appreciate if anyone could provide any suggestions on how I can run the loaded model on the entire dataset. I know that I'm going wrong with the following line of code.
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)
Both the train and test dataset have the Subject, Predicate, Object, Computed and Truth and the features with the Truth being the predicted class. The testing dataset has the actual values for this Truth column and I dopr it usingtestFeatures = testFeatures.drop('Truth', axis = 1) and intend on using the various loaded models of classifiers to predict this Truth as 0 or 1 for the entire dataset and then get the predictions as an array.
I have done this so far. But I think that I am splitting my test dataset as well. Is there a way to pass the entire test dataset even if it is in another file?
This test dataset is in the same format as the training set. I have checked the shape of the two and I get the following.
Confirming the Features and Shape
Shape of the Train features is: (1860, 5)
Shape of the Test features is: (1386, 5)
TRAINING SET
Training Features Shape: (1395, 1045)
Training Labels Shape: (465, 1045)
Testing Features Shape: (1395,)
Testing Labels Shape: (465,)
TEST SETS
Training Features Shape: (1039, 1045)
Training Labels Shape: (347, 1045)
Testing Features Shape: (1039,)
Testing Labels Shape: (347,)
Any suggestions in this regard will be highly appreciated.
Your question is a bit unclear but as I understand, you want to run your model on testX_train and on testX_test (which is just testFeatures splitted into two sub datasets).
So, either you can run your model on testX_train the same way you did for testX_test, e.g. :
result_RFC_train = loaded_model_RFC.score(textX_train, testy_train)
or you can just remove the following line :
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)
So you just don't split you data and run it on the full dataset :
result_RFC_train = loaded_model_RFC.score(testFeatures, testlabels)
Related
I'm trying to perfrom LassoCV feature selection on my miRNA expression dataset and after finding out the 100 best features(miRNAs in this case) I want to build some classification models (like SVM, RF,KNN etc.) for prediction using those 100 miRNAs. I can use the following code for my data without any problems if I don't do train-test splitting.
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import SelectFromModel
feature_names = df.columns[0:2565]
clf = LassoCV().fit(X, y)
importance = np.abs(clf.coef_)
idx_third = importance.argsort()[-3]
threshold = importance[idx_third] + 0.01
idx_features = (-importance).argsort()[:100]
name_features = np.array(feature_names)[idx_features]
print('Selected features: {}'.format(name_features))
sfm = SelectFromModel(clf, threshold=threshold)
sfm.fit(X, y)
X = sfm.transform(X)
But my goal is to select features after the split. And I think I'm having trouble identifying the x_train and X_test after applying LassoCV. Here's the code after train_test_split:
clf = LassoCV().fit(X_train, y_train)
importance = np.abs(clf.coef_)
idx_third = importance.argsort()[-3]
threshold = importance[idx_third] + 0.01
idx_features = (-importance).argsort()[:100]
name_features = np.array(feature_names)[idx_features]
print('Selected features: {}'.format(name_features))
sfm = SelectFromModel(clf, threshold=threshold)
sfm.fit(X_train, y_train)
and the output:
Selected features: ['MIMAT0019071' 'MIMAT0019947' 'MIMAT0005951' 'MIMAT0025458'
'MIMAT0019710' 'MIMAT0005880' 'MIMAT0004810' 'MIMAT0026481'
'MIMAT0016904' 'MIMAT0003340' 'MIMAT0016851' 'MIMAT0019033'
'MIMAT0004508' 'MIMAT0024615' 'MIMAT0022478' 'MIMAT0019004'
'MIMAT0004948' 'MIMAT0005898' 'MIMAT0000064' 'MIMAT0015087'
'MIMAT0005942' 'MIMAT0004602' 'MIMAT0027666' 'MIMAT0003250'
'MIMAT0022289' 'MIMAT0005866' 'MIMAT0004903' 'MIMAT0004592'
'MIMAT0021040' 'MIMAT0003237' 'MIMAT0018954' 'MIMAT0019858'
'MIMAT0003270' 'MIMAT0030416' 'MIMAT0019361' 'MIMAT0018083'
'MIMAT0000440' 'MIMAT0018070' 'MIMAT0016863' 'MIMAT0015066'
'MIMAT0027576' 'MIMAT0017997' 'MIMAT0000421' 'MIMAT0003165'
'MIMAT0027587' 'MIMAT0004603' 'MIMAT0003330' 'MIMAT0019948'
'MIMAT0004978' 'MIMAT0018951' 'MIMAT0016872' 'MIMAT0019203'
'MIMAT0015005' 'MIMAT0003319' 'MIMAT0003316' 'MIMAT0022265'
'MIMAT0011159' 'MIMAT0016898' 'MIMAT0003240' 'MIMAT0004925'
'MIMAT0027580' 'MIMAT0019067' 'MIMAT0018121' 'MIMAT0028112'
'MIMAT0019714' 'MIMAT0000685' 'MIMAT0019742' 'MIMAT0027627'
'MIMAT0003277' 'MIMAT0019737' 'MIMAT0003284' 'MIMAT0020925'
'MIMAT0022929' 'MIMAT0022938' 'MIMAT0020924' 'MIMAT0020603'
'MIMAT0020602' 'MIMAT0020956' 'MIMAT0020601' 'MIMAT0020600'
'MIMAT0022719' 'MIMAT0020300' 'MIMAT0022939' 'MIMAT0022940'
'MIMAT0019984' 'MIMAT0019983' 'MIMAT0019982' 'MIMAT0019981'
'MIMAT0019980' 'MIMAT0019979' 'MIMAT0019978' 'MIMAT0019977'
'MIMAT0019976' 'MIMAT0022941' 'MIMAT0020541' 'MIMAT0019985'
'MIMAT0020958' 'MIMAT0019975' 'MIMAT0021036' 'MIMAT0021037']
SelectFromModel(estimator=LassoCV(), threshold=0.041810456987634005)
So, no problems until here and we can see the 100 miRNAs to be selected. I try to select these features by applying X = sfm.transform(X) to the split dataset like this:
X_train = sfm.transform(X_train)
X_test = sfm.transform(X_test)
But when I check the X_train.shape and X_test.shape the output is like this:
((164, 0), (55, 0))
So, of course when I try to train my model:
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)
it gives me this error:
ValueError: Found array with 0 feature(s) (shape=(164, 0)) while a minimum of 1 is required.
I'm new to machine learning especially feature selection bit. If anyone can tell me how to develope models with the selected features in this particular case, I would greatly appreciate it.
I am trying to do the machine learning practice problem of Heart disease , dataset from kaggle.
Then i tried to split data into train set and test set and after that combing models into single function and predicting,this error shows up in jupyter notebook .
Here's my code:
# Split data into X and y
X = df.drop("target", axis=1)
y = df["target"]
Spliting
# Split data into train and test sets
np.random.seed(42)
# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
Prediction function
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
"KNN": KNeighborsClassifier(),
"Random Forest": RandomForestClassifier()}
# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
"""
Fits and evaluates given machine learning models.
models : a dict of differetn Scikit-Learn machine learning models
X_train : training data (no labels)
X_test : testing data (no labels)
y_train : training labels
y_test : test labels
"""
# Set random seed
np.random.seed(42)
# Make a dictionary to keep model scores
model_scores = {}
# Loop through models
for name, model in models.items():
# Fit the model to the data
model.fit(X_train, y_train)
# Evaluate the model and append its score to model_scores
model_scores[name] = model.score(X_test, y_test)
return model_scores
And when i run this code , that error shows up
model_scores = fit_and_score(models=models,
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test)
model_scores
This is error
Your X_train, y_train, or both, seem to have entries that are not float numbers.
At some point in the code, try using
X_train = X_train.astype(float)
y_train = y_train.astype(float)
X_test = X_test.astype(float)
y_test = y_test.astype(float)
Either this will work and the error will go away, or one of the conversions will fail, at which point you will need to decide how (or if) you want to encode the data as a float.
I have gone through multiple questions that help divide your dataframe into train and test, with scikit, without etc.
But my question is I have 2 different csvs ( 2 different dataframes from different years). I want to use one as train and other as test?
How to do so for LinearRegression / any model?
Load the datasets individually.
Make sure they are in the same format of rows and columns (features).
Use the train set to fit the model.
Use the test set to predict the output after training.
# Load the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Split features and value
# when trying to predict column "target"
X_train, y_train = train.drop("target"), train["target"]
X_test, y_test = test.drop("target"), test["target"]
# Fit (train) model
reg = LinearRegression()
reg.fit(X_train, y_train)
# Predict
pred = reg.predict(X_test)
# Score
accuracy = reg.socre(X_test, y_test)
please skillsmuggler what about the X_train and X_Test how I can define it because when I try to do that it said NameError: name 'X_train' is not defined
I couldn't edit the first answer which is almost there. There is some code missing though...
# Load the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
y_train = train[:, :1] #if y is only one column
X_train = train[:, 1:]
# Fit (train) model
reg = LinearRegression()
reg.fit(X_train, y_train)
# Predict
pred = reg.predict(X_test)
# Score
accuracy = reg.socre(X_test, y_test)
This question already has answers here:
Keep same dummy variable in training and testing data
(5 answers)
Closed 4 years ago.
I am using pandas get_dummies to convert categorical variables into dummy/indicator variables, it introduce new features in the dataset. Then we fit/train this dataset into a model.
Since the dimension of X_train and X_test remains the same, when we do prediction for test data it works well with test data X_test.
Now lets say we have test data in another csv file (with unknown output). When we transform this set of test data using get_dummies, the resulting dataset may not have same number of features as we have trained our model with. Later when we use our model with this dataset its failing, because number of feature in testing set is not matching with the model's.
Any idea how we can handle this?
Code :
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Load the dataset
in_file = 'train.csv'
full_data = pd.read_csv(in_file)
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)
features = pd.get_dummies(features_raw)
features = features.fillna(0.0)
X_train, X_test, y_train, y_test = train_test_split(features, outcomes,
test_size=0.2, random_state=42)
model =
DecisionTreeClassifier(max_depth=50,min_samples_leaf=6,min_samples_split=2)
model.fit(X_train,y_train)
y_train_pred = model.predict(X_train)
#print (X_train.shape)
y_test_pred = model.predict(X_test)
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)
# DOing again to test another set of data
test_data = 'test.csv'
test_data1 = pd.read_csv(test_data)
test_data2 = pd.get_dummies(test_data1)
test_data3 = test_data2.fillna(0.0)
print(test_data2.shape)
print (model.predict(test_data3))
Seems a similar question has been asked before but the most efficient/easiest way would be to follow approach by Thibault Clement described here
# Get missing columns in the training test
missing_cols = set( X_train.columns ) - set( X_test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X_test = X_test[X_train.columns]
It's also worth noting that your model can only use the features it was trained on so if there are additional columns in X_test vs X_train rather than less then these will have to be removed before predicting.
I'm attempting to build a classification model for electric vehicle charging event data. I want to predict whether the charging station will be available at a given point in time. I have the following code working:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
raw_data = pd.read_csv('C:/temp/sample_dataset.csv')
raw_test = pd.read_csv('C:/temp/sample_dataset_test.csv')
print ('raw data shape: ', raw_test.shape)
#choose which columns to dummify
X_vars = ['station_id', 'day_of_week', 'epoch', 'station_city',
'station_county', 'station_zip', 'port_level', 'perc_local_occupancy',
'ports_at_station', 'avg_charge_duration']
y_var = ['target_variable']
categorical_vars = ['station_id','station_city','station_county']
#split X and y in training and test
X_train = raw_data.loc[:,X_vars]
y_train = raw_data.loc[:,y_var]
X_test = raw_test.loc[:,X_vars]
y_test = raw_test.loc[:,y_var]
#make dummy variables
X_train = pd.get_dummies(X_train, columns = categorical_vars )
X_test = pd.get_dummies(X_test, columns=categorical_vars)
print('train size', X_train.shape, '\ntest size', X_test.shape)
# Train uncalibrated random forest classifier on whole train and evaluate on test data
clf = RandomForestClassifier(n_estimators=100, max_depth=2)
clf.fit(X_train, y_train.values.ravel())
print ('RF accuracy: TRAINING', clf.score(X_train,y_train))
print ('RF accuracy: TESTING', clf.score(X_test,y_test))
Results
raw data shape: (1000000, 15)
train size (1000000, 125)
test size (1000000, 125)
RF accuracy: TRAINING 0.831456
RF accuracy: TESTING 0.831456
My question is why is the training and testing accuracy EXACTLY the same? I've run this many many times, it's always exactly the same. Any ideas? (I've checked the make sure the original data IS different)
Well there is simply a typo in your code, because each time your select all rows:
#split X and y in training and test
X_train = raw_data.loc[:,X_vars]
y_train = raw_data.loc[:,y_var]
X_test = raw_test.loc[:,X_vars]
y_test = raw_test.loc[:,y_var]
You should index them separately by some index for example: X_train = raw_data.loc[:idx,X_vars]
Is it possible that you are using the same set of data in train and test files?
If it's same data, then it might be better to split the data into train and test using train_test_split module.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html