How to normalize the Train and Test data using MinMaxScaler sklearn

How to normalize the Train and Test data using MinMaxScaler sklearn - python

So, I have this doubt and have been looking for answers. So the question is when I use,
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)
After which I will train and test the model (A,B as features, C as Label) and get some accuracy score. Now my doubt is, what happens when I have to predict the label for new set of data. Say,
df = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})
Because when I normalize the column the values of A and B will be changed according to the new data, not the data which the model will be trained on.
So, now my data after the data preparation step that is as below, will be.
data[['A','B']] = min_max_scaler.fit_transform(data[['A','B']])
Values of A and B will change with respect to the Max and Min value of df[['A','B']]. The data prep of df[['A','B']] is with respect to Min Max of df[['A','B']].
How can the data preparation be valid with respect to different numbers relate? I don't understand how the prediction will be correct here.

You should fit the MinMaxScaler using the training data and then apply the scaler on the testing data before the prediction.
In summary:
Step 1: fit the scaler on the TRAINING data
Step 2: use the scaler to transform the TRAINING data
Step 3: use the transformed training data to fit the predictive model
Step 4: use the scaler to transform the TEST data
Step 5: predict using the trained model (step 3) and the transformed TEST data (step 4).
Example using your data:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
#training data
df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
#fit and transform the training data and use them for the model training
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
df['C'] = df['C'].apply(lambda x: 0 if x.strip()=='N' else 1)
#fit the model
model.fit(df['A','B'])
#after the model training on the transformed training data define the testing data df_test
df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})
#before the prediction of the test data, ONLY APPLY the scaler on them
df_test[['A','B']] = min_max_scaler.transform(df_test[['A','B']])
#test the model
y_predicted_from_model = model.predict(df_test['A','B'])
Example using iris data:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
data = datasets.load_iris()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
model = SVC()
model.fit(X_train_scaled, y_train)
X_test_scaled = scaler.transform(X_test)
y_pred = model.predict(X_test_scaled)
Hope this helps.
See also by post here: https://towardsdatascience.com/everything-you-need-to-know-about-min-max-normalization-in-python-b79592732b79

Best way is train and save MinMaxScaler model and load the same when it's required.
Saving model:
df = pd.DataFrame({'A':[1,2,3,7,9,15,16,1,5,6,2,4,8,9],'B':[15,12,10,11,8,14,17,20,4,12,4,5,17,19],'C':['Y','Y','Y','Y','N','N','N','Y','N','Y','N','N','Y','Y']})
df[['A','B']] = min_max_scaler.fit_transform(df[['A','B']])
pickle.dump(min_max_scaler, open("scaler.pkl", 'wb'))
Loading saved model:
scalerObj = pickle.load(open("scaler.pkl", 'rb'))
df_test = pd.DataFrame({'A':[25,67,24,76,23],'B':[2,54,22,75,19]})
df_test[['A','B']] = scalerObj.transform(df_test[['A','B']])

Related

how to use numpy mutual information correctly

i want to use principal component analysis-mutual information (PCA-MI) to have data representation from source which has source relevance of (value from smartinsole) and ouput variable (value from force plate). PCA was used to determine the principal component of Ni provided that the cumulative variance is greater than 98% of the source information measured from 89 insole sensors. MI is generally used in the selection of input variables for predictive models because it is a good indicator of the relationship between input variables and output variables. here I want to get results like a flowchart as below
then I try to make code like below. but I can't generate like what's in the flowchart
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# load the dataset
def load_dataset(filename):
# load the dataset as a pandas DataFrame
data = read_csv(filename, header=None)
# retrieve numpy array
dataset = data.values
y = dataset
return y
def load_dataset2(filename):
# load the dataset as a pandas DataFrame
data2 = read_csv(filename, header=None)
# retrieve numpy array
dataset2 = data2.values
X = dataset2
return X
# feature selection
def select_features(X_train, y_train, X_test):
# configure to select a subset of features
fs = SelectKBest(score_func=mutual_info_classif, k=4)
# learn relationship from training data
fs.fit(X_train, y_train)
# transform train input data
X_train_fs = fs.transform(X_train)
# transform test input data
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs
# load the dataset
Insole = pd.read_csv('1119_Rwalk40s1_list.txt', header=None, low_memory=False)
SIData = np.asarray(Insole)
df = pd.read_csv('1119_Rwalk40s1.csv', low_memory=False)
columns = ['Fx','Fy','Fz','Mx','My','Mz']
selected_df = df[columns]
FCDatas = selected_df
SmartInsole = np.array(SIData)
FCData = np.array(FCDatas)
scaler_x = MinMaxScaler(feature_range=(0, 1))
scaler_x.fit(SmartInsole)
xscale = scaler_x.transform(SmartInsole)
scaler_y = MinMaxScaler(feature_range=(0, 1))
scaler_y.fit(FCData)
yscale = scaler_y.transform(FCData)
SIDataPCA = xscale
pca = PCA(n_components=89)
pca.fit(SIDataPCA)
SIdata_pca = pca.transform(SIDataPCA)
X = SIdata_pca
y = yscale
X = SIdata_pca
y = yscale
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# fit the model
model = LogisticRegression(solver='liblinear')
model.fit(X_train_fs, y_train)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))
how can I get the correct PCA-MI result data?

Large difference in performance on validation and test data

I have scraped some data from spotify to see if I can classify the music genre of different songs.
I have split my data up into a test set and a remaining set, which I have then further divided into training and validation set.
When I run the model (I try to classify between 112 genres) I get 30% accuracy in the validation set. Of course this is not great, but to be expected with 112 genres and limited data. What really confuses me is that when I apply the model to the test data, accuracy goes down to 1%.
I am not sure why that is: as far as I can see the validation and test data should be comparable. I train the model on the training data which should be completely independent.
I must be making some mistake either allowing the model to peak into the validation data (better performance there) or mess up my test data.
Or maybe applying the model twice messes things up?
Any idea what could be going on or how to debug it?
Thanks a lot!
Franka
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
# re-read data
track_df = pd.read_csv('track_df_corr.csv')
features = [ 'acousticness', 'speechiness',
'key', 'liveness', 'instrumentalness', 'energy', 'tempo',
'loudness', 'danceability', 'valence',
'duration_mins', 'year', 'genre']
track_df = track_df[features]
#First make a big split of all the data into test and train.
train, test = train_test_split(track_df, test_size=0.2, random_state = 0)
#Then create training and validation data set from the traindata.
# Read the data. Assign train and test data
# "full" is the data before preprocessing
X_full = train
X_test_full = test
# select to be predicted data
y = X_full.genre # just the target for the test data
y = pd.factorize(y)[0] # just keep the number - get rid of name by using [0] numbers needed for classifier
#Since we later on want to validate our data on the testdata, we also need to make sure we have a #y_test.
# select to be predicted data
y_test = X_test_full.genre # just the target for the test data
y_test = pd.factorize(y_test)[0] # just keep the number - get rid of name by using [0]
# numbers needed for classifier
# remove to be predicted variable
X_full.drop(['genre'], axis=1, inplace=True) # rest of training free of target, which is now stored in y
X_test_full.drop(['genre'], axis=1, inplace=True) # not sure if necessary but cannot hurt
# Break off validation set from training data (X_full)
# Remember we still have X_test_full as an entirely independend test set.
# Here we just create our training and validation sets from X_full.
X_train_full, X_valid_full, y_train, y_valid = \
train_test_split(X_full, y, train_size=0.8, test_size=0.2, random_state=0)
# General preprocessing steps: take care of categorical data (does not apply here).
categorical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if
X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
#Time to run the model.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
#Run our model on the TRAINING data
# FRR set up input values that are passed to the Bundle below
# Preprocessing for NUMERICAL data
numerical_transformer = SimpleImputer(strategy='median')
# Preprocessing for CATEGORICAL data
categorical_transformer = Pipeline(steps=[ # FRR Pipeline of transforms with a "final estimator", here "categorical_transformer".
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# FRR Run the numerical_transformer and categorical_transformer defined above here:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer( # frr Applies transformers to columns of an array or pandas DataFrame.
transformers=[ #frr List of (name,transformer,cols) tuples specifying the transformer objects to
#be applied to subsets of the data.
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define model
model = RandomForestClassifier(n_estimators=100, random_state=0)
# Bundle preprocessing and modeling code in a pipeline
# clf stands for clasiifier.
# Pipeline can be used to chain multiple estimators into one
# Preprocessing of training data, fit model
clf = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# "Calling fit on the pipeline is the same as calling *fit* on each estimator (here: prepoc and model)
clf.fit(X_train, y_train)
# --------------------------------------------------------
#Test our model on the VALIDATION data
# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)
# Return the mean accuracy on the given test data and labels.
clf.score(X_valid, y_valid) # this is correct!
# The code yields a value around 30%.
# --------------------------------------------------------
Apply our model on the TESTING data
# Preprocessing of training data, fit model
preds_test = clf.predict(X_test)
clf.score(X_test, y_test)
#The code yields a value around 1%.

The problem that I see is that you're encoding the train and test labels using pd.factorize. Since you're using pd.factorize on y and y_test independently, the resulting encodings will not correspond to one another. You want to use a LabelEncoder, so that when you fit the encoder using the train data, you then transform y_test using the same encoding scheme.
Here's an example to illustrate this:
from sklearn.preprocessing import LabelEncoder
l = [1,4,6,1,4]
le = LabelEncoder()
le.fit(l)
le.transform(l)
# array([0, 1, 2, 0, 1], dtype=int64)
le.transform([1,6,4])
# array([0, 2, 1], dtype=int64)
Here we get the correct encodings. However if we apply a pd.factorize, obviously pandas can't guess which are the correct encodings:
pd.factorize(l)[0]
# array([0, 1, 2, 0, 1], dtype=int64)
pd.factorize([1,6,4])[0]
# array([0, 1, 2], dtype=int64)

MAE using Pipeline and GridSearchCV

I am facing a challenge finding Mean Average Error (MAE) using Pipeline and GridSearchCV
Background:
I have worked on a Data Science project (MWE as below) where a MAE value would be returned of a classifier as it's performance metric.
#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity
#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
#RandomForest Modeling
RF_model = RandomForestClassifier(n_estimators=100, random_state=0)
RF_model.fit(X_train, y_train)
#RandomForest Prediction
y_predict = RF_model.predict(X_valid)
#MAE
print(mean_absolute_error(y_valid, y_predict))
#Output:
# 0.38727149627623564
Challenge:
Now I am trying to implement the same using Pipeline and GridSearchCV (MWE as below). The expectation is the same MAE value would be returned as above. Unfortunately I could not get it right using the 3 approaches below.
#Library
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
#Data import and preparation
data = pd.read_csv("data.csv")
data_features = ['location','event_type_count','log_feature_count','total_volume','resource_type_count','severity_type']
X = data[data_features]
y = data.fault_severity
#Train Validation Split for Cross Validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
#RandomForest Modeling via Pipeline and Hyper-parameter tuning
steps = [('rf', RandomForestClassifier(random_state=0))]
pipeline = Pipeline(steps) # define the pipeline object.
parameters = {'rf__n_estimators':[100]}
grid = GridSearchCV(pipeline, param_grid=parameters, scoring='neg_mean_squared_error', cv=None, refit=True)
grid.fit(X_train, y_train)
#Approach 1:
print(grid.best_score_)
# Output:
# -0.508130081300813
#Approach 2:
y_predict=grid.predict(X_valid)
print("score = %3.2f"%(grid.score(y_predict, y_valid)))
# Output:
# ValueError: Expected 2D array, got 1D array instead:
# array=[0. 0. 0. ... 0. 1. 0.].
# Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
#Approach 3:
y_predict_df = pd.DataFrame(y_predict.reshape(len(y_predict), -1),columns=['fault_severity'])
print("score = %3.2f"%(grid.score(y_predict_df, y_valid)))
# Output:
# ValueError: Number of features of the model must match the input. Model n_features is 6 and input n_features is 1
Discussion:
Approach 1:
As in GridSearchCV() the scoring variable is set to neg_mean_squared_error, tried to read the grid.best_score_. But it did not get the same MAE result.
Approach 2:
Tried to get the y_predict values using grid.predict(X_valid). Then tried to get the MAE using grid.score(y_predict, y_valid) as the scoring variable in GridSearchCV() is set to neg_mean_squared_error. It returned a ValueError complaining "Expected 2D array, got 1D array instead".
Approach 3:
Tried to reshape y_predict and it did not work either. This time it returned "ValueError: Number of features of the model must match the input."
It would be helpful if you can assist to point where I could have made the error?
If you need, the data.csv is available at https://www.dropbox.com/s/t1h53jg1hy4x33b/data.csv
Thank you very much

You are trying to compare mean_absolute_error with neg_mean_squared_error which is very different refer here for more details. You should have used neg_mean_absolute_error in your GridSearchCV object creation like shown below:
grid = GridSearchCV(pipeline, param_grid=parameters,scoring='neg_mean_absolute_error', cv=None, refit=True)
Also, the score method in sklearn takes (X,y) as inputs, where x is your input feature of shape (n_samples, n_features) and y is the target labels, you need to change your grid.score(y_predict, y_valid) into grid.score(X_valid, y_valid).

How do I accept a non-csv input for my machine learning model?

Language: Python.
I have created a model and saved it with joblib. Now I want to load it to make predictions for new data---but the data is in a form of string(numerical in value but the features are a single line separated by "," instead of in columns as one big dataframe) Can I do that? I know I can send in single inputs and get a single prediction but I'm not sure how to do it.
I used
https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
as reference but I'm not too clear about the last bit (loading the model)
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Fitting K-NN to the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# save the model to disk
filename = 'test_model.sav'
joblib.dump(classifier, filename)
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, y_test)
print(result)
*I did not post the data preprocessing part of the code

If your problem is about how to load the input vector X_test from a string input, you can use np.fromstring:
input_string = '34,144,13'
X_test=np.fromstring(input_string, dtype=int, sep=',')
To get the model's prediction for the above X_test, you can use:
loaded_model = joblib.load(filename)
prediction= loaded_model.predict(X_test)

Get panda Series from csv

I am totally new to machine learning, I am currently playing with MNIST machine learning, using RandomForestClassifier.
I use sklearn and panda.
I have a training CSV data set.
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
train = pd.read_csv("train.csv")
features = train.columns[1:]
X = train[features]
y = train['label']
user_train = pd.read_csv("input.csv")
user_features = user_train.columns[1:]
y_train = user_train[user_features]
user_y = user_train['label']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X/255.,y,test_size=1,random_state=0)
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, y_train)
y_pred_rf = clf_rf.predict(X_test)
acc_rf = accuracy_score(y_test, y_pred_rf)
print("pred : ", y_pred_rf)
print("random forest accuracy: ",acc_rf)
I have the current code, which works well. It takes the training set, split and take one element for testing, and does the prediction.
What I want now is to use the testing data from an input, I have a new csv called "input.csv", and I want to predict the value inside this csv.
How can I replace the model_selection.train_test_split with my input data ?
I am sure the response is very obvious, and I didn't find anything.

The following part of your code is unused
user_train = pd.read_csv("input.csv")
user_features = user_train.columns[1:]
y_train = user_train[user_features]
user_y = user_train['label']
If input.csv has the same structure of train.csv you may want to:
train a classifier and test it on a split of the input.csv dataset: (please refer to http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html to know how to set the test size)
input_train = pd.read_csv("input.csv")
input_features = user_train.columns[1:]
input_data = user_train[input_features]
input_labels = user_train['label']
data_train, data_test, labels_train, labels_test = model_selection.train_test_split(input_data/255.,input_labels,test_size=1,random_state=0)
clf_rf = RandomForestClassifier()
clf_rf.fit(data_train, labels_train)
labels_pred_rf = clf_rf.predict(data_test)
acc_rf = accuracy_score(labels_test, labels_pred_rf)
test the previously trained classifier on the whole input.csv file
input_train = pd.read_csv("input.csv")
input_features = user_train.columns[1:]
input_data = user_train[input_features]
input_labels = user_train['label']
labels_pred_rf = clf_rf.predict(input_data)
acc_rf = accuracy_score(input_labels, labels_pred_rf)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to normalize the Train and Test data using MinMaxScaler sklearn - python

Related

how to use numpy mutual information correctly

Large difference in performance on validation and test data

MAE using Pipeline and GridSearchCV

How do I accept a non-csv input for my machine learning model?

Get panda Series from csv

Categories

Resources