How does OneHotEncoder work with a saved XGBoost model?

How does OneHotEncoder work with a saved XGBoost model? - python

I built a rudimentary XGBoost model to make predictions on whether a student will enter college, based on a number of features (both string, ints, and boolean terms). To encode these labels, I used OneHotEncoder after splitting the data between X and y. The model works and I am also using to make new predictions; however, I'm having some trouble when trying to adapt the model to a Flask app. My intention is to build the app, run it locally, and then use Postman to make predictions by posting a JSON. The issue is that I do not know where to add the OneHotEncoder step for the new JSON I'm posting in Postman. Do I add OneHotEncoder as part of the GridSearchCV I'm using, or should I reconfigure the model to use a pipeline for OneHotEncoding, Grid Search, and fitting the model? So when I use the saved model, the JSON sent via Postman would go through the OneHotEncoder process? Also, in the part of the Flask app "content['type_school'] = float(session['type_school'])," can I specify Bool or string before "session"?
Flask app code:
def return_prediction(model, sample_json):
type_school = sample_json["type_school"]
school_accreditation = sample_json["school_accreditation"]
gender = sample_json['gender']
interest = sample_json['interest']
residence = sample_json['residence']
parent_age = sample_json['parent_age']
parent_salary = sample_json['parent salary']
house_area = sample_json['house_area']
average_grades = sample_json['average_grades']
parent_was_in_college = sample_json['parent_was_in_college']
college = [[type_school, school_accreditation, gender, interest,
residence, parent_age, parent_salary, house_area,
average_grades, parent_was_in_collegel]]
classes = np.array(['TRUE','FALSE'])
class_ind = college_model.predict_classes(college)
return classes[class_ind][0]
app = Flask(__name__)
app.config['SECRET_KEY'] = 'mysecretkey'
class CollegeForm(FlaskForm):
type_school = TextField("type_school")
school_accreditation = TextField("school_accreditation")
gender = TextField('gender')
interest = TextField('interest')
residence = TextField('residence')
parent_age = TextField('parent_age')
parent_salary = TextField('parent salary')
house_area = TextField('house_area')
average_grades = TextField('average_grades')
parent_was_in_college = TextField('parent_was_in_college')
submit = SubmitField('Predict')
#app.route("/",methods=['GET', 'POST'])
def index():
form = CollegeForm()
if form.validate_on_submit():
session['type_school'] = form.type_school.data
session['school_accreditation'] = form.school_accreditation.data
session['gender'] = form.gener.data
session['interest'] = form.interest.data
session['residence'] = form.residence.data
session['parent_age'] = form.parent_age.data
session['parent_salary'] = form.parent_salary.data
session['house_area'] = form.house_area.data
session['average_grades'] = form.average_grades.data
session['parent_was_in_college'] = form.parent_was_in_college.data
return redirect(url_for("predictions"))
return render_template('home_2.html',form=form)
college_model = joblib.load("college_model.pkl")
#app.route('/prediction')
def prediction():
content = {}
content['type_school'] = float(session['type_school'])
content['school_accreditation'] = float(session['school_accreditation'])
content['gender'] = float(session['gender'])
content['interest'] = float(session['interest'])
content['residence'] = float(session['residence'])
content['parent_age'] = float(session['parent_age'])
content['parent_salary'] = float(session['parent_salary'])
content['house_area'] = float(session['house_area'])
content['average_grades'] = float(session['average_grades'])
content['parent_was_in_college'] = float(session['parent_was_in_college'])
results = return_prediction(college_model, content)
return render_template('predictions.html',results=results)
if __name__=='__main__':
app.run()

Related

How to you include a OneHotEncoder step in a saved model to deploy via Flask?

I have an XGBoost model that predicts whether a student will enter college based on a number of features. Part of the model is using OneHotEncoder to transform a few columns with string values. There's nothing wrong with the model, but I've run into issues with building a rudimentary Flask app that takes in a JSON to make a prediction. My confusion is where I add the OneHotEncoder step? Would I need to re-build the model using a pipeline for OneHotEncoding, model parameters, and fitting, save the model again, and then when I send the JSON via Postman, the saved model will put the data through the OneHotEncoder step? Can I add OneHotEncoder as part of the GridSearchCV step?
optimal_params = GridSearchCV(
estimator = xgb.XGBClassifier(objective='binary:logistic'),
param_grid=param_grid,
scoring='roc_auc',
verbose=2,
n_jobs=10,
cv=3
)
optimal_params.fit(X,
y,
early_stopping_rounds=10,
eval_metric='auc',
eval_set=[(X_test, y_test)],
verbose=False)
Flask code:
def return_prediction(college_model, sample_json):
type_school = sample_json["type_school"]
school_accreditation = sample_json["school_accreditation"]
gender = sample_json['gender']
interest = sample_json['interest']
residence = sample_json['residence']
parent_age = sample_json['parent_age']
parent_salary = sample_json['parent salary']
house_area = sample_json['house_area']
average_grades = sample_json['average_grades']
parent_was_in_college = sample_json['parent_was_in_college']
college = [[type_school, school_accreditation, gender, interest,
residence, parent_age, parent_salary, house_area,
average_grades, parent_was_in_collegel]]
class_ind = college_model.predict(college)
return class_ind
app = Flask(__name__)
#app.route("/")
def index():
return '<h1>Flask Running</h>'
college_model = joblib.load("college_model.pkl")
column_trans = joblib.load("ohe.pkl")
#app.route('/college', methods=['POST'])
def prediction():
content = request.json
results = return_predictions(college_model, column_trans, content)
results = results.tolist()
return jsonify(results)
if __name__=='__main__':
app.run()

After GridSearchCV, you will store the final variable and make the model for it into a pkl file.
And you have to put in the preprocessing part of your flask code.
The data should be OneHotEncode as possible so that it can be predicated.
def return_prediction(college_model, sample_json):
type_school = sample_json["type_school"]
school_accreditation = sample_json["school_accreditation"]
gender = sample_json['gender']
interest = sample_json['interest']
residence = sample_json['residence']
parent_age = sample_json['parent_age']
parent_salary = sample_json['parent salary']
house_area = sample_json['house_area']
average_grades = sample_json['average_grades']
parent_was_in_college = sample_json['parent_was_in_college']
college = [[type_school, school_accreditation, gender, interest,
residence, parent_age, parent_salary, house_area,
average_grades, parent_was_in_collegel]]
colleage['in your str data'] = pd.get_dummies(colleage = columns = ['in your str data'])
class_ind = college_model.predict(college)
return class_ind

Running Python Script From C# and Working With the Results

I need to run a python script from a .net core web API repository to get the cluster of the current user and recommend the list of courses based on his ratings. (I'm implementing a collaborative filtering recommender system with the use of sklearn.NearestNeighbor to help with user clustering.
List Course Repository
public async Task<IEnumerable<Course>> GetList(int? userId, string name, ContextSession session, bool includeDeleted = false)
{
var entity = GetEntities(session, includeDeleted).AsQueryable();
var courses = await entity.Where(obj => obj.Id > 0).ToListAsync();
SaveToCsv<Course>(courses, Environment.CurrentDirectory + "/csv/course.csv");
var ratings = _dbContext.Set<CourseRating>().AsQueryable();
if (!includeDeleted)
{
ratings = ratings.Where(obj => !obj.IsDeleted);
}
var i = await ratings.Where(obj => obj.Id > 0).ToListAsync();
SaveToCsv<CourseRating>(i, Environment.CurrentDirectory + "/csv/ratings_Final.csv");
ScriptRuntimeSetup setup = Python.CreateRuntimeSetup(null);
ScriptRuntime runtime = new ScriptRuntime(setup);
ScriptEngine engine = Python.GetEngine(runtime);
ScriptSource source = engine.CreateScriptSourceFromFile(Environment.CurrentDirectory + "/CB_RS.py");
ScriptScope scope = engine.CreateScope();
List<String> argv = new List<String>();
argv.Add(session.UserId.ToString());
argv.Add(Environment.CurrentDirectory + "/csv/course.csv");
argv.Add(Environment.CurrentDirectory + "/csv/ratings_Final.csv");
engine.GetSysModule().SetVariable("argv", argv);
try
{
source.Execute(scope);
}
catch(Exception ex)
{
var error = ex;
}
return await entity.Where(obj => obj.Id > 0).ToListAsync();
}
The Python Script
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import operator
def get_rating_course_user(courses_path, ratings_path):
course_df = pd.read_csv(courses_path)
rating_df = pd.read_csv(ratings_path)
#drop useless columns
#course_df.drop(columns_list, axis = 1)
#rating_df.drop([columns_list], axis = 1)
#merge 2 dataset on courseID
df = rating_df.merge(course_df, on = 'courseId')
# Get the ratings per course per user
pvt = df.pivot_table(index='userId',columns='name',values='rating')
#convert all NaN to -1.
pvt.fillna(-1,inplace=True)
#convert pivot table to sparce matrix
spm = csr_matrix(pvt)
return spm, pvt
#returns a list of recommended courses according to similar user's ratings
def make_prediction(userId, courses_path, ratings_path, n_neighbors):
recommended = []
spm, pvt = get_rating_course_user(courses_path, ratings_path)
# Creating, training the model
nn = NearestNeighbors(algorithm = 'brute')
nn.fit(spm)
#testing the model
test = pvt.iloc[userId,:].values.reshape(1,len(pvt.columns))
distance, suggested = nn.kneighbors(test,n_neighbors=n_neighbors)
for i in range(len(suggested[0])):
x = pvt.iloc[suggested[0][i],:].values.reshape(1,len(pvt.columns))
result = np.where(x >= 3)
for j in range(len(result[1])):
course = pvt.iloc[:,result[1][j]].name
if course not in recommended:
recommended.append(course)
return recommended
#return dictionary {id_course: average_rating} of k courses having the best average_rating
def recommend_K_best_courses(recommendedList, courses_path, k):
courses = pd.read_csv(courses_path)
courseIds = {}
for i in (recommendedList):
ID = (courses.loc[course_df.name==i].courseId).to_string(index=False)
rate = (courses.loc[course_df.name==i].average).to_string(index=False)
courseIds[int(ID)] = float(rate)
sort = dict(sorted(courseIds.items(), key=operator.itemgetter(1), reverse=True)[:k])
return sort
def main(userId, courses_path, ratings_path, n_neighbors=7, k=10){
recommended_list = make_prediction(userId, courses_path, ratings_path, n_neighbors)
return recommend_K_best_courses(recommended_list, courses_path, k)
}
if __name__ == "__main__":
main(userId, courses_path, ratings_path, n_neighbors=7, k=10)
I got the following error:
unexpected token '{'
My current problem is running the python script, then using the result returned from the main method of the script

I made a C# dev error, I opened the method main with '{' instead of ':'
def main(userId, courses_path, ratings_path, n_neighbors=7, k=10):
recommended_list = make_prediction(userId, courses_path, ratings_path, n_neighbors)
return recommend_K_best_courses(recommended_list, courses_path, k)

get a result from flask

I have trained a deep learning model (lstm) with keras, save it with h5 and now I want to "hit" a web service in order to get back a category. This is the first time I have tried to do that so I am a little confused. I can not figure out how to take categories back. Also when I send a request to http://localhost:8000/predict I get the following error,
The server encountered an internal error and was unable to complete your
request. Either the server is overloaded or there is an error in the
application.
and in the notebook
ValueError: Tensor Tensor("dense_3/Softmax:0", shape=(?, 6), dtype=float32)
is not an element of this graph.
I try the solution from enter link description here but is not working
The code so far is below
from flask import Flask,request, jsonify#--jsonify will return the data
import os
from keras.models import load_model
app = Flask(__name__)
model=load_model('lstm-final-five-Copy1.h5')
#app.route('/predict', methods= ["GET","POST"])
def predict():
df_final = pd.read_csv('flask.csv')
activities = df_final['activity'].value_counts().index
label = LabelEncoder()
df_final['label'] = label.fit_transform(df_final['activity'])
X = df_final[['accx', 'accy', 'accz', 'gyrx', 'gyry', 'gyrz']]
y = df_final['label']
scaler = StandardScaler()
X = scaler.fit_transform(X)
df_final = pd.DataFrame(X, columns = ['accx', 'accy', 'accz', 'gyrx',
'gyry', 'gyrz'])
df_final['label'] = y.values
Fs = 50
frame_size = Fs*2 # 200 samples
hop_size = frame_size # 40 samples
def get_frames(df_final, frame_size, hop_size):
N_FEATURES = 6 #x,y,z (acc,gut)
frames = []
labels = []
for i in range(0, len(df_final) - frame_size, hop_size):
accx = df_final['accx'].values[i: i + frame_size]
accy = df_final['accy'].values[i: i + frame_size]
accz = df_final['accz'].values[i: i + frame_size]
gyrx = df_final['gyrx'].values[i: i + frame_size]
gyry = df_final['gyry'].values[i: i + frame_size]
gyrz = df_final['gyrz'].values[i: i + frame_size]
# Retrieve the most often used label in this segment
label = stats.mode(df_final['label'][i: i + frame_size])[0][0]
frames.append([accx, accy, accz, gyrx, gyry, gyrz])
labels.append(label)
# Bring the segments into a better shape
frames = np.asarray(frames).reshape(-1, frame_size, N_FEATURES)
labels = np.asarray(labels)
return frames, labels
X, y = get_frames(df_final, frame_size, hop_size)
pred = model.predict_classes(X)
return jsonify({"Prediction": pred}), 201
if __name__ == '__main__':
app.run(host="localhost", port=8000, debug=False)

It seems in your '/predict' POST endpoint you arent returning any values which is why you arent getting back a category as you expect.
If you wanted to add a GET method you could add something like what is mentioned below,
#app.route('/', methods=['GET'])
def check_server_status():
return ("Server Running!")
And in the POST method your case you could return your prediction in the endpoint,
#app.route('/predict', methods=['POST'])
def predict():
# Add in other steps here
pred = model.predict_classes(X)
return jsonify({"Prediction": pred}), 201

As far as I can see you need to install pandas if you haven't by doing pip install pandas and import it as import pandas as pd
Also you can add "GET" method in your /prediction endpoint like:
#app.route("/predict", methods=["GET", "POST"])

Key Error while preprocessing of data (onehot encoding)

I am getting a key error while converting the variables using onehot encoder. This is the code that i used:
def preprocessor(df):
res_df = df.copy()
le = preprocessing.LabelEncoder()
res_df['"job"'] = le.fit_transform(res_df['"job"'])
res_df['"marital"'] = le.fit_transform(res_df['"marital"'])
res_df['"education"'] = le.fit_transform(res_df['"education"'])
res_df['"default"'] = le.fit_transform(res_df['"default"'])
res_df['"housing"'] = le.fit_transform(res_df['"housing"'])
res_df['"month"'] = le.fit_transform(res_df['"month"'])
res_df['"loan"'] = le.fit_transform(res_df['"loan"'])
res_df['"contact"'] = le.fit_transform(res_df['"contact"'])
res_df['"day_of_week"'] = le.fit_transform(res_df['"day"'])
res_df['"poutcome"'] = le.fit_transform(res_df['"poutcome"'])
res_df['"y"'] = le.fit_transform(res_df['"y"'])
return res_df
while executing the function the function, i am getting a key error
encoded_df = preprocessor(df1)
x = encoded_df.drop(['"y"'],axis =1).values
y = encoded_df['"y"'].values
while executing the function the function, i am getting a key error, although i have split the column using sep=';'. Can anyone please help

How to read from local directory, kmeans streaming pyspark

I need help with reading from a local directory when running kmeans streaming with pyspark. There is no good answer on this topic on stackoverflow
Here is my code
if __name__ == "__main__":
ssc = StreamingContext(sc, 1)
training_data_raw, training_data_df = prepare_data(TRAINING_DATA_SET)
trainingData = parse2(training_data_raw)
testing_data_raw, testing_data_df = prepare_data(TEST_DATA_SET)
testingData = testing_data_raw.map(parse1)
#print(testingData)
trainingQueue = [trainingData]
testingQueue = [testingData]
trainingStream = ssc.queueStream(trainingQueue)
testingStream = ssc.queueStream(testingQueue)
# We create a model with random clusters and specify the number of clusters to find
model = StreamingKMeans(k=2, decayFactor=1.0).setRandomCenters(3, 1.0, 0)
# Now register the streams for training and testing and start the job,
# printing the predicted cluster assignments on new data points as they arrive.
model.trainOn(trainingStream)
result = model.predictOnValues(testingStream.map(lambda lp: (lp.label, lp.features)))
result.pprint()
ssc.textFileStream('file:///Users/userrname/PycharmProjects/MLtest/training/data/')
ssc.start()
ssc.awaitTermination()
Thanks!!

from pyspark.mllib.linalg import Vectors
trainingData = ssc.textFileStream("/training/data/dir").map(Vectors.parse)
for test examples
from pyspark.mllib.regression import LabeledPoint
def parse(lp):
label = float(lp[lp.find('(') + 1: lp.find(',')])
vec = Vectors.dense(lp[lp.find('[') + 1: lp.find(']')].split(','))
return LabeledPoint(label, vec)
testData = ssc.textFileStream("/testing/data/dir").map(parse)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How does OneHotEncoder work with a saved XGBoost model? - python

Related

How to you include a OneHotEncoder step in a saved model to deploy via Flask?

Running Python Script From C# and Working With the Results

get a result from flask

Key Error while preprocessing of data (onehot encoding)

How to read from local directory, kmeans streaming pyspark

Categories

Resources