Ensemble of machine learning models in scikit-learn - python

group feature_1 feature_2 year dependent_variable
group_a 12 19 2010 0.4
group_a 11 13 2011 0.9
group_a 10 5 2012 1.2
group_a 16 9 2013 3.2
group_b 8 29 2010 0.6
group_b 9 33 2011 0.1
group_b 111 15 2012 2.1
group_b 16 19 2013 12.2
In the dataframe above, I want to use feature_1, feature_2 to predict dependent_variable. To do this, I want to construct two models: In the first model, I want to construct a separate model for each group. In the second model, I want to use all the available data. In both cases, data from the years 2010 to 2012 will be used for training and 2013 will be used for testing.
How can I construct an ensemble model using the two models outlined above? The data is a toy dataset but in the real dataset, there will be a lot more groups, years and features. In particular, I am interested in an approach that will work with scikit-learn compatible models.

There will be multiple steps to creating an ensemble model.
Start by creating the two models individually. For the first model, split the data by group and train two individual models, then join the two models together in a function. For the second model, the data can be left in its entirety (aside from removing testing data). Then, create another method to join the other two models into one ensemble model.
To demonstrate, I'll start by importing the necessary modules and loading in the dataframe:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
data_str = """group,feature_1,feature_2,year,dependent_variable
group_a,12,19,2010,0.4
group_a,11,13,2011,0.9
group_a,10,5,2012,1.2
group_a,16,9,2013,3.2
group_b,8,29,2010,0.6
group_b,9,33,2011,0.1
group_b,111,15,2012,2.1
group_b,16,19,2013,12.2"""
data_list = [row.split(",") for row in data_str.split("\n")]
data = pd.DataFrame(data_list[1:], columns = data_list[0])
train = data.loc[data["year"] != "2013"]
test = data.loc[data["year"] == "2013"]
This will be using a RandomForestRegressor ensemble model, but any regression model can be used. In addition, it should be noted that the dataframe used here differs from the given dataframe in that this dataframe has its rows indexed from 0 rather than being indexed by group, and group is instead a column within the dataframe.
To construct the first model:
split the data into data for group a and for group b
train two independent models
join the models
The first two steps are done below:
# Splitting Data
train_a = train.loc[train["group"] == "group_a"]
train_b = train.loc[train["group"] == "group_b"]
test_a = test.loc[test["group"] == "group_a"]
test_b = test.loc[test["group"] == "group_b"]
# Training Two Models
model_a = RandomForestRegressor()
model_a.fit(train_a.drop(["dependent_variable", "year", "group"], axis = "columns"), train_a.dependent_variable)
model_b = RandomForestRegressor()
model_b.fit(train_b.drop(["dependent_variable", "year", "group"], axis = "columns"), train_b.dependent_variable)
Then, their predict methods can be joined together:
def individual_predictor(group, feature_1, feature_2):
if group == "group_a": return model_a.predict([[feature_1, feature_2]])[0]
elif group == "group_b": return model_b.predict([[feature_1, feature_2]])[0]
This will take in a group and two features individually and return the prediction. This can be adapted to whatever input and output type is necessary.
To create the second model, leave the data as whole and only train one model, which also removes the necessity to join the models:
model = RandomForestRegressor()
model.fit(train.drop(["dependent_variable", "year", "group"], axis = "columns"), train.dependent_variable)
Finally, you can join the models together into an ensemble model by averaging the result of their predict methods:
def ensemble_predict(group, feature_1, feature_2):
return (individual_predictor(group, feature_1, feature_2) + model.predict([[feature_1, feature_2]])[0]) / 2
Again, this takes in a group and two features then returns the result. This will likely need to be adapted into another format, such as taking in a list of list of inputs and outputting a list of predictions.

This one uses 2 regressors, RandomForestRegressor, and GradientBoostingRegressor.
I Added 2013 data for r2_score calculation, it must be more than 1. Also added data from other years. Copy the text and save to txt file.
First we process the data file, separate the train and test by dataframe manipulation. We then create a model for each regressor with model 1.1 and 1.2 for group "a" and "b" respectively. Then model 2 for all data. After creating the model we then save it to disk for later processing.
After the models are created we then make predictions using all the test data and a single data. Metrics r2_square and MAE are also printed.
The last part is testing the model file by loading it and let it predict from a test. Predictions from model in memory and disk should be the same. Also there is a sample input types and how to use it in custom prediction function.
See also the docstring and comments in the code on how this works.
data.txt
group feature_1 feature_2 year dependent_variable
group_a 12 19 2010 0.4
group_a 7 15 2010 1.5
group_a 11 13 2011 0.9
group_a 8 8 2011 2.1
group_a 10 5 2012 1.2
group_a 11 9 2012 2.6
group_a 16 9 2013 3.2
group_a 8 10 2013 2.6
group_b 8 29 2010 0.6
group_b 11 18 2010 1.5
group_b 9 33 2011 0.1
group_b 20 15 2011 2.8
group_b 111 15 2012 2.1
group_b 99 10 2012 3.6
group_b 16 19 2013 12.2
group_b 4 8 2013 5.1
Code
myensemble.py
"""sklearn ensemble modeling.
Dependencies:
* sklearn
* pandas
* numpy
References:
* https://scikit-learn.org/stable/modules/classes.html?highlight=ensemble#module-sklearn.ensemble
* https://pandas.pydata.org/docs/user_guide/indexing.html
"""
from typing import List, Union, Optional
import pickle # for saving file to disk
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
import pandas as pd
import numpy as np
def make_model(regressor, regname: str, modelfn: str, dfX: pd.DataFrame, dfy: pd.DataFrame):
"""Creates a model.
Args:
regressor: Can be RandomForestRegressor or GradientBoostingRegressor.
regname: Regressor name.
dfX: The features in pandas dataframe.
dfy: The target in pandas dataframe.
Returns:
Model
"""
X = dfX.to_numpy()
y = dfy.to_numpy()
model = regressor(random_state=0)
model.fit(X, y)
# Save model.
with open(f'{regname}_{modelfn}', 'wb') as f:
pickle.dump(model, f)
return model
def get_prediction(model, test: Union[List, pd.DataFrame, np.ndarray]) -> Optional[np.ndarray]:
"""Returns prediction based on model and test input or None.
"""
if isinstance(test, List) or isinstance(test, np.ndarray):
return model.predict([test])
if isinstance(test, pd.DataFrame):
return model.predict(np.array(test))
return None
def model_and_prediction(df: pd.DataFrame, regressor, regname: str, modelfn: str):
"""Build model and show prediction and metrics.
To build a model we need a training data X with features
and data y with target or dependent values.
Args:
df: A dataframe.
regressor: Can be RandomForestRegressor or GradientBoostingRegressor.
regname: The regressor name.
modelfn: The filename where model will be saved to disk.
Returns:
None
"""
features = ['feature_1', 'feature_2']
# 1. Get the train dataframe
train = df.loc[df.year != 2013] # exclude 2013 in training data
train_feature = train[features] # select the features column
train_target = train.dependent_variable # select the dependent column
model = make_model(regressor, regname, modelfn, train_feature, train_target)
# 2. Get the test dataframe
test = df.loc[df.year == 2013] # only include 2013 in test data
test_feature = test[features]
test_target = test.dependent_variable
# 3. Get the prediction from all rows in test feature. See step 5
# for single data prediction.
prediction: np.ndarray = model.predict(np.array(test_feature))
print(f'test feature:\n{np.array(test_feature)}')
print(f'test prediction: {prediction}') # prediction[0] ...
print(f'test target: {np.array(test_target)}')
# 4. metrics
print(f'r2_score: {r2_score(test_target, prediction)}')
print(f'mean_absolute_error: {mean_absolute_error(test_target, prediction)}\n')
# 5. Get prediction from the first row of test features.
prediction_1: np.ndarray = model.predict(np.array(test_feature.iloc[[0]]))
print(f'1st row test:\n{test_feature.iloc[[0]]}')
print(f'1st row test prediction array: {prediction_1}')
print(f'1st row test prediction value: {prediction_1[0]}\n') # get the element value
def main():
datafn = 'data.txt'
df = pd.read_fwf(datafn)
print(df.to_string(index=False))
# A. Create models for each type of regressor.
regressors = [(RandomForestRegressor, 'RandomForrest'),
(GradientBoostingRegressor, 'GradientBoosting')]
for (r, name) in regressors:
print(f'::: Regressor: {name} :::\n')
# Model 1 using group_a
print(':: MODEL 1.1 ::')
grp = 'group_a'
modelfn = f'{grp}.pkl' # filename of model to be save to disk
dfa = df.loc[df.group == grp] # select group
model_and_prediction(dfa, r, name, modelfn)
# Model 1 using group_b
print(':: MODEL 1.2 ::')
grp = 'group_b'
modelfn = f'{grp}.pkl'
dfb = df.loc[df.group == grp]
model_and_prediction(dfb, r, name, modelfn)
# Model 2 using group a and b
print(':: MODEL 2 ::')
grp = 'group_ab'
modelfn = f'{grp}.pkl'
dfab = df.loc[(df.group == 'group_a') | (df.group == 'group_b')]
model_and_prediction(dfab, r, name, modelfn)
# B. Test saved model file prediction.
print('::: Prediction from loaded model :::')
mfn = 'GradientBoosting_group_ab.pkl'
print(f'model: gradient boosting model 2, {mfn}')
with open(mfn, 'rb') as f:
loaded_model = pickle.load(f)
# test: group_b 4 8 2013 5.1
test = [4, 8]
prediction = loaded_model.predict([test])
print(f'test: {test}')
print(f'prediction: {prediction[0]}\n')
# C. Use get_prediction().
# input from list
test = [4, 8]
prediction = get_prediction(loaded_model, test)
print(f'test from list input:\n{test}')
print(f'prediction from get_prediction() with list input: {prediction}\n')
# input from dataframe
testdata = {
'feature_1': [8, 12],
'feature_2': [19, 15],
}
testdf = pd.DataFrame(testdata)
testrow = testdf.iloc[[0]] # first row [8, 19]
prediction = get_prediction(loaded_model, testrow)
print(f'test from df input:\n{testrow}')
print(f'prediction from get_prediction() with df input: {prediction}\n')
testrow = testdf.iloc[[1]] # second row [12, 15]
prediction = get_prediction(loaded_model, testrow)
print(f'test from df input:\n{testrow}')
print(f'prediction from get_prediction() with df input: {prediction}\n')
# input from numpy
test = [8, 9]
testnp = np.array(test)
prediction = get_prediction(loaded_model, testnp)
print(f'test from numpy input:\n{testnp}')
print(f'prediction from get_prediction() with numpy input: {prediction}\n')
if __name__ == '__main__':
main()
Output
group feature_1 feature_2 year dependent_variable
group_a 12 19 2010 0.4
group_a 7 15 2010 1.5
group_a 11 13 2011 0.9
group_a 8 8 2011 2.1
group_a 10 5 2012 1.2
group_a 11 9 2012 2.6
group_a 16 9 2013 3.2
group_a 8 10 2013 2.6
group_b 8 29 2010 0.6
group_b 11 18 2010 1.5
group_b 9 33 2011 0.1
group_b 20 15 2011 2.8
group_b 111 15 2012 2.1
group_b 99 10 2012 3.6
group_b 16 19 2013 12.2
group_b 4 8 2013 5.1
::: Regressor: RandomForrest :::
:: MODEL 1.1 ::
test feature:
[[16 9]
[ 8 10]]
test prediction: [1.811 2.186]
test target: [3.2 2.6]
r2_score: -10.67065000000004
mean_absolute_error: 0.9015000000000026
1st row test:
feature_1 feature_2
6 16 9
1st row test prediction array: [1.811]
1st row test prediction value: 1.8109999999999986
:: MODEL 1.2 ::
test feature:
[[16 19]
[ 4 8]]
test prediction: [2.116 2.408]
test target: [12.2 5.1]
r2_score: -3.3219170799444546
mean_absolute_error: 6.388
1st row test:
feature_1 feature_2
14 16 19
1st row test prediction array: [2.116]
1st row test prediction value: 2.116000000000001
:: MODEL 2 ::
test feature:
[[16 9]
[ 8 10]
[16 19]
[ 4 8]]
test prediction: [2.425 2.145 1.01 1.958]
test target: [ 3.2 2.6 12.2 5.1]
r2_score: -1.3250936994738867
mean_absolute_error: 3.8905000000000016
1st row test:
feature_1 feature_2
6 16 9
1st row test prediction array: [2.425]
1st row test prediction value: 2.4249999999999985
::: Regressor: GradientBoosting :::
:: MODEL 1.1 ::
test feature:
[[16 9]
[ 8 10]]
test prediction: [2.59996945 2.21271005]
test target: [3.2 2.6]
r2_score: -1.8335008778823685
mean_absolute_error: 0.4936602458577084
1st row test:
feature_1 feature_2
6 16 9
1st row test prediction array: [2.59996945]
1st row test prediction value: 2.59996945439128
:: MODEL 1.2 ::
test feature:
[[16 19]
[ 4 8]]
test prediction: [1.99807124 2.63511811]
test target: [12.2 5.1]
r2_score: -3.3703627491779713
mean_absolute_error: 6.333405322236132
1st row test:
feature_1 feature_2
14 16 19
1st row test prediction array: [1.99807124]
1st row test prediction value: 1.9980712422931164
:: MODEL 2 ::
test feature:
[[16 9]
[ 8 10]
[16 19]
[ 4 8]]
test prediction: [3.60257456 2.26208935 0.402739 2.10950224]
test target: [ 3.2 2.6 12.2 5.1]
r2_score: -1.538939968014979
mean_absolute_error: 3.882060991360607
1st row test:
feature_1 feature_2
6 16 9
1st row test prediction array: [3.60257456]
1st row test prediction value: 3.6025745572622014
::: Prediction from loaded model :::
model: gradient boosting model 2, GradientBoosting_group_ab.pkl
test: [4, 8]
prediction: 2.1095022367629728
test from list input:
[4, 8]
prediction from get_prediction() with list input: [2.10950224]
test from df input:
feature_1 feature_2
0 8 19
prediction from get_prediction() with df input: [0.50307204]
test from df input:
feature_1 feature_2
1 12 15
prediction from get_prediction() with df input: [1.46058714]
test from numpy input:
[8 9]
prediction from get_prediction() with numpy input: [2.30007317]

Firstly, create models using time series algorithms(use only date variable and dependent variable), fbprophet (uses features + date + dependent variable), Treebased regression algorithms like CatBoost/XGBoost/LightGBM (use features + date + dependent variable).
Using each of the mentioned algorithms create models for each group (bottom up approach). Different models will peform well for different groups. Take a weighted mean based on models' performance. Assume, group_a predictions perform best with Catboost, then with fbprophet and then with exponential moving average, use weights in proportion to accuracies derived from these models.
You can aggregate results of group level models to get aggregated results. You can also create separate models on aggregated data(summing up on year).

If I understand the last line of your question correctly, in context with the years, you are looking to capture the trend in a given calendar year via model 1 and capture trend across multiple years via model 2. And model 2 in where it could be an issue because you mentioned scikit-learn compatible models.
So I'll try to explain the approach I would take.
Model 1 is pretty straight forward, it is a regression problem so selecting the best regression model should not be an issue. You can find that by seeing results in a given calendar year.
Model 2 is where you would like to capture time-series features, kind of like YoY sort of thing. While there isn't any model in SKLearn to directly capture the time parameter like ARIMA or RNNs do, there are ways to use SKLearn Models to do the forecasting. And a lot of it depends on feature engineering. You could use features 1 and 2, sort them, shift them and then take a diff to create new features, say 1a and 2a, which could then be used with any regression model. These new features would capture the time essence. I could write a lengthy post on that here, but I think you'll find this link much better written.
Now coming to ensembling the 2 models together. As this is a regression problem, the best way I feel would be to assign weightages to the output of both these models, lets say alpha for model 1 and beta for model 2. Treat alpha and beta as hyperparameters. Tune them using the data.
This should make a pretty good ensemble with SKLearn models.

Related

How to determine the cause of not achieving a target in using machine learning model [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
Please I want to know if it is possible to know the specific variables' influence in testing a sample data to a model. The model below clarifies the question;
Given a dataset to predict the score of students.
ID Studies hours Games hours lectures hours social Activities Score
0 1 20 5 15 2 78
1 2 15 6 13 3 69
2 3 31 2 16 1 95
3 4 22 2 15 2 80
4 5 19 7 15 4 71
5 6 10 8 10 8 52
6 7 13 7 11 6 59
7 8 34 1 16 1 96
8 9 25 6 15 1 83
9 10 22 3 16 2 76
10 11 17 7 15 1 66
11 12 28 2 14 2 87
12 13 21 3 16 3 77
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from numpy import absolute
from xgboost import XGBModel
import pickle
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
from xgboost import plot_importance
data = pd.read_csv("student_score.csv")
def perfomance(data):
X = data.iloc[:,:-1]
y = data.iloc[:,-1:]
model = XGBModel(booster='gbtree')
#model = XGBModel(booster='gblinear')
model.fit(X, y)
cv = RepeatedKFold(n_splits=3, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X,y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = np.absolute(scores)
metrics = ('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )
# save the model to disk
filename = 'score.sav'
pickle.dump(model, open(filename, 'wb'))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
# load the model from disk
loaded_model = pickle.load(open('score.sav', 'rb'))
result = loaded_model.predict(X_test)
print(result)
plt.rcParams["figure.figsize"] = (20,15)
plot_importance(model)
plt.show()
Feature Importances :
[5.6058721e-04 6.7560148e-01 3.1960118e-01 4.2312010e-03 5.4962843e-06]
The feature importance is the general importance ranked by the model.
What I need now is:
when I pick A sample test say test = pd.DataFrame([{"Studies hours":15, "Games hours":6, "lectures hours":13,"social Activities":3}])
and predict; loaded_model.predict(test) and I get a score like 68, Which of the variables specifically (not the general importance) didn't make this specific sample test not score 100 but rather 68?
For Example, the model should tell me studies hours were bad or were less than expected.
Can Machine Learning Model do that?
The topic you're describing is called model explainability or interpretability. The more sophisticated the model, the more accurate it is, but the harder it is to explain (really generally speaking). SHAP values are the most common way I see folks explaining the effect of each feature on predictions generally, and each feature value on the prediction for a given observation. The most common visualization of SHAP values is the force plot. It looks like this:
The blog from which I took this image explains how to build a force plot for any model: Explain Any Models with the SHAP Values — Use the KernelExplainer
You can look at explaining the model's decision for a specific example using SHAP. A waterfall or force plot can show why the model scored 68 for a specific example based on the the input variables.

predict future demand using limited variable past data

I have a past demand of kilometers travelled by customers from 2003-2020 in a excel file named transport_demand.xlsx and I have to predict the demand in 2050 using linear regression model.
the figure goes like this.
Year Transport Demand
0 2003 1070000000000
1 2004 1090000000000
2 2005 1090000000000
3 2006 109900000
4 2007 1100000000000
5 2008 1110000000000
6 2009 1120000000000
7 2010 1120000000000
8 2011 1130000000000
9 2012 1140000000000
10 2013 1140000000000
11 2014 1160000000000
12 2015 1180000000000
13 2016 1200000000000
14 2017 1160000000000
15 2018 1160000000000
16 2019 1170000000000
17 2020 943000000000
i am ok wíth statistics so i thought to take 5 years yearly average but since the data is only 17 columns 5 years or 10 years average is difficult. I am new to python so training and testing with such small data is very confusing to me. I am confused about how to go forward or what to code.
import pandas as pd
demand=pd.read_excel('transport_demand.xlsx')
then I used following code to determine outliers.
demand.describe()
since all the data lied within the mean and standard deviation. I assume that there are no outliers here.
then I used the graph to see the trend.
# making Date value a true date-time
demand["Year"] = pd.to_datetime(demand["Year"], format="%Y")
# plot demand dataframe
ax = demand.plot("Year", "Transport Demand",color='green', marker='o')
the different ups and down made me confused how to predict. but still I tried to go forward.
import numpy as np
import seaborn as sns
# Equation: Demand = β0 + β1*Year + e
#Setting the value for X and Y
x = demand[['Year']]
y = demand['Transport Demand']
#Splitting the dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 100)
#Fitting the Linear Regression model
from sklearn.linear_model import LinearRegression
slr = LinearRegression()
slr.fit(x_train, y_train)
I hope I am correct till now but I want the model to predict 2050 demand. How to do this?

How to do Multi label classification or Multi class classification of the below problem? Pandas Python

My original data looks like this.
id season home_team away_team home_goals away_goals result winner
0 0 2006-07 Shu Liv 1 1 D NaN
1 1 2006-07 Ars Avl 1 1 D NaN
2 2 2006-07 Eve Wat 2 1 H Eve
3 3 2006-07 New Wig 2 1 H New
4 4 2006-07 Por Bla 3 0 H Por
The purpose is to build a model that predicts
i.e.
Home Team Win 55%
Draw 13%
Away Team Win 32%
I Selected these 3 columns and label encoded them
home_team, away_team, winner
Then I created these new classes/lables.
df.loc[df["winner"]==df["home_team"],"home_team_win"]=1
df.loc[df["winner"]!=df["home_team"],"home_team_win"]=0
df.loc[df["result"]=='D',"draw"]=1
df.loc[df["result"]!='D',"draw"]=0
df.loc[df["winner"]==df["away_team"],"away_team_win"]=1
df.loc[df["winner"]!=df["away_team"],"away_team_win"]=0
Now the encoded data is looking like this,
home_team away_team home_team_win away_team_win draw
0 28 19 0 0 1
1 1 2 0 0 1
2 14 34 1 0 0
3 23 37 1 0 0
4 25 4 1 0 0
Initially, I used the code below for a single label 'home_team_win' and it worked fine, but it doesn't support multi classes/labels.
X = prediction_df.drop(['home_team_win'] ,axis=1)
y = prediction_df['home_team_win']
logReg=LogisticRegression(solver='lbfgs')
rfe = RFE(logReg, 20)
rfe = rfe.fit(X, y.values.ravel())
How to do Multi label classification or Multi class classification of this problem?
The target binary variables home_team_win, away_team_win, and draw are mutually exclusive. It does not seem to be a good idea to use multi-label methods in this problem, since, in general, they are designed to exploit dependencies among labels, which is nonexistent in this dataset.
I suggest modelling it as a multi-class problem in its most common form, where there is a single column with three classes: 0,1, and 2 (representing home_team_loss, draw, away_team_win).
Many implementations of classifiers in scikit-learn can work directly in this manner. Logistic Regression is one of them:
from sklearn.linear_model import LogisticRegression
logReg=LogisticRegression(solver='lbfgs', multi_class='ovr')
logReg.fit(X,Y)
logReg.predict_proba(X)
This code will output the desired probabilities for each class of each row of X.
In particular, this code trains one Logistic Regression for each class separately (this is what the multi_class='ovr' parameter do).
Take a look at https://scikit-learn.org/stable/supervised_learning.html for other classifiers that directly work in this multi-class dataset form that I suggested.

Online Logistic Regression by Month with Sklearn

I would like to train a Logistic Regression classifier in online fashion with Sklearn. I know about the 'SAG' or 'SAGA' but I am not sure how to implement this.
Specifically, my goal is to get the algorithm to train on the last t-x months (e.g. x=3) at time t where t is a month in the year. I would want to make a prediction over the set of examples for the following month (time t+1).
Here is my df:
X.head()
year month age job marital
0 2008 5 56 3 1
1 2008 5 57 7 1
2 2008 5 37 7 1
3 2008 5 40 0 1
4 2008 5 56 7 1
y.head()
0 0
1 1
2 0
3 0
4 0
Name: y, dtype: int8
Say I have my clf as in the code below (in this example I have trained it on the entire dataset in batch):
clf = LogisticRegression(C=1, max_iter=100, class_weight = 'balanced')
y_pred = clf.predict(X)
cmx = pd.DataFrame(confusion_matrix(y, y_pred),
index = ['No', 'Yes'],
columns = ['No', 'Yes'])
Notice I am not just looking to get a model created for each month in the dataset, but to have a model train itself in an online (minibatch technically) fashion throughout the entire dataset

Why accuracy doesn't get printed?

So the following code never prints the accuracy.
1 #!/usr/bin/python
2
3 """.
4 This is the code to accompany the Lesson 2 (SVM) mini-project.
5
6 Use a SVM to identify emails from the Enron corpus by their authors:....
7 Sara has label 0
8 Chris has label 1
9 """
10 ....
11 import sys
12 from time import time
13 sys.path.append("../tools/")
14 from email_preprocess import preprocess
15 from sklearn import svm
16 from sklearn.metrics import accuracy_score
17
18
19 ### features_train and features_test are the features for the training
20 ### and testing datasets, respectively
21 ### labels_train and labels_test are the corresponding item labels
22 features_train, features_test, labels_train, labels_test = preprocess()
23 clf=svm.SVC(kernel='linear')
24 clf.fit(features_train, labels_train)
25 pred=clf.predict(features_test)
26 print(accuracy_score(labels_test, pred))
I am trying to find out why line print(accuracy_score(labels_test, pred)) does not print anything at all. It should print some value. What could be the issue?
I added this line of code which makes it print something. I have seen people using 1000 iterations normally:
clf=svm.SVC(kernel='linear',max_iter=100)

Categories

Resources