Predicting claim number through GLM model - python

I'm conducting a case study where I have to predict claim number per policy. Since my variable ClaimNb is not binary I can't use logistic Regression but I have to use Poisson.
My code for GLM model:
import statsmodels.api as sm
import statsmodels.formula.api as smf
formula= 'ClaimNb ~ BonusMalus+VehAge+Freq+VehGas+Exposure+VehPower+Density+DrivAge'
model = smf.glm(formula = formula, data=df,
I have also split my data
# train-test-split
train , test = train_test_split(data,test_size=0.2,random_state=0)
# seperate the target and independent variable
train_x = train.drop(columns=['ClaimNb'],axis=1)
train_y = train['ClaimNb']
test_x = test.drop(columns=['ClaimNb'],axis=1)
test_y = test['ClaimNb']
My problem now is the prediction, I have used the following but did not work:
from sklearn.linear_model import PoissonRegressor model = PoissonRegressor(alpha=1e-3, max_iter=1000),train_y)
predict = model.predict(test_x)
Please is there any other way to predict and check the accuracy of the model?

You need to assign the and predict with that, it's different from sklearn. Also, if you using the formula, it is better to split your dataframe into train and test, predict using that. For example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,(50,4)),columns=['ClaimNb','BonusMalus','VehAge','Freq'])
#X = df[['BonusMalus','VehAge','Freq']]
#y = df['ClaimNb']
df_train = df.sample(round(len(df)*0.8))
df_test = df.drop(df_train.index)
formula= 'ClaimNb ~ BonusMalus+VehAge+Freq'
model = smf.glm(formula = formula, data=df,family=sm.families.Poisson())
result =
And we can do the prediction:


How to interpret base_value of multi-class classification problem when using SHAP?

I am using shap library for ML interpretability to better understand k-means segmentation algorithm clusters. In a nutshell I make some blogs, use k-means to cluster them and then take the clusters as label and xgboost to try to predict them. I have 5 clusters so it is a signle-label multi-class classification problem.
import numpy as np
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import xgboost as xgb
import shap
X, y = make_blobs(n_samples=500, centers=5, n_features=5, random_state=0)
data = pd.DataFrame(np.concatenate((X, y.reshape(500,1)), axis=1), columns=['var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'cluster_id'])
data['cluster_id'] = data['cluster_id'].astype(int).astype(str)
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.iloc[:,:-1])
kmeans = KMeans(n_clusters=5, **kmeans_kwargs)
data['predicted_cluster_id'] = kmeans.labels_.astype(int).astype(str)
clf = xgb.XGBClassifier()[:,:-1], scaled_data['predicted_cluster_id'])
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(scaled_data.iloc[0,:-1].values.reshape(1,-1))
shap.force_plot(explainer.expected_value[0], shap_values[0], link='logit') # repeat changing 0 for i in range(0, 5)
The pictures above make sense as the class is '3'. But why this base_value, shouldn't it be 1/5? I asked myself a while ago a similar question but this time I set already link='logit'.
link="logit" does not seem right for multiclass, as it's only suitable for binary output. This is why you do not see probabilities summing up to 1.
Let's streamline your code:
import numpy as np
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import xgboost as xgb
import shap
from scipy.special import softmax, logit, expit
X, y_true = make_blobs(n_samples=500, centers=5, n_features=3, random_state=0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=5)
y_predicted = kmeans.fit_predict(X_scaled, )
clf = xgb.XGBClassifier(), y_predicted)
Then, what you see as expected values in:
explainer = shap.TreeExplainer(clf)
array([0.67111245, 0.60223354, 0.53357694, 0.50821152, 0.50145331])
are base scores in raw space.
The multi-class raw scores can be converted to probabilities with softmax:
array([0.22229282, 0.20749694, 0.19372895, 0.18887673, 0.18760457])
shap.force_plot(..., link="logit") doesn't make sense for multiclass, and it seems impossible to switch from raw to probability and still maintain additivity (because softmax(x+y) ≠ softmax(x) + softmax(y)).
Should you wish to analyze your data in probability space try KernelExplainer:
from shap import KernelExplainer
masker = shap.maskers.Independent(X_scaled, 100)
ke = KernelExplainer(clf.predict_proba,
# array([0.18976762, 0.1900516 , 0.20042894, 0.19995041, 0.21980143])
shap.force_plot(ke.expected_value[0], shap_values[0][0])
or summary plot:
from shap import Explanation
which are now additive for shap values in probability space and align well with both base probabilities (see above) and predicted probabilities for 0th datapoint:
array([[2.2844513e-04, 8.1287889e-04, 6.5225776e-04, 9.9737883e-01,
9.2762709e-04]], dtype=float32)

How can I forecast a y-variable based on multiple x-variables?

I'm testing code like this.
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
#Seaborn for easier visualization
import seaborn as sns
# Load Iris Flower Dataset
# Load data
df = pd.read_csv('C:\\path_to_file\\train.csv')
# the model can only handle numeric values so filter out the rest
# data = df.select_dtypes(include=[np.number]).interpolate().dropna()
df1 = df.select_dtypes(include=[np.number])
df1 = df1.fillna(0)
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
#Split train/test sets
# y = df1.SalePrice
X = df1.drop(['index'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
# Train model
clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model =, y_train)
# Feature Importance
headers = ['name', 'score']
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt='plain'))
(pd.Series(model.feature_importances_, index=X.columns)
This works fine on some sample data that I found online. Now, rather than predicting a sales price as my y variable. I'm trying to figure out how to just get the model to make some kind of prediction like target = True or Target = False or maybe my approach is wrong.
It's a bit confusing for me, because of this line: df1 = df.select_dtypes(include=[np.number]). So, only numbers are included, which makes sense for a RandomForestRegressor classifier. I'm just looking for some guidance on how to deal with a non-numeric prediction here.
You are dealing with a classification problem here with 2 classes (True, False). To get started take a look at a simple logistic regression model.
Since you are using sklearn try:

added Standardscaler but receive errors in Cross Validation and the correlation matrix

This is the code I built to apply a multiple linear regression. I added standard scaler to fix the Y intercept p-value which was not significant but the problem that the results of CV RMSE in the end changed and have nosense anymore and received an error in the code for plotting the correlation Matrix saying : AttributeError: 'numpy.ndarray' object has no attribute 'corr'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
from scipy.stats.stats import pearsonr
# Import Excel File
data = pd.read_excel("C:\\Users\\AchourAh\\Desktop\\Multiple_Linear_Regression\\SP Level Reasons Excels\\SP000273701_PL14_IPC_03_09_2018_Reasons.xlsx",'Sheet1') #Import Excel file
# Replace null values of the whole dataset with 0
data1 = data.fillna(0)
# Extraction of the independent and dependent variables
X = data1.iloc[0:len(data1),[1,2,3,4,5,6,7]] #Extract the column of the COPCOR SP we are going to check its impact
Y = data1.iloc[0:len(data1),9] #Extract the column of the PAUS SP
# Data Splitting to train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.25,random_state=1)
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# Statistical Analysis of the training set with Statsmodels
X = sm.add_constant(X_train) # add a constant to the model
est = sm.OLS(Y_train, X).fit()
print(est.summary()) # print the results
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math
lm = LinearRegression() # create an lm object of LinearRegression Class,Y_train) # train our LinearRegression model using the training set of data - dependent and independent variables as parameters. Teaching lm that Y_train values are all corresponding to X_train.
mse_test = mean_squared_error(Y_test, lm.predict(X_test))
# Data Splitting to train and test set of the reduced data
X_1 = data1.iloc[0:len(data1),[1,2]] #Extract the column of the COPCOR SP we are going to check its impact
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X_1, Y, test_size =0.25,random_state=1)
X_train2 = ss.fit_transform(X_train2)
X_test2 = ss.transform(X_test2)
# Statistical Analysis of the reduced model with Statsmodels
X_reduced = sm.add_constant(X_train2) # add a constant to the model
est_reduced = sm.OLS(Y_train2, X_reduced).fit()
print(est_reduced.summary()) # print the results
# Fitting a Linear Model for the reduced model with Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math
lm1 = LinearRegression() #create an lm object of LinearRegression Class, Y_train2)
mse_test1 = mean_squared_error(Y_test2, lm1.predict(X_test2))
#Cross Validation and Training again the model
from sklearn.model_selection import KFold
from sklearn import model_selection
kf = KFold(n_splits=6, random_state=1)
for train_index, test_index in kf.split(X_train2):
print("Train:", train_index, "Validation:",test_index)
X_train1, X_test1 = X[train_index], X[test_index]
Y_train1, Y_test1 = Y[train_index], Y[test_index]
results = -1 * model_selection.cross_val_score(lm1, X_train1, Y_train1,scoring='neg_mean_squared_error', cv=kf)
#RMSE values interpretation
#Good model built no overfitting or underfitting (Barely Same for test and training :Goal of Cross validation but low prediction accuracy = Value is big
import seaborn
seaborn.heatmap(Corr,cmap='RdYlGn_r',vmax=1.0,vmin=-1.0,mask=mask, linewidths=2.5)
enter code here
Do you have an idea how to fix the issue ?
I'm guessing the problem lies with:
.corr is a pandas dataframe method but X_train2 is a numpy array at that stage. If a dataframe/series is passed into StandardScaler, a numpy array is returned. Try replacing the above with:
or make use of numpy.corrcoef or numpy.correlate in their respective forms.

How to use pandas to create a crosstab to show the prediction result of random forest predictor?

I'm a newbie to the random forest (as well as python).
I'm using random forest classifier, the dataset is defined 't2002'.
So here are the columns:
Index(['IndividualID', 'ES2000_B01ID', 'NSSec_B03ID', 'Vehicle',
'IndIncome2002_B02ID', 'MarStat_B01ID', 'EcoStat_B03ID',
'MainMode_B03ID', 'TripStart_B02ID', 'TripEnd_B02ID',
'TripDisIncSW_B01ID', 'TripTotalTime_B01ID', 'TripTravTime_B01ID',
'TripPurpFrom_B01ID', 'TripPurpTo_B01ID'],
I'm using codes as below to run the classifier:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X_all = t2002.drop(['MainMode_B03ID'],axis=1)
y_all = t2002['MainMode_B03ID']
p = 0.2
X_train,X_test, y_train, y_test = train_test_split(X_all,y_all,test_size=p,
clf = RandomForestClassifier()
acc_scorer = make_scorer(accuracy_score)
parameters = {
} # parameter is blank
grid_obj = GridSearchCV(clf,parameters,scoring=acc_scorer)
grid_obj =,y_train)
clf = grid_obj.best_estimator_,y_train)
predictions = clf.predict(X_test)
In this case, how could I use pandas to generate a crosstab (like a table) to show the detailed prediction results?
Thanks in advance!
you can first create a confusion matrix using sklearn and then convert it to pandas data frame.
from sklearn.metrics import confusion_matrix
#creating confusion matrix as array
confusion = confusion_matrix(t2002['MainMode_B03ID'].tolist(),predictions)
#converting to df
new_df = pd.DataFrame(confusion,
index = t2002['MainMode_B03ID'].unique(),
columns = t2002['MainMode_B03ID'].unique())
Its easy to show all the predicted results using pandas. Use cv_results_ as described in docs.
import pandas as pd
results = pd.DataFrame(clf.cv_results_) # clf is the GridSearchCV object

RandomForest Regressor: Predict and check performance

I am trying predict price for 5 days in future. I followed this tutorial. This tutorial is about predicting categorical variable and is hence using RandomForest Classifier. I am using the same approach as defined in this tutorial but using RandomForest Regressor as I have to predict last price for 5 days in future. I am confused that how do I predict
Here is my code:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics.ranking import roc_curve, auc, roc_auc_score
priceTrainData = pd.read_csv('trainPriceData.csv')
#read test data set
priceTestData = pd.read_csv('testPriceData.csv')
priceTrainData['Type'] = 'Train'
priceTestData['Type'] = 'Test'
target_col = "last"
features = ['low', 'high', 'open', 'last', 'annualized_volatility', 'weekly_return',
'daily_average_volume_10',# try to use log in 10, 30,
'daily_average_volume_30', 'market_cap']
priceTrainData['is_train'] = np.random.uniform(0, 1, len(priceTrainData)) <= .75
Train, Validate = priceTrainData[priceTrainData['is_train']==True], priceTrainData[priceTrainData['is_train']==False]
x_train = Train[list(features)].values
y_train = Train[target_col].values
x_validate = Validate[list(features)].values
y_validate = Validate[target_col].values
x_test = priceTestData[list(features)].values
rf = RandomForestRegressor(n_estimators = 1000), y_train)
status = rf.predict(x_validate)
My first question is that how do I specify to get 5 values for prediction and second question is that how do I check the performance of RandomForest Regressor? Kindly assist me.
Your x_validate is 'pandas.core.series.Series' in nature. So you could execute this:
This will solve your 2nd question by calculating the R square value.

