Online Logistic Regression by Month with Sklearn - python

I would like to train a Logistic Regression classifier in online fashion with Sklearn. I know about the 'SAG' or 'SAGA' but I am not sure how to implement this.
Specifically, my goal is to get the algorithm to train on the last t-x months (e.g. x=3) at time t where t is a month in the year. I would want to make a prediction over the set of examples for the following month (time t+1).
Here is my df:
X.head()
year month age job marital
0 2008 5 56 3 1
1 2008 5 57 7 1
2 2008 5 37 7 1
3 2008 5 40 0 1
4 2008 5 56 7 1
y.head()
0 0
1 1
2 0
3 0
4 0
Name: y, dtype: int8
Say I have my clf as in the code below (in this example I have trained it on the entire dataset in batch):
clf = LogisticRegression(C=1, max_iter=100, class_weight = 'balanced')
y_pred = clf.predict(X)
cmx = pd.DataFrame(confusion_matrix(y, y_pred),
index = ['No', 'Yes'],
columns = ['No', 'Yes'])
Notice I am not just looking to get a model created for each month in the dataset, but to have a model train itself in an online (minibatch technically) fashion throughout the entire dataset

Related

How to determine the cause of not achieving a target in using machine learning model [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
Please I want to know if it is possible to know the specific variables' influence in testing a sample data to a model. The model below clarifies the question;
Given a dataset to predict the score of students.
ID Studies hours Games hours lectures hours social Activities Score
0 1 20 5 15 2 78
1 2 15 6 13 3 69
2 3 31 2 16 1 95
3 4 22 2 15 2 80
4 5 19 7 15 4 71
5 6 10 8 10 8 52
6 7 13 7 11 6 59
7 8 34 1 16 1 96
8 9 25 6 15 1 83
9 10 22 3 16 2 76
10 11 17 7 15 1 66
11 12 28 2 14 2 87
12 13 21 3 16 3 77
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from numpy import absolute
from xgboost import XGBModel
import pickle
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
from xgboost import plot_importance
data = pd.read_csv("student_score.csv")
def perfomance(data):
X = data.iloc[:,:-1]
y = data.iloc[:,-1:]
model = XGBModel(booster='gbtree')
#model = XGBModel(booster='gblinear')
model.fit(X, y)
cv = RepeatedKFold(n_splits=3, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X,y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = np.absolute(scores)
metrics = ('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )
# save the model to disk
filename = 'score.sav'
pickle.dump(model, open(filename, 'wb'))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
# load the model from disk
loaded_model = pickle.load(open('score.sav', 'rb'))
result = loaded_model.predict(X_test)
print(result)
plt.rcParams["figure.figsize"] = (20,15)
plot_importance(model)
plt.show()
Feature Importances :
[5.6058721e-04 6.7560148e-01 3.1960118e-01 4.2312010e-03 5.4962843e-06]
The feature importance is the general importance ranked by the model.
What I need now is:
when I pick A sample test say test = pd.DataFrame([{"Studies hours":15, "Games hours":6, "lectures hours":13,"social Activities":3}])
and predict; loaded_model.predict(test) and I get a score like 68, Which of the variables specifically (not the general importance) didn't make this specific sample test not score 100 but rather 68?
For Example, the model should tell me studies hours were bad or were less than expected.
Can Machine Learning Model do that?
The topic you're describing is called model explainability or interpretability. The more sophisticated the model, the more accurate it is, but the harder it is to explain (really generally speaking). SHAP values are the most common way I see folks explaining the effect of each feature on predictions generally, and each feature value on the prediction for a given observation. The most common visualization of SHAP values is the force plot. It looks like this:
The blog from which I took this image explains how to build a force plot for any model: Explain Any Models with the SHAP Values — Use the KernelExplainer
You can look at explaining the model's decision for a specific example using SHAP. A waterfall or force plot can show why the model scored 68 for a specific example based on the the input variables.

predict future demand using limited variable past data

I have a past demand of kilometers travelled by customers from 2003-2020 in a excel file named transport_demand.xlsx and I have to predict the demand in 2050 using linear regression model.
the figure goes like this.
Year Transport Demand
0 2003 1070000000000
1 2004 1090000000000
2 2005 1090000000000
3 2006 109900000
4 2007 1100000000000
5 2008 1110000000000
6 2009 1120000000000
7 2010 1120000000000
8 2011 1130000000000
9 2012 1140000000000
10 2013 1140000000000
11 2014 1160000000000
12 2015 1180000000000
13 2016 1200000000000
14 2017 1160000000000
15 2018 1160000000000
16 2019 1170000000000
17 2020 943000000000
i am ok wíth statistics so i thought to take 5 years yearly average but since the data is only 17 columns 5 years or 10 years average is difficult. I am new to python so training and testing with such small data is very confusing to me. I am confused about how to go forward or what to code.
import pandas as pd
demand=pd.read_excel('transport_demand.xlsx')
then I used following code to determine outliers.
demand.describe()
since all the data lied within the mean and standard deviation. I assume that there are no outliers here.
then I used the graph to see the trend.
# making Date value a true date-time
demand["Year"] = pd.to_datetime(demand["Year"], format="%Y")
# plot demand dataframe
ax = demand.plot("Year", "Transport Demand",color='green', marker='o')
the different ups and down made me confused how to predict. but still I tried to go forward.
import numpy as np
import seaborn as sns
# Equation: Demand = β0 + β1*Year + e
#Setting the value for X and Y
x = demand[['Year']]
y = demand['Transport Demand']
#Splitting the dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 100)
#Fitting the Linear Regression model
from sklearn.linear_model import LinearRegression
slr = LinearRegression()
slr.fit(x_train, y_train)
I hope I am correct till now but I want the model to predict 2050 demand. How to do this?

How to do Multi label classification or Multi class classification of the below problem? Pandas Python

My original data looks like this.
id season home_team away_team home_goals away_goals result winner
0 0 2006-07 Shu Liv 1 1 D NaN
1 1 2006-07 Ars Avl 1 1 D NaN
2 2 2006-07 Eve Wat 2 1 H Eve
3 3 2006-07 New Wig 2 1 H New
4 4 2006-07 Por Bla 3 0 H Por
The purpose is to build a model that predicts
i.e.
Home Team Win 55%
Draw 13%
Away Team Win 32%
I Selected these 3 columns and label encoded them
home_team, away_team, winner
Then I created these new classes/lables.
df.loc[df["winner"]==df["home_team"],"home_team_win"]=1
df.loc[df["winner"]!=df["home_team"],"home_team_win"]=0
df.loc[df["result"]=='D',"draw"]=1
df.loc[df["result"]!='D',"draw"]=0
df.loc[df["winner"]==df["away_team"],"away_team_win"]=1
df.loc[df["winner"]!=df["away_team"],"away_team_win"]=0
Now the encoded data is looking like this,
home_team away_team home_team_win away_team_win draw
0 28 19 0 0 1
1 1 2 0 0 1
2 14 34 1 0 0
3 23 37 1 0 0
4 25 4 1 0 0
Initially, I used the code below for a single label 'home_team_win' and it worked fine, but it doesn't support multi classes/labels.
X = prediction_df.drop(['home_team_win'] ,axis=1)
y = prediction_df['home_team_win']
logReg=LogisticRegression(solver='lbfgs')
rfe = RFE(logReg, 20)
rfe = rfe.fit(X, y.values.ravel())
How to do Multi label classification or Multi class classification of this problem?
The target binary variables home_team_win, away_team_win, and draw are mutually exclusive. It does not seem to be a good idea to use multi-label methods in this problem, since, in general, they are designed to exploit dependencies among labels, which is nonexistent in this dataset.
I suggest modelling it as a multi-class problem in its most common form, where there is a single column with three classes: 0,1, and 2 (representing home_team_loss, draw, away_team_win).
Many implementations of classifiers in scikit-learn can work directly in this manner. Logistic Regression is one of them:
from sklearn.linear_model import LogisticRegression
logReg=LogisticRegression(solver='lbfgs', multi_class='ovr')
logReg.fit(X,Y)
logReg.predict_proba(X)
This code will output the desired probabilities for each class of each row of X.
In particular, this code trains one Logistic Regression for each class separately (this is what the multi_class='ovr' parameter do).
Take a look at https://scikit-learn.org/stable/supervised_learning.html for other classifiers that directly work in this multi-class dataset form that I suggested.

Different shapes between new data and training dataset

I have a dataframe and looks something like the one below.
Spent Products bought Target Variable
0 2300 Car/Mortgage/Leisure 0
1 1500 Car/Education 0
2 150 Groceries 1
3 700 Groceries/Education 1
4 900 Mortgage 1
5 180 Education/Sports 1
6 1800 Car/Mortgage/Others 0
7 900 Sports/Groceries 1
8 1000 Self-Enrichment/Car 1
9 140 Car/Groceries 1
I used pd.get_dummies to one hot encode all the "products bought" column. Now I have a shape of (5000,150).
I train/test/split my data and thereafter, applied PCA. I fit_transform the train set, and applied only transform on the test set. Following that I used a decision tree classifier to predict which got me a 90% accuracy.
Now here comes the problem. I have new set of data. I know my model was trained on a shape of (,150) and this **new data only has a shape of (150, 28) after** applying encoding with pd.get_dummies.
I know merging the new data with the old dataset is not a solution. I'm kind of stuck, and I'm not sure how to go about solving this. Anyone has any input? Thanks
Edit: I tried reindexing the new dataset but it did not work. There are more unique variables in the "products bought" column my training set and less so in my new dataset.
The new dataframe looks more like something like the one below.
Spent Products bought Target Variable
0 230 Leisure 1
1 150 Others 1
2 100 Groceries 1
3 700 Education 1
4 900 Mortgage 0
5 180 Education/Sports 1
6 1800 Car/Mortgage 0
7 400 Groceries 1
8 4000 Car 1
9 140 Car/Groceries 1

can one predict variable using scikit-learn rather binary classification if yes than how

I am working in the field of Pharmaceutical sciences, I work on
chemical compounds and with calculating their chemical properties or descriptors we can predict certain biological function of that compounds. I use python and R programming language for the same and also use Weka machine learning tool. Weka provides facility for binary prediction using SVM and other supporting algorithms.
Ex data set: Training set
Chem_ID MW LogP HbD HbE IC50 Class_label
001 232 5 0 2 20 0
002 280 2 1 4 41 1
003 240 5 0 2 22 0
004 300 4 1 5 48 1
005 245 2 0 2 24 0
006 255 1 0 2 20 0
007 299 5 1 4 49 1
Test set
Chem_ID MW LogP HbD HbE IC50 Class_label
000 255 1 0 2 20
In weka there are few algorithm with them we can predict the "class_label" or we can also predict specific variable (we usually predict "IC50" values ), does scikit-learn or any other machine learning library in python having that capabilities. if yes how can we use it thanks.
Yes, this is a regression problem. There are many different models to solve a regression problem, from a simple Linear Regression, to Support Vector Regression or Decision Tree Regressors (and many more).
They work similarly to binary classifier: You give them your training data and instead of 0/1 labels you give them target values to train. In your case you would take the feature you want to predict as target value and delete it form the training data.
Short example:
target_values = training_set['IC50']
training_data = training_set.drop('IC50')
clf = LinearRegression()
clf.fit(training_data, target_values)
test_data = test_set.drop('IC50')
predicted_values = clf.predict(test_data)

Categories

Resources