Ordinal Ridge and Lasso Regression in Python - python

Thank you for taking your time to read my problem.
I have to run Ordinal Ridge and Lasso regression on my dataset. The values that I want to predict are ordinal (5 levels) and I have many predictors (over 60) that are continuous but not all of them are logically significant. So, I would like to run the Ordinal Regression using Lasso and Ridge to find the significant ones.
I am very new to python and I don't know really what to do and appreciate any help from the community.
I have found the mord module (and even if I am using it right), it doesn't provide Ordinal Lasso.
Could anyone help me with this, please?
Thanks in advance.
Update:
I have written the following code, I don't get any error and I get an accuracy lower than previous analyses. So, I assume I am making mistake at a point in how I am doing it. I would appreciate it if someone helps me with it. I guess it could be in scaling, but I don't know how.
"rel" has five values: 1,2,3,4,5 which are my predicted values.
import numpy as np
import pandas as pd
import mord
from sklearn.preprocessing import scale, StandardScaler
from sklearn.metrics import mean_squared_error
import csv
#defining a function to rotate numbers in an array
def leftRotatebyOne(arr, n):
temp = arr[0]
for i in range(n-1):
arr[i] = arr[i+1]
arr[n-1] = temp
#defining OR to do Ordinal Ridge Regression
OR = mord.OrdinalRidge()
#definign the loop to go through all participants
for s in range(17):
#reading the data for each participant
df = pd.read_csv("Complete{0}.csv".format(s+1), index_col=0, header=None).dropna()
df.index.name = 'subject{0}'.format(s+1)
df.columns = ["ch{0}".format(i+1) for i in range(64)] +["irrel", "rel"]
#defining output and predictors
y = df.rel
X = df.drop(['rel', 'irrel'], axis=1).astype('float64')
#an array containig trial numbers
T = np.array(range(480))
#defining a matrix to hold the models of all runs(480 one-leave_out) for each participants
out=np.empty((67,480))
#runing the model for all trials (each time keeping one out)
for t in range(480):
T1 = T[:479]
T2 = T[479:] #the last one which is going to be out
## Always the last one is going to be out, how it works is that we rotate T, so the last trail changes
#train samples
X_train = X.iloc[T1,:]
y_train = np.array(y.iloc[T1])
scaler = StandardScaler().fit(X_train)
#test sample
X_test = X.iloc[T2,:]
y_test = np.array(y.iloc[T2])
#rotating T
leftRotatebyOne(T,480)
#runing ordinal ridge regression from the module mord
OR.fit(scaler.transform(X_train), y_train)
predicted = OR.predict(scaler.transform(X_test))
error = mean_squared_error(y_test, predicted)
coeff = pd.Series(OR.coef_, index=X.columns)
#getting the accuracy of each prediction
if predicted == y_test:
accuracy = 1
else:
accuracy = 0
#having all results in a matrix (each column is for leaving out one of the trials)
out[:,t]=np.hstack((coeff,predicted,error, accuracy))
#saving the results for each participant
np.savetxt("reg{0}.csv".format(s+1), out, delimiter=',')
#saving all results in one file
filenames = ["reg{0}.csv".format(i+1) for i in range(17)]
dataframes = [pd.read_csv(p) for p in filenames]
merged_dataframe = pd.concat(dataframes, axis=1)
merged_dataframe.to_csv("merged.csv", index=False)
#reading the file that contains all the models for all the
participants
cl = pd.read_csv("merged.csv", header=None).dropna()
#naming the rows
cl.index = ["ch{0}".format(i+1) for i in range(64)]["predicted","error","accuracy"]
#calculating the mean of each row
print(pd.Series.mean(cl, axis=1))
#getting teh mean of accuracy for each participant
for s in range(17):
regg = pd.read_csv("reg{0}.csv".format(s+1), header=None).dropna()
regg.index = ["ch{0}".format(i+1) for i in range(64)]["predicted","error","accuracy"]
print(pd.Series.mean(regg, axis=1)[66])
I didn't find anything other than mord module.
I want to do a leave-one-out cross validation, and I have to just keep one of the samples for the test.
PS.
I am following instructions in this link:
http://nbviewer.jupyter.org/github/JWarmenhoven/ISL-python/blob/master/Notebooks/Chapter%206.ipynb
I get the following error with doing exactly as they have done:
module 'glmnet' has no attribute 'ElasticNet'
*However, they do not cover ordinal regression.

You can use sklearn for this,
from sklearn import linear_model
regr_lasso = linear_model.Lasso(alpha=0.1)
regr_ridge = linear_model.Ridge(alpha=1.0)
regr_elasticnet = linear_model.ElasticNet(random_state=0)
Refer below links for more details,
http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html

Related

How to loop through multiple polynomial fits changing the degree

My code functions properly but I am repeating a block several times to vary the polynomial variable, degree. I assume this can and should be looped to allow quicker iterations, but I'm not sure how to do it. Prior to the code below I generate the train_test split which I keep for plotting.
After several iterations, I use np.vstack on the y_predictions to create a single array.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
### degree 1 ####
poly1 = PolynomialFeatures(degree=1)
x1_poly = poly1.fit_transform(X_train)
linreg1 = LinearRegression().fit(x1_poly, y_train)
pred_1= poly1.transform(x_prediction_data)
y1_poly_pred=linreg1.predict(pred_1)
### degree 3 #####
poly3 = PolynomialFeatures(degree=3)
x3_poly = poly3.fit_transform(X_train)
linreg3 = LinearRegression().fit(x3_poly, y_train)
pred_3= poly3.transform(x_prediction_data)
y3_poly_pred=linreg3.predict(pred_3)
#### ect... will make several other degree = 6, 9 ...
I would recommend collecting your answers in a dictionary, but I created a list for simplicity.
The code iterates over i, which is the degree of your polynomials. Trains the model, etc..., then collects its answers.
prediction_collector = []
for i in [1,3,6,9]:
poly = PolynomialFeatures(degree=i)
x_poly = poly.fit_transform(X_train)
linreg = LinearRegression().fit(x_poly, y_train)
pred= poly.transform(x_prediction_data)
y_poly_pred=linreg.predict(pred)
# to collect the answer after each iteration/increase of degrees
predictions_collector.append(y_poly_pred)

any workaround to do forward forecasting for estimating time series in python?

I want to make forward forecasting for monthly times series of air pollution data such as what would be 3~6 months ahead of estimation on air pollution index. I tried scikit-learn models for forecasting and fitting data to the model works fine. But what I wanted to do is making a forward period estimate such as what would be 6 months ahead of the air pollution output index is going to be. In my current attempt, I could able to train the model by using scikit-learn. But I don't know how that forward forecasting can be done in python. To make a forward period estimate, what should I do? Can anyone suggest a possible workaround to do this? Any idea?
my attempt
import pandas as pd
from sklearn.preprocessing StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import BayesianRidge
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url, parse_dates=['dates'])
df.drop(columns=['Unnamed: 0'], inplace=True)
resultsDict={}
predictionsDict={}
split_date ='2017-12-01'
df_training = df.loc[df.index <= split_date]
df_test = df.loc[df.index > split_date]
df_tr = df_training.drop(['pollution_index'],axis=1)
df_te = df_test.drop(['pollution_index'],axis=1)
scaler = StandardScaler()
scaler.fit(df_tr)
X_train = scaler.transform(df_tr)
y_train = df_training['pollution_index']
X_test = scaler.transform(df_te)
y_test = df_test['pollution_index']
X_train_df = pd.DataFrame(X_train,columns=df_tr.columns)
X_test_df = pd.DataFrame(X_test,columns=df_te.columns)
reg = linear_model.BayesianRidge()
reg.fit(X_train, y_train)
yhat = reg.predict(X_test)
resultsDict['BayesianRidge'] = accuracy_score(df_test['pollution_index'], yhat)
new update 2
this is my attempt using ARMA model
from statsmodels.tsa.arima_model import ARIMA
index = len(df_training)
yhat = list()
for t in tqdm(range(len(df_test['pollution_index']))):
temp_train = df[:len(df_training)+t]
model = ARMA(temp_train['pollution_index'], order=(1, 1))
model_fit = model.fit(disp=False)
predictions = model_fit.predict(start=len(temp_train), end=len(temp_train), dynamic=False)
yhat = yhat + [predictions]
yhat = pd.concat(yhat)
resultsDict['ARMA'] = evaluate(df_test['pollution_index'], yhat.values)
but this can't help me to make forward forecasting of estimating my time series data. what I want to do is, what would be 3~6 months ahead of estimated values of pollution_index. Can anyone suggest me a possible workaround to do this? How to overcome the limitation of my current attempt? What should I do? Can anyone suggest me a better way of doing this? Any thoughts?
update: goal
for the clarification, I am not expecting which model or approach works best, but what I am trying to figure it out is, how to make reliable forward forecasting for given time series (pollution index), how should I correct my current attempt if it is not efficient and not ready to do forward period estimation. Can anyone suggest any possible way to do this?
update-desired output
here is my sketch desired forecasting plot that I want to make:
In order to obtain your desired output, I think you need to use a model that can return the standard deviation in the predicted value. Therefore, I adopt Gaussian process regression. From the code you provided in your post, I don't see how this is a time series forecasting task, so in my solution below, I also treat this task as a usual regression task.
First, prepare the data
import pandas
from sklearn.preprocessing import StandardScaler
from sklearn.gaussian_process import GaussianProcessRegressor
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# drop the date column
df_training.drop(columns=['date'],axis=1,inplace=True)
df_test.drop(columns=['date'],axis=1,inplace=True)
y_train = df_training['pollution_index']
y_test = df_test['pollution_index']
df_training.drop(['pollution_index'],axis=1)
df_test.drop(['pollution_index'],axis=1)
scaler = StandardScaler()
scaler.fit(df_training)
X_train = scaler.transform(df_training)
X_test = scaler.transform(df_test)
X_train_df = pd.DataFrame(X_train,columns=df_training.columns)
X_test_df = pd.DataFrame(X_test,columns=df_test.columns)
with the dataframes prepared above, you can train a GaussianProcessRegressor and make predictions by
gpr = GaussianProcessRegressor(normalize_y=True).fit(X_train_df,y_train)
pred,std = gpr.predict(X_test_df,return_std=True)
in which std is an array of standard deviations in the predicted values. Then, you can plot the data by
import numpy as np
from matplotlib import pyplot as plt
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 225
# plot the training data
ax.plot(y_train.index[plot_start:],y_train.values[plot_start:],'navy',marker='o',label='observed')
# plot the test data
ax.plot(y_test.index,y_test.values,'navy',marker='o')
ax.plot(y_test.index,pred,'darkgreen',marker='o',label='pred')
sigma = np.sqrt(std)
ax.fill(np.concatenate([y_test.index,y_test.index[::-1]]),
np.concatenate([pred-1.960*sigma,(pred+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
ax.legend(loc='upper left',prop={'size':16})
the output plot looks like
UPDATE
I thought pollution_index is something that can be predicted by 'dew', 'temp', 'press', 'wnd_spd', 'rain'. If you want a one-step ahead forecasting, here is what you can do
import numpy as np
import pandas as pd
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# extract the relevant info
train_date,train_polltidx = df_training['date'].values,df_training['pollution_index'].values
test_date,test_polltidx = df_test['date'].values,df_test['pollution_index'].values
# train an ARIMA model
model = ARIMA(train_polltidx,order=(1,1,1))
model_fit = model.fit(disp=0)
# you can predict as many as you want, here I only predict len(test_dat.index) days
forecast,stderr,conf = model_fit.forecast(len(test_date))
# plot the result
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 225
# plot the training data
plt.plot(train_date[plot_start:],train_polltidx[plot_start:],'navy',marker='o',label='observed')
# plot the test data
plt.plot(test_date,test_polltidx,'navy',marker='o')
plt.plot(test_date,forecast,'darkgreen',marker='o',label='pred')
# ax.errorbar(np.arange(len(pred)),pred,std,fmt='r')
plt.fill(np.concatenate([test_date,test_date[::-1]]),
np.concatenate((conf[:,0],conf[:,1][::-1])),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.set_xlim([df_training['date'].values[plot_start],df_test['date'].values[-1]])
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=6))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
The output figure is
Clearly, the prediction is very bad, because I haven't done any preprocessing to the training data.
UPDATE 2
Since I'm not familiar with ARIMA, I implement one-step forecasting using GaussianProcessRegressor with the help of this wonderful post.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.preprocessing import StandardScaler
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# extract the relevant info
train_date,train_polltidx = df_training['date'].values,df_training['pollution_index'].values[:,None]
test_date,test_polltidx = df_test['date'].values,df_test['pollution_index'].values[:,None]
# preprocessing
scalar = StandardScaler()
scalar.fit(train_polltidx)
train_polltidx = scalar.transform(train_polltidx)
test_polltidx = scalar.transform(test_polltidx)
def series_to_supervised(data,n_in,n_out):
df = pd.DataFrame(data)
cols = list()
for i in range(n_in,0,-1): cols.append(df.shift(i))
for i in range(0, n_out): cols.append(df.shift(-i))
agg = pd.concat(cols,axis=1)
agg.dropna(inplace=True)
return agg.values
months_look_back = 1
# train
pollt_series = series_to_supervised(train_polltidx,months_look_back,1)
x_train,y_train = pollt_series[:,:months_look_back],pollt_series[:,-1]
# test
pollt_series = series_to_supervised(test_polltidx,months_look_back,1)
x_test,y_test = pollt_series[:,:months_look_back],pollt_series[:,-1]
print("The first %i months in the test set won't be predicted." % months_look_back)
def walk_forward_validation(x_train,y_train,x_test,y_test):
predictions = []
history_x = x_train.tolist()
history_y = y_train.tolist()
for rep,target in zip(x_test,y_test):
# train model
gpr = GaussianProcessRegressor(alpha=1e-4,normalize_y=False).fit(history_x,history_y)
pred,std = gpr.predict([rep],return_std=True)
predictions.append([pred,std])
history_x.append(rep)
history_y.append(target)
return predictions
predictions = walk_forward_validation(x_train,y_train,x_test,y_test)
pred_test,pred_std = zip(*predictions)
# put back
pred_test = scalar.inverse_transform(pred_test)
pred_std = scalar.inverse_transform(pred_std)
train_polltidx = scalar.inverse_transform(train_polltidx)
test_polltidx = scalar.inverse_transform(test_polltidx)
# plot the result
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 100
# plot the training data
plt.plot(train_date[plot_start:],train_polltidx[plot_start:],'navy',marker='o',label='observed')
# plot the test data
plt.plot(test_date[months_look_back:],test_polltidx[months_look_back:],'navy',marker='o')
plt.plot(test_date[months_look_back:],pred_test,'darkgreen',marker='o',label='pred')
sigma = np.sqrt(pred_std)
ax.fill(np.concatenate([test_date[months_look_back:],test_date[months_look_back:][::-1]]),
np.concatenate([pred_test-1.960*sigma,(pred_test+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.set_xlim([df_training['date'].values[plot_start],df_test['date'].values[-1]])
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=6))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
The idea of this script is to cast the time series forecasting task into a supervised regression task. The plot_start is a parameter that controls from which year we want to plot, clearly plot_start cannot be greater than the length of the training data. The output figure of the script is
as you can see, the first month in the test dataset is not predicted, because we need to look back one month to make a prediction.
In order to further make predictions about unseen data, based on this post on CV site, you can train a new model using the predicted value from the last step, therefore, here is how you can do it
unseen_dates = pd.date_range(test_date[-1],periods=180,freq='D').values
all_data = series_to_supervised(df['pollution_index'].values,months_look_back,months_to_predict)
def predict_unseen(unseen_dates,all_data,days_look_back):
predictions = []
history_x = all_data[:,:days_look_back].tolist()
history_y = all_data[:,-1].tolist()
inds = np.arange(unseen_dates.shape[0])
for ind in inds:
# train model
gpr = GaussianProcessRegressor(alpha=1e-2,normalize_y=False).fit(history_x,history_y)
rep = np.array(history_y[-days_look_back:]).reshape(days_look_back,1)
pred,std = gpr.predict(rep,return_std=True)
predictions.append([pred,std])
history_x.append(history_y[-days_look_back:])
history_y.append(pred)
return predictions
predictions = predict_unseen(unseen_dates,all_data,days_look_back=1)
pred_test,pred_std = zip(*predictions)
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 100
# plot the test data
plt.plot(unseen_dates,pred_test,'navy',marker='o')
sigma = np.sqrt(pred_std)
ax.fill(np.concatenate([unseen_dates,unseen_dates[::-1]]),
np.concatenate([pred_test-1.960*sigma,(pred_test+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
One very important thing to note: The timestep of the real data is a month, using such data to make predictions about days may not be correct.
The model you have built links what you are trying to model, 'pollution_index', to some input variables, in your case ['dew', 'temp', 'press', 'wnd_spd', 'rain']. So to predict pollution_index into the future using your model, at the high level, you need to estimate what these variables would be over the next 3-6 months, and then run your model on that. Practically, you need to come up with something that looks like X_test but has your projections for these variables for the future, and then call:
yhat = reg.predict(X_test)
... to produce the model estimate of where the pollution_index will be. Hope this makes sense. This gives you a "mechanical" ability to use your model for prediction.
For example, following up on your main example where reg is BayesianRidge() that you fit, we would do the following:
import sys
from io import StringIO
import matplotlib.pyplot as plt
# Here we load your predictions for input variables
# I stubbed it with some random data
df_predict_data = StringIO(
"""
date,dew,temp,press,wnd_spd,rain
2021-01-01,59,28,16,0.78,98.7
2021-02-01,68,32,18,0.79,46.1
2021-03-01,75,34,20,0.81,91.5
2021-04-01,63,31,16,0.83,19.1
2021-05-01,74,38,19,0.83,21.8
2021-06-01,65,32,17,0.85,35.4
""")
df_predict = pd.read_csv(df_predict_data, index_col = 'date')
# scale it using the same scaler you used in training
X_predict = scaler.transform(df_predict)
# predict pollution_index
y_predict = reg.predict(X_predict)
# plot it
plt.plot(df_predict.index, y_predict, '.-')
So we get this:
Whether the linear regression you built is a good model for such prediction is a completely different question. As #Sergey Bushmanov mentioned there is vast literature on forecasting and what models are best for this or that, and this thread is probably not the right place to debate that aspect of your question.

How to get the predict probability in Machine Leaning

I have this ML model trained and dumped so I can use it anywhere. And I need to get not just the score, predict values, but also I need predict_proba value as well.
I could get that but the problem is, I was expecting the probabilities to be between 0 and 1, but I get something else like below.
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
And this is the python code I am using.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# dataframe = pd.read_csv("hr_dataset.csv")
dataframe = pd.read_csv("formodel.csv")
dataframe.head(2)
# spare input and target variables
inputs = dataframe.drop('PerformanceRating', axis='columns')
target = dataframe['PerformanceRating']
MaritalStatus_ = LabelEncoder()
JobRole_ = LabelEncoder()
Gender_ = LabelEncoder()
EducationField_ = LabelEncoder()
Department_ = LabelEncoder()
BusinessTravel_ = LabelEncoder()
Attrition_ = LabelEncoder()
OverTime_ = LabelEncoder()
Over18_ = LabelEncoder()
inputs['MaritalStatus_'] = MaritalStatus_.fit_transform(inputs['MaritalStatus'])
inputs['JobRole_'] = JobRole_.fit_transform(inputs['JobRole'])
inputs['Gender_'] = Gender_.fit_transform(inputs['Gender'])
inputs['EducationField_'] = EducationField_.fit_transform(inputs['EducationField'])
inputs['Department_'] = Department_.fit_transform(inputs['Department'])
inputs['BusinessTravel_'] = BusinessTravel_.fit_transform(inputs['BusinessTravel'])
inputs['Attrition_'] = Attrition_.fit_transform(inputs['Attrition'])
inputs['OverTime_'] = OverTime_.fit_transform(inputs['OverTime'])
inputs['Over18_'] = Over18_.fit_transform(inputs['Over18'])
inputs.drop(['MaritalStatus', 'JobRole', 'Attrition' , 'OverTime' , 'EmployeeCount', 'EmployeeNumber',
'Gender', 'EducationField', 'Department', 'BusinessTravel', 'Over18'], axis='columns', inplace=True)
inputsNew = inputs
inputs.head(2)
# inputs = scaled_df
X_train, X_testt, y_train, y_testt = train_test_split(inputs, target, test_size=0.2)
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_testt, y_testt)
print(result)
loaded_model.predict_proba(inputs) // this produces above result, will put it below as well
outpu produces by the loaded_model.predict_proba(inputs)
array([[1.00000000e+00, 2.46920929e-12],
[1.00000000e+00, 9.89834607e-11],
[9.99993281e-01, 6.71853451e-06],
...,
[1.22327143e-01, 8.77672857e-01],
[9.99999653e-01, 3.47049875e-07],
[1.00000000e+00, 3.79462343e-10]])
How can I convert these values or get an output like a percentage? (eg: 12%, 50%, 96%)
loaded_model.predict_proba(inputs) outputs the probability of 1st class as well as 2nd class (as you have 2 classes). That's why you see 2 outputs for each occurrence of the data. The total probability for each occurrence sums up to 1.
Let's say if you just care about the probability of second class you can use the below line to fetch the probability of second class.
loaded_model.predict_proba(inputs)[:,1]
I am not sure if this is what you are looking for, apologies if I misunderstood your question.
To convert the probability array from decimal to percentage, you can write (loaded_model.predict_proba(inputs)) * 100.
EDIT: The format that is outputted by loaded_model.predict_proba(inputs) is just scientific notation, i.e. all of those numbers are between 0 and 1, but many of them are extremely small probabilities and so are represented in scientific notation.
The reason that you see such small probabilities is that loaded_model.predict_proba(inputs)[:,0] (the first column of the probability array) represents the probabilities of the data belonging to one class, and loaded_model.predict_proba(inputs)[:,1] represents the probabilities of the data belonging to the other class.
In other words, this means that each row of the probability array should add up to 1.
I hope this helps!
Check this out if the result is distributed in a different class and for the right class only you want probability in percentage.
pred_prob = []
pred_labels = loaded_model.predict_proba(inputs)
for each_pred in pred_labels:
each_pred_max = max(each_pred)*100
pred_bools.append(pred_item)
probability_list = [item*100 for item in pred_prob]

Outlier detection with Local Outlier Factor (LOF)

I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred
The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.

How can I increase the accuracy of my Linear Regression model?(machine learning with python)

I have a machine learning project with python by using scikit-learn library. I have two seperated datasets for training and testing and I try to doing linear regression. I use this codeblock shown below:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import LinearRegression
df =pd.read_csv("TrainingData.csv")
df2=pd.read_csv("TestingData.csv")
df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]
df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]
X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']
X_test=df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test=df2['Effort']
lr = LinearRegression().fit(X_train, Y_train)
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.7f}".format(lr.score(X_test, Y_test)))
My results are:
lr.coef_: [ 2.32088001e+00 2.07441948e-12 -4.73338567e-05 6.79658129e+02]
lr.intercept_: 2166.186033098048
Training set score: 0.63
Test set score: 0.5732999
What do you suggest me? How can I increase my accuracy? (adding code,parameter etc.)
My datasets is here: https://yadi.sk/d/JJmhzfj-3QCV4V
I'll elaborate a bit on #GeorgiKaradjov's answer with some examples. Your question is very broad, and there's multiple ways to gain improvements. In the end, having domain knowledge (context) will give you the best possible chance of getting improvements.
Normalise your data, i.e., shift it to have a mean of zero, and a spread of 1 standard deviation
Turn categorical data into variables via, e.g., OneHotEncoding
Do feature engineering:
Are my features collinear?
Do any of my features have cross terms/higher-order terms?
Regularisation of the features to reduce possible overfitting
Look at alternative models given the underlying features and the aim of the project
1) Normalise data
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)
X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])
Gives
0 0.752395
1 0.008489
2 -0.381637
3 -0.020588
4 0.171446
Name: AFP, dtype: float64
2) Categorical Feature Encoding
def feature_engineering(df):
dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
df[dev_plat.columns] = dev_plat
df = df.drop('Development_platform', axis=1)
lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
df[lang_type.columns] = lang_type
df = df.drop('Language_Type', axis=1)
resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
df[resource_level.columns] = resource_level
df = df.drop('Resource_Level', axis=1)
return df
X_train = feature_engineering(X_train)
X_train.head(5)
Gives
AFP dev_plat_077070 dev_plat_077082 dev_plat_077117108116105 dev_plat_080067 lang_type_051071076 lang_type_052071076 lang_type_065112071 resource_level_1 resource_level_2 resource_level_4
0 0.752395 1 0 0 0 1 0 0 1 0 0
1 0.008489 0 0 1 0 0 1 0 1 0 0
2 -0.381637 0 0 1 0 0 1 0 1 0 0
3 -0.020588 0 0 1 0 1 0 0 1 0 0
3) Feature Engineering; collinearity
import seaborn as sns
corr = X_train.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True)
You want the red line for y=x because values should be correlated with themselves. However, any red or blue columns show there's a strong correlation/anti-correlation that requires more investigation. For example, Resource=1, Resource=4, might be highly correlated in the sense if people have 1 there is a less chance to have 4, etc. Regression assumes that the parameters used are independent from one another.
3) Feature engineering; higher-order terms
Maybe your model is too simple, you could consider adding higher order and cross terms:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2, interaction_only=True)
output_nparray = poly.fit_transform(df)
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(df.columns, p) for p in poly.powers_]]
output_df = pd.DataFrame(output_nparray, columns=target_feature_names)
I had a quick try at this, I don't think the higher order terms help out much. It's also possible your data is non-linear, a quick logarithm or the Y-output gives a worse fit, suggesting it's linear. You could also look at the actuals, but I was too lazy....
4) Regularisation
Try using sklearn's RidgeRegressor and playing with alpha:
lr = RidgeCV(alphas=np.arange(70,100,0.1), fit_intercept=True)
5) Alternative models
Sometimes linear regression is not always suited. For example, Random Forest Regressors can perform very well, and are usually insensitive to data being standardised, and being categorical/continuous. Other models include XGBoost, and Lasso (Linear regression with L1 regularisation).
lr = RandomForestRegressor(n_estimators=100)
Putting it all together
I got carried away and started looking at your problem, but couldn't improve it too much without knowing all the context of the features:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import RidgeCV, LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import GridSearchCV
def feature_engineering(df):
dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
df[dev_plat.columns] = dev_plat
df = df.drop('Development_platform', axis=1)
lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
df[lang_type.columns] = lang_type
df = df.drop('Language_Type', axis=1)
resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
df[resource_level.columns] = resource_level
df = df.drop('Resource_Level', axis=1)
return df
df = pd.read_csv("TrainingData.csv")
df2 = pd.read_csv("TestingData.csv")
df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]
df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]
X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']
X_test = df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test = df2['Effort']
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)
X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])
X_train = feature_engineering(X_train)
X_test = feature_engineering(X_test)
lr = RandomForestRegressor(n_estimators=50)
lr.fit(X_train, Y_train)
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, Y_test)))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(Y_test, y_pred, fmt='o')
ax.errorbar([1, Y_test.max()], [1, Y_test.max()])
Resulting in:
Training set score: 0.90
Test set score: 0.61
You can look at the importance of the variables (higher value, more important).
Importance
AFP 0.882295
dev_plat_077070 0.020817
dev_plat_077082 0.001162
dev_plat_077117108116105 0.016334
dev_plat_080067 0.004077
lang_type_051071076 0.012458
lang_type_052071076 0.021195
lang_type_065112071 0.001118
resource_level_1 0.012644
resource_level_2 0.006673
resource_level_4 0.021227
You could start looking at the hyperparameters to get improvements on this also: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
here are some tips :
Data preparation(exploration) is one of the most important steps in a machine learning project, you need to start with it.
did you clean your data? if not start with that step!
As said in this tutorial :
There are no shortcuts for data exploration. If you are in a state of
mind, that machine learning can sail you away from every data storm,
trust me, it won’t.After some point of time, you’ll realize that you
are struggling at improving model’s accuracy. In such situation, data
exploration techniques will come to your rescue.
here is some step for data exploration :
missing values treatment,
outlier removal
feature engineering
After that try to perform univariate and bivariate analysis with your features.
use one hot encoding to transform you categorical features into numerics ones.
this is what you need according to what we have talked about in the comments.
here is a tutorial on how to deal with categorical variables, one-hot encoding from sklearn learn is the best technic for your problem.
Using ASCII representation is not the best practice for handling categorical features
You can find more about data exploration in here
follow the suggestions I gave to you and thank me later.
normalize your data
Depending on the type of input features you can extract different features from them (feature combinations are possible too)
If your data is not linearly separable, you won't be able to predict it well. You may need to use another model - Logistic regression, SVR, NN / whatever

Categories

Resources