I ran a logistic regression model and made predictions of the logit values. I used this to get the points on the ROC curve:
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(Y_test,p)
I know metrics.roc_auc_score gives the area under the ROC curve. Can anyone tell me what command will find the optimal cut-off point (threshold value)?
You can do this using the epi package in R, however I could not find similar package or example in Python.
The optimal cut off point would be where “true positive rate” is high and the “false positive rate” is low. Based on this logic, I have pulled an example below to find optimal threshold.
Python code:
import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
from sklearn.metrics import roc_curve, auc
# read the data in
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
# rename the 'rank' column because there is also a DataFrame method called 'rank'
df.columns = ["admit", "gre", "gpa", "prestige"]
# dummify rank
dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.iloc[:, 'prestige_2':])
# manually add the intercept
data['intercept'] = 1.0
train_cols = data.columns[1:]
# fit the model
result = sm.Logit(data['admit'], data[train_cols]).fit()
print result.summary()
# Add prediction to dataframe
data['pred'] = result.predict(data[train_cols])
fpr, tpr, thresholds =roc_curve(data['admit'], data['pred'])
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)
####################################
# The optimal cut off would be where tpr is high and fpr is low
# tpr - (1-fpr) is zero or near to zero is the optimal cut off point
####################################
i = np.arange(len(tpr)) # index for df
roc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)})
roc.iloc[(roc.tf-0).abs().argsort()[:1]]
# Plot tpr vs 1-fpr
fig, ax = pl.subplots()
pl.plot(roc['tpr'])
pl.plot(roc['1-fpr'], color = 'red')
pl.xlabel('1-False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiver operating characteristic')
ax.set_xticklabels([])
The optimal cut off point is 0.317628, so anything above this can be labeled as 1 else 0. You can see from the output/chart that where TPR is crossing 1-FPR the TPR is 63%, FPR is 36% and TPR-(1-FPR) is nearest to zero in the current example.
Output:
1-fpr fpr tf thresholds tpr
171 0.637363 0.362637 0.000433 0.317628 0.637795
Hope this is helpful.
Edit
To simplify and bring in re-usability, I have made a function to find the optimal probability cutoff point.
Python Code:
def Find_Optimal_Cutoff(target, predicted):
""" Find the optimal probability cutoff point for a classification model related to event rate
Parameters
----------
target : Matrix with dependent or target data, where rows are observations
predicted : Matrix with predicted data, where rows are observations
Returns
-------
list type, with optimal cutoff value
"""
fpr, tpr, threshold = roc_curve(target, predicted)
i = np.arange(len(tpr))
roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})
roc_t = roc.iloc[(roc.tf-0).abs().argsort()[:1]]
return list(roc_t['threshold'])
# Add prediction probability to dataframe
data['pred_proba'] = result.predict(data[train_cols])
# Find optimal probability threshold
threshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba'])
print threshold
# [0.31762762459360921]
# Find prediction to the dataframe applying threshold
data['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)
# Print confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(data['admit'], data['pred'])
# array([[175, 98],
# [ 46, 81]])
Given tpr, fpr, thresholds from your question, the answer for the optimal threshold is just:
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
Vanilla Python Implementation of Youden's J-Score
def cutoff_youdens_j(fpr,tpr,thresholds):
j_scores = tpr-fpr
j_ordered = sorted(zip(j_scores,thresholds))
return j_ordered[-1][1]
Another possible solution.
I'll create some random data.
import numpy as np
import pandas as pd
import scipy.stats as sps
from sklearn import linear_model
from sklearn.metrics import roc_curve, RocCurveDisplay, auc
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
# define data distributions
N0 = 300
N1 = 250
dist0 = sps.gamma(a=8, scale=1/10)
x0 = np.linspace(dist0.ppf(0), dist0.ppf(1-1e-5), 100)
y0 = dist0.pdf(x0)
dist1 = sps.gamma(a=15, scale=1/10)
x1 = np.linspace(dist1.ppf(0), dist1.ppf(1-1e-5), 100)
y1 = dist1.pdf(x1)
with plt.style.context("bmh"):
plt.plot(x0, y0, label="NEG")
plt.plot(x1, y1, label="POS")
plt.legend()
plt.title("Gamma distributions")
# create a random dataset
rvs0 = dist0.rvs(N0, random_state=0)
rvs1 = dist1.rvs(N1, random_state=1)
with plt.style.context("bmh"):
plt.hist(rvs0, alpha=.5, label="NEG")
plt.hist(rvs1, alpha=.5, label="POS")
plt.legend()
plt.title("Random dataset")
Initialize a dataframe with observations (x feature and y target)
df = pd.DataFrame({
"y": np.concatenate(( np.repeat(0, N0) , np.repeat(1, N1) )),
"x": np.concatenate(( rvs0 , rvs1 )),
})
and display it with a box plot
# plot the data
with plt.style.context("bmh"):
g = sns.catplot(
kind="box",
data=df,
x="y", y="x"
)
ax = g.axes.flat[0]
sns.stripplot(
data=df,
x="y", y="x",
ax=ax, color='k',
alpha=.25
)
plt.show()
Now, we can split the dataframe into train-test, perform Logistic regression, compute ROC curve, AUC, Youden's index, find the cut-off and plot everything. All using pandas
# split dataset into train-test
X_train, X_test, y_train, y_test = train_test_split(
df[["x"]], df.y.values, test_size=0.5, random_state=1)
# init and fit Logistic Regression on train set
clf = linear_model.LogisticRegression()
clf.fit(X_train, y_train)
# predict probabilities on x test set
y_proba = clf.predict_proba(X_test)
# compute FPR and TPR from y test set and predicted probabilities
fpr, tpr, thresholds = roc_curve(
y_test, y_proba[:,1], drop_intermediate=False)
# compute ROC AUC
roc_auc = auc(fpr, tpr)
# init a dataframe for results
df_test = pd.DataFrame({
"x": X_test.x.values.flatten(),
"y": y_test,
"proba": y_proba[:,1]
})
# sort it by predicted probabilities
# because thresholds[1:] = y_proba[::-1]
df_test.sort_values(by="proba", inplace=True)
# add reversed TPR and FPR
df_test["tpr"] = tpr[1:][::-1]
df_test["fpr"] = fpr[1:][::-1]
# optional: add thresholds to check
#df_test["thresholds"] = thresholds[1:][::-1]
# add Youden's j index
df_test["youden_j"] = df_test.tpr - df_test.fpr
# define the cut_off and diplay it
cut_off = df_test.sort_values(
by="youden_j", ascending=False, ignore_index=True).iloc[0]
print("CUT-OFF:")
print(cut_off)
# plot everything
with plt.style.context("bmh"):
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
RocCurveDisplay(
fpr=df_test.fpr, tpr=df_test.tpr,
roc_auc=roc_auc).plot(ax=ax[0])
ax[0].set_title("ROC curve")
ax[0].axline(xy1=(0,0), slope=1, color="r", ls=":")
ax[0].plot(cut_off.fpr, cut_off.tpr, 'ko', ms=10)
df_test.plot(
x="youden_j", y="proba", ax=ax[1],
ylabel="Predicted Probabilities", xlabel="Youden j",
title="Youden's index", legend=False
)
ax[1].axvline(cut_off.youden_j, color="k", ls="--")
ax[1].axhline(cut_off.proba, color="k", ls="--")
df_test.plot(
x="x", y="proba", ax=ax[2],
ylabel="Predicted Probabilities", xlabel="X Feature",
title="Cut-Off", legend=False
)
ax[2].axvline(cut_off.x, color="k", ls="--")
ax[2].axhline(cut_off.proba, color="k", ls="--")
plt.show()
and we get
CUT-OFF:
x 1.065712
y 1.000000
proba 0.378543
tpr 0.852713
fpr 0.143836
youden_j 0.708878
We can finally check
# check results
TP = df_test[(df_test.x>=cut_off.x)&(df_test.y==1)].index.size
FP = df_test[(df_test.x>=cut_off.x)&(df_test.y==0)].index.size
TN = df_test[(df_test.x< cut_off.x)&(df_test.y==0)].index.size
FN = df_test[(df_test.x< cut_off.x)&(df_test.y==1)].index.size
print("True Positive Rate: ", TP / (TP + FN))
print("False Positive Rate:", 1 - TN / (TN + FP))
True Positive Rate: 0.8527131782945736
False Positive Rate: 0.14383561643835618
Although I am late to the party, but you can also use Geometric Mean to determine the optimal threshold as stated here: threshold tuning for imbalance classification
It can be computed as:
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))
Related
I am playing around with a dbscan example in order to see if it will work for me. In my case, I have clusters of a few points (3-5) close together with a fairly long distance in between clusters. I have tried to replicate the situation in the following code. I figured with a low epsilon and low min_samples,this should work, but instead it is telling me that it only sees 1 group (and 20 noise points?). Am I using this incorrectly, or is dbscan not good for this type of problem. I went with dbscan instead of kmeans because I dont know beforehand exactly how many clusters there will be (1-5).
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
# Configuration options
num_samples_total = 20
cluster_centers = [(3,3), (7,7),(7,3),(3,7),(5,5)]
num_classes = len(cluster_centers)
#epsilon = 1.0
epsilon = 1e-5
#min_samples = 13
min_samples = 2
# Generate data
X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.05)
np.save('./clusters.npy', X)
X = np.load('./clusters.npy')
# Compute DBSCAN
db = DBSCAN(eps=epsilon, min_samples=min_samples).fit(X)
labels = db.labels_
no_clusters = len(np.unique(labels) )
no_noise = np.sum(np.array(labels) == -1, axis=0)
print('Estimated no. of clusters: %d' % no_clusters)
print('Estimated no. of noise points: %d' % no_noise)
# Generate scatter plot for training data
colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', labels)) #only set for 2 colors
plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True)
plt.title('Two clusters with data')
plt.xlabel('Axis X[0]')
plt.ylabel('Axis X[1]')
plt.show()
ended up going with kmeans and doing a modified elbow method:
print(__doc__)
# Author: Phil Roth <mr.phil.roth#gmail.com>
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Configuration options
num_samples_total = 20
cluster_centers = [(3,3), (7,7),(7,3),(3,7),(5,5)]
num_classes = len(cluster_centers)
#epsilon = 1.0
epsilon = 1e-5
#min_samples = 13
min_samples = 2
# Generate data
X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.05)
random_state = 170
#y_pred = KMeans(n_clusters=5, random_state=random_state).fit_predict(X)
#plt.scatter(X[:, 0], X[:, 1], c=y_pred)
#kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
#maybe I dont have to look for an elbow, just go until the value drops below 1.
#also if I do go too far, it just means that the same shape will be shown twice.
clusterIdx = 0
inertia = 100
while inertia > 1:
clusterIdx = clusterIdx + 1
kmeans = KMeans(n_clusters=clusterIdx, random_state=0).fit(X)
inertia = kmeans.inertia_
print(inertia)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_)
print(clusterIdx)
plt.show()
On each surface I would like to have, the actual number the predicitons.
I don't really care if it's just percentages or numbers. I would also like to label them with True Positive and False Negative.
The Code:
sns.heatmap(pd.crosstab(ytest,classifier.predict(xtest)),cmap='Spectral')
plt.xlabel('predicted')
plt.ylabel('actual')
plt.show()
I Use below to do what you want, though a google search will also give you answer
def find_best_threshold(threshold, fpr, tpr):
t = threshold[np.argmax(tpr * (1-fpr))]
### TPR * TNR ---> We are trying to maximize TNR and TPR
print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
return t
def predict_with_best_thresh(prob,t):
pred=[1 if i>=t else 0 for i in prob ]
return pred
### https://medium.com/#dtuk81/confusion-matrix-visualization-fc31e3f30fea
def conf_matrix_plot(cf_matrix,title):
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, vQ3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
#sns.set(font_scale=1.5)
sns.heatmap(cf_matrix, annot=labels, fmt='',cmap='coolwarm').set_title(title + ' Confusion Matrix for TFIDF')
plt.xlabel('Actual')
plt.ylabel('Predicted')
from sklearn.metrics import confusion_matrix
import numpy as np
best_t = find_best_threshold(tr_thresholds, train_fpr, train_tpr)
cf_matrix_train = confusion_matrix(y_train, predict_with_best_thresh(y_train_pred[:,1], best_t))
cf_matrix_test = confusion_matrix(y_test, predict_with_best_thresh(y_test_pred[:,1], best_t))
conf_matrix_plot(cf_matrix_train,'Train')
Result:
I have a classifier that outputs a proportion X between 0 and 1. I also have an associated ground truth which is the real proportion.
I want to predict 1 when the output of the classifier is greater than some threshold and 0 otherwise .
From data visualization I know that a good threshold is around 0.5.
How can I estimate the best threshold from the data ?
Here is an example of my data
predicted = [0.13675214 0.31400966 0.28037383 0.18337408 0.10043668 0.6
0.74242424 0.30853994 0.30588235 0.24766355 0.19806763 0.20512821
0.29752066 0.23504274 0.14133333 0.52733119 0.46039604 0.56306306
0.29059829 0.02890173 0.2962963 0.47008547 0.54545455 0.58119658
0.3 0.66242038 0.42066421]
ground_truth = [0.11111111 0.647343 0.21028037 0.20293399 0. 0.93333333
1. 0.07162534 0.61176471 0.21028037 0.647343 0.11111111
0.07162534 0.5 0.08 0.88424437 0.58415842 0.74774775
0.11111111 0.03468208 0. 0.5 0. 0.91168091
1. 0.96178344 0.10701107]
desired_output = [0,1,0,0,0,1,1,0,1,0,1,0,0,0,0,1,1,1,0,0,0,1,0,1,1,1,0]
Thank you
Precision, recall and f1-score values depend on the probability threshold. Changes in the threshold that we select to use as a cut-off to determine that a sample belongs to the positive class will affect the precision, recall and therefore f1-score. I share my attempt to plot precision, recall and f1-score depending on discrimination threshold. The plot also determines the optimal threshold for the dataset and the model to classify a sample as a member of the positive class. The optimal threshold is that at which f1-score is highest by default.
import pandas as pd
import pathlib
import matplotlib.pyplot as plt
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator)
from sklearn.metrics import confusion_matrix as cm_sklearn
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
def plot_discrimination_threshold(clf, X_test, y_test, argmax='f1', title='Metrics vs Discriminant Threshold', fig_size=(10, 8), dpi=100, save_fig_path=None):
"""
Plot precision, recall and f1-score vs discriminant threshold for the given pipeline model
Parameters
----------
clf : estimator instance (either sklearn.Pipeline, imblearn.Pipeline or a classifier)
PRE-FITTED classifier or a PRE-FITTED Pipeline in which the last estimator is a classifier.
X_test : pandas.DataFrame of shape (n_samples, n_features)
Test features.
y_test : pandas.Series of shape (n_samples,)
Target values.
argmax : str, default: 'f1'
Annotate the threshold maximized by the supplied metric. Options: 'f1', 'precision', 'recall'
title : str, default ='FPR and FNR vs Discriminant Threshold'
Plot title.
fig_size : tuple, default = (10, 8)
Size (inches) of the plot.
dpi : int, default = 100
Image DPI.
save_fig_path : str, defaut=None
Full path where to save the plot. Will generate the folders if they don't exist already.
Returns
-------
fig : Matplotlib.pyplot.Figure
Figure from matplotlib
ax : Matplotlib.pyplot.Axe
Axe object from matplotlib
"""
thresholds = np.linspace(0, 1, 100)
precision_ls = []
recall_ls = []
f1_ls = []
fpr_ls = []
fnr_ls = []
# obtain probabilities
probs = clf.predict_proba(X_test)[:,1]
for threshold in thresholds:
# obtain class prediction based on threshold
y_predictions = np.where(probs>=threshold, 1, 0)
# obtain confusion matrix
tn, fp, fn, tp = cm_sklearn(y_test, y_predictions).ravel()
# obtain FRP and FNR
FPR = fp / (tn + fp)
FNR = fn / (tp + fn)
# obtain precision, recall and f1 scores
precision = precision_score(y_test, y_predictions, average='binary')
recall = recall_score(y_test, y_predictions, average='binary')
f1 = f1_score(y_test, y_predictions, average='binary')
precision_ls.append(precision)
recall_ls.append(recall)
f1_ls.append(f1)
fpr_ls.append(FPR)
fnr_ls.append(FNR)
metrics = pd.concat([
pd.Series(precision_ls),
pd.Series(recall_ls),
pd.Series(f1_ls),
pd.Series(fpr_ls),
pd.Series(fnr_ls)], axis=1)
metrics.columns = ['precision', 'recall', 'f1', 'fpr', 'fnr']
metrics.index = thresholds
plt.rcParams["figure.facecolor"] = 'white'
plt.rcParams["axes.facecolor"] = 'white'
plt.rcParams["savefig.facecolor"] = 'white'
fig, ax = plt.subplots(1, 1, figsize=fig_size, dpi=dpi)
ax.plot(metrics['precision'], label='Precision')
ax.plot(metrics['recall'], label='Recall')
ax.plot(metrics['f1'], label='f1')
ax.plot(metrics['fpr'], label='False Positive Rate (FPR)', linestyle='dotted')
ax.plot(metrics['fnr'], label='False Negative Rate (FNR)', linestyle='dotted')
# Draw a threshold line
disc_threshold = round(metrics[argmax].idxmax(), 2)
ax.axvline(x=metrics[argmax].idxmax(), color='black', linestyle='dashed', label="$t_r$="+str(disc_threshold))
ax.xaxis.set_major_locator(MultipleLocator(0.1))
ax.xaxis.set_major_formatter('{x:.1f}')
ax.yaxis.set_major_locator(MultipleLocator(0.1))
ax.yaxis.set_major_formatter('{x:.1f}')
ax.xaxis.set_minor_locator(MultipleLocator(0.05))
ax.yaxis.set_minor_locator(MultipleLocator(0.05))
ax.tick_params(which='both', width=2)
ax.tick_params(which='major', length=7)
ax.tick_params(which='minor', length=4, color='black')
plt.grid(True)
plt.xlabel('Probability Threshold', fontsize=18)
plt.ylabel('Scores', fontsize=18)
plt.title(title, fontsize=18)
leg = ax.legend(loc='best', frameon=True, framealpha=0.7)
leg_frame = leg.get_frame()
leg_frame.set_color('gold')
plt.show()
if (save_fig_path != None):
path = pathlib.Path(save_fig_path)
path.parent.mkdir(parents=True, exist_ok=True)
fig.savefig(save_fig_path, dpi=dpi)
return fig, ax, disc_threshold
You seem to have a native 90% accuracy
delt = predicted - ground_truth # where all but 2 of 20 appear within .4
Other/ more examples of (model) predicted would illustrate ranges perhaps?
My dataset can be found in kaggle https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python. So i'm running k-means on my dataset that has 4 columns and 200 rows with k = 5. I wanted to find the cluster radius so I measured the average distance of each data point from their respective cluster center but whenever I re-run my program their values change. My cluster centers don't change with each iteration so what's going on exactly? How do I fix this?
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from sklearn.preprocessing import StandardScaler
import numpy as np
import scipy.spatial.distance as sdist
df = pd.read_csv('D:\Mall_Customers.csv', usecols = ['Spending Score (1-100)', 'Annual Income (k$)'])
x = StandardScaler().fit_transform(df)
kmeans = KMeans(n_clusters=5, max_iter=100, random_state=0)
y_kmeans= kmeans.fit_predict(x)
centroids = kmeans.cluster_centers_
print(centroids)
df["cluster"] = kmeans.labels_
n_clusters = 5
clusters = [x[y_kmeans == i] for i in range(n_clusters)]
for i, c in enumerate(clusters):
print('Cluster {} has {} observations: {}...'.format(i, len(c), c[0]))
df["cluster"] = kmeans.labels_
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df)
#cluster radius
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
return np.mean(distances)
t_data = PCA(n_components=2).fit_transform(x)
k_means = KMeans()
clusters = k_means.fit_predict(t_data)
centroids = kmeans.cluster_centers_
c_mean_distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(t_data, cx, cy, i, clusters)
c_mean_distances.append(mean_distance)
print("mean distances are", c_mean_distances)
Output 1 [1.5381892556224435, 1.796763983963032, 1.5144402423920744, 3.4372440532366753, 1.6533031213582314]
Iteration 2 ```[3.180393284279158, 2.809194267986748, 0.7823704675079582, 3.4929008204149365, 1.8109097594336663]
Iteration 3 [1.9461073260609538, 3.2032294269352155, 2.447917517713439, 3.4372440532366753, 2.197239028470577]
I'll add the answer to document the issue.
First, when you are doing a lower dimensional embedding make sure that it doesn't need a random seed to ensure repeatability. In this case (PCA) I think it is ok, but other lower dimensional embedding's may vary.
Second, KMeans does not always converge to a global optima and thus can have varying convergence clusters. To keep KMeans repeatable Scikit Learn has the random_state input parameter.
You set this the first time you ran KMeans. This kept the first portion of your code repeatable. To ensure repeatability on the clustering after PCA embedding, set the random state in the same way:
k_means = KMeans(n_clusters=5, max_iter=100, random_state=0)
I'm having difficulty getting the weighting array in sklearn's Linear Regression to affect the output.
Here's an example with no weighting.
import numpy as np
import seaborn as sns
from sklearn import linear_model
x = np.arange(0,100.)
y = (x**2.0)
xr = np.array(x).reshape(-1, 1)
yr = np.array(y).reshape(-1, 1)
regr = linear_model.LinearRegression()
regr.fit(xr, yr)
y_pred = regr.predict(xr)
sns.scatterplot(x=x, y = y)
sns.lineplot(x=x, y = y_pred.T[0].tolist())
Now when adding weights, I get the same best fit line back. I expected to see the regression favor the steeper part of the curve. What am I doing wrong?
w = [p**2 for p in x.reshape(-1)]
wregr = linear_model.LinearRegression()
wregr.fit(xr,yr, sample_weight=w)
yw_pred = regr.predict(xr)
wregr = linear_model.LinearRegression(fit_intercept=True)
wregr.fit(xr,yr, sample_weight=w)
yw_pred = regr.predict(xr)
sns.scatterplot(x=x, y = y) #plot curve
sns.lineplot(x=x, y = y_pred.T[0].tolist()) #plot non-weighted best fit line
sns.lineplot(x=x, y = yw_pred.T[0].tolist()) #plot weighted best fit line
This is due to an error in your code. Fitting of your weighted model should be:
yw_pred = wregr.predict(xr)
rather than
yw_pred = regr.predict(xr)
With this you get: