I did k means clustering by running below code
X_std = StandardScaler().fit_transform(df_logret)
km = Kmeans(n_clusters=2, max_iter = 100)
km.fit(X_std)
centroids = km.centroids
and I'd like to put cluster 1 in x_1 and cluster 2 in x_2 and run a regression that looks like y= ax_1+bx_2
I've been searching for ways to do this for the whole day but can't find any.
the dataset 'df_logret' looks like
Any help would be greatly appreciated!
You've just applied Kmeans clustering on X_std. With the Sklearn package, you can extract the labels and fill them into the appropriate clusters.
Assuming your X_std is a 2x1 np array (i.e. np.array([[1,2],[3,4],[4,5]]...))
cluster_1 = []
cluster_2 = []
for i in range(len(X_std)):
if km.labels_[i] == 0:
cluster_1.append(X_std[i])
else:
cluster_2.append(X_std[i])
cluster_1_array = np.array(cluster_1)
cluster_2_array = np.array(cluster_2)
Related
I have used nltk to perform k mean clustering as I would like to change the distance metrics to cosine distance. However, how do I obtain the centroids of all the clusters?
kclusterer = KMeansClusterer(8, distance = nltk.cluster.util.cosine_distance, repeats = 1)
predict = kclusterer.cluster(features, assign_clusters = True)
centroids = kclusterer._centroid
df_clustering['cluster'] = predict
#df_clustering['centroid'] = centroids[df_clustering['cluster'] - 1].tolist()
df_clustering['centroid'] = centroids
I am trying to perform the k mean clustering on a pandas dataframe, and would like to have the coordinates of the centroid of the cluster of each data point to be in the dataframe column 'centroid'.
Thank you in advance!
import pandas as pd
import numpy as np
# created dummy dataframe with 3 feature
df = pd.DataFrame([[1,2,3],[50, 51,52],[2.0,6.0,8.5],[50.11,53.78,52]], columns = ['feature1', 'feature2','feature3'])
print(df)
obj = KMeansClusterer(2, distance = nltk.cluster.util.cosine_distance) #giving number of cluster 2
vectors = [np.array(f) for f in df.values]
df['predicted_cluster'] = obj.cluster(vectors,assign_clusters = True))
print(obj.means())
#OP
[array([50.055, 52.39 , 52. ]), array([1.5 , 4. , 5.75])] #which is going to be mean of three feature for 2 cluster, since number of cluster that we passed is 2
#now if u want the cluster center in pandas dataframe
df['centroid'] = df['predicted_cluster'].apply(lambda x: obj.means()[x])
How to print the confusion matrix for a logistic regression if change the value of threshold between [0.5,0.6,0.9] once 0.5 and once 0.6 and so one
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
X = [[0.7,0.2],[0.9,0.4]]
y = [1,-1]
model = LogisticRegression()
model = model.fit(X,y)
threshold = [0.5,0.6,0.9]
CM = confusion_matrix(y_true, y_pred)
TN = CM[0][0]
FN = CM[1][0]
TP = CM[1][1]
FP = CM[0][1]
Let try this!
for i in threshold:
y_predicted = model.predict_proba(X)[:1] > i
print(confusion_matrix(y, y_predicted))
predict_proba() returns a numpy array of two columns. The first column is the probability that target=0 and the second column is the probability that target=1. That is why we add [:,1] after predict_proba() in order to get the probabilities of target=1
I think an easy approach in pseudo code (based a bit on python) would be:
1 - Predict a set of known value (X) y_prob = model.predict_proba(X) so you will get the probability per each input in X.
2 - Then for each threshold calculate the output. i.e. If y_prob > threshold = 1 else 0
3 - Now get the confussion matrix of each vector obtained.
If you need a deeper explanation on any point let me know!
def predict_y_from_treshold(model,X,treshold):
return np.array(list(map(lambda x : 1 if x > treshold else 0,model.predict_proba(X)[:,1])))
I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred
The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.
I have this dataframe (text_df):
There are 10 different authors with 13834 rows of text.
I then created a bag of words and used a TfidfVectorizer like so:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray() # corpus --> bagofwords
y = text_df.iloc[:,1].values
Shape of X is (13834,2701)
I decided to use 7 clusters for KMeans:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=7,random_state=42)
I'd like to extract the authors of the texts in each cluster to see if the authors are consistently grouped into the same cluster. Not sure about the best way to go about this. Thanks!
Update:
Trying to visualize the author count per cluster using nested dictionary like so:
author_cluster = {}
for i in range(len(y_kmeans)):
# check 20 random predictions
j = np.random.randint(0, 13833, 1)[0]
if y_kmeans[j] not in author_cluster:
author_cluster[y_kmeans[j]] = {}
if y[j] not in author_cluster[y_kmeans[j]]:
author_cluster[y_kmeans[j]][y[j]] = 1
else:
author_cluster[y_kmeans[j]][y[j]] += 1
Output:
There should be a larger count per cluster and probably more than one author per cluster. I'd like to use all of the predictions to get a more accurate count instead of using a subset. But open to alternative solutions.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus) # I removed .toarray() - not sure why it was there except maybe for print debugging?
y = text_df.iloc[:,1].values
km = KMeans(n_clusters=7,random_state=42)
model = km.fit(X)
result = model.predict(X)
for i in range(20):
# check 20 random predictions
container = np.random.randint(low=0, high=13833, size=1)
j = container[0]
print(f'Author {y[j]} wrote {X[j]} and was put in cluster {result[j]}')
I have test and train sets with the following dimensions with all features (i.e. columns) as integers.
X_train.shape
(990188L, 19L)
X_test.shape
(424367L, 19L)
I want to find out the euclidean distance among all the rows of train set and all the rows of the test set.
I have to also remove the rows from the train set with a distance threshold of 0.005.
I have a following linear code which is too slow but works fine.
for a in range(X_test.shape[0]):
a_test = np_Test[a]
for b in range(X_train.shape[0]):
a_train = np_Train[b]
if(a != b):
dst = distance.euclidean(a_test, a_train)
if(dst <= 0.005):
train.append(b)
where I note down the indexes of the rows that lie within the distance threshold.
Is there any way to parallelize this code?
I tried using from sklearn.metrics.pairwise import euclidean_distances
but as the data set is huge, I am getting a memory error.
I tried to parallelize the code by using euclidean_distances is batches but some how I think the following code is not working fine.
Please help me if there is any way to parallelize the code.
rows = X_train.shape[0]
rem = rows%1000
no = rows/1000
i = 0
while (i <= no*1000) :
dst_mat = euclidean_distances(X_train[i:i+1000, :], X_test)
condition = np.any(dst_mat <= 0.005, axis = 1)
index = np.where(condition == True)
index = np.add(index, i)
print(index)
print(dst_mat)
i+=1000
Use scipy.spatial.cdist. This will calculate the pairwise distance.
Thanks to Warren Weckesser for pointing out this solution.