I have a dataset containing 61 rows(users) and 26 columns, on which I apply clustering with k-means and others algorithms.
first applied KMeans on the dataset after normalizing it.
As a prior task I run k-means on this data after normalizing it and identified 10 clusters.
In parallel I also tried to visualize these clusters that's why i use PCA to reduce the number of my features.
I have written the following code:
UserID Communication_dur Lifestyle_dur Music & Audio_dur Others_dur Personnalisation_dur Phone_and_SMS_dur Photography_dur Productivity_dur Social_Media_dur System_tools_dur ... Music & Audio_Freq Others_Freq Personnalisation_Freq Phone_and_SMS_Freq Photography_Freq Productivity_Freq Social_Media_Freq System_tools_Freq Video players & Editors_Freq Weather_Freq
1 63 219 9 10 99 42 36 30 76 20 ... 2 1 11 5 3 3 9 1 4 8
2 9 0 0 6 78 0 32 4 15 3 ... 0 2 4 0 2 1 2 1 0 0
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
Sc = StandardScaler()
X = Sc.fit_transform(df)
pca = PCA(3)
pca.fit(X)
pca_data = pd.DataFrame(pca.transform(X))
print(pca_data.head())
gives the following results:
0 1 2
0 8 -4 5
1 -2 -2 1
2 1 1 -0
3 2 -1 1
4 3 -1 -3
I want to show a plot (cluster) of my dataset by using a PCA and interpret the results ?
I am really new in this space and advice would be greatly appreciated!
Thanks in advance once again.
Using an example dataset:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
df, y = make_blobs(n_samples=70, centers=10,n_features=26,random_state=999,cluster_std=1)
Perform scaling, PCA and put the PC scores into a dataframe:
Sc = StandardScaler()
X = Sc.fit_transform(df)
pca = PCA(2)
pca_data = pd.DataFrame(pca.fit_transform(X),columns=['PC1','PC2'])
Perform kmeans and place the label into a data frame and you can already plot it using seaborn:
kmeans =KMeans(n_clusters=10).fit(X)
pca_data['cluster'] = pd.Categorical(kmeans.labels_)
sns.scatterplot(x="PC1",y="PC2",hue="cluster",data=pca_data)
Or matplotlib:
fig,ax = plt.subplots()
scatter = ax.scatter(pca_data['PC1'], pca_data['PC2'],c=pca_data['cluster'],cmap='Set3',alpha=0.7)
legend1 = ax.legend(*scatter.legend_elements(),
loc="upper left", title="")
ax.add_artist(legend1)
Related
I am doing a binary classification. May I know how to extract the real indexes of the misclassified or classified instances of the training data frame while doing K fold cross-validation? I found no answer to this question here.
I got the values in folds as described here:
skf=StratifiedKFold(n_splits=10,random_state=111,shuffle=False)
cv_results = cross_val_score(model, X_train, y_train, cv=skf, scoring='roc_auc')
fold_pred = [pred[j] for i, j in skf.split(X_train,y_train)]
fold_pred
Is there any method to get index of misclassified (or classified ones)? So the output is a dataframe that only has misclassified(or classified) instances while doing cross validation.
Desired output:
Missclassified instances in the dataframe with real indices.
col1 col2 col3 col4 target
13 0 1 0 0 0
14 0 1 0 0 0
18 0 1 0 0 1
22 0 1 0 0 0
where input has 100 instances, 4 are misclassified (index number 13,14,18 and 22) while doing CV
From cross_val_predict you already have the predictions. It's a matter of subsetting your data frame where the predictions are not the same as your true label, for example:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()
df = pd.DataFrame(data.data[:,:5],columns=data.feature_names[:5])
df['label'] = data.target
rfc = RandomForestClassifier()
skf = StratifiedKFold(n_splits=10,random_state=111,shuffle=True)
pred = cross_val_predict(rfc, df.iloc[:,:5], df['label'], cv=skf)
df[df['label']!=pred]
mean radius mean texture ... mean smoothness label
3 11.42 20.38 ... 0.14250 0
5 12.45 15.70 ... 0.12780 0
9 12.46 24.04 ... 0.11860 0
22 15.34 14.26 ... 0.10730 0
31 11.84 18.70 ... 0.11090 0
My dataset looks like this:
age address freetime goout Dalc Walc G1 G2 G3 AverageG
17 U 1 1 3 5 7 7 7 7
15 X 3 2 6 3 5 4 2 3.6666
20 T 1 5 4 1 3 2 1 2
What I'm trying to do using python is to predict the value AverageG which is the average of G1, G2, G3.
I know that the value of AverageG can be calculated by making the average of G1, G2 and G3 but in my case it has to be predicted by using the library scikit-learn
For this toy example you can use linear regression.
I will give the general idea, then you can translate it for your specific dataframe:
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.random.randint(0,10,(1000,3))
y = X.mean(axis=1)
model = LinearRegression()
model.fit(X, y)
new_data = np.array([1,2,3]).reshape(1, -1)
model.predict(new_data)
and the model correctly predicts:
array([2.])
I'm trying to use mean shift from sklearn to find anomalies and outliers in a dataset. The datasets are signal values from sensors. I have a training dataset to train the algorithm and a test dataset containing dummy anomalies. My problem is that when I use the predict method on test dataset, mean shift doesn't label anomalies with -1 or any other value that indicates anomalies or outliers but associates them with valid cluster.
Here the code:
import pandas as pd
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn import preprocessing
if __name__ == '__main__':
train= pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
scaler = preprocessing.StandardScaler().fit(train)
bandwidth = estimate_bandwidth(train, n_jobs=-1)
ms = MeanShift(bandwidth=bandwidth,n_jobs=-1)
ms.fit(scaler.transform(train))
prediction = ms.predict(scaler.transform(test))
test["cluster"] = prediction
print np.unique(prediction)
here first 5 row training dataset:
A B C
0 300 0 200
1 300 0 200
2 300 0 350
3 300 1 350
4 400 1 350
here first 5 row test dataset with dummy anomalies:
A B C
0 300 0 200
1 300 0 200
2 300 0 350
3 100000000 100000000 100000000
4 400 1 350
what can i do to detect anomalies in test dataset?
I have some csv data in the following format.
Ln Dr Tag Lab 0:01 0:02 0:03 0:04 0:05 0:06 0:07 0:08 0:09
L0 St vT 4R 0 0 0 0 0 0 0 0 0
L2 Tx st 4R 8 8 8 8 8 8 8 8 8
L2 Tx ss 4R 1 1 9 6 1 0 0 6 7
I want to plot a timeseries graph using the columns (Ln , Dr, Tg,Lab) as the keys and the 0:0n field as values on a timeseries graph.
I have the following code.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
plt.ylabel('time')
plt.xlabel('events')
plt.grid(True)
plt.xlim((0,150))
plt.ylim((0,200))
a=pd.read_csv('yourfile.txt',delim_whitespace=True)
for x in a.iterrows():
x[1][4:].plot(label=str(x[1][0])+str(x[1][1])+str(x[1][2])+str(x[1][3]))
plt.legend()
fig.savefig('test.pdf')
I have only shown a subset of my data here. I have around 200 entries (200 rows) in my full data set. the above code plots all graphs in a single figure. I would prefer each row to be plotted in a separate graph.
Use subplot()
import matplotlib.pyplot as plt
fig = plt.figure()
plt.subplot(221) # 2 rows, 2 columns, plot 1
plt.plot([1,2,3])
plt.subplot(222) # 2 rows, 2 columns, plot 2
plt.plot([3,1,3])
plt.subplot(223) # 2 rows, 2 columns, plot 3
plt.plot([3,2,1])
plt.subplot(224) # 2 rows, 2 columns, plot 4
plt.plot([1,3,1])
plt.show()
fig.savefig('test.pdf')
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot.html#matplotlib.pyplot.subplot
How is it possible with matplotlib to plot a graph with that data. The problem is to visualize the distance from column 2 to column 3. At the end it should look like a Gantt chart.
0 0 0.016 19.833
1 0 19.834 52.805
2 0 52.806 84.005
5 0 84.012 107.305
8 0 107.315 128.998
10 0 129.005 138.956
11 0 138.961 145.587
13 0 145.594 163.863
15 0 163.872 192.118
16 0 192.127 193.787
17 0 193.796 197.106
20 0 236.099 246.223
25 1 31.096 56.180
27 1 58.097 64.857
28 1 64.858 66.494
29 1 66.496 89.908
31 1 89.918 111.606
34 1 129.007 137.371
35 1 137.372 145.727
39 1 176.097 209.461
42 1 209.476 226.207
44 1 226.217 259.317
46 1 259.329 282.488
47 1 282.493 298.905
I need 2 colors for column 1. And for the y-axis the column 0 is selected, for the x-axis the column 2 and 3 are important. For each row a line should be plotted. Column 2 is the start time, and column 3 is the stop time.
If I have understood you correctly, you want to plot a horizontal line between the x-values of the 3rd and 4th column, with y-value equal that in column 0. To plot a horizontal line at a given y-value between two x-values, you could use hlines. I believe the code below is a possible solution.
import numpy as np
import matplotlib.pyplot as plt
# Read data from file into variables
y, c, x1, x2 = np.loadtxt('data.txt', unpack=True)
# Map value to color
color_mapper = np.vectorize(lambda x: {0: 'red', 1: 'blue'}.get(x))
# Plot a line for every line of data in your file
plt.hlines(y, x1, x2, colors=color_mapper(c))
You can read the text file using numpy.loadtxt, for example, and then plot it using matplotlib. For example:
import numpy as np
import matplotlib.pyplot as plt
x, y = np.loadtxt('file.txt', usecols=(2,3), unpack=True)
plt.plot(x,y)
You should see the matplotlib documentation for more options.