Plot class probabilities using matplolitb - python

I have two numpy arrays y_prob and dataY whose values correspond. dataY is a one dimensional array where each value is a 1 or a 0. y_prob is a two dimensional array. I wish to plot a scatter plot using y_prob to determine the location and dataY to determine the color of the point. How can I do this?
Sample data:
y_prob = [[0.5,0.5], [0.3,0.7], [0.2,0.8], [0.1,0.9]]
dataY = [1,0,0,0]

You can use the standard packages numpy & matplotlib
import numpy as np
import matplotlib.pyplot as plt
y_prob = np.array([[0.5,0.5], [0.3,0.7], [0.2,0.8], [0.1,0.9]])
dataY = [1,0,0,0]
fig = plt.figure()
plt.scatter(x=y_prob[:,0], y=y_prob[:,1], c=dataY)
fig.show()

Related

How to perform Spectral Clustering on 3 circles dataset with three different classes

I want to perform spectral clustering on the 3 circles dataset that I have generated using make circles as shown in the figure. All the three circles are of different classes.
from sklearn.datasets import make_circles
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt
import pylab as pl
import networkx as nx
X_small, y_small = make_circles(n_samples=(100,200), random_state=3,
noise=0.07, factor = 0.7)
X_large, y_large = make_circles(n_samples=(100,200), random_state=3,
noise=0.07, factor = 0.4)
y_large[y_large==1] = 2
df = pd.DataFrame(np.vstack([X_small,X_large]),columns=['x1','x2'])
df['label'] = np.hstack([y_small,y_large])
df.label.value_counts()
sns.scatterplot(data=df,x='x1',y='x2',hue='label',style='label',palette="bright")
Since I can't flag this question as duplicate (the similar question has no accepted answer), here is a working example of Spectral Clustering on 3 circles using your code:
X_small, y_small = make_circles(n_samples=(1000,2000), random_state=3,
noise=0.07, factor = 0.1)
X_large, y_large = make_circles(n_samples=(1000,2000), random_state=3,
noise=0.07, factor = 0.6)
y_large[y_large==1] = 2
df = pd.DataFrame(np.vstack([X_small,X_large]),columns=['x1','x2'])
df['label'] = np.hstack([y_small,y_large])
df.label.value_counts()
sns.scatterplot(data=df,x='x1',y='x2',hue='label',style='label',palette="bright")
Then adapt the slightly modified 3 circles dataset (added samples and spread the circles) to the code of this SO answer:
x1 = np.expand_dims(df['x1'].values,axis=1)
x2 = np.expand_dims(df['x2'].values,axis=1)
X = np.concatenate((x1,x2),axis=1)
y = df['label'].values
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3, gamma=1000).fit(X)
colors = ['r','g','b']
colors = np.array([colors[label] for label in clustering.labels_])
plt.scatter(X[y==0, 0], X[y==0, 1], c=colors[y==0], marker='X')
plt.scatter(X[y==1, 0], X[y==1, 1], c=colors[y==1], marker='o')
plt.scatter(X[y==2, 0], X[y==2, 1], c=colors[y==2], marker='*')
plt.show()
The np.expand_dims(...,axis=1) is necessary to create the dimension along which to concatenate features with np.concatenate() (we initially have 1D vectors, and we don't want to concatenate along the existing initial dimension which is the samples index dimension). Each plt.scatter() line plots the points of a single true data class (hence the y==y_true index selection) using the associated marker, the colors indicating the class provided by the clustering.
Resulting dataset:
Resulting clusters:
Edit: to use different markers to identify true classes (colors already indicating the clustering classes), as asked by OP in the comments. We unfortunately cannot use an array for markers (as for colors) to produce the plot in a single line of code, this is because marker does not accept a list as input (discussed here).
Edit2: added motivation for the use of np.expand_dims(...,axis=1) and some explanation for the plt.scatter() lines, as asked by OP in the comments.

Numpy make binary matrix outline continuous and fill it with 1s

I have a file containing 0 and 1s here: https://easyupload.io/wvoryj.
How can I fill the shape of these structures with 1s? I think binary_fill_holes is not working because the outline is not continuous.
plot showing structures
import numpy as np
import matplotlib.pyplot as plt
from scipy import ndimage
mask = np.loadtxt('mask.txt', dtype=int)
mask = ndimage.binary_fill_holes(mask).astype(int)
fig, ax = plt.subplots()
plt.imshow(mask)
plt.show()
This would be my approach:
First fill up the gaps with a 2D convolution
run a cumsum over all rows to fill the outlines
divide by the last (or highest) number of the row
set everything larger than 1 back to 1
I hope that helped
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import convolve2d
mask = np.loadtxt('mask.txt', dtype=int)
# run a convolution over the mask to fill out the empty spaces
conv_mat = np.array(3*[[0,1,0]])
mask_continuous = convolve2d(mask, conv_mat)
# add up all numbers from left to right...
# ...and divide by the last value of the row
mask_filled = np.array([np.cumsum(i) / np.cumsum(i)[-1] for i in mask_continuous])
# reset everything larger than 1 to 1
mask_filled[mask_filled>1] = 1
fig, ax = plt.subplots()
plt.imshow(mask_filled)
plt.show()

How to plot the pricipal vectors of each variable after performing PCA?

My question mainly comes from this post
:https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance
In the article, the author plotted the vector direction and length of each variable. Based on my understanding, after performing PCA. All we get are the eigenvectors and eigenvalues. For a dataset which has a dimension M x N, each eigenvalue should be a vector as 1 x N. So, my question is maybe the length of the vector is the eigenvalue, but how to find the direction of the vector for each variable mathematical? And what is the physical meaning of the length of the vector?
Also, if it is possible, can I do similar work with scikit PCA function in python?
Thanks!
This plot is called biplot and it is very useful to understand the PCA results. The length of the vectors it is just the values that each feature/variable has on each Principal Component aka PCA loadings.
Example:
These loadings as accessible through print(pca.components_). Using the Iris Dataset the loadings are:
[[ 0.52106591, -0.26934744, 0.5804131 , 0.56485654],
[ 0.37741762, 0.92329566, 0.02449161, 0.06694199],
[-0.71956635, 0.24438178, 0.14212637, 0.63427274],
[-0.26128628, 0.12350962, 0.80144925, -0.52359713]])
Here, each row is one PC and each column corresponds to one variable/feature. So feature/variable 1, has a value 0.52106591 on the PC1 and 0.37741762 on the PC2. These are the values used to plot the vectors that you saw in the biplot. See below the coordinates of Var1. It's exactly those (above) values !!
Finally, to create this plot in python you can use this using sklearn:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
plt.scatter(xs ,ys, c = y) #without scaling
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function.
myplot(x_new[:,0:2], pca.components_.T)
plt.show()
See also this post: https://stackoverflow.com/a/50845697/5025009
and
https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
Try the pca library. This will plot the explained variance, and create a biplot.
pip install pca
A small example:
from pca import pca
# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)
# Or reduce the data towards 2 PCs
model = pca(n_components=2)
# Load example dataset
import pandas as pd
import sklearn
from sklearn.datasets import load_iris
X = pd.DataFrame(data=load_iris().data, columns=load_iris().feature_names, index=load_iris().target)
# Fit transform
results = model.fit_transform(X)
# Plot explained variance
fig, ax = model.plot()
# Scatter first 2 PCs
fig, ax = model.scatter()
# Make biplot with the number of features
fig, ax = model.biplot(n_feat=4)
The results is a dict containing many statistics of the PCs, loadings etc.

matplotlib convert real to categorical

If i have a set of data
python
x_data = array([-0.5597064565292805, -0.6044992007582148, 0.22877491676881043,
-1.2332817779977419, 0.42077626119484773, 1.825509016838052,
0.3476645527864688, -0.35439666443655543, 0.8783711637081933,
-0.438777582274935], dtype=object)
I can't get matplotlib to draw a bar chart with x as categorical values. No matter what I do, it forces a convert to real. Any ideas how to make each number a categorical?
You can plot your y data as a function of the numbers 0,1,2,3, etc and then set the ticklabels as the strings from the x_data.
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
x_data = [-0.5597064565292805, -0.6044992007582148, 0.22877491676881043,
-1.2332817779977419, 0.42077626119484773, 1.825509016838052,
0.3476645527864688, -0.35439666443655543, 0.8783711637081933,
-0.438777582274935]
y_data = np.random.rand(len(x_data))
x_strings=["{:.3f}".format(x) for x in x_data]
plt.figure(figsize=(5,5), dpi=64./5)
plt.bar(range(len(x_data)), y_data )
plt.xticks(range(len(x_data)), x_strings)
plt.savefig(__file__+".png", dpi="figure")
plt.show()
The result as a 64x64 pixel image would look like
where nothing is readable. So it may not make too much sense.

matplotlib 2d numpy array

I have created a 2d numpy array as:
for line in finp:
tdos = []
for _ in range(250):
sdata = finp.readline()
tdos.append(sdata.split())
break
tdos = np.array(tdos)
Which results in:
[['-3.463' '0.0000E+00' '0.0000E+00' '0.0000E+00' '0.0000E+00']
['-3.406' '0.0000E+00' '0.0000E+00' '0.0000E+00' '0.0000E+00']
['-3.349' '-0.2076E-29' '-0.3384E-30' '-0.1181E-30' '-0.1926E-31']
...,
['10.594' '0.2089E+02' '0.3886E+02' '0.9742E+03' '0.9664E+03']
['10.651' '0.1943E+02' '0.3915E+02' '0.9753E+03' '0.9687E+03']
['10.708' '0.2133E+02' '0.3670E+02' '0.9765E+03' '0.9708E+03']]
Now, I need to plot $0:$1 and $0:-$2 using matplotlib, so that the in x axis, I will have:
tdata[i][0] (i.e. -3.463, -3.406,-3.349, ..., 10.708)
,and in the yaxis, I will have:
tdata[i][1] (i.e. 0.0000E+00,0.0000E+00,-0.2076E-29,...,0.2133E+02)
How I can define xaxis and yaxis from the numpy array?
Just try the following recipe and see if it is what you want (two image plot methods followed by the same methods but with cropped image):
import matplotlib.pyplot as plt
import numpy as np
X, Y = np.meshgrid(range(100), range(100))
Z = X**2+Y**2
plt.imshow(Z,origin='lower',interpolation='nearest')
plt.show()
plt.pcolormesh(X,Y,Z)
plt.show()
plt.imshow(Z[20:40,30:70],origin='lower',interpolation='nearest')
plt.show()
plt.pcolormesh(X[20:40,30:70],Y[20:40,30:70],Z[20:40,30:70])
plt.show()
, results in:

Categories

Resources