How do I use tsne to visualize data with high dimensionality? - python

I am trying to use TSNE to visualize data based on a Category to show me if the data is separable.
I have been trying to do this for the past two days but I am not getting a scatter plot showing the different categories plotted to enable me to see the relationship.
Instead, it is plotting all the data in a straight linear line, which cannot be correct as there are 5 different distinct attributes with the column I am trying to use as a label and legend.
What do I do to correct this?
import label as label
import pandas as pd
from matplotlib.cm import get_cmap
from matplotlib.colors import rgb2hex
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
import numpy as np
# #region Loading Data
filename = 'Dataset/test.csv'
df = pd.read_csv(filename)
label = df.pop('Activity')
label_counts = label.value_counts()
# # Scale Data
scale = StandardScaler()
tsne_data= scale.fit_transform(df)
fig, axa = plt.subplots(2, 1, figsize=(15,10))
group = label.unique()
for i , labels in label.iteritems():
# mask =(label = group)
axa[0].scatter(x = tsne_data, y = tsne_data, label = group)
plt.legend
plt.show()

Related

Create legends in scatter plt

I create a mask of my dataset for plotting only No Animals materials, and when I draw this mask I have problems with the legends, because only the first material defines me and I don't know how to add the other 2 materials.
import numpy as np
import umap
import umap.plot
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display,HTML
import cufflinks as cf
cf.set_config_file(sharing='public',theme='ggplot',offline=True)
import seaborn as sns
palpations = np.load('big_matrix_16384.npz',allow_pickle=True)
X = palpations['arr_0']
embedding = umap.UMAP(n_neighbors=50,
min_dist=0.2,
metric='correlation').fit(X)
emb = embedding.transform(X)
mask_1 = Data["Tipo"]=="Animal"
emb_tipo_1 = emb[mask_1]
cmap = plt.cm.Spectral
c =[sns.color_palette("Set2")[x] for x in data_tipo_1.Material.map({"bone":0, "cartilage":1, "liver_raw_piece1":2})]
plt.scatter(emb_tipo_1[:,0],
emb_tipo_1[:,1],
c=c,
label=np.unique(data_tipo_1.Material),s=10)
plt.gca().set_aspect("equal","datalim")
plt.title("UMAP muestras Animales.")
plt.legend()
enter image description here

Trouble creating scatter plot

I'm having trouble using the scatter to create a scatter plot. Can someone help me? I've highlighted the line causing the error:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import numpy as np
from sklearn.preprocessing import StandardScaler
data = pd.read_csv('vetl8.csv')
df = pd.DataFrame(data=data)
clusterNum = 3
X = df.iloc[:, 1:].values
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
k_means = KMeans(init="k-means++", n_clusters=clusterNum, n_init=12)
k_means.fit(X)
labels = k_means.labels_
df["Labels"] = labels
df.to_csv('dfkmeans.csv')
plt.scatter(df[2], df[1], c=labels) **#Here**
plt.xlabel('K', fontsize=18)
plt.ylabel('g', fontsize=16)
plt.show()
#data set correct
You are close, just a minor adjustment to access the x-y columns by number should fix it:
plt.scatter(df[df.columns[2]], df[df.columns[1]], c=df["Labels"])

No runtime error, but wrong iris PCA plotting

I am using following code to perform PCA on iris dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# get iris data to a dataframe:
from sklearn import datasets
iris = datasets.load_iris()
varnames = ['SL', 'SW', 'PL', 'PW']
irisdf = pd.DataFrame(data=iris.data, columns=varnames)
irisdf['Species'] = [iris.target_names[a] for a in iris.target]
# perform pca:
from sklearn.decomposition import PCA
model = PCA(n_components=2)
scores = model.fit_transform(irisdf.iloc[:,0:4])
loadings = model.components_
# plot results:
scoredf = pd.DataFrame(data=scores, columns=['PC1','PC2'])
scoredf['Grp'] = irisdf.Species
sns.lmplot(fit_reg=False, x="PC1", y='PC2', hue='Grp', data=scoredf) # plot point;
loadings = loadings.T
for e, pt in enumerate(loadings):
plt.plot([0,pt[0]], [0,pt[1]], '--b')
plt.text(x=pt[0], y=pt[1], s=varnames[e], color='b')
plt.show()
I am getting following plot:
However, when I compare with plots from other sites (e.g. at http://marcoplebani.com/pca/ ), my plot is not correct. Following differences seem to be present:
Petal length and petal width lines should have similar lengths.
Sepal length line should be closer to petal length and petal width lines rather than closer to sepal width line.
All 4 lines should be on the same side of x-axis.
Why is my plot not correct. Where is the error and how can it be corrected?
It depends on whether you scale the variance or not. The "other site" uses scale=TRUE. If you want to do this with sklearn, add StandardScaler before fitting the model and fit the model with scaled data, like this:
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(irisdf.iloc[:,0:4])
scores = model.fit_transform(X)
Edit: Difference between StandardScaler and normalize
Here is an answer which pointed out a key difference (row vs column). Even you use normalize here, you might want to consider X = normalize(X.T).T. The following code shows some differences after transformation:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
iris = datasets.load_iris()
varnames = ['SL', 'SW', 'PL', 'PW']
fig, ax = plt.subplots(2, 2, figsize=(16, 12))
irisdf = pd.DataFrame(data=iris.data, columns=varnames)
irisdf.plot(kind='kde', title='Raw data', ax=ax[0][0])
irisdf_std = pd.DataFrame(data=StandardScaler().fit_transform(irisdf), columns=varnames)
irisdf_std.plot(kind='kde', title='StandardScaler', ax=ax[0][1])
irisdf_norm = pd.DataFrame(data=normalize(irisdf), columns=varnames)
irisdf_norm.plot(kind='kde', title='normalize', ax=ax[1][0])
irisdf_norm = pd.DataFrame(data=normalize(irisdf.T).T, columns=varnames)
irisdf_norm.plot(kind='kde', title='normalize', ax=ax[1][1])
plt.show()
I'm not sure how deep I can go with the algorithm/math. The point for StandardScaler is to get uniform/consistent mean and variance across features. The assumption is that variables with large measurement units are not necessarily (and should not be) dominant in PCA. In other word, StandardScaler makes features contribute equally to PCA. As you can see, normalize won't give consistent mean or variance.

Subplot 3 graph in one figure

I am Having trouble with the last subplot. The last Crosstab plot appears by itself, and then the subplot has the first 2 subplots but the 3rd one is empty and contains no data. How can I graph it so that all 3 graphs come up in one figure and they share they same Y axis or 'Frequency'
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
#Data Exploration
data = sm.datasets.fair.load_pandas().data
data['affair'] = np.where(data['affairs'] > 0 , 1,0)
print(data)
print(data.groupby('affair').mean())
print(data.groupby('rate_marriage').mean())
plt.subplot(331)
data['educ'].hist()
plt.title('Histogram of Education')
plt.xlabel('Education Level')
plt.ylabel('Frequency')
plt.subplot(332)
data['rate_marriage'].hist()
plt.title('Histogram of Marriage Rating')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')
plt.subplot(333)
pd.crosstab(data['rate_marriage'], data['affair'].astype(bool)).plot(kind='bar')
plt.title('Marriage Rating distribution by affair Status')
plt.xlabel('Marriage Rating')
plt.ylabel('Frequency')
plt.show()
You need to tell the pandas plotting function where to plot the data.
This can be achieved through the ax keyword.
ax= plt.subplot(333)
pd.crosstab(data['rate_marriage'], data['affair'].astype(bool)).plot(kind='bar', ax=ax)

Plotting PCA results including original data with scatter plot using Python

I have conducted PCA on iris data as an exercise. Here is my code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA # as sklearnPCA
import pandas as pd
#=================
df = pd.read_csv('iris.csv');
# Split the 1st 4 columns comprising values
# and the last column that has species
X = df.ix[:,0:4].values
y = df.ix[:,4].values
X_std = StandardScaler().fit_transform(X); # standardization of data
# Fit the model with X_std and apply the dimensionality reduction on X_std.
pca = PCA(n_components=2) # 2 PCA components;
Y_pca = pca.fit_transform(X_std)
# How to plot my results???? I am struck here!
Please advise on how to plot my original iris data and the PCAs derived using a scatter plot.
Here is the way I think you can visualize it. I'll put PC1 on X-Axis and PC2 in Y-Axis and color each point based on its category. Here is the code:
#first we need to map colors on labels
dfcolor = pd.DataFrame([['setosa','red'],['versicolor','blue'],['virginica','yellow']],columns=['Species','Color'])
mergeddf = pd.merge(df,dfcolor)
#Then we do the graph
plt.scatter(Y_pca[:,0],Y_pca[:,1],color=mergeddf['Color'])
plt.show()

Categories

Resources