Project variables in PCA plot in Python - python

After performing a PCA analysis in R we can do:
ggbiplot(pca, choices=1:2, groups=factor(row.names(df_t)))
That will plot the data in the 2 PC space, and the direction and weight of the variables in such space as vectors (with different length and direction).
In Python I can plot the data in the 2 PC space, and I can get the weights of the variables, but how do I know the direction.
In other words, how could I plot the variable contribution to both PC (weight and direction) in Python?

I am not aware of any pre-made implementation of this kind of plot, but it can be created using matplotlib.pyplot.quiver. Here's an example I quickly put together. You can use this as a basis to create a nice plot that works well for your data.
Example Data
This generates some example data. It is reused from this answer.
# User input
n_samples = 100
n_features = 5
# Prep
data = np.empty((n_samples,n_features))
np.random.seed(42)
# Generate
for i,mu in enumerate(np.random.choice([0,1,2,3], n_samples, replace=True)):
data[i,:] = np.random.normal(loc=mu, scale=1.5, size=n_features)
PCA
pca = PCA().fit(data)
Variables Factor Map
Here we go:
# Get the PCA components (loadings)
PCs = pca.components_
# Use quiver to generate the basic plot
fig = plt.figure(figsize=(5,5))
plt.quiver(np.zeros(PCs.shape[1]), np.zeros(PCs.shape[1]),
PCs[0,:], PCs[1,:],
angles='xy', scale_units='xy', scale=1)
# Add labels based on feature names (here just numbers)
feature_names = np.arange(PCs.shape[1])
for i,j,z in zip(PCs[1,:]+0.02, PCs[0,:]+0.02, feature_names):
plt.text(j, i, z, ha='center', va='center')
# Add unit circle
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
# Ensure correct aspect ratio and axis limits
plt.axis('equal')
plt.xlim([-1.0,1.0])
plt.ylim([-1.0,1.0])
# Label axes
plt.xlabel('PC 0')
plt.ylabel('PC 1')
# Done
plt.show()
Being Uncertain
I struggled a bit with the scaling of the arrows. Please make sure they correctly reflect the loadings for your data. A quick check of whether feature 4 really correlates strongly with PC 1 (as this example would suggest) looks promising:
data_pca = pca.transform(data)
plt.scatter(data_pca[:,1], data[:,4])
plt.xlabel('PC 2') and plt.ylabel('feature 4')
plt.show()

Thanks to WhoIsJack for the earlier answer.
I adapted there code to a function below that takes in a fitted PCA object and the data it was based on. It produces the figure similar to above, but I substituted out real column names for the column index, and then pruned it to only show a certain number of contributing columns.
def plot_pca_vis(pca, df: pd.DataFrame, pc_x: int = 0, pc_y: int = 1, num_dims: int = 5):
"""
https://stackoverflow.com/questions/45148539/project-variables-in-pca-plot-in-python
Adapted into function by Tim Cashion
"""
# Get the PCA components (loadings)
PCs = pca.components_
PC_x_index = PCs[pc_x, : ].argsort()[-num_dims:][::-1]
PC_y_index = PCs[pc_y, : ].argsort()[-num_dims:][::-1]
combined_index = set(list(PC_x_index) + list(PC_y_index))
PCs = PCs[:, list(combined_index)]
# Use quiver to generate the basic plot
fig = plt.figure(figsize=(5,5))
plt.quiver(np.zeros(PCs.shape[1]), np.zeros(PCs.shape[1]),
PCs[pc_x,:], PCs[pc_y,:],
angles='xy', scale_units='xy', scale=1)
# Add labels based on feature names (here just numbers)
feature_names = df.columns
for i,j,z in zip(PCs[pc_y,:]+0.02, PCs[pc_x,:]+0.02, feature_names):
plt.text(j, i, z, ha='center', va='center')
# Add unit circle
circle = plt.Circle((0,0), 1, facecolor='none', edgecolor='b')
plt.gca().add_artist(circle)
# Ensure correct aspect ratio and axis limits
plt.axis('equal')
plt.xlim([-1.0,1.0])
plt.ylim([-1.0,1.0])
# Label axes
plt.xlabel('PC ' + str(pc_x))
plt.ylabel('PC ' + str(pc_y))
# Done
plt.show()
Hope this helps someone!

Related

Histogram line of best fit line is jagged and not smooth?

I can't quite seem to figue out how to get my curve to be displayed smoothly instead of having so many sharp turns.
I am hoping to show a boltzmann probability distribution. With a nice smooth curve.
I'll expect it is a simple fix but I can't see it. Can someone please help?
My code is below:
from matplotlib import pyplot as plt
import numpy as np
import scipy.stats
dE = 1
N = 500
n = 10000
# This is creating an array filled with all twos
def Create_Array(N):
Particle_State_List_set = np.ones(N, dtype = int)
Particle_State_List_twos = Particle_State_List_set + 1
return(Particle_State_List_twos)
Array = Create_Array(N)
def Select_Random_index(N):
Seed = np.random.default_rng()
Partcle_Index = Seed.integers(low=0, high= N - 1)
return(Partcle_Index)
def Exchange(N):
Particle_Index_A = Select_Random_index(N) #Selects a particle to be used as particle "a"
Particle_Index_B = Select_Random_index(N) #Selects a particle to be used as particle "b"
# Checks to see if the energy on particle "a" is zero, if so it selects anbother until it isn't.
while Array[Particle_Index_A] == 1:
Particle_Index_A = Select_Random_index(N)
#This loop is making sure that Particle "a" and "b" aren't the same particle, it chooses again until the are diffrent.
while Particle_Index_B == Particle_Index_A:
Particle_Index_B = Select_Random_index(N)
# This assignes variables to the chosen particle's energy values
a = Array[Particle_Index_A]
b = Array[Particle_Index_B]
# This updates the values of the Energy levels of the interacting particles
Array[Particle_Index_A] = a - dE
Array[Particle_Index_B] = b + dE
return (Array[Particle_Index_A], Array[Particle_Index_B])
for i in range(n):
Exchange(N)
# This part is making the histogram the curve will be made from
_, bins, _ = plt.hist(Array, 12, density=1, alpha=0.15, color="g")
# This is using scipy to find the mean and standard deviation in order to plot the curve
mean, std = scipy.stats.norm.fit(Array)
# This part is drawing the best fit line, using the established bins value and the std and mean from before
best_fit = scipy.stats.norm.pdf(bins, mean, std)
# Plotting the best fit curve
plt.plot(bins, best_fit, color="r", linewidth=2.5)
#These are instructions on how python with show the graph
plt.title("Boltzmann Probablitly Curve")
plt.xlabel("Energy Value")
plt.ylabel('Percentage at this Energy Value')
plt.tick_params(top=True, right=True)
plt.tick_params(direction='in', length=6, width=1, colors='0')
plt.grid()
plt.show()
Whats happening is that in these lines:
best_fit = scipy.stats.norm.pdf(bins, mean, std)
plt.plot(bins, best_fit, color="r", linewidth=2.5)
'bins' the histogram bin edges is being used as the x coordinates of the data points forming the best fit line. The resulting plot is jagged because they are so widely spaced. Instead you can define a tighter packed set of x coordinates and use that:
bfX = np.arange(bins[0],bins[-1],.05)
best_fit = scipy.stats.norm.pdf(bfX, mean, std)
plt.plot(bfX, best_fit, color="r", linewidth=2.5)
For me that gives a nice smooth curve, but you can always use a tighter packing than .05 if its not to your liking yet.

Adding histogram bins together and plotting a figure

I have a histogram with 8192 bins, each bin imported from a line from a text file. To cut things short, it makes an awful fit and it was suggested to mee I could reduce the statistical errors by adding counts from adjacent bins. e.g. add bins 0-7 to make a new first bin, 8 times as wide, but 8 times(roughly) as high.
Ideally, would like to just be able to output a histogram of a binwidth controlled by a single constant in the code. However my attempts to do this, instead of producing something like the first image below (which was born of the version of my code which can only do a binwidth of 1, produce something like the second image below, missing fit lines and with a second empty graph in the same image file (born of my attempts to generalise the code for any bin width).
The following a histogram plotted directly from the original data i.e. binwidth = 1
Original code output, only works for binwidth 1 though.
Example for trying bin width 8 with come code modifications
I also need it to return a fit report, and the area under the gaussian as this is plotted later on in the code, in an exponential decay curve.
Here is the section of code I think is relevant:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from numpy import exp, loadtxt, pi, sqrt, random, linspace
from lmfit import Model
import glob, os
## Load text file
x = np.linspace(0, 8191, 8192)
finalprefix = str(n).zfill(3)
fullprefix = folderToAnalyze + prefix + finalprefix
y = loadtxt(fullprefix + ".Spe", skiprows= 12, max_rows = 8192)
## Make figure and label
fig, ax = plt.subplots(figsize=(15,8))
fig.suptitle('Photon coincidence detections from $β^+$ + $β^-$ annhilation', fontsize=18)
plt.xlabel('Bins', fontsize=14)
plt.ylabel('Counts', fontsize=14)
## Plot data
ax.bar(x, y)
ax.set_xlim(600,960)
## Adding Bins Together
y = y.astype(int)
x = x.astype(int)
## create the data
data = np.repeat(x, y)
## determine the range of x
x_range = range(min(data), max(data)+1)
## determine the length of x
x_len = len(x_range)
## plot
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(10, 10))
ax1.hist(data, bins=x_len) # outliers are not plotted
plt.show()
## given x_len determine how many bins for a given bin width
width = 8
bins = int(np.round(x_len / width))
## determine new x and y for the histogram
y, x = np.histogram(data, bins=bins)
## Fit data to Gaussian
gmodel = Model(gaussian)
result = gmodel.fit(y, x=x[:-1], amp=8, cen=approxcen, wid=1)
## result
print(result.fit_report())
fig.savefig("abw_" + finalprefix + ".png")
## Append to list if error in amplitude and amplitude itself is within reasonable bounds
if result.params['amp'].stderr < stderrThreshold and result.params['amp'] > minimumAmplitude:
amps.append(result.params['amp'].value)
ampserr.append(result.params['amp'].stderr)
ts.append(MaestroT*n)
## Plot decay curve
fig, ax = plt.subplots()
ax.errorbar(ts, amps, yerr= 2*np.array(ampserr), fmt="ko-", capsize = 5, capthick= 2, elinewidth=3, markersize=5)
plt.xlabel('Time', fontsize=14)
plt.ylabel('Peak amplitude', fontsize=14)
plt.title("Decay curve of P-31 by $β^+$ emission", fontsize=14)
Some synthetic data: {1,2,1,0,0,0,0,0,6,0,0,0,0,0,0,0,7,0,0,1,0,1,0,0,6,6,0,0,0,3,0,0,3,3,3,5,4,0,4,3,1,4,0,5,6,4,0,2,0,0,0,9,6,1,1,1,0,0,3,2,2,3,0,0,0,2,4,0,0,0,0,0,0,4,10,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0}
I think this should create 2 very different shaped histograms when the bin width is 1 and when it is 8. Though I have just made them up, the fit may not be good, and it is worth mentioning one of the problems I was having is related to being able to add together the information read in from the text file
In case it's useful:
-Here is the full original code
-Here is the data for that histogram

matplotlib pyplot 2 plots with different axes in same figure

I have a small issue with matplotlib.pyplot and I hope someone might have come across it before.
I have data that contain X,Y,e values that are the X, Y measurements of a variable and e are the errors of the measurements in Y. I need to plot them in a log log scale.
I use the plt.errorbars function to plot them and then set yscale and xscale to log and this works fine. But I need to also plot a line on the same graph that needs to be in linear scale.
I am able to have the plots done separately just fine but I would like to have them in the same image if possible. Do you have any ideas? I am posting what I have done for now.
Cheers,
Kimon
tdlist = np.array([0.01,0.02,0.05,0.1,0.2,0.3,0.4,0.5,0.8,1,2,5,10,15,20,25,30,40,60,80,100,150,200,250,300,400])
freqlist=np.array([30,40,50,60,70,80,90,100,110,120,140,160,180,200,220,250,300,350,400,450])
filename=opts.filename
data = reader(filename)
data2 = logconv(data)
#x,y,e the data. Calculating usefull sums
x = data2[0]
y = data2[1]
e = data2[2]
xoe2 = np.sum(x/e**2)
yoe2 = np.sum(y/e**2)
xyoe2 = np.sum(x*y/e**2)
oe2 = np.sum(1/e**2)
x2oe2 = np.sum(x**2/e**2)
aslope = (xoe2*yoe2-xyoe2*oe2)/(xoe2**2-x2oe2*oe2)
binter = (xyoe2-aslope*x2oe2)/xoe2
aerr = np.sqrt(oe2/(x2oe2*oe2-xoe2**2))
berr = np.sqrt(x2oe2/(x2oe2*oe2-xoe2**2))
print('slope is ',aslope,' +- ', aerr)
print('inter is ',binter,' +- ', berr)
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
ax2 = fig.add_axes(ax1.get_position(), frameon=False)
ax1.errorbar(data[0],data[1],yerr=data[2],fmt='o')
ax1.set_xscale('log',basex=10)
ax1.set_yscale('log',basey=10)
ax1.set_yticks([])
ax1.set_xticks([])
ax2.plot(x,aslope*x+binter,'r')
ax2.plot(x,(aslope-aerr)*x+(binter+berr),'--')
ax2.plot(x,(aslope+aerr)*x+(binter-berr),'--')
ax2.set_xscale('linear')
ax2.set_yscale('linear')
plt.xticks(np.log10(freqlist),freqlist.astype('int'))
plt.yticks(np.log10(tdlist),tdlist.astype('float'))
plt.xlabel('Frequency (MHz)')
plt.ylabel('t_s (msec)')
fitndx1 = 'Fit slope '+"{0:.2f}".format(aslope)+u"\u00B1"+"{0:.2f}".format(aerr)
plt.legend(('Data',fitndx1))
plt.show()
Following Molly's suggestion I managed to get closer to my goal but still not there. I am adding a bit more info for what I am trying to do and it might clarify things a bit.
I am setting ax1 to the errobar plot that uses loglog scale. I need to use errorbar and not loglog plot so that I can display the errors with my points.
I am using ax2 to plot the linear fit in linealinear scale.
Moreover I do not want the x and y axes to display values that are 10,100,1000 powers of ten but my own axes labels that have the spacing I want therefore I am using the plt.xticks. I tried ax1.set_yticks and ax1.set_yticklabes but with no success. Below is the image I am getting.
I do not have enough reputation to post an image but here is the link of it uploaded
http://postimg.org/image/uojanigab/
The values of my points should be x range = 40 - 80 and y range = 5 -200 as the fit lines are now.
You can create two overlapping axes using the add_suplot method of figure. Here's an example:
from matplotlib import pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
ax2 = fig.add_axes(ax1.get_position(), frameon=False)
ax1.loglog([1,10,100,1000],[1000,1,100,10])
ax2.plot([5,10,11,13],'r')
plt.show()
You can then turn off the x and y ticks for the linear scale plot like this:
ax2.set_xticks([])
ax2.set_yticks([])
I was not able to get two sets of axis working with the errorbar function so I had to convert everything to log scale including my linear plot. Below is the code I use to get it might be useful to someone.
plt.errorbar(data[0],data[1],yerr=data[2],fmt='o')
plt.xscale('log',basex=10)
plt.yscale('log',basey=10)
plt.plot(data[0],data[0]**aslope*10**binter,'r')
plt.plot(data[0],data[0]**(aslope-aerr)*10**(binter+berr),'--')
plt.plot(data[0],data[0]**(aslope+aerr)*10**(binter-berr),'--')
plt.xticks(freqlist,freqlist.astype('int'))
plt.yticks(tdlist,tdlist.astype('float'))
plt.xlabel('Frequency (MHz)')
plt.ylabel('t_s (msec)')
fitndx1 = 'Fit slope '+"{0:.2f}".format(aslope)+u"\u00B1"+"{0:.2f}".format(aerr)
plt.legend(('Data',fitndx1))
plt.show()
And here is the link to the final image
http://postimg.org/image/bevj2k6nf/

Plot histogram normalized by fixed parameter

I need to plot a plot a normalized histogram (by normalized I mean divided by a fixed value) using the histtype='step' style.
The issue is that plot.bar() doesn't seem to support that style and if I use instead plot.hist() which does, I can't (or at least don't know how) plot the normalized histogram.
Here's a MWE of what I mean:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(200,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Plot histogram normalized by the parameter defined above.
plt.ylim(0, 3)
plt.bar(edges1[:-1], hist1 / param, width=binwidth, color='none', edgecolor='r')
plt.show()
(notice the normalization: hist1 / param) which produces this:
I can generate a histtype='step' histogram using:
plt.hist(x1, bins=bin_n, histtype='step', color='r')
and get:
but then it wouldn't be normalized by the param value.
The step plot will generate the appearance that you want from a set of bins and the count (or normalized count) in those bins. Here I've used plt.hist to get the counts, then plot them, with the counts normalized. It's necessary to duplicate the first entry in order to get it to actually have a line there.
(a,b,c) = plt.hist(x1, bins=bin_n, histtype='step', color='r')
a = np.append(a[0],a[:])
plt.close()
step(b,a/param,color='r')
This is not quite right, because it doesn't finish the plot correctly. the end of the line is hanging in free space rather than dropping down the x axis.
you can fix that by adding a 0 to the end of 'a' and one more bin point to b
a=np.append(a[:],0)
b=np.append(b,(2*b[-1]-b[-2]))
step(b,a/param,color='r')
lastly, the ax.step mentioned would be used if you had used
fig, ax = plt.subplots()
to give you access to the figure and axis directly. For examples, see http://matplotlib.org/examples/ticks_and_spines/spines_demo_bounds.html
Based on tcaswell's comment (use step) I've developed my own answer. Notice that I need to add elements to both the x (one zero element at the beginning of the array) and y arrays (one zero element at the beginning and another at the end of the array) so that step will plot the vertical lines at the beginning and the end of the bars.
Here's the code:
import matplotlib.pyplot as plt
import numpy as np
def rand_data():
return np.random.uniform(low=10., high=20., size=(5000,))
# Generate data.
x1 = rand_data()
# Define histogram params.
binwidth = 0.25
x_min, x_max = x1.min(), x1.max()
bin_n = np.arange(int(x_min), int(x_max + binwidth), binwidth)
# Obtain histogram.
hist1, edges1 = np.histogram(x1, bins=bin_n)
# Normalization parameter.
param = 5.
# Create arrays adding elements so plt.bar will plot the first and last
# vertical bars.
x2 = np.concatenate((np.array([0.]), edges1))
y2 = np.concatenate((np.array([0.]), (hist1 / param), np.array([0.])))
# Plot histogram normalized by the parameter defined above.
plt.xlim(min(edges1) - (min(edges1) / 10.), max(edges1) + (min(edges1) / 10.))
plt.bar(x2, y2, width=binwidth, color='none', edgecolor='b')
plt.step(x2, y2, where='post', color='r', ls='--')
plt.show()
and here's the result:
The red lines generated by step are equal to those blue lines generated by bar as can be seen.

plotting results of hierarchical clustering on top of a matrix of data

How can I plot a dendrogram right on top of a matrix of values, reordered appropriately to reflect the clustering, in Python? An example is the following figure:
This is Figure 6 from: A panel of induced pluripotent stem cells from chimpanzees: a resource for comparative functional genomics
I use scipy.cluster.dendrogram to make my dendrogram and perform hierarchical clustering on a matrix of data. How can I then plot the data as a matrix where the rows have been reordered to reflect a clustering induced by the cutting the dendrogram at a particular threshold, and have the dendrogram plotted alongside the matrix? I know how to plot the dendrogram in scipy, but not how to plot the intensity matrix of data with the right scale bar next to it.
The question does not define matrix very well: "matrix of values", "matrix of data". I assume that you mean a distance matrix. In other words, element D_ij in the symmetric nonnegative N-by-N distance matrix D denotes the distance between two feature vectors, x_i and x_j. Is that correct?
If so, then try this (edited June 13, 2010, to reflect two different dendrograms).
Tested in python 3.10 and matplotlib 3.5.1
import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import squareform
# Generate random features and distance matrix.
np.random.seed(200) # for reproducible data
x = np.random.rand(40)
D = np.zeros([40, 40])
for i in range(40):
for j in range(40):
D[i,j] = abs(x[i] - x[j])
condensedD = squareform(D)
# Compute and plot first dendrogram.
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_axes([0.09, 0.1, 0.2, 0.6])
Y = sch.linkage(condensedD, method='centroid')
Z1 = sch.dendrogram(Y, orientation='left')
ax1.set_xticks([])
ax1.set_yticks([])
# Compute and plot second dendrogram.
ax2 = fig.add_axes([0.3, 0.71, 0.6, 0.2])
Y = sch.linkage(condensedD, method='single')
Z2 = sch.dendrogram(Y)
ax2.set_xticks([])
ax2.set_yticks([])
# Plot distance matrix.
axmatrix = fig.add_axes([0.3, 0.1, 0.6, 0.6])
idx1 = Z1['leaves']
idx2 = Z2['leaves']
D = D[idx1,:]
D = D[:,idx2]
im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=plt.cm.YlGnBu)
axmatrix.set_xticks([]) # remove axis labels
axmatrix.set_yticks([]) # remove axis labels
# Plot colorbar.
axcolor = fig.add_axes([0.91, 0.1, 0.02, 0.6])
plt.colorbar(im, cax=axcolor)
plt.show()
fig.savefig('dendrogram.png')
Edit: For different colors, adjust the cmap attribute in imshow. See the scipy/matplotlib docs for examples. That page also describes how to create your own colormap. For convenience, I recommend using a preexisting colormap. In my example, I used YlGnBu.
Edit: add_axes (see documentation here) accepts a list or tuple: (left, bottom, width, height). For example, (0.5,0,0.5,1) adds an Axes on the right half of the figure. (0,0.5,1,0.5) adds an Axes on the top half of the figure.
Most people probably use add_subplot for its convenience. I like add_axes for its control.
To remove the border, use add_axes([left,bottom,width,height], frame_on=False). See example here.
If in addition to the matrix and dendrogram it is required to show the labels of the elements, the following code can be used, that shows all the labels rotating the x labels and changing the font size to avoid overlapping on the x axis. It requires moving the colorbar to have space for the y labels:
axmatrix.set_xticks(range(40))
axmatrix.set_xticklabels(idx1, minor=False)
axmatrix.xaxis.set_label_position('bottom')
axmatrix.xaxis.tick_bottom()
pylab.xticks(rotation=-90, fontsize=8)
axmatrix.set_yticks(range(40))
axmatrix.set_yticklabels(idx2, minor=False)
axmatrix.yaxis.set_label_position('right')
axmatrix.yaxis.tick_right()
axcolor = fig.add_axes([0.94,0.1,0.02,0.6])
The result obtained is this (with a different color map):

Categories

Resources