Scatter plot for a matrix of a given form

Scatter plot for a matrix of a given form - python

Suppose I have a matrix of the form where first column is all x points, the second column is all y points, and then the third and fourth are indicator variables telling whether the point belongs to a particular 'cluster' (can be either 1 or 0; so if in column 3 I have 1 for a third row, it means that the point of the third row, belongs to say cluster 1, which is represented by column 3).
My question is, how do I create a figure, scatter plot all the points belonging to cluster 1 and then on the same plot have scatter of the remaining points in another color. In Matlab, I would just say figure, then hold on and write out my commands. I am new to plotting in Python and not sure how this would be performed.
EDIT:
I think I made it work. How would I however, change marker size, depending on which cluster the point belongs to

Let's start with how we'd do this in MATLAB.
Supposing you have N unique clusters, you can simply loop through as many clusters as you have and plot the points in a different colour. Also, we can change the marker size at each iteration. You'll need to use logical indexing to extract out the points that belong to each cluster. Given that your matrix is stored in M, something like this comes to mind:
rng(123); %// Set random seeds
%// Total number of clusters
N = max(M(:,3));
%// Create a colour map
cmap = rand(N,3);
%// Store point sizes per cluster
sizes = [10 14 18];
figure; hold on; %// Create a blank figure and hold for changes
for ii = 1 : N
%// Determine those points belonging to the ith cluster
ind = M(:,3) == ii;
%// Get the x and y coordinates
x = M(ind,1);
y = M(ind,2);
%// Plot the points in a different colour
plot(x,y,'.','Color', cmap(ii,:), 'MarkerSize', sizes(ii));
end
%// Create labels
labels = sprintfc('Label %d', 1:N);
%// Make our legend
legend(labels{:});
The code is pretty self explanatory, you need to define your matrix M and we determine the total number of clusters by taking the max of the third column. Next we create a random colour map which has as many rows as there are clusters and there are three columns corresponding to a unique RGB colour per cluster. Each row defines a colour for each cluster which we'll use when plotting.
Next we create an array of sizes where we store the radius of each point stored in an array per cluster. We create a blank figure, hold it for changes we make to the plot then we iterate over each cluster of points. For each cluster of points, figure out the right points in M to extract out through logical indexing, extract out the x and y coordinates for those points then plot these points on your figure in a scatter formation where we manually specify the colour as a RGB tuple as well as the desired marker size.
We then create a cell array of labels that denote which set of points each cluster belongs to, then show a legend illustrating which points belong to which clusters given this array of labels.
Generating random data with random labels, where we have 20 points uniformly distributed between [0,1] for both x and y and generating a random set of up to three labels:
rng(123);
M = [rand(20,2) randi(3,20,1)];
I get this plot when I run the above code:
To get the equivalent in Python, well that's pretty easy. It's just a transcription from MATLAB to Python and the plotting mechanisms are exactly the same. You're using matplotlib and so I'm assuming numpy can be used as it's a dependency.
As such, the equivalent code would look something like this:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
# Total number of clusters
N = int(np.max(M[:,2]))
# Create a colour map
cmap = np.random.rand(N, 3)
# Store point sizes per cluster
sizes = np.array([10, 14, 18]);
plt.figure(); # Create blank figure. No need to hold on
for ii in range(N):
# Determine those points belonging to the ith cluster
ind = M[:,2] == (ii+1)
# Get the x and y coordinates
x = M[ind,0];
y = M[ind,1];
# Plot the points in a different colour
# Also add in labels for legend
plt.plot(x,y,'.',color=tuple(cmap[ii]), markersize=sizes[ii], label='Cluster #' + str(ii+1))
# Make our legend
plt.legend()
# Show the image
plt.show()
I won't bother explaining this one because it's pretty much the same as what you see in the MATLAB code. There are some nuances, such as the way hold on works in matplotlib. You don't need to use hold on because any changes you make the figure will be remembered until you decide to show the figure. You also have the nuances where numpy and Python start indexing at 0 instead of 1.
Using the same generation data code like in MATLAB:
M = np.column_stack([np.random.rand(20,2), np.random.randint(1,4,size=(20,1))])
I get this figure:

Related

what do these commands do in the digits dataset clustering demonstration?

I have been looking at this fitting a digits dataset to a k-means cluster on Python tutorial here, and some of the codes are just confusing me.
I do understand this part where we need to train our model using 10 clusters.
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape
The following show us an output of the 10 cluster centroids on the console.
it first creates figure and axes which has two row, each row has 5 axes subplots return the figure and
(8,3) is the size of the figure displaying on the console.
But after that I just do not understand how the command shows the output of cluster centroids in the for loop.
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)
Also, this part is to check how accurate the clustering was in finding the similar digits within the data. I know that we need to create a labels that has the same size as the clusters filling with zero so we can place our predicted label in there.
But again, I just do not understand how do they implement it inside the for-loop.
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
Can someone please explain what each line of the commands do? Thank you.

Question 1: How does the code plot the centroids?
It's important to see that each centroid is a point in the feature space. In other words, a centroid looks like one of the training samples. In this case, each training sample is an 8 × 8 image (although they've been flattened into rows with 64 elements (because sklearn always wants input X to be a two-dimensional array). So each centroid also represents an 8 × 8 image.
The loop steps over the axes (a 2 × 5 matrix) and the centroids (kmeans.cluster_centers_ together. The purpose of zip is to ensure that for each Axes object there is a corresponding center (this is a common way to plot a bunch of n things into a bunch of n subplots). The centroids have been reshaped into a 10 × 8 × 8 array, so that each of the 10 centroids is the 8 × 8 image we're expecting.
Since each centroid is now a 2D array, you can use imshow to plot it.
Question 2: How does the code assign labels?
The easiest thing might be to take the code apart and run bits of it on their own. For example, take a look at clusters == 0. This is a Boolean array. You can use Boolean arrays to index other arrays of the same shape. The first line of code in the loop assigns this array to mask so we can use it.
Then we index into labels using the Boolean array (try it!) to say, "Change these values to the mode average of the corresponding elements of the label vector, i.e. digits.target." The index [0] is just needed because of what the scipy.stats.mode() function returns (again, try it out).

Identify the grid particles belong to

A square box of size 10,000*10,000 has 10,00,000 particles distributed uniformly. The box is divided into grids, each of size 100*100. There are 10,000 grids in total. At every time-step (for a total of 2016 steps), I would like to identify the grid to which a particle belongs. Is there an efficient way to implement this in python? My implementation is as below and currently takes approximately 83s for one run.
import numpy as np
import time
start=time.time()
# Size of the layout
Layout = np.array([0,10000])
# Total Number of particles
Population = 1000000
# Array to hold the cell number
cell_number = np.zeros((Population),dtype=np.int32)
# Limits of each cell
boundaries = np.arange(0,10100,step=100)
cell_boundaries = np.dstack((boundaries[0:100],boundaries[1:101]))
# Position of Particles
points = np.random.uniform(0,Layout[1],size = (Population,2))
# Generating a list with the x,y boundaries of each cell in the grid
x = []
limit_list = cell_boundaries
for i in range(0,Layout[1]//100):
for j in range(0,Layout[1]//100):
x.append([limit_list[0][i,0],limit_list[0][i,1],limit_list[0][j,0],limit_list[0][j,1]])
# Identifying the cell to which the particles belong
i=0
for y in (x):
cell_number[(points[:,1]>y[0])&(points[:,1]<y[1])&(points[:,0]>y[2])&(points[:,0]<y[3])]=i
i+=1
print(time.time()-start)

I am not sure about your code. You seem to be accumulating the i variable globally. While it should be accumulated on a per cell basis, correct? Something like cell_number[???] += 1, maybe?
Anyhow, the way I see is from a different perspective. You could start by assigning each point a cell id. Then inverse the resulting array with a kind of counter function. I have implemented the following in PyTorch, you will most likely find equivalent utilities in Numpy.
The conversion from 2-point coordinates to cell ids corresponds to applying floor on the coordinates then unfolding them according to your grid's width.
>>> p = torch.from_numpy(points).floor()
>>> p_unfold = p[:, 0]*10000 + p[:, 1]
Then you can "inverse" the statistics, i.e. find out how many particles there are in each respective cell based on the cell ids. This can be done using PyTorch histogram's counter torch.histc:
>>> torch.histc(p_unfold, bins=Population)

Reshape data into 'closest square'

I'm fairly new to python. Currently using matplotlib I have a script that returns a variable number of subplots to make, that I pass to another script to do the plotting. I want to arrange these subplots into a nice arrangement, i.e., 'the closest thing to a square.' So the answer is unique, let's say I weight number of columns higher
Examples: Let's say I have 6 plots to make, the grid I would need is 2x3. If I have 9, it's 3x3. If I have 12, it's 3x4. If I have 17, it's 4x5 but only one in the last row is created.
Attempt at a solution: I can easily find the closest square that's large enough:
num_plots = 6
square_size = ceil(sqrt(num_plots))**2
But this will leave empty plots. Is there a way to make the correct grid size?

This what I have done in the past
num_plots = 6
nr = int(num_plots**0.5)
nc = num_plots/nr
if nr*nc < num_plots:
nr+=1
fig,axs = pyplot.subplots(nr,nc,sharex=True,sharey=True)

If you have a prime number of plots like 5 or 7, there's no way to do it unless you go one row or one column. If there are 9 or 15 plots, it should work.
The example below shows how to
Blank the extra empty plots
Force the axis pointer to be a 2D array so you can index it generally even if there's only one plot or one row of plots
Find the correct row and column for each plot as you loop through
Here it is:
nplots=13
#find number of columns, rows, and empty plots
nc=int(nplots**0.5)
nr=int(ceil(nplots/float(nc)))
empty=nr*nc-nplots
#make the plot grid
f,ax=pyplot.subplots(nr,nc,sharex=True)
#force ax to have two axes so we can index it properly
if nplots==1:
ax=array([ax])
if nc==1:
ax=ax.reshape(nr,1)
if nr==1:
ax=ax.reshape(1,nc)
#hide the unused subplots
for i in range(empty): ax[-(1+i),-1].axis('off')
#loop through subplots and make output
for i in range(nplots):
ic=i/nr #find which row we're on. If the definitions of ir and ic are switched, the indecies for empty (above) should be switched, too.
ir=mod(i,nr) #find which column we're on
axx=ax[ir,ic] #get a pointer to the subplot we're working with
axx.set_title(i)

How can I account for identical data points in a scatter plot?

I'm working with some data that has several identical data points. I would like to visualize the data in a scatter plot, but scatter plotting doesn't do a good job of showing the duplicates.
If I change the alpha value, then the identical data points become darker, which is nice, but not ideal.
Is there some way to map the color of a dot to how many times it occurs in the data set? What about size? How can I assign the size of the dot to how many times it occurs in the data set?

As it was pointed out, whether this makes sense depends a bit on your dataset. If you have reasonably discrete points and exact matches make sense, you can do something like this:
import numpy as np
import matplotlib.pyplot as plt
test_x=[2,3,4,1,2,4,2]
test_y=[1,2,1,3,1,1,1] # I am just generating some test x and y values. Use your data here
#Generate a list of unique points
points=list(set(zip(test_x,test_y)))
#Generate a list of point counts
count=[len([x for x,y in zip(test_x,test_y) if x==p[0] and y==p[1]]) for p in points]
#Now for the plotting:
plot_x=[i[0] for i in points]
plot_y=[i[1] for i in points]
count=np.array(count)
plt.scatter(plot_x,plot_y,c=count,s=100*count**0.5,cmap='Spectral_r')
plt.colorbar()
plt.show()
Notice: You will need to adjust the radius (the value 100 in th s argument) according to your point density. I also used the square root of the count to scale it so that the point area is proportional to the counts.
Also note: If you have very dense points, it might be more appropriate to use a different kind of plot. Histograms for example (I personally like hexbin for 2d data) are a decent alternative in these cases.

Visualization of scatter plots with overlapping points in matplotlib

I have to represent about 30,000 points in a scatter plot in matplotlib. These points belong to two different classes, so I want to depict them with different colors.
I succeded in doing so, but there is an issue. The points overlap in many regions and the class that I depict for last will be visualized on top of the other one, hiding it. Furthermore, with the scatter plot is not possible to show how many points lie in each region.
I have also tried to make a 2d histogram with histogram2d and imshow, but it's difficult to show the points belonging to both classes in a clear way.
Can you suggest a way to make clear both the distribution of the classes and the concentration of the points?
EDIT: To be more clear, this is the
link to my data file in the format "x,y,class"

One approach is to plot the data as a scatter plot with a low alpha, so you can see the individual points as well as a rough measure of density. (The downside to this is that the approach has a limited range of overlap it can show -- i.e., a maximum density of about 1/alpha.)
Here's an example:
As you can imagine, because of the limited range of overlaps that can be expressed, there's a tradeoff between visibility of the individual points and the expression of amount of overlap (and the size of the marker, plot, etc).
import numpy as np
import matplotlib.pyplot as plt
N = 10000
mean = [0, 0]
cov = [[2, 2], [0, 2]]
x,y = np.random.multivariate_normal(mean, cov, N).T
plt.scatter(x, y, s=70, alpha=0.03)
plt.ylim((-5, 5))
plt.xlim((-5, 5))
plt.show()
(I'm assuming here you meant 30e3 points, not 30e6. For 30e6, I think some type of averaged density plot would be necessary.)

You could also colour the points by first computing a kernel density estimate of the distribution of the scatter, and using the density values to specify a colour for each point of the scatter. To modify the code in the earlier example :
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde as kde
from matplotlib.colors import Normalize
from matplotlib import cm
N = 10000
mean = [0,0]
cov = [[2,2],[0,2]]
samples = np.random.multivariate_normal(mean,cov,N).T
densObj = kde( samples )
def makeColours( vals ):
colours = np.zeros( (len(vals),3) )
norm = Normalize( vmin=vals.min(), vmax=vals.max() )
#Can put any colormap you like here.
colours = [cm.ScalarMappable( norm=norm, cmap='jet').to_rgba( val ) for val in vals]
return colours
colours = makeColours( densObj.evaluate( samples ) )
plt.scatter( samples[0], samples[1], color=colours )
plt.show()
I learnt this trick a while ago when I noticed the documentation of the scatter function --
c : color or sequence of color, optional, default : 'b'
c can be a single color format string, or a sequence of color specifications of length N, or a sequence of N numbers to be mapped to colors using the cmap and norm specified via kwargs (see below). Note that c should not be a single numeric RGB or RGBA sequence because that is indistinguishable from an array of values to be colormapped. c can be a 2-D array in which the rows are RGB or RGBA, however, including the case of a single row to specify the same color for all points.

My answer may not perfectly answer your question, however, I too tried to plot overlapping points, but mine were perfectly overlapped. I therefore came up with this function in order to offset identical points.
import numpy as np
def dodge_points(points, component_index, offset):
"""Dodge every point by a multiplicative offset (multiplier is based on frequency of appearance)
Args:
points (array-like (2D)): Array containing the points
component_index (int): Index / column on which the offset will be applied
offset (float): Offset amount. Effective offset for each point is `index of appearance` * offset
Returns:
array-like (2D): Dodged points
"""
# Extract uniques points so we can map an offset for each
uniques, inv, counts = np.unique(
points, return_inverse=True, return_counts=True, axis=0
)
for i, num_identical in enumerate(counts):
# Prepare dodge values
dodge_values = np.array([offset * i for i in range(num_identical)])
# Find where the dodge values must be applied, in order
points_loc = np.where(inv == i)[0]
#Apply the dodge values
points[points_loc, component_index] += dodge_values
return points
Here is an example of before and after.
Before:
After:
This method only works for EXACTLY overlapping points (or if you are willing to round points off in a way that np.unique finds matching points).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.