How to calculate the distance of each cluster in a scatter plot

How to calculate the distance of each cluster in a scatter plot - python

I have 2 clusters plotted in a scatter plot and i need to find their standard deviation and distance from the center from one cluster to another. I was not able to find any guide of documentation that simplifies the process of finding the center of 2 clusters for scatter plots, the reason is that i need to compare the scatter of each cluster with the distance of the centres of the clusters. My actual scatter plot looks like this:
import matplotlib.pyplot as plt
import numpy as np
vector1 = [
2.8238,
3.0284,
5.9333,
2.0156,
2.2467,
2.0092,
4.7983,
4.3554,
3.6372,
1.3159,
2.6174,
2.2336,
0.9625,
5.6285,
5.4040,
2.7887,
0,
3.4632,
0,
2.7370
]
vector5 = [
1.2994,
7.4469,
3.6503,
2.1667,
4.1975,
3.3006,
10.4082,
3.4112,
2.2395,
1.5653,
4.3237,
1.8679,
1.2622,
14.1372,
6.1686,
3.8903,
2.2873,
6.2559,
0.2132,
7.2303,
]
plt.rcParams['figure.figsize'] = (16.0, 10.0)
plt.style.use('ggplot')
data = [vector1, std_colomns4]
plt.plot(vector1 , marker='.', linestyle='none', markersize=20, label='Vector 1')
plt.plot(vector5, marker='.', linestyle='none', markersize=20, label='Vector 5')
plt.xticks(range(1, 20, 1))
plt.yticks(range(1, 20, 1))
plt.ylabel('Sizes')
plt.xlabel('Index')
plt.legend()
plt.show()
For the sake of pre-visualization:

You can compute the mean by converting them to arrays
vector1 = np.array([...])
vector5 = np.array([...])
mean1 = np.mean(vector1)
mean5 = np.mean(vector5)
# Rest of the code
plt.plot((vector1+vector5)/2, marker='x', linestyle='none', markersize=12, label='Mean')
plt.axhline(mean1)
plt.axhline(mean5, c='b')

Related

How to fill intervals under KDE curve with different colors

I am looking for a way to color the intervals below the curve with different colors; on the interval x < 0, I would like to fill the area under the curve with one color and on the interval x >= 0 with another color, like the following image:
This is the code for basic kde plot:
fig, (ax1) = plt.subplots(1, 1, figsize = ((plot_size + 1.5) * 1,(plot_size + 1.5)))
sns.kdeplot(data=pd.DataFrame(w_contrast, columns=['contrast']), x="contrast", ax=ax1);
ax1.set_xlabel(f"Dry Yield Posterior Contrast (kg)");
Is there a way to fill the area under the curve with different colors using seaborn?

seaborn is a high level api for matplotlib, so the curve will have to be calculated; similar to, but simpler than this answer.
Calculate the values for the kde curve with scipy.stats.gaussian_kde
Use matplotlib.pyplot.fill_between to fill the areas.
Use scipy.integrate.simpson to calculate the area under the curve, which will be passed to matplotlib.pyplot.annotate to annotate.
import seaborn as sns
from scipy.stats import gaussian_kde
from scipy.integrate import simps
import numpy as np
# load sample data
df = sns.load_dataset('planets')
# create the kde model
kde = gaussian_kde(df.mass.dropna())
# plot
fig, ax = plt.subplots(figsize=(9, 6))
g = sns.kdeplot(data=df.mass, ax=ax, c='k')
# remove margins; optional
g.margins(x=0, y=0)
# get the min and max of the x-axis
xmin, xmax = g.get_xlim()
# create points between the min and max
x = np.linspace(xmin, xmax, 1000)
# calculate the y values from the model
kde_y = kde(x)
# select x values below 0
x0 = x[x < 0]
# get the len, which will be used for slicing the other arrays
x0_len = len(x0)
# slice the arrays
y0 = kde_y[:x0_len]
x1 = x[x0_len:]
y1 = kde_y[x0_len:]
# calculate the area under the curves
area0 = np.round(simps(y0, x0, dx=1) * 100, 0)
area1 = np.round(simps(y1, x1, dx=1) * 100, 0)
# fill the areas
g.fill_between(x=x0, y1=y0, color='r', alpha=.5)
g.fill_between(x=x1, y1=y1, color='b', alpha=.5)
# annotate
g.annotate(f'{area0:.0f}%', xy=(-1, 0.075), xytext=(10, 0.150), arrowprops=dict(arrowstyle="->", color='r', alpha=.5))
g.annotate(f'{area1:.0f}%', xy=(1, 0.05), xytext=(10, 0.125), arrowprops=dict(arrowstyle="->", color='b', alpha=.5))

How to mask data that appears in the ocean using cartopy and matplotlib

Not at all sure what I'm doing wrong besides perhaps the order that I am plotting the ocean in. I am trying to get the ocean feature in to mask the data in the ocean. I am trying to get data to not appear in the ocean and to get the ax.add_feature(cfeature.OCEAN) to be on top of the temperature data I am plotting so I see ocean and no data. Similar to what is happening in the great lakes region where you see lakes and no temperature data.
proj_map = ccrs.Mercator(central_longitude=cLon)
proj_data = ccrs.PlateCarree()
fig = plt.figure(figsize=(30,20))
ax = fig.add_subplot(1,1,1, projection=proj_map)
ax.set_extent([-84,-66,37,47.5])
CT = ax.contourf(Tlat, Tlon, tempF, transform=temp.metpy.cartopy_crs, levels=clevs,
cmap=cmap)
ax.add_feature(cfeature.COASTLINE.with_scale('10m'), linewidth=0.5)
ax.add_feature(cfeature.OCEAN)
ax.add_feature(cfeature.LAKES)
ax.add_feature(cfeature.BORDERS, linewidth=0.5)
ax.add_feature(cfeature.STATES.with_scale('10m'), linewidth=0.5)
ax.add_feature(USCOUNTIES.with_scale('20m'), linewidth=0.25)
cbar = fig.colorbar(CT, orientation='horizontal', shrink=0.5, pad=0.05)
cbar.ax.tick_params(labelsize=14)
cbar.set_ticks([-50, -40, -30, -20, -10, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
110, 120])
cbar.ax.set_xlabel("Temp ($^\circ$F)",fontsize=20)
Here is what the image looks like

You need to use zorder option to specify proper orders of the plot on the map. Features with largers values of zorder will be plotted on top of those with lower values. In your case, you need zorder of the OCEAN larger than the filled-contour.
Here is a runnable demo code and its sample plot. Read comments in the code for explanation.
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import numpy as np
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(projection=ccrs.PlateCarree()))
extent = [-84, -66, 37, 47.5]
# generate (x, y), centered at the middle of the `extent`
mean = [(extent[0]+extent[1])/2, (extent[2]+extent[3])/2] #mean
cov = [[7, 3.5], [3.5, 6]] #co-variance matrix
x, y = np.random.multivariate_normal(mean, cov, 4000).T
# make a 2D histogram
# set the edges of the bins in x and y directions
bin_size = 40
lonrange = np.linspace(extent[0], extent[1], bin_size)
latrange = np.linspace(extent[2], extent[3], bin_size)
# the cell sizes of the bins:
dx = (lonrange[1]- lonrange[0])/2
dy = (latrange[3]- latrange[2])/2
# compute array of center points of the bins' grid
# the dimensions of mesh-grid < the edges by 1
lonrange2 = np.linspace(extent[0]+dx, extent[1]-dx, bin_size-1)
latrange2 = np.linspace(extent[2]+dy, extent[3]-dy, bin_size-1)
x2d, y2d = np.meshgrid(lonrange2, latrange2)
# create 2d-histogram
# zorder is set = 10
h = ax.hist2d(x, y, bins=[lonrange, latrange], zorder=10, alpha=0.75)
#h: (counts, xedges, yedges, image)
ax.add_feature(cfeature.OCEAN, zorder=12) #zorder > 10
ax.add_feature(cfeature.BORDERS, linewidth=0.5)
ax.gridlines(draw_labels=True, xlocs=list(range(-85, -60, 5)), ylocs=list(range(35, 50, 5)),
linewidth=1.8, color='gray', linestyle='--', alpha=0.8, zorder=20)
# plot colorbar, using image from hist2d's result
plt.colorbar(h[3], ax=ax, shrink=0.45)
# finally, show the plot.
plt.show()
The output plot:
If zorder option is not specified:
ax.add_feature(cfeature.OCEAN)
the plot will be:

Python matplotlib polar coordinate is not plotting as it is supposed to be

I am plotting from a CSV file that contains Cartesian coordinates and I want to change it to Polar coordinates, then plot using the Polar coordinates.
Here is the code
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('test_for_plotting.csv',index_col = 0)
x_temp = df['x'].values
y_temp = df['y'].values
df['radius'] = np.sqrt( np.power(x_temp,2) + np.power(y_temp,2) )
df['theta'] = np.arctan2(y_temp,x_temp)
df['degrees'] = np.degrees(df['theta'].values)
df['radians'] = np.radians(df['degrees'].values)
ax = plt.axes(polar = True)
ax.set_aspect('equal')
ax.axis("off")
sns.set(rc={'axes.facecolor':'white', 'figure.facecolor':'white','figure.figsize':(10,10)})
# sns.scatterplot(data = df, x = 'x',y = 'y', s= 1,alpha = 0.1, color = 'black',ax = ax)
sns.scatterplot(data = df, x = 'radians',y = 'radius', s= 1,alpha = 0.1, color = 'black',ax = ax)
plt.tight_layout()
plt.show()
Here is the dataset
If you run this command using polar = False and use this line to plot sns.scatterplot(data = df, x = 'x',y = 'y', s= 1,alpha = 0.1, color = 'black',ax = ax) it will result in this picture
now after setting polar = True and run this line to plot sns.scatterplot(data = df, x = 'radians',y = 'radius', s= 1,alpha = 0.1, color = 'black',ax = ax) It is supposed to give you this
But it is not working as if you run the actual code the shape in the Polar format is the same as Cartesian which does not make sense and it does not match the picture I showed you for polar (If you are wondering where did I get the second picture from, I plotted it using R)
I would appreciate your help and insights and thanks in advance!

For a polar plot, the "x-axis" represents the angle in radians. So, you need to switch x and y, and convert the angles to radians (I also added ax=ax, as the axes was created explicitly):
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
data = {'radius': [0, 0.5, 1, 1.5, 2, 2.5], 'degrees': [0, 25, 75, 155, 245, 335]}
df_temp = pd.DataFrame(data)
ax = plt.axes(polar=True)
sns.scatterplot(x=np.radians(df_temp['degrees']), y=df_temp['radius'].to_numpy(),
s=100, alpha=1, color='black', ax=ax)
for deg, y in zip(df_temp['degrees'], df_temp['radius']):
x = np.radians(deg)
ax.axvline(x, color='skyblue', ls=':')
ax.text(x, y, f' {deg}', color='crimson')
ax.set_rlabel_position(-15) # Move radial labels away from plotted dots
plt.tight_layout()
plt.show()
About your new question: if you have an xy plot, and you convert these xy values to polar coordinates, and then plot these on a polar plot, you'll get again the same plot.
After some more testing with the data, I decided to create the plot directly with matplotlib, as seaborn makes some changes that don't have exactly equal effects across seaborn and matplotlib versions.
What seems to be happening in R:
The angles (given by "x") are spread out to fill the range (0,2 pi). This either requires a rescaling of x, or change how the x-values are mapped to angles. One way to get this, is subtracting the minimum. And with that result divide by the new maximum and multiply by 2 pi.
The 0 of the angles it at the top, and the angles go clockwise.
The following code should create the plot with Python. You might want to experiment with alpha and with s in the scatter plot options. (Default the scatter dots get an outline, which often isn't desired when working with very small dots, and can be removed by lw=0.)
ax = plt.axes(polar=True)
ax.set_aspect('equal')
ax.axis('off')
x_temp = df['x'].to_numpy()
y_temp = df['y'].to_numpy()
x_temp -= x_temp.min()
x_temp = x_temp / x_temp.max() * 2 * np.pi
ax.scatter(x=x_temp, y=y_temp, s=0.05, alpha=1, color='black', lw=0)
ax.set_rlim(y_temp.min(), y_temp.max())
ax.set_theta_zero_location("N") # set zero at the north (top)
ax.set_theta_direction(-1) # go clockwise
plt.show()
At the left the resulting image, at the right using the y-values for coloring (ax.scatter(..., c=y_temp, s=0.05, alpha=1, cmap='plasma_r', lw=0)):

Specify values on x axis for a mathplotlib.pyplot histogram

Given a certain dataset, I would like to create three histograms in one plot. The data (just a small snippet of a huge dataset, which would break the mold) looks like this:
x, y1, y2, y3
2.0466115, 0, 0, 0
2.349824, 0, 0, 0
2.697959, 0, 0, 0
3.097671, 0.195374, 0.191008, 0.167979
3.5566025, 0.522926, 0.511492, 0.426324
4.083526, 0.691916, 0.6774083,0.5790586666666666
4.688515, 0.8181206,0.801901, 0.6795873333333334
5.3831355, 0.8489766,0.833376, 0.707486
6.1806665, 0.809022, 0.795524, 0.6750806666666667
All my x values are the same, y1, y2 and y3 represent the three different y values. I'm creating a seperate list for each column and pass them as an argument for pyplot.hist. You can see my code here:
import numpy as np
from matplotlib import pyplot
from excel_to_csv import coordinates
y1 = coordinates(1) #another method, which creates the list out of the column
y2 = coordinates(2)
y3 = coordinates(3)
bins = np.linspace(0, 10, 150)
pyplot.hist(y1, bins, alpha=0.5, label='y1')
pyplot.hist(y2, bins, alpha=0.5, label='y2')
pyplot.hist(y3, bins, alpha=0.5, label='y3')
pyplot.legend(loc='upper right')
pyplot.show()
This code results in the following plot (regarding the actual dataset):
As far as I researched, you creating bins for the range of the x axis. But instead of doing so, I would like to put there my x values.
My goal is the histogram looking like this, but as a histogram (once again - regarding the huge dataset):

You can use np.histogram and then plot the values of the histogram:
import numpy as np
import matplotlib.pyplot as plt
# Generate sample data
y1 = np.random.normal(3,1,10000)
y2 = np.random.normal(5,1,10000)
y3 = np.random.normal(7,1,10000)
bins = np.linspace(0, 10, 150)
x = np.linspace(0,10000,149)
# Plot regular histograms
plt.figure()
plt.hist(y1, bins, alpha=0.5, label='y1')
plt.hist(y2, bins, alpha=0.5, label='y2')
plt.hist(y3, bins, alpha=0.5, label='y3')
plt.ylabel('Frequency')
plt.xlabel('Bins')
plt.legend(loc='upper right')
plt.show()
# Compute histogram data
h1 = np.histogram(y1, bins)
h2 = np.histogram(y2, bins)
h3 = np.histogram(y3, bins)
# Compute bin average
bin_avg = bins[0:-1] + bins[1] - bins[0]
# Plot histogram data as a line with markers
plt.figure()
plt.plot(bin_avg, h1[0], alpha=0.5, label='y1', marker='o')
plt.plot(bin_avg, h2[0], alpha=0.5, label='y2', marker='o')
plt.plot(bin_avg, h3[0], alpha=0.5, label='y3', marker='o')
plt.ylabel('Frequency')
plt.xlabel('Bins')
plt.legend(loc='upper right')
plt.show()
It wouldn't make sense to plot the binned data versus x because once the data has been transformed by the histogram the relationship it had with x is no longer the same.

Trying to plot a system of linear equation using matplotlib in a 2D plane

As the title says, I am trying to plot a system of linear equations to get the intersection point of the 2 equations.
8a-b = 9
4a+9b = 7.
below is the code i have tried.
import matplotlib.pyplot as plt
from numpy.linalg import inv
import numpy as np
a = np.array([[8,-1],[4,9]])
b = np.array([9,7])
c = np.linalg.solve(a,b)
plt.figure()
# Set x-axis range
plt.xlim((-10,10))
# Set y-axis range
plt.ylim((-10,10))
# Draw lines to split quadrants
plt.plot([-10,-10],[10,10], linewidth=4, color='blue' )
#draw the equations
plt.plot(a[0][0],a[0][1], linewidth=2, color='red' )
plt.plot(a[1][0],a[1][1], linewidth=2, color='red' )
plt.plot(c[0],c[1], marker='x', color="black")
plt.title('Quadrant plot')
plt.show()
I get only the intersection point, but not the lines on the 2D plane as shown in the below graph.
I want something like this.

To plot the lines it's easiest if you rearrange your equations to in terms of b. This way 8a-b=9 becomes b=8a-9 and 4a+9b=7 becomes b=(7-4a)/9
It also looks like you were trying to draw the "axis" of the graph, I've fixed this in the code below too.
The following should do the trick:
import matplotlib.pyplot as plt
import numpy as np
a = np.array([[8,-1],[4,9]])
b = np.array([9,7])
c = np.linalg.solve(a,b)
plt.figure()
# Set x-axis range
plt.xlim((-10,10))
# Set y-axis range
plt.ylim((-10,10))
# Draw lines to split quadrants
plt.plot([-10, 10], [0, 0], color='C0')
plt.plot([0, 0], [-10, 10], color='C0')
# Draw line 8a-b=9 => b=8a-9
x = np.linspace(-10, 10)
y = 8 * x - 9
plt.plot(x, y, color='C2')
# Draw line 4a+9b=7 => b=(7-4a)/9
y = (7 - 4*x) / 9
plt.plot(x, y, color='C2')
# Add solution
plt.scatter(c[0], c[1], marker='x', color='black')
# Annotate solution
plt.annotate('({:0.3f}, {:0.3f})'.format(c[0], c[1]), c+0.5)
plt.title('Quadrant plot')
plt.show()
This gave me the following plot:

x1 = np.arange(-10, 10, 0.01) # between -10 and 10, 0.01 stepsize
y1 = 8*x1-9
x2 = np.arange(-10, 10, 0.01) # between -10 and 10, 0.01 stepsize
y2 = (7-4*x2)/9
This is the equations of your lines.
Now plot these using plt.plot(x1,y1) etc.
plt.figure()
# Set x-axis range
plt.xlim((-10,10))
# Set y-axis range
plt.ylim((-10,10))
# Draw lines to split quadrants
plt.plot([-10,-10],[10,10], linewidth=4, color='blue' )
plt.plot(x1,y1)
plt.plot(x2,y2)
#draw the equations
plt.plot(a[0][0],a[0][1], linewidth=2, color='red' )
plt.plot(a[1][0],a[1][1], linewidth=2, color='red' )
plt.plot(c[0],c[1], marker='x', color="black")
plt.title('Quadrant plot')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate the distance of each cluster in a scatter plot - python

Related

How to fill intervals under KDE curve with different colors

How to mask data that appears in the ocean using cartopy and matplotlib

Python matplotlib polar coordinate is not plotting as it is supposed to be

Specify values on x axis for a mathplotlib.pyplot histogram

Trying to plot a system of linear equation using matplotlib in a 2D plane

Categories

Resources