I can't quite seem to figue out how to get my curve to be displayed smoothly instead of having so many sharp turns.
I am hoping to show a boltzmann probability distribution. With a nice smooth curve.
I'll expect it is a simple fix but I can't see it. Can someone please help?
My code is below:
from matplotlib import pyplot as plt
import numpy as np
import scipy.stats
dE = 1
N = 500
n = 10000
# This is creating an array filled with all twos
def Create_Array(N):
Particle_State_List_set = np.ones(N, dtype = int)
Particle_State_List_twos = Particle_State_List_set + 1
Array = Create_Array(N)
def Select_Random_index(N):
Seed = np.random.default_rng()
Partcle_Index = Seed.integers(low=0, high= N - 1)
def Exchange(N):
Particle_Index_A = Select_Random_index(N) #Selects a particle to be used as particle "a"
Particle_Index_B = Select_Random_index(N) #Selects a particle to be used as particle "b"
# Checks to see if the energy on particle "a" is zero, if so it selects anbother until it isn't.
while Array[Particle_Index_A] == 1:
Particle_Index_A = Select_Random_index(N)
#This loop is making sure that Particle "a" and "b" aren't the same particle, it chooses again until the are diffrent.
while Particle_Index_B == Particle_Index_A:
Particle_Index_B = Select_Random_index(N)
# This assignes variables to the chosen particle's energy values
a = Array[Particle_Index_A]
b = Array[Particle_Index_B]
# This updates the values of the Energy levels of the interacting particles
Array[Particle_Index_A] = a - dE
Array[Particle_Index_B] = b + dE
return (Array[Particle_Index_A], Array[Particle_Index_B])
for i in range(n):
# This part is making the histogram the curve will be made from
_, bins, _ = plt.hist(Array, 12, density=1, alpha=0.15, color="g")
# This is using scipy to find the mean and standard deviation in order to plot the curve
mean, std = scipy.stats.norm.fit(Array)
# This part is drawing the best fit line, using the established bins value and the std and mean from before
best_fit = scipy.stats.norm.pdf(bins, mean, std)
# Plotting the best fit curve
plt.plot(bins, best_fit, color="r", linewidth=2.5)
#These are instructions on how python with show the graph
plt.title("Boltzmann Probablitly Curve")
plt.xlabel("Energy Value")
plt.ylabel('Percentage at this Energy Value')
plt.tick_params(top=True, right=True)
plt.tick_params(direction='in', length=6, width=1, colors='0')
Whats happening is that in these lines:
best_fit = scipy.stats.norm.pdf(bins, mean, std)
plt.plot(bins, best_fit, color="r", linewidth=2.5)
'bins' the histogram bin edges is being used as the x coordinates of the data points forming the best fit line. The resulting plot is jagged because they are so widely spaced. Instead you can define a tighter packed set of x coordinates and use that:
bfX = np.arange(bins[0],bins[-1],.05)
best_fit = scipy.stats.norm.pdf(bfX, mean, std)
plt.plot(bfX, best_fit, color="r", linewidth=2.5)
For me that gives a nice smooth curve, but you can always use a tighter packing than .05 if its not to your liking yet.
I have a large amount of hue values expressed in degrees (0 to 360) that I wish to plot on a circle.
Here is some 'test' data. My real data is similar to this.
# create values with a normal distributions.
mu = 0.5
sigma = 0.02
values = np.random.normal(mu,sigma,10000)
values = values*360
Now I create a simple circle.
# create a circle
circle = np.linspace(0,2*np.pi,1000)
x = np.sin(circle)
y = np.cos(circle)
Next, I wish to plot my data onto this circle.
# plot values on circle
x = []
y = []
for i in values:
Hmmm. Okay, so the values are plotted onto the circle. But now it looks as if the points are more-or-less equally likely within the spread. I would like to show the data in such a way, that you can see the distribution of the data too. Something like a normal bell curve. That is, I would like something like this (don't mind the bad paint skills)
In this image, the further away from the black circle, the more often we find these data points. Basically, a circular normal bell curve.
I tried to multiply each data-point by a value that increases as the likelihood for that value increases. I.e, the more likely the data-point, the further away it is from the black circle. (Just as a bell curve, but than on a circle) but it is giving me these weird results.
uniqueX = set(x)
uniqueY = set(y)
countx = max([x.count(i) for i in set(x)])
county = min([y.count(i) for i in set(y)])
ofset = [((1/countx*x.count(i))+1) for i in x]
x = [x*ofset[ii] for ii,x in enumerate(x)]
y = [x*ofset[ii] for ii,x in enumerate(y)]
This output is not what I had inentend. I'm not sure where I am going wrong (my geometry and math has never been my strong suit). How can I make my desired plot?
My data looks like this:
Possibly you would rather like to show a kernel density estimate of your distribution?
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
mu = 0.5
sigma = 0.1
values = np.random.normal(mu,sigma,10000)
values = values
phi = np.linspace(-np.pi,np.pi,1000)
xc = np.sin(phi)
yc = np.cos(phi)
kde = gaussian_kde(values)
r = kde(phi)
# scale the kde by 1/10 to make it fit to the screen
x = (1+r/10.)*np.cos(phi)
y = (1+r/10.)*np.sin(phi)
plt.plot(x,y,color='red', zorder=0)
Possibly you also want to show this on a polar plot.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
mu = 0.5
sigma = 0.1
values = np.random.normal(mu,sigma,10000)
values = values
phi = np.linspace(-np.pi,np.pi,1000)
r0 = np.ones_like(phi)
fig, ax = plt.subplots(subplot_kw=dict(projection="polar"))
kde = gaussian_kde(values)
r = kde(phi)
# scale the kde by 1/10 to make it fit to the screen
ax.plot(phi,(1+r/10),color='red', zorder=0)
Currently, I am trying to fill under the histogram with fill_between function in python until 10 and 90 percentile in the original numbers.
However, the problem is the histogram curve is not a "function' but the series of discrete number with the interval of bin size. I couldn't fill exactly up to 10 or 90 percentile. I have tried several tries, I failed.
The code bellow is what I tried:
S1 = [0.34804491 0.18036933 0.41111951 0.31947523 .........
0.46212255 0.39229157 0.28937502 0.22095423 0.52415083]
N, bins = np.histogram(S1, bins=np.linspace(0.1,0.7,20), normed=False)
bincenters = 0.5*(bins[1:]+bins[:-1])
ax.fill_between(bincenters,N,0,where=bincenters<=np.percentile(S1,10),interpolate=True,facecolor='r', alpha=0.5)
ax.fill_between(bincenters,N,0,where=bincenters>=np.percentile(S1,90),interpolate=True, facecolor='r', alpha=0.5,label = "Summer 10 P")
It seems to fill only until bincenter before or after given percentile number, not until up to those.
Any idea or help would be really appreciated.
Try changing your last two lines to:
ax.fill_between(bincenters, 0, N, interpolate=True,
where=((bincenters>=np.percentile(bincenters, 10)) &
(bincenters<=np.percentile(bincenters, 90))))
I believe you want to call np.percentile on bincenters since that is your effective x-axis.
The other difference is that you want to want fill between regions where 10<x<90, which necessitates the use of & in the where parameter.
Edit based on comment from OP:
I think to achieve what you want, you have to do some minimal interpolation of your own. See my example below using a random, normal distribution in which I'm using interp1d from scipy.interpolate to interpolate over bincenters.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
# create normally distributed random data
n = 10000
data = np.random.normal(0, 1, n)
bins = np.linspace(-data.max(), data.max(), 20)
hist = np.histogram(data, bins=bins)[0]
bincenters = 0.5 * (bins[1:] + bins[:-1])
# create interpolation function and dense x-axis to interpolate over
f = interp1d(bincenters, hist, kind='cubic')
x = np.linspace(bincenters.min(), bincenters.max(), n)
plt.plot(bincenters, hist, '-o')
# calculate greatest bincenter < 10th percentile
bincenter_under10thPerc = bincenters[bincenters < np.percentile(bincenters, 10)].max()
bincenter_10thPerc = np.percentile(bincenters, 10)
bincenter_90thPerc = np.percentile(bincenters, 90)
# calculate smallest bincenter > 90th percentile
bincenter_above90thPerc = bincenters[bincenters > np.percentile(bincenters, 90)].min()
# fill between 10th percentile region using dense x-axis array, x
plt.fill_between(x, 0, f(x), interpolate=True,
where=((x>=bincenter_under10thPerc) &
# fill between 90th percentile region using dense x-axis array, x
plt.fill_between(x, 0, f(x), interpolate=True,
where=((x>=bincenter_90thPerc) &
The figure I get out is below. Note that I changed the percentiles from 10/90% to 30/70% so that they show up better in the plot. Again, I hope that this is what you're trying to do
I have a version of this that uses axvspan to make a Rectangle and then uses the hist as a clip_path:
def hist(sample, low=None, high=None):
# draw the histogram
options = dict(alpha=0.5, color='C0')
xs, ys, patches = plt.hist(sample,
# fill in the histogram, if desired
if low is not None:
x1 = low
if high is not None:
x2 = high
x2 = np.max(sample)
fill = plt.axvspan(x1, x2,
Would something like that work for you?
I've been playing with the Scikit-learn's GMM function. To start with, I've just created a distribution along the line x=y.
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)
#Create a distribution that's centred along y=x
plt.plot(xs, ys)
This produces the expected distribution:
Next I fit a GMM to it, and plot the results:
#Create the x,y mesh that will be used to make a 3D plot
x_y_grid = []
for x in xs:
for y in ys:
#Calculate a probability for each point in the x,y grid.
x_y_z_grid = []
for x,y in x_y_grid:
z = line_model.score([[x,y]])
x_y_z_grid = np.array(x_y_z_grid)
#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot(x_y_z_grid[:,0], x_y_z_grid[:,1], 2.72**x_y_z_grid[:,2])
The resulting probability distribution has some weird tails along x=0 and x=1 and also extra probability in the corners (x=1, y=1 and x=0,y=0).
Using n_components=5 also shows this behaviour:
Is this something inherent with GMMs, or is there an issue with the implementation, or am I doing something wrong?
Edit: getting scores from the model seems to get rid of this behaviour -- should this be?
I'm training both the models on the same dataset (x=y from x=0 to x=1). Simply checking the probability via the score method of the gmm seems to eliminate this boundary effect. Why is this? I've attached the plots and code below.
# Creates a line of 'observations' between (x_small_start, x_small_end)
# and (y_small_start, y_small_end). This is the data both gmms are trained on.
x_small_start = 0
x_small_end = 1
y_small_start = 0
y_small_end = 1
# These are the range of values that will be plotted
x_big_start = -1
x_big_end = 2
y_big_start = -1
y_big_end = 2
shorter_eval_range_gmm = mixture.GMM(n_components = 5)
longer_eval_range_gmm = mixture.GMM(n_components = 5)
x_small = np.linspace(x_small_start, x_small_end, 100)
y_small = np.linspace(y_small_start, y_small_end, 100)
x_big = np.linspace(x_big_start, x_big_end, 100)
y_big = np.linspace(y_big_start, y_big_end, 100)
#Train both gmms on a distribution that's centered along y=x
#Create the x,y meshes that will be used to make a 3D plot
x_y_evals_grid_big = []
for x in x_big:
for y in y_big:
x_y_evals_grid_small = []
for x in x_small:
for y in y_small:
#Calculate a probability for each point in the x,y grid.
x_y_z_plot_grid_big = []
for x,y in x_y_evals_grid_big:
z = longer_eval_range_gmm.score([[x, y]])
x_y_z_plot_grid_big.append([x, y, z])
x_y_z_plot_grid_big = np.array(x_y_z_plot_grid_big)
x_y_z_plot_grid_small = []
for x,y in x_y_evals_grid_small:
z = shorter_eval_range_gmm.score([[x, y]])
x_y_z_plot_grid_small.append([x, y, z])
x_y_z_plot_grid_small = np.array(x_y_z_plot_grid_small)
#Plot probabilities on the Z axis.
fig = plt.figure()
fig.suptitle("Probability of different x,y pairs")
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot(x_y_z_plot_grid_big[:,0], x_y_z_plot_grid_big[:,1], np.exp(x_y_z_plot_grid_big[:,2]))
ax1.set_xlabel('X Label')
ax1.set_ylabel('Y Label')
ax2 = fig.add_subplot(1, 2, 2, projection='3d')
ax2.plot(x_y_z_plot_grid_small[:,0], x_y_z_plot_grid_small[:,1], np.exp(x_y_z_plot_grid_small[:,2]))
ax2.set_xlabel('X Label')
ax2.set_ylabel('Y Label')
There is no problem with the fit, but with the visualisation you're using. A hint should be the straight line connecting (0,1,5) to (0,1,0), which is actually just a rendering of the connection of two points (which is due to the order in which the points are read). Although the two points at its extrema are in your data, no other point on this line actually is.
Personally, I think it is a rather bad idea to use 3d plots (wires) to represent a surface for the reason mentioned above, and I would recommend surface plots or contour plots instead.
Try this:
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.atleast_2d(np.linspace(0, 1, 100)).T
ys = np.atleast_2d(np.linspace(0, 1, 100)).T
#Create a distribution that's centred along y=x
line_model.fit(np.concatenate([xs, ys], axis=1))
plt.scatter(xs, ys)
#Create the x,y mesh that will be used to make a 3D plot
X, Y = np.meshgrid(xs, ys)
x_y_grid = np.c_[X.ravel(), Y.ravel()]
#Calculate a probability for each point in the x,y grid.
z = line_model.score(x_y_grid)
z = z.reshape(X.shape)
#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, z)
From an academic point of view I am quite uncomfortable with the goal of fitting a 1D line in a 2D space by a 2D mixture model. Manifold learning with GMMs requires at least the normal direction to have zero variance, reducing thus to a dirac-distribution. Numerically and analytically this is unstable, and should be avoided (there seems to be some stabilising trick in the gmm fit, since variance of the model is rather large in the direction of the normal to the straight line).
It is also recommended to use plt.scatter rather than plt.plot when drawing data, since there is no reason to connect the dots when you're fitting their joint distribution.
Hope this helps to shed some light on your problem.
This is not correct. Talking with Ronald P., you can't get Gibbs effects because the Gaussians cannot compensate each other by "going negative", as probability is strictly > 0. This seems to be a simple plotting issue... see his answer instead! Either way, I would recommend using 2D data to test GMMs, rather than a 1D line.
The GMM is fitting to the data you gave it - specifically:
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)
Because the data ends at 0 and 1, the GMM is attempting to model that fact: -.01 and 1.01 are technically outside the trained data range and should be scored with very low probabilities. In doing so it ends up creating a gaussian with smaller spread (smaller covariance/higher precision) to cover the ends of the data and model the fact that the data stops.
I would expect that adding enough gaussians would lead to a pseudo-Gibbs phenomena effect, and you can kind of see that happening in the change from 5 to 99. To exactly model the edges, you would need an infinite mixture model. This is analogous to infinite frequency components - you are representing a "signal" with a set of basis functions (in this case, gaussians) in GMM as well!
I have successfully read in data from a catalog, and I have graphed what I need. However, I need one more thing. I would like to correspond the different "standard_deviation" values with the "number" via "half-light radius." In the graph shown, the "number" is not an axis on the graph, however, there will in this case be ten "number 9's" for example. I would like a way to match the points of these same-numbered points with some sort of line, as I showed in the image below (I just drew lines randomly to give you an idea of what I want).
In this example, assume that every point one one of the drawn lines is of the same "number." A point of a "number" will have ten different "standard_deviation" values, 1 through 10, and ten different "half_light radius" values, which are the values I would like to match. I've pasted my read/plot code below. How would I do this?
newvid = asciitable.read('user4.cat')
n_new = newvid['n']
re_new = newvid['re']
number = newvid['number']
standard_deviation = newvid['standard_deviation']
plt.title('sersic parameter vs. standard deviation distribution of noise')
plt.xlabel('standard deviation')
plt.ylabel('sersic parameter')
plt.scatter(standard_deviation, n_new)
plt.title('half-light radius vs. standard deviation distribution of noise')
plt.xlabel('standard deviation')
plt.ylabel('half-light radius')
To do what I think you want, you'll have to use the plot function instead of scatter in order to connect the lines. Depending on how your data is arranged, you may have to split or sort your data, so that you can plot all points of each number at once, sorted by standard deviation.
Try this:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
newvid = asciitable.read('user4.cat')
n_new = newvid['n']
re_new = newvid['re']
number = newvid['number']
std_dev = newvid['standard_deviation']
n_max = float(number.max()) # for coloring later
plt.title('sersic parameter vs. standard deviation distribution of noise')
plt.xlabel('standard deviation')
plt.ylabel('sersic parameter')
for n in np.unique(number):
n_mask = number == n # pick out just where n_new is the current n
order = np.argsort(std_dev[n_mask]) # sort by std_dev, so line doesn't zig zag
plt.plot(std_dev[n_mask][order], n_new[n_mask][order],
label=str(n), color=cm.jet(n/n_max)) # label and color by n
plt.title('half-light radius vs. standard deviation distribution of noise')
plt.xlabel('standard deviation')
plt.ylabel('half-light radius')
# do one plot per number
for n in np.unique(number):
n_mask = number == n # pick out just where n_new is the current n
order = np.argsort(std_dev[n_mask]) # sort by std_dev, so line doesn't zig zag
plt.plot(std_dev[n_mask][order], re_new[n_mask][order],
label=str(n), color=cm.jet(n/n_max)) # label and color by n
With random data:
To do a colorbar instead of a legend:
m = cm.ScalarMappable(cmap=cm.jet)
I have a complicated curve defined as a set of points in a table like so (the full table is here):
# x y
1.0577 12.0914
1.0501 11.9946
1.0465 11.9338
If I plot this table with the commands:
plt.plot(x_data, y_data, c='b',lw=1.)
plt.scatter(x_data, y_data, marker='o', color='k', s=10, lw=0.2)
I get the following:
where I've added the red points and segments manually. What I need is a way to calculate those segments for each of those points, that is: a way to find the minimum distance from a given point in this 2D space to the interpolated curve.
I can't use the distance to the data points themselves (the black dots that generate the blue curve) since they are not located at equal intervals, sometimes they are close and sometimes they are far apart and this deeply affects my results further down the line.
Since this is not a well behaved curve I'm not really sure what I could do. I've tried interpolating it with a UnivariateSpline but it returns a very poor fit:
# Sort data according to x.
temp_data = zip(x_data, y_data)
# Unpack sorted data.
x_sorted, y_sorted = zip(*temp_data)
# Generate univariate spline.
s = UnivariateSpline(x_sorted, y_sorted, k=5)
xspl = np.linspace(0.8, 1.1, 100)
yspl = s(xspl)
# Plot.
plt.scatter(xspl, yspl, marker='o', color='r', s=10, lw=0.2)
I also tried increasing the number of interpolating points but got a mess:
# Sort data according to x.
temp_data = zip(x_data, y_data)
# Unpack sorted data.
x_sorted, y_sorted = zip(*temp_data)
t = np.linspace(0, 1, len(x_sorted))
t2 = np.linspace(0, 1, 100)
# One-dimensional linear interpolation.
x2 = np.interp(t2, t, x_sorted)
y2 = np.interp(t2, t, y_sorted)
plt.scatter(x2, y2, marker='o', color='r', s=10, lw=0.2)
Any ideas/pointers will be greatly appreciated.
If you're open to using a library for this, have a look at shapely: https://github.com/Toblerity/Shapely
As a quick example (points.txt contains the data you linked to in your question):
import shapely.geometry as geom
import numpy as np
coords = np.loadtxt('points.txt')
line = geom.LineString(coords)
point = geom.Point(0.8, 10.5)
# Note that "line.distance(point)" would be identical
As an interactive example (this also draws the line segments you wanted):
import numpy as np
import shapely.geometry as geom
import matplotlib.pyplot as plt
class NearestPoint(object):
def __init__(self, line, ax):
self.line = line
self.ax = ax
ax.figure.canvas.mpl_connect('button_press_event', self)
def __call__(self, event):
x, y = event.xdata, event.ydata
point = geom.Point(x, y)
distance = self.line.distance(point)
print 'Distance to line:', distance
def draw_segment(self, point):
point_on_line = line.interpolate(line.project(point))
self.ax.plot([point.x, point_on_line.x], [point.y, point_on_line.y],
color='red', marker='o', scalex=False, scaley=False)
if __name__ == '__main__':
coords = np.loadtxt('points.txt')
line = geom.LineString(coords)
fig, ax = plt.subplots()
NearestPoint(line, ax)
Note that I've added ax.axis('equal'). shapely operates in the coordinate system that the data is in. Without the equal axis plot, the view will be distorted, and while shapely will still find the nearest point, it won't look quite right in the display:
The curve is by nature parametric, i.e. for each x there isn't necessary a unique y and vice versa. So you shouldn't interpolate a function of the form y(x) or x(y). Instead, you should do two interpolations, x(t) and y(t) where t is, say, the index of the corresponding point.
Then you use scipy.optimize.fminbound to find the optimal t such that (x(t) - x0)^2 + (y(t) - y0)^2 is the smallest, where (x0, y0) are the red dots in your first figure. For fminsearch, you could specify the min/max bound for t to be 1 and len(x_data)
You could try implementing a calculation of distance from point to line on incremental pairs of points on the curve and finding that minimum. This will introduce a small bit of error from the curve as drawn, but it should be very small, as the points are relatively close together.
You can easily use the package trjtrypy in PyPI: https://pypi.org/project/trjtrypy/
All needed computations and visualizations are available in this package. You can get your answer within a line of code like:
to get the minimum distance use: trjtrypy.basedists.distance(points, curve)
to visualize the curve and points use: trjtrypy.visualizations.draw_landmarks_trajectory(points, curve)