Gaussian mixture model (GMM) gives a bad fit - python

I've been playing with the Scikit-learn's GMM function. To start with, I've just created a distribution along the line x=y.
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)
#Create a distribution that's centred along y=x
line_model.fit(zip(xs,ys))
plt.plot(xs, ys)
plt.show()
This produces the expected distribution:
Next I fit a GMM to it, and plot the results:
#Create the x,y mesh that will be used to make a 3D plot
x_y_grid = []
for x in xs:
for y in ys:
x_y_grid.append([x,y])
#Calculate a probability for each point in the x,y grid.
x_y_z_grid = []
for x,y in x_y_grid:
z = line_model.score([[x,y]])
x_y_z_grid.append([x,y,z])
x_y_z_grid = np.array(x_y_z_grid)
#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot(x_y_z_grid[:,0], x_y_z_grid[:,1], 2.72**x_y_z_grid[:,2])
plt.show()
The resulting probability distribution has some weird tails along x=0 and x=1 and also extra probability in the corners (x=1, y=1 and x=0,y=0).
Using n_components=5 also shows this behaviour:
Is this something inherent with GMMs, or is there an issue with the implementation, or am I doing something wrong?
Edit: getting scores from the model seems to get rid of this behaviour -- should this be?
I'm training both the models on the same dataset (x=y from x=0 to x=1). Simply checking the probability via the score method of the gmm seems to eliminate this boundary effect. Why is this? I've attached the plots and code below.
# Creates a line of 'observations' between (x_small_start, x_small_end)
# and (y_small_start, y_small_end). This is the data both gmms are trained on.
x_small_start = 0
x_small_end = 1
y_small_start = 0
y_small_end = 1
# These are the range of values that will be plotted
x_big_start = -1
x_big_end = 2
y_big_start = -1
y_big_end = 2
shorter_eval_range_gmm = mixture.GMM(n_components = 5)
longer_eval_range_gmm = mixture.GMM(n_components = 5)
x_small = np.linspace(x_small_start, x_small_end, 100)
y_small = np.linspace(y_small_start, y_small_end, 100)
x_big = np.linspace(x_big_start, x_big_end, 100)
y_big = np.linspace(y_big_start, y_big_end, 100)
#Train both gmms on a distribution that's centered along y=x
shorter_eval_range_gmm.fit(zip(x_small,y_small))
longer_eval_range_gmm.fit(zip(x_small,y_small))
#Create the x,y meshes that will be used to make a 3D plot
x_y_evals_grid_big = []
for x in x_big:
for y in y_big:
x_y_evals_grid_big.append([x,y])
x_y_evals_grid_small = []
for x in x_small:
for y in y_small:
x_y_evals_grid_small.append([x,y])
#Calculate a probability for each point in the x,y grid.
x_y_z_plot_grid_big = []
for x,y in x_y_evals_grid_big:
z = longer_eval_range_gmm.score([[x, y]])
x_y_z_plot_grid_big.append([x, y, z])
x_y_z_plot_grid_big = np.array(x_y_z_plot_grid_big)
x_y_z_plot_grid_small = []
for x,y in x_y_evals_grid_small:
z = shorter_eval_range_gmm.score([[x, y]])
x_y_z_plot_grid_small.append([x, y, z])
x_y_z_plot_grid_small = np.array(x_y_z_plot_grid_small)
#Plot probabilities on the Z axis.
fig = plt.figure()
fig.suptitle("Probability of different x,y pairs")
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot(x_y_z_plot_grid_big[:,0], x_y_z_plot_grid_big[:,1], np.exp(x_y_z_plot_grid_big[:,2]))
ax1.set_xlabel('X Label')
ax1.set_ylabel('Y Label')
ax1.set_zlabel('Probability')
ax2 = fig.add_subplot(1, 2, 2, projection='3d')
ax2.plot(x_y_z_plot_grid_small[:,0], x_y_z_plot_grid_small[:,1], np.exp(x_y_z_plot_grid_small[:,2]))
ax2.set_xlabel('X Label')
ax2.set_ylabel('Y Label')
ax2.set_zlabel('Probability')
plt.show()

There is no problem with the fit, but with the visualisation you're using. A hint should be the straight line connecting (0,1,5) to (0,1,0), which is actually just a rendering of the connection of two points (which is due to the order in which the points are read). Although the two points at its extrema are in your data, no other point on this line actually is.
Personally, I think it is a rather bad idea to use 3d plots (wires) to represent a surface for the reason mentioned above, and I would recommend surface plots or contour plots instead.
Try this:
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.atleast_2d(np.linspace(0, 1, 100)).T
ys = np.atleast_2d(np.linspace(0, 1, 100)).T
#Create a distribution that's centred along y=x
line_model.fit(np.concatenate([xs, ys], axis=1))
plt.scatter(xs, ys)
plt.show()
#Create the x,y mesh that will be used to make a 3D plot
X, Y = np.meshgrid(xs, ys)
x_y_grid = np.c_[X.ravel(), Y.ravel()]
#Calculate a probability for each point in the x,y grid.
z = line_model.score(x_y_grid)
z = z.reshape(X.shape)
#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, z)
plt.show()
From an academic point of view I am quite uncomfortable with the goal of fitting a 1D line in a 2D space by a 2D mixture model. Manifold learning with GMMs requires at least the normal direction to have zero variance, reducing thus to a dirac-distribution. Numerically and analytically this is unstable, and should be avoided (there seems to be some stabilising trick in the gmm fit, since variance of the model is rather large in the direction of the normal to the straight line).
It is also recommended to use plt.scatter rather than plt.plot when drawing data, since there is no reason to connect the dots when you're fitting their joint distribution.
Hope this helps to shed some light on your problem.

EDIT:
This is not correct. Talking with Ronald P., you can't get Gibbs effects because the Gaussians cannot compensate each other by "going negative", as probability is strictly > 0. This seems to be a simple plotting issue... see his answer instead! Either way, I would recommend using 2D data to test GMMs, rather than a 1D line.
The GMM is fitting to the data you gave it - specifically:
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)
Because the data ends at 0 and 1, the GMM is attempting to model that fact: -.01 and 1.01 are technically outside the trained data range and should be scored with very low probabilities. In doing so it ends up creating a gaussian with smaller spread (smaller covariance/higher precision) to cover the ends of the data and model the fact that the data stops.
I would expect that adding enough gaussians would lead to a pseudo-Gibbs phenomena effect, and you can kind of see that happening in the change from 5 to 99. To exactly model the edges, you would need an infinite mixture model. This is analogous to infinite frequency components - you are representing a "signal" with a set of basis functions (in this case, gaussians) in GMM as well!

Related

Histogram line of best fit line is jagged and not smooth?

I can't quite seem to figue out how to get my curve to be displayed smoothly instead of having so many sharp turns.
I am hoping to show a boltzmann probability distribution. With a nice smooth curve.
I'll expect it is a simple fix but I can't see it. Can someone please help?
My code is below:
from matplotlib import pyplot as plt
import numpy as np
import scipy.stats
dE = 1
N = 500
n = 10000
# This is creating an array filled with all twos
def Create_Array(N):
Particle_State_List_set = np.ones(N, dtype = int)
Particle_State_List_twos = Particle_State_List_set + 1
return(Particle_State_List_twos)
Array = Create_Array(N)
def Select_Random_index(N):
Seed = np.random.default_rng()
Partcle_Index = Seed.integers(low=0, high= N - 1)
return(Partcle_Index)
def Exchange(N):
Particle_Index_A = Select_Random_index(N) #Selects a particle to be used as particle "a"
Particle_Index_B = Select_Random_index(N) #Selects a particle to be used as particle "b"
# Checks to see if the energy on particle "a" is zero, if so it selects anbother until it isn't.
while Array[Particle_Index_A] == 1:
Particle_Index_A = Select_Random_index(N)
#This loop is making sure that Particle "a" and "b" aren't the same particle, it chooses again until the are diffrent.
while Particle_Index_B == Particle_Index_A:
Particle_Index_B = Select_Random_index(N)
# This assignes variables to the chosen particle's energy values
a = Array[Particle_Index_A]
b = Array[Particle_Index_B]
# This updates the values of the Energy levels of the interacting particles
Array[Particle_Index_A] = a - dE
Array[Particle_Index_B] = b + dE
return (Array[Particle_Index_A], Array[Particle_Index_B])
for i in range(n):
Exchange(N)
# This part is making the histogram the curve will be made from
_, bins, _ = plt.hist(Array, 12, density=1, alpha=0.15, color="g")
# This is using scipy to find the mean and standard deviation in order to plot the curve
mean, std = scipy.stats.norm.fit(Array)
# This part is drawing the best fit line, using the established bins value and the std and mean from before
best_fit = scipy.stats.norm.pdf(bins, mean, std)
# Plotting the best fit curve
plt.plot(bins, best_fit, color="r", linewidth=2.5)
#These are instructions on how python with show the graph
plt.title("Boltzmann Probablitly Curve")
plt.xlabel("Energy Value")
plt.ylabel('Percentage at this Energy Value')
plt.tick_params(top=True, right=True)
plt.tick_params(direction='in', length=6, width=1, colors='0')
plt.grid()
plt.show()
Whats happening is that in these lines:
best_fit = scipy.stats.norm.pdf(bins, mean, std)
plt.plot(bins, best_fit, color="r", linewidth=2.5)
'bins' the histogram bin edges is being used as the x coordinates of the data points forming the best fit line. The resulting plot is jagged because they are so widely spaced. Instead you can define a tighter packed set of x coordinates and use that:
bfX = np.arange(bins[0],bins[-1],.05)
best_fit = scipy.stats.norm.pdf(bfX, mean, std)
plt.plot(bfX, best_fit, color="r", linewidth=2.5)
For me that gives a nice smooth curve, but you can always use a tighter packing than .05 if its not to your liking yet.

To plot graph non linear function

I want to plot graph of this function:
y = 2[1-e^(-x+1)]^2-2
When I plot a linear function, I used this code :
import matplotlib.pyplot as plt
import numpy as np
x = np.array(...)
y = np.array(...)
z = np.polyfit(x, y, 2)
p = np.poly1d(z)
xp = np.linspace(...)
_ = plt.plot(x, y, '.', xp, p(xp), '-')
plt.ylim(0, 200)
plt.show()
When the function is non-linear, it does not works
becasue it hard to find each x,y value.
How can I plot a non-linear function?
I hate to be the one to break this news to you, but polynomials of order greater than one are technically nonlinear too.
When you plot in matplotlib, you're really supplying discreet x and y values at a resolution sufficient to be visually pleasing. In this case, you've chosen xp to determine the points you plot for the parabola. You then call p(xp) to generate an array of y-values at those locations.
There nothing stopping you from generating y-values for your formula of interest using simple numpy functions:
y = 2 * (1 - np.exp(1 - xp))**2 - 2

linear fit on log-log plot isn't linear

I'm trying to analyse reproducibility of one experiment. I replaced 0 values with 0.1 and I plotted data from both experiments with log-log axes. So far, so good.
Next, I got rows where values in both columns are > 0 and I calculated a linear regression on the log10 of those values. I got the slope and the intercept of the linear fit and then I tried to plot it.
import pandas as pd
import numpy as np
table = pd.read_csv("data.csv")
data = table.replace(0, 0.1)
plt.plot(data["run1"], data["run2"], color="#03012d", marker=".", ls="None", markersize=3, label="")
plt.xscale('log')
plt.yscale('log')
plt.axis('square')
plt.xlabel("1st experiment")
plt.ylabel("2nd experiment")
from scipy.stats import linregress
df = table.loc[(table['run1'] >0) & (table['run2'] >0)]
stats = linregress(np.log10(df["run1"]),np.log10(df["run2"]))
m = stats.slope
b = stats.intercept
r = stats.rvalue
x = np.logspace(-1, 5, base=10)
y = (m*x+b)
plt.plot(x, y, c='orange', label="fit")
plt.legend()
But this is what I get and it's definitely not linear:
I don't know what I am doing wrong..
EDIT:
Link to the initial dataset
You are confusing things here. The problem is that np.logspace(-1, 5, base=10) simply returns you logarithmically spaced values but you still need to take the base 10 log of your x-values because your x-axis in the plot is logarithmic (np.log10(x)) and do the following
x = np.log10(np.logspace(-1, 5, base=10))
y = (m*x + b)
plt.plot(x, y, c='orange', label="fit")
This will give you what you expect, a straight linear regression prediction.
When I visually inspect a scatterplot of the data, I see no utility in taking logs. A straight line through the raw data looks like it is probably the best you can do here, see the attached images.

Fit a distribution to a histogram

I want to know the distribution of my data points, so first I plotted the histogram of my data. My histogram looks like the following:
Second, in order to fit them to a distribution, here's the code I wrote:
size = 20000
x = scipy.arange(size)
# fit
param = scipy.stats.gamma.fit(y)
pdf_fitted = scipy.stats.gamma.pdf(x, *param[:-2], loc = param[-2], scale = param[-1]) * size
plt.plot(pdf_fitted, color = 'r')
# plot the histogram
plt.hist(y)
plt.xlim(0, 0.3)
plt.show()
The result is:
What am I doing wrong?
Your data does not appear to be gamma-distributed, but assuming it is, you could fit it like this:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
gamma = stats.gamma
a, loc, scale = 3, 0, 2
size = 20000
y = gamma.rvs(a, loc, scale, size=size)
x = np.linspace(0, y.max(), 100)
# fit
param = gamma.fit(y, floc=0)
pdf_fitted = gamma.pdf(x, *param)
plt.plot(x, pdf_fitted, color='r')
# plot the histogram
plt.hist(y, normed=True, bins=30)
plt.show()
The area under the pdf (over the entire domain) equals 1.
The area under the histogram equals 1 if you use normed=True.
x has length size (i.e. 20000), and pdf_fitted has the same shape as x. If we call plot and specify only the y-values, e.g. plt.plot(pdf_fitted), then values are plotted over the x-range [0, size].
That is much too large an x-range. Since the histogram is going to use an x-range of [min(y), max(y)], we much choose x to span a similar range: x = np.linspace(0, y.max()), and call plot with both the x- and y-values specified, e.g. plt.plot(x, pdf_fitted).
As Warren Weckesser points out in the comments, for most applications you know the gamma distribution's domain begins at 0. If that is the case, use floc=0 to hold the loc parameter to 0. Without floc=0, gamma.fit will try to find the best-fit value for the loc parameter too, which given the vagaries of data will generally not be exactly zero.

Find minimum distance from point to complicated curve

I have a complicated curve defined as a set of points in a table like so (the full table is here):
# x y
1.0577 12.0914
1.0501 11.9946
1.0465 11.9338
...
If I plot this table with the commands:
plt.plot(x_data, y_data, c='b',lw=1.)
plt.scatter(x_data, y_data, marker='o', color='k', s=10, lw=0.2)
I get the following:
where I've added the red points and segments manually. What I need is a way to calculate those segments for each of those points, that is: a way to find the minimum distance from a given point in this 2D space to the interpolated curve.
I can't use the distance to the data points themselves (the black dots that generate the blue curve) since they are not located at equal intervals, sometimes they are close and sometimes they are far apart and this deeply affects my results further down the line.
Since this is not a well behaved curve I'm not really sure what I could do. I've tried interpolating it with a UnivariateSpline but it returns a very poor fit:
# Sort data according to x.
temp_data = zip(x_data, y_data)
temp_data.sort()
# Unpack sorted data.
x_sorted, y_sorted = zip(*temp_data)
# Generate univariate spline.
s = UnivariateSpline(x_sorted, y_sorted, k=5)
xspl = np.linspace(0.8, 1.1, 100)
yspl = s(xspl)
# Plot.
plt.scatter(xspl, yspl, marker='o', color='r', s=10, lw=0.2)
I also tried increasing the number of interpolating points but got a mess:
# Sort data according to x.
temp_data = zip(x_data, y_data)
temp_data.sort()
# Unpack sorted data.
x_sorted, y_sorted = zip(*temp_data)
t = np.linspace(0, 1, len(x_sorted))
t2 = np.linspace(0, 1, 100)
# One-dimensional linear interpolation.
x2 = np.interp(t2, t, x_sorted)
y2 = np.interp(t2, t, y_sorted)
plt.scatter(x2, y2, marker='o', color='r', s=10, lw=0.2)
Any ideas/pointers will be greatly appreciated.
If you're open to using a library for this, have a look at shapely: https://github.com/Toblerity/Shapely
As a quick example (points.txt contains the data you linked to in your question):
import shapely.geometry as geom
import numpy as np
coords = np.loadtxt('points.txt')
line = geom.LineString(coords)
point = geom.Point(0.8, 10.5)
# Note that "line.distance(point)" would be identical
print(point.distance(line))
As an interactive example (this also draws the line segments you wanted):
import numpy as np
import shapely.geometry as geom
import matplotlib.pyplot as plt
class NearestPoint(object):
def __init__(self, line, ax):
self.line = line
self.ax = ax
ax.figure.canvas.mpl_connect('button_press_event', self)
def __call__(self, event):
x, y = event.xdata, event.ydata
point = geom.Point(x, y)
distance = self.line.distance(point)
self.draw_segment(point)
print 'Distance to line:', distance
def draw_segment(self, point):
point_on_line = line.interpolate(line.project(point))
self.ax.plot([point.x, point_on_line.x], [point.y, point_on_line.y],
color='red', marker='o', scalex=False, scaley=False)
fig.canvas.draw()
if __name__ == '__main__':
coords = np.loadtxt('points.txt')
line = geom.LineString(coords)
fig, ax = plt.subplots()
ax.plot(*coords.T)
ax.axis('equal')
NearestPoint(line, ax)
plt.show()
Note that I've added ax.axis('equal'). shapely operates in the coordinate system that the data is in. Without the equal axis plot, the view will be distorted, and while shapely will still find the nearest point, it won't look quite right in the display:
The curve is by nature parametric, i.e. for each x there isn't necessary a unique y and vice versa. So you shouldn't interpolate a function of the form y(x) or x(y). Instead, you should do two interpolations, x(t) and y(t) where t is, say, the index of the corresponding point.
Then you use scipy.optimize.fminbound to find the optimal t such that (x(t) - x0)^2 + (y(t) - y0)^2 is the smallest, where (x0, y0) are the red dots in your first figure. For fminsearch, you could specify the min/max bound for t to be 1 and len(x_data)
You could try implementing a calculation of distance from point to line on incremental pairs of points on the curve and finding that minimum. This will introduce a small bit of error from the curve as drawn, but it should be very small, as the points are relatively close together.
http://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line
You can easily use the package trjtrypy in PyPI: https://pypi.org/project/trjtrypy/
All needed computations and visualizations are available in this package. You can get your answer within a line of code like:
to get the minimum distance use: trjtrypy.basedists.distance(points, curve)
to visualize the curve and points use: trjtrypy.visualizations.draw_landmarks_trajectory(points, curve)

Categories

Resources