linear fit on log-log plot isn't linear

linear fit on log-log plot isn't linear - python

I'm trying to analyse reproducibility of one experiment. I replaced 0 values with 0.1 and I plotted data from both experiments with log-log axes. So far, so good.
Next, I got rows where values in both columns are > 0 and I calculated a linear regression on the log10 of those values. I got the slope and the intercept of the linear fit and then I tried to plot it.
import pandas as pd
import numpy as np
table = pd.read_csv("data.csv")
data = table.replace(0, 0.1)
plt.plot(data["run1"], data["run2"], color="#03012d", marker=".", ls="None", markersize=3, label="")
plt.xscale('log')
plt.yscale('log')
plt.axis('square')
plt.xlabel("1st experiment")
plt.ylabel("2nd experiment")
from scipy.stats import linregress
df = table.loc[(table['run1'] >0) & (table['run2'] >0)]
stats = linregress(np.log10(df["run1"]),np.log10(df["run2"]))
m = stats.slope
b = stats.intercept
r = stats.rvalue
x = np.logspace(-1, 5, base=10)
y = (m*x+b)
plt.plot(x, y, c='orange', label="fit")
plt.legend()
But this is what I get and it's definitely not linear:
I don't know what I am doing wrong..
EDIT:
Link to the initial dataset

You are confusing things here. The problem is that np.logspace(-1, 5, base=10) simply returns you logarithmically spaced values but you still need to take the base 10 log of your x-values because your x-axis in the plot is logarithmic (np.log10(x)) and do the following
x = np.log10(np.logspace(-1, 5, base=10))
y = (m*x + b)
plt.plot(x, y, c='orange', label="fit")
This will give you what you expect, a straight linear regression prediction.

When I visually inspect a scatterplot of the data, I see no utility in taking logs. A straight line through the raw data looks like it is probably the best you can do here, see the attached images.

Related

Interpolating values that have very high spread

Im trying to interpolate some data to make a smooth curve for some datapoints I have, but interp1d doesn't work with using the x-values I have, it works if I just make a new x-vector being [1,2,3,4,5,6,7,8], but with original x-values I get some curve that does not fit the data at all.
I am wondering if it's the large span of my x-vector which causes the problems?
My data is this:
y = np.array([0.768, 0.901, 1.790, 1.213, 0.543, 0.261, 0.121, 0.049])
x = np.array([1.2e-17, 3.7e-16, 1.2e-14, 2.8e-13, 4.8e-12, 9.2e-11, 2.0e-9, 5.0e-8])
Trying interpolation and plotting:
xnew = np.linspace(np.min(x),np.max(x),100)
f = interp1d(x, y, kind='quadratic')
y_smooth=f(xnew)
plt.scatter(x,y)
plt.plot(xnew, y_smooth)
plt.ylim(0,2)
plt.xscale('log')
plt.show()
Gives a figure which doesn't make sense at all. I have googled and searched for a solution for hours now, trying different methods such as curve fitting instead, but nothing seems to work.
Changing the x-data vector, gives the desired curve, but obviously with the wrong x-values:
new_x_data = [1,2,3,4,5,6,7,8]
plt.scatter(new_x_data,y)
Any help will be deeply appreciated.

the smallest change to your code would be:
x = np.log(np.array([1.2e-17, 3.7e-16, 1.2e-14, 2.8e-13, 4.8e-12, 9.2e-11, 2.0e-9, 5.0e-8]))
y = np.array([0.768, 0.901, 1.790, 1.213, 0.543, 0.261, 0.121, 0.049])
xnew = np.linspace(np.min(x),np.max(x),100)
f = interp1d(x, y, kind='quadratic')
y_smooth=f(xnew)
plt.scatter(np.exp(x), y)
plt.plot(np.exp(xnew), y_smooth)
plt.ylim(0,2)
plt.xscale('log')
plt.show()
which gives me something sensible looking. note the log and exps to move into log space and back out again to make the interpolation consistent with the plot

I think this is just a matter of a discrepancy in how you are viewing the result.
Try using a logspace instead of a linspace -- it should follow the points better. As is, the vast majority of points in xnew are clustered to the right.
Edit:
This shows what is happening a bit better. It is interpolating between them, it just does not look great.
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import interp1d
y = np.array([0.768, 0.901, 1.790, 1.213, 0.543, 0.261, 0.121, 0.049])
x = np.array([1.2e-17, 3.7e-16, 1.2e-14, 2.8e-13, 4.8e-12, 9.2e-11, 2.0e-9, 5.0e-8])
xnew = np.logspace(-16, -8, 100)
f = interp1d(x, y, kind='quadratic')
y_smooth=f(xnew)
plt.scatter(x,y)
plt.plot(xnew, y_smooth)
plt.ylim(0,4)
plt.xscale('log')
plt.show()

Check if seaborn scatterplot function is sampling data

I have plotted a seaborn scatter plot. My data consists of 5000 data points. By looking into the plot, I definitely am not seeing 5000 points. So I'm pretty sure some kind of sampling is performed by seaborn scatterplot function. I want to know how many data points each point in the plot represent? If it depends on the code, the code is as following:
g = sns.scatterplot(x=data['x'], y=data['y'],hue=data['P'], s=40, edgecolor='k', alpha=0.8, legend="full")

Nothing would really suggest to me that seaborn is sampling your data. However, you can check the data in your axes g to be sure. Query the children of the axes for a PathCollection (scatter plot) object:
g.get_children()
It's probably the first item in the list that is returned. From there you can use get_offsets to retrieve the data and check its shape.
g.get_children()[0].get_offsets().shape

As far as I know, no sampling is performed. On the picture you have posted, you can see that most of the data points are just overlapping and that might be the reason why you can not see 5000 points. Try with less points and you will see that all of them get plotted.

In order to check whether or not Seaborn's scatter removes points, here is a way to see 5000 different points. No points seem to be missing.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
x = np.linspace(1, 100, 100)
y = np.linspace(1, 50, 50)
X, Y = np.meshgrid(x, y)
Z = (X * Y) % 25
X = np.ravel(X)
Y = np.ravel(Y)
Z = np.ravel(Z)
sns.scatterplot(x=X, y=Y, s=15, hue=Z, palette=plt.cm.plasma, legend=False)
plt.show()

Plot straight line of best fit on log-log plot

Have some data that I've plotted on a log-log plot and now I want to fit a straight line through these points. I have tried various methods and can't get what I'm after. Example code:
import numpy as np
import matplotlib.pyplot as plt
import random
x= np.linspace(1,100,10)
y = np.log10(x)+np.log10(np.random.uniform(0,10))
coefficients = np.polyfit(np.log10(x),np.log10(y),1)
polynomial=np.poly1d(coefficients)
y_fit = polynomial(y)
plt.plot(x,y,'o')
plt.plot(x,y_fit,'-')
plt.yscale('log')
plt.xscale('log')
This gives me a ideal 'straight' line in log log offset by a random number to which I then fit a 1d poly. The output is:
So ignoring the offset, which I can deal with, it is not quite what I require as it has basically plotted a straight line between each point and then joined them up whereas I need a 'line of best fit' through the middle of them all so I can measure the gradient of it.
What is the best way to achieve this?

One problem is
y_fit = polynomial(y)
You must plug in the x values, not y, to get y_fit.
Also, you fit log10(y) with log10(x), so to evaluate the linear interpolator, you must plug in log10(x), and the result will be the base-10 log of the y values.
Here's a modified version of your script, followed by the plot it generates.
import numpy as np
import matplotlib.pyplot as plt
import random
x = np.linspace(1,100,10)
y = np.log10(x) + np.log10(np.random.uniform(0,10))
coefficients = np.polyfit(np.log10(x), np.log10(y), 1)
polynomial = np.poly1d(coefficients)
log10_y_fit = polynomial(np.log10(x)) # <-- Changed
plt.plot(x, y, 'o-')
plt.plot(x, 10**log10_y_fit, '*-') # <-- Changed
plt.yscale('log')
plt.xscale('log')

Gaussian mixture model (GMM) gives a bad fit

I've been playing with the Scikit-learn's GMM function. To start with, I've just created a distribution along the line x=y.
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)
#Create a distribution that's centred along y=x
line_model.fit(zip(xs,ys))
plt.plot(xs, ys)
plt.show()
This produces the expected distribution:
Next I fit a GMM to it, and plot the results:
#Create the x,y mesh that will be used to make a 3D plot
x_y_grid = []
for x in xs:
for y in ys:
x_y_grid.append([x,y])
#Calculate a probability for each point in the x,y grid.
x_y_z_grid = []
for x,y in x_y_grid:
z = line_model.score([[x,y]])
x_y_z_grid.append([x,y,z])
x_y_z_grid = np.array(x_y_z_grid)
#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot(x_y_z_grid[:,0], x_y_z_grid[:,1], 2.72**x_y_z_grid[:,2])
plt.show()
The resulting probability distribution has some weird tails along x=0 and x=1 and also extra probability in the corners (x=1, y=1 and x=0,y=0).
Using n_components=5 also shows this behaviour:
Is this something inherent with GMMs, or is there an issue with the implementation, or am I doing something wrong?
Edit: getting scores from the model seems to get rid of this behaviour -- should this be?
I'm training both the models on the same dataset (x=y from x=0 to x=1). Simply checking the probability via the score method of the gmm seems to eliminate this boundary effect. Why is this? I've attached the plots and code below.
# Creates a line of 'observations' between (x_small_start, x_small_end)
# and (y_small_start, y_small_end). This is the data both gmms are trained on.
x_small_start = 0
x_small_end = 1
y_small_start = 0
y_small_end = 1
# These are the range of values that will be plotted
x_big_start = -1
x_big_end = 2
y_big_start = -1
y_big_end = 2
shorter_eval_range_gmm = mixture.GMM(n_components = 5)
longer_eval_range_gmm = mixture.GMM(n_components = 5)
x_small = np.linspace(x_small_start, x_small_end, 100)
y_small = np.linspace(y_small_start, y_small_end, 100)
x_big = np.linspace(x_big_start, x_big_end, 100)
y_big = np.linspace(y_big_start, y_big_end, 100)
#Train both gmms on a distribution that's centered along y=x
shorter_eval_range_gmm.fit(zip(x_small,y_small))
longer_eval_range_gmm.fit(zip(x_small,y_small))
#Create the x,y meshes that will be used to make a 3D plot
x_y_evals_grid_big = []
for x in x_big:
for y in y_big:
x_y_evals_grid_big.append([x,y])
x_y_evals_grid_small = []
for x in x_small:
for y in y_small:
x_y_evals_grid_small.append([x,y])
#Calculate a probability for each point in the x,y grid.
x_y_z_plot_grid_big = []
for x,y in x_y_evals_grid_big:
z = longer_eval_range_gmm.score([[x, y]])
x_y_z_plot_grid_big.append([x, y, z])
x_y_z_plot_grid_big = np.array(x_y_z_plot_grid_big)
x_y_z_plot_grid_small = []
for x,y in x_y_evals_grid_small:
z = shorter_eval_range_gmm.score([[x, y]])
x_y_z_plot_grid_small.append([x, y, z])
x_y_z_plot_grid_small = np.array(x_y_z_plot_grid_small)
#Plot probabilities on the Z axis.
fig = plt.figure()
fig.suptitle("Probability of different x,y pairs")
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot(x_y_z_plot_grid_big[:,0], x_y_z_plot_grid_big[:,1], np.exp(x_y_z_plot_grid_big[:,2]))
ax1.set_xlabel('X Label')
ax1.set_ylabel('Y Label')
ax1.set_zlabel('Probability')
ax2 = fig.add_subplot(1, 2, 2, projection='3d')
ax2.plot(x_y_z_plot_grid_small[:,0], x_y_z_plot_grid_small[:,1], np.exp(x_y_z_plot_grid_small[:,2]))
ax2.set_xlabel('X Label')
ax2.set_ylabel('Y Label')
ax2.set_zlabel('Probability')
plt.show()

There is no problem with the fit, but with the visualisation you're using. A hint should be the straight line connecting (0,1,5) to (0,1,0), which is actually just a rendering of the connection of two points (which is due to the order in which the points are read). Although the two points at its extrema are in your data, no other point on this line actually is.
Personally, I think it is a rather bad idea to use 3d plots (wires) to represent a surface for the reason mentioned above, and I would recommend surface plots or contour plots instead.
Try this:
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.atleast_2d(np.linspace(0, 1, 100)).T
ys = np.atleast_2d(np.linspace(0, 1, 100)).T
#Create a distribution that's centred along y=x
line_model.fit(np.concatenate([xs, ys], axis=1))
plt.scatter(xs, ys)
plt.show()
#Create the x,y mesh that will be used to make a 3D plot
X, Y = np.meshgrid(xs, ys)
x_y_grid = np.c_[X.ravel(), Y.ravel()]
#Calculate a probability for each point in the x,y grid.
z = line_model.score(x_y_grid)
z = z.reshape(X.shape)
#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, z)
plt.show()
From an academic point of view I am quite uncomfortable with the goal of fitting a 1D line in a 2D space by a 2D mixture model. Manifold learning with GMMs requires at least the normal direction to have zero variance, reducing thus to a dirac-distribution. Numerically and analytically this is unstable, and should be avoided (there seems to be some stabilising trick in the gmm fit, since variance of the model is rather large in the direction of the normal to the straight line).
It is also recommended to use plt.scatter rather than plt.plot when drawing data, since there is no reason to connect the dots when you're fitting their joint distribution.
Hope this helps to shed some light on your problem.

EDIT:
This is not correct. Talking with Ronald P., you can't get Gibbs effects because the Gaussians cannot compensate each other by "going negative", as probability is strictly > 0. This seems to be a simple plotting issue... see his answer instead! Either way, I would recommend using 2D data to test GMMs, rather than a 1D line.
The GMM is fitting to the data you gave it - specifically:
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)
Because the data ends at 0 and 1, the GMM is attempting to model that fact: -.01 and 1.01 are technically outside the trained data range and should be scored with very low probabilities. In doing so it ends up creating a gaussian with smaller spread (smaller covariance/higher precision) to cover the ends of the data and model the fact that the data stops.
I would expect that adding enough gaussians would lead to a pseudo-Gibbs phenomena effect, and you can kind of see that happening in the change from 5 to 99. To exactly model the edges, you would need an infinite mixture model. This is analogous to infinite frequency components - you are representing a "signal" with a set of basis functions (in this case, gaussians) in GMM as well!

How to smooth date based data in matplotlib?

I have a 2 lists, first with dates (datetime objects) and second with some values for these dates.
When I create a simple plot:
plt.plot_date(x=dates, y=dur, fmt='r-')
I get a very ugly image like this.
How I can smooth this line? I think about extrapolation, but have not found a simple function for this. In Scipy there are very difficult tools for this, but I don't understand what I must add to my data for extrapolation.

You can make it smooth using sp.polyfit
Code:
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt
# sampledata
x = np.arange(199)
r = np.random.rand(100)
y = np.convolve(r, r)
# plot sampledata
plt.plot(x, y, color='grey')
# smoothen sampledata using a 50 degree polynomial
p = sp.polyfit(x, y, deg=50)
y_ = sp.polyval(p, x)
# plot smoothened data
plt.plot(x, y_, color='r', linewidth=2)
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

linear fit on log-log plot isn't linear - python

When I visually inspect a scatterplot of the data, I see no utility in taking logs. A straight line through the raw data looks like it is probably the best you can do here, see the attached images.

Related

Interpolating values that have very high spread

Check if seaborn scatterplot function is sampling data

Plot straight line of best fit on log-log plot

Gaussian mixture model (GMM) gives a bad fit

How to smooth date based data in matplotlib?

Categories

Resources