Interpolating values that have very high spread

Interpolating values that have very high spread - python

Im trying to interpolate some data to make a smooth curve for some datapoints I have, but interp1d doesn't work with using the x-values I have, it works if I just make a new x-vector being [1,2,3,4,5,6,7,8], but with original x-values I get some curve that does not fit the data at all.
I am wondering if it's the large span of my x-vector which causes the problems?
My data is this:
y = np.array([0.768, 0.901, 1.790, 1.213, 0.543, 0.261, 0.121, 0.049])
x = np.array([1.2e-17, 3.7e-16, 1.2e-14, 2.8e-13, 4.8e-12, 9.2e-11, 2.0e-9, 5.0e-8])
Trying interpolation and plotting:
xnew = np.linspace(np.min(x),np.max(x),100)
f = interp1d(x, y, kind='quadratic')
y_smooth=f(xnew)
plt.scatter(x,y)
plt.plot(xnew, y_smooth)
plt.ylim(0,2)
plt.xscale('log')
plt.show()
Gives a figure which doesn't make sense at all. I have googled and searched for a solution for hours now, trying different methods such as curve fitting instead, but nothing seems to work.
Changing the x-data vector, gives the desired curve, but obviously with the wrong x-values:
new_x_data = [1,2,3,4,5,6,7,8]
plt.scatter(new_x_data,y)
Any help will be deeply appreciated.

the smallest change to your code would be:
x = np.log(np.array([1.2e-17, 3.7e-16, 1.2e-14, 2.8e-13, 4.8e-12, 9.2e-11, 2.0e-9, 5.0e-8]))
y = np.array([0.768, 0.901, 1.790, 1.213, 0.543, 0.261, 0.121, 0.049])
xnew = np.linspace(np.min(x),np.max(x),100)
f = interp1d(x, y, kind='quadratic')
y_smooth=f(xnew)
plt.scatter(np.exp(x), y)
plt.plot(np.exp(xnew), y_smooth)
plt.ylim(0,2)
plt.xscale('log')
plt.show()
which gives me something sensible looking. note the log and exps to move into log space and back out again to make the interpolation consistent with the plot

I think this is just a matter of a discrepancy in how you are viewing the result.
Try using a logspace instead of a linspace -- it should follow the points better. As is, the vast majority of points in xnew are clustered to the right.
Edit:
This shows what is happening a bit better. It is interpolating between them, it just does not look great.
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import interp1d
y = np.array([0.768, 0.901, 1.790, 1.213, 0.543, 0.261, 0.121, 0.049])
x = np.array([1.2e-17, 3.7e-16, 1.2e-14, 2.8e-13, 4.8e-12, 9.2e-11, 2.0e-9, 5.0e-8])
xnew = np.logspace(-16, -8, 100)
f = interp1d(x, y, kind='quadratic')
y_smooth=f(xnew)
plt.scatter(x,y)
plt.plot(xnew, y_smooth)
plt.ylim(0,4)
plt.xscale('log')
plt.show()

Related

Check if seaborn scatterplot function is sampling data

I have plotted a seaborn scatter plot. My data consists of 5000 data points. By looking into the plot, I definitely am not seeing 5000 points. So I'm pretty sure some kind of sampling is performed by seaborn scatterplot function. I want to know how many data points each point in the plot represent? If it depends on the code, the code is as following:
g = sns.scatterplot(x=data['x'], y=data['y'],hue=data['P'], s=40, edgecolor='k', alpha=0.8, legend="full")

Nothing would really suggest to me that seaborn is sampling your data. However, you can check the data in your axes g to be sure. Query the children of the axes for a PathCollection (scatter plot) object:
g.get_children()
It's probably the first item in the list that is returned. From there you can use get_offsets to retrieve the data and check its shape.
g.get_children()[0].get_offsets().shape

As far as I know, no sampling is performed. On the picture you have posted, you can see that most of the data points are just overlapping and that might be the reason why you can not see 5000 points. Try with less points and you will see that all of them get plotted.

In order to check whether or not Seaborn's scatter removes points, here is a way to see 5000 different points. No points seem to be missing.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
x = np.linspace(1, 100, 100)
y = np.linspace(1, 50, 50)
X, Y = np.meshgrid(x, y)
Z = (X * Y) % 25
X = np.ravel(X)
Y = np.ravel(Y)
Z = np.ravel(Z)
sns.scatterplot(x=X, y=Y, s=15, hue=Z, palette=plt.cm.plasma, legend=False)
plt.show()

linear fit on log-log plot isn't linear

I'm trying to analyse reproducibility of one experiment. I replaced 0 values with 0.1 and I plotted data from both experiments with log-log axes. So far, so good.
Next, I got rows where values in both columns are > 0 and I calculated a linear regression on the log10 of those values. I got the slope and the intercept of the linear fit and then I tried to plot it.
import pandas as pd
import numpy as np
table = pd.read_csv("data.csv")
data = table.replace(0, 0.1)
plt.plot(data["run1"], data["run2"], color="#03012d", marker=".", ls="None", markersize=3, label="")
plt.xscale('log')
plt.yscale('log')
plt.axis('square')
plt.xlabel("1st experiment")
plt.ylabel("2nd experiment")
from scipy.stats import linregress
df = table.loc[(table['run1'] >0) & (table['run2'] >0)]
stats = linregress(np.log10(df["run1"]),np.log10(df["run2"]))
m = stats.slope
b = stats.intercept
r = stats.rvalue
x = np.logspace(-1, 5, base=10)
y = (m*x+b)
plt.plot(x, y, c='orange', label="fit")
plt.legend()
But this is what I get and it's definitely not linear:
I don't know what I am doing wrong..
EDIT:
Link to the initial dataset

You are confusing things here. The problem is that np.logspace(-1, 5, base=10) simply returns you logarithmically spaced values but you still need to take the base 10 log of your x-values because your x-axis in the plot is logarithmic (np.log10(x)) and do the following
x = np.log10(np.logspace(-1, 5, base=10))
y = (m*x + b)
plt.plot(x, y, c='orange', label="fit")
This will give you what you expect, a straight linear regression prediction.

When I visually inspect a scatterplot of the data, I see no utility in taking logs. A straight line through the raw data looks like it is probably the best you can do here, see the attached images.

Plot straight line of best fit on log-log plot

Have some data that I've plotted on a log-log plot and now I want to fit a straight line through these points. I have tried various methods and can't get what I'm after. Example code:
import numpy as np
import matplotlib.pyplot as plt
import random
x= np.linspace(1,100,10)
y = np.log10(x)+np.log10(np.random.uniform(0,10))
coefficients = np.polyfit(np.log10(x),np.log10(y),1)
polynomial=np.poly1d(coefficients)
y_fit = polynomial(y)
plt.plot(x,y,'o')
plt.plot(x,y_fit,'-')
plt.yscale('log')
plt.xscale('log')
This gives me a ideal 'straight' line in log log offset by a random number to which I then fit a 1d poly. The output is:
So ignoring the offset, which I can deal with, it is not quite what I require as it has basically plotted a straight line between each point and then joined them up whereas I need a 'line of best fit' through the middle of them all so I can measure the gradient of it.
What is the best way to achieve this?

One problem is
y_fit = polynomial(y)
You must plug in the x values, not y, to get y_fit.
Also, you fit log10(y) with log10(x), so to evaluate the linear interpolator, you must plug in log10(x), and the result will be the base-10 log of the y values.
Here's a modified version of your script, followed by the plot it generates.
import numpy as np
import matplotlib.pyplot as plt
import random
x = np.linspace(1,100,10)
y = np.log10(x) + np.log10(np.random.uniform(0,10))
coefficients = np.polyfit(np.log10(x), np.log10(y), 1)
polynomial = np.poly1d(coefficients)
log10_y_fit = polynomial(np.log10(x)) # <-- Changed
plt.plot(x, y, 'o-')
plt.plot(x, 10**log10_y_fit, '*-') # <-- Changed
plt.yscale('log')
plt.xscale('log')

Gaussian mixture model (GMM) gives a bad fit

I've been playing with the Scikit-learn's GMM function. To start with, I've just created a distribution along the line x=y.
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)
#Create a distribution that's centred along y=x
line_model.fit(zip(xs,ys))
plt.plot(xs, ys)
plt.show()
This produces the expected distribution:
Next I fit a GMM to it, and plot the results:
#Create the x,y mesh that will be used to make a 3D plot
x_y_grid = []
for x in xs:
for y in ys:
x_y_grid.append([x,y])
#Calculate a probability for each point in the x,y grid.
x_y_z_grid = []
for x,y in x_y_grid:
z = line_model.score([[x,y]])
x_y_z_grid.append([x,y,z])
x_y_z_grid = np.array(x_y_z_grid)
#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot(x_y_z_grid[:,0], x_y_z_grid[:,1], 2.72**x_y_z_grid[:,2])
plt.show()
The resulting probability distribution has some weird tails along x=0 and x=1 and also extra probability in the corners (x=1, y=1 and x=0,y=0).
Using n_components=5 also shows this behaviour:
Is this something inherent with GMMs, or is there an issue with the implementation, or am I doing something wrong?
Edit: getting scores from the model seems to get rid of this behaviour -- should this be?
I'm training both the models on the same dataset (x=y from x=0 to x=1). Simply checking the probability via the score method of the gmm seems to eliminate this boundary effect. Why is this? I've attached the plots and code below.
# Creates a line of 'observations' between (x_small_start, x_small_end)
# and (y_small_start, y_small_end). This is the data both gmms are trained on.
x_small_start = 0
x_small_end = 1
y_small_start = 0
y_small_end = 1
# These are the range of values that will be plotted
x_big_start = -1
x_big_end = 2
y_big_start = -1
y_big_end = 2
shorter_eval_range_gmm = mixture.GMM(n_components = 5)
longer_eval_range_gmm = mixture.GMM(n_components = 5)
x_small = np.linspace(x_small_start, x_small_end, 100)
y_small = np.linspace(y_small_start, y_small_end, 100)
x_big = np.linspace(x_big_start, x_big_end, 100)
y_big = np.linspace(y_big_start, y_big_end, 100)
#Train both gmms on a distribution that's centered along y=x
shorter_eval_range_gmm.fit(zip(x_small,y_small))
longer_eval_range_gmm.fit(zip(x_small,y_small))
#Create the x,y meshes that will be used to make a 3D plot
x_y_evals_grid_big = []
for x in x_big:
for y in y_big:
x_y_evals_grid_big.append([x,y])
x_y_evals_grid_small = []
for x in x_small:
for y in y_small:
x_y_evals_grid_small.append([x,y])
#Calculate a probability for each point in the x,y grid.
x_y_z_plot_grid_big = []
for x,y in x_y_evals_grid_big:
z = longer_eval_range_gmm.score([[x, y]])
x_y_z_plot_grid_big.append([x, y, z])
x_y_z_plot_grid_big = np.array(x_y_z_plot_grid_big)
x_y_z_plot_grid_small = []
for x,y in x_y_evals_grid_small:
z = shorter_eval_range_gmm.score([[x, y]])
x_y_z_plot_grid_small.append([x, y, z])
x_y_z_plot_grid_small = np.array(x_y_z_plot_grid_small)
#Plot probabilities on the Z axis.
fig = plt.figure()
fig.suptitle("Probability of different x,y pairs")
ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot(x_y_z_plot_grid_big[:,0], x_y_z_plot_grid_big[:,1], np.exp(x_y_z_plot_grid_big[:,2]))
ax1.set_xlabel('X Label')
ax1.set_ylabel('Y Label')
ax1.set_zlabel('Probability')
ax2 = fig.add_subplot(1, 2, 2, projection='3d')
ax2.plot(x_y_z_plot_grid_small[:,0], x_y_z_plot_grid_small[:,1], np.exp(x_y_z_plot_grid_small[:,2]))
ax2.set_xlabel('X Label')
ax2.set_ylabel('Y Label')
ax2.set_zlabel('Probability')
plt.show()

There is no problem with the fit, but with the visualisation you're using. A hint should be the straight line connecting (0,1,5) to (0,1,0), which is actually just a rendering of the connection of two points (which is due to the order in which the points are read). Although the two points at its extrema are in your data, no other point on this line actually is.
Personally, I think it is a rather bad idea to use 3d plots (wires) to represent a surface for the reason mentioned above, and I would recommend surface plots or contour plots instead.
Try this:
from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.atleast_2d(np.linspace(0, 1, 100)).T
ys = np.atleast_2d(np.linspace(0, 1, 100)).T
#Create a distribution that's centred along y=x
line_model.fit(np.concatenate([xs, ys], axis=1))
plt.scatter(xs, ys)
plt.show()
#Create the x,y mesh that will be used to make a 3D plot
X, Y = np.meshgrid(xs, ys)
x_y_grid = np.c_[X.ravel(), Y.ravel()]
#Calculate a probability for each point in the x,y grid.
z = line_model.score(x_y_grid)
z = z.reshape(X.shape)
#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, z)
plt.show()
From an academic point of view I am quite uncomfortable with the goal of fitting a 1D line in a 2D space by a 2D mixture model. Manifold learning with GMMs requires at least the normal direction to have zero variance, reducing thus to a dirac-distribution. Numerically and analytically this is unstable, and should be avoided (there seems to be some stabilising trick in the gmm fit, since variance of the model is rather large in the direction of the normal to the straight line).
It is also recommended to use plt.scatter rather than plt.plot when drawing data, since there is no reason to connect the dots when you're fitting their joint distribution.
Hope this helps to shed some light on your problem.

EDIT:
This is not correct. Talking with Ronald P., you can't get Gibbs effects because the Gaussians cannot compensate each other by "going negative", as probability is strictly > 0. This seems to be a simple plotting issue... see his answer instead! Either way, I would recommend using 2D data to test GMMs, rather than a 1D line.
The GMM is fitting to the data you gave it - specifically:
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)
Because the data ends at 0 and 1, the GMM is attempting to model that fact: -.01 and 1.01 are technically outside the trained data range and should be scored with very low probabilities. In doing so it ends up creating a gaussian with smaller spread (smaller covariance/higher precision) to cover the ends of the data and model the fact that the data stops.
I would expect that adding enough gaussians would lead to a pseudo-Gibbs phenomena effect, and you can kind of see that happening in the change from 5 to 99. To exactly model the edges, you would need an infinite mixture model. This is analogous to infinite frequency components - you are representing a "signal" with a set of basis functions (in this case, gaussians) in GMM as well!

Plotting mplot3d / axes3D xyz surface plot with log scale?

I've been looking high and low for a solution to this simple problem but I can't find it anywhere! There are a loads of posts detailing semilog / loglog plotting of data in 2D e.g. plt.setxscale('log') however I'm interested in using log scales on a 3d plot(mplot3d).
I don't have the exact code to hand and so can't post it here, however the simple example below should be enough to explain the situation. I'm currently using Matplotlib 0.99.1 but should shortly be updating to 1.0.0 - I know I'll have to update my code for the mplot3d implementation.
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FixedLocator, FormatStrFormatter
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
ax = Axes3D(fig)
X = np.arange(-5, 5, 0.025)
Y = np.arange(-5, 5, 0.025)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.jet, extend3d=True)
ax.set_zlim3d(-1.01, 1.01)
ax.w_zaxis.set_major_locator(LinearLocator(10))
ax.w_zaxis.set_major_formatter(FormatStrFormatter('%.03f'))
fig.colorbar(surf)
plt.show()
The above code will plot fine in 3D, however the three scales (X, Y, Z) are all linear. My 'Y' data spans several orders of magnitude (like 9!), so it would be very useful to plot it on a log scale. I can work around this by taking the log of the 'Y', recreating the numpy array and plotting the log(Y) on a linear scale, but in true python style I'm looking for smarter solution which will plot the data on a log scale.
Is it possible to produce a 3D surface plot of my XYZ data using log scales, ideally I'd like X & Z on linear scales and Y on a log scale?
Any help would be greatly appreciated. Please forgive any obvious mistakes in the above example, as mentioned I don't have my exact code to have and so have altered a matplotlib gallery example from my memory.
Thanks

Since I encountered the same question and Alejandros answer did not produced the desired Results here is what I found out so far.
The log scaling for Axes in 3D is an ongoing issue in matplotlib. Currently you can only relabel the axes with:
ax.yaxis.set_scale('log')
This will however not cause the axes to be scaled logarithmic but labeled logarithmic.
ax.set_yscale('log') will cause an exception in 3D
See on github issue 209
Therefore you still have to recreate the numpy array

I came up with a nice and easy solution taking inspiration from Issue 209. You define a small formatter function in which you set your own notation.
import matplotlib.ticker as mticker
# My axis should display 10⁻¹ but you can switch to e-notation 1.00e+01
def log_tick_formatter(val, pos=None):
return f"$10^{{{int(val)}}}$" # remove int() if you don't use MaxNLocator
# return f"{10**val:.2e}" # e-Notation
ax.zaxis.set_major_formatter(mticker.FuncFormatter(log_tick_formatter))
ax.zaxis.set_major_locator(mticker.MaxNLocator(integer=True))
set_major_locator sets the exponential to only use integers 10⁻¹, 10⁻² without 10^-1.5 etc. Source
Important! remove the cast int() in the return statement if you don't use set_major_locator and you want to display 10^-1.5 otherwise it will still print 10⁻¹ instead of 10^-1.5.
Example:
Try it yourself!
from mpl_toolkits.mplot3d import axes3d
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
fig = plt.figure(figsize=(11,8))
ax1 = fig.add_subplot(121,projection="3d")
# Grab some test data.
X, Y, Z = axes3d.get_test_data(0.05)
# Now Z has a range from 10⁻³ until 10³, so 6 magnitudes
Z = (np.full((120, 120), 10)) ** (Z / 20)
ax1.plot_wireframe(X, Y, Z, rstride=10, cstride=10)
ax1.set(title="Linear z-axis (small values not visible)")
def log_tick_formatter(val, pos=None):
return f"$10^{{{int(val)}}}$"
ax2 = fig.add_subplot(122,projection="3d")
# You still have to take log10(Z) but thats just one operation
ax2.plot_wireframe(X, Y, np.log10(Z), rstride=10, cstride=10)
ax2.zaxis.set_major_formatter(mticker.FuncFormatter(log_tick_formatter))
ax2.zaxis.set_major_locator(mticker.MaxNLocator(integer=True))
ax2.set(title="Logarithmic z-axis (much better)")
plt.savefig("LinearLog.png", bbox_inches='tight')
plt.show()

in osx: ran ax.zaxis._set_scale('log') (notice the underscore)

There is no solution because of the issue 209. However, you can try doing this:
ax.plot_surface(X, np.log10(Y), Z, cmap='jet', linewidth=0.5)
If in "Y" there is a 0, it is going to appear a warning but still works. Because of this warning color maps don´t work, so try to avoid 0 and negative numbers. For example:
Y[Y != 0] = np.log10(Y[Y != 0])
ax.plot_surface(X, Y, Z, cmap='jet', linewidth=0.5)

I wanted a symlog plot and, since I fill the data array by hand, I just made a custom function to calculate the log to avoid having negative bars in the bar3d if the data is < 1:
import math as math
def manual_log(data):
if data < 10: # Linear scaling up to 1
return data/10
else: # Log scale above 1
return math.log10(data)
Since I have no negative values, I did not implement handling this values in this function, but it should not be hard to change it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Interpolating values that have very high spread - python

Related

Check if seaborn scatterplot function is sampling data

linear fit on log-log plot isn't linear

Plot straight line of best fit on log-log plot

Gaussian mixture model (GMM) gives a bad fit

Plotting mplot3d / axes3D xyz surface plot with log scale?

Categories

Resources