Python: Plot residuals on a fitted model

Python: Plot residuals on a fitted model - python

I want to plot the lines (residuals; cyan lines) between data points and the estimated model. Currently I'm doing so by iterating over all data points in my income pandas.DataFrame and adding vertical lines. x, y are the points' coordinates and predicted are the predictions (here the blue line).
plt.scatter(income["Education"], income["Income"], c='red')
plt.ylim(0,100)
for indx, (x, y, _, _, predicted) in income.iterrows():
plt.axvline(x, y/100, predicted/100) # /100 because it needs floats [0,1]
Is there a more efficient way? This doesn't seem like a good approach for more than a few rows.

First of all note that axvline here only works by coincidence. In general the y values taken by axvline are in coordinates relative to the axes, not in data coordinates.
In contrast, vlines uses data coordinates and also has the advantage to accept arrays of values. It will then create a LineCollection, which is more efficient than individual lines.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1.2,1.2,20)
y = np.sin(x)
dy = (np.random.rand(20)-0.5)*0.5
fig, ax = plt.subplots()
ax.plot(x,y)
ax.scatter(x,y+dy)
ax.vlines(x,y,y+dy)
plt.show()

Related

Check if seaborn scatterplot function is sampling data

I have plotted a seaborn scatter plot. My data consists of 5000 data points. By looking into the plot, I definitely am not seeing 5000 points. So I'm pretty sure some kind of sampling is performed by seaborn scatterplot function. I want to know how many data points each point in the plot represent? If it depends on the code, the code is as following:
g = sns.scatterplot(x=data['x'], y=data['y'],hue=data['P'], s=40, edgecolor='k', alpha=0.8, legend="full")

Nothing would really suggest to me that seaborn is sampling your data. However, you can check the data in your axes g to be sure. Query the children of the axes for a PathCollection (scatter plot) object:
g.get_children()
It's probably the first item in the list that is returned. From there you can use get_offsets to retrieve the data and check its shape.
g.get_children()[0].get_offsets().shape

As far as I know, no sampling is performed. On the picture you have posted, you can see that most of the data points are just overlapping and that might be the reason why you can not see 5000 points. Try with less points and you will see that all of them get plotted.

In order to check whether or not Seaborn's scatter removes points, here is a way to see 5000 different points. No points seem to be missing.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
x = np.linspace(1, 100, 100)
y = np.linspace(1, 50, 50)
X, Y = np.meshgrid(x, y)
Z = (X * Y) % 25
X = np.ravel(X)
Y = np.ravel(Y)
Z = np.ravel(Z)
sns.scatterplot(x=X, y=Y, s=15, hue=Z, palette=plt.cm.plasma, legend=False)
plt.show()

Plotting dataset using griddata without cancelling outliers

I have an x, y, z dataset which contains a rather large number of points.
x and y are the positions while z is the actual observable at those coordinates.
Most coordinates have a zero value for z, while only a few of them define lines (with smoothly changing z) in the 2D map.
If I plot it with
scatter(x,y,c=z))
I get only very faint lines as the scatterpoints with color defined by z=0 are overlapping with the nonzero values of z. If I decrease the size of the points to reduce overlap, I can't see them anymore.
Here an example of the best I could get using scatter (blue is zero z, other colors are non-zero z).
So, I thought of instead using
data = np.genfromtxt('data')
x=data[:,0]
y=data[:,1]*3.0
z=data[:,2]
grid_x, grid_y = np.mgrid[min(x):max(x):100, min(y):max(y):1000]
from scipy.interpolate import griddata
grid_z0 = griddata((x, y),z, (grid_x, grid_y), method='cubic')
im = imshow(grid_z0,origin="lower",extent=[0,0.175,-0.15,0.15]) # zoom in on specific part of data
to get a denser grid of points and possibly get wider lines due to the cubic interpolation of points around them.
However, then it seems like griddata is removing the non-zero z, considering them as outliners, thus hiding any possible features and the whole grid plots a zero z.
Is there any python/matplotlib/... feature or trick I am missing to plot this in a nice way?
I am trying to make plots that would look something like the ones you can see in Fig. 2 (2) of [https://journals.aps.org/prb/abstract/10.1103/PhysRevB.93.0854092 (you can see the figure without downloading the paper) with possibly some kind of glow around the lines.
The data I used is in this dropbox link.

Of course you may change the scatter, e.g. to set the size of the points without energy to 0.
import matplotlib.pyplot as plt
import numpy as np
data = np.genfromtxt('data/some_solidstate_physics_data.txt')
x=data[:,0]; y=data[:,1]*3.0; z=data[:,2]
plt.scatter(x,y,c=z, s=np.log10(z+1), cmap="PuRd", vmin=-500)
plt.show()
Since the data is already gridded, there is for sure no need to use griddata, this will only smooth out the data. Instead just reshaping the data into a grid is enough.
import matplotlib.pyplot as plt
import numpy as np
data = np.genfromtxt('data/some_solidstate_physics_data.txt')
x=data[:,0]; y=data[:,1]*3.0; z=data[:,2]
ux = np.unique(x); uy = np.unique(y)
Z = z.reshape(len(ux),len(uy)).T
dx = np.diff(ux[:2])[0]; dy = np.diff(uy[:2])[0]
ext = [ux.min()-dx/2.,ux.max()+dx/2.,uy.min()-dy/2., uy.max()+dy/2.]
plt.imshow(Z, extent=ext, aspect="auto", cmap="magma")
plt.show()
Since the grid is very dense, it looks somehow pixelated.
You may of course also bin your data into larger chunks. For example joining the data of 3x3 pixels into one and taking the maximum value, using scipy.stats.binned_statistic_2d
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binned_statistic_2d
data = np.genfromtxt('data/some_solidstate_physics_data.txt')
x=data[:,0]; y=data[:,1]*3.0; z=data[:,2]
ux = np.unique(x); uy = np.unique(y)
h, ex, ey,_ = binned_statistic_2d(x, y, z, bins=[ux[::3],uy[::3]], statistic='max')
dx = np.diff(ex[:2])[0]; dy = np.diff(ey[:2])[0]
ext = [ux.min()-dx/2.,ux.max()+dx/2.,uy.min()-dy/2., uy.max()+dy/2.]
plt.imshow(h.T, extent=ext, aspect="auto", cmap="magma")
plt.show()
Having those techniques at your disposal you may then decide to beautify your result at the expense of quantitative accuracy.
E.g. applying a gaussian filter, scipy.ndimage.filters.gaussian_filter as well as interpolation="gaussian" in the plotting.
import matplotlib.pyplot as plt
import numpy as np
import scipy.ndimage.filters
data = np.genfromtxt('data/some_solidstate_physics_data.txt')
x=data[:,0]; y=data[:,1]*3.0; z=data[:,2]
ux = np.unique(x); uy = np.unique(y)
Z = z.reshape(len(ux),len(uy)).T
Z = scipy.ndimage.filters.gaussian_filter(Z, 3)
dx = np.diff(ux[:2])[0]; dy = np.diff(uy[:2])[0]
ext = [ux.min()-dx/2.,ux.max()+dx/2.,uy.min()-dy/2., uy.max()+dy/2.]
plt.imshow(Z, extent=ext, aspect="auto", cmap="magma", interpolation="gaussian")
plt.show()

Python - 3D Plotting, horizontal lines missing and incorrect gradients showing

I am new to 3D surface plots and I am trying to make a 3D plot of temperature as a function of distance and time using the following:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
t = np.arange(0,60,1)
z = np.arange(5,85,5)
fig=plt.figure(1)
ax = fig.gca(projection='3d')
X, Y = np.meshgrid(z, t)
surface=ax.plot_surface(X,Y,T1, linewidth=1,cmap=cm.coolwarm, antialiased=False)
fig.colorbar(surface, shrink=0.5, aspect=5)
ax.view_init(ax.elev, ax.azim+90)
ax.set_title("Temperature Distribution 1")
ax.set_xlabel('z (cm)')
ax.set_ylabel('Time (min)')
ax.set_zlabel('Temperature ($^\circ$C)')
ax.set_xticks(np.arange(0,80, 15))
plt.savefig("3D_1.png",format='png',dpi=1000,bbox_inches='tight')
plt.show()
T1 is 2D data. This produces the following:
There is only 1 horizontal line showing at around 60cm, however I would like a horiztonal line every 5cm (data was taken along every 5cm). It seems there is only 2 sections to the plot along the distance axis. The colour gradients are showing in large blocks instead of showing as a function of temperature all along the length.
e.g. for time between 50-60mins from distance 0~40cm, the temperature goes from ~180 to ~20 degrees, but the colour of that block is dark red all the way along it, instead it should start as dark red and reduce down to blue. How do I get the temperature to show the correct gradient along the whole length axis.
Also the temperature legend is in %, instead of the temperature values in degrees, how do I fix this?

Looking at the documentation of surface we find
The rstride and cstride kwargs set the stride used to sample the input data to generate the graph. If 1k by 1k arrays are passed in, the default values for the strides will result in a 100x100 grid being plotted. Defaults to 10.
Thus, using
surface=ax.plot_surface(X,Y,T1, rstride=8, cstride=8)
you get
While using
surface=ax.plot_surface(X,Y,T1, rstride=5, cstride=1)
you get
This is an example how to create reproducible data for this case:
t = np.arange(0,60,1)
z = z=np.arange(5,85,5)
f = lambda z, t, z0, t0, sz, st: 180.*np.exp(-(z-z0)**2/sz**2 -(t-t0)**2/st**2)
X, Y = np.meshgrid(z, t)
T1 =f(X,Y,-20.,56, 40.,30.)

matplotlib 2d histrogram heatmap-- how do I use my dataset to make one?

I am new to python.
I have a dataset like
import numpy as np
from matplotlib import pyplot as plt
dats = np.array([r1,x1,y1],[r2,x2,y2],...])
I would like to plot color intensity associated with r1,r2,... at the position (x1,y1), (x2,y2), et cetera respectively.
How can I get this data set manipulated in a format which matplotlib can use in a 2D histogram?
Any help much appreciated. I'll help others in return once I've gained some skill : o

In order to make 2D histogram, your data set has to comprises two data values rather than one data value and two indices. Thus, you need two arrays: one containing the r1 values and one containing the r2 values. Your data does not have any r2 values, therefore, you cannot compute a bi-dimensional histogram.
Regarding your question, you do not even want a histogram. You just want to visualise your r1 values on a plane. This is easy. Say, your array dats has a length of 100, then:
rs = dats[:, 0] # retrieve r-values from dats
plt.imshow(rs.reshape(10, 10), cmap='Greys', interpolation='None')
plt.colorbar()

You can create interpolated data from a set of points using griddata, assuming x = [x1, x2, etc] and r = [r1, r2, etc] then,
#Setup a grid
xi = np.linspace(x.min(),x,max(),100)
yi = np.linspace(y.min(),y.max(),100)
zi = griddata(x, y, r, xi, yi, interp='linear')
#Plot the colormap
cm = plt.pcolormesh(xi,yi,zi)
plt.colorbar()
plt.show()
Other options include colouring scatter plots,
plt.scatter(x,y,c=r)
or there is a 2D histogram functions in scipy where you could set the weights based on r,
H, xedges, yedges = np.histogram2d(x, y, w_i = r)
I haven't used the last one personally.

I think what you are looking for is not a histogram but a contour plot (a histogram would count the number of occurrences of a coordinate (x,y) falling into a bin).
If your data is not on a grid, you can use tricontourf:
plt.tricontourf(dats[:,1],dats[:,2],dats[:,0],cmap='hot')
plt.colorbar()
plt.show()
There are more ways to plot this, such as scatter plots etc.

Making a contour plot with solutions from systems of differential equations with pylab

So, I'm solving a system of differential equations numerically i have x,y,z each a solution. Each array is one dimensional and and for example x[0],y[0],z[0] goes with a point in space. i want to graph these in a contour like the usual x y z coordinates, it says i need z to be a 2d array, i know how to make a mesh from x and y, but how do i do this to z?
I have made a mesh out of the x,y, but for z i don't know what to do.
if someone could give me insight it would be much appreciated.

It is not enough to just mesh in x and y, you need to grid your data on a regular grid to be able to do a contour plot. To do this you should look into matplotlib.mlab.griddata (http://matplotlib.org/examples/pylab_examples/griddata_demo.html).
I'll paste the example code from the link below with some extra comments:
from numpy.random import uniform, seed
from matplotlib.mlab import griddata
import matplotlib.pyplot as plt
import numpy as np
# Here the code generates some x and y coordinates and some corresponding z values.
seed(0)
npts = 200
x = uniform(-2,2,npts)
y = uniform(-2,2,npts)
z = x*np.exp(-x**2-y**2)
# Here you define a grid (of arbitrary dimensions, but equal spacing) onto which your data will be mapped
xi = np.linspace(-2.1,2.1,100)
yi = np.linspace(-2.1,2.1,200)
# Map the data to the grid to get a 2D array of remapped z values
zi = griddata(x,y,z,xi,yi,interp='linear')
# contour the gridded data, plotting dots at the nonuniform data points.
CS = plt.contour(xi,yi,zi,15,linewidths=0.5,colors='k')
CS = plt.contourf(xi,yi,zi,15,cmap=plt.cm.rainbow,
vmax=abs(zi).max(), vmin=-abs(zi).max())
plt.colorbar() # draw colorbar
# Plot the original sampling
plt.scatter(x,y,marker='o',c='b',s=5,zorder=10)
plt.xlim(-2,2)
plt.ylim(-2,2)
plt.title('griddata test (%d points)' % npts)
plt.show()

It looks like you are looking for line or scatter plots instead of contour.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Plot residuals on a fitted model - python

Related

Check if seaborn scatterplot function is sampling data

Plotting dataset using griddata without cancelling outliers

Python - 3D Plotting, horizontal lines missing and incorrect gradients showing

matplotlib 2d histrogram heatmap-- how do I use my dataset to make one?

Making a contour plot with solutions from systems of differential equations with pylab

Categories

Resources