binned_statistic_2d producing unexpected negative values

binned_statistic_2d producing unexpected negative values - python

I'm using scipy.stats.binned_statistic_2d and then plotting the output. When I use stat="count", I have no problems. When I use stat="mean" (or np.max() for that matter), I end up with negative values in each bin (as identified by the color bar), which should not be the case because I have constructed zvals such that it is always greater than zero. Does anyone know why this is the case? I've included the minimal code I use to generate the plots. I also get an invalid value RunTime warning, which makes me think that something strange is going on in binned_statistic_2d. The following code should just copy and run.
From the documentation:
'count' : compute the count of points within each bin. This is
identical to an unweighted histogram. `values` array is not
referenced.
which leads me to believe that there might be something going on in binned_statistic_2d and how it handles z-values.
import numbers as _numbers
import numpy as _np
import scipy as _scipy
import matplotlib as _mpl
import types as _types
import scipy.stats
from matplotlib import pyplot as _plt
norm_args = (0, 3, int(1e5)) # loc, scale, size
x = _np.random.random(norm_args[-1]) # xvals can be log scaled.
y = _np.random.normal(*norm_args) #_np.random.random(norm_args[-1]) #
z = _np.abs(_np.random.normal(1e2, *norm_args[1:]))
nbins = 1e2
kwargs = {}
stat = _np.max
fig, ax = _plt.subplots()
binned_stats = _scipy.stats.binned_statistic_2d(x, y, z, stat,
nbins)
H, xedges, yedges, binnumber = binned_stats
Hplot = H
if isinstance(stat, str):
cbar_title = stat.title()
elif isinstance(stat, _types.FunctionType):
cbar_title = stat.__name__.title()
XX, YY = _np.meshgrid(xedges, yedges)
Image = ax.pcolormesh(XX, YY, Hplot.T) #norm=norm,
ax.autoscale(tight=True)
grid_kargs = {'orientation': 'vertical'}
cax, kw = _mpl.colorbar.make_axes_gridspec(ax, **grid_kargs)
cbar = fig.colorbar(Image, cax=cax)
cbar.set_label(cbar_title)
Here's the runtime warning:
/Users/balterma/Library/Enthought/Canopy_64bit/User/lib/python2.7/sitepackages/matplotlib/colors.py:584: RuntimeWarning: invalid value encountered in less cbook._putmask(xa, xa < 0.0, -1)
Image with mean:
Image with max:
Image with count:

Turns out the problem was interfacing with plt.pcolormesh. I had to convert the output array from binned_statistic_2d to a masked array that masked the NaNs.
Here's the question that gave me the answer:
pcolormesh with missing values?

Related

Using streamplot function in Python for stretched grid

I am trying to plot (magnetic) field lines. I came to know stream function is the tool used to plot field lines. However, an error comes up telling ValueError: 'x' values must be equally spaced.
Please note that the code uses Fourier basis along x so it is equally spaced along x and uses Chebyshev basis along y and hence it is non-uniform grid along y. I do not understand why the code says "x must be uniformly spaced" when it's uniform along x.
The other thing I want to know is, how to plot field lines for a non-uniform grid?
Also is there any other method to plot field lines apart from streamplot?
Code that I am running is attached below for reference:
import numpy as np
import h5py
import matplotlib.pyplot as plt
f = h5py.File('./B_mm/Re_10k/Bx_v24/Bx_v24_s1.h5', 'r')
y = f['/scales/y/1.0'][:]
x = f['/scales/x/1.0'][:]
Bxtotal = f['/tasks/Bxtotal'][:]
Bytotal = f['/tasks/Bytotal'][:]
t = f['scales']['sim_time'][:]
print(np.shape(y))
print('y = ', y)
print(np.shape(x))
print('x= ', x)
X, Y = np.meshgrid(y,x)
print(np.shape(X))
print(np.shape(Y))
Bx = Bxtotal[len(Bxtotal)-1, :, :]
By = Bytotal[len(Bytotal)-1, :, :]
print(np.shape(Bx))
print(np.shape(By))
plt.figure(figsize=(4,4),facecolor="w")
plt.streamplot(X, Y, Bx, By, color='k', density=1.3, minlength=0.9, arrowstyle='-')
plt.show()
And this is my error message;

Simulate the compound random variable S

Let S=X_1+X_2+...+X_N where N is a nonnegative integer-valued random variable and X_1,X_2,... are i.i.d random variables.(If N=0, we set S=0).
Simulate S in the case where N ~ Poi(100) and X_i ~ Exp(0.5). (draw histograms and use the numpy or scipy built-in functions).And check the equations E(S)=E(N)*E(X_1) and Var(S)=E(N)*Var(X_1)+E(X_1)^2 *Var(N)
I was trying to solve it, but I'm not sure yet of everything and also got stuck on the histogram part. Note: I'm new to python or more generally , new to programming.
My work:
import scipy.stats as stats
import matplotlib as plt
N = stats.poisson(100)
X = stats.expon(0.5)
arr = X.rvs(N.rvs())
S = 0
for i in arr:
S=S+i
print(arr)
print("S=",S)
expected_S = (N.mean())*(X.mean())
variance_S = (N.mean()*X.var()) + (X.mean()*X.mean()*N.var())
print("E(X)=",expected_S)
print("Var(S)=",variance_S)

Your existing code mostly looks sensible, but I'd simplify:
arr = X.rvs(N.rvs())
S = 0
for i in arr:
S=S+i
down to:
S = X.rvs(N.rvs()).sum()
To draw a histogram, you need many samples from this distribution, which is now easily accomplished via:
arr = []
for _ in range(10_000):
arr.append(X.rvs(N.rvs()).sum())
or, equivalently, using a list comprehension:
arr = [X.rvs(N.rvs()).sum() for _ in range(10_000)]
to plot these in a histogram, you need the pyplot module from Matplotlib, so your import should be:
from matplotlib.pyplot import plt
plt.hist(arr, 50)
The 50 above says to use that number of "bins" when drawing the histogram. We can also compare these to the mean and variance you calculated by assuming the distribution is well approximated by a normal:
approx = stats.norm(expected_S, np.sqrt(variance_S))
_, x, _ = plt.hist(arr, 50, density=True)
plt.plot(x, approx.pdf(x))
This works because the second value returned from matplotlib's hist method are the locations of the bins. I used density=True so I could work with probability densities, but another option could be to just multiply the densities by the number of samples to get expected counts like the previous histogram.
Running this gives me:

get bins coordinates with hexbin in matplotlib

I use matplotlib's method hexbin to compute 2d histograms on my data.
But I would like to get the coordinates of the centers of the hexagons in order to further process the results.
I got the values using get_array() method on the result, but I cannot figure out how to get the bins coordinates.
I tried to compute them given number of bins and the extent of my data but i don't know the exact number of bins in each direction. gridsize=(10,2) should do the trick but it does not seem to work.
Any idea?

I think this works.
from __future__ import division
import numpy as np
import math
import matplotlib.pyplot as plt
def generate_data(n):
"""Make random, correlated x & y arrays"""
points = np.random.multivariate_normal(mean=(0,0),
cov=[[0.4,9],[9,10]],size=int(n))
return points
if __name__ =='__main__':
color_map = plt.cm.Spectral_r
n = 1e4
points = generate_data(n)
xbnds = np.array([-20.0,20.0])
ybnds = np.array([-20.0,20.0])
extent = [xbnds[0],xbnds[1],ybnds[0],ybnds[1]]
fig=plt.figure(figsize=(10,9))
ax = fig.add_subplot(111)
x, y = points.T
# Set gridsize just to make them visually large
image = plt.hexbin(x,y,cmap=color_map,gridsize=20,extent=extent,mincnt=1,bins='log')
# Note that mincnt=1 adds 1 to each count
counts = image.get_array()
ncnts = np.count_nonzero(np.power(10,counts))
verts = image.get_offsets()
for offc in xrange(verts.shape[0]):
binx,biny = verts[offc][0],verts[offc][1]
if counts[offc]:
plt.plot(binx,biny,'k.',zorder=100)
ax.set_xlim(xbnds)
ax.set_ylim(ybnds)
plt.grid(True)
cb = plt.colorbar(image,spacing='uniform',extend='max')
plt.show()

I would love to confirm that the code by Hooked using get_offsets() works, but I tried several iterations of the code mentioned above to retrieve center positions and, as Dave mentioned, get_offsets() remains empty. The workaround that I found is to use the non-empty 'image.get_paths()' option. My code takes the mean to find centers but which means it is just a smidge longer, but it does work.
The get_paths() option returns a set of x,y coordinates embedded that can be looped over and then averaged to return the center position for each hexagram.
The code that I have is as follows:
counts=image.get_array() #counts in each hexagon, works great
verts=image.get_offsets() #empty, don't use this
b=image.get_paths() #this does work, gives Path([[]][]) which can be plotted
for x in xrange(len(b)):
xav=np.mean(b[x].vertices[0:6,0]) #center in x (RA)
yav=np.mean(b[x].vertices[0:6,1]) #center in y (DEC)
plt.plot(xav,yav,'k.',zorder=100)

I had this same problem. I think what needs to be developed is a framework to have a HexagonalGrid object which can then be applied to many different data sets (and it would be awesome to do it for N dimensions). This is possible and it surprises me that neither Scipy or Numpy has anything for it (furthermore there seems to be nothing else like it except perhaps binify)
That said, I assume you want to use hexbinning to compare multiple binned data sets. This requires some common base. I got this to work using matplotlib's hexbin the following way:
import numpy as np
import matplotlib.pyplot as plt
def get_data (mean,cov,n=1e3):
"""
Quick fake data builder
"""
np.random.seed(101)
points = np.random.multivariate_normal(mean=mean,cov=cov,size=int(n))
x, y = points.T
return x,y
def get_centers (hexbin_output):
"""
about 40% faster than previous post only cause you're not calculating the
min/max every time
"""
paths = hexbin_output.get_paths()
v = paths[0].vertices[:-1] # adds a value [0,0] to the end
vx,vy = v.T
idx = [3,0,5,2] # index for [xmin,xmax,ymin,ymax]
xmin,xmax,ymin,ymax = vx[idx[0]],vx[idx[1]],vy[idx[2]],vy[idx[3]]
half_width_x = abs(xmax-xmin)/2.0
half_width_y = abs(ymax-ymin)/2.0
centers = []
for i in xrange(len(paths)):
cx = paths[i].vertices[idx[0],0]+half_width_x
cy = paths[i].vertices[idx[2],1]+half_width_y
centers.append((cx,cy))
return np.asarray(centers)
# important parts ==>
class Hexagonal2DGrid (object):
"""
Used to fix the gridsize, extent, and bins
"""
def __init__ (self,gridsize,extent,bins=None):
self.gridsize = gridsize
self.extent = extent
self.bins = bins
def hexbin (x,y,hexgrid):
"""
To hexagonally bin the data in 2 dimensions
"""
fig = plt.figure()
ax = fig.add_subplot(111)
# Note mincnt=0 so that it will return a value for every point in the
# hexgrid, not just those with count>mincnt
# Basically you fix the gridsize, extent, and bins to keep them the same
# then the resulting count array is the same
hexbin = plt.hexbin(x,y, mincnt=0,
gridsize=hexgrid.gridsize,
extent=hexgrid.extent,
bins=hexgrid.bins)
# you could close the figure if you don't want it
# plt.close(fig.number)
counts = hexbin.get_array().copy()
return counts, hexbin
# Example ===>
if __name__ == "__main__":
hexgrid = Hexagonal2DGrid((21,5),[-70,70,-20,20])
x_data,y_data = get_data((0,0),[[-40,95],[90,10]])
x_model,y_model = get_data((0,10),[[100,30],[3,30]])
counts_data, hexbin_data = hexbin(x_data,y_data,hexgrid)
counts_model, hexbin_model = hexbin(x_model,y_model,hexgrid)
# if you want the centers, they will be the same for both
centers = get_centers(hexbin_data)
# if you want to ignore the cells with zeros then use the following mask.
# But if want zeros for some bins and not others I'm not sure an elegant way
# to do this without using the centers
nonzero = counts_data != 0
# now you can compare the two data sets
variance_data = counts_data[nonzero]
square_diffs = (counts_data[nonzero]-counts_model[nonzero])**2
chi2 = np.sum(square_diffs/variance_data)
print(" chi2={}".format(chi2))

Python interp1d vs. UnivariateSpline

I'm trying to port some MatLab code over to Scipy, and I've tried two different functions from scipy.interpolate, interp1d and UnivariateSpline. The interp1d results match the interp1d MatLab function, but the UnivariateSpline numbers come out different - and in some cases very different.
f = interp1d(row1,row2,kind='cubic',bounds_error=False,fill_value=numpy.max(row2))
return f(interp)
f = UnivariateSpline(row1,row2,k=3,s=0)
return f(interp)
Could anyone offer any insight? My x vals aren't equally spaced, although I'm not sure why that would matter.

I just ran into the same issue.
Short answer
Use InterpolatedUnivariateSpline instead:
f = InterpolatedUnivariateSpline(row1, row2)
return f(interp)
Long answer
UnivariateSpline is a 'one-dimensional smoothing spline fit to a given set of data points' whereas InterpolatedUnivariateSpline is a 'one-dimensional interpolating spline for a given set of data points'. The former smoothes the data whereas the latter is a more conventional interpolation method and reproduces the results expected from interp1d. The figure below illustrates the difference.
The code to reproduce the figure is shown below.
import scipy.interpolate as ip
#Define independent variable
sparse = linspace(0, 2 * pi, num = 20)
dense = linspace(0, 2 * pi, num = 200)
#Define function and calculate dependent variable
f = lambda x: sin(x) + 2
fsparse = f(sparse)
fdense = f(dense)
ax = subplot(2, 1, 1)
#Plot the sparse samples and the true function
plot(sparse, fsparse, label = 'Sparse samples', linestyle = 'None', marker = 'o')
plot(dense, fdense, label = 'True function')
#Plot the different interpolation results
interpolate = ip.InterpolatedUnivariateSpline(sparse, fsparse)
plot(dense, interpolate(dense), label = 'InterpolatedUnivariateSpline', linewidth = 2)
smoothing = ip.UnivariateSpline(sparse, fsparse)
plot(dense, smoothing(dense), label = 'UnivariateSpline', color = 'k', linewidth = 2)
ip1d = ip.interp1d(sparse, fsparse, kind = 'cubic')
plot(dense, ip1d(dense), label = 'interp1d')
ylim(.9, 3.3)
legend(loc = 'upper right', frameon = False)
ylabel('f(x)')
#Plot the fractional error
subplot(2, 1, 2, sharex = ax)
plot(dense, smoothing(dense) / fdense - 1, label = 'UnivariateSpline')
plot(dense, interpolate(dense) / fdense - 1, label = 'InterpolatedUnivariateSpline')
plot(dense, ip1d(dense) / fdense - 1, label = 'interp1d')
ylabel('Fractional error')
xlabel('x')
ylim(-.1,.15)
legend(loc = 'upper left', frameon = False)
tight_layout()

The reason why the results are different (but both likely correct) is that the interpolation routines used by UnivariateSpline and interp1d are different.
interp1d constructs a smooth B-spline using the x-points you gave to it as knots
UnivariateSpline is based on FITPACK, which also constructs a smooth B-spline. However, FITPACK tries to choose new knots for the spline, to fit the data better (probably to minimize chi^2 plus some penalty for curvature, or something similar). You can find out what knot points it used via g.get_knots().
So the reason why you get different results is that the interpolation algorithm is different. If you want B-splines with knots at data points, use interp1d or splmake. If you want what FITPACK does, use UnivariateSpline. In the limit of dense data, both methods give same results, but when data is sparse, you may get different results.
(How do I know all this: I read the code :-)

Works for me,
from scipy import allclose, linspace
from scipy.interpolate import interp1d, UnivariateSpline
from numpy.random import normal
from pylab import plot, show
n = 2**5
x = linspace(0,3,n)
y = (2*x**2 + 3*x + 1) + normal(0.0,2.0,n)
i = interp1d(x,y,kind=3)
u = UnivariateSpline(x,y,k=3,s=0)
m = 2**4
t = linspace(1,2,m)
plot(x,y,'r,')
plot(t,i(t),'b')
plot(t,u(t),'g')
print allclose(i(t),u(t)) # evaluates to True
show()
This gives me,

UnivariateSpline: A more recent
wrapper of the FITPACK routines.
this might explain the slightly different values? (I also experienced that UnivariateSpline is much faster than interp1d.)

Use matplotlib.contour with complex data

I'm trying to show a contour plot using matplotlib from a complex array. The array is a 2x2 complex matrix, generated by the (C like) method:
for i in max_y:
for j in max_x:
pos_x = pos_x + step
z = complex(pos_x,pos_y)
c_arr[i][j] = complex_function(z)
pos_y = pos_y + step
I would like to plot this c_arr (real part) using contourplot, but so far the only thing that I can get from contour is
TypeError: Input z must be a 2D array.
The c_arr.real is a 2D array, and doesn't matter if I make a grid with x, y, or pos_x, or pos_y, the result is always the same. The docs from matplotlib tells me how to use it, but not the datatypes necessary to use it, so I feel left in the dark.
EDIT: Thanks for the answer. My problem now is that I have to get the complex values from a function in this form:
def f(z):
return np.sum(np.arange(n)*np.sqrt(z-1)**np.arange(n))
where the sum must be added up. How can this be accomplished using the meshgrid form that contour needs? Thanks again.

matplotlib.pyplot.contour() allows complex-valued input arrays. It extracts real values from the array implicitly:
#!/usr/bin/env python
import numpy as np
from matplotlib import pyplot as plt
# generate data
x = np.r_[0:100:30j]
y = np.r_[0:1:20j]
X, Y = np.meshgrid(x, y)
Z = X*np.exp(1j*Y) # some arbitrary complex data
# plot it
def plotit(z, title):
plt.figure()
cs = plt.contour(X,Y,z) # contour() accepts complex values
plt.clabel(cs, inline=1, fontsize=10) # add labels to contours
plt.title(title)
plt.savefig(title+'.png')
plotit(Z, 'real')
plotit(Z.real, 'explicit real')
plotit(Z.imag, 'imagenary')
plt.show()
real
explicit real
imagenary

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

binned_statistic_2d producing unexpected negative values - python

Turns out the problem was interfacing with plt.pcolormesh. I had to convert the output array from binned_statistic_2d to a masked array that masked the NaNs. Here's the question that gave me the answer: pcolormesh with missing values?

Related

Using streamplot function in Python for stretched grid

Simulate the compound random variable S

get bins coordinates with hexbin in matplotlib

Python interp1d vs. UnivariateSpline

Use matplotlib.contour with complex data

Categories

Resources