interpolate.griddata shifts data northwards, is it a bug? - python

I observe unexpected results from scipy.interpolate.griddata. I am trying to visualize a set of irregularly spaced points using matplotlib.basemap and scipy.interpolate.griddata.
The data is given as three lists: latitudes, longitudes and values. To get them on the map I interpolate the data onto a regular grid and visualize it using Basemap's imshow function.
I observe that the interpolated data is shifted northwards from true positions.
Here is an example. Here I want to highlight a cell formed by two meridians and two parallels. I expect to get something like this:
However what I get is something like this:
You can see that the red rectangle is visibly shifted northwards.
I have tried to vary the grid resolution and the number of points, however this does not seem to have any effect on this observed shift.
Here is an IPython notebook that illustrates the issue.
Also below is the complete code:
import numpy as np
from numpy import random
from scipy import interpolate
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
# defining the region of interest
r = {'lon':[83.0, 95.5], 'lat':[48.5,55.5]}
# initializing Basemap
m = Basemap(projection='merc',
llcrnrlon=r['lon'][0],
llcrnrlat=r['lat'][0],
urcrnrlon=r['lon'][1],
urcrnrlat=r['lat'][1],
lon_0=r['lon'][0],
ellps='WGS84',
fix_aspect=True,
resolution='h')
# defining the highlighted block
block = {'lon':[89,91],'lat':[50.5,52.5]}
# generating the data
npixels = 100000
lat_range = r['lat'][1] - r['lat'][0]
lats = lat_range * random.random(npixels) + r['lat'][0]
lon_range = r['lon'][1] - r['lon'][0]
lons = lon_range * random.random(npixels) + r['lon'][0]
values = np.zeros(npixels)
for p in range(npixels):
if block['lat'][0] < lats[p] < block['lat'][1] \
and block['lon'][0] < lons[p] < block['lon'][1]:
values[p] = 1.0
# plotting the original data without interpolation
plt.figure(figsize=(5, 5))
m.drawparallels(np.arange(r['lat'][0], r['lat'][1] + 0.25, 2.0),
labels=[True,False,True,False])
m.drawmeridians(np.arange(r['lon'][0], r['lon'][1] + 0.25, 2.0),
labels=[True,True,False,True])
m.scatter(lons,lats,c=values,latlon=True,edgecolors='none')
# interpolating on the regular grid
nx = ny = 500
mapx = np.linspace(r['lon'][0],r['lon'][1],nx)
mapy = np.linspace(r['lat'][0],r['lat'][1],ny)
mapgridx,mapgridy = np.meshgrid(mapx,mapy)
mapdata = interpolate.griddata(list(zip(lons,lats)),values,
(mapgridx,mapgridy),method='nearest')
# plotting the interpolated data
plt.figure(figsize=(5, 5))
m.drawparallels(np.arange(r['lat'][0], r['lat'][1] + 0.25, 2.0),
labels=[True,False,True,False])
m.drawmeridians(np.arange(r['lon'][0], r['lon'][1] + 0.25, 2.0),
labels=[True,True,False,True])
m.imshow(mapdata)
I am seeing this with SciPy 0.17.0

Pauli Virtanen on SciPy bugtracker answered the question.
The issue goes away if one replaces basemap.imshow() with matplotlib.pyplot.pcolormesh()
Replacing above
m.imshow(mapdata)
with
meshx,meshy = m(mapx,mapy)
plt.pcolormesh(meshx,meshy,mapdata)
produces correctly aligned image.
It is not clear what I am doing wrong with basemap.imshow, but that is probably another question.

Related

Python's Basemap doesn't align with correct coordinates?

I have a .dat file containing a list of coordinates (~100k) and a temperature at each coordinate. It has a structure like this:
-59.083 -26.583 0.2
-58.417 -26.250 0.6
-58.412 -26.417 0.4
...
To visually display the temperature ranges, I created a numpy array and plotted the datasets using the Basemap module for Python. The code I wrote is the following:
from matplotlib import pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np
m = Basemap(projection='mill',llcrnrlat=-90,urcrnrlat=90,\
llcrnrlon=-180,urcrnrlon=180,resolution='c')
m.drawcoastlines(linewidth=0.15)
data = np.loadtxt('gridly.dat')
xcoordlist = []
ycoordlist = []
tempvallist = []
for i in data:
xcoord = i[0]
ycoord = i[1]
tempval = i[2]
xcoord2 = xcoord*111139 #<--- Multiplying converts each coordinate's degrees to meters)
ycoord2 = ycoord*111139
xcoordlist.append(xcoord2)
ycoordlist.append(ycoord2)
tempvallist.append(tempval)
xco = np.array(xcoordlist)
yco = np.array(ycoordlist)
tval = np.array(tempvallist)
gridsize = 100
m.hexbin(yco, xco, C=tval, gridsize=gridsize)
cb = m.colorbar()
plt.show()
When I plot the data, I'm getting almost exactly what I want, however, the hexagonal heatmap is offset for some reason, giving me the following chart:
I've been searching online for what might be wrong but unfortunately couldn't find answers or troubleshoot. Does anyone know how I can fix this issue?
After hours of digging around, I finally figured it out! What was wrong with my code was that I was trying to manually convert the geographic coordinates into point coordinates for the displaying chart (by multiplying by 111139).
While the logic for doing this makes sense, I believe this process broke down when I began to plot the data onto different kinds of charts (i.e. orthogonal, miller projection etc.) because the different projections/charts will have different point coordinates (kind of like how the pixel locations on your computer screen may not align with the pixel locations on a different computer screen).
Instead, the Basemap module has a built-in function that will convert real-world coordinates into coordinates that can be plotted on the chart, for you: m(x, y).
So, the improved and correct script would be:
from matplotlib import pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np
m = Basemap(projection='mill',llcrnrlat=-90,urcrnrlat=90,\
llcrnrlon=-180,urcrnrlon=180,resolution='c')
m.drawcoastlines(linewidth=0.15)
data = np.loadtxt('gridly.dat')
xcoordlist = []
ycoordlist = []
tempvallist = []
for i in data:
lat = i[0]
lon = i[1]
tempval = i[2]
xpt, ypt = m(lon, lat)
xcoordlist.append(xpt)
ycoordlist.append(ypt)
tempvallist.append(tempval)
xco = np.array(xcoordlist)
yco = np.array(ycoordlist)
tval = np.array(tempvallist)
gridsize = 100
m.hexbin(xco, yco, C=tval, gridsize=gridsize)
cb = m.colorbar()
plt.show()
As you can see where it says xpt, ypt = m(lon, lat), the function converts the real world longitudes (lon) and latitudes (lat) from the .dat file into pottable points. Hope this helps anyone else that may have this problem in the future!

Path 'contains_points' yields incorrect results with Bezier curve

I am trying to select a region of data based on a matplotlib Path object, but when the path contains a Bezier curve (not just straight lines), the selected region doesn't completely fill in the curve. It looks like it's trying, but the far side of the curve gets chopped off.
For example, the following code defines a fairly simple closed path with one straight line and one cubic curve. When I look at the True/False result from the contains_points method, it does not seem to match either the curve itself or the raw vertices.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.path import Path
from matplotlib.patches import PathPatch
# Make the Path
verts = [(1.0, 1.5), (-2.0, 0.25), (-1.0, 0.0), (1.0, 0.5), (1.0, 1.5)]
codes = [Path.MOVETO, Path.CURVE4, Path.CURVE4, Path.CURVE4, Path.CLOSEPOLY]
path1 = Path(verts, codes)
# Make a field with points to select
nx, ny = 101, 51
x = np.linspace(-2, 2, nx)
y = np.linspace(0, 2, ny)
yy, xx = np.meshgrid(y, x)
pts = np.column_stack((xx.ravel(), yy.ravel()))
# Construct a True/False array of contained points
tf = path1.contains_points(pts).reshape(nx, ny)
# Make a PathPatch for display
patch1 = PathPatch(path1, facecolor='c', edgecolor='b', lw=2, alpha=0.5)
# Plot the true/false array, the patch, and the vertices
fig, ax = plt.subplots()
ax.imshow(tf.T, origin='lower', extent=(x[0], x[-1], y[0], y[-1]))
ax.add_patch(patch1)
ax.plot(*zip(*verts), 'ro-')
plt.show()
This gives me this plot:
It looks like there is some sort of approximation going on - is this just a fundamental limitation of the calculation in matplotlib, or am I doing something wrong?
I can calculate the points inside the curve myself, but I was hoping to not reinvent this wheel if I don't have to.
It's worth noting that a simpler construction using quadratic curves does appear to work properly:
I am using matplotlib 2.0.0.
This has to do with the space in which the paths are evaluated, as explained in GitHub issue #6076. From a comment by mdboom there:
Path intersection is done by converting the curves to line segments
and then converting the intersection based on the line segments. This
conversion happens by "sampling" the curve at increments of 1.0. This
is generally the right thing to do when the paths are already scaled
in display space, because sampling the curve at a resolution finer
than a single pixel doesn't really help. However, when calculating the
intersection in data space as you've done here, we obviously need to
sample at a finer resolution.
This is discussing intersections, but contains_points is also affected. This enhancement is still open so we'll have to see if it is addressed in the next milestone. In the meantime, there are a couple options:
1) If you are going to be displaying a patch anyway, you can use the display transformation. In the example above, adding the following demonstrates the correct behavior (based on a comment by tacaswell on duplicate issue #8734, now closed):
# Work in transformed (pixel) coordinates
hit_patch = path1.transformed(ax.transData)
tf1 = hit_patch.contains_points(ax.transData.transform(pts)).reshape(nx, ny)
ax.imshow(tf2.T, origin='lower', extent=(x[0], x[-1], y[0], y[-1]))
2) If you aren't using a display and just want to calculate using a path, the best bet is to simply form the Bezier curve yourself and make a path out of line segments. Replacing the formation of path1 with the following calculation of path2 will produce the desired result.
from scipy.special import binom
def bernstein(n, i, x):
coeff = binom(n, i)
return coeff * (1-x)**(n-i) * x**i
def bezier(ctrlpts, nseg):
x = np.linspace(0, 1, nseg)
outpts = np.zeros((nseg, 2))
n = len(ctrlpts)-1
for i, point in enumerate(ctrlpts):
outpts[:,0] += bernstein(n, i, x) * point[0]
outpts[:,1] += bernstein(n, i, x) * point[1]
return outpts
verts1 = [(1.0, 1.5), (-2.0, 0.25), (-1.0, 0.0), (1.0, 0.5), (1.0, 1.5)]
nsegments = 31
verts2 = np.concatenate([bezier(verts1[:4], nsegments), np.array([verts1[4]])])
codes2 = [Path.MOVETO] + [Path.LINETO]*(nsegments-1) + [Path.CLOSEPOLY]
path2 = Path(verts2, codes2)
Either method yields something that looks like the following:

How to separate two regions of latlon data on python

I am currently working with BUFR files with wind data. When I read this file on python I get 4 large vectors, latitude vector, longitude vector, wind_direction vector, and wind_speed vector.
Both wind vectors are masked python arrays because there is non-valid data. This happens because the data comes from a non-geostationary satellite. In fact I successfully generated the following image from this BUFR file to show you the general shape that the data takes.
In this image I have plotted a color field to represent the wind speed, while the arrows obviously represent the wind direction.
Please notice the two bands of actual data. Unfortunately the way I am plotting the data, generates a third band (where the color field is smooth), in-between the actual data bands. This is an artefact of the function pcolormesh. If I could superimpose two `pcolormesh plots, each one representing one of the bands, this problem would disappear.
Unfortunately, I do not know how I could separate the data "regions". I have thought about clustering techniques but do not know how to cluster along latlon data using ANOTHER array (the wind data) as the clustering rule.
This is my current code:
#!/usr/bin/python
import bufr
import numpy as np
import sys
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot as plt
from matplotlib import mlab
WIND_DIR_INDEX = 97
WIND_SPEED_INDEX = 96
bfrfile = sys.argv[1]
print bfrfile
bfr = bufr.BUFRFile(bfrfile)
lon = []
lat = []
wind_d = []
wind_s = []
for record in bfr:
for entry in record:
if entry.index == WIND_DIR_INDEX:
wind_d.append(entry.data)
if entry.index == WIND_SPEED_INDEX:
wind_s.append(entry.data)
if entry.name.find("LONGITUDE") == 0:
lon.append(entry.data)
if entry.name.find("LATITUDE") == 0:
lat.append(entry.data)
lons = np.concatenate(lon)
lats = np.concatenate(lat)
winds_d = np.concatenate(wind_d)
winds_s = np.concatenate(wind_s)
winds_d = np.ma.masked_greater(winds_d,1.0e+6)
winds_s = np.ma.masked_greater(winds_s,1.0e+6)
windu = np.cos((winds_d-180)*(np.pi/180))
windv = np.sin((winds_d-180)*(np.pi/180))
# Data interpolation for pcolormesh (needs gridded data)
xi = np.linspace(lons.min(),lons.max(),lons.size/10)
yi = np.linspace(lats.min(),lats.max(),lats.size/10)
Z = mlab.griddata(lons,lats,winds_s,xi,yi)
X,Y = np.meshgrid(xi,yi)
mydpi = 96
fig = plt.figure(frameon=True)
fig.set_size_inches(1600/mydpi,1200/mydpi)
ax = plt.Axes(fig,[0,0,1,1])
#ax.set_axis_off()
fig.add_axes(ax)
plt.hold(True);
plt.quiver(lons[::5],lats[::5],windu[::5],windv[::5],linewidths=0)
for method in (ax.set_xticks,ax.set_xticklabels,ax.set_yticks,ax.set_yticklabels):
method([])
fig.savefig('/home/cendas/bin/python/bufr_ascat.png',bbox_inches=0,dpi=5*mydpi)
mydpi = 96
fig = plt.figure(frameon=True)
fig.set_size_inches(1600/mydpi,1200/mydpi)
ax = plt.Axes(fig,[0,0,1,1])
#ax.set_axis_off()
fig.add_axes(ax)
plt.hold(True);
try:
plt.pcolormesh(X,Y,Z,alpha=None)
plt.clim(0,10)
except ValueError:
pass
print "Warning: Empty data array."
for method in (ax.set_xticks,ax.set_xticklabels,ax.set_yticks,ax.set_yticklabels):
method([])
fig.savefig('/home/cendas/bin/python/bufr_ascat_color.png',bbox_inches=0,dpi=5*mydpi)
I then usually follow this python code with the following terminal commands to combine the images:
convert bufr_ascat.png -transparent white bufr_ascat.png
convert bufr_ascat_color.png -transparent white bufr_ascat_color.png
composite bufr_ascat.png bufr_ascat_color.png bufrascat.png
Don't abuse clustering for this.
What you need is a simple selection / filtering; not a structure discovery process.
Choose the mean of the masked data. All non-masked data left of that mean is the left part, all non-masked data on the right is the other?
Clustering is the wrong tool for this task.

get bins coordinates with hexbin in matplotlib

I use matplotlib's method hexbin to compute 2d histograms on my data.
But I would like to get the coordinates of the centers of the hexagons in order to further process the results.
I got the values using get_array() method on the result, but I cannot figure out how to get the bins coordinates.
I tried to compute them given number of bins and the extent of my data but i don't know the exact number of bins in each direction. gridsize=(10,2) should do the trick but it does not seem to work.
Any idea?
I think this works.
from __future__ import division
import numpy as np
import math
import matplotlib.pyplot as plt
def generate_data(n):
"""Make random, correlated x & y arrays"""
points = np.random.multivariate_normal(mean=(0,0),
cov=[[0.4,9],[9,10]],size=int(n))
return points
if __name__ =='__main__':
color_map = plt.cm.Spectral_r
n = 1e4
points = generate_data(n)
xbnds = np.array([-20.0,20.0])
ybnds = np.array([-20.0,20.0])
extent = [xbnds[0],xbnds[1],ybnds[0],ybnds[1]]
fig=plt.figure(figsize=(10,9))
ax = fig.add_subplot(111)
x, y = points.T
# Set gridsize just to make them visually large
image = plt.hexbin(x,y,cmap=color_map,gridsize=20,extent=extent,mincnt=1,bins='log')
# Note that mincnt=1 adds 1 to each count
counts = image.get_array()
ncnts = np.count_nonzero(np.power(10,counts))
verts = image.get_offsets()
for offc in xrange(verts.shape[0]):
binx,biny = verts[offc][0],verts[offc][1]
if counts[offc]:
plt.plot(binx,biny,'k.',zorder=100)
ax.set_xlim(xbnds)
ax.set_ylim(ybnds)
plt.grid(True)
cb = plt.colorbar(image,spacing='uniform',extend='max')
plt.show()
I would love to confirm that the code by Hooked using get_offsets() works, but I tried several iterations of the code mentioned above to retrieve center positions and, as Dave mentioned, get_offsets() remains empty. The workaround that I found is to use the non-empty 'image.get_paths()' option. My code takes the mean to find centers but which means it is just a smidge longer, but it does work.
The get_paths() option returns a set of x,y coordinates embedded that can be looped over and then averaged to return the center position for each hexagram.
The code that I have is as follows:
counts=image.get_array() #counts in each hexagon, works great
verts=image.get_offsets() #empty, don't use this
b=image.get_paths() #this does work, gives Path([[]][]) which can be plotted
for x in xrange(len(b)):
xav=np.mean(b[x].vertices[0:6,0]) #center in x (RA)
yav=np.mean(b[x].vertices[0:6,1]) #center in y (DEC)
plt.plot(xav,yav,'k.',zorder=100)
I had this same problem. I think what needs to be developed is a framework to have a HexagonalGrid object which can then be applied to many different data sets (and it would be awesome to do it for N dimensions). This is possible and it surprises me that neither Scipy or Numpy has anything for it (furthermore there seems to be nothing else like it except perhaps binify)
That said, I assume you want to use hexbinning to compare multiple binned data sets. This requires some common base. I got this to work using matplotlib's hexbin the following way:
import numpy as np
import matplotlib.pyplot as plt
def get_data (mean,cov,n=1e3):
"""
Quick fake data builder
"""
np.random.seed(101)
points = np.random.multivariate_normal(mean=mean,cov=cov,size=int(n))
x, y = points.T
return x,y
def get_centers (hexbin_output):
"""
about 40% faster than previous post only cause you're not calculating the
min/max every time
"""
paths = hexbin_output.get_paths()
v = paths[0].vertices[:-1] # adds a value [0,0] to the end
vx,vy = v.T
idx = [3,0,5,2] # index for [xmin,xmax,ymin,ymax]
xmin,xmax,ymin,ymax = vx[idx[0]],vx[idx[1]],vy[idx[2]],vy[idx[3]]
half_width_x = abs(xmax-xmin)/2.0
half_width_y = abs(ymax-ymin)/2.0
centers = []
for i in xrange(len(paths)):
cx = paths[i].vertices[idx[0],0]+half_width_x
cy = paths[i].vertices[idx[2],1]+half_width_y
centers.append((cx,cy))
return np.asarray(centers)
# important parts ==>
class Hexagonal2DGrid (object):
"""
Used to fix the gridsize, extent, and bins
"""
def __init__ (self,gridsize,extent,bins=None):
self.gridsize = gridsize
self.extent = extent
self.bins = bins
def hexbin (x,y,hexgrid):
"""
To hexagonally bin the data in 2 dimensions
"""
fig = plt.figure()
ax = fig.add_subplot(111)
# Note mincnt=0 so that it will return a value for every point in the
# hexgrid, not just those with count>mincnt
# Basically you fix the gridsize, extent, and bins to keep them the same
# then the resulting count array is the same
hexbin = plt.hexbin(x,y, mincnt=0,
gridsize=hexgrid.gridsize,
extent=hexgrid.extent,
bins=hexgrid.bins)
# you could close the figure if you don't want it
# plt.close(fig.number)
counts = hexbin.get_array().copy()
return counts, hexbin
# Example ===>
if __name__ == "__main__":
hexgrid = Hexagonal2DGrid((21,5),[-70,70,-20,20])
x_data,y_data = get_data((0,0),[[-40,95],[90,10]])
x_model,y_model = get_data((0,10),[[100,30],[3,30]])
counts_data, hexbin_data = hexbin(x_data,y_data,hexgrid)
counts_model, hexbin_model = hexbin(x_model,y_model,hexgrid)
# if you want the centers, they will be the same for both
centers = get_centers(hexbin_data)
# if you want to ignore the cells with zeros then use the following mask.
# But if want zeros for some bins and not others I'm not sure an elegant way
# to do this without using the centers
nonzero = counts_data != 0
# now you can compare the two data sets
variance_data = counts_data[nonzero]
square_diffs = (counts_data[nonzero]-counts_model[nonzero])**2
chi2 = np.sum(square_diffs/variance_data)
print(" chi2={}".format(chi2))

How to create a grid from LiDAR points (X,Y,Z) with GDAL python?

I'm new really to python programming, and I was just wondering if you can create a regular grid of 0.5 by o.5 m of resolution using LiDAR points.
My data are in LAS format (reading with from liblas import file as lasfile) and they have the following format: X,Y,Z. Where X and Y are coordinates.
The points are randomly positioned and some pixel are empty (NAN value) and in some pixel there are more of one points. Where there are more of one point, I wish to obtain a mean value. In the end i need to save the data in a TIF format or Ascii format.
I am studying osgeo module and GDAL but I honest to say that i don't know if osgeo module is the best solution.
I am really glad for help with some code that i can study and implement,
Thanks in Advance for the help, I really need.
I don't know the best way to get a grid with these parameters.
It's a bit late but maybe this answer will be useful for others, if not for you...
I have done this with Numpy and Pandas, and it's pretty fast. I was using TLS data and could do this with several million data points without any trouble on a decent 2009-vintage laptop. The key is 'binning' by rounding the data, and then using Pandas' GroupBy methods to do the aggregating and calculate the means.
If you need to round to a power of 10 you can use np.round, otherwise you can round to an arbitrary value by making a function to do so, which I have done by modifying this SO answer.
import numpy as np
import pandas as pd
# make rounding function:
def round_to_val(a, round_val):
return np.round( np.array(a, dtype=float) / round_val) * round_val
# load data
data = np.load( 'shape of ndata, 3')
n_d = data.shape[0]
# round the data
d_round = np.empty( [n_d, 5] )
d_round[:,0] = data[:,0]
d_round[:,1] = data[:,1]
d_round[:,2] = data[:,2]
del data # free up some RAM
d_round[:,3] = round_to_val( d_round[:,0], 0.5)
d_round[:,4] = round_to_val( d_round[:,1], 0.5)
# sorting data
ind = np.lexsort( (d_round[:,4], d_round[:,3]) )
d_sort = d_round[ind]
# making dataframes and grouping stuff
df_cols = ['x', 'y', 'z', 'x_round', 'y_round']
df = pd.DataFrame( d_sort)
df.columns = df_cols
df_round = df[['x_round', 'y_round', 'z']]
group_xy = df_round.groupby(['x_round', 'y_round'])
# calculating the mean, write to csv, which saves the file with:
# [x_round, y_round, z_mean] columns. You can exit Python and then start up
# later to clear memory if that's an issue.
group_mean = group_xy.mean()
group_mean.to_csv('your_binned_data.csv')
# Restarting...
import numpy as np
from scipy.interpolate import griddata
binned_data = np.loadtxt('your_binned_data.csv', skiprows=1, delimiter=',')
x_bins = binned_data[:,0]
y_bins = binned_data[:,1]
z_vals = binned_data[:,2]
pts = np.array( [x_bins, y_bins])
pts = pts.T
# make grid (with borders rounded to 0.5...)
xmax, xmin = 640000.5, 637000
ymax, ymin = 6070000.5, 6067000
grid_x, grid_y = np.mgrid[640000.5:637000:0.5, 6067000.5:6070000:0.5]
# interpolate onto grid
data_grid = griddata(pts, z_vals, (grid_x, grid_y), method='cubic')
# save to ascii
np.savetxt('data_grid.txt', data_grid)
When I've done this, I have saved the output as a .npy and converted to a tiff with the Image library, and then georeferenced in ArcMap. There is probably a way to do that with osgeo but I haven't used it.
Hope this helps someone at least...
You can use the histogram function in Numpy to do binning, for instance:
import numpy as np
points = np.random.random(1000)
#create 10 bins from 0 to 1
bins = np.linspace(0, 1, 10)
means = (numpy.histogram(points, bins, weights=data)[0] /
numpy.histogram(points, bins)[0])
Try LAStools, particularly lasgrid or las2dem.

Categories

Resources