Parallelize nested looping of gridded data in Python

Parallelize nested looping of gridded data in Python - python

I'm struggling to find an easy example to parallelize nested loops in NDim data in Python.
As a simple example suppose we have a gridded precipitation data of dimensions (time, lat, lon), and are to find the temporal mean at each lat-lon grid point, i.e. to obtain same result as data.mean(axis=0).
def mean(data):
result = np.zeros(data[0,:,:].shape, dtype=np.float)
for i in range(data.shape[1]):
for j in range(data.shape[2]):
result[i,j] = data[:,i,j].mean()
return result
What would be the most elegant way to parallelize this function?
Update: The precip data can be downloaded at: https://www.esrl.noaa.gov/psd/data/gridded/data.gpcp.html.
Test Code:
%matplotlib inline
import xarray
import numpy as np
import matplotlib.pyplot as plt
#Load data:
ds = xarray.open_dataset('precip.mon.mean.nc')
# Select a small subset, shape is now (442, 50,50)
data = ds.precip[:,:50,:50].to_masked_array()
#define the function to compute the temporal mean at each grid point:
def mean(data):
result = np.zeros(data[0,:,:].shape, dtype=np.float)
for i in range(data.shape[1]):
for j in range(data.shape[2]):
result[i,j] = data[:,i,j].mean()
return result
#Call the function
result = mean(data)
#A quick plot for visual reference
plt.figure()
plt.imshow(result, origin='upper',interpolation='None'); plt.colorbar()
My working codes involve more complex techniques (than just taking the mean), but the basic code structure is similar: nested double loop to access each grid point in order to perform analysis, and save the result as a 2D or ND array. So being able to parallelize this would be immensely beneficial.

Related

Fastest way to convert a set of 3D points into image of heights in python

I am trying to convert a set of 3D points into a heightmap (a 2d image that shows the largest displacements of the points from the floor)
The only way I can come up with is writing a for look that iterates through all points and update the heightmap, this method, is quite slow.
import numpy as np
heightmap_resolution = 0.02
# generate some random 3D points
points = np.array([[x,y,z] for x in np.random.uniform(0,2,100) for y in np.random.uniform(0,2,100) for z in np.random.uniform(0,2,100)])
heightmap = np.zeros((int(np.max(points[:,1])/heightmap_resolution) + 1,
int(np.max(points[:,0])/heightmap_resolution) + 1))
for point in points:
y = int(point[1]/heightmap_resolution)
x = int(point[0]/heightmap_resolution)
if point[2] > heightmap[y][x]:
heightmap[y][x] = point[2]
I wonder if there is a better way of doing this. Any improvement is greatly appreciated!

The intuition:
If you find yourself using a for loop with numpy, you probably need to check again if numpy has an operation for it. I saw you wanted to compare items to get max and I wasn't sure if the structure was imporant so I changed it.
2nd point is heightmap is pre-allocating a lot of memory you aren't going to use. Try using a dictionary with a tuple (x,y) as the key or this (a dataframe)
import numpy as np
import pandas as pd
heightmap_resolution = 0.02
# generate some random 3D points
points = np.array([[x,y,z] for x in np.random.uniform(0,2,100) for y in np.random.uniform(0,2,100) for z in np.random.uniform(0,2,100)])
points_df = pd.DataFrame(points, columns = ['x','y','z'])
#didn't know if you wanted to keep the x and y columns so I made new ones.
points_df['x_normalized'] = (points_df['x']/heightmap_resolution).astype(int)
points_df['y_normalized'] = (points_df['y']/heightmap_resolution).astype(int)
points_df.groupby(['x_normalized','y_normalized'])['z'].max()

understanding pyresample to regrid irregular grid data to a regular grid

I need to regrid data on a irregular grid (lambert conical) to a regular grid. I think pyresample is my best bet. Infact my original lat,lon are not 1D (which seems to be needed to use basemap.interp or scipy.interpolate.griddata).
I found this SO's answer helpful. However I get empty interpolated data. I think it has to do with the choice of my radius of influence and with the fact that my data are wrapped (??).
This is my code:
import numpy as np
from matplotlib import pyplot as plt
import netCDF4
%matplotlib inline
url = "http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/NARR/Dailies/monolevel/hlcy.2009.nc"
SRHtemp = netCDF4.Dataset(url).variables['hlcy'][0,::]
Y_n = netCDF4.Dataset(url).variables['y'][:]
X_n = netCDF4.Dataset(url).variables['x'][:]
T_n = netCDF4.Dataset(url).variables['time'][:]
lat_n = netCDF4.Dataset(url).variables['lat'][:]
lon_n = netCDF4.Dataset(url).variables['lon'][:]
lat_n and lon_n are irregular and the latitude and longitude corresponding to the projected coordinates x,y.
Because of the way lon_n is, I added:
lon_n[lon_n<0] = lon_n[lon_n<0]+360
so that now if I plot them they look nice and ok:
Then I create my new set of regular coordinates:
XI = np.arange(148,360)
YI = np.arange(0,87)
XI, YI = np.meshgrid(XI,YI)
Following the answer above I wrote the following code:
from pyresample.geometry import SwathDefinition
from pyresample.kd_tree import resample_nearest
def_a = SwathDefinition(lons=XI, lats=YI)
def_b = SwathDefinition(lons=lon_n, lats=lat_n)
interp_dat = resample_nearest(def_b,SRHtemp,def_a,radius_of_influence = 70000,fill_value = -9.96921e+36)
the resolution of the data is about 30km, so I put 70km, the fill_value I put is the one from the data, but of course I can just put zero or nan.
however I get an empty array.
What do I do wrong? also - if there is another way of doing it, I am interested in knowing it. Pyresample documentation is a bit thin, and I need a bit more help.
I did find this answer suggesting to use another griddata function:
import matplotlib.mlab as ml
resampled_data = ml.griddata(lon_n.ravel(), lat_n.ravel(),SRHtemp.ravel(),XI,YI,interp = "linear")
and it seems to be ok:
But I would like to understand more about pyresample, since it seems so powerful.

The problem is that XI and XI are integers, not floats. You can fix this by simply doing
XI = np.arange(148,360.)
YI = np.arange(0,87.)
XI, YI = np.meshgrid(XI,YI)
The inability to handle integer datatypes is an undocumented, unintuitive, and possibly buggy behavior from pyresample.
A few more notes on your coding style:
It's not necessary to overwrite the XI and YI variables, you don't gain much by this
You should just load the netCDF dataset once and the access the variables via that object

How do I subset a 2D grid from another 2D grid in python?

I have gridded data over the contiguous United States and I'm trying to select a chunk of it over a specific area.
import numpy as np
from netCDF4 import Dataset
import matplotlib.pyplot as plt
filename = '/Users/me/myfile.nc'
full_data = Dataset(filename,'r')
latitudes = full_data.variables['latitude'][0,:,:]
longitudes = full_data.variables['longitude'][0,:,:]
temperature = full_data.variables['temperature'][0,:,:]
All three variables are 2-dimensional matrices of shape (337,451). I'm trying to do the following to get a sub-selection of the data over a specific region.
index = (latitudes>=44.0)&(latitudes<=45.0)&(longitudes>=-91.0)&(longitudes<=-89.0)
temp_subset = temperature[index]
lat_subset = latitudes[index]
lon_subset = longitudes[index]
I would expect all three of these variables to be 2-dimensional, but instead they all return a flattened array with a shape of (102,). I've tried another approach:
index2 = np.where((latitudes>=44.0)&(latitudes<=45.0)&(longitudes>=-91.0)&(longitudes<=-89.0))
temp = temperatures[index2[0],:]
temp2 = temp[:,index2[1]]
plt.imshow(temp2,origin='lower')
plt.colobar()
But my data looks quite incorrect. Is there a better way to get a 2D subset grid from a larger grid?

Edub,
I suggest looking on at numpy's matrix indexing documentation, specifically http://docs.scipy.org/doc/numpy-1.10.1/user/basics.indexing.html#other-indexing-options . Currently, you are providing two dimensions for indexing, but no slicing information (resulting in only receiving one dimensional results). I hope this proves useful!

Drawing a 2D function in matplotlib

Dear fellow coders and science guys :)
I am using python with numpy and matplotlib to simulate a perceptron, proud to say it works pretty well.
I used python even tough I've never seen it before, cause I heard matplotlib offered amazing graph visualisation capabilities.
Using functions below I get a 2d array that looks like this:
[[aplha_1, 900], [alpha_2], 600, .., [alpha_99, 900]
So I get this 2D array and would love to write a function that would enable me to analyze the convergence.
I am looking for something that will easily and intuitively (don't have time to study a whole new library for 5 hours now) draw a function like this sketch:
def get_convergence_for_alpha(self, _alpha):
epochs = []
for i in range(0, 5):
epochs.append(self.perceptron_algorithm())
self.weights = self.generate_weights()
avg = sum(epochs, 0) / len(epochs)
res = [_alpha, avg]
return res
And this is the whole generation function.
def alpha_convergence_function(self):
res = []
for i in range(1, 100):
res.append(self.get_convergence_for_alpha(i / 100))
return res
Is this easily doable?

You can convert your nested list to a 2d numpy array and then use slicing to get the alphas and epoch counts (just like in matlab).
import numpy as np
import matplotlib.pyplot as plt
# code to simulate the perceptron goes here...
res = your_object.alpha_convergence_function()
res = np.asarray(res)
print('array size:', res.shape)
plt.xkcd() # so you get the sketchy look :)
# first column -> x-axis, second column -> y-axis
plt.plot(res[:,0], res[:,1])
plt.show()
Remove the plt.xkcd() line if you don't actually want the plot to look like a sketch...

How to create a grid from LiDAR points (X,Y,Z) with GDAL python?

I'm new really to python programming, and I was just wondering if you can create a regular grid of 0.5 by o.5 m of resolution using LiDAR points.
My data are in LAS format (reading with from liblas import file as lasfile) and they have the following format: X,Y,Z. Where X and Y are coordinates.
The points are randomly positioned and some pixel are empty (NAN value) and in some pixel there are more of one points. Where there are more of one point, I wish to obtain a mean value. In the end i need to save the data in a TIF format or Ascii format.
I am studying osgeo module and GDAL but I honest to say that i don't know if osgeo module is the best solution.
I am really glad for help with some code that i can study and implement,
Thanks in Advance for the help, I really need.
I don't know the best way to get a grid with these parameters.

It's a bit late but maybe this answer will be useful for others, if not for you...
I have done this with Numpy and Pandas, and it's pretty fast. I was using TLS data and could do this with several million data points without any trouble on a decent 2009-vintage laptop. The key is 'binning' by rounding the data, and then using Pandas' GroupBy methods to do the aggregating and calculate the means.
If you need to round to a power of 10 you can use np.round, otherwise you can round to an arbitrary value by making a function to do so, which I have done by modifying this SO answer.
import numpy as np
import pandas as pd
# make rounding function:
def round_to_val(a, round_val):
return np.round( np.array(a, dtype=float) / round_val) * round_val
# load data
data = np.load( 'shape of ndata, 3')
n_d = data.shape[0]
# round the data
d_round = np.empty( [n_d, 5] )
d_round[:,0] = data[:,0]
d_round[:,1] = data[:,1]
d_round[:,2] = data[:,2]
del data # free up some RAM
d_round[:,3] = round_to_val( d_round[:,0], 0.5)
d_round[:,4] = round_to_val( d_round[:,1], 0.5)
# sorting data
ind = np.lexsort( (d_round[:,4], d_round[:,3]) )
d_sort = d_round[ind]
# making dataframes and grouping stuff
df_cols = ['x', 'y', 'z', 'x_round', 'y_round']
df = pd.DataFrame( d_sort)
df.columns = df_cols
df_round = df[['x_round', 'y_round', 'z']]
group_xy = df_round.groupby(['x_round', 'y_round'])
# calculating the mean, write to csv, which saves the file with:
# [x_round, y_round, z_mean] columns. You can exit Python and then start up
# later to clear memory if that's an issue.
group_mean = group_xy.mean()
group_mean.to_csv('your_binned_data.csv')
# Restarting...
import numpy as np
from scipy.interpolate import griddata
binned_data = np.loadtxt('your_binned_data.csv', skiprows=1, delimiter=',')
x_bins = binned_data[:,0]
y_bins = binned_data[:,1]
z_vals = binned_data[:,2]
pts = np.array( [x_bins, y_bins])
pts = pts.T
# make grid (with borders rounded to 0.5...)
xmax, xmin = 640000.5, 637000
ymax, ymin = 6070000.5, 6067000
grid_x, grid_y = np.mgrid[640000.5:637000:0.5, 6067000.5:6070000:0.5]
# interpolate onto grid
data_grid = griddata(pts, z_vals, (grid_x, grid_y), method='cubic')
# save to ascii
np.savetxt('data_grid.txt', data_grid)
When I've done this, I have saved the output as a .npy and converted to a tiff with the Image library, and then georeferenced in ArcMap. There is probably a way to do that with osgeo but I haven't used it.
Hope this helps someone at least...

You can use the histogram function in Numpy to do binning, for instance:
import numpy as np
points = np.random.random(1000)
#create 10 bins from 0 to 1
bins = np.linspace(0, 1, 10)
means = (numpy.histogram(points, bins, weights=data)[0] /
numpy.histogram(points, bins)[0])

Try LAStools, particularly lasgrid or las2dem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelize nested looping of gridded data in Python - python

Related

Fastest way to convert a set of 3D points into image of heights in python

understanding pyresample to regrid irregular grid data to a regular grid

How do I subset a 2D grid from another 2D grid in python?

Drawing a 2D function in matplotlib

How to create a grid from LiDAR points (X,Y,Z) with GDAL python?

Categories

Resources