Efficiently create 2d histograms from large datasets

Efficiently create 2d histograms from large datasets - python

I'd like to create 2d histograms in python from large datasets (100000+ samples) stored in a HDF5 file. I came up with the following code:
import sys
import h5py
import numpy as np
import matplotlib as mpl
import matplotlib.pylab
f = h5py.File(sys.argv[1], 'r')
A = f['A']
T = f['T']
at_hist, xedges, yedges = np.histogram2d(T, A, bins=500)
extent = [yedges[0], yedges[-1], xedges[0], xedges[-1]]
fig = mpl.pylab.figure()
at_plot = fig.add_subplot(111)
at_plot.imshow(at_hist, extent=extent, origin='lower', aspect='auto')
mpl.pylab.show()
f.close()
It takes about 15s to execute (100000 data points). CERN's Root however (using its own tree data structure instead of HDF5) can do this in less than 1s. Do you have any idea how I could speed up the code? I could also change the structure of the HDF5 data if it would be helpful.

I would try a few different things.
Load your data from the hdf file instead of passing in what are effectively memory-mapped arrays.
If that doesn't fix the problem, you can exploit a scipy.sparse.coo_matrix to make the 2D histogram. With older versions of numpy, digitize (which all of the various histogram* functions use internally) could use excessive memory under some circumstances. It's no longer the case with recent (>1.5??) versions of numpy, though.
As an example of the first suggestion, you'd do something like:
f = h5py.File(sys.argv[1], 'r')
A = np.empty(f['A'].shape, f['A'].dtype)
T = np.empty(f['T'].shape, f['T'].dtype)
f['A'].read_direct(A)
f['T'].read_direct(T)
The difference here is that the entirety of the arrays will be read into memory, instead of being h5py's array-like objects, which are basically efficient memory-mapped arrays stored on disk.
As for the second suggestion, don't try it unless the first suggestion didn't help your problem.
It probably won't be significantly faster (and is likely slower for small arrays), and with recent versions of numpy, it's only slightly more memory-efficient. I do have a piece of code where I deliberately do this, but I wouldn't recommend it in general. It's a very hackish solution. In very select circumstances (many points and many bins), it can preform better than histogram2d, though.
All those caveats aside, though, here it is:
import numpy as np
import scipy.sparse
import timeit
def generate_data(num):
x = np.random.random(num)
y = np.random.random(num)
return x, y
def crazy_histogram2d(x, y, bins=10):
try:
nx, ny = bins
except TypeError:
nx = ny = bins
xmin, xmax = x.min(), x.max()
ymin, ymax = y.min(), y.max()
dx = (xmax - xmin) / (nx - 1.0)
dy = (ymax - ymin) / (ny - 1.0)
weights = np.ones(x.size)
# Basically, this is just doing what np.digitize does with one less copy
xyi = np.vstack((x,y)).T
xyi -= [xmin, ymin]
xyi /= [dx, dy]
xyi = np.floor(xyi, xyi).T
# Now, we'll exploit a sparse coo_matrix to build the 2D histogram...
grid = scipy.sparse.coo_matrix((weights, xyi), shape=(nx, ny)).toarray()
return grid, np.linspace(xmin, xmax, nx), np.linspace(ymin, ymax, ny)
if __name__ == '__main__':
num=1e6
numruns = 1
x, y = generate_data(num)
t1 = timeit.timeit('crazy_histogram2d(x, y, bins=500)',
setup='from __main__ import crazy_histogram2d, x, y',
number=numruns)
t2 = timeit.timeit('np.histogram2d(x, y, bins=500)',
setup='from __main__ import np, x, y',
number=numruns)
print 'Average of %i runs, using %.1e points' % (numruns, num)
print 'Crazy histogram', t1 / numruns, 'sec'
print 'numpy.histogram2d', t2 / numruns, 'sec'
On my system, this yields:
Average of 10 runs, using 1.0e+06 points
Crazy histogram 0.104092288017 sec
numpy.histogram2d 0.686891794205 sec

You need to figure out if the bottleneck is in the data load or in histogram2d. Try inserting some time measurement into your code.
Are A and T arrays or are they generator objects? If the latter then more care is needed to distinguish where the bottleneck is; you might have to unpack them to numpy arrays first to test.

Does the whole thing run in 15s or just the call to histogram2d? Importing the pylab family can take plenty of time. The numpy histogram2d function should be implemented in C if i recall correctly so I doubt there are performance issues there. You can speed up python by calling your script with the optimization flag --OO
python -OO script.py
Also consider using Psycho to improve performance.

Related

matplotlib doesn't display the correct data

I am new to Python. For some reason when I look at the plot it displays all the data as if Y = 0 but the last one, which is weird since when I ask it to print Y it displays the right values. What am I doing wrong?
import math
import numpy as np
import matplotlib.pyplot as plt
y0=2 # [m]
g=9.81 # [m/s^2]
v=20 # initial speed [m/s]
y_target=1 # [m]
x=35 # [m]
n_iter=50
theta=np.linspace(0,0.5*math.pi,n_iter) # theta input [rad]
Y=np.zeros(n_iter) # y output [m]
for i in range(n_iter):
Y[i]=math.tan(theta[i])*x-g/(2*(v*math.cos(theta[i]))**2)*x**2+y0
plt.plot(theta,Y)
plt.ylabel('y [m]')
plt.xlabel('theta [rad]')
plt.ylim(top=max(Y),bottom=min(Y))
plt.show()

The problem is that the function blows up a bit as theta approaches π/2. Notice the little 1e33 at the top of the y-axis in the plot: the scale of that axis is huge, because the last value of y is essentially minus infinity (because of dividing by almost zero). If you change the limits of the y-axis, e.g. to (-1000, +1000), the plot looks correct.
But I can't resist helping you with something you didn't ask for help on... You are not using NumPy correctly. NumPy gives you two things: n-dimensional arrays as a data structure, and fast, optimized code for 'vectorized' computing with those arrays. In essence, you never need a loop in NumPy — you just compute with everything at once. Try doing 10 * np.array([1, 2, 3]) and you will get the idea.
So I would write your code like this:
import numpy as np
import matplotlib.pyplot as plt
# Problem parameters.
y0 = 2 # [m]
g = 9.81 # [m/s^2]
v = 20 # initial speed [m/s]
x = 35 # [m]
# Make theta [rad].
steps = 50
theta = np.linspace(0, 0.5*np.pi, steps)
# Compute y.
y = np.tan(theta) * x - g / (2 * (v * np.cos(theta))**2) * x**2 + y0
# Plot.
plt.plot(theta, y)
plt.ylabel('y [m]')
plt.xlabel('theta [rad]')
plt.ylim(-1000, 1000)
plt.show()
Notice that there's no loop — you just use the vector theta as if it were a scalar. And the math library (which can't handle NumPy's arrays, only scalars) is not needed at all when you're using NumPy.

2D FFT from MATLAB to Python

Most of my programming experience is in MATLAB and I recently started get familiar with Python.
I came across some great MATLAB code here that pertains to some things I'd like to work with, so I've tried to recreate it in Python:
import numpy as np
import math
import matplotlib.pyplot as plt
x = np.linspace(-2, 2, 100) # seconds
y = np.linspace(-3, 3, 200) # seconds
xFreq = 2; # Hz
yFreq = -3; # Hz
a = np.matrix(np.matrix(np.exp(2j * np.pi * y * yFreq)))
b = np.matrix(np.exp(2j * np.pi * np.matrix(x).T * xFreq))
c = np.dot(b,a).T
plt.imshow(c.real, cmap='gray', extent = [min(x), max(x), min(y), max(y)], aspect=2/3);
plt.colorbar()
plt.xlabel('x (Sec)')
plt.ylabel('y (Sec)')
plt.show()
nfftx = len(x);
fs = 1/np.diff(x)[0];
fx = np.linspace(-1,1,nfftx) * fs/2;
nffty = len(y);
fs = 1/np.diff(y)[0];
fy = np.linspace(-1,1,nffty) * fs/2;
imF = np.fft.fftshift(np.fft.fft2(c))/np.size(c)
plt.figure()
plt.title("FFT (real)")
plt.imshow(np.real(imF), cmap='gray', extent = [min(fx), max(fx), min(fy), max(fy)], aspect=2/3)
plt.xlabel('fx (Hz)')
plt.ylabel('fy (Hz)')
Any idea why the y frequency is shown at 3 Hz vice -3 Hz
I couldn't understand what the original commentator was doing in MATLAB with these two lines:
Nfft = 4 * 2 .^ nextpow2(size(im));
imF = fftshift(fft2(im, Nfft(1), Nfft(2))) / numel(im);
which is likely why my FFT signal is so intense relative to background. Thoughts on how I could adjust my FFT in Python?

I only have a partial answer.
If you look closely, the colors on the sinusoidal image generated with your Python code and the one generated with Matlab code you linked have inverted color (check the colors of the stripes closer to edges, and the colors on the color bar).
That explains why you have inverted colors on the FFT plot, and may be why you got 3 Hz, instead of -3 Hz. Unfortunately, I do not have access to a computer with Python right now and won't be able to verify this. I guess this may be a good thing to start troubleshooting with.
EDIT:
Yes, you are right. I completely missed the flipud in the Matlab script. I do no think your c calculation is wrong. The easiest way to check that is to save the Matlab data and import it to Python.
In Matlab:
save('data.mat', 'im')
Then import that to Python using scipy:
im_matlab = scipy.io.loadmat('data.mat')['im']
np.all(np.isclose(im_matlab, im))
If the last line gives you True, then that means the calculations are correct.
About the plot, imshow assumes that the origin (0-th index of the numpy array) is the top left corner, which is the norm for images. You can change this by using origin='lower' keyword in plt.imshow.
About fftshift, I think this answer to a different StackOverflow question is what you need.

How to create a grid from LiDAR points (X,Y,Z) with GDAL python?

I'm new really to python programming, and I was just wondering if you can create a regular grid of 0.5 by o.5 m of resolution using LiDAR points.
My data are in LAS format (reading with from liblas import file as lasfile) and they have the following format: X,Y,Z. Where X and Y are coordinates.
The points are randomly positioned and some pixel are empty (NAN value) and in some pixel there are more of one points. Where there are more of one point, I wish to obtain a mean value. In the end i need to save the data in a TIF format or Ascii format.
I am studying osgeo module and GDAL but I honest to say that i don't know if osgeo module is the best solution.
I am really glad for help with some code that i can study and implement,
Thanks in Advance for the help, I really need.
I don't know the best way to get a grid with these parameters.

It's a bit late but maybe this answer will be useful for others, if not for you...
I have done this with Numpy and Pandas, and it's pretty fast. I was using TLS data and could do this with several million data points without any trouble on a decent 2009-vintage laptop. The key is 'binning' by rounding the data, and then using Pandas' GroupBy methods to do the aggregating and calculate the means.
If you need to round to a power of 10 you can use np.round, otherwise you can round to an arbitrary value by making a function to do so, which I have done by modifying this SO answer.
import numpy as np
import pandas as pd
# make rounding function:
def round_to_val(a, round_val):
return np.round( np.array(a, dtype=float) / round_val) * round_val
# load data
data = np.load( 'shape of ndata, 3')
n_d = data.shape[0]
# round the data
d_round = np.empty( [n_d, 5] )
d_round[:,0] = data[:,0]
d_round[:,1] = data[:,1]
d_round[:,2] = data[:,2]
del data # free up some RAM
d_round[:,3] = round_to_val( d_round[:,0], 0.5)
d_round[:,4] = round_to_val( d_round[:,1], 0.5)
# sorting data
ind = np.lexsort( (d_round[:,4], d_round[:,3]) )
d_sort = d_round[ind]
# making dataframes and grouping stuff
df_cols = ['x', 'y', 'z', 'x_round', 'y_round']
df = pd.DataFrame( d_sort)
df.columns = df_cols
df_round = df[['x_round', 'y_round', 'z']]
group_xy = df_round.groupby(['x_round', 'y_round'])
# calculating the mean, write to csv, which saves the file with:
# [x_round, y_round, z_mean] columns. You can exit Python and then start up
# later to clear memory if that's an issue.
group_mean = group_xy.mean()
group_mean.to_csv('your_binned_data.csv')
# Restarting...
import numpy as np
from scipy.interpolate import griddata
binned_data = np.loadtxt('your_binned_data.csv', skiprows=1, delimiter=',')
x_bins = binned_data[:,0]
y_bins = binned_data[:,1]
z_vals = binned_data[:,2]
pts = np.array( [x_bins, y_bins])
pts = pts.T
# make grid (with borders rounded to 0.5...)
xmax, xmin = 640000.5, 637000
ymax, ymin = 6070000.5, 6067000
grid_x, grid_y = np.mgrid[640000.5:637000:0.5, 6067000.5:6070000:0.5]
# interpolate onto grid
data_grid = griddata(pts, z_vals, (grid_x, grid_y), method='cubic')
# save to ascii
np.savetxt('data_grid.txt', data_grid)
When I've done this, I have saved the output as a .npy and converted to a tiff with the Image library, and then georeferenced in ArcMap. There is probably a way to do that with osgeo but I haven't used it.
Hope this helps someone at least...

You can use the histogram function in Numpy to do binning, for instance:
import numpy as np
points = np.random.random(1000)
#create 10 bins from 0 to 1
bins = np.linspace(0, 1, 10)
means = (numpy.histogram(points, bins, weights=data)[0] /
numpy.histogram(points, bins)[0])

Try LAStools, particularly lasgrid or las2dem.

matplotlib MemoryError on pcolorfast

The Problem:
I'm currently loading column data from text files into numpy arrays, and then plotting them and saving the resulting image. Because the values will always lie on an equally spaced grid, it seemed an appropriate time to use pcolorfast. Each array is necessarily square, usually between 1024x1024 and 8192x8192. At present, I'm only concerned with this working up to and including 4096x4096 sizes. This needs to be done for hundreds of files, and while it successfully completes the fist image, subsequent images crash on a MemoryError.
Unsuccessful solutions:
I've ensured, as per here, that I have hold = False in rc.
Limitations:
The images must be saved using all 4096x4096 values, and cannot be scaled down to 1024x1024 (as suggested here).
Notes:
After watching memory usage during each phase (create empty array, load values, plot, save), the array A is still sitting in memory after makeFrame is complete. Is an explicit call to delete it required? Does fig need to be explicitly deleted, or should pylab take care of that? The ideal situation (probably obvious) would be to have memory usage return to ~the same level as it was before the call to makeFrame().
Any and all advice is greatly appreciated. I've been trying to resolve this for a few days, so it's not unlikely I've missed something obvious. And obvious solutions would be exciting (if the alternative should be that this is a more complicated problem).
Current code sample:
import numpy
import matplotlib
matplotlib.use("AGG")
import matplotlib.pylab as plt
def makeFrame(srcName, dstName, coloring, sideLength,
dataRanges, delim, dpi):
v,V,cmap = coloring
n = sideLength
xmin,xmax,ymin,ymax = dataRanges
A = numpy.empty((n,n),float)
dx = (xmax-xmin) / (n-1)
dy = (ymax-ymin) / (n-1)
srcfile = open(srcName,'rb')
for line in srcfile:
lineVals = line[:-1].split(delim)
x = float(lineVals[0])
y = float(lineVals[1])
c = float(lineVals[2])
#Find index from float value, adjust for rounding
i = (x-xmin) / dx
if (i - int(i) ) > .05: i += 1
j = (y-ymin) / dy
if (j - int(j) ) > .05: j += 1
A[i,j] = c
srcfile.close()
print "loaded vals"
fig = plt.figure(1)
fig.clf()
ax = fig.gca()
ScalarMap = ax.pcolorfast(A, vmin = v, vmax = V, cmap = cmap)
fig.colorbar(ScalarMap)
ax.axis('image')
fig.savefig(dstName, dpi = dpi)
plt.close(1)
print "saved image"

Caveats:
There might be a better way to deal
with this memory problem that I don't
know about.
I haven't been able to reproduce this
error. When I use
matplotlib.cbook.report_memory() my
memory usage seems to level out as
expected.
Despite the caveats, I thought I'd mention a general, cheap method of dealing with problems caused by a program refusing to release memory: Use the multiprocessing module to spawn the problematic function in a separate process. Wait for the function to end, then call it again. Each time a subprocess ends, you regain the memory it used.
So I suggest trying something like this:
import matplotlib.cbook as mc
import multiprocessing as mp
import matplotlib.cm as cm
if __name__=='__main__':
for _ in range(10):
srcName='test.data'
dstName='test.png'
vmin = 0
vmax = 5
cmap = cm.jet
sideLength = 500
dataRanges = (0.0,1.0,0.0,1.0)
delim = ','
dpi = 72
proc=mp.Process(target=makeFrame,args=(
srcName,dstName,(vmin,vmax,cmap),sideLength,
dataRanges,delim,dpi))
proc.start()
proc.join()
usage=mc.report_memory()
print(usage)

Issues with 2D-Interpolation in Scipy

In my application, the data data is sampled on a distorted grid, and I would like to resample it to a nondistorted grid. In order to test this, I wrote this program with examplary distortions and a simple function as data:
from __future__ import division
import numpy as np
import scipy.interpolate as intp
import pylab as plt
# Defining some variables:
quadratic = -3/128
linear = 1/16
pn = np.poly1d([quadratic, linear,0])
pixels_x = 50
pixels_y = 30
frame = np.zeros((pixels_x,pixels_y))
x_width= np.concatenate((np.linspace(8,7.8,57) , np.linspace(7.8,8,pixels_y-57)))
def data(x,y):
z = y*(np.exp(-(x-5)**2/3) + np.exp(-(x)**2/5) + np.exp(-(x+5)**2))
return(z)
# Generating grid coordinates
yt = np.arange(380,380+pixels_y*4,4)
xt = np.linspace(-7.8,7.8,pixels_x)
X, Y = np.meshgrid(xt,yt)
Y=Y.T
X=X.T
Y_m = np.zeros((pixels_x,pixels_y))
X_m = np.zeros((pixels_x,pixels_y))
# generating distorted grid coordinates:
for i in range(pixels_y):
Y_m[:,i] = Y[:,i] - pn(xt)
X_m[:,i] = np.linspace(-x_width[i],x_width[i],pixels_x)
# Sample data:
for i in range(pixels_y):
for j in range(pixels_x):
frame[j,i] = data(X_m[j,i],Y_m[j,i])
Y_m = Y_m.flatten()
X_m = X_m.flatten()
frame = frame.flatten()
##
Y = Y.flatten()
X = X.flatten()
ipf = intp.interp2d(X_m,Y_m,frame)
interpolated_frame = ipf(xt,yt)
At this point, I have to questions:
The code works, but I get the the following warning:
Warning: No more knots can be added because the number of B-spline coefficients
already exceeds the number of data points m. Probably causes: either
s or m too small. (fp>s)
kx,ky=1,1 nx,ny=54,31 m=1500 fp=0.000006 s=0.000000
Also, some interpolation artifacts appear, and I assume that they are related to the warning - Do you guys know what I am doing wrong?
For my actual applications, the frames need to be around 500*100, but when doing this, I get a MemoryError - Is there something I can do to help that, apart from splitting the frame into several parts?
Thanks!

This problem is most likely related to the usage of bisplrep and bisplev within interp2d. The docs mention that they use a smooting factor of s=0.0 and that bisplrep and bisplev should be used directly if more control over s is needed. The related docs mention that s should be found between (m-sqrt(2*m),m+sqrt(2*m)) where m is the number of points used to construct the splines. I had a similar problem and found it solved when using bisplrep and bisplev directly, where s is only optional.

For 2d interpolation,
griddata
is solid, local, fast.
Take a look at problem-with-2d-interpolation-in-scipy-non-rectangular-grid on SO.

You might want to look at the following interp method in basemap:
mpl_toolkits.basemap.interp
http://matplotlib.sourceforge.net/basemap/doc/html/api/basemap_api.html
unless you really need spline-based interpolation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently create 2d histograms from large datasets - python

Related

matplotlib doesn't display the correct data

2D FFT from MATLAB to Python

How to create a grid from LiDAR points (X,Y,Z) with GDAL python?

matplotlib MemoryError on pcolorfast

Issues with 2D-Interpolation in Scipy

Categories

Resources