python xarray indexing/slicing very slow

python xarray indexing/slicing very slow - python

I'm currently processing some ocean model outputs. At each time step, it has 42*1800*3600 grid points.
I found that the bottelneck in my program is the slicing, and calling xarray_built in method to extract the values. And what's more interesting, same syntax sometimes require a vastly differnt amount of time.
ds = xarray.open_dataset(filename, decode_times=False)
vvel0=ds.VVEL.sel(lat=slice(-60,-20),lon=slice(0,40))/100 #in CCSM output, unit is cm/s convert to m/s
uvel0=ds.UVEL.sel(lat=slice(-60,-20),lon=slice(0,40))/100 ## why the speed is that different? now it's regional!!
temp0=ds.TEMP.sel(lat=slice(-60,-20),lon=slice(0,40)) #de
Take this for example, reading a VVEL and UVEL took ~4sec, while reading in TEMP only needed ~6ms. Without slicing, VVEL and UVEL took ~1sec, and TEMP needed 120 nanosecond.
I always thought that, when I only input part of the full array, I need less memory, and therefore less time. It turned out, that XARRAY loads in the full array and any extra slicing takes more time. But, could somebody please explain why is reading different variables from the same netcdf file takes that different of time?
The program is designed to extract a stepwise section, and calculate the cross-sectional heat transport, so I need to pick out either UVEL or VVEL, times that by TEMP along the section. So, it may seems that, loading in TEMP that fast is good, isn't it?
Unfortunately, that's not the case. When I loop through about ~250 grid points along the prescribed section...
# Calculate VT flux orthogonal to the chosen grid cells, which is the heat transport across GOODHOPE line
vtflux=[]
utflux=[]
vap = vtflux.append
uap = utflux.append
#for i in range(idx_north,idx_south+1):
for i in range(10):
yidx=gh_yidx[i]
xidx=gh_xidx[i]
lon_next=ds_lon[i+1].values
lon_current=ds_lon[i].values
lat_next=ds_lat[i+1].values
lat_current=ds_lat[i].values
tt=np.squeeze(temp[:,yidx,xidx].values) #<< calling values is slow
if (lon_next<lon_current) and (lat_next==lat_current): # The condition is incorrect
dxlon=Re*np.cos(lat_current*np.pi/180.)*0.1*np.pi/180.
vv=np.squeeze(vvel[:,yidx,xidx].values)
vt=vv*tt
vtdxdz=np.dot(vt[~np.isnan(vt)],layerdp[0:len(vt[~np.isnan(vt)])])*dxlon
vap(vtdxdz)
#del vtdxdz
elif (lon_next==lon_current) and (lat_next<lat_current):
#ut=np.array(uvel[:,gh_yidx[i],gh_xidx[i]].squeeze().values*temp[:,gh_yidx[i],gh_xidx[i]].squeeze().values) # slow
uu=np.squeeze(uvel[:,yidx,xidx]).values # slow
ut=uu*tt
utdxdz=np.dot(ut[~np.isnan(ut)],layerdp[0:len(ut[~np.isnan(ut)])])*dxlat
uap(utdxdz) #m/s*degC*m*m ## looks fine, something wrong with the sign
#del utdxdz
total_trans=(np.nansum(vtflux)-np.nansum(utflux))*3996*1026/1e15
Especially this line:
tt=np.squeeze(temp[:,yidx,xidx].values)
It takes ~3.65 Sec, but now it has to be repeated for ~250 times. If I remove .values, then this time reduces to ~4ms. But I need to time the tt to vt, so I have to extract the values. What's weird, is that the similar expression, vv=np.squeeze(vvel[:,yidx,xidx].values) requires much less time, only about ~1.3ms.
To summarize my questions:
Why loading in different variables from the same netcdf file takes different amount of time?
Is there a more efficient way to pick out a single column in a multidimensional array? (not necessary the xarray structure, also numpy.ndarray)
Why does extracting values from Xarray structures need different amount of time, for the exact same syntax?
Thank you!

When you index a variable loaded from a netCDF file, xarray doesn't load it into memory immediately. Instead, we create a lazy array that supports any number of further differed indexing operations. This is true even if you aren't using dask.array (triggered by setting chunks= in open_dataset or using open_mfdataset).
This explains the surprising performance you observe. Calculating temp0 is fast, because it doesn't load any data from disk. vvel0 is slow, because dividing by 100 requires loading the data into memory as a numpy array.
Later, it's slower to index temp0 because each operation loads data from disk, instead of indexing a numpy array already in memory.
The work-around is to explicitly load the portion of your dataset that you need into memory first, e.g., by writing temp0.load(). The netCDF section of the xarray docs also gives this tip.

Related

How to make a for loop in python, which is called multiple times consecutively, execute faster?

I'd like to state off the bat that I don't have a lot of experience with numPy, and deeper explanation would be appreciated(even obvious ones).
Here's my issue:
converted_X = X
for col in X:
curr_data = X[col]
i = 0
for pix in curr_data:
inv_pix = 255.0 - pix
curr_data[i] = inv_pix
i+=1
converted_X[col] = curr_data.values
Context: X is a DataFrame with images of handwritten digits (70k images, 784 pixels/image).
The entire point of doing this is to change the black background to white and white numbers to black.
The only problem I'm facing with this is that it's taking a ridiculously long time. I tried using rich.Progress() to track its execution, and it's an astonishing 4 hour ETA.
Also, I'm executing this code block in the jupyter notebook extension of VSCode (Might help).
I know it probably has to do with a ton of inefficiencies and under-usage of numPy functionality, but I need guidance.
Thanks in advance.

Never ever write for loop in python on numpy data, that is how you make them faster.
Most of the times, there are ways to have numpy do the for loop for you (meaning, process data by batch. Obviously, there is still a for loop. But not one you wrote in python)
Here, it seems you are trying to compute an inverted image, whose pixels are 255-original pixel.
Just write inverted_image = 255-image
Addition: note that as a python array, numpy arrays are quite inefficient. If you use them just as 2D arrays, that you read and write with low level instruction (settings values individually), then, most of the time, even good'ol python lists are faster. For example, in your case (I've just tried), on my machine, your code is 9 times slower with ndarrays than the exact same code, using directly python list of list of values.
The whole point of ndarrays is that they are faster because you can use them with numpy functions that deal with the whole data in batch for you. And that would not be feasible as easily with python lists.

If X is a numpy array, you can do the following, without any loops:
converted_X = 255.0 - X

What is the most efficient way to repeatedly search a large text file (800 MB) for certain numbers?

The large file is 12 million lines of text such as this:
81.70, 89.86, 717.985
81.74, 89.86, 717.995
81.78, 89.86, 718.004
81.82, 89.86, 718.014
81.86, 89.86, 718.024
81.90, 89.86, 718.034
This is latitude, longitude, and distance from the nearest coastline (respectively).
My code uses coordinates of known places (for example: Mexico City: "-99.1, 19.4) and searches the large file, line by line, to output the distance from the nearest coastline of that coordinate.
I put each line into a list because many lines meet the long/lat criteria. I later average the distances from the coastline.
Each coordinate takes about 12 seconds to retrieve. My entire script takes 14 minutes to complete.
Here's what I have been using:
long = -99.1
lat = 19.4
country_d2s = []
# outputs all list items with specified long and lat values
with open(r"C:\Users\jason\OneDrive\Desktop\s1186prXbF0O", 'r') as dist2sea:
for line in dist2sea:
if long in line and lat in line and line.startswith(long):
country_d2s.append(line)
I am looking for a way to search through the file much quicker and/or rewrite the file to make it easier to work with.

Use a database with a key comprised of the latitude and longitude. If you're looking for a lightweight DB that can be shared as a file, there's SqliteDict or bsddb3. That would be much faster than reading a text file each time the program is run.

Import your data into SQLite database, then create index for (latitude, longitude). Index lookup should take milliseconds. To read data, use python SQLite module.

Comments:
It's unclear if you are using the fact that your long/lat are XX.Y and you are searching against XX.YY as some kind of fuzzy matching technique.
I also cannot tell how you plan to execute this: load + [run] x 1000 vs [load + run] x 1000, which would inform the solution you want to use.
That being said, if you want to do very fast exact lookups one option is to load the entire thing into memory as a mapping, e.g. {(long, lat): coast_distance, ...}. Since floats are not good keys, it would be better to use strings, integers, or fractions for this.
If you want to do fuzzy matching, there are data structures (and a number of packages) that would solve that issue:
1D: https://pypi.org/project/intervaltree/
2D: https://pypi.org/project/Quadtree/
3+D: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree
If you want the initial load time to be faster you can do things like writing a binary pickle and loading that directly instead of parsing a file. A database is also a simple solution to this.

You could partition the file into 10 by 10 degree patches. This would reduce the search space by 648 which would yield 648 files with each one having about 18500 lines. This would reduce the search time to about 0.02 seconds.
As you are doing exact matches of lat-long, you could instead use any on-disk key-value store. Python has at least one of them built in. If you were doing nearest neighbor or metric space searches, there are spacial databases that support those.

If you are using python i recommend to use PySpark.
in this particular case you can use the function mapPartitions and join the results.
this could help How does the pyspark mapPartitions function work?
PySpark is a useful at the time to work with giant amount of data because it makes N partitions and use your processor full power.
Hope it helps you.

Fill up a 2D array while iterating through it

An example what I want to do is instead of doing what is shown below:
Z_old = [[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0]]
for each_axes in range(len(Z_old)):
for each_point in range(len(Z_old[each_axes])):
Z_old[len(Z_old)-1-each_axes][each_point] = arbitrary_function(each_point, each_axes)
I want now to not initialize the Z_old array with zeroes but rather fill it up with values while iterating through it which is going to be something like the written below although it's syntax is horribly wrong but that's what I want to reach in the end.
Z = np.zeros((len(x_list), len(y_list))) for Z[len(x_list) -1 - counter_1][counter_2] is equal to power_at_each_point(counter_1, counter_2] for counter_1 in range(len(x_list)) and counter_2 in range(len(y_list))]

As I explained in my answer to your previous question, you really need to vectorize arbitrary_function.
You can do this by just calling np.vectorize on the function, something like this:
Z = np.vectorize(arbitrary_function)(np.arange(3), np.arange(5).reshape(5, 1))
But that will only give you a small speedup. In your case, since arbitrary_function is doing a huge amount of work (including opening and parsing an Excel spreadsheet), it's unlikely to make enough difference to even notice, much less to solve your performance problem.
The whole point of using NumPy for speedups is to find the slow part of the code that operates on one value at a time, and replace it with something that operates on the whole array (or at least a whole row or column) at once. You can't do that by looking at the very outside loop, you need to look at the very inside loop. In other words, at arbitrary_function.
In your case, what you probably want to do is read the Excel spreadsheet into a global array, structured in such a way that each step in your process can be written as an array-wide operation on that array. Whether that means multiplying by a slice of the array, indexing the array using your input values as indices, or something completely different, it has to be something NumPy can do for you in C, or NumPy isn't going to help you.
If you can't figure out how to do that, you may want to consider not using NumPy, and instead compiling your inner loop with Cython, or running your code under PyPy. You'll still almost certainly need to move the "open and parse a whole Excel spreadsheet" outside of the inner loop, but at least you won't have to figure out how to rethink your problem in terms of vectorized operations, so it may be easier for you.

rows = 10
cols = 10
Z = numpy.array([ arbitrary_function(each_point, each_axes) for each_axes in range(cols) for each_point in range(rows) ]).reshape((rows,cols))
maybe?

Efficient way of computing statistics for large/imprecise amount of data

I have over 65 million numeric values stored in a text file. I need to compute the maximum, minimum, average, standard deviation, as well as the 25, 50, and 75 percentiles.
Normally I would use the attached code, but I need a more efficient way to compute these metrics because i cannot store all value p in a list. How can I more effectively calculate these values in Python?
import numpy as np
np.average(obj)
np.min(mylist)
np.max(mylist)
np.std(mylist)
np.percentile(obj, 25)
np.percentile(obj, 50)
np.percentile(obj, 75)
maxx = float('-inf')
minx = float('+inf')
sumz = 0
for index, p in enumerate(open("foo.txt", "r")):
maxx = max(maxx, float(p))
minx = min(minx, float(p))
sumz += float(p)
index += 1
my_max = maxx
my_min = minx
my_avg = sumz/index

Use binary file. Then you can use numpy.memmap to map it to memory and can perform all sorts of algorithms, even if the dataset was larger than RAM.
You can even use the numpy.memmap to create a memory mapped array, and read your data in from the text file... you can work on it and when you are done, you also have the data in binary format.

I think you are on the right track, by iterating over the file and keeping track of max and min values. To calculate the std, you should keep a sum of squares inside the loop: sum_of_squares += z**2. You then can calculate std = sqrt(sum_of_squares / n - (sumz / n)**2) after the loop, see formula here (but this formula might suffer from numerical problems). For performance, you might want to iterate over the file in some decent size chunks of data.
To calculate the median and percentiles in a 'continuous' way, you could build up a histogram inside your loop. After the loop, you can get approximate percentiles and median by converting the histogram to the CDF, the error will depend on the number of bins.

As Antti Haapala says, the easiest and most efficient way to do this will be to stick with numpy, and just use a memmapped binary file instead of a text file. Yes, converting from one format to the other will take a bit of time—but it'll almost certainly save more time than it costs (because you can use numpy vectorized operations instead of loops), and it will also make your code a lot simpler.
If you can't do that, Python 3.4 will come with a statistics module. A backport to 2.6+ will hopefully be available at some point after the PEP is finalized; at present I believe you can only get stats, the earlier module it's based on, which requires 3.1+. Unfortunately, while stats does do single-pass algorithms on iterators, it doesn't have any convenient way to run multiple algorithms in parallel on the same iterator, so you have be clever with itertools.tee and zip to force it to interleave the work instead of pulling the whole thing into memory.
And of course there are plenty of other modules out there if you search PyPI for "stats" and/or "statistics" and/or "statistical".
Either way, using a pre-built module will mean someone's already debugged all the problems you're going to run into, and they may have also optimized the code (maybe even ported it to C) to boot.

To get the percentiles, sort the text file using a command line program. Use the line count (index in your program) to find the line numbers of the percentiles (index // 4, etc.) Then retrieve those lines from the file.

Most of these operations can be expressed easily in terms of simple arithmetic. In that case, it can actually (surprisingly) be quite efficient to process simple statistics directly from the Linux command line using awk and sed, e.g. as in this post: < http://www.unixcl.com/2008/09/sum-of-and-group-by-using-awk.html >.
If you need to generalize to more advanced operations, like weighted percentiles, then I'd recommend using Python Pandas (notably the HDFStore capabilities for later retrieval). I've used Pandas with a DataFrame of over 25 million records before (10 columns by 25 million distinct rows). If you're more memory constrained, you could read the data in in chunks, calculate partial contributions from each chunk, and store out intermediate results, then finish off the calculation by just loading the intermediate results, in a serialized sort of map-reduce kind of framework.

Python: Fastest way to iterate this through a large file

Right, I'm iterating through a large binary file
I need to minimise the time of this loop:
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
blockReturn = np.zeros((num_samples,num_receivers,num_channels))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
for smpl in range(0,num_samples):
r2_iq=np.fromfile(ReadFile.fid,np.int16,2)
blockReturn[smpl,rec,chl] = np.sqrt(math.fabs(r2_iq[0])*math.fabs(r2_iq[0]) + math.fabs(r2_iq[1])*math.fabs(r2_iq[1]))
return blockReturn
So, what's going on is as follows:
r1 is the header of the file, dTypes.NB_HDR is a type I made:
NB_HDR= np.dtype([('f3',np.uint32),('f4',np.uint32),('f5',np.uint32),('f6',np.int32),('f7',np.int32),('f8',np.uint32)])
That gets all the information about the forthcoming data block, and nicely puts us in the right position within the file (the start of the data block!).
In this data block there is:
4096 samples per channel,
4 channels per receiver,
9 receivers.
So num_receivers, num_channels, num_samples will always be the same (at the moment anyway), but as you can see this is a fairly large amount of data. Each 'sample' is a pair of int16 values that I want to find the magnitude of (hence Pythagoras).
This NB2 code is executed for each 'Block' in the file, for a 12GB file (which is how big they are) there are about 20,900 Blocks, and I've got to iterate through 1000 of these files (so, 12TB overall). Any speed advantage even it's it's milliseconds would be massively appreciated.
EDIT: Actually it might be of help to know how I'm moving around inside the file. I have a function as follows:
def navigateTo(self, blockNum, indexNum):
ReadFile.fid.seek(ReadFile.fileIndex[blockNum][indexNum],0)
ReadFile.currentBlock = blockNum
ReadFile.index = indexNum
Before I run all this code I scan the file and make a list of index locations at ReadFile.fileIndex that I browse using this function and then 'seek' to the absolute location - is this efficient?
Cheers

Because you know the length of a block after you read the header, read the whole block at once. Then reshape the array (very fast, only affects metadata) and take use the np.hypot ufunc:
blockData = np.fromfile(ReadFile.fid, np.int16, num_receivers*num_channels*num_samples*2)
blockData = blockData.reshape((num_receivers, num_channes, num_samples, 2))
return np.hypot(blockData[:,:,:,0], blockData[:,:,:,1])
On my machine it runs in 11ms per block.

import numpy as np
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
# first, match your array bounds to the way you are walking the file
blockReturn = np.zeros((num_receivers,num_channels,num_samples))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
# second, read in all the samples at once if you have enough memory
r2_iq=np.fromfile(ReadFile.fid,np.int16,2*num_samples)
r2_iq.shape = (-1,2) # tell numpy that it is an array of two values
# create dot product vector by squaring data elementwise, and then
# adding those elements together. Results is of length num_samples
r2_iq = r2_iq * r2_iq
r2_iq = r2_iq[:,0] + r2_iq[:,1]
# get the distance by performing the square root "into" blockReturn
np.sqrt(r2_iq, out=blockReturn[rec,chl,:])
return blockReturn
This should help your performance. Two main ideas in numpy work. First, your result arrays dimensions should match how your loop dimensions are crafted, for memory locality.
Second, Numpy is FAST. I've beaten hand coded C with numpy, simply because it uses LAPack and vector acceleration. However to get that power, you have to let it manipulate more data at a time. That is why your sample loop has been collapsed to read in the full sample for the receiver and channel in one large read. Then use the supreme vector powers of numpy to calculate your magnitude by dot product.
There is a little more optimization to be had in the magnitude calculation, but numpy recycles buffers for you, making it less important than you might think. I hope this helps!

I'd try to use as few loops and as much constants as possible.
Everything that can be done in a linear fashion should be done so.
If values don't change, use constants to reduce lookups and such,
because that eats up cpu cycles.
This is from a theoretical point of view ;-)
If possible use highly optimised libraries. I don't exaclty know what you are trying to achieve but i'd rather use an existing FFT-Lib than writing it myself :>
One more thing: http://en.wikipedia.org/wiki/Big_O_notation (can be an eye-opener)

Most importantly, you shouldn't do file access at the lowest level of a triple nested loop, whether you do this in C or Python. You've got to read in large chunks of data at a time.
So to speed this up, read in large chunks of data at a time, and process that data using numpy indexing (that is, vectorize your code). This is particularly easy in your case since all your data is int32. Just read in big chunks of data, and reshape the data into an array that reflects the (receiver, channel, sample) structure, and then use the appropriate indexing to multiply and add things for Pythagoras, and the 'sum' command to add up the terms in the resulting array.

This is more of an observation than a solution, but porting that function to C++ and loading it in with the Python API would get you a lot of speed gain to begin with before loop optimization.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.