Why does netCDF4 give different results depending on how data is read? - python

I am coding in python, and trying to use netCDF4 to read in some floating point netCDF data. Mt original code looked like
from netCDF4 import Dataset
import numpy as np
infile='blahblahblah'
ds = Dataset(infile)
start_pt = 5 # or whatever
x = ds.variables['thedata'][start_pt:start_pt+2,:,:,:]
Now, because of various and sundry other things, I now have to read 'thedata' one slice at a time:
x = np.zeros([2,I,J,K]) # I,J,K match size of input array
for n in range(2):
x[n,:,:,:] = ds.variables['thedata'][start_pt+n,:,:,:]
The thing is that the two methods of reading give slightly different results. Nothing big, like one part in 10 to the fifth, but still ....
So can anyone tell me why this is happening and how I can guarantee the same results from the two methods? My thought was that the first method perhaps automatically establishes x as being the same type as the input data, while the second method establishes x as the default type for a numpy array. However, the input data is 64 bit and I thought the default for a numpy array was also 64 bit. So that doesn't explain it. Any ideas? Thanks.

The first example pulls the data into a NetCDF4 Variable object, while the second example pulls the data into a numpy array. Is it possible that the Variable object is just displaying the data with a different amount of precision?

Related

Python quick data read-in and slice

I've got the following code in python and I think I'd need some help optimizing it.
I'm reading in a few million lines of data, but then throwing out most of them if one coordinate per line is not fitting my criterion.
The code is as following:
def loadFargoData(dataname, thlimit):
temp = np.loadtxt(dataname)
return temp[ np.abs(temp[:,1]) < thlimit ]
I've coded it as if it were C-type code and of course in python now this is crazy slow.
Can I throw out my temp object somehow? Or what other optimization can the Pythonian population help me with?
The data reader included in pandas might speed up your script. It reads faster than numpy. Pandas will produce a dataframe object, easy to view as a numpy array (also easy to convert if preferred) so you can execute your condition in numpy (which looks efficient enough in your question).
import pandas as pd
def loadFargoData(dataname, thlimit):
temp = pd.read_csv(dataname) # returns a dataframe
temp = temp.values # returns a numpy array
# the 2 lines above can be replaced by temp = pd.read_csv(dataname).values
return temp[ np.abs(temp[:,1]) < thlimit ]
You might want to check up Pandas' documentation to learn the function arguments you may need to read the file correctly (header, separator, etc).

python xarray indexing/slicing very slow

I'm currently processing some ocean model outputs. At each time step, it has 42*1800*3600 grid points.
I found that the bottelneck in my program is the slicing, and calling xarray_built in method to extract the values. And what's more interesting, same syntax sometimes require a vastly differnt amount of time.
ds = xarray.open_dataset(filename, decode_times=False)
vvel0=ds.VVEL.sel(lat=slice(-60,-20),lon=slice(0,40))/100 #in CCSM output, unit is cm/s convert to m/s
uvel0=ds.UVEL.sel(lat=slice(-60,-20),lon=slice(0,40))/100 ## why the speed is that different? now it's regional!!
temp0=ds.TEMP.sel(lat=slice(-60,-20),lon=slice(0,40)) #de
Take this for example, reading a VVEL and UVEL took ~4sec, while reading in TEMP only needed ~6ms. Without slicing, VVEL and UVEL took ~1sec, and TEMP needed 120 nanosecond.
I always thought that, when I only input part of the full array, I need less memory, and therefore less time. It turned out, that XARRAY loads in the full array and any extra slicing takes more time. But, could somebody please explain why is reading different variables from the same netcdf file takes that different of time?
The program is designed to extract a stepwise section, and calculate the cross-sectional heat transport, so I need to pick out either UVEL or VVEL, times that by TEMP along the section. So, it may seems that, loading in TEMP that fast is good, isn't it?
Unfortunately, that's not the case. When I loop through about ~250 grid points along the prescribed section...
# Calculate VT flux orthogonal to the chosen grid cells, which is the heat transport across GOODHOPE line
vtflux=[]
utflux=[]
vap = vtflux.append
uap = utflux.append
#for i in range(idx_north,idx_south+1):
for i in range(10):
yidx=gh_yidx[i]
xidx=gh_xidx[i]
lon_next=ds_lon[i+1].values
lon_current=ds_lon[i].values
lat_next=ds_lat[i+1].values
lat_current=ds_lat[i].values
tt=np.squeeze(temp[:,yidx,xidx].values) #<< calling values is slow
if (lon_next<lon_current) and (lat_next==lat_current): # The condition is incorrect
dxlon=Re*np.cos(lat_current*np.pi/180.)*0.1*np.pi/180.
vv=np.squeeze(vvel[:,yidx,xidx].values)
vt=vv*tt
vtdxdz=np.dot(vt[~np.isnan(vt)],layerdp[0:len(vt[~np.isnan(vt)])])*dxlon
vap(vtdxdz)
#del vtdxdz
elif (lon_next==lon_current) and (lat_next<lat_current):
#ut=np.array(uvel[:,gh_yidx[i],gh_xidx[i]].squeeze().values*temp[:,gh_yidx[i],gh_xidx[i]].squeeze().values) # slow
uu=np.squeeze(uvel[:,yidx,xidx]).values # slow
ut=uu*tt
utdxdz=np.dot(ut[~np.isnan(ut)],layerdp[0:len(ut[~np.isnan(ut)])])*dxlat
uap(utdxdz) #m/s*degC*m*m ## looks fine, something wrong with the sign
#del utdxdz
total_trans=(np.nansum(vtflux)-np.nansum(utflux))*3996*1026/1e15
Especially this line:
tt=np.squeeze(temp[:,yidx,xidx].values)
It takes ~3.65 Sec, but now it has to be repeated for ~250 times. If I remove .values, then this time reduces to ~4ms. But I need to time the tt to vt, so I have to extract the values. What's weird, is that the similar expression, vv=np.squeeze(vvel[:,yidx,xidx].values) requires much less time, only about ~1.3ms.
To summarize my questions:
Why loading in different variables from the same netcdf file takes different amount of time?
Is there a more efficient way to pick out a single column in a multidimensional array? (not necessary the xarray structure, also numpy.ndarray)
Why does extracting values from Xarray structures need different amount of time, for the exact same syntax?
Thank you!
When you index a variable loaded from a netCDF file, xarray doesn't load it into memory immediately. Instead, we create a lazy array that supports any number of further differed indexing operations. This is true even if you aren't using dask.array (triggered by setting chunks= in open_dataset or using open_mfdataset).
This explains the surprising performance you observe. Calculating temp0 is fast, because it doesn't load any data from disk. vvel0 is slow, because dividing by 100 requires loading the data into memory as a numpy array.
Later, it's slower to index temp0 because each operation loads data from disk, instead of indexing a numpy array already in memory.
The work-around is to explicitly load the portion of your dataset that you need into memory first, e.g., by writing temp0.load(). The netCDF section of the xarray docs also gives this tip.

Regridding NetCDF4 in Python

I'm working with various climate models, but right now I'm working on regridding the latitudes and longitudes of these files from 2.5x2.5 to 0.5x0.5, and I am completely lost. I've been running on the Anaconda package for all of my netCDF4 needs, and I've made good progress, it's just regridding that baffles me completely. I have three main arrays that I'm using:
The first is the data_array, a numpy array that contains the information for precipitation.
The second is the lan_array, a numpy array containing the latitude information.
The third is the lot_array, a numpy array containing the longitude information.
All this data came from the netCDF4 file.
Again, my data is currently in 2.5x2.5. Meaning, the lonxlat is currently 144x72. I use np.meshgrid(lon_array,lat_array) to bring lonxlat to go to 72. My data_array also contains 72 elements, thus matching up perfectly.
This is where I get stuck and I have no idea how to proceed.
My thoughts: I want my 144x72 to convert to 720x360 in order for it to be 0.5x0.5.
I know one way of creating the lonxlat that I want is by np.arange(-89.75,90.25,0.5) and np.arange(-179.75,181.25,0.5). But I don't know how to match up the data_array to match with that.
Can anyone please offer any assistance? Any help is much appreciated!
Note: I also have ESMF modules available to me.
An easy option would be nctoolkit (https://nctoolkit.readthedocs.io/en/latest/installing.html). This has a built in method called to_latlon that easily achieves what you want. Just do the following for bilinear interpolation (and see the user guide for other methods):
import nctoolkit as nc
data = nc.open("infile.nc")
data.to_latlon(lon = [-179.75, 179.75], lat = [-89.75, 89.75], res = [0.5, 0.5])

How to read NetCDF variable float data into a Numpy array with the same precision and scale as the original NetCDF float values?

I have a NetCDF file which contains a variable with float values with precision/scale == 7/2, i.e. there are possible values from -99999.99 to 99999.99.
When I take a slice of the values from the NetCDF variable and look at them in in my debugger I see that the values I now have in my array have more precision/scale than what I see in the original NetCDF. For example when I look at the values in the ToosUI/ncdump viewer they display as '-99999.99' or '12.45' but when I look at the values in the slice array they look like '-99999.9921875' (a greater scale length). So if I'm using '-99999.99' as the expected value to indicate a missing data point then I won't get a match with what gets pulled into the slice array since those values have a greater scale length and the additional digits in the scale are not just zeros for padding.
For example I see this if I do a ncdump on a point within the NetCDF dataset:
Variable: precipitation(0:0:1, 40:40:1, 150:150:1)
float precipitation(time=1348, lat=180, lon=360);
:units = "mm/month";
:long_name = "precipitation totals";
data:
{
{
{-99999.99}
}
}
However if I get a slice of the data from the variable like so:
value = precipitationVariable[0:1:1, 40:41:1, 150:151:1]
then I see it like this in my debugger (Eclipse/PyDev):
value == ndarray: [[[-99999.9921875]]]
So it seems as if the NetCDF dataset values that I read into a Numpy array are not being read with the same precision/scale of the original values in the NetCDF file. Or perhaps the values within the NetCDF are actually the same as what I'm seeing when I read them, but what's shown to me via ncdump is being truncated due to some format settings in the ncdump program itself.
Can anyone advise as to what's happening here? Thanks in advance for your help.
BTW I'm developing this code using Python 2.7.3 on a Windows XP machine and using the Python module for the NetCDF4 API provided here: https://code.google.com/p/netcdf4-python/
There is no simple way of doing what you want because numpy stores the values as single precision, so they will always have the trailing numbers after 0.99.
However, netCDF already provides a mechanism for missing data (see the best practices guide). How was the netCDF file written in the first place? The missing_value is a special variable attribute that should be used to indicate those values that are missing. In the C and Fortran interfaces, when the file is created all variable values are set to be missing. If you wrote a variable all in one go, you can then set the missing_value attribute to an array of indices where the values are missing. See more about the fill values in the C and Fortran interfaces. This is the recommended approach. The python netCDF4 module plays well with these missing values, and such arrays are read as masked arrays in numpy.
If you must work with the file you currently have, then I'd suggest creating a mask to cover values around your missing value:
import numpy as np
value = precipitationVariable[:]
mask = (value < -99999.98) & (value > -100000.00)
value = np.ma.MaskedArray(value, mask=mask)

Reconstructing the original data from detrended data -- Python

I have obtained the detrended data from the following python code:
Detrended_Data = signal.detrend(Original_Data)
Is there a function in python wherein the "Original_Data" can be reconstructed using the "Detrended_Data" and some "correction factor"?
Are you referring to scipy.signal.detrend? If so, the answer is no -- there is no (and can never be an) un-detrend function. detrend maps many arrays to the same array. For example,
import numpy as np
import scipy.signal as signal
t = np.linspace(0, 5, 100)
assert np.allclose(signal.detrend(t), signal.detrend(2*t))
If there were an undetrend function, it would have to map signal.detrend(t) back to t, and also map signal.detrend(2*t) back to 2*t. That's impossible, since signal.detrend(t) is the same array as signal.detrend(2*t).
I guess you could use numpy to trend your data. That wouldn't properly give you the original data but it would make less 'noisy'.
Read this question as it goes much more in detail into this.

Categories

Resources