Fast/efficient way to extract data from multiple large NetCDF files - python

I need to extract data from a global grid only for a specific set of nodes, given by lat/lon coordinates (in the order of 5000-10000). The data are time-series of hydraulic parameters, for example wave height.
The global data set is huge so it is divided into many NetCDF files. Each NetCDF file is around 5GB and contains data for the entire global grid, but only for one variable (e.g. wave height) and one year (e.g. 2020). Say I want to extract the full time series (42 years) of 6 variables at a certain location, I need to extract data form 6x42 = 252 NC files, each 5GB in size.
My current approach is a triple loop through years, variables, and nodes. I use Xarray to open each NC file, extract the data for all the required nodes and store it in a dictionary. Once I've extracted all the data in the dictionary I create one pd.dataframe for each location, which I store as a pickle file. With 6 variables and 42 years, this results in a pickle file of around 7-9 MB for each location (so not very large actually).
My approach works perfectly fine if I have a small amount of locations, but as soon as it grows to a few hundred, this approach takes extremely long. My gut feeling is that it is a memory problem (since all the extracted data is first stored in a single dictionary, until every year and variable are extracted). But one of my colleagues said that Xarray is actually quite inefficient and that this might lead to the long duration.
Does anyone here have experience with similar issues or know of an efficient way to extract data from a multitude of NC files? I put the code I currently use below. Thanks for any help!
# set conditions
vars = {...dictionary which contains variables}
years = np.arange(y0, y1 + 1) # year range
ndata = {} # dictionary which will contain all data
# loop through all the desired variables
for v in vars.keys():
ndata[v] = {}
# For each variable, loop through each year, open the nc file and extract the data
for y in years:
# Open file with xarray
fname = 'xxx.nc'
data = xr.open_dataset(fname)
# loop through the locations and load the data for each node as temp
for n in range(len(nodes)):
node = nodes.node_id.iloc[n]
lon = nodes.lon.iloc[n]
lat = nodes.lat.iloc[n]
temp = data.sel(longitude=lon, latitude=lat)
# For the first year, store the data into the ndata dict
if y == years[0]:
ndata[v][node] = temp
# For subsequent years, concatenate the existing array in ndata
else:
ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')
# merge the variables for the current location into one dataset
for n in range(len(nodes)):
node = nodes.node_id.iloc[n]
dset = xr.merge(ndata[v][node] for v in variables.keys())
df = dset.to_dataframe()
# save dataframe as pickle file, named by the node id
df.to_pickle('%s.xz'%(node)))

This is a pretty common workflow so I'll give a few pointers. A few suggested changes, with the most important ones first
Use xarray's advanced indexing to select all points at once
It looks like you're using a pandas DataFrame nodes with columns 'lat', 'lon', and 'node_id'. As with nearly everything in python, remove an inner for loop whenever possible, leveraging array-based operations written in C. In this case:
# create an xr.Dataset indexed by node_id with arrays `lat` and `lon
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()
# select all points from each file simultaneously, reshaping to be
# indexed by `node_id`
node_data = data.sel(lat=node_indexer.lat, lon=node_indexer.lon)
# dump this reshaped data to pandas, with each variable becoming a column
node_df = node_data.to_dataframe()
Only reshape arrays once
In your code, you are looping over many years, and every year after
the first one you are allocating a new array with enough memory to
hold as many years as you've stored so far.
# For the first year, store the data into the ndata dict
if y == years[0]:
ndata[v][node] = temp
# For subsequent years, concatenate the existing array in ndata
else:
ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')
Instead, just gather all the years worth of data and concatenate
them at the end. This will only allocate the needed array for all the data once.
Use dask, e.g. with xr.open_mfdataset to leverage multiple cores. If you do this, you may want to consider using a format that supports multithreaded writes, e.g. zarr
All together, this could look something like this:
# build nested filepaths
filepaths = [
['xxx.nc'.format(year=y, variable=v) for y in years
for v in variables
]
# build node indexer
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()
# I'm not sure if you have conflicting variable names - you'll need to
# tailor this line to your data setup. It may be that you want to just
# concatenate along years and then use `xr.merge` to combine the
# variables, or just handle one variable at a time
ds = xr.open_mfdataset(
filepaths,
combine='nested',
concat_dim=['variable', 'year'],
parallel=True,
)
# this will only schedule the operation - no work is done until the next line
ds_nodes = ds.sel(lat=node_indexer.lat, lon=node_indexer.lon)
# this triggers the operation using a dask LocalCluster, leveraging
# multiple threads on your machine (or a distributed Client if you have
# one set up)
ds_nodes.to_netcdf('all_the_data.zarr')
# alternatively, you could still dump to pandas:
df = ds_nodes.to_dataframe()

Related

Spatially joining a series of netCDF data to a polygon object in python

I have a set of Polygon instances (Loaded via geopandas), which comprise of historical severe-weather based watches and warnings, and netCDF data at 15-minute intervals containing parameters such as tornado / hail forecast parameters and updraft helicity (These data are point data). Each polygon has a set start time and an end time. What I'm trying to do is for each of the 15-minute interval within the polygon's time range (start time -> end time), I want to spatially join in from the netCDF files, the highest value of each of these forecast parameter within the polygon.
I already have the code that pulls the range of time steps (time_list) required to analyze from the netCDF files for each polygon (TOR):
# Pull the warning issue time, locate the closest index in our subhour files
issue_time = datetime.strptime(str(TOR.ISSUED), "%Y%m%d%H%M")
expire_time = datetime.strptime(str(TOR.EXPIRED), "%Y%m%d%H%M")
closest_first = min(time_list, key=lambda d: abs(d - issue_time))
closest_last = min(time_list, key=lambda d: abs(d - expire_time))
# Pull the timesteps in between the issue time and the expire time
before = time_list.index(closest_first) if closest_first < issue_time else time_list.index(closest_first)-1
after = time_list.index(closest_last) if closest_last > issue_time else time_list.index(closest_last)+1
The TOR object is generated by slicing the geoDataFrame of the full polygon data. Right now I am using itertuples on the below slice to get the TOR object:
TORNADOES_IN_WINDOW = df_slice_TO_WRN[(df_slice_TO_WRN.ISSUED > int(first.strftime("%Y%m%d%H%M")))
& (df_slice_TO_WRN.ISSUED < int(last.strftime("%Y%m%d%H%M")))]
Inside the loop, I then iterate over the list of netCDF files in the range of timesteps found (before -> after), and load these into geopandas as well such that I can perform a spatial join, this is done with this block of code:
xr_i = xr.open_dataset(subhr_files[i])
ds_i = xr_i.to_dataframe()
ds_i = ds_i.reset_index()
geom = [Point(x,y) for x, y in zip(ds_i['XLONG'], ds_i['XLAT'])]
gdf = gpd.GeoDataFrame(ds_i, geometry=geom)
The set of problems I'm encountering right now comprise of the following. When trying to run itertuples on the geoDataFrame object (TORNADOES_IN_WINDOW), each iterable returns as a Pandas object, not a GeoPandas object, and therefore I cannot run sjoin to join the attributes. Also, this is not a very efficient methodology, and I'm wondering if perhaps there is a better way to do it.

Is there a more efficient tool than iterrows() in this situation?

Okay so, here's the thing. I'm working with a lot of pandas data frames and arrays. Often times, I need to pair up a value from one frame with a value from another, ideally combining the information into one frame in the end.
Say I'm looking at image files. There's a set of information specific to each file. Sometimes there's certain types of image files that share the same kind of information. Simple example:
FILEPATH, TYPE, COLOR, VALUE_I,<br>
/img2.jpg, A, 'green', 0.6294<br>
/img45.jpg, B, 'green', 0.1846<br>
/img87.jpg, A, 'blue', 34.78<br>
Often, this information is indexed out by type/color/value etc and fed into some other function that gives me another important output, let's say VALUE_II. But I can't concatenate it directly onto the original dataframe because the indices won't match, either because of the nature of the output or because I only fed part of the frame.
Or another situation: I learn that images of a certain TYPE have a specific value attached to them, so I make a dictionary of types and their value. Again, this column doesn't exist, so in this case I would use iterrows() to march down the frame, see if the type matches a specific key, and if it does append it to an array. Then in the end, I convert that array to a dataframe and concatenate it onto the original.
Here's the worse offender. With up to 1800 rows in each frame, it takes FOREVER.:
newColumn = []
for index, row in originalDataframe.iterrows():
for indx, rw in otherDataframe.iterrows():
if row['filename'] in rw['filepath']:
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
newColumn = pd.DataFrame(newColumn, columns = ['VALUE_I', 'VALUE_II', 'VALUE_III'])
originalDataframe = pd.concat([originalDataframe, newColumn], axis=1)
Solutions would be appreciated!
If you can split filename from otherDataframe["filepath"], you can then just compare for equality with orinalDataframe's filename without need to check in. After that you can simplify calculation with pandas.DataFrame.join, which for each filename in originalDataframe will find the same filename in otherDataframe and add all other columns from it.
import os
otherDataframe["filename"] = otherDataframe["filepath"].map(os.path.basename)
joinedDataframe = originalDataframe.join(otherDataframe.set_index("filename"), on="filename")
If there are columns with the same name in originalDataframe and otherDataframe you should set lsuffix or rsuffix.
focusing on the second half of your question, as that's what you provided code for. Your program is checking every row of df1 against every row in df2, yielding potentially 1800 *1800, or 3240000 possible combinations. If there is only one possible match for each row then adding 'break' in will help some, but is not ideal.
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
break
if the structure of you data allows it, i would try something like:
ref = {}
for i, path in enumerate(otherDataframe['filepath']):
*_, file = path.split('\\')
ref[file] = i
originalDataframe['VALUE_I'] = None
originalDataframe['VALUE_II'] = None
originalDataframe['VALUE_III'] = None
for i, file in enumerate(originalDataframe['filename']):
try:
j = ref[file]
originalDataframe.loc[i, 'VALUE_I'] = otherDataframe.loc[j, 'VALUE_I']
originalDataframe.loc[i, 'VALUE_II'] = otherDataframe.loc[j, 'VALUE_II']
originalDataframe.loc[i, 'VALUE_III'] = otherDataframe.loc[j, 'VALUE_III']
except:
pass
Here we we iterate through the paths in otherDataframe (I assume they follow a pattern of C:\asdf\asdf\file), split the path on \ to pull out file, and then construct a dictionary of files to row numbers. Next we initialize the 3 columns in originalDataframe that you want to write to.
Lastly we iterate through the files in originalDataframe, check to see if that file exists in our dictionary of files in otherDataframe (done inside a try to catch errors), and pull the row number (out of the dictionary) which we then use to write the values from other to original.
Side note, you describe you paths as being in the vein of 'C:/asd/fdg/img2.jpg', in which case you should use:
*_, file = path.split('/')

Appending to h5py groups

I have files with the following structure:
time 1
index 1
value x
value y
time 1
index 2
value x
value y
time 2
index 1
...
I wish to convert the file to the hdf5 format using h5py, and sort the values from each index into separate groups.
My approach is
f = h5py.File(filename1,'a')
trajfile = open(filename2, 'rb')
for i in range(length_of_filw):
time = struct.unpack('>d', filename2.read(8))[0]
index = struct.unpack('>i', filename2.read(4))[0]
x = struct.unpack('>d', filename2.read(8))[0]
y = struct.unpack('>d', filename2.read(8))[0]
f.create_dataset('/'+str(index), data=[time,x,y,z])
But in this way I am not able to append to the groups (I am only able to write to each group once...). The error message is "RuntimeError: Unable to create link (name already exists)".
Is there a way to append to the groups?
You can write to a dataset as many times as you want - you just can't have twice a dataset with the same name. This is the error you're getting. Note that you are creating a dataset and at the same time you are putting some data inside of it. In order to write other data to it, it has to be large enough to accomodate it.
Anyway, I believe you are confusing groups and datasets.
Groups are created with e.g.
grp = f.create_group('bar') # this create the group '/bar'
and you want to store datasets in a dataset, created like you said with:
dst = f.create_dataset('foo',shape=(100,)) # this create the dataset 'foo', with enough space for 100 elements.
you only need to create groups and datasets once - but you can refer to them through their handles, (grp and dst), in order to write in them.
I suggest you first go through your file once, create your desired groups and datasets using the 'shape' parameter to properly size it, and then populate the datasets with actual data.

I/O efficiency in Python

I am writing a program that:
Read the content from an excel sheets for each row (90,000 rows in total)
Compare the content with another excel sheet for each row (600,000 rows in total)
If a match occurs, write the matching entry into a new excel sheet
I have written the script and everything works fine. however, the computational time is HUGE. For an hour, it has done just 200 rows from the first sheet, resulting in writing 200 different files.
I was wondering if there is a way to save the matching in a different way as I am going to use them later on? Is there any way to save in a matrix or something?
import xlrd
import xlsxwriter
import os, itertools
from datetime import datetime
# choose the incident excel sheet
book_1 = xlrd.open_workbook('D:/Users/d774911/Desktop/Telstra Internship/Working files/Incidents.xlsx')
# choose the trap excel sheet
book_2 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Traps.xlsx")
# choose the features sheet
book_3 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Features.xlsx")
# select the working sheet, either by name or by index
Traps = book_2.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Incidents = book_1.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Features_Numbers = book_3.sheet_by_name('Sheet1')
#return the total number of rows for the traps sheet
Total_Number_of_Rows_Traps = Traps.nrows
# return the total number of rows for the incident sheet
Total_Number_of_Rows_Incidents = Incidents.nrows
# open a file two write down the non matching incident's numbers
print(Total_Number_of_Rows_Traps, Total_Number_of_Rows_Incidents)
write_no_matching = open('C:/Users/d774911/PycharmProjects/GlobalData/No_Matching.txt', 'w')
# For loop to iterate for all the row for the incident sheet
for Rows_Incidents in range(Total_Number_of_Rows_Incidents):
# Store content for the comparable cell for incident sheet
Incidents_Content_Affected_resources = Incidents.cell_value(Rows_Incidents, 47)
# Store content for the comparable cell for incident sheet
Incidents_Content_Product_Type = Incidents.cell_value(Rows_Incidents, 29)
# Convert Excel date type into python type
Incidents_Content_Date = xlrd.xldate_as_tuple(Incidents.cell_value(Rows_Incidents, 2), book_1.datemode)
# extract the year, month and day
Incidents_Content_Date = str(Incidents_Content_Date[0]) + ' ' + str(Incidents_Content_Date[1]) + ' ' + str(Incidents_Content_Date[2])
# Store content for the comparable cell for incident sheet
Incidents_Content_Date = datetime.strptime(Incidents_Content_Date, '%Y %m %d')
# extract the incident number
Incident_Name = Incidents.cell_value(Rows_Incidents, 0)
# Create a workbook for the selected incident
Incident_Name_Book = xlsxwriter.Workbook(os.path.join('C:/Users/d774911/PycharmProjects/GlobalData/Test/', Incident_Name + '.xlsx'))
# Create sheet name for the created workbook
Incident_Name_Sheet = Incident_Name_Book.add_worksheet('Sheet1')
# insert the first row that contains the features
Incident_Name_Sheet.write_row(0, 0, Features_Numbers.row_values(0))
Insert_Row_to_Incident_Sheet = 0
# For loop to iterate for all the row for the traps sheet
for Rows_Traps in range(Total_Number_of_Rows_Traps):
# Store content for the comparable cell for traps sheet
Traps_Content_Node_Name = Traps.cell_value(Rows_Traps, 3)
# Store content for the comparable cell for traps sheet
Traps_Content_Event_Type = Traps.cell_value(Rows_Traps, 6)
# extract date temporally
Traps_Content_Date_temp = Traps.cell_value(Rows_Traps, 10)
# Store content for the comparable cell for traps sheet
Traps_Content_Date = datetime.strptime(Traps_Content_Date_temp[0:10], '%Y-%m-%d')
# If the content matches partially or full
if len(str(Traps_Content_Node_Name)) * len(str(Incidents_Content_Affected_resources)) != 0 and \
str(Incidents_Content_Affected_resources).lower().find(str(Traps_Content_Node_Name).lower()) != -1 and \
len(str(Traps_Content_Event_Type)) * len(str(Incidents_Content_Product_Type)) != 0 and \
str(Incidents_Content_Product_Type).lower().find(str(Traps_Content_Event_Type).lower()) != -1 and \
len(str(Traps_Content_Date)) * len(str(Incidents_Content_Date)) != 0 and \
Traps_Content_Date <= Incidents_Content_Date:
# counter for writing inside the new incident sheet
Insert_Row_to_Incident_Sheet = Insert_Row_to_Incident_Sheet + 1
# Write the Incident information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 0, Incidents.row_values(Rows_Incidents))
# Write the Traps information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 107, Traps.row_values(Rows_Traps))
Incident_Name_Book.close()
Thanks
What your doing is seeking/reading a litte bit of data for each cell. This is very inefficient.
Try reading all information in one go into an as basic as sensible python data structure (lists, dicts etc.) and make your comparisons/operations on this data set in memory and write all results in one go. If not all data fits into memory, try to partition it into sub-tasks.
Having to read the data set 10 times, to extract a tenth of data each time will likely still be hugely faster than reading each cell independently.
I don't see how your code can work; the second loop works on variables which change for every row in the first loop but the second loop isn't inside of the first one.
That said, comparing files in this way has a complexity of O(N*M) which means that the runtime explodes quickly. In your case you try to execute 54'000'000'000 (54 billion) loops.
If you run into these kind of problems, the solution is always a three step process:
Transform the data to make it easier to process
Put the data into efficient structures (sorted lists, dict)
Search the data with the efficient structures
You have to find a way to get rid of the find(). Try to get rid of all the junk in the cells that you want to compare so that you could use =. When you have this, you can put rows into a dict to find matches. Or you could load it into a SQL database and use SQL queries (don't forget to add indexes!)
One last trick is to use sorted lists. If you can sort the data in the same way, then you can use two lists:
Sort the data from the two sheets into two lists
Use two row counters (one per list)
If the current item from the first list is less than the current one from the second list, then there is no match and you have to advance the first row counter.
If the current item from the first list is bigger than the current one from the second list, then there is no match and you have to advance the second row counter.
If the items are the same, you have a match. Process the match and advance both counters.
This allows you to process all the items in one go.
I would suggest that you use pandas. This module provides a huge amount of functions to compare datasets. It also has a very fast import/export algorithms for excel files.
IMHO you should use the merge function and provide the arguments how = 'inner' and on = [list of your columns to compare]. That will create a new dataset with only such rows that occur in both tables (having the same values in the defined colums). This new dataset you can export to your excel file.

python pandas native select_as_multiple

Suppose I have a DataFrame that is block sparse. By this I mean that there are groups of rows that have disjoint sets of non-null columns. Storing this a huge table will use more memory in the values (nan filling) and unstacking the table to rows will creating a large index (at least it appears that way on saving to disk ... I'm not 100% clear if there is some efficient MultiIndexing that is supposed to be going on).
Typically, I store the blocks as separate DataFrames in a dict or list (dropping the nan columns) and make a class that has almost the same api as a DataFrame, 'manually' passing the queries to the blocks and concatenating the results. This works well but involves a short amount of some special code to store and handle these objects.
Recently, I've noticed that pytables provides a feature similar to this but only for the pytables query api.
Is there some way of handling this natively in pandas? Or am I missing some simpler way of getting a solution that is similar in performance?
EDIT: Here is a small example dataset
import pandas, string, itertools
from pylab import *
# create some data and put it in a list of blocks (d)
m = 10; n = 6;
s = list(string.ascii_uppercase)
A = array([s[x] * (1 + mod(x, 3)) for x in randint(0, 26, m*n)]).reshape(m, n)
df = pandas.DataFrame(A)
d = list()
d += [df.ix[0:(m/2)].T.ix[0:(n/2)].T]
d += [df.ix[(m/2):].T.ix[(n/2):].T]
# 1. use lots of memory, fill with na
d0 = pandas.concat(d) # this is just the original df
# 2. maybe ok, not sure how this is handled across different pandas versions
d1 = pandas.concat([x.unstack() for x in d])
# want this to work however the blocks are stored
print(d0.ix[[0, 8]][[2,5]])
# this raises exception
sdf = pandas.SparseDataFrame(df)
You could use HDFStore this way
Store different tables with a common index (that is itself) a column
only the non-all-nan rows would be stored. so if you group your columns intelligently (e.g.
put the ones that would tend to have lots of sparseness in the same place together). I think you could achieve a 'sparse'-like layout.
you can compress the table if necessary.
you can then query individual tables, and get the coordinates to then pull from other tables (this is what select_as_multiple does).
Can you provde a small example, and rough size of data set, e.g. num of rows, columuns, disjoint groups, etc.
What do your queries look like? This is generally how I approach the problem. Figure our how you are going to query; this is going to define how you store the data layout.

Categories

Resources