how do I aggregate 50 datasets within an HDf5 file - python

I have an HDF5 file with 2 groups, each containing 50 datasets of 4D numpy arrays of same type per group. I want to combine all 50 datasets in each group into a single dataset. In other words, instead of 2 x 50 datasets I want 2x1 dataset. How can I accomplish this? The file is 18.4 Gb in size. I am a novice at working with large datasets. I am working in python with h5py.
Thanks!

Look at this answer: How can I combine multiple .h5 file? - Method 3b: Merge all data into 1 Resizeable Dataset. It describes a way to copy data from multiple HDF5 files into a single dataset. You want to do something similar. The only difference is all of your datasets are in 1 HDF5 file.
You didn't say how you want to stack the 4D arrays. In my first answer I stacked them along axis=3. As noted in my comment, I it's easier (and cleaner) to create the merged dataset as a 5d array, and stack the data along the 5th axis (axis=4). I like this for 2 reasons: The code is simpler/easier to follow, and 2) it's more intuitive (to me) that axis=4 represents a unique dataset (instead of slicing on axis=3).
I wrote a self-contained example to demonstrate the procedure. First it creates some data and closes the file. Then it reopens the file (read only) and creates a new file for the copied datasets. It loops over the groups and and datasets in the first and copies the data into to a merged dataset in the second file. The 5D example is first, and my original 4D example follows.
Note: this is a simple example that will work for your specific case. If you are writing a general solution, it should check for consistent shapes and dtypes before blindly merging the data (which I don't do).
Code to create the Example data (2 groups, 5 datasets each):
import h5py
import numpy as np
# Create a simple H5 file with 2 groups and 5 datasets (shape=a0,a1,a2,a3)
with h5py.File('SO_69937402_2x5.h5','w') as h5f1:
a0,a1,a2,a3 = 100,20,20,10
grp1 = h5f1.create_group('group1')
for ds in range(1,6):
arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
grp1.create_dataset(f'dset_{ds:02d}',data=arr)
grp2 = h5f1.create_group('group2')
for ds in range(1,6):
arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
grp2.create_dataset(f'dset_{ds:02d}',data=arr)
Code to merge the data (2 groups, 1 5D dataset each -- my preference):
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
h5py.File('SO_69937402_2x1_5d.h5','w') as h5f2:
# loop on groups in existing file (h5f1)
for grp in h5f1.keys():
# Create group in h5f2 if it doesn't exist
print('working on group:',grp)
h5f2.require_group(grp)
# Loop on datasets in group
ds_cnt = len(h5f1[grp].keys())
for i,ds in enumerate(h5f1[grp].keys()):
print('working on dataset:',ds)
if 'merged_ds' not in h5f2[grp].keys():
# If dataset doesn't exist in group, create it
# Set maxshape so dataset is resizable
ds_shape = h5f1[grp][ds].shape
merge_ds = h5f2[grp].create_dataset('merged_ds',dtype=h5f1[grp][ds].dtype,
shape=(ds_shape+(ds_cnt,)), maxshape=(ds_shape+(None,)) )
# Now add data to the merged dataset
merge_ds[:,:,:,:,i] = h5f1[grp][ds]
Code to merge the data (2 groups, 1 4D dataset each):
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
h5py.File('SO_69937402_2x1_4d.h5','w') as h5f2:
# loop on groups in existing file (h5f1)
for grp in h5f1.keys():
# Create group in h5f2 if it doesn't exist
print('working on group:',grp)
h5f2.require_group(grp)
# Loop on datasets in group
for ds in h5f1[grp].keys():
print('working on dataset:',ds)
if 'merged_ds' not in h5f2[grp].keys():
# if dataset doesn't exist in group, create it
# Set maxshape so dataset is resizable
ds_shape = h5f1[grp][ds].shape
merge_ds = h5f2[grp].create_dataset('merged_ds',data=h5f1[grp][ds],
maxshape=[ds_shape[0],ds_shape[1],ds_shape[2],None])
else:
# otherwise, resize the merged dataset to hold new values
ds1_shape = h5f1[grp][ds].shape
ds2_shape = merge_ds.shape
merge_ds.resize(ds1_shape[3]+ds2_shape[3],axis=3)
merge_ds[ :,:,:, ds2_shape[3]:ds2_shape[3]+ds1_shape[3] ] = h5f1[grp][ds]

Related

Vacuum HDF5 dataset (to remove rows of data and resize)

Let say I have HDF5 dataset with maxshape=(None,1000), chunk=(1,1000).
Then whenever I need to delete a some row I just zero-it (many):
ds[ix,:] = 0
What is the fastest way to vacuum-zeroth-rows and resize the array ?
Now lets add a twist. I have a dict to resolve symbols =to=> ds_ix
{ name : ds_ix }..
What is the fastest way to vacuum and keep the correct ds_ix ?
Did you mean resize the dataset when you asked 'resize the array?' (Also, I assume you meant maxshape=(None,1000).) If so, you use the .resize() method. However, if you aren't removing the last row(s), you will have to rearrange the non-zero data, then resize. (And you really don't need to zero out the row(s) since you are going to overwrite them.)
I can think of 2 approaches to rearrange the data: 1) use slice notation to define FROM and TO indices, or 2) read the dataset into a numpy array, delete the rows, and copy it back. Both involve disk I/O so it's not clear which would be faster without testing. It probably doesn't matter for small datasets and only a few deleted rows. I suspect the second method will be better if you plan to delete a lot of rows from large datasets. However, benchmark tests are required to confirm.
Note: be careful setting chunksize. Remember this controls the I/O size, and you will be doing a lot of I/O when you move rows. Setting it too small (or too large) can degrade performance. Setting to (1,1000) is probably too small. Recommended chunk size is 10 KiB to 1 MiB. (1,1000) float32 is 4 Kib.
Here are both approaches with a very small dataset.
Create a HDF5 file:
with h5py.File('SO_73353006.h5','w') as h5f:
a0, a1 = 10, 5
arr = np.arange(a0*a1).reshape(a0,a1)
ds = h5f.create_dataset('test',data=arr,maxshape=(None,a1))
Method 1: move data, then resize dataset
with h5py.File('SO_73353006.h5','r+') as h5f:
idx = 5
ds = h5f['test']
#ds[idx,:] = 0 # Not required since we will overwrite the row
a0 = ds.shape[0]
ds[idx:a0-1] = ds[idx+1:a0]
ds.resize(a0-1,axis=0)
Method 2: extract array, delete row and copy data to resized dataset
with h5py.File('SO_73353006.h5','r+') as h5f:
idx = 5
ds = h5f['test']
a0 = ds.shape[0]
a1 = ds.shape[1]
# read dataset into array and delete row
ds_arr = ds[()]
ds_arr = np.delete(ds_arr, obj=idx, axis=0)
# Resize dataset and load array
ds.resize(a0-1,axis=0) # same as above
ds[:] = ds_arr[:]
# Create a new dataset for comparison
ds2 = h5f.create_dataset('test2',data=ds_arr,maxshape=(None,a1))

Dask: parallel group by with sequential saving

To summarize: How to perform groupby operations in parallel for a limited number of groups, but writing the result of each group apply function to disk?
My problem: I'm trying to create a supervised structure for regression models from information of a lot of clients separated into years. From the same clients I have to build different models, with different inputs X and labels Y, thus my idea is to create a single X and Y dataframe holding all variables at once, and slicing each one according to the task. For example, X could hold information from the salary, age or sex, but model 1 would use only age and sex, while model 2 only use salary.
As clients are not present every year, I can only use clients that are present from one period to the next one.
Instead of selecting the intersection of clients for each pair of contigous years, I'm trying to concatenate the whole information and performing groupby operations by client ID (and then filtering by year sequence, for example using the rows where the difference of periods are 1). The problem of using Dask for this task is that distributed workers are running low on memory (even after increasing the limit to 30Gb each). Note that for each group I'm creating a new dataframe, so I'm not reducing calculation to a single number per group, thus the memory intensive operation.
What I'm currently doing is performing a groupby operation, then iterating over the groupby object and writing to disk sequentially: for example like:
x_file=open('X.csv', 'w')
for name, group in concatenated_data.groupby('ID'):
data_x=my_func(group) # In my real code, my_func returns x and y dataframes
data_x.to_csv(x_file, header=None)
x_file.close()
which write the data sequentially applying my_func which selects the x and y for each group.
What I want is to perform the operation for a controlled number of groups (lets say 3 at the time), and writing the result of each group to disk (maybe with data_x.to_csv(x_file, single_file=True)).
Of course I can do the same for a dask dataframe, and iterate over the groupbpy object using get_group(), but I don't believe it will run in parallel while also keeping the memory on check.
EDIT: Example
# Lets say I have 3 csv files:
data=['./data_2016', './data_2017', './data_2018'] # Each file contains millions of rows (1 per client ID) and like 85 columns
# and certains variables
x_vars=['x1', 'x2', 'x3'] # x variables
y_vars= ['y1', 'y2', 'x1'] # note than some variables can be among x and y (like using today's salary to predict tomorrows salary)
data=[pd.read_csv(x) for x in data]
def func1(df_):
# do some preprocessing stuff
return df_
data=map(func1, data) # Some preprocessing and adding some columns (for example adding column for year)
concatenated_data=pd.concat(data, axis=1) # Big file, all clients from 2016-2018
def my_func(df_): # function applied above
# order by year
df_['Diff']=df_.year.diff() # calculating the difference among years
df['shifted']=df.Diff.shift(-1) # calculate shift of difference
# For exammple, *client z* may be on 2016 and 2018, thus his year difference is 2.
# I can't use *clien z* x_vars to predict y (only a single period ahead regression)
x=df_.loc[df_['shifted']==1, x_vars] # select only contigous years
y=df_.loc[df_['Diff']==1, y_vars] # the same, but a year ahead of x
return (x, y)
# ... Iteration over groupby object
Instead of using groupby() to reduce, I'm expanding the single, big file into an x and y dataframes, on which y holds information a period ahead of x.
As you can see, using a dask dataframe groupby (omitted for simplicity) would parallelize my_func operation, but as I understand would also wait until all operations nodes are completed, thus depleting my memory. What I would like is to perform my_func for certain groups (ideally as most as memory could hold), finish them, save to disk (without problems related to paralell saving) and finally proceed to the next batch of groups.
Maybe I can use some dask delayed objects, but I don't think it will make good use of my memory if a set the batches manually.
I'm not sure if this is what you are looking for
Generate data
import pandas as pd
import numpy as np
import dask.dataframe as dd
import os
n = 200
df = pd.DataFrame({"grp":np.random.choice(list("abcd"), n),
"x":np.random.randn(n),
"y":np.random.randn(n),
"z":np.random.randn(n)})
df.to_csv("file.csv", index=False)
# we will need later on
df.to_parquet("file.parquet", index=False)
Pandas solution
# we save our files on a given folder
fldr = "output1"
os.makedirs(fldr, exist_ok=True)
# we read the columns we need only
cols2read = ["grp", "x", "y"]
df = pd.read_csv("file.csv")
df = df[cols2read]
def write_file(x, fldr):
name = x["grp"].iloc[0]
x.to_csv(f"{fldr}/{name}.csv", index=False)
df.groupby("grp")\
.apply(lambda x: write_file(x, fldr))
Dask solution
This is basically the same but we need to add meta to our apply and the compute
# we save our files on a given folder
fldr = "output2"
os.makedirs(fldr, exist_ok=True)
# we read the columns we need only
cols2read = ["grp", "x", "y"]
df = pd.read_csv("file.csv")
df = df[cols2read]
def write_file(x, fldr):
name = x["grp"].iloc[0]
x.to_csv(f"{fldr}/{name}.csv", index=False)
df.groupby("grp")\
.apply(lambda x: write_file(x, fldr), meta='f8')\
.compute()
Working with parquet
Here I suggest you to work with parquet as it's going to be ways more efficient
cols2read = ["grp", "x", "y"]
df = dd.read_parquet("file.parquet",
columns=cols2read)
df.to_parquet("output3/",
partition_on="grp")
Inside output3 you can find several folders called grp=a and so on. And each off them could eventually contain several files. but you can read all of them with pd.read_parquet("output3/grp=a)

on efficiently separating one data file into two files for model building work

There is a data file, consisting of 100K rows, where each row stores a data point. I would like to randomly select 10K rows and save them into a validation file; and use the remaining rows to be saved into a training file. In stead of writing a code to do this, are there any existing function in scikit-learn, pandas or in generic Python to efficiently separate a data file into two ones?
There can be only one reason not to use sklearn train-test-split method because you probably don’t want to take the labels out from the features. You just simply want to split the DataFrame in two sections without splitting the features and labels.
If you don’t want to use train-test-split from sklearn, you can do it in pandas too
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]

Merge CountVectorizer output from 4 text columns back into one dataset

I have a collection of ~100,000 documents in a dataset with a unique doc_id and four columns containing text (like below).
original dataset
I want to vectorize each of the four text columns individually and then combine all of those features back together to create one large dataset for the purpose of building a model for prediction. I approached the vectorization for each text feature using code like below:
stopwords = nltk.corpus.stopwords.words("english")
subject_transformer = CountVectorizer(stop_words=stopwords)
subject_vectorized = subject_transformer.fit_transform(full_docs['subject'])
body_transformer = CountVectorizer(stop_words=stopwords)
body_vectorized = body_transformer.fit_transform(full_docs['body'])
excerpt_transformer = CountVectorizer(stop_words=stopwords)
excerpt_vectorized = excerpt_transformer.fit_transform(full_docs['excerpt'])
regex_transformer = CountVectorizer(stop_words=stopwords)
regex_vectorized = regex_transformer.fit_transform(full_docs['regex'])
Each vectorization yields a sparse matrix like below where column one is the document number, column two is the column number (one for each word in the original text column), and the last column is the actual count.
sparse matrix
What I want to do is the following:
Transpose each sparse matrix to a full dataframe of dimensions nxp where n=number of documents & p=number of words in that corpus
Merge each of these matrices/dataframes back together for the purpose of building a model for prediction
I initially tried the following:
regex_vectorized_df = pd.DataFrame(regex_vectorized.toarray())
Then I could merge the four individual dataframes back together. This doesn't work because toarray() is too memory intensive. What is the best way to merge these four sparse matrices into one dataset with one unique line per document?

Combine multiple NetCDF files into timeseries multidimensional array python

I am using data from multiple netcdf files (in a folder on my computer). Each file holds data for the entire USA, for a time period of 5 years. Locations are referenced based on the index of an x and y coordinate. I am trying to create a time series for multiple locations(grid cells), compiling the 5 year periods into a 20 year period (this would be combining 4 files). Right now I am able to extract the data from all files for one location and compile this into an array using numpy append. However, I would like to extract the data for multiple locations, placing this into a matrix where the rows are the locations and the columns contain the time series precipitation data. I think I have to create a list or dictionary, but I am not really sure how to allocate the data to the list/dictionary within a loop.
I am new to python and netCDF, so forgive me if this is an easy solution. I have been using this code as a guide, but haven't figured out how to format it for what I'd like to do: Python Reading Multiple NetCDF Rainfall files of variable size
Here is my code:
import glob
from netCDF4 import Dataset
import numpy as np
# Define x & y index for grid cell of interest
# Pittsburgh is 37,89
yindex = 37 #first number
xindex = 89 #second number
# Path
path = '/Users/LMC/Research Data/NARCCAP/'
folder = 'MM5I_ccsm/'
## load data file names
all_files = glob.glob(path + folder+'*.nc')
all_files.sort()
## initialize np arrays of timeperiods and locations
yindexlist = [yindex,'38','39'] # y indices for all grid cells of interest
xindexlist = [xindex,xindex,xindex] # x indices for all grid cells of interest
ngridcell = len(yindexlist)
ntimestep = 58400 # This is for 4 files of 14600 timesteps
## Initialize np array
timeseries_per_gridcell = np.empty(0)
## START LOOP FOR FILE IMPORT
for timestep, datafile in enumerate(all_files):
fh = Dataset(datafile,mode='r')
days = fh.variables['time'][:]
lons = fh.variables['lon'][:]
lats = fh.variables['lat'][:]
precip = fh.variables['pr'][:]
for i in range(1):
timeseries_per_gridcell = np.append(timeseries_per_gridcell,precip[:,yindexlist[i],xindexlist[i]]*10800)
fh.close()
print timeseries_per_gridcell
I put 3 files on dropbox so you could access them, but I am only allowed to post 2 links. Here are they are:
https://www.dropbox.com/s/rso0hce8bq7yi2h/pr_MM5I_ccsm_2041010103.nc?dl=0
https://www.dropbox.com/s/j56undjvv7iph0f/pr_MM5I_ccsm_2046010103.nc?dl=0
Nice start, I would recommend the following to help solve your issues.
First, check out ncrcat to quickly concatenate your individual netCDF files into a single file. I highly recommend downloading NCO for netCDF manipulations, especially in this instance where it will ease your Python coding later on.
Let's say the files are named precip_1.nc, precip_2.nc, precip_3.nc, and precip_4.nc. You could concatenate them along the record dimension to form a new precip_all.nc with a record dimension of length 58400 with
ncrcat precip_1.nc precip_2.nc precip_3.nc precip_4.nc -O precip_all.nc
In Python we now just need to read in that new single file and then extract and store the time series for the desired grid cells. Something like this:
import netCDF4
import numpy as np
yindexlist = [1,2,3]
xindexlist = [4,5,6]
ngridcell = len(xidx)
ntimestep = 58400
# Define an empty 2D array to store time series of precip for a set of grid cells
timeseries_per_grid_cell = np.zeros([ngridcell, ntimestep])
ncfile = netCDF4.Dataset('path/to/file/precip_all.nc', 'r')
# Note that precip is 3D, so need to read in all dimensions
precip = ncfile.variables['precip'][:,:,:]
for i in range(ngridcell):
timeseries_per_grid_cell[i,:] = precip[:, yindexlist[i], xindexlist[i]]
ncfile.close()
If you have to use Python only, you'll need to keep track of the chunks of time indices that the individual files form to make the full time series. 58400/4 = 14600 time steps per file. So you'll have another loop to read in each individual file and store the corresponding slice of times, i.e. the first file will populate 0-14599, the second 14600-29199, etc.
You can easily merge multiple netCDF files into one using netCDF4 package in Python. See example below:
I have four netCDF files like 1.nc, 2.nc, 3.nc, 4.nc.
Using command below all four files will be merge into one dataset.
import netCDF4
from netCDF4 import Dataset
dataset = netCDF4.MFDataset(['1.nc','2.nc','3.nc','4.nc'])
In parallel to the answer of N1B4, you can also concatenate 4 files along their time dimension using CDO from the command line
cdo mergetime precip1.nc precip2.nc precip3.nc precip4.nc merged_file.nc
or with wildcards
cdo mergetime precip?.nc merged_file.nc
and then proceed to read it in as per that answer.
You can add another step from the command line to extract the location of choice by using
cdo remapnn,lon=X/lat=Y merged_file.nc my_location.nc
this picks out the gridcell nearest to your specified lon/lat (X,Y) coordinate, or you can use bilinear interpolation if you prefer:
cdo remapbil,lon=X/lat=Y merged_file.nc my_location.nc
I prefer xarray's approach
ds = xr.open_mfdataset('nc_*.nc', combine = 'by_coord', concat_dim = 'time')
ds.to_netcdf('nc_combined.nc') # Export netcdf file
open_mfdatase has to use DASK library to work. SO, if for some reason you can't use it like I can't then this method is useless.

Categories

Resources