Combine multiple NetCDF files into timeseries multidimensional array python - python

I am using data from multiple netcdf files (in a folder on my computer). Each file holds data for the entire USA, for a time period of 5 years. Locations are referenced based on the index of an x and y coordinate. I am trying to create a time series for multiple locations(grid cells), compiling the 5 year periods into a 20 year period (this would be combining 4 files). Right now I am able to extract the data from all files for one location and compile this into an array using numpy append. However, I would like to extract the data for multiple locations, placing this into a matrix where the rows are the locations and the columns contain the time series precipitation data. I think I have to create a list or dictionary, but I am not really sure how to allocate the data to the list/dictionary within a loop.
I am new to python and netCDF, so forgive me if this is an easy solution. I have been using this code as a guide, but haven't figured out how to format it for what I'd like to do: Python Reading Multiple NetCDF Rainfall files of variable size
Here is my code:
import glob
from netCDF4 import Dataset
import numpy as np
# Define x & y index for grid cell of interest
# Pittsburgh is 37,89
yindex = 37 #first number
xindex = 89 #second number
# Path
path = '/Users/LMC/Research Data/NARCCAP/'
folder = 'MM5I_ccsm/'
## load data file names
all_files = glob.glob(path + folder+'*.nc')
all_files.sort()
## initialize np arrays of timeperiods and locations
yindexlist = [yindex,'38','39'] # y indices for all grid cells of interest
xindexlist = [xindex,xindex,xindex] # x indices for all grid cells of interest
ngridcell = len(yindexlist)
ntimestep = 58400 # This is for 4 files of 14600 timesteps
## Initialize np array
timeseries_per_gridcell = np.empty(0)
## START LOOP FOR FILE IMPORT
for timestep, datafile in enumerate(all_files):
fh = Dataset(datafile,mode='r')
days = fh.variables['time'][:]
lons = fh.variables['lon'][:]
lats = fh.variables['lat'][:]
precip = fh.variables['pr'][:]
for i in range(1):
timeseries_per_gridcell = np.append(timeseries_per_gridcell,precip[:,yindexlist[i],xindexlist[i]]*10800)
fh.close()
print timeseries_per_gridcell
I put 3 files on dropbox so you could access them, but I am only allowed to post 2 links. Here are they are:
https://www.dropbox.com/s/rso0hce8bq7yi2h/pr_MM5I_ccsm_2041010103.nc?dl=0
https://www.dropbox.com/s/j56undjvv7iph0f/pr_MM5I_ccsm_2046010103.nc?dl=0

Nice start, I would recommend the following to help solve your issues.
First, check out ncrcat to quickly concatenate your individual netCDF files into a single file. I highly recommend downloading NCO for netCDF manipulations, especially in this instance where it will ease your Python coding later on.
Let's say the files are named precip_1.nc, precip_2.nc, precip_3.nc, and precip_4.nc. You could concatenate them along the record dimension to form a new precip_all.nc with a record dimension of length 58400 with
ncrcat precip_1.nc precip_2.nc precip_3.nc precip_4.nc -O precip_all.nc
In Python we now just need to read in that new single file and then extract and store the time series for the desired grid cells. Something like this:
import netCDF4
import numpy as np
yindexlist = [1,2,3]
xindexlist = [4,5,6]
ngridcell = len(xidx)
ntimestep = 58400
# Define an empty 2D array to store time series of precip for a set of grid cells
timeseries_per_grid_cell = np.zeros([ngridcell, ntimestep])
ncfile = netCDF4.Dataset('path/to/file/precip_all.nc', 'r')
# Note that precip is 3D, so need to read in all dimensions
precip = ncfile.variables['precip'][:,:,:]
for i in range(ngridcell):
timeseries_per_grid_cell[i,:] = precip[:, yindexlist[i], xindexlist[i]]
ncfile.close()
If you have to use Python only, you'll need to keep track of the chunks of time indices that the individual files form to make the full time series. 58400/4 = 14600 time steps per file. So you'll have another loop to read in each individual file and store the corresponding slice of times, i.e. the first file will populate 0-14599, the second 14600-29199, etc.

You can easily merge multiple netCDF files into one using netCDF4 package in Python. See example below:
I have four netCDF files like 1.nc, 2.nc, 3.nc, 4.nc.
Using command below all four files will be merge into one dataset.
import netCDF4
from netCDF4 import Dataset
dataset = netCDF4.MFDataset(['1.nc','2.nc','3.nc','4.nc'])

In parallel to the answer of N1B4, you can also concatenate 4 files along their time dimension using CDO from the command line
cdo mergetime precip1.nc precip2.nc precip3.nc precip4.nc merged_file.nc
or with wildcards
cdo mergetime precip?.nc merged_file.nc
and then proceed to read it in as per that answer.
You can add another step from the command line to extract the location of choice by using
cdo remapnn,lon=X/lat=Y merged_file.nc my_location.nc
this picks out the gridcell nearest to your specified lon/lat (X,Y) coordinate, or you can use bilinear interpolation if you prefer:
cdo remapbil,lon=X/lat=Y merged_file.nc my_location.nc

I prefer xarray's approach
ds = xr.open_mfdataset('nc_*.nc', combine = 'by_coord', concat_dim = 'time')
ds.to_netcdf('nc_combined.nc') # Export netcdf file

open_mfdatase has to use DASK library to work. SO, if for some reason you can't use it like I can't then this method is useless.

Related

Which file format uses less memory in python?

I wrote the code for points generation which will generate a dataframe for every one second and it keeps on generating. Each dataframe has 1000 rows and 7 columns.. It was implemented using while loop and thus for every iteration one dataframe is generated and it must be appended on a file. While file format should I use to manage the memory efficiency? Which file format takes less memory.? Can anyone give me a suggestion.. Is it okay to use csv? If so what datatype should I prefer to use.. Currently my dataframe has int16 values.. Should I append the same or should I convert it into binary format or byte format?
numpy arrays can be stored in binary format. Since you you have a single int16 data type, you can create a numpy array and write that. You would have 2 bytes per int16 value which is fairly good for size. The trick is that you need to know the dimensions of the stored data when you read it later. In this example its hard coded. This is a bit fragile - if you change your mind and start using different dimensions later, old data would have to be converted.
Assuming you want to read a bunch of 1000x7 dataframes later, you could do something like the example below. The writer keeps appending 1000x7 int16s and the reader chunks them back into dataframes. If you don't use anything specific to pandas itself, you would be better off just sticking with numpy for all of your operations and skip the demonstrated conversions.
import pandas as pd
import numpy as np
def write_df(filename, df):
with open(filename, "ab") as fp:
np.array(df, dtype="int16").tofile(fp)
def read_dfs(filename, dim=(1000,7)):
"""Sequentially reads dataframes from a file formatted as raw int16
with dimension 1000x7"""
size = dim[0] * dim[1]
with open(filename, "rb") as fp:
while True:
arr = np.fromfile(fp, dtype="int16", count=size)
if not len(arr):
break
yield pd.DataFrame(arr.reshape(*dim))
import os
# ready for test
test_filename = "test123"
if os.path.exists(test_filename):
os.remove(test_filename)
df = pd.DataFrame({"a":[1,2,3], "b":[4,5,6]})
# write test file
for _ in range(5):
write_df(test_filename, df)
# read and verify test file
return_data = [df for df in read_dfs(test_filename, dim=(3,2))]
assert len(return_data) == 5

Organize flow data from multiple excel sheets into one excel by iterrating through each column

I have the paths to each excel file in 'files'(using this thread).
And then, I was trying to use a for loop to iterate through each file and gather flow data and combine it into new matrix 'val' by adding it to a new column each time. 'Flow' is also the column name in the excel so I use that on line 5 to call the column I want.
For example,
Excel 1
Flow data
1
2
Excel 2:
Flow data
3
4
val matrix should have
Excel 1 Excel 2
1 3
2 4
I keep getting this error however.
could not broadcast input array from shape (105408,1) into shape (105408,)
Seems like a common error but I haven't been able to solve it from similar question on here.
val = np.zeros((105408, 50), int)
for x in range (len(files)+1):
dt = pd.read_csv(files[x])
flow_data = dt[['Flow']]
val[:,x] = flow_data
#print(val)
I think you are running into an issue due to the extra pair of brackets surrounding 'Flow', removing should function as you intended. dt[['Flow']] --> dt['Flow']
Using a dataframe might be a better approach to aggregate results though, a numpy.ndarray will throw an error if len(files) turns out to be larger than the preset array width (50 in this case). A df will be more flexible for varying file counts/rows. Which seems to be the case given you are using len(files) and not a specific file count.
Working example (using pd.DataFrame):
aggregate_df = pd.DataFrame()
for x in range (len(files)+1):
dt = pd.read_csv(files[x])
flow_data = dt['Flow']
aggregate_df.loc[:, x] = flow_data # using a df to aggregate results
# print(aggregate_df)

how do I aggregate 50 datasets within an HDf5 file

I have an HDF5 file with 2 groups, each containing 50 datasets of 4D numpy arrays of same type per group. I want to combine all 50 datasets in each group into a single dataset. In other words, instead of 2 x 50 datasets I want 2x1 dataset. How can I accomplish this? The file is 18.4 Gb in size. I am a novice at working with large datasets. I am working in python with h5py.
Thanks!
Look at this answer: How can I combine multiple .h5 file? - Method 3b: Merge all data into 1 Resizeable Dataset. It describes a way to copy data from multiple HDF5 files into a single dataset. You want to do something similar. The only difference is all of your datasets are in 1 HDF5 file.
You didn't say how you want to stack the 4D arrays. In my first answer I stacked them along axis=3. As noted in my comment, I it's easier (and cleaner) to create the merged dataset as a 5d array, and stack the data along the 5th axis (axis=4). I like this for 2 reasons: The code is simpler/easier to follow, and 2) it's more intuitive (to me) that axis=4 represents a unique dataset (instead of slicing on axis=3).
I wrote a self-contained example to demonstrate the procedure. First it creates some data and closes the file. Then it reopens the file (read only) and creates a new file for the copied datasets. It loops over the groups and and datasets in the first and copies the data into to a merged dataset in the second file. The 5D example is first, and my original 4D example follows.
Note: this is a simple example that will work for your specific case. If you are writing a general solution, it should check for consistent shapes and dtypes before blindly merging the data (which I don't do).
Code to create the Example data (2 groups, 5 datasets each):
import h5py
import numpy as np
# Create a simple H5 file with 2 groups and 5 datasets (shape=a0,a1,a2,a3)
with h5py.File('SO_69937402_2x5.h5','w') as h5f1:
a0,a1,a2,a3 = 100,20,20,10
grp1 = h5f1.create_group('group1')
for ds in range(1,6):
arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
grp1.create_dataset(f'dset_{ds:02d}',data=arr)
grp2 = h5f1.create_group('group2')
for ds in range(1,6):
arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
grp2.create_dataset(f'dset_{ds:02d}',data=arr)
Code to merge the data (2 groups, 1 5D dataset each -- my preference):
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
h5py.File('SO_69937402_2x1_5d.h5','w') as h5f2:
# loop on groups in existing file (h5f1)
for grp in h5f1.keys():
# Create group in h5f2 if it doesn't exist
print('working on group:',grp)
h5f2.require_group(grp)
# Loop on datasets in group
ds_cnt = len(h5f1[grp].keys())
for i,ds in enumerate(h5f1[grp].keys()):
print('working on dataset:',ds)
if 'merged_ds' not in h5f2[grp].keys():
# If dataset doesn't exist in group, create it
# Set maxshape so dataset is resizable
ds_shape = h5f1[grp][ds].shape
merge_ds = h5f2[grp].create_dataset('merged_ds',dtype=h5f1[grp][ds].dtype,
shape=(ds_shape+(ds_cnt,)), maxshape=(ds_shape+(None,)) )
# Now add data to the merged dataset
merge_ds[:,:,:,:,i] = h5f1[grp][ds]
Code to merge the data (2 groups, 1 4D dataset each):
with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
h5py.File('SO_69937402_2x1_4d.h5','w') as h5f2:
# loop on groups in existing file (h5f1)
for grp in h5f1.keys():
# Create group in h5f2 if it doesn't exist
print('working on group:',grp)
h5f2.require_group(grp)
# Loop on datasets in group
for ds in h5f1[grp].keys():
print('working on dataset:',ds)
if 'merged_ds' not in h5f2[grp].keys():
# if dataset doesn't exist in group, create it
# Set maxshape so dataset is resizable
ds_shape = h5f1[grp][ds].shape
merge_ds = h5f2[grp].create_dataset('merged_ds',data=h5f1[grp][ds],
maxshape=[ds_shape[0],ds_shape[1],ds_shape[2],None])
else:
# otherwise, resize the merged dataset to hold new values
ds1_shape = h5f1[grp][ds].shape
ds2_shape = merge_ds.shape
merge_ds.resize(ds1_shape[3]+ds2_shape[3],axis=3)
merge_ds[ :,:,:, ds2_shape[3]:ds2_shape[3]+ds1_shape[3] ] = h5f1[grp][ds]

Extracting the sensing date/time over a particular lat/lng in a netCDF file in python

I am currently working with multiple netCDF files in python. I am using Sentinel-5P NO2 tropospheric columns over Greater London. I want to plot the individual files as a time series, titled with the passover time over London for each individual swath but I am unsure as how to extract this.
Is there a simple way to which I can extract the passover time of the satellite over a particular lat/lng for each file?
EDIT:
Some more information on the files. They are netCDF files meaning they contain Dimensions, Variables, Attributes and Coordinates. They contain information on Vertical Column Densities of NO2 over London at a spatial resolution of 3.5x7km. I have opened the files with xarray in PyCharm and have further attached an image to provide more information on the variables.
I essentially need to find the value of delta_time when latitude=51.2 or 51.8. Below is what I have developed so far, however I have around 50 files all with over 100,000 pixels so this is very very slow. Does anyone know how I can improve this?
for i in file_list:
# Open product - GROUP PRODUCT
with xr.open_dataset(i, group='PRODUCT') as file:
print(colored('\nPRODUCT Group:\n', 'blue'), file)
no2 = file['nitrogendioxide_tropospheric_column'][0]
for row in no2.coords['latitude']:
for cell in row:
if cell == 51.2 or cell == 51.8:
print(cell)
print(cell['scanline'])
scanpoint = (cell['scanline'])
scantime = no2['delta_time'].sel(scanline=scanpoint)
print(scantime)
return scantime
else:
continue
You should be able to use the vectorised NumPy functions to do what you want. Now, I'm not so sure about comparing equality of floats but this should be similar to yours. I haven't specifically used xarray but have used netCDF4, so where it says <array> I mean get a numpy (or equivalent) array for that variable/coordinate. Also, note that I haven't selected an individual time value, which it looks like you have, but am simply using the whole 3D array of latitudes.
import numpy as np
latitude = <3D latitudes array>
delta_time = <2D delta_time array>
# 3D boolean array with our required condition
condition = (latitude == 51.2) | (latitude == 51.8)
# Expand tuple of indices, one for each of the 3 dims, but ignore ground_pixel dim
# Each of these idx arrays is 1D
time_idx, scanline_idx, _ = condition.nonzero()
# Get 1D array of delta_times by using time and scanline indices
delta_times = delta_time[time_idx, scanline_idx]
This should leave you with the co-ordinates (condition.nonzero()) of all relevant cells in all three dimensions, as well as the delta_times of these cells.
Note that you don't need the actual no2 array if you're not using the actual values and only care about latitude and delta_time, but you can always get the values of the relevant cells with something like no2[condition].

Find big differences in numpy array

I have a csv file that contains data from two led measurements. There are some mistakes in the file that gives huge sparks in the graph. I want to locate this places where this happens.
I have this code that makes two arrays that I plot.
x625 = np.array(df['LED Group 625'].dropna(axis=0, how='all'))
x940 = np.array(df['LED Group 940'].dropna(axis=0, how='all'))
I will provide an answer with some artificial data since you have not posted any data yet.
So after you convert the pandas columns into a numpy array, you can do something like this:
import numpy as np
# some random data. 100 lines and 1 column
x625 = np.random.rand(100,1)
# Assume that the maximum value in `x625` is a spark.
spark = x625.max()
# Find where these spark are in the `x625`
np.where(x625==spark)
#(array([64]), array([0]))
The above means that a value equal to spark is located on the 64th line of the 0th column.
Similarly, you can use np.where(x625 > any_number_here)
If instead of the location you need to create a boolean mask use this:
boolean_mask = (x625==spark)
# verify
np.where(boolean_mask)
# (array([64]), array([0]))
EDIT 1
You can use numpy.diff() to get the element wise differences of all elements into the list (variable).
diffs = np.diff(x625.ravel())
This will have in index 0 the results of element1-element0.
If the vaules in diffs are big in a specific index positio, then a spark occured in that position.

Categories

Resources