I know how to calculate the mean of a variable in one netcdf file. But, I have 40 netcdf files. In each file I have 4000 data values for mixing layer height. I want to create a list of mean mixing layer height for the multiple netcdf files.
In the end the size of my list should be 40.
Can some help me with a python code to create this list?
Thank you so much.
Here is the code I used to calculate the mean mixing layer height for one layer in a single netcdf file
import numpy as np
import netCDF4
f = netCDF4.Dataset('niv.nc')
#the shape of my data set is (5760,3)
#5760 is the number of lists of time
#In each list I have 3 mixing layer heights for 3 layers.
#I'm going to call all the mixing layer height data for the first layer
a= (f.variables['pbl'][:,0])
print (np.mean(a))
You have to get the list of filenames somehow. Here I'll assume you have all your files in one folder, and there are no other netCDF files in that folder.To do this using netCDF4 and requiring a separate mean for each file
import numpy as np
import netCDF4
from glob import glob
# you want to modify this to use your actual data directory
filename_list = glob('/home/user/data_dir/*.nc')
mean_list = []
for filename in filename_list: # make filename_list with something like os.listdir
with netCDF4.Dataset(filename) as ds:
mean_list.append(np.mean(ds.variables['pbl'][:, 0]))
To do the same thing with xarray:
import xarray as xr
from glob import glob
# you want to modify this to use your actual data directory
filename_list = glob('/home/user/data_dir/*.nc')
mean_list = []
for filename in filename_list: # make filename_list with something like os.listdir
with xr.open_dataset(filename) as ds:
mean_list.append(np.mean(ds['pbl'][:, 0].values))
If instead of getting the average for each file, let's say the first dimension is time and you want to get the average among all the files. To do that with xarray, you could use open_mfdataset like so:
import xarray as xr
import os
from glob import glob
# you want to modify this to use your actual data directory
filename_list = glob('/home/user/data_dir/*.nc')
ds = xr.open_mfdataset(filename_list, concat_dim='time')
mean = np.mean(ds['pbl'][:, 0].values)
Related
I was working with satellite data and by reading the variable it created a masked array by masking the fill values. But now I can't extract any values from that masked array.
How can I unmask the array
The code is mentioned below
import glob
import os
import netCDF4 as nc
import numpy as np
#listing all nc files in this folder
os.chdir(r'I:\Data\AOD\MODIS_AQUA_AOD\2004')
myfiles = glob.glob('*.nc')
ds=nc.Dataset(myfiles[0])
#now myfiles has every nc file name saved as an array
#reading lat and lon one time because same band in every file
lats=ds.variables['lat'][:]
lons=ds.variables['lon'][:]
aods=[] #will save the AOD data in this array
# Creating Loop and reading AOD Data
for i in range(len(myfiles)):
dt=nc.Dataset(myfiles[0],'r')
aod=dt.variables['MYD08_D3_6_1_AOD_550_Dark_Target_Deep_Blue_Combined_Mean'][:]
aods.append(aod)
now I want to extract values of specific latitude and Longitude from the list of masked array list (aods) but I am unable to do so
any solutions?
I have some ensemble files in grib format that I would like to lazy load in Python using dask and xarray. Based in https://climate-cms.org/2018/09/14/dask-era-interim.html, I managed to lazy load the files as intended, but now I want to slice and select the dimensions to plot the data for some time and level.
UPDATE: I've recently came back to this issue and I finally figured out that instead of using da.concatenate, I should use da.stack. This simple change solved my problem. This issue is updated accordingly, in case anyone need an example on how to create an ensemble of grib files using python (with dask arrays for lazy load), to load and plot data in the same fashion as one would do using softwares like GrADS.
My program looks like:
import dask
import dask.array as da
import xarray as xr
import pandas as pd
import numpy as np
from glob import glob
from datetime import date, datetime, timedelta
import matplotlib.pyplot as plt
bpath = '/some/path/to/my/data'
# pressure levels
levels =['1000', '925', '850', '700', '500', '300', '250', '200', '50']
# ensemble member names
ensm = ['M01', 'M02', 'M03', 'M04', 'M05']
#dask.delayed
def open_file_delayed(file, vname):
ds = xr.open_dataset(file, decode_cf='False', engine='pynio')
return ds
def open_file(file, vname, nlevs, nlats, nlons, ftype):
file_data = open_file_delayed(file, vname)[vname].data
return da.from_delayed(file_data, (nlevs, nlats, nlons), ftype)
# list of files to open (sorted by date)
# filename mask: PREFIXMEMYYYYiMMiDDiHHiYYYYfMMfDDfHHf.grb
# MEM: member name (see the levels list)
# YYYYiMMiDDiHHi: analysis date (passed as an argument to the open_file function)
# YYYYfMMfDDfHHf: forecast date
files = sorted(glob(bpath + '/%(dateanl)s/%(mem)s/PREFIX%(mem)s%(dateanl)s*.grb'%
{'dateanl': date, 'mem': member}))
ntime = len(files)
# open the first file in the list to get dimensions and coordinates
ds0 = xr.open_dataset(files[0], decode_cf='False', engine='pynio')
var0 = ds0[vname]
levs = ds0.lv_ISBL2.data
lats = ds0.g4_lat_0.data
lons = ds0.g4_lon_1.data
nlevs = ds0.lv_ISBL2.size
nlats = ds0.g4_lat_0.size
nlons = ds0.g4_lon_1.size
ftype = var0.dtype
ds0.close()
# calculate the date range of the forecasts, in my case len(date_fmt) = 61 (every grib file has 61 times and 9 levels)
date_fmt = pd.date_range(start=datetime.strptime(date, "%Y%m%d%H"), freq="6H", periods=ntime)
# call the function 'open_file' for all files contained in the list 'files' and stack them up
dask_var = da.stack([open_file(file, vname, nlevs, nlats, nlons, ftype) for file in files], axis=0)
# xda is the data array with all files
xda = xr.DataArray(dask_var, dims=['tlev', 'lat', 'lon'])
# set coordinates values
xda.coords['time'] = ('time', date_fmt)
xda.coords['lat'] = ('lat', lats)
xda.coords['lon'] = ('lon', lons)
return xda
To use this code, I do (for a single analysis date - 202005300 - May 30, 2020, and a variable called ZGEO):
Note: this part is very fast (it takes miliseconds), as we are just creating a map structure to the actual data, similar to a GrADS control file.
lens_zgeo = [open_ensemble('2020053000', ens, 'ZGEO') for ens in ensm]
dens_zgeo = xr.concat(lens_zgeo, dim='ens')
dens_zgeo.coords['ens'] = ('ens', ensm)
dens_zgeo is a data array with the following sctructure:
data array structure
From this point, I can slice the dimensions of the data array and plot (which was what I've intented originally):
Note: this part takes longer because the data needs to be read from the disk.
dens_zgeo.isel(ens=0,time=0,lev=0).plot()
BOOM, case closed. Thanks!
I've edited the question with the modifications I needed in order to get the result I wanted. For this case, the main point is the use of da.stack instead of da.concatenate. By doing so, I've got the resulting data array to get concatenated in the ensemble dimension I needed.
I'm having some troubles with trying to get a monthly average with Sentinel 3 images on... Everything, really. Python, Matlab, we are two people getting stuck in this problem.
The main reason deals with the fact that these images' information is not on a single netcdf file, neatly put with coordinates and products. Instead, they are all in separate files inside a one day folder as
different .nc files with different information each, about one single satellite image. SNAP uses an xmlxs file to work with all of these separate .nc files as I understand it.
Now, I though it would be a good idea to try to merge and create/edit the .nc files as to create a new daily .nc which included the chlorophyll, the coordinates and, might as well add it, time. Later on, I would merge these new ones so to be able to make a monthly mean with xarray. At least that was my idea but I can't do the first part. It might be an obvious solution however here's what I tried, using the xarray module
import os
import numpy as np
import xarray as xr
import netCDF4
from netCDF4 import Dataset
nc_folder = df_try.iloc[0] #folder where the image files are
#open dataset in xarray
nc_chl = xr.open_dataset(str(nc_folder['path']) + '/' + 'chl_nn.nc') #path to chlorophyll file
nc_chl
n_coord =xr.open_dataset(str(nc_folder['path'])+ '/'+ 'geo_coordinates.nc') #path to coordinates file
n_time = xr.open_dataset(str(nc_folder['path'])+ '/' + 'time_coordinates.nc') #path to time file
ds_grid = [[nc_chl], [n_coord], [n_time]]
combined = xr.combine_nested(ds_grid, concat_dim=[None, None])
combined #dataset with all but not recognizing coordinates
ds = combined.rename({'latitude': 'lat', 'longitude': 'lon', 'time_stamp' : 'time'}).set_coords(['lon', 'lat', 'time']) #dataset recognizing coordinates as coordinates
ds
which gives a dataset with
Dimensions: columns 4865 rows: 4091
3 coordinates (lat, lon and time) and the chl variable.
Now, it doesn't save to netcdf4 (I tried but there was an error) but I was also thinking if anyone knew of another way to make an average? I have images from three years (beginning on 2017 to ending on 2019) I would need to average in different ways (monthly, seasonally...). My main current problem is that the chlorophyll values are separate from the geographical coordinates so directly only using the chlorophyll files should not work and would just make a mess.
Any suggestions?
Two options here:
Using xarray
In xarray you can add them as coordinates. It is a bit tricky as the coordinates in the geo_coordinates.nc file are multidimensional as well.
A possible solution is the following:
import netCDF4
import xarray as xr
import matplotlib.pyplot as plt
# paths
root = r'C:<your_path>\S3B_OL_2_WFR____20201015.SEN3\chl_nn.nc' #set path to chl file
coor = r'C:<your_path>\S3B_OL_2_WFR____20201015.SEN3\geo_coordinates.nc' #set path to the coordinates file
# loading xarray datasets
ds = xr.open_dataset(root)
olci_geo_coords = xr.open_dataset(coor)
# extracting coordinates
lat = olci_geo_coords.latitude.data
lon = olci_geo_coords.longitude.data
# assign coordinates to the chl dataset (needs to refer to both the dimensions of our dataset)
ds = ds.assign_coords({"lon":(["rows","columns"], lon), "lat":(["rows","columns"], lat)})
# clip the image (add your own coordinates)
area_of_interest = ds.where((10 < ds.lon) & (ds.lon < 12) & (58 < ds.lat) & (ds.lat < 59), drop=True)
# simple plot with coordinates as axis
plt.figure(figsize=(15,15))
area_of_interest["CHL_NN"].plot(x="lon",y="lat")
Even simpler is to add them as variables in a new dataset:
# path to the folder
root = r'C:<your_path>\S3B_OL_2_WFR____20201015.SEN3\*.nc' #set path to chl file
# create a dataset by combining nc files (coordinates will become variables)
ds = xr.open_mfdataset(root,combine = 'by_coords')
But in this case when you plot the image or clip it you cannot use the coordinates directly.
Using snappy
In python the snappy package is available and based on SNAP toolbox (which is implemented on JAVA). Check: https://senbox.atlassian.net/wiki/spaces/SNAP/pages/19300362/How+to+use+the+SNAP+API+from+Python
Once installed (unfortunately snappy supports only python 2.7, 3.3 or 3.4), you can use the available SNAP function directly on python to aggregate your satellite images and create week/month averages. You then do not need to merge the lon, lat netcdf file as you will work on the xfdumanifest.xml and SNAP will take care of that.
This is an example. It performs aggregation as well (mean calculated on two chl nc files):
from snappy import ProductIO, WKTReader
from snappy import jpy
from snappy import GPF
from snappy import HashMap
# setting the aggregator method
aggregator_average_config = snappy.jpy.get_type('org.esa.snap.binning.aggregators.AggregatorAverage$Config')
agg_avg_chl = aggregator_average_config('CHL_NN')
# creating the hashmap to store the parameters
HashMap = snappy.jpy.get_type('java.util.HashMap')
parameters = HashMap()
#creating the aggregator array
aggregators = snappy.jpy.array('org.esa.snap.binning.aggregators.AggregatorAverage$Config', 1)
#adding my aggregators in the list
aggregators[0] = agg_avg_chl
# set parameters
# output directory
dir_out = 'level-3_py_dynamic.dim'
parameters.put('outputFile', dir_out)
# number of rows (directly linked with resolution)
parameters.put('numRows', 66792) # to have about 300 meters spatial resolution
# aggregators list
parameters.put('aggregators', aggregators)
# Region to clip the aggregation on
wkt="POLYGON ((8.923302175377243 59.55648108694149, 13.488748662344074 59.11388968719029,12.480488185001589 56.690625338725155, 8.212366327767503 57.12425256476263,8.923302175377243 59.55648108694149))"
geom = WKTReader().read(wkt)
parameters.put('region', geom)
# Source product path
path_15 = r"C:<your_path>\S3B_OL_2_WFR____20201015.SEN3\xfdumanifest.xml"
path_16 = r"C:\<your_path>\S3B_OL_2_WFR____20201016.SEN3\xfdumanifest.xml"
path = path_15 + "," + path_16
parameters.put('sourceProductPaths', path)
#result = snappy.GPF.createProduct('Binning', parameters, (source_p1, source_p2))
# create results
result = snappy.GPF.createProduct('Binning', parameters) #to be used with product paths specified in the parameters hashmap
print("results stored in: {0}".format(dir_out) )
I am quite new and interested in the topic and would be happy to hear your/other solutions!
I know this topic has been asked before, but as i'm new to python I couldn't fully understand how to do that and I would like to get explanations about.
I have ndarray cube (stack of images from the same location with the same size and shape which differs in the wavelength they were taken).
I want to convert this image into pandas dataframe in order to be able to iterate through specific rows.
i'm really confused because of the big number of columns I have: I ahve 1024 columns in each image and that confuse me when I need to index those images.
My end goal is to get in the end the images in structure of df, so maybe it means to have kind of imagecollection that I can iterate rows in each one of them.
this is the code I have written until now:
import spectral.io.envi as envi
import matplotlib.pyplot as plt
import os
from spectral import *
import numpy as np
#Create the image path
#the path
img_path = r'N:\this\is\a\path\capture'
cali_path=r'N:\location\Image_Python'
#the specific file
img_file = 'emptyname_2019-08-13_11-05-46.hdr'
img_dark= 'DARKREF_emptyname_2019-08-13_11-05-46.hdr'
cali_hdr= 'Radiometric_1x1.hdr'
cali_img = 'Radiometric_1x1.cal'
img= envi.open(os.path.join(img_path,img_file)).load()
img_dark= envi.open(os.path.join(img_path,img_dark)).load()
img_cali= envi.open(os.path.join(cali_path,cali_hdr), image = os.path.join(cali_path,cali_img)).load()
cali_shape=img_cali.shape
dark_shape=img_dark.shape
img_shape=img.shape
print('shape image:',img_shape,'shape dark:',dark_shape,'calibration shape:',cali_shape)
wavelength=[float(i) for i in img.metadata['wavelength']]
#get the exposure time
tint=float(img.metadata['tint'])
print(tint)
#goak: need to reduce the dark reference from DN image.
#step 1: for each column in the dark reference, calculate mean. then reduce this mean line from the DN image.
#we have created average according to the horizontal axix- axis=0, it calculates the mean for the whole column and we get one row.
dark_1024=img_dark.mean(axis=0)
from numpy import asarray
import pandas as pd
img_np=asarray(img)
dark_np=asarray(img_dark)
cali_np=asarray(img_cali)
There are 192 x 144 pixel images. They should be imported to a Python list so that the items in the list are NDArray instances. New dataframe should be created from the list and that dataframe should be given to Isomap. iso.fit(df) fails with the errors
array = array.astype(np.float64)
ValueError: setting an array element with a sequence.
I have spent more than one day trying to figure out how the NDArrays should be processed and the dataframe loaded with them. No luck. Any help would be appreciated.
import pandas as pd
from scipy import misc
import glob
from sklearn import manifold
samples = []
for filename in glob.glob('Datasets/ALOI/32/*.png'):
img = misc.imread(filename, mode='I')
samples.append(img)
df = pd.DataFrame.from_records(samples, coerce_float=True)
iso = manifold.Isomap(n_neighbors=6, n_components=3)
iso.fit(df)
If those are gray scale images from the ALOI, you probably want to treat each pixel's brightness as a feature. Therefore, you should flatten the img array with img.reshape(-1). The revised code follows:
import pandas as pd
from scipy import misc
import glob
from sklearn import manifold
samples = []
for filename in glob.glob('Datasets/ALOI/32/*.png'):
img = misc.imread(filename, mode='I')
# the following line changed
samples.append(img.reshape(-1))
df = pd.DataFrame.from_records(samples, coerce_float=True)
iso = manifold.Isomap(n_neighbors=6, n_components=3)
iso.fit(df)