Plotting Sentinel-5P data in xarray

Plotting Sentinel-5P data in xarray - python

I'm trying to plot a grid of air pollution data from a netCDF files in python using xarray. However, I'm facing a couple roadblocks.
To start off, here is the data that can be used to reproduce my code:
Data
When you try to import this data using xarray.open_dataset, you end up with a file that has zero coordinates or variables, and lots of attributes:
FILE_NAME = "test2.nc". ##I changed the name to make it shorter
xr.open_dataset(FILE_NAME)
So I created variables of the data and tried to import those into xarray:
prd='PRODUCT'
metdata = "METADATA"
lat= ds.groups[prd].variables['latitude']
lon= ds.groups[prd].variables['longitude']
no2 = ds.groups[prd].variables['nitrogendioxide_tropospheric_column']
scanline = ds.groups[prd].variables['scanline']
time = ds.groups[prd].variables['time']
ground_pixel = ds.groups[prd].variables['ground_pixel']
ds = xr.DataArray(no2,
dims=["time","x","y"],
coords={
"lon":(["time","x", "y"], lon)
}
# coords=[("time", time), ("x", scanline),("y", ground_pixel)]
)
As you can see above, I tried multiple ways of creating the coordinates, but I'm still getting an error. The data in this netCDF file is on an irregular grid, and I just want to be able to plot that accurately and quickly using xarray.
Does someone know how I can do this?

Related

Optimize plane of array (POA) irradiance calculation using WRF (netCDF) data

I need to calculate the plane of array (POA) irradiance using python's pvlib package (https://pvlib-python.readthedocs.io/en/stable/). For this I would like to use the output data from the WRF model (GHI, DNI, DHI). The output data is in netCDF format, which I open using the netCDF4 package and then I extract the necessary variables using the wrf-python package.
With that I get a xarray.Dataset with the variables I will use. I then use the xarray.Dataset.to_dataframe() method to transform it into a pandas dataframe, and then I transform the dataframe into a numpy array using the dataframe.values. And then I do a loop where in each iteration I calculate the POA using the function irradiance.get_total_irradiance (https://pvlib-python.readthedocs.io/en/stable/auto_examples/plot_ghi_transposition.html) for a grid point.
That's the way I've been doing it so far, however I have over 160000 grid points in the WRF domain, the data is hourly and spans 365 days. This gives a very large amount of data. I believe if pvlib could work directly with xarray.dataset it could be faster. However, I could only do it this way, transforming the data into a numpy.array and looping through the rows. Could anyone tell me how I can optimize this calculation? Because the code I developed is very time-consuming.
If anyone can help me with this I would appreciate it. Maybe an improvement to the code, or another way to calculate the POA from the WRF data...
I'm providing the code I've built so far:
from pvlib import location
from pvlib import irradiance
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import netCDF4
import wrf
Getting WRF data
variaveis = ['T2',
'U10',
'V10',
'SWDDNI',
'SWDDIF',
'SWDOWN']
netcdf_data = netCDF4.Dataset('wrfout_d02_2003-11-01_00_00_00')
first = True
for v in variaveis:
var = wrf.getvar(netcdf_data, v, timeidx=wrf.ALL_TIMES)
if first:
met_data = var
first = False
else:
met_data = xr.merge([met_data, var])
met_data = xr.Dataset.reset_coords(met_data, ['XTIME'], drop=True)
met_data['T2'] = met_data['T2'] - 273.15
WS10 = (met_data['U10']**2 + met_data['V10']**2)**0.5
met_data['WS10'] = WS10
df = met_data[['SWDDIF',
'SWDDNI',
'SWDOWN',
'T2',
'WS10']].to_dataframe().reset_index().drop(columns=['south_north',
'west_east'])
df.rename(columns={'SWDOWN': 'ghi',
'SWDDNI':'dni',
'SWDDIF':'dhi',
'T2':'temp_air',
'WS10':'wind_speed',
'XLAT': 'lat',
'XLONG': 'lon',
'Time': 'time'}, inplace=True)
df.set_index(['time'], inplace=True)
df = df[df.ghi>0]
df.index = df.index.tz_localize('America/Recife')
Function to get POA irradiance
def get_POA_irradiance(lon, lat, date, dni, dhi, ghi, tilt=10, surface_azimuth=0):
site_location = location.Location(lat, lon, tz='America/Recife')
# Get solar azimuth and zenith to pass to the transposition function
solar_position = site_location.get_solarposition(times=date)
# Use the get_total_irradiance function to transpose the GHI to POA
POA_irradiance = irradiance.get_total_irradiance(
surface_tilt = tilt,
surface_azimuth = surface_azimuth,
dni = dni,
ghi = ghi,
dhi = dhi,
solar_zenith = solar_position['apparent_zenith'],
solar_azimuth = solar_position['azimuth'])
# Return DataFrame with only GHI and POA
return pd.DataFrame({'lon': lon,
'lat': lat,
'GHI': ghi,
'POA': POA_irradiance['poa_global']}, index=[date])
Loop in each row (time) of the array
array = df.reset_index().values
list_poa = []
def loop_POA():
for i in tqdm(range(len(array) - 1)):
POA = get_POA_irradiance(lon=array[i,6],
lat=array[i,7],
dni=array[i,2],
dhi=array[i,1],
ghi=array[i,3],
date=str(array[i,0]))
list_poa.append(POA)
return list_poa
poa_final = pd.concat(lista)

Thanks both for a good question and for using pvlib! You're right that pvlib is intended for modeling single locations and is not designed for use with xarray datasets, although some functions might coincidentally work with them.
I strongly suspect that the majority of the runtime you're seeing is for the solar position calculations. You could switch to a faster method (see the method options here), as the default solar position method is very accurate but also quite slow when calculating bulk positions. Installing numba will help, but it still might be too slow for you, so you might check the other models (ephemeris, pyephem). There are also some fast but low-precision methods, but you will need to change your code a bit to use them. See the list under "Correlations and analytical expressions for low precision solar position calculations" here.
Like Michael Delgado suggests in the comments, parallel processing is an option. But that can be a headache in python. You will probably want multiprocessing, not multithreading.
Another idea is to use atlite, a python package designed for this kind of spatial modeling. But its solar modeling capabilities are not nearly as detailed as pvlib, so it might not be useful for your case.
One other note: I don't know if the WRF data are interval averages or instantaneous values, but if you care about accuracy you should handle them differently for transposition. See this example.
Edit to add: after looking at your code again, there might be another significant speedup to be had. Are you calling get_POA_irradiance for single combinations of position and timestamp? If so, that is unnecessary and very slow. It would be much faster to pass in the full time series for each location, i.e. scalar lat/lon but vector irradiance.

Is there a simple way of getting an xyz array from xarray dataset?

Is there a simple way of getting an array of xyz values (i.e. an array of 3 cols and nrows = number of pixels) from an xarray dataset? Something like what we get from the rasterToPoints function in R.
I'm opening a netcdf file with values for a certain variable (chl). I'm not able to add images here directly, but here is a screenshot of the output:
Xarray dataset structure
I need to end with an array that have this structure:
[[lon1, lat1, val],
[lon1, lat2, val]]
And so on, getting the combination of lon/lat for each point. I'm sorry if I'm missing something really obvious, but I'm new to Python.

The natural format you are probably looking for here is a pandas dataframe, where lon, lat and chl are columns. This can be easily created using xarray's to_dataframe method, as follows.
import xarray as xr
ds = xr.open_dataset("infile.nc")
df = (
ds
.to_dataframe()
.reset_index()
)

I can suggest you a small pseudo-code:
import numpy as np
lons = ds.variables['lon'].load()
lats = ds.variables['lat'].load()
chl = ds.variables['chl'].load()
xm,ym = np.meshgrid(lons,lats)
dataout = np.concatenate((xm.flatten()[np.newaxis,:],ym.flatten()[np.newaxis,:],chla.flatten()[np.newaxis,:]),axis=0)
Might be it does not work out-of-the box, but at least one solution could be similar with this.

Sentinel3 OLCI (chl) Average of netcdf files on Python

I'm having some troubles with trying to get a monthly average with Sentinel 3 images on... Everything, really. Python, Matlab, we are two people getting stuck in this problem.
The main reason deals with the fact that these images' information is not on a single netcdf file, neatly put with coordinates and products. Instead, they are all in separate files inside a one day folder as
different .nc files with different information each, about one single satellite image. SNAP uses an xmlxs file to work with all of these separate .nc files as I understand it.
Now, I though it would be a good idea to try to merge and create/edit the .nc files as to create a new daily .nc which included the chlorophyll, the coordinates and, might as well add it, time. Later on, I would merge these new ones so to be able to make a monthly mean with xarray. At least that was my idea but I can't do the first part. It might be an obvious solution however here's what I tried, using the xarray module
import os
import numpy as np
import xarray as xr
import netCDF4
from netCDF4 import Dataset
nc_folder = df_try.iloc[0] #folder where the image files are
#open dataset in xarray
nc_chl = xr.open_dataset(str(nc_folder['path']) + '/' + 'chl_nn.nc') #path to chlorophyll file
nc_chl
n_coord =xr.open_dataset(str(nc_folder['path'])+ '/'+ 'geo_coordinates.nc') #path to coordinates file
n_time = xr.open_dataset(str(nc_folder['path'])+ '/' + 'time_coordinates.nc') #path to time file
ds_grid = [[nc_chl], [n_coord], [n_time]]
combined = xr.combine_nested(ds_grid, concat_dim=[None, None])
combined #dataset with all but not recognizing coordinates
ds = combined.rename({'latitude': 'lat', 'longitude': 'lon', 'time_stamp' : 'time'}).set_coords(['lon', 'lat', 'time']) #dataset recognizing coordinates as coordinates
ds
which gives a dataset with
Dimensions: columns 4865 rows: 4091
3 coordinates (lat, lon and time) and the chl variable.
Now, it doesn't save to netcdf4 (I tried but there was an error) but I was also thinking if anyone knew of another way to make an average? I have images from three years (beginning on 2017 to ending on 2019) I would need to average in different ways (monthly, seasonally...). My main current problem is that the chlorophyll values are separate from the geographical coordinates so directly only using the chlorophyll files should not work and would just make a mess.
Any suggestions?

Two options here:
Using xarray
In xarray you can add them as coordinates. It is a bit tricky as the coordinates in the geo_coordinates.nc file are multidimensional as well.
A possible solution is the following:
import netCDF4
import xarray as xr
import matplotlib.pyplot as plt
# paths
root = r'C:<your_path>\S3B_OL_2_WFR____20201015.SEN3\chl_nn.nc' #set path to chl file
coor = r'C:<your_path>\S3B_OL_2_WFR____20201015.SEN3\geo_coordinates.nc' #set path to the coordinates file
# loading xarray datasets
ds = xr.open_dataset(root)
olci_geo_coords = xr.open_dataset(coor)
# extracting coordinates
lat = olci_geo_coords.latitude.data
lon = olci_geo_coords.longitude.data
# assign coordinates to the chl dataset (needs to refer to both the dimensions of our dataset)
ds = ds.assign_coords({"lon":(["rows","columns"], lon), "lat":(["rows","columns"], lat)})
# clip the image (add your own coordinates)
area_of_interest = ds.where((10 < ds.lon) & (ds.lon < 12) & (58 < ds.lat) & (ds.lat < 59), drop=True)
# simple plot with coordinates as axis
plt.figure(figsize=(15,15))
area_of_interest["CHL_NN"].plot(x="lon",y="lat")
Even simpler is to add them as variables in a new dataset:
# path to the folder
root = r'C:<your_path>\S3B_OL_2_WFR____20201015.SEN3\*.nc' #set path to chl file
# create a dataset by combining nc files (coordinates will become variables)
ds = xr.open_mfdataset(root,combine = 'by_coords')
But in this case when you plot the image or clip it you cannot use the coordinates directly.
Using snappy
In python the snappy package is available and based on SNAP toolbox (which is implemented on JAVA). Check: https://senbox.atlassian.net/wiki/spaces/SNAP/pages/19300362/How+to+use+the+SNAP+API+from+Python
Once installed (unfortunately snappy supports only python 2.7, 3.3 or 3.4), you can use the available SNAP function directly on python to aggregate your satellite images and create week/month averages. You then do not need to merge the lon, lat netcdf file as you will work on the xfdumanifest.xml and SNAP will take care of that.
This is an example. It performs aggregation as well (mean calculated on two chl nc files):
from snappy import ProductIO, WKTReader
from snappy import jpy
from snappy import GPF
from snappy import HashMap
# setting the aggregator method
aggregator_average_config = snappy.jpy.get_type('org.esa.snap.binning.aggregators.AggregatorAverage$Config')
agg_avg_chl = aggregator_average_config('CHL_NN')
# creating the hashmap to store the parameters
HashMap = snappy.jpy.get_type('java.util.HashMap')
parameters = HashMap()
#creating the aggregator array
aggregators = snappy.jpy.array('org.esa.snap.binning.aggregators.AggregatorAverage$Config', 1)
#adding my aggregators in the list
aggregators[0] = agg_avg_chl
# set parameters
# output directory
dir_out = 'level-3_py_dynamic.dim'
parameters.put('outputFile', dir_out)
# number of rows (directly linked with resolution)
parameters.put('numRows', 66792) # to have about 300 meters spatial resolution
# aggregators list
parameters.put('aggregators', aggregators)
# Region to clip the aggregation on
wkt="POLYGON ((8.923302175377243 59.55648108694149, 13.488748662344074 59.11388968719029,12.480488185001589 56.690625338725155, 8.212366327767503 57.12425256476263,8.923302175377243 59.55648108694149))"
geom = WKTReader().read(wkt)
parameters.put('region', geom)
# Source product path
path_15 = r"C:<your_path>\S3B_OL_2_WFR____20201015.SEN3\xfdumanifest.xml"
path_16 = r"C:\<your_path>\S3B_OL_2_WFR____20201016.SEN3\xfdumanifest.xml"
path = path_15 + "," + path_16
parameters.put('sourceProductPaths', path)
#result = snappy.GPF.createProduct('Binning', parameters, (source_p1, source_p2))
# create results
result = snappy.GPF.createProduct('Binning', parameters) #to be used with product paths specified in the parameters hashmap
print("results stored in: {0}".format(dir_out) )
I am quite new and interested in the topic and would be happy to hear your/other solutions!

How to grid data points onto a shapefile (grid)?

So I am working with a shapefile and a csv file of data points. I have used a raster function to convert the shapefile to a grid based on latitudes/longitudes. Now I need to put data points onto the same grid so that I can see where they fall in comparison to the "shape" produced by the other file. I have opened the csv file using the below code, and removed where the latitude/longitudes are null and made two new numpy arrays of lat/lons. But now I am confused about the next step, so if anyone has done anything similar, how should I proceed?
x = list(csv.reader(open(places,'rb'),delimiter=','))
List1 = zip(*x)
dataDict1 = {}
for column in List1:
dataDict1[col[0]] = col[1:]
lats = np.array(dataDict1['Latitude'])
lons = np.array(dataDict1['Longitude'])

Runs out of memory when plotting, Python

I'm retrieving a large number of data from a database, which I later plot using a scatterplot. However, I run out of memory, and the program aborts when I am using my full data. Just for the record it takes >30 minutes to run this program, and the length of the data list is about 20-30 million.
map = Basemap(projection='merc',
resolution = 'c', area_thresh = 10,
llcrnrlon=-180, llcrnrlat=-75,
urcrnrlon=180, urcrnrlat=82)
map.drawcoastlines(color='black')
# map.fillcontinents(color='#27ae60')
with lite.connect('database.db') as con:
start = 1406851200
end = 1409529600
cur = con.cursor()
cur.execute('SELECT latitude, longitude FROM plot WHERE unixtime >= {start} AND unixtime < {end}'.format(start = start, end = end))
data = cur.fetchall()
y,x = zip(*data)
x,y = map(x,y)
plt.scatter(x,y, s=0.05, alpha=0.7, color="#e74c3c", edgecolors='none')
plt.savefig('Plot.pdf')
plt.savefig('Plot.png')
I think my problem may be in the zip(*) function, but I really have no clue. I'm both interested in how I can preserve more memory by rewriting my existing code, and to split up the plotting process. My idea is to split the time period in half, then just do the same thing twice for the two time periods before saving the figure, however I am unsure on this will help me at all. If the problem is to actually plot it, I got no idea.

If you think the problem lies in the zip function, why not use a matplotlib array to massage your data into the right format? Something like this:
data = numpy.array(cur.fetchall())
lat = data[:,0]
lon = data[:,1]
x,y = map(lon, lat)
Also, your generated PDF will be very large and slow to render by the various PDF readers because it is a vectorized format by default. All your millions of data points will be stored as floats and rendered when the user opens the document. I recommend that you add the rasterized=True argument to your plt.scatter() call. This will save the result as a bitmap inside your PDF (see the docs here)
If this all doesn't help, I would investigate further by commenting out lines starting at the back. That is, first comment out plt.savefig('Plot.png') and see if the memory use goes down. If not, comment out the line before that, etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.