Extracting data from multiple NetCDF files from multiple locations at specific date - python

I am trying to extract from multi-temporal files (netCDF) the value of the pixel at a specific location and time.
Each file is named: T2011, T2012, and so on until T2017.
Each file contains 365 layers, every layer corresponds to one day of a the year and that layer expresses the temperature of that day.
My goal is to extract information according to my input dataset.
I have a csv (locd.csv) with my targets and it looks like this:
id lat lon DateFin DateCount
1 46.63174271 7.405986324 02-02-18 43,131
2 46.64972969 7.484352537 25-01-18 43,123
3 47.27028727 7.603811832 20-01-18 43,118
4 46.99994455 7.063905466 05-02-18 43,134
5 47.08125481 7.19501811 20-01-18 43,118
6 47.37833814 7.432005368 11-12-18 43,443
7 47.43230354 7.445253182 30-12-18 43,462
8 46.73777711 6.777871255 09-04-18 43,197
69 47.42285191 7.113934735 09-04-18 43,197
The id is the location I am interested in, lat and lon: latitude and longitude), DateFin correspond to the date I am interested to know the temperature at that particular location and DateCount corresponds to the number
of days from 01-01-1900 to the date I am interested in (that's how the layers are indexed in the file).
For doing that I have something like this:
import glob
from netCDF4 import Dataset
import pandas as pd
import numpy as np
from datetime import date
import os
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('*.nc'):
print(file)
data = Dataset(file, 'r')
time = data.variables['time'] # that's how the days are stored
year = file[0:4]
all_years.append(year)
# define my input data
cities = pd.read_csv('locd.csv')
# extracting the data
for index, row in cities.iterrows():
id_row = row['id'] # id from the database
location_latitude = row['lat']
location_longitude = row['lon']
location_date = row['DateCount'] #number of day counting since 1900-01-01
# Sorting the all_years python list
all_years.sort()
for yr in all_years:
# Reading-in the data
data = Dataset(str(yr)+'.nc', 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_latitude)**2
sq_diff_lon = (lon - location_longitude)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the precipitation data
prec= data.variables['precipi'] # that's how the variable is called
for p_index in np.arange(0, len(location_date)):
print('Recording the value for '+ id_row+': ' + str(location_date[p_index]))
df.loc[id_row[location_date]]['Precipitation'] = prec[location_date, min_index_lat, min_index_lon]
# to record it in a new archive
df.to_csv('locationNew.csv')
My issues:
I don't manage to make it work. Every time there is a new thing coming, now it says that "id_row" must be a string.
Does anybody have a hint or experience working with these type of files?

Related

How to save two variables in the same csv file with different time steps

I am trying to save time series of two variables that have different forecast steps. How can I modify the code below to be able to save both variables with different time steps in the same csv file. One of them starts the cycle at 000 and the other from 003 h of forecast.
But when I try to save, the following error occurs: IndexError: index 112 is out of bounds for axis 0 with size 112, sending another variable with 114 time steps.
lat = GFS.variables['latitude'][:]
lon = GFS.variables['longitude'][:]
times = GFS['valid_time'][:]
time_cycle = radiation['valid_time'][:]
unit = GFS['time'].units
step = GFS['step']
for key, value in stations.iterrows():
#print(key,value[0], value[1], value[2])
station = value[0]
file_name = "{}{}".format(station,".csv")
#print(file_name)
lon_point = value[1]
lat_point = value[2]
########################################
# Encontrando o ponto de Latitude e Longitude mais próximo das estações
# Squared difference of lat and lon
sq_diff_lat = (lat - lat_point)**2
sq_diff_lon = (lon - lon_point)**2
# Identifying the index of the minimum value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
print("Generating time series for station {}".format(station))
ref_date = datetime.datetime(int(unit[14:18]),int(unit[19:21]),int(unit[22:24]),int(unit[25:27]))
rad_data = list()
pblh_data = list()
for index, time in enumerate(times):
date_time = ref_date+datetime.timedelta(seconds=int(time))
date_range.append(date_time)
step_data.append(step[index].values)
pblh_data.append(hpbl[index, min_index_lat, min_index_lon].values)
if index_rad, time_cycle in enumerate(time_cycle):
rad_data.append(radiation[index_rad, min_index_lat, min_index_lon].values)
#print(date_range)
df = pd.DataFrame(date_range, columns = ["Date-Time"])
df["Date-Time"] = date_range
df = df.set_index(["Date-Time"])
df["Forecast ({})".format('valid time')] = step_data
df["RAD ({})".format('W m**-2')] = rad_data
df["PBLH ({})".format('m')] = pblh_data
print("The following time series is being saved as .csv files")
df.to_csv(os.path.join(dir_out,file_name), sep=';',encoding="utf-8", index=True)
#df.to_parquet(os.path.join(dir_out,file_name),
# engine='auto',
# compression='default',
# write_index=True,
# overwrite=True,
# append=False)
print("\n! !Successfuly saved all the Time Series the output Directory!!\n{}".format(dir_out))
That is, the PBLH variable has 114 time steps, while the RAD variable has 112, but I would like to save both variables in the same csv file. How should I modify the loop time (PBLH) and time_cycle (RAD) to put in the same csv?

Truncate a time serie files and extract some descriptive variable

I have two major problems, and I can't imagine the solution in python. Now, I explain you the context.
On the one hand I have a dataset, containing some date point with ID (1 ID = 1 patient) like this :
ID
Date point
0001
25/12/2022 09:00
0002
29/12/2022 16:00
0003
30/12/2022 18:00
...
....
And on the other hand, i have a folder with many text files containing the times series, like this :
0001.txt
0002.txt
0003.txt
...
The files have the same architecture : the ID (same as the dataset) is in the name of the file, and inside the file is structured like that (first column contains the date and the second de value) :
25/12/2022 09:00 155
25/12/2022 09:01 156
25/12/2022 09:02 157
25/12/2022 09:03 158
...
1/ I would like to truncate the text files and retrieve only the variables prior to the 48H dataset Date point.
2/ To make some statistical analysis, I want to take some value like the mean or the maximum of this variables and add in a dataframe like this :
ID
Mean
Maximum
0001
0002
0003
...
....
...
I know for you it will be a trivial problem, but for me (a beginner in python code) it will be a challenge !
Thank you everybody.
Manage time series with a dataframe containing date point and take some statistical values.
You could do something along these lines using pandas (I've not been able to test this fully):
import pandas as pd
from pathlib import Path
# I'll create a limited version of your initial table
data = {
"ID": ["0001", "0002", "0003"],
"Date point": ["25/12/2022 09:00", "29/12/2022 16:00", "30/12/2022 18:00"]
}
# put in a Pandas DataFrame
df = pd.DataFrame(data)
# convert the "Date point" column to a datetime object
df["Date point"] = pd.to_datetime(df["Date point"])
# provide the path to the folder containing the files
folder = Path("/path_to_files")
newdata = {"ID": [], "Mean": [], "Maximum": []} # an empty dictionary that you'll fill with the required statistical info
# loop through the IDs and read in the files
for i, date in zip(df["ID"], df["Date point"]):
inputfile = folder / f"{i}.txt" # construct file name
if inputfile.exists():
# read in the file
subdata = pd.read_csv(
inputfile,
sep="\s+", # columns are separated by spaces
header=None, # there's no header information
parse_dates=[[0, 1]], # the first and second columns should be combined and converted to datetime objects
infer_datetime_format=True
)
# get the values 48 hours after the current date point
td = pd.Timedelta(value=48, unit="hours")
mask = (subdata["0_1"] > date) & (subdata["0_1"] <= date + td)
# add in the required info
newdata["ID"].append(i)
newdata["Mean"].append(subdata[2].loc[mask].mean())
newdata["Maximum"].append(subdata[2].loc[mask].max())
# put newdata into a DataFrame
dfnew = pd.DataFrame(newdata)

Plot based on different date

i've written the following code that plots the relative bpm for the various hours contained in the csv given in input, for a given date.
dfMonday['date'] = pd.to_datetime(dfMonday['date'])
df_temp = dfMonday.loc[dfMonday['date'] == '2021/04/26']
bpmMon = df_temp.tempo
hMon = df_temp.time
x = '26/04/2021'
meanBpmMon = bpmMon .mean()
second = plt.figure(figsize=(10,5))
plt.title('HOURS - BPM (MONDAY- 26/04/2021)')
plt.scatter(hMon , bpmMon , c = bpmMon)
plt.xticks(rotation=45)
Within the CSV I have other dates, all different from each other, which refer to other days of listening to music. What I would like to do is create more charts based on the date which is contained in the csv. In detail: if I have other n dates I would like to output n graphs based on the date contained in the csv.
Csv file have the following structure:
artist_name;ms_played;track_name;...date;time;week_day
Taylor Swift;35260;Wildest Dreams;...;2021-01-25;07:55;0
Edward Sharpe & The Magnetic Zeros;...2021-01-25;15:34;0
Kanye West; 127964; ...; 2021-02-21;08:08;0
Billie Eilish; 125412; ...; 2021-15-2; 15:02; 0
......
As you can see from the date column, I have several dates inside the csv

How do I create a Mean Annual Rainfall table for various durations from a NetCDF4 using Python?

I have downloaded a NetCDF4 file of total hourly precipitation across Sierra Leone from 1974 to Present, and have started to create a code to analyze it.
I'm trying to form a table in Python that will display my average annual rainfall for different rainfall durations, rather like this one below:
I'm wondering if anyone has done anything similar to this before and could possibly help me out as I'm very new to programming?
Here is the script I've written so far that records the hourly data for each year. From here I need to find a way to store this information onto a table, then to change the duration to say, 2 hours, and repeat until I have a complete table:
import glob
import numpy as np
from netCDF4 import Dataset
import pandas as pd
import xarray as xr
all_years = []
for file in glob.glob('*.nc'):
data = Dataset(file, 'r')
time = data.variables['time']
year = time.units[11:16]
all_years.append(year)
year_start = '01-01-1979'
year_end = '31-12-2021'
date_range = pd.date_range(start = str(year_start),
end = str(year_end),
freq = 'H')
df = pd.DataFrame(0.0,columns = ['tp'], index = date_range)
lat_freetown = 8.4657
lon_freetown = 13.2317
all_years.sort()
for yr in range(1979,2021):
data = Dataset('era5_year' + str(yr)+ '.nc', 'r')
lat = data.variables['latitude'][:]
lon = data.variables['longitude'][:]
sq_diff_lat = (lat - lat_freetown)**2
sq_diff_lon = (lon - lon_freetown)**2
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
tp = data.variables['tp']
start = str(yr) + '-01-01'
end = str(yr) + '-12-31'
d_range = pd.date_range(start = start,
end = end,
freq = 'H')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: ' + str(d_range[t_index])+str(tp[t_index, min_index_lat, min_index_lon]))
df.loc[d_range[t_index]]['tp'] = tp[t_index, min_index_lat, min_index_lon]
I gave this a try, I hope it helps.
I downloaded two years of coarse US precip data here:
https://downloads.psl.noaa.gov/Datasets/cpc_us_hour_precip/precip.hour.2000.nc
https://downloads.psl.noaa.gov/Datasets/cpc_us_hour_precip/precip.hour.2001.nc
import xarray as xr
import pandas as pd
# Read two datasets and append them so there are multiple years of hourly data
precip_full1 = xr.open_dataset('precip.hour.2000.nc') * 25.4
precip_full2 = xr.open_dataset('precip.hour.2001.nc') * 25.4
precip_full = xr.concat([precip_full1,precip_full2],dim='time')
# Select only the Western half of the US
precip = precip_full.where(precip_full.lon<257,drop=True)
# Initialize output
output = []
# Select number of hours to sum
# This assumes that the data is hourly
intervals = [1,2,6,12,24]
# Loop through each desired interval
for interval in intervals:
# Take rolling sum
# This means the value at any time is the sum of the preceeding times
# So when interval is 6, it's the sum of the previous six values
roll = precip.rolling(time=interval,center=False).sum()
# Take the annual mean and average over all space
annual = roll.groupby('time.year').mean('time').mean(['lat','lon'])
# Convert output to a pandas dataframe
# and rename the column to correspond to the interval length
tab = annual.to_dataframe().rename(columns={'precip':str(interval)})
# Keep track of the output by appending it to the output list
output.append(tab)
# Combine the dataframes into one, by rows
output = pd.concat(output,1)
The output looks like this:
1 2 6 12 24
year
2000 0.014972 0.029947 0.089856 0.179747 0.359576
2001 0.015610 0.031219 0.093653 0.187290 0.374229
Again, this assumes that the data is already hourly. It also takes the average of any (for example) 6 hour period, so it's not just 00:00-06:00, 06:00-12:00, etc., it's 00:00-06:00, 001:00-07:00, etc., and then the annual mean. If you wanted the former you could use xarray's resample function after taking the rolling sum.

Convert emission inventory (.csv) to (.nc4) - 2D Variable Issue

Thank you in advance for assistance with this issue. I have been working tirelessly to get this working, but as a non-expert I've found myself stuck with the following issue.
Goal: I would like to convert a gridded emission inventory in .csv format to netCDF (.nc4). I am able to do so with the framework below, but I am unable to specify that 'Out_tonnes_per_year_by_cell' is a plottable 2D variable referencing lat and lon.
Here is all of the information I believe you'll need:
Metadata for .csv file:
s1
Out[2]:
Long Lat Out_tonnes_per_year_by_cell
0 -179.5 -89.5 0.0
1 -178.5 -89.5 0.0
2 -177.5 -89.5 0.0
3 -176.5 -89.5 0.0
4 -175.5 -89.5 0.0
... ... ...
64795 175.5 89.5 0.0
64796 176.5 89.5 0.0
64797 177.5 89.5 0.0
64798 178.5 89.5 0.0
64799 179.5 89.5 0.0
[64800 rows x 3 columns]
Body of Python Script:
# Fetches internal data of target .csv (Excel) or .xlsx (Excel) files and writes to .nc (NetCDF)
# By doing so, this will allow for the import of key climate variable data directly into HEMCO within GEOS-Chem GCHP
# Key Module Import(s)
import numpy as np
import netCDF4 as nc
import pandas as pd
import xarray as xr
# Key Variable(s)
kw = dict(sep = '\s*', parse_dates = {'dates':[0,1]}, header = None, index_col = 0, squeeze = True, engine = 'python')
GEO_C2H6_Dir = ('E:/PSU/CCAR REU/Emission(s) Inventories (GEOS-CHEM)/Simulation Data/Geo-CH4_emission_grid_files (Geologic)/Gridded Geologic C2H6 - Emissions Inventory/') # Geologic C2H6 Data Directory
# Load csv file into Python
GEO = (GEO_C2H6_Dir + 'Total_geoC2H6_output_2018.csv') # Location of Global Geologic C2H6 Emission Data
# Read into Pandas Series
s1 = pd.read_csv(GEO, sep = ",", skipinitialspace = True)
# Name of Each Pandas Series
s1.name = 'GEOCH4_Total'
# Concatenate Pandas Series into an Aggregated Pandas DataFrame
df1 = pd.concat([s1], axis = 1)
# Create Xarray Dataset from Pandas DataFrame
xds1 = xr.Dataset(df1)
# Addition of Variable Attribute Metadata
xds1['Long'].attrs = {'units':'1', 'Long_Name':'Longitudinal coordinate in decimal degrees'}
xds1['Lat'].attrs = {'units':'1', 'Long_Name':'Latitudinal coordinate in decimal degrees'}
xds1['Out_tonnes_per_year_by_cell'].attrs = {'units':'1', 'Long_Name':'Total Output of Cumulative Methane in units of Tonnes per Year in each Cell'}
# Addition of Global Attribute Metadata
xds1.attrs = {'Conventions':'CF-1.0', 'Title':'Total_GEOCH4_Output_2018', 'summary':'Total output of Cumulative Geologic Emitted Methane from 2018'}
# Save to Output NetCDF
xds1.to_netcdf('E:/PSU/CCAR REU/Emission(s) Inventories (GEOS-CHEM)/Simulation Data/Geo-CH4_emission_grid_files (Geologic)/Gridded Geologic C2H6 - Emissions Inventory/Total_GEOCH4_Output_2018.nc')
Output:
1.
Variable type comparison between output file and example emission inventory
2.
2D variable information
3.
'Out_tonnes_per_year_by_cell' variable information
My hunch is that it has to do with the different format(s). At this point, I am unsure what I can do to correct for this. ANY help would be greatly appreciated. Once again, thanks in advance and kind regards!
You can probably solve this by replacing your lines setting up the xarray dataset, as follows:
df1 = df1.set_index(["Lat", "Long"])
xds1 = xr.Dataset(df1)
This will ensure that Lat/Long are the coordinates/dimensions of the xarray dataset, and this should guarantee you have a valid netCDF file in the end.

Categories

Resources