Convert emission inventory (.csv) to (.nc4) - 2D Variable Issue - python

Thank you in advance for assistance with this issue. I have been working tirelessly to get this working, but as a non-expert I've found myself stuck with the following issue.
Goal: I would like to convert a gridded emission inventory in .csv format to netCDF (.nc4). I am able to do so with the framework below, but I am unable to specify that 'Out_tonnes_per_year_by_cell' is a plottable 2D variable referencing lat and lon.
Here is all of the information I believe you'll need:
Metadata for .csv file:
s1
Out[2]:
Long Lat Out_tonnes_per_year_by_cell
0 -179.5 -89.5 0.0
1 -178.5 -89.5 0.0
2 -177.5 -89.5 0.0
3 -176.5 -89.5 0.0
4 -175.5 -89.5 0.0
... ... ...
64795 175.5 89.5 0.0
64796 176.5 89.5 0.0
64797 177.5 89.5 0.0
64798 178.5 89.5 0.0
64799 179.5 89.5 0.0
[64800 rows x 3 columns]
Body of Python Script:
# Fetches internal data of target .csv (Excel) or .xlsx (Excel) files and writes to .nc (NetCDF)
# By doing so, this will allow for the import of key climate variable data directly into HEMCO within GEOS-Chem GCHP
# Key Module Import(s)
import numpy as np
import netCDF4 as nc
import pandas as pd
import xarray as xr
# Key Variable(s)
kw = dict(sep = '\s*', parse_dates = {'dates':[0,1]}, header = None, index_col = 0, squeeze = True, engine = 'python')
GEO_C2H6_Dir = ('E:/PSU/CCAR REU/Emission(s) Inventories (GEOS-CHEM)/Simulation Data/Geo-CH4_emission_grid_files (Geologic)/Gridded Geologic C2H6 - Emissions Inventory/') # Geologic C2H6 Data Directory
# Load csv file into Python
GEO = (GEO_C2H6_Dir + 'Total_geoC2H6_output_2018.csv') # Location of Global Geologic C2H6 Emission Data
# Read into Pandas Series
s1 = pd.read_csv(GEO, sep = ",", skipinitialspace = True)
# Name of Each Pandas Series
s1.name = 'GEOCH4_Total'
# Concatenate Pandas Series into an Aggregated Pandas DataFrame
df1 = pd.concat([s1], axis = 1)
# Create Xarray Dataset from Pandas DataFrame
xds1 = xr.Dataset(df1)
# Addition of Variable Attribute Metadata
xds1['Long'].attrs = {'units':'1', 'Long_Name':'Longitudinal coordinate in decimal degrees'}
xds1['Lat'].attrs = {'units':'1', 'Long_Name':'Latitudinal coordinate in decimal degrees'}
xds1['Out_tonnes_per_year_by_cell'].attrs = {'units':'1', 'Long_Name':'Total Output of Cumulative Methane in units of Tonnes per Year in each Cell'}
# Addition of Global Attribute Metadata
xds1.attrs = {'Conventions':'CF-1.0', 'Title':'Total_GEOCH4_Output_2018', 'summary':'Total output of Cumulative Geologic Emitted Methane from 2018'}
# Save to Output NetCDF
xds1.to_netcdf('E:/PSU/CCAR REU/Emission(s) Inventories (GEOS-CHEM)/Simulation Data/Geo-CH4_emission_grid_files (Geologic)/Gridded Geologic C2H6 - Emissions Inventory/Total_GEOCH4_Output_2018.nc')
Output:
1.
Variable type comparison between output file and example emission inventory
2.
2D variable information
3.
'Out_tonnes_per_year_by_cell' variable information
My hunch is that it has to do with the different format(s). At this point, I am unsure what I can do to correct for this. ANY help would be greatly appreciated. Once again, thanks in advance and kind regards!

You can probably solve this by replacing your lines setting up the xarray dataset, as follows:
df1 = df1.set_index(["Lat", "Long"])
xds1 = xr.Dataset(df1)
This will ensure that Lat/Long are the coordinates/dimensions of the xarray dataset, and this should guarantee you have a valid netCDF file in the end.

Related

Truncate a time serie files and extract some descriptive variable

I have two major problems, and I can't imagine the solution in python. Now, I explain you the context.
On the one hand I have a dataset, containing some date point with ID (1 ID = 1 patient) like this :
ID
Date point
0001
25/12/2022 09:00
0002
29/12/2022 16:00
0003
30/12/2022 18:00
...
....
And on the other hand, i have a folder with many text files containing the times series, like this :
0001.txt
0002.txt
0003.txt
...
The files have the same architecture : the ID (same as the dataset) is in the name of the file, and inside the file is structured like that (first column contains the date and the second de value) :
25/12/2022 09:00 155
25/12/2022 09:01 156
25/12/2022 09:02 157
25/12/2022 09:03 158
...
1/ I would like to truncate the text files and retrieve only the variables prior to the 48H dataset Date point.
2/ To make some statistical analysis, I want to take some value like the mean or the maximum of this variables and add in a dataframe like this :
ID
Mean
Maximum
0001
0002
0003
...
....
...
I know for you it will be a trivial problem, but for me (a beginner in python code) it will be a challenge !
Thank you everybody.
Manage time series with a dataframe containing date point and take some statistical values.
You could do something along these lines using pandas (I've not been able to test this fully):
import pandas as pd
from pathlib import Path
# I'll create a limited version of your initial table
data = {
"ID": ["0001", "0002", "0003"],
"Date point": ["25/12/2022 09:00", "29/12/2022 16:00", "30/12/2022 18:00"]
}
# put in a Pandas DataFrame
df = pd.DataFrame(data)
# convert the "Date point" column to a datetime object
df["Date point"] = pd.to_datetime(df["Date point"])
# provide the path to the folder containing the files
folder = Path("/path_to_files")
newdata = {"ID": [], "Mean": [], "Maximum": []} # an empty dictionary that you'll fill with the required statistical info
# loop through the IDs and read in the files
for i, date in zip(df["ID"], df["Date point"]):
inputfile = folder / f"{i}.txt" # construct file name
if inputfile.exists():
# read in the file
subdata = pd.read_csv(
inputfile,
sep="\s+", # columns are separated by spaces
header=None, # there's no header information
parse_dates=[[0, 1]], # the first and second columns should be combined and converted to datetime objects
infer_datetime_format=True
)
# get the values 48 hours after the current date point
td = pd.Timedelta(value=48, unit="hours")
mask = (subdata["0_1"] > date) & (subdata["0_1"] <= date + td)
# add in the required info
newdata["ID"].append(i)
newdata["Mean"].append(subdata[2].loc[mask].mean())
newdata["Maximum"].append(subdata[2].loc[mask].max())
# put newdata into a DataFrame
dfnew = pd.DataFrame(newdata)

Read CSV, Change 2 Columns using pyproj Transform and save to new CSV

I'm very new to Python so apologies for my lack of understanding.
I need to read in 2 columns (latitude & longitude) of a 4 column CSV file.
Example below.
ShopName Misc latitude longitude
XXX 3 999999 999999
I then have to change the latitude and longitude using a pyproj transform scrypt that I have checked. I then need to save the tranformed latitude and longitude data into a new csv such that the column format is the same as the existing csv.
Example below.
ShopName Misc latitude longitude
XXX 3 49.12124 -2.32131
I'm a bit lost but this is where I got to. Thank you in advance
import csv
from pyproj import Transformer
#2.2 Define function
def transformer = Transformer.from_crs("epsg:12345", "epsg:9999")
result = transformer.transform(old_longitude, old_latitude)
return new_longitude, new latitude
#2.3 Set destination file to variable
with open('new.csv' ,'w') as csv_new2:
#2.4 Instruct write method to new file
fileWriter2 = csv.writer(csv_new2)
#2.5 Set existing file to variable
with open('old.csv','r') as csv_old2:
#2.6 Instruct read method to new file
fileReader2 = csv.reader(csv_old2)
for row in fileReader2:
Here are some options for you.
pandas + pyproj:
How to speed up projection conversion?
geopandas:
https://geopandas.org/gallery/create_geopandas_from_pandas.html
https://geopandas.org/projections.html
from pyproj import Transformer
import pandas
pdf = pandas.read_csv("old.csv")
transformer = Transformer.from_crs("epsg:12345", "epsg:9999", always_xy=True)
xx, yy = trans.transform(pdf["longitude"].values, pdf["latitude"].values)
pdf = pdf.assign(longitude=xx, latitude=yy)
pdf.to_csv("new.csv")

Extracting data from multiple NetCDF files from multiple locations at specific date

I am trying to extract from multi-temporal files (netCDF) the value of the pixel at a specific location and time.
Each file is named: T2011, T2012, and so on until T2017.
Each file contains 365 layers, every layer corresponds to one day of a the year and that layer expresses the temperature of that day.
My goal is to extract information according to my input dataset.
I have a csv (locd.csv) with my targets and it looks like this:
id lat lon DateFin DateCount
1 46.63174271 7.405986324 02-02-18 43,131
2 46.64972969 7.484352537 25-01-18 43,123
3 47.27028727 7.603811832 20-01-18 43,118
4 46.99994455 7.063905466 05-02-18 43,134
5 47.08125481 7.19501811 20-01-18 43,118
6 47.37833814 7.432005368 11-12-18 43,443
7 47.43230354 7.445253182 30-12-18 43,462
8 46.73777711 6.777871255 09-04-18 43,197
69 47.42285191 7.113934735 09-04-18 43,197
The id is the location I am interested in, lat and lon: latitude and longitude), DateFin correspond to the date I am interested to know the temperature at that particular location and DateCount corresponds to the number
of days from 01-01-1900 to the date I am interested in (that's how the layers are indexed in the file).
For doing that I have something like this:
import glob
from netCDF4 import Dataset
import pandas as pd
import numpy as np
from datetime import date
import os
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('*.nc'):
print(file)
data = Dataset(file, 'r')
time = data.variables['time'] # that's how the days are stored
year = file[0:4]
all_years.append(year)
# define my input data
cities = pd.read_csv('locd.csv')
# extracting the data
for index, row in cities.iterrows():
id_row = row['id'] # id from the database
location_latitude = row['lat']
location_longitude = row['lon']
location_date = row['DateCount'] #number of day counting since 1900-01-01
# Sorting the all_years python list
all_years.sort()
for yr in all_years:
# Reading-in the data
data = Dataset(str(yr)+'.nc', 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_latitude)**2
sq_diff_lon = (lon - location_longitude)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the precipitation data
prec= data.variables['precipi'] # that's how the variable is called
for p_index in np.arange(0, len(location_date)):
print('Recording the value for '+ id_row+': ' + str(location_date[p_index]))
df.loc[id_row[location_date]]['Precipitation'] = prec[location_date, min_index_lat, min_index_lon]
# to record it in a new archive
df.to_csv('locationNew.csv')
My issues:
I don't manage to make it work. Every time there is a new thing coming, now it says that "id_row" must be a string.
Does anybody have a hint or experience working with these type of files?

KeyError when trying to get to a value in 2d array (imported from csv file)

I need a code in which I load data from several csv files (containing distance, altitude, angle, wavelength etc.)
import numpy as np
import pandas as pd
date = 20180710
# import csv files
geometry = pd.read_csv('20180710_geo.csv', sep=';')
TWOb = pd.read_csv('l2b.csv', sep=';')
calib = pd.read_csv('calibration.csv', sep=';')
obs = pd.read_csv('observation.csv', sep=';')
###### extract position of date in obs #########
idxtupple = np.where(obs == date)
listidx = list(zip(idxtupple[0], idxtupple[1]))
for idx in listidx:
print(idx)
D = obs[idx[0],7]
print(D)
Everything seems fine, I have the correct number of lines and columns, and correct float numbers at each positions. But when I try to get an element in the 2d array (for instance obs[18,7] or geometry[2,5]), I get "KeyError: (18, 7)" (or KeyError: (2, 5) etc.) and I don't get why...
Here's what I get as csv file (from row 15 to 19 of my obs file):
Date of acq. [yyyymmdd] Operation ... Altitude min [km] Comments
15 20180630 Check out ... 19,51 NaN
16 20180703 Check out ... NaN NaN
17 20180705 Dark ... NaN NaN
19 20180711 Box-A ... 19,52 Global mapping
Thank you for your answer, Serge Ballesta!
I managed to get the wanted value in the table with str = obs.loc[idx[0], 'Solar distance [au]'] and then just converted it with D = float(str.replace(',', '.')) and it works.

Write a function from csv using dataframes to read and return column values in python

I have the following data set in a csv file:
vehicle---time-----aspd[m/s]------gspd[m/s]----hdg---alt[m-msl]
veh_1---17:19.5---0.163471505---0.140000001---213---273.8900146
veh_2---17:19.5---0.505786836---0.170000002---214---273.9100037
veh_3---17:19.8---0.173484877---0.109999999---213---273.980011
veh_4---44:12.4---18.64673424---19.22999954---316---388.9299927
veh_5---44:13.0---18.13533401---19.10000038---316---389.1700134
I am trying to write a function launch_time() with two inputs (dataframe, vehicle name) that returns the first time the gspd is reported above 10.0 m/s.
The output time must be converted from a string (HH:MM:SS.SS) to a minutes after 12:00 format.
It should look something like this:
>>> launch_time(df, veh_1)
30.0
I will use this function to iterate through each vehicle and then need to record the results into a list of tuples with the format (v_name, launch time) in launch sequence order.
It should look something like this:
'veh_1', 30.0, 'veh_2', 15.0
Disclosure: my python/pandas knowledge is very entry-level.
You can use read_csv with separator -{3,} - read csv with 3 and more -:
import pandas as pd
from pandas.compat import StringIO
temp=u"""vehicle---time-----aspd[m/s]------gspd[m/s]----hdg---alt[m-msl]
veh_1---17:19.5---0.163471505---0.140000001---213---273.8900146
veh_2---17:19.5---0.505786836---0.170000002---214---273.9100037
veh_3---17:19.8---0.173484877---0.109999999---213---273.980011
veh_4---44:12.4---18.64673424---19.22999954---316---388.9299927
veh_5---45:13.0---18.13533401---19.10000038---316---389.1700134"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="-{3,}", engine='python')
print (df)
vehicle time aspd[m/s] gspd[m/s] hdg alt[m-msl]
0 veh_1 17:19.5 0.163472 0.14 213 273.890015
1 veh_2 17:19.5 0.505787 0.17 214 273.910004
2 veh_3 17:19.8 0.173485 0.11 213 273.980011
3 veh_4 44:12.4 18.646734 19.23 316 388.929993
4 veh_5 45:13.0 18.135334 19.10 316 389.170013
Then convert column time to_timedelta, filter all rows above 10m/s by boolean indexing, sort_values, group on vehicles using groupby, then get the first value in each group and last zip columns vehicle and time and convert to list:
df.time = pd.to_timedelta('00:' + df.time, unit='h').\
astype('timedelta64[m]').astype(int)
req = df[df['gspd[m/s]'] > 10].\
sort_values('time', ascending=True).\
groupby('vehicle', as_index=False).head(1)
print(req)
vehicle time aspd[m/s] gspd[m/s] hdg alt[m-msl]
4 veh_5 45 18.135334 19.10 316 389.170013
3 veh_4 44 18.646734 19.23 316 388.929993
L = list(zip(req['vehicle'],req['time']))
print (L)
[('veh_5', 45), ('veh_4', 44)]

Categories

Resources