Convert time series data from csv to netCDF python

Convert time series data from csv to netCDF python - python

Main problem during this process is the code below:
precip[:] = orig
Produces an error of:
ValueError: cannot reshape array of size 5732784 into shape (39811,144,144)
I have two CSV files, one of the CSV file contains all the actual data of a variable (precipitation), with each column as a station, and their corresponding coordinates is in the second separate CSV file.
My sample data is in google drive here.
If you want to have a look at the data itself, but my 1st CSV file has the shape (39811, 144) and 2nd CSV file has the shape (171, 10) but note; I'm only using the sliced dataframe as (144, 2).
This is the code:
stations = pd.read_csv(stn_precip)
stncoords = stations.iloc[:,[0,1]][:144]
orig = pd.read_csv(orig_precip, skiprows = 1, names = stations['Code'][:144])
lons = stncoords['X']
lats = stncoords['Y']
ncout = netCDF4.Dataset('Precip_1910-2018_homomod.nc', 'w')
ncout.createDimension('longitude',lons.shape[0])
ncout.createDimension('latitude',lats.shape[0])
ncout.createDimension('precip',orig.shape[1])
ncout.createDimension('time',orig.shape[0])
lons_out = lons.tolist()
lats_out = lats.tolist()
time_out = orig.index.tolist()
lats = ncout.createVariable('latitude',np.dtype('float32').char,('latitude',))
lons = ncout.createVariable('longitude',np.dtype('float32').char,('longitude',))
time = ncout.createVariable('time',np.dtype('float32').char,('time',))
precip = ncout.createVariable('precip',np.dtype('float32').char,('time', 'longitude','latitude'))
lats[:] = lats_out
lons[:] = lons_out
time[:] = time_out
precip[:] = orig
ncout.close()
I'm mostly basing my code to this post: convert-csv-to-netcdf
but does not include the variable 'TIME' as a 3rd dimension, so that's where I'm failing.
I think I should be expecting the precipitation variable to have a shape in the form (39811, 144, 144), but the error suggests otherwise.
Not exactly sure how to deal with this, any inputs are appreciated.

As you have data from different stations, I would suggest using dimension station for your netCDF file and not separate lon and lat. Of course, you can save the longitude and latitude of each station to separate variable.
Here is one possible solution, using your code as an example:
#!/usr/bin/env ipython
import pandas as pd
import numpy as np
import netCDF4
stn_precip='Precip_1910-2018_stations.csv'
orig_precip='Precip_1910-2018_origvals.csv'
stations = pd.read_csv(stn_precip)
stncoords = stations.iloc[:,[0,1]][:144]
orig = pd.read_csv(orig_precip, skiprows = 1, names = stations['Code'][:144])
lons = stncoords['X']
lats = stncoords['Y']
nstations = np.size(lons)
ncout = netCDF4.Dataset('Precip_1910-2018_homomod.nc', 'w')
ncout.createDimension('station',nstations)
ncout.createDimension('time',orig.shape[0])
lons_out = lons.tolist()
lats_out = lats.tolist()
time_out = orig.index.tolist()
lats = ncout.createVariable('latitude',np.dtype('float32').char,('station',))
lons = ncout.createVariable('longitude',np.dtype('float32').char,('station',))
time = ncout.createVariable('time',np.dtype('float32').char,('time',))
precip = ncout.createVariable('precip',np.dtype('float32').char,('time', 'station'))
lats[:] = lats_out
lons[:] = lons_out
time[:] = time_out
precip[:] = orig
ncout.close()
So the information about output file (ncdump -h Precip_1910-2018_homomod.nc) is like this:

Related

How do I write a function that reads a .data file and returns an np array in python?

I have a data file that can be downloaded from here: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data
I want to define a function that reads and loads the data and returns dataset numpy arrays. Dataset should have 14 columns corresponding to the 13 attributes of housing property x and housing price value y.
def loadData(filename):
dataset = None
file = open(filename, "r")
data = file.read()
print(data)
x = np.genfromtxt(filename, usecols = [0,1,2,3,4,5,6,7,8,9,10,11,12])
y = np.genfromtxt(filename, usecols = 13)
print("x: ", x)
print("y: ", y)
dataset = np.concatenate((x,y), axis = 1)
return dataset
My y output seems to be alright. However, my x output is wrong as seen below:
Part of the output of x should contain the values below, as part of an np array:
What am I doing wrong?
edit: the above question has been answered and resolved. However, I just wanted to ask how would I ensure that the output is in float64.
My output is
but my expected is
I have edited the np.genfromtxt line to have type = np.float64 as shown:
x = np.genfromtxt(filename, usecols = [0,1,2,3,4,5,6,7,8,9,10,11,12], dtype = np.float64)
y = np.genfromtxt(filename, usecols = 13, dtype = np.float64)
I have also tried dataset.astype(float64) but neither has worked. Would appreciate some help again. Thank you!

your code is almost correct. The problem there is that after loading x you got an array x of shape (506, 13) (two-dimensional) and an array y with shape (506,) (one-dimensional). So, after loading y you have to add a new dimension to convert it to two-dimensional. Numpy offers the np.newaxis method for that. The code that solves your problem is:
import numpy as np
def loadData(filename):
x = np.genfromtxt(filename, usecols = [0,1,2,3,4,5,6,7,8,9,10,11,12])
y = np.genfromtxt(filename, usecols = 13)
y = y[:, np.newaxis].astype(np.float64) # Add new axis and convert to float64
dataset = np.concatenate((x,y), axis = 1)
return dataset
if __name__ == "__main__":
dataset = loadData("housing.data")
"""
print(type(dataset[0, 0]))
>>> <class 'numpy.float64'>
"""
Hope it helps!

You have already read the data from file in data variable.
Use data variable instead of filename in genfromtxt() as below instead of filename:
def loadData(filename):
dataset = None
file = open(filename, "r")
data = file.read()
print(data)
x = np.genfromtxt(data, usecols = [0,1,2,3,4,5,6,7,8,9,10,11,12])
y = np.genfromtxt(data, usecols = 13)
print("x: ", x)
print("y: ", y)
dataset = np.concatenate((x,y), axis = 1)
return dataset

How to extract efficiently time-series data from a netCDF file?

I want to extract time-series of data from a unique netCDF file.
I have to extract three-time series of daily temperatures across more than 500 cities from 2004 to 2016 (more precisely, I extract 3-time series across 3 points coordinates for each city).
The following program works, but it is very slow. (More than 8hours to obtain one location time series). I have already tried to divide coordinates into several CSV files and run the program separately for each of these files, but it is not very efficient.
Maybe I should chunck the netCDF file (5 Go) into smaller files to reduce the "reading" process. But I don't know how to do that.
from netCDF4 import Dataset
from datetime import datetime
from netCDF4 import Dataset
import pandas as pd
import os
import numpy as np
os.chdir('D:PATH/tmp/')
date_range = pd.date_range(start = "2004-01-01", end = "2016-12-31", freq ='D')
df = pd.DataFrame(0.0, columns = ['Temp1','Temp2','Temp3'], index = date_range)
cities = pd.read_csv(r'D:\PATH\cities_coordinates.csv', sep =',')
cities['NUTS_ID']= cities['NUTS_ID'].map(str)
for index, row in cities.iterrows():
location = row['NUTS_ID']
location_latitude1 = row['lat1']
location_longitude1 = row['lon1']
location_latitude2 = row['lat2']
location_longitude2 = row['lon2']
location_latitude3 = row['lat3']
location_longitude3 = row['lon3']
for day in date_range:
data = Dataset("D:/PATH/temperature.nc",'r')
# Storing the lat and lon data into variables of the netCDF file into variables
lat = data.variables['latitude'][:]
lon = data.variables['longitude'][:]
# Squared difference between the specified lat, lon and the lat, lon of the netCDF
sq_diff_lat1 = (lat - location_latitude1)**2
sq_diff_lon1 = (lon - location_longitude1)**2
sq_diff_lat2 = (lat - location_latitude2)**2
sq_diff_lon2 = (lon - location_longitude2)**2
sq_diff_lat3 = (lat - location_latitude3)**2
sq_diff_lon3 = (lon - location_longitude3)**2
# Identify the index of the min value for lat and lon
min_index_lat1 = sq_diff_lat1.argmin()
min_index_lon1 = sq_diff_lon1.argmin()
min_index_lat2 = sq_diff_lat2.argmin()
min_index_lon2 = sq_diff_lon2.argmin()
min_index_lat3 = sq_diff_lat3.argmin()
min_index_lon3 = sq_diff_lon3.argmin()
# Accessing the temperature data
tx = data.variables['tx']
start = '2004-01-01'
end = '2016-12-31'
d_range = pd.date_range(start = start, end = end, freq='D')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: '+str(d_range[t_index]))
df.loc[d_range[t_index]]['Temp1']=tx[t_index, min_index_lat1, min_index_lon1]
df.loc[d_range[t_index]]['Temp2']=tx[t_index, min_index_lat2, min_index_lon2]
df.loc[d_range[t_index]]['Temp3']=tx[t_index, min_index_lat3, min_index_lon3]
df.to_csv(location +'.csv')

How to write The values of Latitude, Longitude and another variable's values in a csv file in three different columns?

I want to write The values of Latitude, Longitude and Air_flux values in a csv file in three different columns.
Here is the code in Python3 that I have done so far:
The file "path" has all the values of "Air_Flux" across specified Lat and Lon.
CODE:
import numpy as np
import csv
LAT_MIN = 34.675
LAT_MAX = 38.275
LON_MIN = 124.625
LON_MAX = 130.795
path = 'BESS_PAR_Daily.A2015004.nc_output.csv' # "File That contains the Values Of Air_Flux"
flux = np.genfromtxt(path, delimiter=',') # Reading Data from File
latData = np.arange(LAT_MIN, LAT_MAX, 0.05)
lonData = np.arange(LON_MIN, LON_MAX, 0.05)
with open('data.csv', 'w') as file:
writer = csv.writer(file, delimiter=',')
for x in np.nditer(latData.T, order='C'):
for y in np.nditer(lonData.T, order='C'):
file.write(str(x))
file.write("\n")
file.write(str(y))
file.write("\n")
for fl in np.nditer(flux):
file.write(str(fl))
file.write("\n")
file.close()
I only know the way to store values in One column...
BUT:
I want to write The values of Latitude, Longitude and Air_flux values in a csv file in such a way that one column would have Latitude values, 2nd column for Longitude value and the third column for "Air_flux"

My understanding is that you're data is required in the format
LAT1 LON1 FLUX1
LAT2 LON2 FLUX2
In that case you don't need multiple for loops, you can pass all three arrays to the nditer method and then use csvwriter.writerows to write all values in a stretch.
Here is an example based on your scenario
import numpy as np
import csv
LAT_MIN = 34.675
LAT_MAX = 38.275
LON_MIN = 124.625
LON_MAX = 130.795
# path = 'BESS_PAR_Daily.A2015004.nc_output.csv' # "File That contains the Values Of Air_Flux"
# flux = np.genfromtxt(path, delimiter=',') # Reading Data from File
# latData = np.arange(LAT_MIN, LAT_MAX, 0.05)
# lonData = np.arange(LON_MIN, LON_MAX, 0.05)
flux = np.array([1,2,3,4,5])
latData = np.array([1,2,3,4,5])
lonData = np.array([1,2,3,4,5])
with open('data.csv', 'w') as file:
writer = csv.writer(file, delimiter=',')
for x,y,z in np.nditer([latData.T, lonData.T, flux], order='C'):
writer.writerow([x,y,z])
Also you don't need file.close() since the with block takes care of it

As the values of flux are stored across those Lat and Lons, after iterating Lat values across Lon, I fetched the indices of lat and lon across flux:
writer.writerow([x, y, flux[lat.index, lon.index]])

How to create netcdf file from excel sheet contain one variable values with corresponding lat long?

I am trying to create a nc file from excel file containing values in three column (lat, long and Precipitation). The data is gridded value of 0.05 degree resolution. The total number of grids is 1694. The data extent is 57 columns and 52 rows. I am not getting how to load the precipitation variable?.
I tried like this.
from netCDF4 import *
from numpy import *
from openpyxl import *
import time
#Load the data sheet
wb = load_workbook('D:\\R_Workspace\\UB_Try.xlsx')
ws = wb['UB_details']
#To get the prec variable
ppt = []
for i in range(2,1696):
ppt.append(ws.cell(row=i,column=3).value)
ppt_arr = asarray(ppt)
#Writing nc file
nc_file = Dataset('D:\\R_Workspace\\Test.nc','w',format='NETCDF4_CLASSIC')
nc_file.description = 'Example dataset'
nc_file.history = 'Created on ' +time.ctime(time.time())
#Defining dimensions
lon = nc_file.createDimension('lon',57)
lat = nc_file.createDimension('lat',52)
#Creating variables
latitude = nc_file.createVariable('Latitude',float32,('lat',))
latitude.units = 'Degree_North'
longitude = nc_file.createVariable('Longitude',float32,('lon',))
longitude.units = 'degree_East'
prec = nc_file.createVariable('prec',float32,('lon','lat'),fill_value = -9999.0)
prec.units = 'mm'
#Writing data to variables
lats = arange(16.875,19.425,0.05)
lat_reverse = lats[::-1]
lons = arange(73.325,76.175,0.05)
latitude[:] = lat_reverse
longitude[:] = lons
prec[:] = ppt_arr
nc_file.close()
I got the error
Traceback (most recent call last):
File "D:\R_Workspace\Try.py", line 45, in <module>
prec[:] = ppt_arr
File "netCDF4\_netCDF4.pyx", line 4358, in netCDF4._netCDF4.Variable.__setitem__
ValueError: cannot reshape array of size 1694 into shape (57,52)

Replace this line:
prec[:] = ppt_arr
with:
temp = np.zeros(52 * 57, dtype=np.float32)
temp[:ppt_arr.size] = ppt_arr
prec[:] = temp
This creates a temporary array of zeros and fills the number of valid element with your values for pt_arr before assigning to the netCDF array.

Python: passing coordinates from list to function

I am using some code from a workshop to extract data from netCDF files by the coordinates closest to my specified coordinates. When using just one set of coordinates I am able to extract the values I need without trouble as below:
import numpy as np
import netCDF4
from math import pi
from numpy import cos, sin
def tunnel_fast(latvar,lonvar,lat0,lon0):
'''
Find closest point in a set of (lat,lon) points to specified point
latvar - 2D latitude variable from an open netCDF dataset
lonvar - 2D longitude variable from an open netCDF dataset
lat0,lon0 - query point
Returns iy,ix such that the square of the tunnel distance
between (latval[it,ix],lonval[iy,ix]) and (lat0,lon0)
is minimum.
'''
rad_factor = pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:] * rad_factor
lonvals = lonvar[:] * rad_factor
ny,nx = latvals.shape
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
# Compute numpy arrays for all values, no loops
clat,clon = cos(latvals),cos(lonvals)
slat,slon = sin(latvals),sin(lonvals)
delX = cos(lat0_rad)*cos(lon0_rad) - clat*clon
delY = cos(lat0_rad)*sin(lon0_rad) - clat*slon
delZ = sin(lat0_rad) - slat;
dist_sq = delX**2 + delY**2 + delZ**2
minindex_1d = dist_sq.argmin() # 1D index of minimum element
iy_min,ix_min = np.unravel_index(minindex_1d, latvals.shape)
return iy_min,ix_min
ncfile = netCDF4.Dataset('E:\wind_level2_1.nc', 'r')
latvar = ncfile.variables['latitude']
lonvar = ncfile.variables['longitude']
#_________GG turbine_________GAD10 Latitude 51.735516, GAD10 Longitude 1.942656
iy,ix = tunnel_fast(latvar, lonvar, 51.735516, 1.942656)
print('Closest lat lon:', latvar[iy,ix], lonvar[iy,ix])
refLAT=latvar[iy,ix]
refLON = lonvar[iy,ix]
#try to find the data for this location
SARwind = ncfile.variables['sar_wind'][:,:]
ModelWind = ncfile.variables['model_speed'][:,:]
print 'iy,ix' #appears to be the index of the value of Lat,lon
print SARwind[iy,ix]
ncfile.close()
Now I am trying to loop through a text file containing coordinates coord_list to extract sets of coordinates, find the data then move to the next set of coordinates in the list. This code works on it's own as below:
import csv
from decimal import Decimal
with open('Turbine_locs_no_header.csv','rb') as f:
reader = csv.reader(f)
#coord_list = list(reader)
coord_list = [reader]
end_row = len(coord_list)
lon_ind=1
lat_ind=2
for row in range(0, end_row-1):#end_row - 1 due to the 0 index
turbine_lat = coord_list[row][lat_ind]
turbine_lon = coord_list[row][lon_ind]
turbine_lat = [Decimal(turbine_lat)]
print 'lat',turbine_lat, 'lon',turbine_lon, row
However, I want to pass coordinates from the text file to this part of the original code iy,ix = tunnel_fast(latvar, lonvar, 51.94341, 1.922094888), replacing the numbers with variables iy, ix = tunnel_fast(latvar, lonvar, turbine_lat, turbine_lon). I try to combine the two codes by creating a function get_coordinates, I get the following errors
File "C:/Users/mm/test_nc_bycoords_GG_turbines_AGW.py", line 65, in <module>
get_coordinates(coord_list, latvar, lonvar)
File "C:/Users/mm/test_nc_bycoords_GG_turbines_AGW.py", line 51, in get_coordinates
iy, ix = tunnel_fast(latvar, lonvar, turbine_lat, turbine_lon)
File "C:/Users/mm/test_nc_bycoords_GG_turbines_AGW.py", line 27, in tunnel_fast
lat0_rad = lat0 * rad_factor
TypeError: can't multiply sequence by non-int of type 'float'
I thought this is because the turbine_lat and turbine_lon are list items so cannot be used, but this doesn't seem to be connected to the errors. I know this code needs more work anyway, but if anyone could help me spot where I am going wrong that would be very helpful. My attempt to combine the two codes is below.
import numpy as np
import netCDF4
from math import pi
from numpy import cos, sin
import csv
# edited from https://github.com/Unidata/unidata-python-workshop/blob/a56daa50d7b343c7debe93968683613642d6b9f7/notebooks/netcdf-by-coordinates.ipynb
def tunnel_fast(latvar,lonvar,lat0,lon0):
'''
Find closest point in a set of (lat,lon) points to specified point
latvar - 2D latitude variable from an open netCDF dataset
lonvar - 2D longitude variable from an open netCDF dataset
lat0,lon0 - query point
Returns iy,ix such that the square of the tunnel distance
between (latval[it,ix],lonval[iy,ix]) and (lat0,lon0)
is minimum.
'''
rad_factor = pi/180.0 # for trignometry, need angles in radians
# Read latitude and longitude from file into numpy arrays
latvals = latvar[:] * rad_factor
lonvals = lonvar[:] * rad_factor
ny,nx = latvals.shape
lat0_rad = lat0 * rad_factor
lon0_rad = lon0 * rad_factor
# Compute numpy arrays for all values, no loops
clat,clon = cos(latvals),cos(lonvals)
slat,slon = sin(latvals),sin(lonvals)
delX = cos(lat0_rad)*cos(lon0_rad) - clat*clon
delY = cos(lat0_rad)*sin(lon0_rad) - clat*slon
delZ = sin(lat0_rad) - slat;
dist_sq = delX**2 + delY**2 + delZ**2
minindex_1d = dist_sq.argmin() # 1D index of minimum element
iy_min,ix_min = np.unravel_index(minindex_1d, latvals.shape)
return iy_min,ix_min
#________________my edits___________________________________________________
def get_coordinates(coord_list, latvar, lonvar):
"this takes coordinates from a .csv and assigns them to variables"
end_row = len(coord_list)
lon_ind=1
lat_ind=2
for row in range(0, end_row-1):#end_row - 1 due to the 0 index
turbine_lat = coord_list[row][lat_ind]
turbine_lon = coord_list[row][lon_ind]
iy, ix = tunnel_fast(latvar, lonvar, turbine_lat, turbine_lon)
print('Closest lat lon:', latvar[iy, ix], lonvar[iy, ix])
#________________________________________________________________________________________________________________________
ncfile = netCDF4.Dataset('NOGAPS_wind_level2_1.nc', 'r')
latvar = ncfile.variables['latitude']
lonvar = ncfile.variables['longitude']
#____added in to pass to get coordinates function
with open('Turbine_locs_no_header.csv','rb') as f:
reader = csv.reader(f)
coord_list = list(reader)
#_________take latitude from coordinateas function
get_coordinates(coord_list, latvar, lonvar)
#iy,ix = tunnel_fast(latvar, lonvar, turbine_lat, turbine_lon)#get these from the 'assign_coordinates_fromlist.py
#print('Closest lat lon:', latvar[iy,ix], lonvar[iy,ix])
SARwind = ncfile.variables['sar_wind'][:,:]
ModelWind = ncfile.variables['model_speed'][:,:]
print 'iy,ix' #appears to be the index of the value of Lat,lon
print SARwind[iy,ix]
ncfile.close()
When I try to convert

You can unpack an argument list using *args (see the docs). In your case you could do tunnel_fast(latvar, lonvar, *coord_list[row]). You need to make sure that the order of arguments in coord_list[row] is correct and if coord_list[row] contains more than the two values then you need to slice it appropriately.

Thanks to help from a_guest
It was a simple problem of lat0 and lon0 being passed as
<type 'str'> to tunnel_fast when it requires <type 'float'>. This appears to come from loading the coord_list as a list.
with open('Turbine_locs_no_header.csv','rb') as f:
reader = csv.reader(f)
coord_list = list(reader)
The workaround I used was to convert lat0 and lon0 to floats at the beginning of tunnel_fast
lat0 = float(lat0)
lon0 = float(lon0)
I am sure there is a more elegant way to do this, but it works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert time series data from csv to netCDF python - python

Related

How do I write a function that reads a .data file and returns an np array in python?

How to extract efficiently time-series data from a netCDF file?

How to write The values of Latitude, Longitude and another variable's values in a csv file in three different columns?

How to create netcdf file from excel sheet contain one variable values with corresponding lat long?

Python: passing coordinates from list to function

Categories

Resources