Reading and manipulating multiple netcdf files in python - python

I need help with reading multiple netCDF files, despite few examples in here, none of them works properly.
I am using Python(x,y) vers 2.7.5, and other packages : netcdf4 1.0.7-4, matplotlib 1.3.1-4, numpy 1.8, pandas 0.12,
basemap 1.0.2...
I have few things I'm used to do with GrADS that I need to start doing them in Python.
I have a few 2 meter temperature data (4-hourly data, each year, from ECMWF), each file contains 2 meter temp data, with Xsize=480, Ysize=241,
Zsize(level)=1, Tsize(time) = 1460 or 1464 for leap years.
These are my files name look alike: t2m.1981.nc, t2m.1982.nc, t2m.1983.nc ...etc.
Based on this page:
( Loop through netcdf files and run calculations - Python or R )
Here is where I am now:
from pylab import *
import netCDF4 as nc
from netCDF4 import *
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np
f = nc.MFDataset('d:/data/ecmwf/t2m.????.nc') # as '????' being the years
t2mtr = f.variables['t2m']
ntimes, ny, nx = shape(t2mtr)
temp2m = zeros((ny,nx),dtype=float64)
print ntimes
for i in xrange(ntimes):
temp2m += t2mtr[i,:,:] #I'm not sure how to slice this, just wanted to get the 00Z values.
# is it possible to assign to a new array,...
#... (for eg.) the average values of 00z for January only from 1981-2000?
#creating a NetCDF file
nco = nc.Dataset('d:/data/ecmwf/t2m.00zJan.nc','w',clobber=True)
nco.createDimension('x',nx)
nco.createDimension('y',ny)
temp2m_v = nco.createVariable('t2m', 'i4', ( 'y', 'x'))
temp2m_v.units='Kelvin'
temp2m_v.long_name='2 meter Temperature'
temp2m_v.grid_mapping = 'Lambert_Conformal' # can it be something else or ..
#... eliminated?).This is straight from the solution on that webpage.
lono = nco.createVariable('longitude','f8')
lato = nco.createVariable('latitude','f8')
xo = nco.createVariable('x','f4',('x')) #not sure if this is important
yo = nco.createVariable('y','f4',('y')) #not sure if this is important
lco = nco.createVariable('Lambert_Conformal','i4') #not sure
#copy all the variable attributes from original file
for var in ['longitude','latitude']:
for att in f.variables[var].ncattrs():
setattr(nco.variables[var],att,getattr(f.variables[var],att))
# copy variable data for lon,lat,x and y
lono=f.variables['longitude'][:]
lato=f.variables['latitude'][:]
#xo[:]=f.variables['x']
#yo[:]=f.variables['y']
# write the temp at 2 m data
temp2m_v[:,:]=temp2m
# copy Global attributes from original file
for att in f.ncattrs():
setattr(nco,att,getattr(f,att))
nco.Conventions='CF-1.6' #not sure what is this.
nco.close()
#attempt to plot the 00zJan mean
file=nc.Dataset('d:/data/ecmwf/t2m.00zJan.nc','r')
t2mtr=file.variables['t2m'][:]
lon=file.variables['longitude'][:]
lat=file.variables['latitude'][:]
clevs=np.arange(0,500.,10.)
map = Basemap(projection='cyl',llcrnrlat=0.,urcrnrlat=10.,llcrnrlon=97.,urcrnrlon=110.,resolution='i')
x,y=map(*np.meshgrid(lon,lat))
cs = map.contourf(x,y,t2mtr,clevs,extend='both')
map.drawcoastlines()
map.drawcountries()
plt.plot(cs)
plt.show()
First question is at the temp2m += t2mtr[1,:,:] . I am not sure how to slice the data to get only 00z (let say for January only) of all files.
Second, While running the test, an error came at cs = map.contourf(x,y,t2mtr,clevs,extend='both') saying "shape does not match that of z: found (1,1) instead of (241,480)". I know some error probably on the output data, due to error on recording the values, but I can't figure out what/where .
Thanks for your time. I hope this is not confusing.

So t2mtr is a 3d array
ntimes, ny, nx = shape(t2mtr)
This sums all values across the 1st axis:
for i in xrange(ntimes):
temp2m += t2mtr[i,:,:]
A better way to do this is:
temp2m = np.sum(tm2tr, axis=0)
temp2m = tm2tr.sum(axis=0) # alt
If you want the average, use np.mean instead of np.sum.
To average across a subset of the times, jan_times, use an expression like:
jan_avg = np.mean(tm2tr[jan_times,:,:], axis=0)
This is simplest if you want just a simple range, e.g the first 30 times. For simplicity I'm assuming the data is daily and years are constant length. You can adjust things for the 4hr frequency and leap years.
tm2tr[0:31,:,:]
A simplistic way on getting Jan data for several years is to construct an index like:
yr_starts = np.arange(0,3)*365 # can adjust for leap years
jan_times = (yr_starts[:,None]+ np.arange(31)).flatten()
# array([ 0, 1, 2, ... 29, 30, 365, ..., 756, 757, 758, 759, 760])
Another option would be to reshape tm2tr (doesn't work well for leap years).
tm2tr.reshape(nyrs, 365, nx, ny)[:,0:31,:,:].mean(axis=1)
You could test the time sampling with something like:
np.arange(5*365).reshape(5,365)[:,0:31].mean(axis=1)
Doesn't the data set have a time variable? You might be able to extract the desired time indices from that. I worked with ECMWF data a number of years ago, but don't remember a lot of the details.
As for your contourf error, I would check the shape of the 3 main arguments: x,y,t2mtr. They should match. I haven't worked with Basemap.

Related

Replacing masked data with previous values of same dataset

I am working on filling in missing data in a large (4GB) netcdf datafile (3 dimensions: time, longitude and latitude). The method is to fill in the masked values in data1 either with:
1) previous values from data1 or
2) with data from another (also masked dataset, data2) if the found value from data1 < the found value from data2.
So fare I have tried a couple of things, one is to make a very complex script with long for loops which never finished running after 24 hours. I have tried to reduce it, but i think it is still very much to complicated. I believe there is a much more simple procedure to do it than the way I am doing it now I just can't see how.
I have made a script where masked data is first replaced with zeroes in order to use the function np.where to get the index of my masked data (i did not find a function that returns the coordinates of masked data, so this is my work arround it). My problem is that my code is very long and i think time consuming for large datasets to run through. I believe there is a more simple way of doing it, but I haven't found another work arround it.
Here is what I have so fare: : (the first part is just to generate some matrices that are easy to work with):
if __name__ == '__main__':
import numpy as np
import numpy.ma as ma
from sortdata_helpers import decision_tree
# Generating some (easy) test data to try the algorithm on:
# data1
rand1 = np.random.randint(10, size=(10, 10, 10))
rand1 = ma.masked_where(rand1 > 5, rand1)
rand1 = ma.filled(rand1, fill_value=0)
rand1[0,:,:] = 1
#data2
rand2 = np.random.randint(10, size=(10, 10, 10))
rand2[0, :, :] = 1
coordinates1 = np.asarray(np.where(rand1 == 0)) # gives the locations of where in the data there are zeros
filled_data = decision_tree(rand1, rand2, coordinates1)
print(filled_data)
The functions that I defined to be called in the main script are these, in the same order as they are used:
def decision_tree(data1, data2, coordinates):
# This is the main function,
# where the decision between data1 or data2 is chosen.
import numpy as np
from sortdata_helpers import generate_vector
from sortdata_helpers import find_value
for i in range(coordinates.shape[1]):
coordinate = [coordinates[0, i], coordinates[1,i], coordinates[2,i]]
AET_vec = generate_vector(data1, coordinate) # makes vector to go back in time
AET_value = find_value(AET_vec) # Takes the vector and find closest day with data
PET_vec = generate_vector(data2, coordinate)
PET_value = find_value(PET_vec)
if PET_value > AET_value:
data1[coordinate[0], coordinate[1], coordinate[2]] = AET_value
else:
data1[coordinate[0], coordinate[1], coordinate[2]] = PET_value
return(data1)
def generate_vector(data, coordinate):
# This one generates the vector to go back in time.
vector = data[0:coordinate[0], coordinate[1], coordinate[2]]
return(vector)
def find_value(vector):
# Here the fist value in vector that is not zero is chosen as "value"
from itertools import dropwhile
value = list(dropwhile(lambda x: x == 0, reversed(vector)))[0]
return(value)
Hope someone has a good idea or suggestions on how to improve my code. I am still struggling with understanding indexing in python, and I think this can definately be done in a more smooth way than I have done here.
Thanks for any suggestions or comments,

Problem with converting octave code to python/pandas - wrong signal processing due to incorrect float64 values

I did not find all anwsers regarding my problem or each of questions deal with just part of it. After few days of trying I decided to post a question.
I am doing biomechanical research involving computing maximum velocity of a kicks. There are three kicks captured in each file (simple finding maximum value wont do). I need to find maxium values of those kicks. With some help I manage to do it using matlab/octave but I for future work I decided to stick with python for data processing.
The point is that I have time,x,y,z data of a specific marker and I need to compute its velocity for each registered frame and pick maximum velocity from each kick.
This is a code in octave:
pkg load signal
txyz=importdata('295ltoe.txt',',',8); % read the text file
txyz=txyz.data; % all data in array time,x,y,z
dxyz=diff(txyz); % first differences of all columns
vxyz=dxyz(:,2:end)./dxyz(:,1)/1000; % compute velocity components
v=sqrt(sum(vxyz.^2,2)); % and the total velocity
[pks,locs]=findpeaks(v,'minpeakheight',6 )
I tried to convert it to pandas with this code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
data1 = pd.read_csv("B0264_dollyo_air_P_T01 rtoe.txt") #examle of txt
imported from c3d file
df = data1.diff()
dfx = (df['X'] /1000) / df['T']
dfy = (df['Y'] /1000) / df['T']
dfz = (df['Z'] /1000) / df['T']
dfx1 = dfx**2
dfy1 = dfy**2
dfz1 = dfz**2
v = (dfx1 + dfy1 + dfz1)**1/2
peaks, _ = find_peaks(v, height=6)
plt.plot(v)
plt.plot(peaks, v[peaks], "x")
plt.show()
The problem is that velocity gets values like that:
0 NaN
1 6.450000e-07
2 8.237500e-07
3 1.159062e-06
4 1.250312e-06
5 1.657500e-06
instead of normal correct values it gives me values more than 60. I am attaching plots that I recived vs excel correct polot (excel computing is time consuming due to constant copypasting).
My total aim is to get 3 max peaks and 3 min peaks to compute time of kick execution, but I do not know how to obtain it.
For know, if anyone is willing to help me I can provide files I have used.

How to get months from np.datetime64 object, NOT using pandas

I have an array like this one:
dt64 = array(['1970-01-01', '1970-01-02', '1970-02-03', '1970-02-04',
'1970-03-05', '1970-03-06', '1970-04-07', '1970-04-08',
'1970-05-09', '1970-05-10', '1970-06-11', '1970-06-12',
'1970-07-13', '1970-07-14'], dtype='datetime64[D]')
Now I want to plot some data associated with the single element of the array. In the figure I want to plot using matplotlib I need to draw a line that changes color for some months.
I want to draw the months mar to aug in orange and the others in blue.
I think I have to do two plt.plot lines, one for the orange line and one for the blue line.
My problem now is, I struggle to slice these datetime64 object in a way that returns the month to to compare them with the required months.
So far I have:
import numpy as np
from matplotlib import pyplot as plt
def md_plot(dt64=np.array, md=np.array):
"""Erzeugt Plot der Marsdistanz (y-Achse) zur Zeit (x-Achse)."""
plt.style.use('seaborn-whitegrid')
y, m, d = dt64.astype(int) // np.c_[[10000, 100, 1]] % np.c_[[10000, 100, 100]]
dt64 = y.astype('U4').astype('M8') + (m-1).astype('m8[M]') + (d-1).astype('m8[D]')
plt.plot(dt64, md, color='orange', label='Halbjahr der steigenden Temperaturen')
plt.plot(dt64, md, color='blue', label='Halbjahr der fallenden Temperaturen')
plt.xlabel("Zeit in Jahren\n")
plt.xticks(rotation = 45)
plt.ylabel("Marsdistanz in AE\n(1 AE = 149.597.870,7 km)")
plt.figure('global betrachtet...') # diesen Block ggf. auskommentieren
#plt.style.use('seaborn-whitegrid')
md_plot(master_array[:,0], master_array[:,1]) # Graph
plt.show()
plt.close()
This idea seemed to work, but won't work for a whole array:
In [172]: dt64[0].astype(datetime.datetime).month
Out[172]: 1
I really try to avoid Pandas because I don't want to bloat my script when there is a way to get the task done by using the modules I am already using. I also read it would decrease the speed here.
If i understand you correctly this would do it:
[np.datetime64(i,'M') for i in dt64]
Converting to python datetime in an intermediate step:
from datetime import datetime
import numpy as np
datestrings = np.array(["18930201", "19840404"])
months = np.array([datetime.strptime(d, "%Y%m%d").month for d in datestrings])
print(months)
# out: [2 4]
My version of numpy May be dated, but when I ran np.datetime64(dt64[0]) I got numpy.datetime64('1970-01')
To get just the month (if that’s what you are looking for) try:
np.datetime_as_string(dt64[0]).split('-')[1]
This solution fits the best for me:
dt64[(dt64.astype('M8[M]') - dt64.astype('M8[Y]')).view(int) == 2]
Thanks to Paul Panzer.

Determining if data in a txt file obeys certain statistics

I'm working with a Geiger counter which can be hooked up to a computer and which records its output in the form of a .txt file, NC.txt, where it records the time since starting and the 'value' of the radiation it recorded. It looks like
import pylab
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
x1 = []
y1 = []
#Define a dictionary: counts
f = open("NC.txt", "r")
for line in f:
line = line.strip()
parts = line.split(",") #the columns are separated by commas and spaces
time = float(parts[1]) #time is recorded in the second column of NC.txt
value = float(parts[2]) #and the value it records is in the third
x1.append(time)
y1.append(value)
f.close()
xv = np.array(x1)
yv = np.array(y1)
#Statistics
m = np.mean(yv)
d = np.std(yv)
#Strip out background radiation
trueval = yv - m
#Basic plot of counts
num_bins = 10000
plt.hist(trueval,num_bins)
plt.xlabel('Value')
plt.ylabel('Count')
plt.show()
So this code so far will just create a simple histogram of the radiation counts centred at zero, so the background radiation is ignored.
What I want to do now is perform a chi-squared test to see how well the data fits, say, Poisson statistics (and then go on to compare it with other distributions later). I'm not really sure how to do that. I have access to scipy and numpy, so I feel like this should be a simple task, but just learning python as I go here, so I'm not a terrific programmer.
Does anyone know of a straightforward way to do this?
Edit for clarity: I'm not asking so much about if there is a chi-squared function or not. I'm more interested in how to compare it with other statistical distributions.
Thanks in advance.
You can use SciPy library, here is documentation and examples.

Drawing a 2D function in matplotlib

Dear fellow coders and science guys :)
I am using python with numpy and matplotlib to simulate a perceptron, proud to say it works pretty well.
I used python even tough I've never seen it before, cause I heard matplotlib offered amazing graph visualisation capabilities.
Using functions below I get a 2d array that looks like this:
[[aplha_1, 900], [alpha_2], 600, .., [alpha_99, 900]
So I get this 2D array and would love to write a function that would enable me to analyze the convergence.
I am looking for something that will easily and intuitively (don't have time to study a whole new library for 5 hours now) draw a function like this sketch:
def get_convergence_for_alpha(self, _alpha):
epochs = []
for i in range(0, 5):
epochs.append(self.perceptron_algorithm())
self.weights = self.generate_weights()
avg = sum(epochs, 0) / len(epochs)
res = [_alpha, avg]
return res
And this is the whole generation function.
def alpha_convergence_function(self):
res = []
for i in range(1, 100):
res.append(self.get_convergence_for_alpha(i / 100))
return res
Is this easily doable?
You can convert your nested list to a 2d numpy array and then use slicing to get the alphas and epoch counts (just like in matlab).
import numpy as np
import matplotlib.pyplot as plt
# code to simulate the perceptron goes here...
res = your_object.alpha_convergence_function()
res = np.asarray(res)
print('array size:', res.shape)
plt.xkcd() # so you get the sketchy look :)
# first column -> x-axis, second column -> y-axis
plt.plot(res[:,0], res[:,1])
plt.show()
Remove the plt.xkcd() line if you don't actually want the plot to look like a sketch...

Categories

Resources