I am constructing a NetCDF file that will be used with xarray. It will consist of many groups that use dimensions that are defined in the root group. In my current example, xarray's plot function is unable to put the proper values on the axes. Tools like panoply or ncview produce plots that do put the proper values of the dimensions at the axes. The script below creates a file which allows me to reproduce the problem. Do I construct the NetCDF file in an incorrect way, or is this a bug in xarray?
import numpy as np
import netCDF4 as nc
import xarray as xr
import matplotlib.pyplot as plt
# Three series, two variables that contain the axis values and the 2D field.
z = np.arange(0., 1000., 50.)
time = np.arange(0., 86400., 3600.)
a = np.random.rand(time.size, z.size)
# The two dimensions are stored in the root group.
nc_file = nc.Dataset("test.nc", mode="w", datamodel="NETCDF4", clobber=False)
nc_file.createDimension("z" , z.size )
nc_file.createDimension("time", time.size)
nc_z = nc_file.createVariable("z" , "f8", ("z") )
nc_time = nc_file.createVariable("time", "f8", ("time"))
nc_z [:] = z [:]
nc_time[:] = time[:]
# The 2D field is created and stored in a group called test_group.
nc_group = nc_file.createGroup("test_group")
nc_a = nc_group.createVariable("a", "f8", ("time", "z"))
nc_a[:,:] = a[:,:]
nc_file.close()
# Opening the file in x-array shows a plot that misses the axes values.
xr_file = xr.open_dataset("test.nc", "test_group")
xr_a = xr_file['a']
xr_a.plot()
plt.show()
The resulting figure, which has just the count rather than the dimension values on the axes, is:
Related
I was working with NumPy and Pandas to create some artificial data for testing models.
First, I coded this:
# Constructing some random data for experiments
import math
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(42)
# Rectangular Data
total_n = 500
x = np.random.rand(total_n)*10
y = np.random.rand(total_n)*10
divider = 260
# Two lambda functions are for shifting the data, the numbers are chosen arbitrarily
f = lambda a: a*2
x[divider:] = f(x[divider:])
y[divider:] = f(y[divider:])
g = lambda a: a*3 + 5
x[:divider] = g(x[:divider])
y[:divider] = g(y[:divider])
# Colours array for separating the data
colors = ['blue']*divider + ['red']*(total_n-divider)
squares = np.array([x,y])
plt.scatter(squares[0],squares[1], c=colors, alpha=0.5)
I got what I wanted:
The Data I wanted
But I wanted to add the colors array to the numpy array, to take it as a Label variable so I added this to the code:
# Constructing some random data for experiments
import math
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(42)
# Rectangular Data
total_n = 500
x = np.random.rand(total_n)*10
y = np.random.rand(total_n)*10
divider = 260
# Two lambda functions are for shifting the data, the numbers are chosen arbitrarily
f = lambda a: a*2
x[divider:] = f(x[divider:])
y[divider:] = f(y[divider:])
g = lambda a: a*3 + 5
x[:divider] = g(x[:divider])
y[:divider] = g(y[:divider])
# Colours array for separating the data
colors = ['blue']*divider + ['red']*(total_n-divider)
squares = np.array([x,y,colors])
plt.scatter(squares[0],squares[1], c=colors, alpha=0.5)
And everything just blows out:
The Blown out Data
I got my work around this by separating the label from the whole numpy array. But still what's going on here??
Alright so I think I have the answer. A Numpy array can only have one type of data which is infered when creating the array if it is not given. When you create squares with colors in it, then squares.dtype='<U32', which means that all values are converted to a little-endian 32 character string.
To avoid that you can:
use a simple list
use a pandas dataframe, as they accept columns of different types
if you want to use numpy you can use a structured array as follow
zipped = [z for z in zip(x, y, colors)]
#input must be a list of tuples/list representing rows
#the transformation is made with zip
dtype = np.dtype([('x', float), ('y', float), ('colors', 'U10')])
#type of data, 10 characters string is U10
squares = np.array(zipped, dtype=dtype)
#creating the array by precising the type
plt.scatter(squares["x"],squares["y"], c=squares["colors"], alpha=0.5)
#when plotting call the corresponding column, just as in a dataframe
I have a netcdf file with a spatial resolution of 0.05º and I want to regrid it to a spatial resolution of 0.01º like this other netcdf. I tried using scipy.interpolate.griddata, but I am not really getting there, I think there is something that I am missing.
original_dataset = xr.open_dataset('to_regrid.nc')
target_dataset= xr.open_dataset('SSTA_L4_MED_0_1dg_2022-01-18.nc')
According to scipy.interpolate.griddata documentation, I need to construct my interpolation pipeline as following:
grid = griddata(points, values, (grid_x_new, grid_y_new),
method='nearest')
So in my case, I assume it would be as following:
#Saving in variables the old and new grids
grid_x_new = target_dataset['lon']
grid_y_new = target_dataset['lat']
grid_x_old = original_dataset ['lon']
grid_y_old = original_dataset ['lat']
points = (grid_x_old,grid_y_old)
values = original_dataset['analysed_sst'] #My variable in the netcdf is the sea surface temp.
Now, when I run griddata:
from scipy.interpolate import griddata
grid = griddata(points, values, (grid_x_new, grid_y_new),method='nearest')
I am getting the following error:
ValueError: shape mismatch: objects cannot be broadcast to a single
shape
I assume it has something to do with the lat/lon array shapes. I am quite new to netcdf field and don't really know what can be the issue here. Any help would be very appreciated!
In your original code the indices in grid_x_old and grid_y_old should correspond to each unique coordinate in the dataset. To get things working correctly something like the following will work:
import xarray as xr
from scipy.interpolate import griddata
original_dataset = xr.open_dataset('to_regrid.nc')
target_dataset= xr.open_dataset('SSTA_L4_MED_0_1dg_2022-01-18.nc')
#Saving in variables the old and new grids
grid_x_old = original_dataset.to_dataframe().reset_index().loc[:,["lat", "lon"]].lon
grid_y_old = original_dataset.to_dataframe().reset_index().loc[:,["lat", "lon"]].lat
grid_x_new = target_dataset.to_dataframe().reset_index().loc[:,["lat", "lon"]].lon
grid_y_new = target_dataset.to_dataframe().reset_index().loc[:,["lat", "lon"]].lat
values = original_dataset.to_dataframe().reset_index().loc[:,["lat", "lon", "analysed_sst"]].analysed_sst
points = (grid_x_old,grid_y_old)
grid = griddata(points, values, (grid_x_new, grid_y_new),method='nearest')
I recommend using xesm for regridding xarray datasets. The code below will regrid your dataset:
import xarray as xr
import xesmf as xe
original_dataset = xr.open_dataset('to_regrid.nc')
target_dataset= xr.open_dataset('SSTA_L4_MED_0_1dg_2022-01-18.nc')
regridder = xe.Regridder(original_dataset, target_dataset, "bilinear")
dr_out = regridder(original_dataset)
I'm quite new to python and am working on an assignment. I am using pandas to read a csv into a dataframe. One of the fields in that dataframe is Car-Type. I want to get a total of each of the different Car-Types (sedan, hatch-back, wagon, etc.) in the data frame, then use matplotlib to make graphs of the Car-Types vs. the type-totals. Depending on the type of graph I try to make, I get different errors about the x and y variables.
import pandas as pd
from matplotlib import pyplot as plt
path_to_csv = r'C:\Automobile_data.csv'
# use pandas to read the csv and assign it to a variable, df
df = pd.read_csv(path_to_csv, encoding='iso-8859-1')
# Where the car-type value is null, set the value to Not Available
df.loc[df['Car-Type'].isnull(), 'Car-Type'] = 'Not_Available'
# create new dataframe with just the car-type counts
countType = df['Car-Type'].value_counts().astype('int64')
print(countType)
# use matplotlib to create a graph
# assign values to the x and y variables
x = df['Car-Type']
y = countType
# create a graph with x and y
plt.plot(x, y)
# # display the graph
plt.show()
line: (plt.plot) - ValueError: x and y must have same first dimension, but have shapes (205,) and (6,)
bar: (plt.bar) - ValueError: shape mismatch: objects cannot be broadcast to a single shape
scatter: (plt.scatter) - ValueError: x and y must be the same size
I understand that there's an error regarding the size and or shape of the data I'm assigning to the x and y variable, but I'm not experienced enough, and haven't been able to extrapolate from reading, how to fix those errors.
I am very new to coding python and I am working with a .CSV file that gives me a 32x32 matrix in a 1024 column row with a time stamp. I reshaped the data to give me 32x32 arrays and looped through each row appending the matrices to a numpy array.
`i = 0
while i < len(df_array):
if i == 0:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = spec
else:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = np.concatenate((spectrum_matrix, spec), axis = 0)
i = i + 1
print("job done")`
What I would like to do is to add the time stamp from the original data file and add them to each of the matrices thus allowing me to re sample the data over a 5 minute average. I also would like to plot the bins a to get a plot similar to this Drop size distribution
As a reference I am reading in the data .CSV with pandas and here is an example of a portion of the raw data: 01.06.2017;18:22:20;0.122;0.00;51;7.401;10375;18745;57;27;0.00;23.6;0.110;0;
<SPECTRUM>;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
The ;'s after the SPECTRUM is the 32x32 matrix.
Thanks in advance for any help!
Python and associated packages can do many things without loops
From my understanding of your data you have a (8640 x 32 x 32) Data Structure (time x size x velocity).
Pandas works very well with 2D data structures, however for higher dimensional data I would recommend you get familiar with xarray. With this package along with pandas you can create and manipulate your data without having to resort to loops.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import seaborn as sns
%matplotlib inline
#create random data
data = (np.random.binomial(n =5, p =0.2, size =(8640,32,32))*1000).astype(int)
#create labels for data
sizes= np.linspace(1,5,32)
velocities = np.linspace(1,1000, num = 32)
#make time range of 24 hours with 10sec intervals
ind = pd.date_range(start='2014-01-01', periods=8640, freq='10s')
#convert data to xarray 3D data structure
df = xr.DataArray(data, coords = [ind, sizes, velocities],
dims = ['time', 'size', 'speed'])
#make a 5 min average of the data
min_average= df.resample('300s', dim = 'time', how = 'mean')
#plot sample of data and 5 min average
my1d = min_average.isel(size = 5, speed= 10)
my1d.plot(label = '5 min avg')
plt.gca()
df.isel(size = 5, speed =10).plot(alpha = 0.3, c = 'r', label = 'raw_data')
plt.legend()
As for making a distribution plot like you linked things become a bit trickier but is possible:
#transform your data to only have mean speed for each time and size
#and convert to pandas dataframe
mean_speed =min_average.mean(dim = ['speed'])
#for some reason xarray make you name the new column when you convert
#to a pandas dataframe. I then get rid of the extra empty variable with
#a list comprehension
df= mean_speed.to_dataframe('').unstack().T
df.index = np.array([np.array(i)[1].astype(float) for i in df.index])
#make a contourplot of your new data
plt.contourf(df.columns, df.index, df.values, cmap ='PuBu_r')
plt.title('mean speed')
plt.ylabel('size')
plt.xlabel('time')
plt.colorbar()
I am currently working with BUFR files with wind data. When I read this file on python I get 4 large vectors, latitude vector, longitude vector, wind_direction vector, and wind_speed vector.
Both wind vectors are masked python arrays because there is non-valid data. This happens because the data comes from a non-geostationary satellite. In fact I successfully generated the following image from this BUFR file to show you the general shape that the data takes.
In this image I have plotted a color field to represent the wind speed, while the arrows obviously represent the wind direction.
Please notice the two bands of actual data. Unfortunately the way I am plotting the data, generates a third band (where the color field is smooth), in-between the actual data bands. This is an artefact of the function pcolormesh. If I could superimpose two `pcolormesh plots, each one representing one of the bands, this problem would disappear.
Unfortunately, I do not know how I could separate the data "regions". I have thought about clustering techniques but do not know how to cluster along latlon data using ANOTHER array (the wind data) as the clustering rule.
This is my current code:
#!/usr/bin/python
import bufr
import numpy as np
import sys
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot as plt
from matplotlib import mlab
WIND_DIR_INDEX = 97
WIND_SPEED_INDEX = 96
bfrfile = sys.argv[1]
print bfrfile
bfr = bufr.BUFRFile(bfrfile)
lon = []
lat = []
wind_d = []
wind_s = []
for record in bfr:
for entry in record:
if entry.index == WIND_DIR_INDEX:
wind_d.append(entry.data)
if entry.index == WIND_SPEED_INDEX:
wind_s.append(entry.data)
if entry.name.find("LONGITUDE") == 0:
lon.append(entry.data)
if entry.name.find("LATITUDE") == 0:
lat.append(entry.data)
lons = np.concatenate(lon)
lats = np.concatenate(lat)
winds_d = np.concatenate(wind_d)
winds_s = np.concatenate(wind_s)
winds_d = np.ma.masked_greater(winds_d,1.0e+6)
winds_s = np.ma.masked_greater(winds_s,1.0e+6)
windu = np.cos((winds_d-180)*(np.pi/180))
windv = np.sin((winds_d-180)*(np.pi/180))
# Data interpolation for pcolormesh (needs gridded data)
xi = np.linspace(lons.min(),lons.max(),lons.size/10)
yi = np.linspace(lats.min(),lats.max(),lats.size/10)
Z = mlab.griddata(lons,lats,winds_s,xi,yi)
X,Y = np.meshgrid(xi,yi)
mydpi = 96
fig = plt.figure(frameon=True)
fig.set_size_inches(1600/mydpi,1200/mydpi)
ax = plt.Axes(fig,[0,0,1,1])
#ax.set_axis_off()
fig.add_axes(ax)
plt.hold(True);
plt.quiver(lons[::5],lats[::5],windu[::5],windv[::5],linewidths=0)
for method in (ax.set_xticks,ax.set_xticklabels,ax.set_yticks,ax.set_yticklabels):
method([])
fig.savefig('/home/cendas/bin/python/bufr_ascat.png',bbox_inches=0,dpi=5*mydpi)
mydpi = 96
fig = plt.figure(frameon=True)
fig.set_size_inches(1600/mydpi,1200/mydpi)
ax = plt.Axes(fig,[0,0,1,1])
#ax.set_axis_off()
fig.add_axes(ax)
plt.hold(True);
try:
plt.pcolormesh(X,Y,Z,alpha=None)
plt.clim(0,10)
except ValueError:
pass
print "Warning: Empty data array."
for method in (ax.set_xticks,ax.set_xticklabels,ax.set_yticks,ax.set_yticklabels):
method([])
fig.savefig('/home/cendas/bin/python/bufr_ascat_color.png',bbox_inches=0,dpi=5*mydpi)
I then usually follow this python code with the following terminal commands to combine the images:
convert bufr_ascat.png -transparent white bufr_ascat.png
convert bufr_ascat_color.png -transparent white bufr_ascat_color.png
composite bufr_ascat.png bufr_ascat_color.png bufrascat.png
Don't abuse clustering for this.
What you need is a simple selection / filtering; not a structure discovery process.
Choose the mean of the masked data. All non-masked data left of that mean is the left part, all non-masked data on the right is the other?
Clustering is the wrong tool for this task.