indexing issues when extracting specified latitude and longitude - python

I want to extract a specified latitude and longitude from a netCDF file. In the past, I have never had issues with extracting the data. I am assuming that the reason it is not working this time is because I read in my data differently (see below)
data = netCDF4.Dataset('/home/eburrows/metr173/regional_cm/Lab1/air.mon.mean.nc', mode = 'r')
lat = data.variables['lat'][:] #90 through -90
lon = data.variables['lon'][:] #0 through 360
air_temp = data.variables['air'][:] #degrees C
air_temp[air_temp>10000] = n.NaN
Previously I have been able to do the following:
us_lat = n.ravel(n.where((lat>=___)&(lat<=___)))
us_lon = n.ravel(n.where((lon>=___)&(lon<=___)))
us_annual_temp = n.nanmean(air_temp[:,us_lat, us_lon],0)
This time however, it is returning a Type Error stating that list indices must be integers, not tuple.
I then forced the tuple into a list by changing us_lat and us_lon to have list(n.ravel(n.where(...)), but it still returns the same error. In the past I have been able to index this way and am not entirely sure why it is not working this time around.

The results of lat_us from the where command are a tuple of indices, not the actual indices that are needed for slicing air_temp. To fix this, you need to index the first result from lat_us to access the array of latitude indices.
For instance,
>>> import numpy as np
>>> lat = np.arange(-90,91,10)
>>> lat
array([-90, -80, -70, -60, -50, -40, -30, -20, -10, 0, 10, 20, 30,
40, 50, 60, 70, 80, 90])
>>> lat_us = np.where((lat >= -30) & (lat <= 30))
>>> lat_us
(array([ 6, 7, 8, 9, 10, 11, 12]),)
>>> lat_us[0]
array([ 6, 7, 8, 9, 10, 11, 12])
So the line
us_lat = n.ravel(n.where((lat>=___)&(lat<=___)))
Should be modified to (note: I don't think you need to ravel this either):
us_lat = n.where((lat>=___) & (lat<=___))[0]
Also, you are currently reading in only one dimensional for the variable air_temp, but it appears to be 3D (time x lat x lon). So you need to modify the read in of this variable to include all three dimensions:
air_temp = data.variables['air'][:,:,:]

Related

Numpy: given a set of ranges, is there an efficient way to find the set of ranges that are disjoint with all other ranges?

Is there an elegant way to find the set of disjoint ranges from a set of ranges in numpy?
ranges = [[0,3], [2,4],[5,10]] # there are about 50 000 elements
disjoint_ranges = [] # these are all disjoint
adjoint_ranges = [] # these do not all have to be mutually adjoint
for index, range_1 in enumerate(ranges):
i, j = range_1 # all ranges are ordered s.t. i<j
for swap_2 in ranges[index+1:]: # the list of ranges is ordered by increasing i
a, b, _ = swap_2
if a<j and a>i:
adjoint_swaps.append(swap)
adjoint_swaps.append(swap_2)
else:
if swap not in adjoint_swaps:
swaps_to_do.append(swap)
print(adjoint_swaps)
print(swaps_to_do)
Looping on numpy array kinda defeats the purpose of using numpy. You can detect disjoint ranges by leveraging the accumulate method.
With your ranges sorted in order of their lower bound, you can accumulate the maximum of the upper bounds to determine the coverage of previous ranges over subsequent ones. Then compare the lower bound of each range to the reach of the previous ones to know if there is a forward overlap. Then you only need to compare the upper bound of each range with the next one's lower bound to detect backward overlaps. The combination of forward and backward overlaps will allow you to flag all overlapping ranges and, by elimination, find the ones that are completely disjoint from others:
import numpy as np
ranges = np.array( [ [1,8], [10,15], [2,5], [18,24], [7,10] ] )
ranges.sort(axis=0)
overlaps = np.zeros(ranges.shape[0],dtype=np.bool)
overlaps[1:] = ranges[1:,0] < np.maximum.accumulate(ranges[:-1,1])
overlaps[:-1] |= ranges[1:,0] < ranges[:-1,1]
disjoints = ranges[overlaps==False]
print(disjoints)
[[10 15]
[18 24]]
I'm not sure with numpy but there is the following with pandas:
from functools import reduce
import pandas as pd
ranges = [
pd.RangeIndex(10, 20),
pd.RangeIndex(15, 25),
pd.RangeIndex(30, 50),
pd.RangeIndex(40, 60),
]
disjoints = reduce(lambda x, y : x.symmetric_difference(y), ranges)
disjoints
Int64Index([10, 11, 12, 13, 14, 20, 21, 22, 23, 24, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
dtype='int64')

Renumber/Relabel a Numpy array based on coordinates

I have a segmentation map (numpy.ndarray) that contain objects labeled with unique numbers. I want to combine objects across multiple slices by labeling them with the same number. Specifically, I want to renumber objects based on a DataFrame containing centroid positions and the desired label value.
First, I created some mock labels and a DataFrame:
df = pd.DataFrame({
"slice": [0, 0, 0, 0, 1, 1, 1, 2, 2, 2],
"number": [1, 2, 3, 4, 1, 2, 3, 1, 2, 3],
"x": [10, 20, 30, 40, 11, 21, 31, 12, 22, 32],
"y": [10, 20, 30, 40, 11, 21, 31, 12, 22, 32]
})
def make_segmap(df):
x, y = np.indices((50, 50))
maps = []
# Iterate over slices and coordinates
for n_slice in df["slice"].unique():
masks = []
for row in df[df["slice"] == n_slice].iterrows():
# Create circle
mask_circle = (x - row[1]["x"])**2 + (y - row[1]["y"])**2 < 5**2
# Random index number (here just a multiple)
masks.append(mask_circle * row[1]["number"]*3)
maps.append(np.max(masks, axis=0))
return np.stack(maps, axis=0)
segmap = make_segmap(df)
For renumbering, this is what I came up with so far:
new_maps = []
# Iterate over slices
for n_slice in df["slice"].unique():
new_labels = []
for row in df[df["slice"] == n_slice].iterrows():
# Find current value at position
original_label = segmap[n_slice, row[1]["y"], row[1]["x"]]
# Replace all label occurrences with the desired label from the DataFrame
replaced_label = np.where(segmap[n_slice] == original_label, row[1]["number"], 0)
new_labels.append(replaced_label)
new_maps.append(np.max(new_labels, axis=0))
new_segmap = np.stack(new_maps, axis=0)
This works reasonably well but doesn't scale to larger datasets. The real dataset has thousands of objects across hundreds of slices and this approach takes very long to run (an hour or so). Are there any suggestions on how to replace multiple values at once to improve performance?
Thanks in advance.
You can use groupby to replace the current quadratic search algorithm by a (quasi) linear search. Moreover, you can take advantage of Numpy's vectorization and broadcasting to remove the inner loop and make the computation faster.
Here is a faster implementation:
def make_segmap_fast(df):
x, y = np.indices((50, 50))
maps = []
# Iterate over slices and coordinates
for n_slice,subDf in df.groupby("slice"):
subDf_x = subDf["x"].to_numpy()[:, None, None]
subDf_y = subDf["y"].to_numpy()[:, None, None]
subDf_number = subDf["number"].to_numpy()[:, None, None]
# Create circle
mask_circle = (x - subDf_x)**2 + (y - subDf_y)**2 < 5**2
# Random index number (here just a multiple)
masks = mask_circle * subDf_number
maps.append(np.max(masks, axis=0)*3)
return np.stack(maps, axis=0)
On my machine, this is 2 times faster on the very small example (much more on bigger dataframes).

how to include first element of a list before computing the difference

I'm creating a histogram. I currently have this block of code:
g = [479, 481, 503, 525, 554, 586, 614, 669, 683]
and then i've written this for the x and y axis:
x =[28, 27, 26, 25, 24, 23, 22, 21, 20]
y = diff(g)
This is what it computes y as:
array([ 2, 22, 22, 29, 32, 28, 55, 14])
However, I realized that my histogram doesn't include 479 (first element in g) before it starts computing the difference from there onwards, which is what I was hoping to do. My desired output is
array([ 479, 2, 22, 22, 29, 32, 28, 55, 14])
Is there a way that I can do this? I don't want to manually append it as I need to automate it for various files.
There are two main ways of prepending elements to a diff: before or after the fact. If you want to prepend a zero before, you can use the the prepend argument available as of v1.16.0:
y = np.diff(g, prepend=0)
This is equivalent to manually inserting a zero into your array (in case your version of numpy is older):
y = np.diff(np.insert(g, 0, 0))
You can do something very similar after the diff, by inserting g[0] into the beginning:
y = np.insert(np.diff(g), 0, g[0])
However, all the options shown here are inefficient because they copy all your data (g or the diff). A space-efficient solution would allocate an output buffer, and compute the difference manually:
y = np.empty_like(g)
y[1:] = g[1:] - g[:-1]
y[0] = g[0]

Subsetting A Pytorch Tensor Using Square-Brackets

I came across a line of code used to reduce a 3D Tensor to a 2D Tensor in PyTorch. The 3D tensor x is of size torch.Size([500, 50, 1]) and this line of code:
x = x[lengths - 1, range(len(lengths))]
was used to reduce x to a 2D tensor of size torch.Size([50, 1]). lengths is also a tensor of shape torch.Size([50]) containing values.
Please can anyone explain how this works? Thank you.
After being quite stumped by the behavior, I did some more digging into this, and found that it is consistent behavior with the indexing of multi-dimensional NumPy arrays. What makes this counter-intuitive is the less obvious fact that both arrays have to have the same length, i.e. in this case len(lengths).
In fact, it works as the following:
* lengths is determining the order in which you access the first dimension. I.e., if you have a 1D array a = [0, 1, 2, ...., 500], and access it with the list b = [300, 200, 100], then the result a[b] = [301, 201, 101] (This also explains the lengths - 1 operator, which simply causes the accessed values to be the same as the index used in b, or lengths, respectively).
* range(len(lengths)) then *simply chooses the i-th element in the i-th row. If you have a square matrix, you can interpret this as the diagonal of the matrix. Since you only access a single element for each position along the first two dimensions, this can be stored in a single dimension (thus reducing your 3D tensor to 2D). The latter dimension is simply kept "as is".
If you want to play around with this, I strongly recommend to change the range() value to something longer/shorter, which will result in the following error:
IndexError: shape mismatch: indexing arrays could not be broadcast
together with shapes (x,) (y,)
where x and y are your specific length values.
To write this accessing method out in the long form to understand what happens "under the hood", also consider the below example:
import torch
x = torch.randint(500, 50, 1)
lengths = torch.tensor([2, 30, 1, 4]) # random examples to explore
diag = list(range(len(lengths))) # [0, 1, 2, 3]
result = []
for i, row in enumerate(lengths):
temp_tensor = x[row, :, :] # temp_tensor.shape = [1, 50, 1]
temp_tensor = temp_tensor.squeeze(0)[diag[i]] # temp_tensor.shape = [1, 1]
result.append(temp.tensor)
# back to pytorch
result = torch.tensor(result)
result.shape # [4, 1]
The key feature here is passing values of a tensor lengths as indices for x.
Here simplified example, I swaped dimensions of container, so index dimenson goes first:
container = torch.arange(0, 50 )
container = f.reshape((5, 10))
>>>tensor([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49]])
indices = torch.arange( 2, 7, dtype=torch.long )
>>>tensor([2, 3, 4, 5, 6])
print( container[ range( len(indices) ), indices] )
>>>tensor([ 2, 13, 24, 35, 46])
Note: we got one thing from a row ( range( len(indices) ) makes sequential row numbers), with column number given by indices[ row_number ]

Interpolation of datetimes for smooth matplotlib plot in python

I have lists of datetimes and values like this:
import datetime
x = [datetime.datetime(2016, 9, 26, 0, 0), datetime.datetime(2016, 9, 27, 0, 0),
datetime.datetime(2016, 9, 28, 0, 0), datetime.datetime(2016, 9, 29, 0, 0),
datetime.datetime(2016, 9, 30, 0, 0), datetime.datetime(2016, 10, 1, 0, 0)]
y = [26060, 23243, 22834, 22541, 22441, 23248]
And can plot them like this:
import matplotlib.pyplot as plt
plt.plot(x, y)
I would like to be able to plot a smooth version using more x-points. So first I do this:
delta_t = max(x) - min(x)
N_points = 300
xnew = [min(x) + i*delta_t/N_points for i in range(N_points)]
Then attempting a spline fit with scipy:
from scipy.interpolate import spline
ynew = spline(x, y, xnew)
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
What is the best way to proceed? I am open to solutions involving other libraries such as pandas or plotly.
You're trying to pass a list of datetimes to the spline function, which are Python objects (hence dtype('O')). You need to convert the datetimes to a numeric format first, and then convert them back if you wish:
int_x = [i.total_seconds() for i in x]
ynew = spline(int_x, y, xnew)
Edit: total_seconds() is actually a timedelta method, not for datetimes. However it looks like you sorted it out so I'll leave this answer as is.
Figured something out:
x_ts = [x_.timestamp() for x_ in x]
xnew_ts = [x_.timestamp() for x_ in xnew]
ynew = spline(x_ts, y, xnew_ts)
plt.plot(xnew, ynew)
This works very nicely, but I'm still open to ideas for simpler methods.

Categories

Resources