Applying numpy masks with multiple matching criteria

Applying numpy masks with multiple matching criteria - python

I am a python newbie, trying to understand how to work with numpy masks better.
I have two 2D data arrays plus axis values, so something like
import numpy as np
data1=np.arange(50).reshape(10,5)
data2=np.random.rand(10,5)
x=5*np.arange(5)+15
y=2*np.arange(10)
Where x contains the coordinates of the 1st axis of data1 and data2, and y gives the coordinates of the 2nd axis of data1 and data2.
I want to identify and count all the points in data1 for which
data1>D1min,
the corresponding x values are inside a given range, XRange, and
the corresponding y is are inside a given range, YRange
Then, when I am all done, I also need to do a check to make sure none of the corresponding data2 values are less than another limit, D2Max
so if
XRange = [27,38]
YRange = [2,12]
D1min = 23
D2Max = 0.8
I would want to include cells 3:4 in the x direction and 1:6 in the 2nd dimension (assuming I want to include the limiting values).
That means I would only consider data1[3:4,1:6]
Then the limits of the values in the 2D arrays come into it, so want to identify and count points for which data1[3:4,1:6] > 23.
Once I have done that I want to take those data locations and check to see if any of those locations have values <0.8 in data2.
In reality I don't have formulas for x and y, and the arrays are much larger. Also, x and y might not even be monotonic.
I figure I should use numpy masks for this and I have managed to do it, but the result seems really tortured - I think the code wold be more clear if I just looped though the values in the 2D arrays.
I think the main problem is that I have trouble combining masks with boolean operations. The ideas I get from searching on line often don't seem to work on arrays.
I assume there is a elegant and (hopefully) understandable way to do this in just a few lines with masks. Would anyone care to explain it to me?

Well I eventually came up with something, so I thought I'd post it. I welcome suggested improvements.
#expand x and y into 2D arrays so that they can more
#easily be used for masking using tile
x2D = np.tile(x,(len(y),1))
y2D = np.tile(y,(len(x),1)).T
#mask these based on the ranges in X and Y
Xmask = np.ma.masked_outside(x2D,XRange[0],XRange[1]).mask
Ymask = np.ma.masked_outside(y2D,YRange[0],YRange[1]).mask
#then combine them
#Not sure I need the shrink=False, but it seems safer
XYmask = np.ma.mask_or(Xmask, Ymask,shrink=False)
#now mask the data1 array based on D1mask.
highdat = np.ma.masked_less(data1,D1min)
#combine with XYmask
data1mask = np.ma.mask_or(highdat.mask, XYmask,shrink=False)
#apply to data1
data1masked = np.ma.masked_where(data1mask,data1)
#number of points fulfilling my criteria
print('Number of points: ',np.ma.count(data1masked))
#transfer mask from data1 to data2
data2masked = np.ma.masked_where(data1mask, data2)
#do my check based on data2
if data2masked.min() < D2Max: print('data2 values are low!')

Related

SavGol Filter gives same length error on 3-D array

Hi, I'm trying to apply a Sav-gol filter to a 3D array of data that I have, (magnetic field data with xyz coordinates.) When I run my program, I Get the error: TypeError: expected x and y to have same length. My array is 460798 units long, with each unit being some list of coordinates [x y z]. I think it has to do something with the window size parameter. When I put it to three, it works fine, but my data points aren't smoothed. Higher than three, it does not work.
I am trying to get the function to smooth the 3-D array.
mag = cdf['Mag'][start_ind:stop_ind) #mag is a 3-D array with coordinate element [x y z]
mag_smoothed = signal.savgol_filter(x=mag, window_length=5, polyorder=2)
print mag_smoothed[1]
I'm supposed to get a smoothed 3-D array back, I believe.
File "/Users/sosa/research/Python Files/MagnometerPlot.py", line 33, in plot
mag_smoothed = signal.savgol_filter(x=mag, window_length=7, polyorder=2,axis=1)
File "/Users/sosa/anaconda/lib/python2.7/site-packages/scipy/signal/_savitzky_golay.py", line 339, in savgol_filter
_fit_edges_polyfit(x, window_length, polyorder, deriv, delta, axis, y)
File "/Users/sosa/anaconda/lib/python2.7/site-packages/scipy/signal/_savitzky_golay.py", line 217, in _fit_edges_polyfit
polyorder, deriv, delta, y)
File "/Users/sosa/anaconda/lib/python2.7/site-packages/scipy/signal/_savitzky_golay.py", line 187, in _fit_edge
xx_edge, polyorder)
File "/Users/sosa/anaconda/lib/python2.7/site-packages/numpy/lib/polynomial.py", line 559, in polyfit
raise TypeError("expected x and y to have same length")
TypeError: expected x and y to have same length

do you think if I separate the x,y,z components of the mag list and apply the filter separately to each component, would the filter be replicated?
I think it could be a reasonable approximation, but that's highly subjective and depends on what you're planning to do with your data. If you're trying to do precision measurements this might not be the best way to process your data.
Since I'm not sure if you're working with volume data or surface data (with z being the magnitude at x, y). I'll use a 3D-surface as an example. (Let's say it's a 2D array of magnitudes, arr1)
What we want to do: Smooth the surface with SG.
What we can do with scipy's SG-Filter: Smooth a 1D line.
But a surface is just a set of lines side by side, so to work around it we might do the following:
1) Smooth every row in arr1 (axis = 0). We put all the smoothed rows into a new array, arr2
2) Now we do the same with every column in arr2 (axis=1) and generate arr3, which is, nominally, the "2D-smoothed" surface.
But it isn't, not quite. For a given data point, the 1D-filter calculates a new values by taking into acount the point itself and several adjacent values. But in a 2D-set, that data point has more adjacent values which the 1D-filter doesn't see, because those values are in the wrong row (or column). It would probably arrive at a different value if it could.
The easiest way to convince yourself that the step-wise smoothing isn't perfect is to do it twice, but the second time you reverse the order. First you work along columns, then rows. In a perfect world the final results should agree, whether you started with rows or columns. As it is you'll probably find they're slightly different.
If your data is quite uniform without many 'jagged' peaks or jumps (e.g. noise) you probably wouldn't have any problems. Otherwise you may see more significant differences between the two results.
A quick google search did show up various discussions about 2D-Savitzky-Golay filters, so investigating that might be worthwhile for you.

If your data are organized in columns, you have to use
mag_smoothed = signal.savgol_filter(x=mag, window_length=5, polyorder=2, axis=0)

fill Numpy array with axisymmetric values

I'm trying to find a fast way to fill a Numpy array with rotation symmetric values. Imagine an array of zeros containing a cone shaped area. I have a 1D array of values and want to rotate it 360° around the center of the array. There is no 2D function like z=f(x,y), so I can't calculate the 2D values explicitly. I have something that works, but the for-loop is too slow for big arrays. This should make a circle:
values = np.ones(100)
x = np.arange(values.size)-values.size/2+0.5
y = values.size/2-0.5-np.arange(values.size)
x,y = np.meshgrid(x,y)
grid = np.rint(np.sqrt(x**2+y**2))
arr = np.zeros_like(grid)
for i in np.arange(values.size/2):
arr[grid==i] = values[i+values.size/2]
My 1D array is of course not as simple. Can someone think of a way to get rid of the for-loop?
Update: I want to make a circular filter for convolutional blurring. Before I used np.outer(values,values) which gave me a rectangular filter. David's hint allows me to create a circular filter very fast. See below:
square filter with np.outer()
circular filter with David's answer

You can use fancy indexing to achieve this:
values = np.ones(100)
x = np.arange(values.size)-values.size/2+0.5
y = values.size/2-0.5-np.arange(values.size)
x,y = np.meshgrid(x,y)
grid = np.rint(np.sqrt(x**2+y**2)).astype(np.int)
arr = np.zeros_like(grid)
size_half = values.size // 2
inside = (grid < size_half)
arr[inside] = values[grid[inside] + size_half]
Here, inside select the indices that lie inside the circle, since only these items can be derived from values.

You can do something like that:
x=y=np.arange(-500,501)
r=np.random.randint(0,256,len(x)/np.sqrt(2)+1)
X,Y=np.meshgrid(x,y)
im=(X*X+Y*Y)**(1/2)
circles=r.take(np.int64(im))
plt.imshow(circles)

Counting the number of times a threshold is met or exceeded in a multidimensional array in Python

I have an numpy array that I brought in from a netCDF file with the shape (930, 360, 720) where it is organized as (time, latitudes, longitudes).
At each lat/lon pair for each of the 930 time stamps, I need to count the number of times that the value meets or exceeds a threshold "x" (such as 0.2 or 0.5 etc.) and ultimately calculate the percentage that the threshold was exceeded at each point, then output the results so they can be plotted later on.
I have attempted numerous methods but here is my most recent:
lat_length = len(lats)
#where lats has been defined earlier when unpacked from the netCDF dataset
lon_length = len(lons)
#just as lats; also these were defined before using np.meshgrid(lons, lats)
for i in range(0, lat_length):
for j in range(0, lon_length):
if ice[:,i,j] >= x:
#code to count number of occurrences here
#code to calculate percentage here
percent_ice[i,j] += count / len(time) #calculation
#then go on to plot percent_ice
I hope this makes sense! I would greatly appreciate any help. I'm self taught in Python so I may be missing something simple.
Would this be a time to use the any() function? What would be the most efficient way to count the number of times the threshold was exceeded and then calculate the percentage?

You can compare the input 3D array with the threshold x and then sum along the first axis with ndarray.sum(axis=0) to get the count and thereby the percentages, like so -
# Calculate count after thresholding with x and summing along first axis
count = (ice > x).sum(axis=0)
# Get percentages (ratios) by dividing with first axis length
percent_ice = np.true_divide(count,ice.shape[0])

Ah, look, another meteorologist!
There are probably multiple ways to do this and my solution is unlikely to be the fastest since it uses numpy's MaskedArray, which is known to be slow, but this should work:
Numpy has a data type called a MaskedArray which actually contains two normal numpy arrays. It contains a data array as well as a boolean mask. I would first mask all data that are greater than or equal to my threshold (use np.ma.masked_greater() for just greater than):
ice = np.ma.masked_greater_equal(ice)
You can then use ice.count() to determine how many values are below your threshold for each lat/lon point by specifying that you want to count along a specific axis:
n_good = ice.count(axis=0)
This should return a 2-dimensional array containing the number of good points. You can then calculate the number of bad by subtracting n_good from ice.shape[0]:
n_bad = ice.shape[0] - n_good
and calculate the percentage that are bad using:
perc_bad = n_bad/float(ice.shape[0])
There are plenty of ways to do this without using MaskedArray. This is just the easy way that comes to mind for me.

From 1D graph to 2D mask

I have calculated the boundaries in which I want to sample points.
For example one dataset looks like:
Now I want to find point in the red area, which I do in the following way:
The plot consist of 10 lines, so I reshape to get the region limits per value of x.
limits = data.reshape(data.shape + (5, 2))
Now for a particular value of x data looks like:
limits[20] = array([[ 5.65624197, 6.70331962],
[ 13.68248989, 14.77227669],
[ 15.50973796, 16.61491606],
[ 24.03948128, 25.14907398],
[ 26.41541777, 27.53475798]])
I thought to make a mesh and mask the area as following
X, Y = np.meshgrid(xs, ys)
bool_array = np.zeros(Y.shape)
for j, y in enumerate(limits):
for min_y, max_y in y:
inds = np.where(np.logical_and(ys >= min_y, ys <= max_y))[0]
bool_array[inds, j] = True
plt.imshow(bool_array[::-1])
(don't know why the graph need to be plotted inverted)
results in
which is indeed the data I'm looking for , now I could use the True values to take points with a different function.
The problem is that this code is very slow, and my datasets will get much bigger.
I would like to find a more efficient way of finding this "mask".

I tried several things and ended up with the following result which worked for my simple cases
low_bound = limits[:,:,0]
upp_bound = limits[:,:,1]
mask = np.any((low_bound[:,None,:] <= Y.T[:,:,None]) & ( Y.T[:,:,None] <= upp_bound[:,None,:]),axis=-1).T
I know it looks ugly. What I do is introducing an additional dimension in which I subsequently check conditions whether it lies between two ending points. At the end I collapse the additional dimension by using np.any.
I don't know how much faster it is compared to your code. However, given that I don't use a single for loop there should be a performance boost.
Check the code with your data and tell me if something goes wrong.
Edit:
plt.imshow plots (0,0) in the lower left edge when you use
plt.imshow(mask,origin='lower')

Avoid for-loops in assignment of data values

So this is a little follow up question to my earlier question: Generate coordinates inside Polygon and my answer https://stackoverflow.com/a/15243767/1740928
In fact, I want to bin polygon data to a regular grid. Therefore, I calculate a couple of coordinates within the polygon and translate their lat/lon combination to their respective column/row combo of the grid.
Currently, the row/column information is stored in a numpy array with its number of rows corresponding to the number of data polygons and its number of columns corresponding to the coordinates in the polygon.
The whole code takes less then a second, but this code is the bottleneck at the moment (with ~7sec):
for ii in np.arange(len(data)):
for cc in np.arange(data_lats.shape[1]):
final_grid[ row[ii,cc], col[ii,cc] ] += data[ii]
final_grid_counts[ row[ii,cc], col[ii,cc] ] += 1
The array "data" simply contains the data values for each polygon (80000,). The arrays "row" and "col" contain the row and column number of a coordinate in the polygon (shape: (80000,16)).
As you can see, I am summing up all data values within each grid cell and count the number of matches. Thus, I know the average for each grid cell in case different polygons intersect it.
Still, how can these two for loops take around 7 seconds? Can you think of a faster way?

I think numpy should add an nd-bincount function, I had one lying around from a project I was working on some time ago.
import numpy as np
def two_d_bincount(row, col, weights=None, shape=None):
if shape is None:
shape = (row.max() + 1, col.max() + 1)
row = np.asarray(row, 'int')
col = np.asarray(col, 'int')
x = np.ravel_multi_index([row, col], shape)
out = np.bincount(x, weights, minlength=np.prod(shape))
return out.reshape(shape)
weights = np.column_stack([data] * row.shape[1])
final_grid = two_d_bincount(row.ravel(), col.ravel(), weights.ravel())
final_grid_counts = two_d_bincount(row.ravel(), col.ravel())
I hope this helps.

I might not fully understand the shapes of your different grids, but you can maybe eliminate the cc loop using something like this:
final_grid = np.empty((nrows,ncols))
for ii in xrange(len(data)):
final_grid[row[ii,:],col[ii,:]] = data[ii]
This of course assumes that final_grid is starting with no other info (that the count you're incrementing starts at zero). And I'm not sure how to test if it works not understanding how your row and col arrays work.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.