Avoid for-loops in assignment of data values - python

So this is a little follow up question to my earlier question: Generate coordinates inside Polygon and my answer https://stackoverflow.com/a/15243767/1740928
In fact, I want to bin polygon data to a regular grid. Therefore, I calculate a couple of coordinates within the polygon and translate their lat/lon combination to their respective column/row combo of the grid.
Currently, the row/column information is stored in a numpy array with its number of rows corresponding to the number of data polygons and its number of columns corresponding to the coordinates in the polygon.
The whole code takes less then a second, but this code is the bottleneck at the moment (with ~7sec):
for ii in np.arange(len(data)):
for cc in np.arange(data_lats.shape[1]):
final_grid[ row[ii,cc], col[ii,cc] ] += data[ii]
final_grid_counts[ row[ii,cc], col[ii,cc] ] += 1
The array "data" simply contains the data values for each polygon (80000,). The arrays "row" and "col" contain the row and column number of a coordinate in the polygon (shape: (80000,16)).
As you can see, I am summing up all data values within each grid cell and count the number of matches. Thus, I know the average for each grid cell in case different polygons intersect it.
Still, how can these two for loops take around 7 seconds? Can you think of a faster way?

I think numpy should add an nd-bincount function, I had one lying around from a project I was working on some time ago.
import numpy as np
def two_d_bincount(row, col, weights=None, shape=None):
if shape is None:
shape = (row.max() + 1, col.max() + 1)
row = np.asarray(row, 'int')
col = np.asarray(col, 'int')
x = np.ravel_multi_index([row, col], shape)
out = np.bincount(x, weights, minlength=np.prod(shape))
return out.reshape(shape)
weights = np.column_stack([data] * row.shape[1])
final_grid = two_d_bincount(row.ravel(), col.ravel(), weights.ravel())
final_grid_counts = two_d_bincount(row.ravel(), col.ravel())
I hope this helps.

I might not fully understand the shapes of your different grids, but you can maybe eliminate the cc loop using something like this:
final_grid = np.empty((nrows,ncols))
for ii in xrange(len(data)):
final_grid[row[ii,:],col[ii,:]] = data[ii]
This of course assumes that final_grid is starting with no other info (that the count you're incrementing starts at zero). And I'm not sure how to test if it works not understanding how your row and col arrays work.

Related

Applying numpy masks with multiple matching criteria

I am a python newbie, trying to understand how to work with numpy masks better.
I have two 2D data arrays plus axis values, so something like
import numpy as np
data1=np.arange(50).reshape(10,5)
data2=np.random.rand(10,5)
x=5*np.arange(5)+15
y=2*np.arange(10)
Where x contains the coordinates of the 1st axis of data1 and data2, and y gives the coordinates of the 2nd axis of data1 and data2.
I want to identify and count all the points in data1 for which
data1>D1min,
the corresponding x values are inside a given range, XRange, and
the corresponding y is are inside a given range, YRange
Then, when I am all done, I also need to do a check to make sure none of the corresponding data2 values are less than another limit, D2Max
so if
XRange = [27,38]
YRange = [2,12]
D1min = 23
D2Max = 0.8
I would want to include cells 3:4 in the x direction and 1:6 in the 2nd dimension (assuming I want to include the limiting values).
That means I would only consider data1[3:4,1:6]
Then the limits of the values in the 2D arrays come into it, so want to identify and count points for which data1[3:4,1:6] > 23.
Once I have done that I want to take those data locations and check to see if any of those locations have values <0.8 in data2.
In reality I don't have formulas for x and y, and the arrays are much larger. Also, x and y might not even be monotonic.
I figure I should use numpy masks for this and I have managed to do it, but the result seems really tortured - I think the code wold be more clear if I just looped though the values in the 2D arrays.
I think the main problem is that I have trouble combining masks with boolean operations. The ideas I get from searching on line often don't seem to work on arrays.
I assume there is a elegant and (hopefully) understandable way to do this in just a few lines with masks. Would anyone care to explain it to me?
Well I eventually came up with something, so I thought I'd post it. I welcome suggested improvements.
#expand x and y into 2D arrays so that they can more
#easily be used for masking using tile
x2D = np.tile(x,(len(y),1))
y2D = np.tile(y,(len(x),1)).T
#mask these based on the ranges in X and Y
Xmask = np.ma.masked_outside(x2D,XRange[0],XRange[1]).mask
Ymask = np.ma.masked_outside(y2D,YRange[0],YRange[1]).mask
#then combine them
#Not sure I need the shrink=False, but it seems safer
XYmask = np.ma.mask_or(Xmask, Ymask,shrink=False)
#now mask the data1 array based on D1mask.
highdat = np.ma.masked_less(data1,D1min)
#combine with XYmask
data1mask = np.ma.mask_or(highdat.mask, XYmask,shrink=False)
#apply to data1
data1masked = np.ma.masked_where(data1mask,data1)
#number of points fulfilling my criteria
print('Number of points: ',np.ma.count(data1masked))
#transfer mask from data1 to data2
data2masked = np.ma.masked_where(data1mask, data2)
#do my check based on data2
if data2masked.min() < D2Max: print('data2 values are low!')

Identify the grid particles belong to

A square box of size 10,000*10,000 has 10,00,000 particles distributed uniformly. The box is divided into grids, each of size 100*100. There are 10,000 grids in total. At every time-step (for a total of 2016 steps), I would like to identify the grid to which a particle belongs. Is there an efficient way to implement this in python? My implementation is as below and currently takes approximately 83s for one run.
import numpy as np
import time
start=time.time()
# Size of the layout
Layout = np.array([0,10000])
# Total Number of particles
Population = 1000000
# Array to hold the cell number
cell_number = np.zeros((Population),dtype=np.int32)
# Limits of each cell
boundaries = np.arange(0,10100,step=100)
cell_boundaries = np.dstack((boundaries[0:100],boundaries[1:101]))
# Position of Particles
points = np.random.uniform(0,Layout[1],size = (Population,2))
# Generating a list with the x,y boundaries of each cell in the grid
x = []
limit_list = cell_boundaries
for i in range(0,Layout[1]//100):
for j in range(0,Layout[1]//100):
x.append([limit_list[0][i,0],limit_list[0][i,1],limit_list[0][j,0],limit_list[0][j,1]])
# Identifying the cell to which the particles belong
i=0
for y in (x):
cell_number[(points[:,1]>y[0])&(points[:,1]<y[1])&(points[:,0]>y[2])&(points[:,0]<y[3])]=i
i+=1
print(time.time()-start)
I am not sure about your code. You seem to be accumulating the i variable globally. While it should be accumulated on a per cell basis, correct? Something like cell_number[???] += 1, maybe?
Anyhow, the way I see is from a different perspective. You could start by assigning each point a cell id. Then inverse the resulting array with a kind of counter function. I have implemented the following in PyTorch, you will most likely find equivalent utilities in Numpy.
The conversion from 2-point coordinates to cell ids corresponds to applying floor on the coordinates then unfolding them according to your grid's width.
>>> p = torch.from_numpy(points).floor()
>>> p_unfold = p[:, 0]*10000 + p[:, 1]
Then you can "inverse" the statistics, i.e. find out how many particles there are in each respective cell based on the cell ids. This can be done using PyTorch histogram's counter torch.histc:
>>> torch.histc(p_unfold, bins=Population)

Shuffling the lower triangle in a large matrix

I have a large, sparse adjacency matrix (64948 x 64948) that is symmetrical across the diagonal. What I need to do is randomize the locations of the nonzero elements in the upper or lower triangle of the matrix (which will then get transposed.) I have code that does this below. It works with 10x10 matrices, but not with 64948x64948 (I run into memory errors on a cluster). I realize my method might be flawed, and I would appreciate if anyone has any insight about how I could do this in a more efficient way!
First, I create "mask," which is essentially an array of every location in the lower triangle.
mask_mtx = np.ones([10,10]) #all ones
mask_mtx = np.tril(mask_mtx,-1) #lower triangle ones
mask_mtx = sparse.csr_matrix(mask_mtx)
mask = sparse.find(mask_mtx) #indices of ones
np.save('struc_conn_mat_mask.npy',mask) #cluster fails here when n=64948. I'm trying to save it out from a cluster so I can use the mask on my local machine with code below
len_mask = len(mask[0]) #how many indices there are
I create mtx as an array for the purposes of this example, but usually I will be reading in a 65k x 65k csr_matrix. I then find the number of nonzero elements in the mtx lower triangle and randomly pick that many locations from the mask. I then put 1s at these locations in an empty tmp_mtx. Finally, I transpose the lower triangle to upper triangle.
mtx = sparse.random(10,10,format='csr',density=0.1) #for the purposes of this example, create random matrix
lmtx = sparse.tril(mtx,-1,format='csr') #lower triangle
tmp_mtx = np.zeros((10,10)) #empty lower triangle to set
lvals = sparse.csr_matrix.count_nonzero(lmtx) #how many 1s in lmtx?
coordinate_indices = random.sample(range(len_mask),lvals) #choose n=lvals random indices to fill with ones
for idx in coordinate_indices:
tmp_mtx[mask[0][idx]][mask[1][idx]] = 1 #at randomly chosen index from mask, put a 1
tmp_mtx = sparse.csr_matrix(tmp_mtx)
mtx = tmp_mtx + tmp_mtx.T #transpose to upper triangle
Again, this works fine with 10x10 matrices, but fails at several places with larger matrices. Ultimately, what I want to do is a seemingly simple operation--shuffle the triangle--but I can't think of how to do it in a more efficient way. Perhaps there is some way of shuffling the columns and rows (but just for one of the triangles?)
Any help would be so, so appreciated! Thank you.

Counting the number of times a threshold is met or exceeded in a multidimensional array in Python

I have an numpy array that I brought in from a netCDF file with the shape (930, 360, 720) where it is organized as (time, latitudes, longitudes).
At each lat/lon pair for each of the 930 time stamps, I need to count the number of times that the value meets or exceeds a threshold "x" (such as 0.2 or 0.5 etc.) and ultimately calculate the percentage that the threshold was exceeded at each point, then output the results so they can be plotted later on.
I have attempted numerous methods but here is my most recent:
lat_length = len(lats)
#where lats has been defined earlier when unpacked from the netCDF dataset
lon_length = len(lons)
#just as lats; also these were defined before using np.meshgrid(lons, lats)
for i in range(0, lat_length):
for j in range(0, lon_length):
if ice[:,i,j] >= x:
#code to count number of occurrences here
#code to calculate percentage here
percent_ice[i,j] += count / len(time) #calculation
#then go on to plot percent_ice
I hope this makes sense! I would greatly appreciate any help. I'm self taught in Python so I may be missing something simple.
Would this be a time to use the any() function? What would be the most efficient way to count the number of times the threshold was exceeded and then calculate the percentage?
You can compare the input 3D array with the threshold x and then sum along the first axis with ndarray.sum(axis=0) to get the count and thereby the percentages, like so -
# Calculate count after thresholding with x and summing along first axis
count = (ice > x).sum(axis=0)
# Get percentages (ratios) by dividing with first axis length
percent_ice = np.true_divide(count,ice.shape[0])
Ah, look, another meteorologist!
There are probably multiple ways to do this and my solution is unlikely to be the fastest since it uses numpy's MaskedArray, which is known to be slow, but this should work:
Numpy has a data type called a MaskedArray which actually contains two normal numpy arrays. It contains a data array as well as a boolean mask. I would first mask all data that are greater than or equal to my threshold (use np.ma.masked_greater() for just greater than):
ice = np.ma.masked_greater_equal(ice)
You can then use ice.count() to determine how many values are below your threshold for each lat/lon point by specifying that you want to count along a specific axis:
n_good = ice.count(axis=0)
This should return a 2-dimensional array containing the number of good points. You can then calculate the number of bad by subtracting n_good from ice.shape[0]:
n_bad = ice.shape[0] - n_good
and calculate the percentage that are bad using:
perc_bad = n_bad/float(ice.shape[0])
There are plenty of ways to do this without using MaskedArray. This is just the easy way that comes to mind for me.

Fastest way to get bounding boxes around segments in a label map

A 3D label map is matrix in which every pixel (voxel) has an integer label. These values are expected to be contiguous, meaning that a segment with label k will not be fragmented.
Given such label map (segmentation), what is the fastest way to obtain the coordinates of a minimum bounding box around each segment, in Python?
I have tried the following:
Iterate through the matrix using multiindex iterator (from numpy.nditer) and construct a reverse index dictionary. This means that for every label you get the 3 coordinates of every voxel where the label is present.
For every label get the max and min of each coordinate.
The good thing is that you get all the location information in one O(N) pass. The bad thing is that I dont need this detailed information. I just need the extremities, so there might be a faster way to do this, using some numpy functions which are faster than so many list appends. Any suggestions?
The one pass through the matrix takes about 8 seconds on my machine, so it would be great to get rid of it. To give an idea of the data, there are a few hundred labels in a label map. Sizes of the label map can be 700x300x30 or 300x300x200 or something similar.
Edit: Now storing only updated max and min per coordinate for every label. This removes the need to maintain and store all these large lists (append).
If I understood your problem correctly, you have groups of voxels, and you would like to have the extremes of a group in each axis.
Let'd define:
arr: 3D array of integer labels
labels: list of labels (integers 0..labmax)
The code:
import numpy as np
# number of highest label:
labmax = np.max(labels)
# maximum and minimum positions along each axis (initialized to very low and high values)
b_first = np.iinfo('int32').min * np.ones((3, labmax + 1), dtype='int32')
b_last = np.iinfo('int32').max * np.ones((3, labmax + 1), dtype='int32')
# run through all of the dimensions making 2D slices and marking all existing labels to b
for dim in range(3):
# create a generic slice object to make the slices
sl = [slice(None), slice(None), slice(None)]
bf = b_first[dim]
bl = b_last[dim]
# go through all slices in this dimension
for k in range(arr.shape[dim]):
# create the slice object
sl[dim] = k
# update the last "seen" vector
bl[arr[sl].flatten()] = k
# if we have smaller values in "last" than in "first", update
bf[:] = np.clip(bf, None, bl)
After this operation we have six vectors giving the smallest and largest indices for each axis. For example, the bounding values along second axis of label 13 are b_first[1][13] and b_last[1][13]. If some label is missing, all corresponding b_first and b_last will be the maximum int32 value.
I tried this with my computer, and for a (300,300,200) array it takes approximately 1 sec to find the values.

Categories

Resources