Shuffling the lower triangle in a large matrix - python

I have a large, sparse adjacency matrix (64948 x 64948) that is symmetrical across the diagonal. What I need to do is randomize the locations of the nonzero elements in the upper or lower triangle of the matrix (which will then get transposed.) I have code that does this below. It works with 10x10 matrices, but not with 64948x64948 (I run into memory errors on a cluster). I realize my method might be flawed, and I would appreciate if anyone has any insight about how I could do this in a more efficient way!
First, I create "mask," which is essentially an array of every location in the lower triangle.
mask_mtx = np.ones([10,10]) #all ones
mask_mtx = np.tril(mask_mtx,-1) #lower triangle ones
mask_mtx = sparse.csr_matrix(mask_mtx)
mask = sparse.find(mask_mtx) #indices of ones
np.save('struc_conn_mat_mask.npy',mask) #cluster fails here when n=64948. I'm trying to save it out from a cluster so I can use the mask on my local machine with code below
len_mask = len(mask[0]) #how many indices there are
I create mtx as an array for the purposes of this example, but usually I will be reading in a 65k x 65k csr_matrix. I then find the number of nonzero elements in the mtx lower triangle and randomly pick that many locations from the mask. I then put 1s at these locations in an empty tmp_mtx. Finally, I transpose the lower triangle to upper triangle.
mtx = sparse.random(10,10,format='csr',density=0.1) #for the purposes of this example, create random matrix
lmtx = sparse.tril(mtx,-1,format='csr') #lower triangle
tmp_mtx = np.zeros((10,10)) #empty lower triangle to set
lvals = sparse.csr_matrix.count_nonzero(lmtx) #how many 1s in lmtx?
coordinate_indices = random.sample(range(len_mask),lvals) #choose n=lvals random indices to fill with ones
for idx in coordinate_indices:
tmp_mtx[mask[0][idx]][mask[1][idx]] = 1 #at randomly chosen index from mask, put a 1
tmp_mtx = sparse.csr_matrix(tmp_mtx)
mtx = tmp_mtx + tmp_mtx.T #transpose to upper triangle
Again, this works fine with 10x10 matrices, but fails at several places with larger matrices. Ultimately, what I want to do is a seemingly simple operation--shuffle the triangle--but I can't think of how to do it in a more efficient way. Perhaps there is some way of shuffling the columns and rows (but just for one of the triangles?)
Any help would be so, so appreciated! Thank you.

Related

Trim an array with respective to another array with numpy

I am handling a set of data recorded by a 2D detector. Therefore, the data are represented by three arrays: x and y labelling the coordinate of a pixel and intensity storing the measured signal.
For example, a 6x6 grid will give a set of data:
xraw = np.array([0,1,2,3,4,5,0,1,2,3,4,5,...])
yraw = np.array([0,0,0,0,0,0,1,1,1,1,1,1,...])
intensity = np.array([i_00,i_01,i_02,i_03,i_04,i_05,i_10,i_11,...])
Due to various reasons, such as pixel defects, some of the data points are discarded in the raw data. Therefore, xraw, yraw, intensity have a size smaller than 36 (if that's a 6x6 grid), with, say, the point at (2,3) missing.
The intensity data needs further treatment by an element-wise multiplication with another array. This treatment array is from theoretical calculation and so it has a size of nxn (6x6 in this case). However, as some of the points in the true data are missing, the two arrays have different sizes.
I can use a loop to check for the missing points and eliminate the corresponding element in the treatment array. I wonder if there are some methods in numpy that take care of such operations. Thanks
First, construct the indices of available and all possible pixel positions by
avail_ind = yraw * h + xraw
all_ind = np.arange(0, h * w)
where h and w is the image's height and width in pixels.
Then, find the indices of the missing pixels by
missing_ind = all_ind[~np.in1d(all_ind, avail_ind)]
Once having the missing indices, use np.delete to construct a copy of the treatment_array with elements at the indices removed, then simply multiply that with your intensity array.
result = intensity * np.delete(treatment_array, missing_ind)

Calculating the nearest neighbour in a 2d grid using multilevel solution

I have a problem where in a grid of x*y size I am provided a single dot, and I need to find the nearest neighbour. In practice, I am trying to find the closest dot to the cursor in pygame that crosses a color distance threshold that is calculated as following:
sqrt(((rgb1[0]-rgb2[0])**2)+((rgb1[1]-rgb2[1])**2)+((rgb1[2]-rgb2[2])**2))
So far I have a function that calculates the different resolutions for the grid and reduces it by a factor of two while always maintaining the darkest pixel. It looks as following:
from PIL import Image
from typing import Dict
import numpy as np
#we input a pillow image object and retrieve a dictionary with every grid version of the 3 dimensional array:
def calculate_resolutions(image: Image) -> Dict[int, np.ndarray]:
resolutions = {}
#we start with the highest resolution image, the size of which we initially divide by 1, then 2, then 4 etc.:
divisor = 1
#reduce the grid by 5 iterations
resolution_iterations = 5
for i in range(resolution_iterations):
pixel_lookup = image.load() #convert image to PixelValues object, which allows for pixellookup via [x,y] index
#calculate the resolution of the new grid, round upwards:
resolution = (int((image.size[0] - 1) // divisor + 1), int((image.size[1] - 1) // divisor + 1))
#generate 3d array with new grid resolution, fill in values that are darker than white:
new_grid = np.full((resolution[0],resolution[1],3),np.array([255,255,255]))
for x in range(image.size[0]):
for y in range(image.size[1]):
if not x%divisor and not y%divisor:
darkest_pixel = (255,255,255)
x_range = divisor if x+divisor<image.size[0] else (0 if image.size[0]-x<0 else image.size[0]-x)
y_range = divisor if y+divisor<image.size[1] else (0 if image.size[1]-y<0 else image.size[1]-y)
for x_ in range(x,x+x_range):
for y_ in range(y,y+y_range):
if pixel_lookup[x_,y_][0]+pixel_lookup[x_,y_][1]+pixel_lookup[x_,y_][2] < darkest_pixel[0]+darkest_pixel[1]+darkest_pixel[2]:
darkest_pixel = pixel_lookup[x_,y_]
if darkest_pixel != (255,255,255):
new_grid[int(x/divisor)][int(y/divisor)] = np.array(darkest_pixel)
resolutions[i] = new_grid
divisor = divisor*2
return resolutions
This is the most performance efficient solution I was able to come up with. If this function is run on a grid that continually changes, like a video with x fps, it will be very performance intensive. I also considered using a kd-tree algorithm that simply adds and removes any dots that happen to change on the grid, but when it comes to finding individual nearest neighbours on a static grid this solution has the potential to be more resource efficient. I am open to any kinds of suggestions in terms of how this function could be improved in terms of performance.
Now, I am in a position where for example, I try to find the nearest neighbour of the current cursor position in a 100x100 grid. The resulting reduced grids are 50^2, 25^2, 13^2, and 7^2. In a situation where a part of the grid looks as following:
And I am on the aggregation step where a part of the grid consisting of six large squares, the black one being the current cursor position and the orange dots being dots where the color distance threshold is crossed, I would not know which diagonally located closest neighbour I would want to pick to search next. In this case, going one aggregation step down shows that the lower left would be the right choice. Depending on how many grid layers I have this could result in a very large error in terms of the nearest neighbour search. Is there a good way how I can solve this problem? If there are multiple squares that show they have a relevant location, do I have to search them all in the next step to be sure? And if that is the case, the further away I get the more I would need to make use of math functions such as the pythagorean theorem to assert whether the two positive squares I find are overlapping in terms of distance and could potentially contain the closest neighbour, which would start to be performance intensive again if the function is called frequently. Would it still make sense to pursue this solution over a regular kd tree? For now the grid size is still fairly small (~800-600) but if the grid gets larger the performance may start suffering again. Is there a good scalable solution to this problem that could be applied here?

Drawing 5 elements uniformly from a list of lists

There's some context to this, so bear with me please.
I have a list of lists, call it nested_lists, where each list is of the form [[1,2,3,...], [4,3,1,...]] (i.e. each list contains two lists of integers). Now, in each of these lists, the two lists of integers have the same length and two integers corresponding to the same index represent a coordinate in R^2.
So for example, (1,4) would be one coordinate from the above example.
Now, my task is to draw 5 unique coordinates from nested_lists uniformly (i.e. each coordinate has the same probability of being chosen), without replacement. That is, from all of the coordinates from the lists in nested_lists, I am trying to draw 5 unique coordinates uniformly without replacement.
One very straightforward way to do this would be to : 1. Create a list of ALL the unique coordinates in nested_lists. 2. Use numpy.random.choice to sample 5 elements uniformly without replacement.
The code would be something like this:
import numpy as np
coordinates = []
#Get list of all unique coordinates
for list in nested_lists:
l = len(list[0])
for i in range(0, l):
coordinate = (list[0][i], list[1][i])
if coordinate not coordinates:
coordinates += [coordinate]
draws = np.random.choice(coordinates, 5, replace=False, p= [1/len(coordinates)]*len(coordinates))
But getting a set of all the unique coordinates can be very computationally expensive, especially if nested_lists contains millions of lists, each with thousands of coordinates in them. So I'm looking for methods to perform the same draws without having to get a list of all the coordinates first.
One method I thought of would be to sample with weighted probabilities from each list in nested_lists.
So get a list of the sizes (number of coordinates) of each list, and then go through each list and draw a coordinate with probability (size/sum(size))*(1/sum(sizes)). Repeating the process until 5 unique coordinates are drawn should then correspond to what we wanted to draw. The code would be something like this:
no_coordinates = lambda x: len(x[0])
sizes = list(map(no_coordinates, nested_lists))
i = 0
sum_sizes = sum(sizes)
draws = []
while i != 5: #to make sure we get 5 draws
for list in nested_lists:
size = len(list[0])
p = size/(sum_sizes**2)
for j in range(0, size):
if i >= 5: exit for loop when we reach 5 draws
break
if np.random.random() < p and (list[0][j], list[1][j]) not in draws:
draws += (list[0][j], list[1][j])
i += 1
The code above seems to be more computationally efficient, but I am not sure if it actually draws with the same probability that would be required overall. From my calculation, the overall probability would sum(size)/sum_sizes**2 which is the same as 1/sum_sizes (our required probability), but again, I'm not sure if this is correct.
So I was wondering if there are more efficient approaches to drawing like I want, and if my approach is actually correct or not.
You can use bootstrapping. Basically, the idea is to draw some large (but fixed) amount of coordinates with replacement to estimate probability of each coordinate. Then, you can subsample from this list using transformed densities.
from collections import Counter
bootstrap_sample_size = 1000
total_lists = len(nested_lists)
list_len = len(nested_lists[0])
# set will make more sense in this example
# I used counter to allow for future statistical manipulations
c = Counter()
for _ in range(bootstrap_sample_size):
x, y = random.randrange(total_lists), random.randrange(list_len)
random_point = nested_lists[x][0][y], nested_lists[x][1][y]
c.update((random_point,))
# now c contains counts for 1000 points with replacements
# let's just ignore these probabilities to get uniform sample
result = random.sample(c.keys(), 5)
This will not be exactly uniform, but bootstrap provides statistical guarantees that it will be arbitrary close to uniform distribution as the bootstrap_sample_size is increased. 1000 samples is usually enough for most real-life applications.

Numpy griddata interpolation up to certain radius

I'm using griddata() to interpolate my (irregular) 2-dimensional depth-measurements; x,y,depth. The method does a great job - but it interpolates over the entire grid where it can find to opposing points. I don't want that behaviour. I'd like to have an interpolation around the existing measurements, say with up to an extent of a certain radius.
Is it possible to tell numpy/scipy: don't interpolate if you're too far from an existing measurement? Resulting in a NODATA-value? ideal = griddata(.., .., .., radius=5.0)
edit example:
In the image below; black dots are the measurements. Shades of blue are the interpolated cells by numpy. The area marked in green is in fact part of the picture but is considered as NODATA by numpy (because there's no points in between). Now, the red areas, are interpolated, but I want to get rid of them. any ideas?
Ok cool. I don't think there is a built-in option for griddata() that does what you want, so you will need to write it yourself.
This comes down to calculating the distances between N input data points and M interpolation points. This is simple enough to do but if you have a lot of points it can be slow at ~O(M*N). But here's an example that calculates the distances to allN data points, for each interpolation point. If the number of data points withing the radius is at least neighbors, it keeps the value. Otherwise is writes the value of NODATA.
neighbors is 4 because griddata() will use biilinear interpolation which needs points bounding the interpolants in each dimension (2*2 = 4).
#invec - input points Nx2 numpy array
#mvec - interpolation points Mx2 numpy array
#just some random points for example
N=100
invec = 10*np.random.random([N,2])
M=50
mvec = 10*np.random.random([M,2])
# --- here you would put your griddata() call, returning interpolated_values
interpolated_values = np.zeros(M)
NODATA=np.nan
radius = 5.0
neighbors = 4
for m in range(M):
data_in_radius = np.sqrt(np.sum( (invec - mvec[m])**2, axis=1)) <= radius
if np.sum(data_in_radius) < neighbors :
interpolated_values[m] = NODATA
Edit:
Ok re-read and noticed the input is really 2D. Example modified.
Just as an additional comment, this could be greatly accelerated if you first build a coarse mapping from each point mvec[m] to a subset of the relevant data points.
The costliest step in the loop would change from
np.sqrt(np.sum( (invec - mvec[m])**2, axis=1))
to something like
np.sqrt(np.sum( (invec[subset[m]] - mvec[m])**2, axis=1))
There are plenty of ways to do this, for example using a Quadtree, hashing function, or 2D index. But whether this gives performance advantage depends on the application, how your data is structured, etc.

Avoid for-loops in assignment of data values

So this is a little follow up question to my earlier question: Generate coordinates inside Polygon and my answer https://stackoverflow.com/a/15243767/1740928
In fact, I want to bin polygon data to a regular grid. Therefore, I calculate a couple of coordinates within the polygon and translate their lat/lon combination to their respective column/row combo of the grid.
Currently, the row/column information is stored in a numpy array with its number of rows corresponding to the number of data polygons and its number of columns corresponding to the coordinates in the polygon.
The whole code takes less then a second, but this code is the bottleneck at the moment (with ~7sec):
for ii in np.arange(len(data)):
for cc in np.arange(data_lats.shape[1]):
final_grid[ row[ii,cc], col[ii,cc] ] += data[ii]
final_grid_counts[ row[ii,cc], col[ii,cc] ] += 1
The array "data" simply contains the data values for each polygon (80000,). The arrays "row" and "col" contain the row and column number of a coordinate in the polygon (shape: (80000,16)).
As you can see, I am summing up all data values within each grid cell and count the number of matches. Thus, I know the average for each grid cell in case different polygons intersect it.
Still, how can these two for loops take around 7 seconds? Can you think of a faster way?
I think numpy should add an nd-bincount function, I had one lying around from a project I was working on some time ago.
import numpy as np
def two_d_bincount(row, col, weights=None, shape=None):
if shape is None:
shape = (row.max() + 1, col.max() + 1)
row = np.asarray(row, 'int')
col = np.asarray(col, 'int')
x = np.ravel_multi_index([row, col], shape)
out = np.bincount(x, weights, minlength=np.prod(shape))
return out.reshape(shape)
weights = np.column_stack([data] * row.shape[1])
final_grid = two_d_bincount(row.ravel(), col.ravel(), weights.ravel())
final_grid_counts = two_d_bincount(row.ravel(), col.ravel())
I hope this helps.
I might not fully understand the shapes of your different grids, but you can maybe eliminate the cc loop using something like this:
final_grid = np.empty((nrows,ncols))
for ii in xrange(len(data)):
final_grid[row[ii,:],col[ii,:]] = data[ii]
This of course assumes that final_grid is starting with no other info (that the count you're incrementing starts at zero). And I'm not sure how to test if it works not understanding how your row and col arrays work.

Categories

Resources