I have a very large numpy array of 1s and 0s. I want to go row by row and look for all the 1s. Once I encounter a 1, I want to store the indices of entries which fall inside a radius of five rows. This is better illustrated in the picture:
(in the picture I only show half a circle, in the real case I need the indices of the values that fall inside the entire circle)
Once I collect the indices, I go to the next 1 in the array and do the same. Once I finish looping through the array I want to set all the values of the collected indices which are not 1 to 1. In a sense, I am creating a buffer around all 1s with a radius of 5 columns.
for row in myarray:
for column in myarray:
dist = math.sqrt(row**2+column**2)
if dist <= 5
.........store the indices of the neighbouring cells
Can you please give me a suggestion how to accomplish this?
The operation you are describing is called dilation. I you have scipy, you could use ndimage.binary_dilation to obtain the result:
import numpy as np
import scipy.ndimage as ndimage
import matplotlib.pyplot as plt
arr = np.zeros((21, 21))
arr[5, 5] = arr[15, 15] = 1
i, j = np.ogrid[:11, :11]
# struct = ((i-5)**2 + (j-5)**2 <= 40)
struct = np.abs(i-5)+ np.abs(j-5) <= 8
result = ndimage.binary_dilation(arr, structure=struct)
plt.imshow(result, interpolation='nearest')
plt.show()
Related
I am looking for an efficient way to do the following calculations on millions of arrays. For the values in each array, I want to calculate the mean of the values in the bin with most frequency as demonstrated below. Some of the arrays might contain nan values and other values are float. The loop for my actual data takes too long to finish.
import numpy as np
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
bin_values=np.linspace(0, 10, 21)
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0])
values = np.zeros(array.shape[0])
for i in range(array.shape[0]):
values[i] = np.nanmean(array[i][(array[i]>=bin_start[i])*(array[i]<bin_end[i])])
Also, when I run the above code I get three warnings. The first is 'RuntimeWarning: Mean of empty slice' for the line where I calculate the value variable. I set a condition in case I have all nan values to skip this line, but the warning did not go away. I was wondering what the reason is. The other two warnings are for when the less and greater_equal conditions do not meet which make sense to me since they might be nan values.
The arrays that I want to run this algorithm on are independent, but I am already processing them with 12 separate scripts. Running the code in parallel would be an option, however, for now I am looking to improve the algorithm itself.
The reason that I am using lambda function is to run numpy.histogram over an axis since it seems the histogram function does not take an axis as an option. I was able to use a mask and remove the loop from the code. The code is 2 times faster now, but I think it still can be improved more.
I can explain what I want to do in more detail by an example if it clarifies it. Imagine I have 36 numbers which are greater than 0 and smaller than 20. Also, I have bins with equal distance of 0.5 over the same interval (0.0_0.5, 0.5_1.0, 1.0_1.5, … , 19.5_20.0). I want to see if I put the 36 numbers in their corresponding bin what would be the mean of the numbers within the bin which contain the most number of numbers.
Please post your solution if you can think of a faster algorithm.
import numpy as np
# creating an array to test the algorithm
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
# the algorithm
bin_values=np.linspace(0, 10, 21)
# calculating the frequency of each bin
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0]))
# creating a mask to get the mean over the bin with maximum frequency
mask = (array>=bin_start) * (array<bin_end)
mask_nan = np.tile(np.nan, (mask.shape[0], mask.shape[1]))
mask_nan[mask] = 1
v = np.nanmean(array * mask_nan, axis = 1)
I would like to create a NumPy array that all arrays are between -1 to 1. I want to sum all array that becomes zero.
Could you please tell me how can I create this NumPy array?
You can accomplish this by creating two arrays with the two values and concatenating them. To finish it all of you shuffle the final concatenated array.
import numpy as np
size = 10
array_size = int(size / 2)
ones = np.ones(array_size)
minus_ones = np.full(array_size, -1)
sum_zero = np.concatenate((ones, minus_ones))
np.random.shuffle(sum_zero)
print(sum_zero)
I'm new to Python.
I have a numpy array of 3 columns and 50 rows. I want to add a value drawn from a normally distributed distribution to every number in the array except the first row. I am curious to know if there is a cleaner but also readable way to do this compared what I am currently doing? At the moment I'm using perhaps the not so elegant way:
nRows = np.shape (data)[0]
nCols = np.shape (data)[1]
x = data[0,:].copy() # Copy the first row
# Add a random number to all rows but 0
for i in range (nCols):
data[:,i] += np.random.normal (0, 0.8, nRows)
data[0,:] = x # Copy the first row back
You can assign values to indexed array. For your case, generate the 2d random array first and then directly add it to sliced data:
data[1:] += np.random.normal(0, 0.8, (nRows - 1, nCols))
I'm struggling for days trying to resolve this problem: I have cartesian coordinates on the y-axis (for depth from 0 to 1) and numbers with different values on the x axis (the numbers are the firing rate of different cells populations at the given depth on y axis, so they vary randomly).
I would like to show bigger size of markers in the scatterplot corresponding to a bigger x-axis value (firing rate).
Thank you for any suggestion.
This is the code (not working).
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.cbook as cbook
x = np.genfromtxt('x_dex.csv', delimiter=',')
y = np.genfromtxt('z_dex.csv', delimiter=',')
array = [i for i in x if i > 4]
array.sort()
s = [30*2**n for n in range(len(array))];
plt.subplot(212)
plt.scatter(x,y,s=s)
plt.show()
This is unfortunately not showing the correct relation between size of marker and depth.
The line where you compute your 'size' values looks incorrect to me:
s = [30*2**n for n in range(len(array))];
This will give you a list containing:
s = [30*2**0, 30*2**1, 30*2**2, ..., 30*2**(len(array) - 1)]
The values bear no relation to y, so I assume this is not what you intended. Maybe you meant something more like this:
s = 30 * 2 ** y
There are actually several other issues here:
Don't give your variables names like array - this can lead to confusion with numpy.array. It's even worse in this case, since array is actually not an array but a Python list!
Since you're dealing with numpy arrays, it's much faster to use vectorization rather than list comprehensions. For example, you could use:
array = x[x > 4]
rather than
array = [i for i in x if i > 4]
After your list comprehension array = [i for i in x if i > 4], array will have a different number of elements to y if there are elements in array that are less than 4.
array.sort() will sort the list in place, which means that the order of the elements in array will no longer match the order of elements in y.
In fact, sorting seems rather pointless in this situation - since you're making a scatter plot the order of the points should not matter.
You're not writing MATLAB code any more, so there's no need to end lines on a semicolon (although it won't do any harm if you do).
Here's my educated guess at what you're trying to do:
import matplotlib.pyplot as plt
import numpy as np
x = np.genfromtxt('x_dex.csv', delimiter=',')
y = np.genfromtxt('z_dex.csv', delimiter=',')
# get the set of indices that will sort x in ascending order, apply these
# to both x & y
order = np.argsort(x)
x_sorted = x[order]
y_sorted = y[order]
# keep only xy pairs where x > 4
valid = x_sorted > 4
x_valid = x_sorted[valid]
y_valid = y_sorted[valid]
# compute the sizes
s = 30 * 2 ** y_valid
# plot
plt.subplot(212)
plt.scatter(x_valid, y_valid, s=s)
plt.show()
I'm trying to reduce noise in a binary python array by removing all completely isolated single cells, i.e. setting "1" value cells to 0 if they are completely surrounded by other "0"s. I have been able to get a working solution by removing blobs with sizes equal to 1 using a loop, but this seems like a very inefficient solution for large arrays:
import numpy as np
import scipy.ndimage as ndimage
import matplotlib.pyplot as plt
# Generate sample data
square = np.zeros((32, 32))
square[10:-10, 10:-10] = 1
np.random.seed(12)
x, y = (32*np.random.random((2, 20))).astype(np.int)
square[x, y] = 1
# Plot original data with many isolated single cells
plt.imshow(square, cmap=plt.cm.gray, interpolation='nearest')
# Assign unique labels
id_regions, number_of_ids = ndimage.label(square, structure=np.ones((3,3)))
# Set blobs of size 1 to 0
for i in xrange(number_of_ids + 1):
if id_regions[id_regions==i].size == 1:
square[id_regions==i] = 0
# Plot desired output, with all isolated single cells removed
plt.imshow(square, cmap=plt.cm.gray, interpolation='nearest')
In this case, eroding and dilating my array won't work as it will also remove features with a width of 1. I feel the solution lies somewhere within the scipy.ndimage package, but so far I haven't been able to crack it. Any help would be greatly appreciated!
A belated thanks to both Jaime and Kazemakase for their replies. The manual neighbour-checking method did remove all isolated patches, but also removed patches attached to others by one corner (i.e. to the upper-right of the square in the sample array). The summed area table works perfectly and is very fast on the small sample array, but slows down on larger arrays.
I ended up following a approach using ndimage which seems to work efficiently for very large and sparse arrays (0.91 sec for 5000 x 5000 array vs 1.17 sec for summed area table approach). I first generated a labelled array of unique IDs for each discrete region, calculated sizes for each ID, masked the size array to focus only on size == 1 blobs, then index the original array and set IDs with a size == 1 to 0:
def filter_isolated_cells(array, struct):
""" Return array with completely isolated single cells removed
:param array: Array with completely isolated single cells
:param struct: Structure array for generating unique regions
:return: Array with minimum region size > 1
"""
filtered_array = np.copy(array)
id_regions, num_ids = ndimage.label(filtered_array, structure=struct)
id_sizes = np.array(ndimage.sum(array, id_regions, range(num_ids + 1)))
area_mask = (id_sizes == 1)
filtered_array[area_mask[id_regions]] = 0
return filtered_array
# Run function on sample array
filtered_array = filter_isolated_cells(square, struct=np.ones((3,3)))
# Plot output, with all isolated single cells removed
plt.imshow(filtered_array, cmap=plt.cm.gray, interpolation='nearest')
Result:
You can manually check the neighbors and avoid the loop using vectorization.
has_neighbor = np.zeros(square.shape, bool)
has_neighbor[:, 1:] = np.logical_or(has_neighbor[:, 1:], square[:, :-1] > 0) # left
has_neighbor[:, :-1] = np.logical_or(has_neighbor[:, :-1], square[:, 1:] > 0) # right
has_neighbor[1:, :] = np.logical_or(has_neighbor[1:, :], square[:-1, :] > 0) # above
has_neighbor[:-1, :] = np.logical_or(has_neighbor[:-1, :], square[1:, :] > 0) # below
square[np.logical_not(has_neighbor)] = 0
That way looping over the square is performed internally by numpy, which is rather more efficient than looping in python. There are two drawbacks of this solution:
If your array is very sparse there may be more efficient ways to check the neighborhood of non-zero points.
If your array is very large the has_neighbor array might consume too much memory. In this case you could loop over sub-arrays of smaller size (trade-off between python loops and vectorization).
I have no experience with ndimage, so there may be a better solution built in somewhere.
The typical way of getting rid of isolated pixels in image processing is to do a morphological opening, for which you have a ready-made implementation in scipy.ndimage.morphology.binary_opening. This would affect the contours of your larger areas as well though.
As for a DIY solution, I would use a summed area table to count the number of items in every 3x3 subimage, subtract from that the value of the central pixel, then zero all center points where the result came out to zero. To properly handle the borders, first pad the array with zeros:
sat = np.pad(square, pad_width=1, mode='constant', constant_values=0)
sat = np.cumsum(np.cumsum(sat, axis=0), axis=1)
sat = np.pad(sat, ((1, 0), (1, 0)), mode='constant', constant_values=0)
# These are all the possible overlapping 3x3 windows sums
sum3x3 = sat[3:, 3:] + sat[:-3, :-3] - sat[3:, :-3] - sat[:-3, 3:]
# This takes away the central pixel value
sum3x3 -= square
# This zeros all the isolated pixels
square[sum3x3 == 0] = 0
The implementation above works, but is not especially careful about not creating intermediate arrays, so you can probably shave off some execution time by refactoring adequately.