Related
I'm currently doing some work to extract NoData values from a gridded satellite image. The image is presented as a 2D array, where the inner array is every pixel value in a given row from left to right, and the outer array is every row in the image from top to bottom.
Any advice on this?
I have built the following functions:
from more_itertools import locate
def find_indices(liste, item):
indices = locate(liste, lambda x: x == item)
return list(indices)
def find_indices2(liste, item):
indices = locate(liste, lambda x: item in x)
return list(indices)
and I have built two separate arrays of the index positions of:
a) the rows containing a '0' value in them (all of them). This is a 1D array marked as 'f'
b) the pixels with a '0' value within their given row. This is a 2D array, marked as 'g'
Finally, I carried out the following to merge my two arrays.
h = np.dstack((g, f))
Which gives me a 3D array of the form [g, list([f])]. I.e. [[0, list([0, 1, 2, 3, 4, 5...])], [1, list([0, 1, 2, 3, 4, 5...])]].
I want to convert this array into the form [[g, f]]. I.e. [[0, 0], [0,1], [0,2], [0,3], [0,4]...]. This will essentially give me a set of 2D co-ordinates for each NoData pixel which I can then apply to a second satellite pixel to mask it, turn both satellite images to arrays of the same length and run a regression on them.
Assuming I understood correctly what you mean, you could do something like this to convert your data:
data = np.array([[0, list([0, 1, 2, 3])], [1, list([0, 1, 2])]])
for i in range(data.shape[0]):
converted = np.asarray(np.meshgrid(data[i][0],data[i][1])).T.reshape(-1,2)
print(converted)
# and you could vstack here for example
This would give the output:
[[0 0]
[0 1]
[0 2]
[0 3]]
[[1 0]
[1 1]
[1 2]]
This can surely be done faster and more efficiently, but you didn't provide exact information on the data you start with. So I'm just trying to address the conversion part of the question. I think it's a bad idea to store data as list inside a numpy array in the first place, especially if its length varies.
Let's suppose I have two arrays that represent pixels in pictures.
I want to build an array of tensordot products of pixels of a smaller picture with a bigger picture as it "scans" the latter. By "scanning" I mean iteration over rows and columns while creating overlays with the original picture.
For instance, a 2x2 picture can be overlaid on top of 3x3 in four different ways, so I want to produce a four-element array that contains tensordot products of matching pixels.
Tensordot is calculated by multiplying a[i,j] with b[i,j] element-wise and summing the terms.
Please examine this code:
import numpy as np
a = np.array([[0,1,2],
[3,4,5],
[6,7,8]])
b = np.array([[0,1],
[2,3]])
shape_diff = (a.shape[0] - b.shape[0] + 1,
a.shape[1] - b.shape[1] + 1)
def compute_pixel(x,y):
sub_matrix = a[x : x + b.shape[0],
y : y + b.shape[1]]
return np.tensordot(sub_matrix, b, axes=2)
def process():
arr = np.zeros(shape_diff)
for i in range(shape_diff[0]):
for j in range(shape_diff[1]):
arr[i,j]=compute_pixel(i,j)
return arr
print(process())
Computing a single pixel is very easy, all I need is the starting location coordinates within a. From there I match the size of the b and do a tensordot product.
However, because I need to do this all over again for each x and y location as I'm iterating over rows and columns I've had to use a loop, which is of course suboptimal.
In the next piece of code I have tried to utilize a handy feature of tensordot, which also accepts tensors as arguments. In order words I can feed an array of arrays for different combinations of a, while keeping the b the same.
Although in order to create an array of said combination, I couldn't think of anything better than using another loop, which kind of sounds silly in this case.
def try_vector():
tensor = np.zeros(shape_diff + b.shape)
for i in range(shape_diff[0]):
for j in range(shape_diff[1]):
tensor[i,j]=a[i: i + b.shape[0],
j: j + b.shape[1]]
return np.tensordot(tensor, b, axes=2)
print(try_vector())
Note: tensor size is the sum of two tuples, which in this case gives (2, 2, 2, 2)
Yet regardless, even if I produced such array, it would be prohibitively large in size to be of any practical use. For doing this for a 1000x1000 picture, could probably consume all the available memory.
So, is there any other ways to avoid loops in this problem?
In [111]: process()
Out[111]:
array([[19., 25.],
[37., 43.]])
tensordot with 2 is the same as element multiply and sum:
In [116]: np.tensordot(a[0:2,0:2],b, axes=2)
Out[116]: array(19)
In [126]: (a[0:2,0:2]*b).sum()
Out[126]: 19
A lower-memory way of generating your tensor is:
In [121]: np.lib.stride_tricks.sliding_window_view(a,(2,2))
Out[121]:
array([[[[0, 1],
[3, 4]],
[[1, 2],
[4, 5]]],
[[[3, 4],
[6, 7]],
[[4, 5],
[7, 8]]]])
We can do a broadcasted multiply, and sum on the last 2 axes:
In [129]: (Out[121]*b).sum((2,3))
Out[129]:
array([[19, 25],
[37, 43]])
Forgive me for a vague title. I honestly don't know which title will suit this question. If you have a better title, let's change it so that it will be apt for the problem at hand.
The problem.
Let's say result is a 2D array and values is a 1D array. values holds some values associated with each element in result. The mapping of an element in values to result is stored in x_mapping and y_mapping. A position in result can be associated with different values. Now, I have to find the sum of the values grouped by associations.
An example for better clarification.
result array:
[[0, 0],
[0, 0],
[0, 0],
[0, 0]]
values array:
[ 1., 2., 3., 4., 5., 6., 7., 8.]
Note: Here result and values have the same number of elements. But it might not be the case. There is no relation between the sizes at all.
x_mapping and y_mapping have mappings from 1D values to 2D result. The sizes of x_mapping, y_mapping and values will be the same.
x_mapping - [0, 1, 0, 0, 0, 0, 0, 0]
y_mapping - [0, 3, 2, 2, 0, 3, 2, 1]
Here, 1st value(values[0]) have x as 0 and y as 0(x_mapping[0] and y_mappping[0]) and hence associated with result[0, 0]. If we are counting the number of associations, then element value at result[0,0] will be 2 as 1st value and 5th value are associated with result[0, 0]. If we are taking the sum, the result[0, 0] = value[0] + value[4] which is 6.
Current solution
# Initialisation. No connection with the solution.
result = np.zeros([4,2], dtype=np.int16)
values = np.linspace(start=1, stop=8, num=8)
y_mapping = np.random.randint(low=0, high=values.shape[0], size=values.shape[0])
x_mapping = np.random.randint(low=0, high=values.shape[1], size=values.shape[0])
# Summing the values associated with x,y (current solution.)
for i in range(values.size):
x = x_mapping[i]
y = y_mapping[i]
result[-y, x] = result[-y, x] + values[i]
The result,
[[6, 0],
[ 6, 2],
[14, 0],
[ 8, 0]]
Failed solution; But why?
test_result = np.zeros_like(result)
test_result[-y_mapping, x_mapping] = test_result[-y_mapping, x_mapping] + values # solution
To my surprise elements are overwritten in test_result. Values at test_result,
[[5, 0],
[6, 2],
[7, 0],
[8, 0]]
Question
1. Why, in the second solution, every element is overwritten?
As #Divakar has pointed out in the comment in his answer -
NumPy doesn't assign accumulated/summed values when the indices are repeated in test_result[-y_mapping, x_mapping] =. It randomly assigns from one of the instances.
2. Is there any Numpy way to do this? That is without looping? I'm looking for some speed optimization.
Approach #2 in #Divakar's answer gives me good results. For 23315 associations, for loop took 50 ms while Approach #1 took 1.85 ms. Beating all these, Approach #2 took 668 µs.
Side note
I'm using Numpy version 1.14.3 with Python 3.5.2 on an i7 processor.
Approach #1
Most intutive one would be with np.add.at for those repeated indices -
np.add.at(result, [-y_mapping, x_mapping], values)
Approach #2
We need to perform binned summations owing to the possible repeated nature of x,y indices. Hence, another way could be to use NumPy's binned summation func : np.bincount and have an implementation like so -
# Get linear index equivalents off the x and y indices into result array
m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping
# Get binned summations off values based on linear index as bins
binned_sums = np.bincount(lidx, values, minlength=m*n)
# Finally add into result array
result += binned_sums.astype(result.dtype).reshape(m,n)
If you are always starting off with a zeros array for result, the last step could be made more performant with -
result = binned_sums.astype(out_dtype).reshape(m,n)
I guess you were to write
y_mapping = np.random.randint(low=0, high=result.shape[0], size=values.shape[0])
x_mapping = np.random.randint(low=0, high=result.shape[1], size=values.shape[0])
With that correction, the code works for me as expected.
While I have already found the documentation on scipy.ndimage.convolve function and I "practically know what it does", when I try to calculate the resulting arrays I can't follow the mathematical formula. Let's take for example:
a = np.array([[1, 2, 0, 0],`
[5, 3, 0, 4],
[0, 0, 0, 7],
[9, 3, 0, 0]])
k = np.array([[1,1,1],[1,1,0],[1,0,0]])
from scipy import ndimage
ndimage.convolve(a, k, mode='constant', cval=0.0)
# Why is the result like this ?
array([[11, 10, 7, 4],
[10, 3, 11, 11],
[15, 12, 14, 7],
[12, 3, 7, 0]])
I would appreciate a step by step calculation.
Details on NDImage.convolve
I stumbled on this NDImage convolution eventhough I know the basic np.convolve, and the document is not much self explanatory, so I took the effort to crunch through and supplement the earlier explanatory post:
A. Basics:
Reference: refers to following if your concept on convolution is not so well grounded
https://en.wikipedia.org/wiki/Kernel_(image_processing),
https://en.wikipedia.org/wiki/Convolution
Essentially NDimage.convolve has 4 modes, this post focused on the Constant mode, for which you use the value as specified by cval=0 or whatever and add padded rows and columns as needed (will explain in a little bit)
The convolution essentially slides the kernel from left and right and then step down again and from left to right again until the needed (same number) number of convolved elements are achieved
The function will calculate the padded rows/columns needed. In this case the filter K is 3 x 3 matrix, and the source image is matrix a is 4 x 4, so you need two padded rows at top and bottom and two padded rows at left and right (4 + 2 = 6, and the number of rows or columns needed is 3 + 1 + 1 + 1 = 6, each slide will need the extra one row or column)
B. Operations:
Add a row and column of zeros to the top and left of Array a (to convolve a 3 x 3 to 4 x 4 evenly,
you need extra padded row/column at the 1st and 4th sliding window) and also one row/column of padded zeros to the bottom and right
Flip the kernel K as Kflip: [[0,0,1], [0,1,1], [1,1,1]]
you can use numpy np.flip (why it need to be flipped basically relates to the concept of convolution vs correlation which are like twins in opposite direction)
Slide the flipped K matrix onto this size 6 x 6 expanded matrix [[0,0,0,0,0], [0,1,2,0,0], [0,5,3,0,4,0], [0,0,0,0,7], [0,9,3,0,0,0], [0,0,0,0,0]]
For the first step of sliding window (note the first row of column of the kernel will convolved with the padded zeros), you get:
Flipped K dot sum [[0,0,0], [0,1,2], [0,5,3]] = 11 (1*1+1*2+1*5+1*3, others are zeros)
(dot sum refers to sum of the inner dot element-wise multiplication, basically just multiply the corresponding elements in the same positions for the two given matrices)
Slide K one step to the right, you will have 10 (first row all zeros due to padded zeros, second row: 1*2+, third row 1*3 + 1*4, fourth row all zeros due to [0,0,0,0,7])
likewise you slide to the right for another two steps to get all four elements for the convolved matrix (note for the 4th of this row, again we partially convolved on expanded padded row/columns)(
Then you slide the K filter one row down and reset to the far left of the "expanded /padded matrix"
You will have again the same 10 (first row: 1*2+, second row 1*3 + 1*4), so on and so forth
Just to warm up consider
k = np.array([[1,0,0],[0,1,0],[0,0,0]])
instead of your k, then if you
ndimage.convolve(a, k, mode='constant', cval=0.0)
you get
array([[4, 2, 4, 0],
[5, 3, 7, 4],
[3, 0, 0, 7],
[9, 3, 0, 0]])
and note that any element is the sum of it's own position (due to the 2nd 1 in k) and the one below and to the right (due to the 1st 1 in k), ie the 4 in the top corner is from the original 1 in the top corner plus the 3 diagonally down from it.
The (possibly) confusing part is that the effect of the k is opposite of what you might expect, ie for the k above you might expect the first 1 to add the value above and to the left, instead of down and to the right.
Now back to yours: the 12 (3 down and 2 across) is the sum of 9+3+0+0+0+0.
Note that anything outside the matrix is assumed to be 0.
To index the middle points of a numpy array, you can do this:
x = np.arange(10)
middle = x[len(x)/4:len(x)*3/4]
Is there a shorthand for indexing the middle of the array? e.g., the n or 2n elements closes to len(x)/2? Is there a nice n-dimensional version of this?
as cge said, the simplest way is by turning it into a lambda function, like so:
x = np.arange(10)
middle = lambda x: x[len(x)/4:len(x)*3/4]
or the n-dimensional way is:
middle = lambda x: x[[slice(np.floor(d/4.),np.ceil(3*d/4.)) for d in x.shape]]
Late, but for everyone else running into this issue:
A much smoother way is to use numpy's take or put.
To address the middle of an array you can use put to index an n-dimensional array with a single index. Same for getting values from an array with take
Assuming your array has an odd number of elements, the middle of the array will be at half of it's size. By using an integer division (// instead of /) you won't get any problems here.
import numpy as np
arr = np.array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
# put a value to the center
np.put(arr, arr.size // 2, 999)
print(arr)
# take a value from the center
center = np.take(arr, arr.size // 2)
print(center)