smart assignment in 2D numpy array based on numpy 1D array - python

I have a numpy 2D array and I want to turn it to -1\1 values based on the following logic:
a. find the argmax() of each row
b. based on that 1D array (a) assign the values it contain the value 1
c. based on the negation of this 1D array assign the value -1
Example:
arr2D = np.random.randint(10,size=(3,3))
idx = np.argmax(arr2D, axis=1)
arr2D = [[5 4 1]
[0 9 4]
[4 2 6]]
idx = [0 1 2]
arr2D[idx] = 1
arr2D[~idx] = -1
what I get is this:
arr2D = [[-1 -1 -1]
[-1 -1 -1]
[-1 -1 -1]]
while I wanted:
arr2D = [[1 -1 -1]
[-1 1 -1]
[-1 -1 1]]
appreciate some help,
Thanks

Approach #1
Create a mask with those argmax -
mask = idx[:,None] == np.arange(arr2D.shape[1])
Then, use those indices and then use it to create those 1s and -1s array -
out = 2*mask-1
Alternatively, we could use np.where -
out = np.where(mask,1,-1)
Approach #2
Another way to create the mask would be -
mask = np.zeros(arr2D.shape, dtype=bool)
mask[np.arange(len(idx)),idx] = 1
Then, get out using one of the methods as listed in approach #1.
Approach #3
One more way would be like so -
out = np.full(arr2D.shape, -1)
out[np.arange(len(idx)),idx] = 1
Alternatively, we could use np.put_along_axis for the assignment -
np.put_along_axis(out,idx[:,None],1,axis=1)

Related

Python Numpy. Delete an element (or elements) in a 2D array if said element is located between a pair of specified elements

I have a 2D NumPy array exclusively filled with 1s and 0s.
a = [[0 0 0 0 1 0 0 0 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 0 0 0 0 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 0 1 1 1 1 1]
[1 1 1 1 1 1 0 0 1]
[1 1 1 1 1 1 1 1 1]]
To get the location of the 0s I used the following code:
new_array = np.transpose(np.nonzero(a==0))
As expected, I get the following result showing the location of the 0s within the array
new_array = [[0 0]
[0 1]
[0 2]
[0 3]
[0 5]
[0 6]
[0 7]
[3 4]
[3 5]
[3 6]
[3 7]
[5 3]
[6 6]
[6 7]]
Now comes my question:
Is there way to get the location of the 0s at the start and end of and horizontal group if said group is larger than 2?
EDIT: If group were to finish at the end of a row and continue on the one below it, it would count as 2 separate groups.
My first thought was to implement a process that would delete 0s if they are located in-between 0s but I was not able to figure out how to do that.
I would like "new_array" output to be:
new_array = [[0 0]
[0 3]
[0 5]
[0 7]
[3 4]
[3 7]
[5 3]
[6 6]
[6 7]]
Thanks beforehand!!
EDIT 2:
Thanks you all for your very helpful insights, I was able to solve the problem that I had.
To satisfy the curiosity, this data represents musical information. The purpose of the program I'm working on is to create a musical score based on a image (that consist exclusively of horizontal lines).
Once the image conversion to 1s and 0s is done, I needed to extract the following information from it: Onset, Pitch, and Duration. This translates into position in the "x" axis, position on the "y" axis and total length of group.
Since X and Y locations are fairly easy to get, I decided to process them separately from the "Duration" calculation (which was the main problem to solve in this post).
Thanks to your help I was able to solve the Duration problem and create a new array with all necessary information:
[[0 0 4]
[5 0 3]
[4 3 4]
[6 6 2]]
Note that 1st column represent Onset, 2nd column represents Pitch, and 3rd column represents Duration.
It has also come to my attention the comment that suggested to add an identifier to each event. Eventually I will need to implement that to differentiate between different instruments (and later sending them to individual Midi channels). However, for this first iteration of the program that only aims to create a music score for a single instrument, it is not necessary since all events belong to a single instrument.
I have very little experience with programming, I don't know if this was the most efficient way of achieving my goal. Any suggestions are welcomed.
Thanks!
One possible solution that is easier to follow is:
b = np.diff(a, prepend=1) # prepend a column of 1s and detect
# jumps between adjacent columns (left to right)
y, x = np.where(b > 0) # find positions of the jumps 0->1 (left to right)
# shift positive jumps to the left by 1 position while filling gaps with 0:
b[y, x - 1] = 1
b[y, x] = 0
new_array = list(zip(*np.where(b)))
Another one is:
new_array = list(zip(*np.where(np.diff(a, n=2, prepend=1, append=1) > 0)))
Both solutions are based on the np.diff that computes differences between consecutive columns (when axis=-1 for 2D arrays).
A flaw in the other solution is that it reports all sequences
of zeroes, regardless of their length.
Your expected output also contains such groups, composed of 1 or 2
zeroes, but in my opinion it shouldn't.
My solution is free of the above flaw.
An elegant tool to process groups of adjacent equal elements is
itertools.groupby, so start from:
import itertools
Then generate your intended result as:
res = []
for rowIdx, row in enumerate(a):
colIdx = 0 # Start column index
for k, grp in itertools.groupby(row):
vals = list(grp) # Values in the group
lgth = len(vals) # Length of the group
colIdx2 = colIdx + lgth - 1 # End column index
if k == 0 and lgth > 2: # Record this group
res.append([rowIdx, colIdx])
res.append([rowIdx, colIdx2])
colIdx = colIdx2 + 1 # Advance column index
result = np.array(res)
The result, for your source data, is:
array([[0, 0],
[0, 3],
[0, 5],
[0, 7],
[3, 4],
[3, 7]])
As you can see, it doesn't include shorter sequences of zeroes
in row 5 and 6.

Remove repeated rows in 2D numpy array, maintaining first instance and ordering

I have an 2-dimensional Numpy array where some rows are not unique, i.e., when I do:
import numpy as np
data.shape #number of rows X columns in data
# (75000, 8)
np.unique(data.T, axis=0).shape #number of unique rows is fewer than above
# (74801, 8)
Starting with the first row of data, I would like to remove any row that is a duplicate of a previous row, maintaining the original order of the rows. In the above example, the final shape of the new Numpy array should be (74801, 8).
E.g., given the below data array
data = np.array([[1,2,1],[2,2,3],[3,3,2],[2,2,3],[1,1,2],[0,0,0],[3,3,2]])
print(data)
[[1 2 1]
[2 2 3]
[3 3 2]
[2 2 3]
[1 1 2]
[0 0 0]
[3 3 2]]
I'd like to have the unique rows in their original order, i.e.,
[[1 2 1]
[2 2 3]
[3 3 2]
[1 1 2]
[0 0 0]]
Any tips on an efficient solution would be greatly appreciated!
Try numpy.unique with the "return_index" parameter:
data[np.sort(np.unique(data, axis = 0, return_index = True)[1])]
As it name indicates, it will return the unique rows and their indices in that order inside a tuple (that's why there's a [1] at the end).
You can also use pandas:
import pandas as pd
pd.DataFrame(data).drop_duplicates().values

How do I subtract two columns from the same array and put the value in their own single column array with numpy?

Lets say i have a single array of 3x4 (3 rows, 4 columns) for example
import numpy as np
data = [[0,5,0,1], [0,5,0,1], [0,5,0,1]]
data = np.array(data)
print(data)
[[0 5 0 1]
[0 5 0 1]
[0 5 0 1]]
and i want to subtract column 4 from column 2 and have the values in their own, named, 3x1 array like this
print(subtraction)
[[4]
[4]
[4]]
how would i go about this in numpy?
result = (data[:, 1] - data[:, 3]).reshape((3, 1))

Masking nested array with value at index with a second array

I have a nested array with some values. I have another array, where the length of both arrays are equal. I'd like to get an output, where I have a nested array of 1's and 0's, such that it is 1 where the value in the second array was equal to the value in that nested array.
I've taken a look on existing stack overflow questions but have been unable to construct an answer.
masks_list = []
for i in range(len(y_pred)):
mask = (y_pred[i] == y_test.values[i]) * 1
masks_list.append(mask)
masks = np.array(masks_list);
Essentially, that's the code I currently have and it works, but I think that it's probably not the most effecient way of doing it.
YPRED:
[[4 0 1 2 3 5 6]
[0 1 2 3 5 6 4]]
YTEST:
8 1
5 4
Masks:
[[0 0 1 0 0 0 0]
[0 0 0 0 0 0 1]]
Another good solution with less line of code.
a = set(y_pred).intersection(y_test)
f = [1 if i in a else 0 for i, j in enumerate(y_pred)]
After that you can check performance like in this answer as follow:
import time
from time import perf_counter as pc
t0=pc()
a = set(y_pred).intersection(y_test)
f = [1 if i in a else 0 for i, j in enumerate(y_pred)]
t1 = pc() - t0
t0=pc()
for i in range(len(y_pred)):
mask = (y_pred[i] == y_test[i]) * 1
masks_list.append(mask)
t2 = pc() - t0
val = t1 - t2
Generally it means if value is positive than the first solution are slower.
If you have np.array instead of list you can try do as described in this answer:
type(y_pred)
>> numpy.ndarray
y_pred = y_pred.tolist()
type(y_pred)
>> list
Idea(least loop): compare array and nested array:
masks = np.equal(y_pred, y_test.values)
you can look at this too:
np.array_equal(A,B) # test if same shape, same elements values
np.array_equiv(A,B) # test if broadcastable shape, same elements values
np.allclose(A,B,...) # test if same shape, elements have close enough values

numpy moving window percent cover

I have a classified raster that I am reading into a numpy array. (n classes)
I want to use a 2d moving window (e.g. 3 by 3) to create a n-dimensional vector that stores the %cover of each class within the window. Because the raster is large it would be useful to store this information so as not to re-compute it each time....therefor I think the best solution is creating a 3d array to act as the vector. A new raster will be created based on these %/count values.
My idea is to:
1) create a 3d array n+1 'bands'
2) band 1 = the original classified raster. each other 'band' value = count cells of a value within the window (i.e. one band per class) ....for example:
[[2 0 1 2 1]
[2 0 2 0 0]
[0 1 1 2 1]
[0 2 2 1 1]
[0 1 2 1 1]]
[[2 2 3 2 2]
[3 3 3 2 2]
[3 3 2 2 2]
[3 3 0 0 0]
[2 2 0 0 0]]
[[0 1 1 2 1]
[1 3 3 4 2]
[1 2 3 4 3]
[2 3 5 6 5]
[1 1 3 4 4]]
[[2 3 2 2 1]
[2 3 3 3 2]
[2 4 4 3 1]
[1 3 5 3 1]
[1 3 3 2 0]]
4) read these bands into a vrt so only needs be created the once ...and can be read in for further modules.
Question: what is the most efficient 'moving window' method to 'count' within the window?
Currently - I am trying, and failing with the following code:
def lcc_binary_vrt(raster, dim, bands):
footprint = np.zeros(shape = (dim,dim), dtype = int)+1
g = gdal.Open(raster)
data = gdal_array.DatasetReadAsArray(g)
#loop through the band values
for i in bands:
print i
# create a duplicate '0' array of the raster
a_band = data*0
# we create the binary dataset for the band
a_band = np.where(data == i, 1, a_band)
count_a_band_fname = raster[:-4] + '_' + str(i) + '.tif'
# run the moving window (footprint) accross the band to create a 'count'
count_a_band = ndimage.generic_filter(a_band, np.count_nonzero(x), footprint=footprint, mode = 'constant')
geoTiff.create(count_a_band_fname, g, data, count_a_band, gdal.GDT_Byte, np.nan)
Any suggestions very much appreciated.
Becky
I don't know anything about the spatial sciences stuff, so I'll just focus on the main question :)
what is the most efficient 'moving window' method to 'count' within the window?
A common way to do moving window statistics with Numpy is to use numpy.lib.stride_tricks.as_strided, see for example this answer. Basically, the idea is to make an array containing all the windows, without any increase in memory usage:
from numpy.lib.stride_tricks import as_strided
...
m, n = a_band.shape
newshape = (m-dim+1, n-dim+1, dim, dim)
newstrides = a_band.strides * 2 # strides is a tuple
count_a_band = as_strided(ar, newshape, newstrides).sum(axis=(2,3))
However, for your use case this method is inefficient, because you're summing the same numbers over and over again, especially if the window size increases. A better way is to use a cumsum trick, like in this answer:
def windowed_sum_1d(ar, ws, axis=None):
if axis is None:
ar = ar.ravel()
else:
ar = np.swapaxes(ar, axis, 0)
ans = np.cumsum(ar, axis=0)
ans[ws:] = ans[ws:] - ans[:-ws]
ans = ans[ws-1:]
if axis:
ans = np.swapaxes(ans, 0, axis)
return ans
def windowed_sum(ar, ws):
for axis in range(ar.ndim):
ar = windowed_sum_1d(ar, ws, axis)
return ar
...
count_a_band = windowed_sum(a_band, dim)
Note that in both codes above it would be tedious to handle edge cases. Luckily, there is an easy way to include these and get the same efficiency as the second code:
count_a_band = ndimage.uniform_filter(a_band, size=dim, mode='constant') * dim**2
Though very similar to what you already had, this will be much faster! The downside is that you may need to round to integers to get rid of floating point rounding errors.
As a final note, your code
# create a duplicate '0' array of the raster
a_band = data*0
# we create the binary dataset for the band
a_band = np.where(data == i, 1, a_band)
is a bit redundant: You can just use a_band = (data == i).

Categories

Resources