Masking nested array with value at index with a second array - python

I have a nested array with some values. I have another array, where the length of both arrays are equal. I'd like to get an output, where I have a nested array of 1's and 0's, such that it is 1 where the value in the second array was equal to the value in that nested array.
I've taken a look on existing stack overflow questions but have been unable to construct an answer.
masks_list = []
for i in range(len(y_pred)):
mask = (y_pred[i] == y_test.values[i]) * 1
masks_list.append(mask)
masks = np.array(masks_list);
Essentially, that's the code I currently have and it works, but I think that it's probably not the most effecient way of doing it.
YPRED:
[[4 0 1 2 3 5 6]
[0 1 2 3 5 6 4]]
YTEST:
8 1
5 4
Masks:
[[0 0 1 0 0 0 0]
[0 0 0 0 0 0 1]]

Another good solution with less line of code.
a = set(y_pred).intersection(y_test)
f = [1 if i in a else 0 for i, j in enumerate(y_pred)]
After that you can check performance like in this answer as follow:
import time
from time import perf_counter as pc
t0=pc()
a = set(y_pred).intersection(y_test)
f = [1 if i in a else 0 for i, j in enumerate(y_pred)]
t1 = pc() - t0
t0=pc()
for i in range(len(y_pred)):
mask = (y_pred[i] == y_test[i]) * 1
masks_list.append(mask)
t2 = pc() - t0
val = t1 - t2
Generally it means if value is positive than the first solution are slower.
If you have np.array instead of list you can try do as described in this answer:
type(y_pred)
>> numpy.ndarray
y_pred = y_pred.tolist()
type(y_pred)
>> list

Idea(least loop): compare array and nested array:
masks = np.equal(y_pred, y_test.values)
you can look at this too:
np.array_equal(A,B) # test if same shape, same elements values
np.array_equiv(A,B) # test if broadcastable shape, same elements values
np.allclose(A,B,...) # test if same shape, elements have close enough values

Related

One Mask subtracting another mask on numpy

I am new to numpy so any help is appreciated. Say I have two 1-0 masks A and B in 2D numpy array with the same dimension.
Now I would like to do logical operation to subtract B from A
A B Expected Result
1 1 0
1 0 1
0 1 0
0 0 0
But i am not sure it works when a = 0 and b = 1 where a and b are elements from A and B respectively for A = A - B
So I do something like
A = np.where(B == 0, A, 0)
But this is not very readable. Is there a better way to do that
Because for logical or, I can do something like
A = A | B
Is there a similar operator that I can do the subtraction?
The operation that you have described can be given by the following boolean operation
A = A & (~B)
where & is the element-wise AND operation and ~ is the elementwise NOT operation.
for each elements a and b in A and B respectively, we have
a = 1 and b = 1 => a & (~b) = 0
a = 1 and b = 0 => a & (~b) = 1
a = 0 and b = 1 => a & (~b) = 0
a = 0 and b = 0 => a & (~b) = 0
Intuitively, this can be simply understood as the following. We interpret each array A and B as sets, each containing only the indices for which the value is 1. (in your case A = {0, 1} and B = {0,2}). Then the result we want is a set that contains the elements such that that element is in A AND NOT in B.
Note that boolean algebra proves that any binary boolean operation can be acheived using AND, NOT, and OR gates (strictly you need only NOT and either the AND or the OR gate), so naturally, the operation you have specified is no exception.
Since subtraction is not supported for booleans, you need to cast at least one of the arrays to an integer dtype before subtracting. If you want to make sure that the result can't be negative, you can use numpy.maximum.
np.maximum(A.astype(int) - B, 0)

Find number of positives in segment of list

I have a unsorted list. I'm asked to print k-times number of positive values in a segment.The boarders for segment start with 1 not 0. So, if the boarder [1,3] it means that we should find all positives among elements with indices [0,1,2]
For example,
2 -1 2 -2 3
4
1 1
1 3
2 4
1 5
The answer needs to be:
1
2
1
3
Currently, I'm creating a list with length as original where i equals 1 if original is positive and 0 if original is negative or zero. After that I sum for this segment:
lst = list(map(int, input().split()))
k = int(input())
neg = [1 if x > 0 else 0 for x in lst]
for i in range(k):
l,r = map(int, input().split())
l = l - 1
print(sum(neg[l:r]))
Despite the fact that it's the fastest code that I created so far, it is still too slow for this task. How would I optimize it (or make it faster)?
If I understand you correctly, there doesn't seem to be a lot of room for optimization. The only thing that comes to mind really is that the lst and neg steps could be combined, which would save one loop:
positive = [int(x) > 0 for x in input().split()]
k = int(input())
for i in range(k):
l, r = map(int, input().split())
print(sum(positive[l-1:r]))
We can just have bools in the positive list, because bool is just a subclass of int, meaning True is treated like 1 and False like 0. (Also I would call the list positive instead of neg.)
The complexity is still O(n) though.

smart assignment in 2D numpy array based on numpy 1D array

I have a numpy 2D array and I want to turn it to -1\1 values based on the following logic:
a. find the argmax() of each row
b. based on that 1D array (a) assign the values it contain the value 1
c. based on the negation of this 1D array assign the value -1
Example:
arr2D = np.random.randint(10,size=(3,3))
idx = np.argmax(arr2D, axis=1)
arr2D = [[5 4 1]
[0 9 4]
[4 2 6]]
idx = [0 1 2]
arr2D[idx] = 1
arr2D[~idx] = -1
what I get is this:
arr2D = [[-1 -1 -1]
[-1 -1 -1]
[-1 -1 -1]]
while I wanted:
arr2D = [[1 -1 -1]
[-1 1 -1]
[-1 -1 1]]
appreciate some help,
Thanks
Approach #1
Create a mask with those argmax -
mask = idx[:,None] == np.arange(arr2D.shape[1])
Then, use those indices and then use it to create those 1s and -1s array -
out = 2*mask-1
Alternatively, we could use np.where -
out = np.where(mask,1,-1)
Approach #2
Another way to create the mask would be -
mask = np.zeros(arr2D.shape, dtype=bool)
mask[np.arange(len(idx)),idx] = 1
Then, get out using one of the methods as listed in approach #1.
Approach #3
One more way would be like so -
out = np.full(arr2D.shape, -1)
out[np.arange(len(idx)),idx] = 1
Alternatively, we could use np.put_along_axis for the assignment -
np.put_along_axis(out,idx[:,None],1,axis=1)

Comparing rows of two pandas dataframes?

This is a continuation of my question. Fastest way to compare rows of two pandas dataframes?
I have two dataframes A and B:
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
For a condensed example:
A B C D E
0 0 0 0 1 0
1 1 1 1 1 0
2 1 0 0 1 1
3 0 1 1 1 0
B is 1024 rows x 10 columns, and is a full iteration from 0 to 1023 in binary form.
Example:
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I am trying to find which rows in A, at a particular 10 columns of A, correspond with each row of B.
Each row of A[My_Columns_List] is guaranteed to be somewhere in B, but not every row of B will match up with a row in A[My_Columns_List]
For example, I want to show that for columns [B,D,E] of A,
rows [1,3] of A match up with row [6] of B,
row [0] of A matches up with row [2] of B,
row [2] of A matches up with row [3] of B.
I have tried using:
pd.merge(B.reset_index(), A.reset_index(),
left_on = B.columns.tolist(),
right_on =A.columns[My_Columns_List].tolist(),
suffixes = ('_B','_A')))
This works, but I was hoping that this method would be faster:
S = 2**np.arange(10)
A_ID = np.dot(A[My_Columns_List],S)
B_ID = np.dot(B,S)
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
But when I do this, out_row_idx returns an array containing all the indices of A, which doesn't tell me anything.
I think this method will be faster, but I don't know why it returns an array from 0 to 999.
Any input would be appreciated!
Also, credit goes to #jezrael and #Divakar for these methods.
I'll stick by my initial answer but maybe explain better.
You are asking to compare 2 pandas dataframes. Because of that, I'm going to build dataframes. I may use numpy, but my inputs and outputs will be dataframes.
Setup
You said we have a a 1000 x 500 array of ones and zeros. Let's build that.
A_init = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A_init.columns = pd.MultiIndex.from_product([range(A_init.shape[1]/10), range(10)])
A = A_init
In addition, I gave A a MultiIndex to easily group by columns of 10.
Solution
This is very similar to #Divakar's answer with one minor difference that I'll point out.
For one group of 10 ones and zeros, we can treat it as a bit array of length 8. We can then calculate what it's integer value is by taking the dot product with an array of powers of 2.
twos = 2 ** np.arange(10)
I can execute this for every group of 10 ones and zeros in one go like this
AtB = A.stack(0).dot(twos).unstack()
I stack to get a row of 50 groups of 10 into columns in order to do the dot product more elegantly. I then brought it back with the unstack.
I now have a 1000 x 50 dataframe of numbers that range from 0-1023.
Assume B is a dataframe with each row one of 1024 unique combinations of ones and zeros. B should be sorted like B = B.sort_values().reset_index(drop=True).
This is the part I think I failed at explaining last time. Look at
AtB.loc[:2, :2]
That value in the (0, 0) position, 951 means that the first group of 10 ones and zeros in the first row of A matches the row in B with the index 951. That's what you want!!! Funny thing is, I never looked at B. You know why, B is irrelevant!!! It's just a goofy way of representing the numbers from 0 to 1023. This is the difference with my answer, I'm ignoring B. Ignoring this useless step should save time.
These are all functions that take two dataframes A and B and returns a dataframe of indices where A matches B. Spoiler alert, I'll ignore B completely.
def FindAinB(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
twos = 2 ** np.arange(10)
return A.stack(0).dot(twos).unstack()
def FindAinB2(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
# use clever bit shifting instead of dot product with powers
# questionable improvement
return (A.stack(0) << np.arange(10)).sum(1).unstack()
I'm channelling my inner #Divakar (read, this is stuff I've learned from Divakar)
def FindAinB3(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
a = A.values.reshape(-1, 10)
a = np.einsum('ij->i', a << np.arange(10))
return pd.DataFrame(a.reshape(A.shape[0], -1), A.index)
Minimalist One Liner
f = lambda A: pd.DataFrame(np.einsum('ij->i', A.values.reshape(-1, 10) << np.arange(10)).reshape(A.shape[0], -1), A.index)
Use it like
f(A)
Timing
FindAinB3 is an order of magnitude faster

numpy moving window percent cover

I have a classified raster that I am reading into a numpy array. (n classes)
I want to use a 2d moving window (e.g. 3 by 3) to create a n-dimensional vector that stores the %cover of each class within the window. Because the raster is large it would be useful to store this information so as not to re-compute it each time....therefor I think the best solution is creating a 3d array to act as the vector. A new raster will be created based on these %/count values.
My idea is to:
1) create a 3d array n+1 'bands'
2) band 1 = the original classified raster. each other 'band' value = count cells of a value within the window (i.e. one band per class) ....for example:
[[2 0 1 2 1]
[2 0 2 0 0]
[0 1 1 2 1]
[0 2 2 1 1]
[0 1 2 1 1]]
[[2 2 3 2 2]
[3 3 3 2 2]
[3 3 2 2 2]
[3 3 0 0 0]
[2 2 0 0 0]]
[[0 1 1 2 1]
[1 3 3 4 2]
[1 2 3 4 3]
[2 3 5 6 5]
[1 1 3 4 4]]
[[2 3 2 2 1]
[2 3 3 3 2]
[2 4 4 3 1]
[1 3 5 3 1]
[1 3 3 2 0]]
4) read these bands into a vrt so only needs be created the once ...and can be read in for further modules.
Question: what is the most efficient 'moving window' method to 'count' within the window?
Currently - I am trying, and failing with the following code:
def lcc_binary_vrt(raster, dim, bands):
footprint = np.zeros(shape = (dim,dim), dtype = int)+1
g = gdal.Open(raster)
data = gdal_array.DatasetReadAsArray(g)
#loop through the band values
for i in bands:
print i
# create a duplicate '0' array of the raster
a_band = data*0
# we create the binary dataset for the band
a_band = np.where(data == i, 1, a_band)
count_a_band_fname = raster[:-4] + '_' + str(i) + '.tif'
# run the moving window (footprint) accross the band to create a 'count'
count_a_band = ndimage.generic_filter(a_band, np.count_nonzero(x), footprint=footprint, mode = 'constant')
geoTiff.create(count_a_band_fname, g, data, count_a_band, gdal.GDT_Byte, np.nan)
Any suggestions very much appreciated.
Becky
I don't know anything about the spatial sciences stuff, so I'll just focus on the main question :)
what is the most efficient 'moving window' method to 'count' within the window?
A common way to do moving window statistics with Numpy is to use numpy.lib.stride_tricks.as_strided, see for example this answer. Basically, the idea is to make an array containing all the windows, without any increase in memory usage:
from numpy.lib.stride_tricks import as_strided
...
m, n = a_band.shape
newshape = (m-dim+1, n-dim+1, dim, dim)
newstrides = a_band.strides * 2 # strides is a tuple
count_a_band = as_strided(ar, newshape, newstrides).sum(axis=(2,3))
However, for your use case this method is inefficient, because you're summing the same numbers over and over again, especially if the window size increases. A better way is to use a cumsum trick, like in this answer:
def windowed_sum_1d(ar, ws, axis=None):
if axis is None:
ar = ar.ravel()
else:
ar = np.swapaxes(ar, axis, 0)
ans = np.cumsum(ar, axis=0)
ans[ws:] = ans[ws:] - ans[:-ws]
ans = ans[ws-1:]
if axis:
ans = np.swapaxes(ans, 0, axis)
return ans
def windowed_sum(ar, ws):
for axis in range(ar.ndim):
ar = windowed_sum_1d(ar, ws, axis)
return ar
...
count_a_band = windowed_sum(a_band, dim)
Note that in both codes above it would be tedious to handle edge cases. Luckily, there is an easy way to include these and get the same efficiency as the second code:
count_a_band = ndimage.uniform_filter(a_band, size=dim, mode='constant') * dim**2
Though very similar to what you already had, this will be much faster! The downside is that you may need to round to integers to get rid of floating point rounding errors.
As a final note, your code
# create a duplicate '0' array of the raster
a_band = data*0
# we create the binary dataset for the band
a_band = np.where(data == i, 1, a_band)
is a bit redundant: You can just use a_band = (data == i).

Categories

Resources