Change dataset entries based on a boolean mask - python

As part of a wider workflow, I need to perform the following operation: given 3 datasets, (with the same shape): one of them contains only boolean value, and will be referred as the "mask".
Essentially I need a function that changes each entry of the first dataset, using values from the second one, if the corresponding entry in the mask equals 1.
The following function does the job
def swap(a,b,c):
for i in range(a.shape[0]):
for j in range(a.shape[1]):
if c.iloc[i,j]== 1:
a.iloc[i,j] = b.iloc[i,j]
return a
but I doubt very much this is efficient, to say the least.
For starters, would be certainly best not to iterate over all entries, but just over indices corresponding to 1s in the mask.
Yet, in general, are there any pandas/numpy functions/implementations I should be considering? I could not find much at all, thanks

You can use np.copyto:
a,b,c = np.random.randint([0,10,0,],[10,20,2],[10,3]).T
a
# array([4, 5, 2, 6, 3, 6, 3, 1, 0, 7])
b
# array([19, 10, 17, 17, 18, 13, 15, 17, 14, 16])
c
# array([0, 1, 1, 1, 0, 1, 0, 1, 1, 1])
np.copyto(a,b,where=c.astype(bool))
a
# array([ 4, 10, 17, 17, 3, 13, 3, 17, 14, 16])

Using NumPy will be better:
import numpy as np
a = b.values*c.values + a.values*np.logical_not(c.values)

You can use boolean array indexing in numpy. Here is an simple example:
A = np.random.randn(5,5)
B = np.ones((5,5))
C = np.random.randint(1,size=(5,5))
A[C==1] = B[C==1]

Related

Numpy 1D array - find indices of boundaries of subsequences of the same number [duplicate]

This question already has an answer here:
Numpy find indices of groups with same value
(1 answer)
Closed 2 years ago.
I have an numpy.array made by zeros and ones, e.g.:
import numpy
a = numpy.array([0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1])
And now I need to get the first and last index of 1 in each sequence of ones. If I use where, I can get indices of each 1 in the array:
ones = numpy.where(a == 1)
# ones = (array([ 3, 4, 5, 6, 9, 10, 14, 15, 16, 17], dtype=int64),)
But I would like to get only boundaries, it means:
# desired:
ones = (array([ 3, 6, 9, 10, 14, 17], dtype=int64),)
Could you please help me, how to achieve this result? Thank you
You can find the beginning and end of these sequences shifting and comparing using bitwise operators and np.where to get the corresponding indices:
def first_and_last_seq(x, n):
a = np.r_[n-1,x,n-1]
a = a==n
start = np.r_[False,~a[:-1] & a[1:]]
end = np.r_[a[:-1] & ~a[1:], False]
return np.where(start|end)[0]-1
Checking with the proposed example:
a = np.array([0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1])
first_and_last_seq(a, 1)
# array([ 3, 6, 9, 10, 14, 17])
Or with the following array:
a = np.array([5,5,5,6,2,3,5,5,5,2,3,5,5])
first_and_last_seq(a, 5)
# array([ 3, 6, 9, 10, 14, 17])
Further details:
A simple way to check for consecutive values in numpy, is to use bitwise operators to compare shifted versions of an array. Note that ~a[:-1] & a[1:] is doing precesely that. The first term is the array sliced up till the last element, and the second term a slice from the first element onwards.
Note that a is a boolean array, given a = a==n. In the above case we are taking a NOT of the first shifted boolean array (since we want a True is the value is False. And by taking a bitwise AND with the next value, we will only have True is the next sample is True This way we set to True only the indices where the sequences start (i.e. we've matched the subsequence [False, True])
Now the same logic applies for end. And by taking an OR of both arrays and np.where on the result we get all start and end indices.

Get the relative extrema from 1D numpy array

I'm writing code that includes the algorithm to find local maximum/minimum values in array. But I failed to find the proper function.
At first, I used argrelextrema in scipy.signal.
b = [6, 1, 3, 5, 5, 3, 1, 2, 2, 3, 2, 1, 1, 9, 10, 10, 9, 8, 7, 7, 13, 10]
scipy.signal.argrelextrema(np.array(b), np.greater)
scipy.signal.argrelextrema(np.array(b), np.greater_equal)
scipy.signal.argrelextrema(np.array(b), np.greater_equal, order=2)
The result is
(array([ 9, 20], dtype=int64),)
(array([ 0, 3, 4, 7, 9, 14, 15, 20], dtype=int64),)
(array([ 0, 3, 4, 9, 14, 15, 20], dtype=int64),)
First one didn't catch the b[3](or b[4]). So I modified it to second one, using np.greater_equal. However, in this case, the first value b[0] is also treated as local maximum, and the value 2 in b[7] is included. By using third one, I could throw away b[7]. But order=2 still also has problem when data is like [1, 3, 1, 4, 1] (it can't catch 3)
My expected result is
[3(or 4), 9, 14(or 15), 20]
I want to catch only one among b[3], b[4] (same value). I want some problems of argrelextrema I mentioned above to be solved. The code below succeeded.
scipy.signal.find_peaks(b)
the result is [3, 9, 14, 20].
The code I'm writing is treating the pair of local maximum, and local minimum. So I want to find the local minimum in the same way. Is there any function like scipy.signal.find_peaks to find local minimum?
You could simply apply find_peaks to the negative version of your array:
from scipy.signal import find_peaks
min_idx = find_peaks([-x for x in b])
Even more convenient when using numpy arrays:
import numpy as np
b = np.array(b)
min_idx = find_peaks(-b)

Add always each value before each entry in object in numpy. insert

I have an 1D array a and everytime I have in this array a zero value I would like to copy a complete array before this entry, for example :
b=np.ones(30)
I got the location of the needed entries with :
c=np.nonzero(a==0)
In doing so :
len(c) > len(b)
If I then use :
np.insert(arr, obj, values, axis=None)
respectively
np.insert(a,c,b)
Np insert will always only copy one value of b before the position specified in c
Question
How do I have to modifiy the np. insert code that he will always copy all entries in values before ech entry in obj ?
A list approach:
In [197]: a=[0,0,1,0,1,2,0,1,2,3]
In [198]: for i in c[::-1]:
...: a[i:i+1]=b+[a[i]]
...:
In [199]: a
Out[199]: [10, 11, 12, 0, 10, 11, 12, 0, 1, 10, 11, 12, 0, 1, 2, 10, 11, 12, 0, 1, 2, 3]
insert can insert single values at each before:
In [200]: arr
Out[200]: array([0, 0, 1, 0, 1, 2, 0, 1, 2, 3])
In [201]: np.insert(arr, c, np.arange(10,15))
Out[201]: array([10, 0, 11, 0, 1, 12, 0, 1, 2, 13, 0, 1, 2, 3])
insert is written in Python, but is a bit complex, especially since it can handle multidimensional inserts (axis). But essentially it does:
res = np.empty(result_size)
idx1 = <where original a values go>
# e.g. [ 3, 7, 8, 12, 13, 14, 18, 19, 20, 21]
res[idx1] = a
idx2 = <where insert values go>
# e.g. [0,1,2, 4,5,6, 9,10,11, 15,16,17]
res[idx2] = values
idx1 can be derived from idx2 with
mask=np.ones(res.shape, bool)
mask[idx2]=False
idx1 = np.where(mask)
So you only have to construct one or the other, which ever is easier.
Expanding that logic to your case could be get messy. I suspect #Divakar can quickly come up with something using cumsum or np.max.accumulate.
A different iterative approach - add values of b one by one
In [234]: arr1 = arr.copy()
In [236]: for i in b:
...: idx = np.where(arr1==0)[0]
...: print(idx)
...: arr1 = np.insert(arr1, idx, i)
...:
[0 1 3 6]
[ 1 3 6 10]
[ 2 5 9 14]
In [237]: arr1
Out[237]:
array([10, 11, 12, 0, 10, 11, 12, 0, 1, 10, 11, 12, 0, 1, 2, 10, 11, 12, 0, 1, 2, 3])
Given the overhead of creating arrays (and I'm sure insert has array overhead), I suspect this will be slower than the pure list iteration - unless the length of c (the insert points) is much larger than the length of b. This also depends on being able to identify the original insert points at each iteration (here with arr1==0).
A possible solution would be to duplicate the object array as often as len(value) using np.tile(object, len(value)) and resort the values afterwards using np.sort(). The new array should then be used instead of the inial object array in the np.insert function

`np.average()` format option

I'm trying to understand a python code, a specific line of the code has troubled me a bit:
mean = np.average(data[:,index])
I understand that this is an average calculation of data declared early above, but what does [:,index]indicate?
I apologise if this question is duplicated, but please link me a solution before you mark it down. This is the first day I'm exposed to Python, please excuse my ignorance. Appreciate for any kind advice!
below is part of the original code
data = np.genfromtxt(args.inputfile)
def doBlocking(data,index):
ndata = data.shape[0]
ncols = data.shape[1]-1
#things unimportant
mean = np.average(data[:,index])
#more unimportance
It is so called slicing. In your case average of specific column (with index equal to variable with the name "index") of 2-dimensional array is calculated.
In this case data is a two dimensional numpy.array. Numpy supports slicing similar to that of Matlab
In [1]: import numpy as np
In [2]: data = np.arange(15)
In [3]: data
Out[3]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
In [4]: data = data.reshape([5,3])
In [5]: data
Out[5]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
In [6]: data[:, 1]
Out[6]: array([ 1, 4, 7, 10, 13])
As you can see it selects the second column
Your code above will get the mean of column index. It basically says "Compute the mean for data in every line, and column index"

Set numpy array elements to zero if they are above a specific threshold

Say, I have a numpy array consists of 10 elements, for example:
a = np.array([2, 23, 15, 7, 9, 11, 17, 19, 5, 3])
Now I want to efficiently set all a values higher than 10 to 0, so I'll get:
[2, 0, 0, 7, 9, 0, 0, 0, 5, 3]
Because I currently use a for loop, which is very slow:
# Zero values below "threshold value".
def flat_values(sig, tv):
"""
:param sig: signal.
:param tv: threshold value.
:return:
"""
for i in np.arange(np.size(sig)):
if sig[i] < tv:
sig[i] = 0
return sig
How can I achieve that in the most efficient way, having in mind big arrays of, say, 10^6 elements?
In [7]: a = np.array([2, 23, 15, 7, 9, 11, 17, 19, 5, 3])
In [8]: a[a > 10] = 0
In [9]: a
Out[9]: array([2, 0, 0, 7, 9, 0, 0, 0, 5, 3])
Generally, list comprehensions are faster than for loops in python (because python knows that it doesn't need to care for a lot of things that might happen in a regular for loop):
a = [0 if a_ > thresh else a_ for a_ in a]
but, as #unutbu correctly pointed out, numpy allows list indexing, and element-wise comparison giving you index lists, so:
super_threshold_indices = a > thresh
a[super_threshold_indices] = 0
would be even faster.
Generally, when applying methods on vectors of data, have a look at numpy.ufuncs, which often perform much better than python functions that you map using any native mechanism.
If you don't want to change your original array
In [2]: a = np.array([2, 23, 15, 7, 9, 11, 17, 19, 5, 3])
In [3]: b = np.where(a > 10, 0, a)
In [4]: b
Out[4]: array([2, 0, 0, 7, 9, 0, 0, 0, 5, 3])
In [5]: a
Out[5]: array([ 2, 23, 15, 7, 9, 11, 17, 19, 5, 3])
From the neural networks from scratch series by sentdex on Youtube, he used np.maximum(0, [your array]) to make all values less than 0 into 0.
For your question I tried np.minimum(10, [your array]) and it seemed to work incredibly fast. I even did it on an array that was 10e6 (uniform distribution generated using 50 * np.random.rand(10000000)), and it worked in 0.039571 seconds. I hope this is fast enough.

Categories

Resources