Boolean Indexing numpy Array with or logical operator - python

I was trying to do an or boolean logical indexing on a Numpy array but I cannot find a good way.
The and operator & works properly like:
X = np.arange(25).reshape(5, 5)
# We print X
print()
print('Original X = \n', X)
print()
X[(X > 10) & (X < 17)] = -1
# We print X
print()
print('X = \n', X)
print()
Original X =
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
X =
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 -1 -1 -1 -1]
[-1 -1 17 18 19]
[20 21 22 23 24]]
But when I try with:
X = np.arange(25).reshape(5, 5)
# We use Boolean indexing to assign the elements that are between 10 and 17 the value of -1
X[ (X < 10) or (X > 20) ] = 0 # No or condition possible!?!
I got the error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Does exist any good way to use the or logic operator?

You can use numpy.logical_or for that task following way:
import numpy as np
X = np.arange(25).reshape(5,5)
X[np.logical_or(X<10,X>20)] = 0
print(X)
Output:
[[ 0 0 0 0 0]
[ 0 0 0 0 0]
[10 11 12 13 14]
[15 16 17 18 19]
[20 0 0 0 0]]
There is also numpy.logical_and, numpy.logical_xor and numpy.logical_not

I would use something with np.logical_and and np.where.
For your given example, I believe this would work.
X = np.arange(25).reshape(5, 5)
i = np.where(np.logical_and(X > 10 , X < 17))
X[i] = -1
This is not a very pythonic answer. But it's pretty clear

Related

Faster Way to Translate DataFrame Column to Feature and Target Matrix

I have a column (binary) in a dataframe (df) of the form:
Vector
0
1
0
1
0
.
.
.
I am using this in a binary classification model. My objective is to take these 0's and 1's and move them into two seperate lists, which then get translated into numpy arrays. As an example, I would like to move the first 5 items from Vector into X, then the 6th item into Y. Then the next 5 items into X, and then the following 6th item into Y, till the end of the df (currently 200k rows).
My first instinct is to write a for loop for this (but I know this is hugely inefficient):
for i in range(0, df.shape[0] - 6):
# as we iterate through the df
# we will use a step of 5
if i_cnt > 5:
y = df['Vector'].iloc[i]
Y.append(y)
i_cnt = 1
else:
x = df['Vector'].iloc[i]
X.append(x)
i_cnt +=1
There is definitely a faster way to do this and hoping someone knows how I can do that?
Use modulo with 6 by array created by length of index and compare for X and Y:
#sample data for easy verify
df = pd.DataFrame({'Vector': range(20)})
idx = np.arange(len(df)) % 6
X = df.loc[idx < 5, 'Vector']
print (X)
0 0
1 1
2 2
3 3
4 4
6 6
7 7
8 8
9 9
10 10
12 12
13 13
14 14
15 15
16 16
18 18
19 19
Y = df.loc[idx == 5, 'Vector']
print (Y)
5 5
11 11
17 17
If output format is different - X is 2d array use reshape with -1 for automatic count length with 6 and select by indexing:
df = pd.DataFrame({'Vector': range(18)})
arr = df['Vector'].to_numpy().reshape(-1, 6)
X = arr[:, :-1]
Y = arr[:, -1]
print (X)
[[ 0 1 2 3 4]
[ 6 7 8 9 10]
[12 13 14 15 16]]
print (Y)
[ 5 11 17]
For k = 5 + 1 = 6,
k = 6
n_rows = len(df.index)
n_samples = n_rows // k
X_and_y = df.Vector.to_numpy().reshape(n_samples, k)
X = X_and_y[:, :-1]
y = X_and_y[:, -1]
We reshape the column to a (n_samples, 5 + 1) array where n_samples = n_rows / 6, then we take all but last column into X and last column into y.
e.g.
>>> df = pd.DataFrame(np.random.randint(0, 2, size=18), columns=["Vector"])
>>> df
Vector
0 0
1 0
2 1
3 1
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 1
12 0
13 0
14 1
15 0
16 0
17 1
>>> # after
>>> X
array([[0, 0, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])
>>> y
array([0, 1, 1])
You can try
X = list(df[df.index % 6 < 5]["Vector"])
y = list(df[df.index % 6 == 5]["Vector"])

How to recursively split a 2D array into a tensor?

I have turned dataframe that has a tuple of length 2 as index
1 2 -1
(0, 1) 0 1 0
(0, 2) 1 0 0
(0, -1) 0 0 0
(1, 1) 1 0 0
(1, 2) 0 1 0
(1, -1) 1 1 1
into numpy 2D array and managed to split it to 3D array(in regards to the first value) by split funcion:
arr = np.array(np.array_split(arr,2))
with result
[[[0 1 0]
[1 0 0]
[0 0 0]]
[[1 0 0]
[0 1 0]
[1 1 1]]]
I want to make a function to do the split even further, for example, to create 5D tensor from (0,0,0,0) (length 4) indices.
Any idea on how to do this recursively?
Use the following code to generate sample data:
import pandas as pd
import numpy as np
import itertools
def create_fake_data_frame(nlevels = 2, ncols = 3):
result = pd.DataFrame(
index=itertools.product(*(nlevels * [[0, 1]])),
data=np.arange(ncols*2**nlevels).reshape(2**nlevels, ncols)
)
result = convert_index_of_tuples_to_multiindex(result)
return result
def convert_index_of_tuples_to_multiindex(df):
return df.set_index(pd.MultiIndex.from_tuples(df.index))
# Increase nlevels to get dataframes with more levels in their MultiIndex
df = create_fake_data_frame(nlevels=3)
print(df)
This is the result:
0 1 2
0 0 0 0 1 2
1 3 4 5
1 0 6 7 8
1 9 10 11
1 0 0 12 13 14
1 15 16 17
1 0 18 19 20
1 21 22 23
Then, modify the dataframe in such a way that each row contains a single
column, whose value is a list of the values in the corresponding row of
the original dataframe:
def data_frame_with_single_column_of_lists(df):
if len(df.columns) <= 1:
return df
result = df.apply(collapse_columns_into_lists, axis=1)
return result
def collapse_columns_into_lists(s):
result = s.copy()
result['lists'] = result.values.tolist()
result = result[['lists']]
return result
df = data_frame_with_single_column_of_lists(df)
print(df)
The output will be like this:
lists
0 0 0 [0, 1, 2]
1 [3, 4, 5]
1 0 [6, 7, 8]
1 [9, 10, 11]
1 0 0 [12, 13, 14]
1 [15, 16, 17]
1 0 [18, 19, 20]
1 [21, 22, 23]
Finally, use the following code to get a tensor
def increase_list_nesting_by_removing_an_index_level(df):
def list_of_lists(series):
result = series.to_frame().set_index(series.index.droplevel(-1))
result = result.apply(lambda x: x['lists'], axis=1).to_frame()
result = [x[0] for x in result.values.tolist()]
return result
grouped = df.groupby(df.index.droplevel(-1))
result = grouped.agg(list_of_lists)
if type(result.index[0]) == tuple:
result = convert_index_of_tuples_to_multiindex(result)
return result
def tensor_from_data_frame(df):
if df.index.nlevels <= 1:
return np.array([i[0] for i in df.values])
result = increase_list_nesting_by_removing_an_index_level(df)
result = tensor_from_data_frame(result)
return result
tensor = tensor_from_data_frame(df)
print(tensor)
The result will be like this:
[[[[ 0 1 2]
[ 3 4 5]]
[[ 6 7 8]
[ 9 10 11]]]
[[[12 13 14]
[15 16 17]]
[[18 19 20]
[21 22 23]]]]

How to efficiently subtract values from each column with numpy

I have a 2D array of shape (50,50). I need to subtract a value from each column of this array skipping the first), which is calculated based on the index of the column. For example, using a for loop it would look something like this:
for idx in range(1, A[0, :].shape[0]):
A[0, idx] -= idx * (...) # simple calculations with idx
Now, of course this works fine, but it's very slow and performance is critical for my application. I've tried computing the values to be subtracted using np.fromfunction() and then subtracting it from the original array, but results are different than those obtained by the for loop iteractive subtraction:
func = lambda i, j: j * (...) #some simple calculations
subtraction_matrix = np.fromfunction(np.vectorize(func), (1,50))
A[0, 1:] -= subtraction_matrix
What am I doing wrong? Or is there some other method that would be better? Any help is appreciated!
All your code snippets indicate that you require the subtraction to happen only in the first row of A (though you've not explicitly mentioned that). So, I'm proceeding with that understanding.
Referring to your use of from_function(), you can use the subtraction_matrix as below:
A[0,1:] -= subtraction_matrix[1:]
Testing it out (assuming shape (5,5) instead of (50,50)):
import numpy as np
A = np.arange(25).reshape(5,5)
print (A)
func = lambda j: j * 10 #some simple calculations
subtraction_matrix = np.fromfunction(np.vectorize(func), (5,), dtype=A.dtype)
A[0,1:] -= subtraction_matrix[1:]
print (A)
Output:
[[ 0 1 2 3 4] # print(A), before subtraction
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]
[20 21 22 23 24]]
[[ 0 -9 -18 -27 -36] # print(A), after subtraction
[ 5 6 7 8 9]
[ 10 11 12 13 14]
[ 15 16 17 18 19]
[ 20 21 22 23 24]]
If you want the subtraction to happen in all the rows of A, you just need to use the line A[:,1:] -= subtraction_matrix[1:], instead of the line A[0,1:] -= subtraction_matrix[1:]

How does NumPy Sum (with axis) work?

I've taken it upon myself to learn how NumPy works for my own curiosity.
It seems that the simplest function is the hardest to translate to code (I understand by code). It's easy to hard code each axis for each case but I want to find a dynamic algorithm that can sum in any axis with n-dimensions.
The documentation on the official website is not helpful (It only shows the result not the process) and it's hard to navigate through Python/C code.
Note: I did figure out that when an array is summed, the axis specified is "removed", i.e. Sum of an array with a shape of (4, 3, 2) with axis 1 yields an answer of an array with a shape of (4, 2)
Setup
consider the numpy array a
a = np.arange(30).reshape(2, 3, 5)
print(a)
[[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]
[[15 16 17 18 19]
[20 21 22 23 24]
[25 26 27 28 29]]]
Where are the dimensions?
The dimensions and positions are highlighted by the following
p p p p p
o o o o o
s s s s s
dim 2 0 1 2 3 4
| | | | |
dim 0 ↓ ↓ ↓ ↓ ↓
----> [[[ 0 1 2 3 4] <---- dim 1, pos 0
pos 0 [ 5 6 7 8 9] <---- dim 1, pos 1
[10 11 12 13 14]] <---- dim 1, pos 2
dim 0
----> [[15 16 17 18 19] <---- dim 1, pos 0
pos 1 [20 21 22 23 24] <---- dim 1, pos 1
[25 26 27 28 29]]] <---- dim 1, pos 2
↑ ↑ ↑ ↑ ↑
| | | | |
dim 2 p p p p p
o o o o o
s s s s s
0 1 2 3 4
Dimension examples:
This becomes more clear with a few examples
a[0, :, :] # dim 0, pos 0
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]
a[:, 1, :] # dim 1, pos 1
[[ 5 6 7 8 9]
[20 21 22 23 24]]
a[:, :, 3] # dim 2, pos 3
[[ 3 8 13]
[18 23 28]]
sum
explanation of sum and axis
a.sum(0) is the sum of all slices along dim 0
a.sum(0)
[[15 17 19 21 23]
[25 27 29 31 33]
[35 37 39 41 43]]
same as
a[0, :, :] + \
a[1, :, :]
[[15 17 19 21 23]
[25 27 29 31 33]
[35 37 39 41 43]]
a.sum(1) is the sum of all slices along dim 1
a.sum(1)
[[15 18 21 24 27]
[60 63 66 69 72]]
same as
a[:, 0, :] + \
a[:, 1, :] + \
a[:, 2, :]
[[15 18 21 24 27]
[60 63 66 69 72]]
a.sum(2) is the sum of all slices along dim 2
a.sum(2)
[[ 10 35 60]
[ 85 110 135]]
same as
a[:, :, 0] + \
a[:, :, 1] + \
a[:, :, 2] + \
a[:, :, 3] + \
a[:, :, 4]
[[ 10 35 60]
[ 85 110 135]]
default axis is -1
this means all axes. or sum all numbers.
a.sum()
435
I use a nested loop operation to explain it.
import numpy as np
n = np.array(
[[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[2, 4, 6],
[8, 10, 12],
[14, 16, 18]],
[[1, 3, 5],
[7, 9, 11],
[13, 15, 17]]])
print(n)
print("============ sum axis=None=============")
sum = 0
for i in range(3):
for j in range(3):
for k in range(3):
sum += n[k][i][j]
print(sum) # 216
print('------------------')
print(np.sum(n)) # 216
print("============ sum axis=0 =============")
for i in range(3):
for j in range(3):
sum = 0
for axis in range(3):
sum += n[axis][i][j]
print(sum,end=' ')
print()
print('------------------')
print("sum[0][0] = %d" % (n[0][0][0] + n[1][0][0] + n[2][0][0]))
print("sum[1][1] = %d" % (n[0][1][1] + n[1][1][1] + n[2][1][1]))
print("sum[2][2] = %d" % (n[0][2][2] + n[1][2][2] + n[2][2][2]))
print('------------------')
print(np.sum(n, axis=0))
print("============ sum axis=1 =============")
for i in range(3):
for j in range(3):
sum = 0
for axis in range(3):
sum += n[i][axis][j]
print(sum,end=' ')
print()
print('------------------')
print("sum[0][0] = %d" % (n[0][0][0] + n[0][1][0] + n[0][2][0]))
print("sum[1][1] = %d" % (n[1][0][1] + n[1][1][1] + n[1][2][1]))
print("sum[2][2] = %d" % (n[2][0][2] + n[2][1][2] + n[2][2][2]))
print('------------------')
print(np.sum(n, axis=1))
print("============ sum axis=2 =============")
for i in range(3):
for j in range(3):
sum = 0
for axis in range(3):
sum += n[i][j][axis]
print(sum,end=' ')
print()
print('------------------')
print("sum[0][0] = %d" % (n[0][0][0] + n[0][0][1] + n[0][0][2]))
print("sum[1][1] = %d" % (n[1][1][0] + n[1][1][1] + n[1][1][2]))
print("sum[2][2] = %d" % (n[2][2][0] + n[2][2][1] + n[2][2][2]))
print('------------------')
print(np.sum(n, axis=2))
print("============ sum axis=(0,1)) =============")
for i in range(3):
sum = 0
for axis1 in range(3):
for axis2 in range(3):
sum += n[axis1][axis2][i]
print(sum,end=' ')
print()
print('------------------')
print("sum[1] = %d" % (n[0][0][1] + n[0][1][1] + n[0][2][1] +
n[1][0][1] + n[1][1][1] + n[1][2][1] +
n[2][0][1] + n[2][1][1] + n[2][2][1] ))
print('------------------')
print(np.sum(n, axis=(0,1)))
result:
[[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]]
[[ 2 4 6]
[ 8 10 12]
[14 16 18]]
[[ 1 3 5]
[ 7 9 11]
[13 15 17]]]
============ sum axis=None=============
216
------------------
216
============ sum axis=0 =============
4 9 14
19 24 29
34 39 44
------------------
sum[0][0] = 4
sum[1][1] = 24
sum[2][2] = 44
------------------
[[ 4 9 14]
[19 24 29]
[34 39 44]]
============ sum axis=1 =============
12 15 18
24 30 36
21 27 33
------------------
sum[0][0] = 12
sum[1][1] = 30
sum[2][2] = 33
------------------
[[12 15 18]
[24 30 36]
[21 27 33]]
============ sum axis=2 =============
6 15 24
12 30 48
9 27 45
------------------
sum[0][0] = 6
sum[1][1] = 30
sum[2][2] = 45
------------------
[[ 6 15 24]
[12 30 48]
[ 9 27 45]]
============ sum axis=(0,1)) =============
57 72 87
------------------
sum[1] = 72
------------------
[57 72 87]

Python: select function

With this code:
import scipy
from scipy import *
x = r_[1:15]
print x
a = select([x > 7, x >= 4],[x,x+10])
print a
I get this answer:
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
[ 0 0 0 14 15 16 17 8 9 10 11 12 13 14]
But why do I have zeros in the beginning and not in the end? Thanks in advance.
You seem to be using numpy.
From the documentation for numpy.select():
numpy.select(condlist, choicelist, default=0)
...
default: The element inserted in output when all conditions evaluate to False.
Since your conditions are x > 7 and x >=4, the output array will have elements from x+10 when x >= 4 and from x when x > 7. When both the conditions are false, i.e., when x < 4, you will get default, which is 0. So you get 3 zeros in the beginning.
You don't get any zeros in the end because at least one of the conditions is true (both are true, in fact).

Categories

Resources