Related
Having a matrix with d features and n samples, I would like to compare each feature of a sample (row) against the mean of the column corresponding to that feature and then assign a corresponding label 1 or 0.
Eg. for a matrix X = [x11, x12; x21, x22] I compute the mean of the two columns (mu1, mu2) and then I keep on comparing (x11, x21 with mu1 and so on) to check whether these are greater or smaller than mu and to then assign a label to them according to the if statement (see below).
I have the mean vector for each column i.e. of length d.
I am now using for-loops however these are not computationally effective.
X_copy = X_train;
mu = np.mean(X_train, axis = 0)
for i in range(X_train.shape[0]):
for j in range(X_train.shape[1]):
if X_train[i,j]<mu[j]: #less than mean for the col, assign 0
X_copy[i,j] = 0
else:
X_copy[i,j] = 1 #more than or equal to mu for the col, assign 1
Is there any better alternative?
I don't have much experience with python hence thank you for understanding.
Direct comparison, which makes the average vector compare on each row of the original array. Then convert the data type of the result to int:
>>> X_train = np.random.rand(3, 4)
>>> X_train
array([[0.4789953 , 0.84095907, 0.53538172, 0.04880835],
[0.64554335, 0.50904539, 0.34069036, 0.5290601 ],
[0.84664389, 0.63984867, 0.66111495, 0.89803495]])
>>> (X_train >= X_train.mean(0)).astype(int)
array([[0, 1, 1, 0],
[0, 0, 0, 1],
[1, 0, 1, 1]])
Update:
There is a broadcast mechanism for operations between numpy arrays. For example, an array is compared with a number, which will make the number swim among all elements of the array and compare them one by one:
>>> X_train > 0.5
array([[False, True, True, False],
[ True, True, False, True],
[ True, True, True, True]])
>>> X_train > np.full(X_train.shape, 0.5) # Equivalent effect.
array([[False, True, True, False],
[ True, True, False, True],
[ True, True, True, True]])
Similarly, you can compare a vector with a 2D array, as long as the length of the vector is the same as that of the first dimension of the array:
>>> mu = X_train.mean(0)
>>> X_train > mu
array([[False, True, True, False],
[False, False, False, True],
[ True, False, True, True]])
>>> X_train > np.tile(mu, (X_train.shape[0], 1)) # Equivalent effect.
array([[False, True, True, False],
[False, False, False, True],
[ True, False, True, True]])
How do I compare other axes? My English is not good, so it is difficult for me to explain. Here I provide the official explanation of numpy. I hope you can get started through it: Broadcasting
Assuming I have n = 3 lists of same length for example:
R1 = [7,5,8,6,0,6,7]
R2 = [8,0,2,2,0,2,2]
R3 = [1,7,5,9,0,9,9]
I need to find the first index t that verifies the n = 3 following conditions for a period p = 2.
Edit: the meaning of period p is the number of consecutive "boxes".
R1[t] >= 5, R1[t+1] >= 5. Here t +p -1 = t+1, we need to only verify for two boxes t and t+1. If p was equal to 3 we will need to verify for t, t+1 and t+2. Note that It's always the same number for which we test, we always test if it's greater than 5 for every index. The condition is always the same for all the "boxes".
R2[t] >= 2, R2[t+1] >= 2
R3[t] >= 9, R3[t+1] >= 9
In total there is 3 * p conditions.
Here the t I am looking for is 5 (indexing is starting from 0).
The basic way to do this is by looping on all the indexes using a for loop. If the condition is found for some index t we store it in some local variable temp and we verify the conditions still hold for every element whose index is between t+1 and t+p -1. If while checking, we find an index that does not satisfy a condition, we forget about the temp and we keep going.
What is the most efficient way to do this in Python if I have large lists (like of 10000 elements)? Is there a more efficient way than the for loop?
Since all your conditions are the same (>=), we could leverage this.
This solution will work for any number of conditions and any size of analysis window, and no for loop is used.
You have an array:
>>> R = np.array([R1, R2, R3]).T
>>> R
array([[7, 8, 1],
[5, 0, 7],
[8, 2, 5],
[6, 2, 9],
[0, 0, 0],
[6, 2, 9],
[7, 2, 9]]
and you have thresholds:
>>> thresholds = [5, 2, 9]
So you can check where the conditions are met:
>>> R >= thresholds
array([[ True, True, False],
[ True, False, False],
[ True, True, False],
[ True, True, True],
[False, False, False],
[ True, True, True],
[ True, True, True]])
And where they all met at the same time:
>>> R_cond = np.all(R >= thresholds, axis=1)
>>> R_cond
array([False, False, False, True, False, True, True])
From there, you want the conditions to be met for a given window.
We'll use the fact that booleans can sum together, and convolution to apply the window:
>>> win_size = 2
>>> R_conv = np.convolve(R_cond, np.ones(win_size), mode="valid")
>>> R_conv
array([0., 0., 1., 1., 1., 2.])
The resulting array will have values equal to win_size at the indices where all conditions are met on the window range.
So let's retrieve the first of those indices:
>>> index = np.where(R_conv == win_size)[0][0]
>>> index
5
If such an index doesn't exist, it will raise an IndexError, I'm letting you handle that.
So, as a one-liner function, it gives:
def idx_conditions(arr, thresholds, win_size, condition):
return np.where(
np.convolve(
np.all(condition(arr, thresholds), axis=1),
np.ones(win_size),
mode="valid"
)
== win_size
)[0][0]
I added the condition as an argument to the function, to be more general.
>>> from operator import ge
>>> idx_conditions(R, thresholds, win_size, ge)
5
This could be a way:
R1 = [7,5,8,6,0,6,7]
R2 = [8,0,2,2,0,2,2]
R3 = [1,7,5,9,0,9,9]
for i,inext in zip(range(len(R1)),range(len(R1))[1:]):
if (R1[i]>=5 and R1[inext]>=5)&(R2[i]>=2 and R2[inext]>=2)&(R3[i]>=9 and R3[inext]>=9):
print(i)
Output:
5
Edit: Generalization could be:
def foo(ls,conditions):
index=0
for i,inext in zip(range(len(R1)),range(len(R1))[1:]):
if all((ls[j][i]>=conditions[j] and ls[j][inext]>=conditions[j]) for j in range(len(ls))):
index=i
return index
R1 = [7,5,8,6,0,6,7]
R2 = [8,0,2,2,0,2,2]
R3 = [1,7,5,9,0,9,9]
R4 = [1,7,5,9,0,1,1]
R5 = [1,7,5,9,0,3,3]
conditions=[5,2,9,1,3]
ls=[R1,R2,R3,R4,R5]
print(foo(ls,conditions))
Output:
5
And, maybe if the arrays match the conditions multiple times, you could return a list of the indexes:
def foo(ls,conditions):
index=[]
for i,inext in zip(range(len(R1)),range(len(R1))[1:]):
if all((ls[j][i]>=conditions[j] and ls[j][inext]>=conditions[j]) for j in range(len(ls))):
print(i)
index.append(i)
return index
R1 = [6,7,8,6,0,6,7]
R2 = [2,2,2,2,0,2,2]
R3 = [9,9,5,9,0,9,9]
R4 = [1,1,5,9,0,1,1]
R5 = [3,3,5,9,0,3,3]
conditions=[5,2,9,1,3]
ls=[R1,R2,R3,R4,R5]
print(foo(ls,conditions))
Output:
[0,5]
Here is a solution using numpy ,without for loops:
import numpy as np
R1 = np.array([7,5,8,6,0,6,7])
R2 = np.array([8,0,2,2,0,2,2])
R3 = np.array([1,7,5,9,0,9,9])
a = np.logical_and(np.logical_and(R1>=5,R2>=2),R3>=9)
np.where(np.logical_and(a[:-1],a[1:]))[0].item()
ouput
5
Edit:
Generalization
Say you have a list of lists R and a list of conditions c:
R = [[7,5,8,6,0,6,7],
[8,0,2,2,0,2,2],
[1,7,5,9,0,9,9]]
c = [5,2,9]
First we convert them to numpy arrays. the reshape(-1,1) converts c to a column matrix so that we can use pythons broadcasting feature in the >= operator
R = np.array(R)
c = np.array(c).reshape(-1,1)
R>=c
output:
array([[ True, True, True, True, False, True, True],
[ True, False, True, True, False, True, True],
[False, False, False, True, False, True, True]])
then we perform logical & operation between all rows using reduce function
a = np.logical_and.reduce(R>=c)
a
output:
array([False, False, False, True, False, True, True])
next we create two arrays by removing first and last element of a and perform a logical & between them which shows which two subsequent elements satisfied the conditions in all lists:
np.logical_and(a[:-1],a[1:])
output:
array([False, False, False, False, False, True])
now np.where just shows the index of the True element
np.where(np.logical_and(a[:-1],a[1:]))[0].item()
output:
5
I have two tensors like this:
1st tensor
[[0,0],[0,1],[0,2],[1,3],[1,4],[2,1],[2,4]]
2nd tensor
[[0,1],[0,2],[1,4],[2,4]]
I want the result tensor to be like this:
[[0,0],[1,3],[2,1]] # differences between 1st tensor and 2nd tensor
I have tried to use set, list, torch.where,.. and couldn't find any good way to achieve this. Is there any way to get the different rows between two different sizes of tensors? (need to be efficient)
You can perform a pairwairse comparation to see which elements of the first tensor are present in the second vector.
a = torch.as_tensor([[0,0],[0,1],[0,2],[1,3],[1,4],[2,1],[2,4]])
b = torch.as_tensor([[0,1],[0,2],[1,4],[2,4]])
# Expand a to (7, 1, 2) to broadcast to all b
a_exp = a.unsqueeze(1)
# c: (7, 4, 2)
c = a_exp == b
# Since we want to know that all components of the vector are equal, we reduce over the last fim
# c: (7, 4)
c = c.all(-1)
print(c)
# Out: Each row i compares the ith element of a against all elements in b
# Therefore, if all row is false means that the a element is not present in b
tensor([[False, False, False, False],
[ True, False, False, False],
[False, True, False, False],
[False, False, False, False],
[False, False, True, False],
[False, False, False, False],
[False, False, False, True]])
non_repeat_mask = ~c.any(-1)
# Apply the mask to a
print(a[non_repeat_mask])
tensor([[0, 0],
[1, 3],
[2, 1]])
If you feel cool you can do it one liner :)
a[~a.unsqueeze(1).eq(b).all(-1).any(-1)]
In case someone is looking for a solution with a vector of dim=1, this is the adaptation of #Guillem solution
a = torch.tensor(list(range(0, 10)))
b = torch.tensor(list(range(5,15)))
a[~a.unsqueeze(1).eq(b).any(1)]
outputs:
tensor([0, 1, 2, 3, 4])
Here is another solution, when you want the absolute difference, and not just comparing the first with the second. Be careful when using it, because order here doesnt matter
combined = torch.cat((a, b))
uniques, counts = combined.unique(return_counts=True)
difference = uniques[counts == 1]
outputs
tensor([ 0, 1, 2, 3, 4, 10, 11, 12, 13, 14])
I need your help. I want to walk over a three dimensional array and check in one direction the distance between two elements, if it is smaller the value should be True. As soon as the distance gets higher than a certain value the rest of the values in this dimension should be set to False.
Here is an example in 1D:
a = np.array([1,2,2,1,2,5,2,7,1,2])
b = magic_check_fct(a, threshold=3, axis=0)
print(b)
# The expected output is :
> b = [True, True, True, True, True, False, False, False, False, False]
For a simple check, the result with a <= threshold would be and is not the expected output:
> b = [True, True, True, True, True, False, True, False, True, True]
Is there an efficient way to this with numpy? This whole thing is performance critical.
Thanks for your help!
One way would be to use np.minimum.accumulate along that axis -
np.minimum.accumulate(a<=threshold,axis=0)
Sample run -
In [515]: a
Out[515]: array([1, 2, 2, 1, 2, 5, 2, 7, 1, 2])
In [516]: threshold = 3
In [518]: print np.minimum.accumulate(a<=threshold,axis=0)
[ True True True True True False False False False False]
Another with thresholding and then slicing for 1D arrays -
out = a<=threshold
if ~out.all():
out[out.argmin():] = 0
Here's one more approach using 1st discrete difference:
In [126]: threshold = 3
In [127]: mask = np.diff(a, prepend=a[0]) < threshold
In [128]: mask[mask.argmin():] = False
In [129]: mask
Out[129]:
array([ True, True, True, True, True, False, False, False, False,
False])
I'm trying to rewrite a function using numpy which is originally in MATLAB. There's a logical indexing part which is as follows in MATLAB:
X = reshape(1:16, 4, 4).';
idx = [true, false, false, true];
X(idx, idx)
ans =
1 4
13 16
When I try to make it in numpy, I can't get the correct indexing:
X = np.arange(1, 17).reshape(4, 4)
idx = [True, False, False, True]
X[idx, idx]
# Output: array([6, 1, 1, 6])
What's the proper way of getting a grid from the matrix via logical indexing?
You could also write:
>>> X[np.ix_(idx,idx)]
array([[ 1, 4],
[13, 16]])
In [1]: X = np.arange(1, 17).reshape(4, 4)
In [2]: idx = np.array([True, False, False, True]) # note that here idx has to
# be an array (not a list)
# or boolean values will be
# interpreted as integers
In [3]: X[idx][:,idx]
Out[3]:
array([[ 1, 4],
[13, 16]])
In numpy this is called fancy indexing. To get the items you want you should use a 2D array of indices.
You can use an outer to make from your 1D idx a proper 2D array of indices. The outers, when applied to two 1D sequences, compare each element of one sequence to each element of the other. Recalling that True*True=True and False*True=False, the np.multiply.outer(), which is the same as np.outer(), can give you the 2D indices:
idx_2D = np.outer(idx,idx)
#array([[ True, False, False, True],
# [False, False, False, False],
# [False, False, False, False],
# [ True, False, False, True]], dtype=bool)
Which you can use:
x[ idx_2D ]
array([ 1, 4, 13, 16])
In your real code you can use x=[np.outer(idx,idx)] but it does not save memory, working the same as if you included a del idx_2D after doing the slice.