I am doing logistic regression on iris dataset from sklearn, I know the math and try to implement it. At the final step, I get a prediction vector, this prediction vector represents the probability of that data point being to class 1 or class 2 (binary classification).
Now I want to turn this prediction vector into target vector. Say if probability is greater than 50%, that corresponding data point will belong to class 1, otherwise class 2. Use 0 to represent class 1, 1 for class 2.
I know there is a for loop version of it, just looping through the whole vector. But when the size get large, for loop is very expensive, so I want to do it more efficiently, like numpy's matrix operation, it is faster than doing matrix operation in for loop.
Any suggestion on the faster method?
import numpy as np
a = np.matrix('0.1 0.82')
print(a)
a[a > 0.5] = 1
a[a <= 0.5] = 0
print(a)
Output:
[[ 0.1 0.82]]
[[ 0. 1.]]
Update:
import numpy as np
a = np.matrix('0.1 0.82')
print(a)
a = np.where(a > 0.5, 1, 0)
print(a)
A more general solution to a 2D array which has many vectors with many classes:
import numpy as np
a = np.array( [ [.5, .3, .2],
[.1, .2, .7],
[ 1, 0, 0] ] )
idx = np.argmax(a, axis=-1)
a = np.zeros( a.shape )
a[ np.arange(a.shape[0]), idx] = 1
print(a)
Output:
[[1. 0. 0.]
[0. 0. 1.]
[1. 0. 0.]]
Option 1: If you do binary classification and have 1d prediction vector then your solution is numpy.round:
prob = model.predict(X_test)
Y = np.round(prob)
Option 2: If you have an n-dimensional one-hot prediction matrix, but want to have labels then you can use numpy.argmax. This will return 1d vector with labels:
prob = model.predict(X_test)
y = np.argmax(prob, axis=1)
In case you want to procede with a confusion matrix etc. afterwards and get the original format of a target variable in scikit again: array([1 0 ... 1])you can use:
a = clf.predict_proba(X_test)[:,1]
a = np.where(a>0.5, 1, 0)
The [:,1] referes to the second class (in my case: 1), the first class in my case was 0
for multi class, or a more generalized solution, use
np.argmax(y_hat, axis=1)
Related
I am trying to calculate the Pearson correlation coefficient between two vectors in 2-dimensions using np.corrcoef. When the dimension of the vectors is different than two, they work fine, see for example:
import numpy as np
x = np.random.uniform(-10, 10, 3)
y = np.random.uniform(-10, 10, 3)
print(x, y)
print(np.corrcoef(x,y))
Output:
[-6.59840638 -1.81100446 5.6158669 ] [ 6.7200348 -7.0373677 -2.11395157]
[[ 1. -0.53299763]
[-0.53299763 1. ]]
However, when the dimension is exactly two, the correlation is wrong with the only values 1 or -1:
import numpy as np
x = np.random.uniform(-10, 10, 2)
y = np.random.uniform(-10, 10, 2)
print(x, y)
print(np.corrcoef(x,y))
Output 1:
[-2.61268708 8.32602293] [6.42020314 3.43806504]
[[ 1. -1.]
[-1. 1.]]
Output 2:
[ 5.04249697 -3.6599369 ] [6.12936665 3.15827974]
[[1. 1.]
[1. 1.]]
Output 3:
[7.33503682 7.7145613 ] [-9.54304108 7.43840944]
[[1. 1.]
[1. 1.]]
Question: What's happening and how to solve it?
There are a couple misunderstandings leading to your confusion:
I'll use row major order as numpy "Each row of x represents a variable, and each column a single observation of all those variables."
The Pearson correlation coefficient describes the linear relationship between 2 variables. If you only have 2 values point for each. You can always create a linear relationship between the 2. With the normalization, you'll always get 1 or -1.
A covariance or correlation matrix is usually calculated amongst the components of a random vector X=(X1,....,Xn).T . When you say you want the correlation between 2 vectors, it is unclear whether you want the cross-correlation between X an Y in which case you need np.correlate.
I have a 3D NumPy array of size (9,9,200) and a 2D array of size (200,200).
I want to take each channel of shape (9,9,1) and generate an array (9,9,200), every channel multiplied 200 times by 1 scalar in a single row, and average it such that the resultant array is (9,9,1).
Basically, if there are n channels in an input array, I want each channel multiplied n times and averaged - and this should happen for all channels. Is there an efficient way to do so?
So far what I have is this -
import numpy as np
arr = np.random.rand(9,9,200)
nchannel = arr.shape[-1]
transform = np.array([np.random.uniform(low=0.0, high=1.0, size=(nchannel,)) for i in range(nchannel)])
for channel in range(nchannel):
# The below line needs optimization
temp = [arr[:,:,i] * transform[channel][i] for i in range(nchannel)]
arr[:,:,channel] = np.sum(temp, axis=0)/nchannel
Edit :
A sample image demonstrating what I am looking for. Here nchannel = 3.
The input image is arr. The final image is the transformed arr.
EDIT:
import numpy as np
n_channels = 3
scalar_size = 2
t = np.ones((n_channels,scalar_size,scalar_size)) # scalar array
m = np.random.random((n_channels,n_channels)) # letters array
print(m)
print(t)
m_av = np.mean(m, axis=1)
print(m_av)
for i in range(n_channels):
t[i] = t[i]*m_av1[i]
print(t)
output:
[[0.04601533 0.05851365 0.03893352]
[0.7954655 0.08505869 0.83033369]
[0.59557455 0.09632997 0.63723506]]
[[[1. 1.]
[1. 1.]]
[[1. 1.]
[1. 1.]]
[[1. 1.]
[1. 1.]]]
[0.04782083 0.57028596 0.44304653]
[[[0.04782083 0.04782083]
[0.04782083 0.04782083]]
[[0.57028596 0.57028596]
[0.57028596 0.57028596]]
[[0.44304653 0.44304653]
[0.44304653 0.44304653]]]
What you're asking for is a simple matrix multiplication along the last axis:
import numpy as np
arr = np.random.rand(9,9,200)
transform = np.random.uniform(size=(200, 200)) / 200
arr = arr # transform
I understand the concept of vectorization, and how you can avoid using a loop to run through the elements when you want to adjust each individual element, however what I can't figure out it how to do this when we have a conditional based on the neighbouring values of a pixel.
For example, if I have a mask:
mask = np.array([[0,0,0,0],
[1,0,0,0],
[0,0,0,1],
[1,0,0,0]])
And I wanted to change an element by evaluating neighboring components in the mask, like so:
if sum(mask[j-1:j+2,i-1:i+2].flatten())>1 and mask[j,i]!=1:
out[j,i]=1
How can I vectorize the operation when I specifically need to access the neighboring elements?
Thanks in advance.
Full loop:
import numpy as np
mask = np.array([[0,0,0,0], [1,0,0,0], [0,0,0,1], [1,0,0,0]])
out = np.zeros(mask.shape)
for j in range(len(mask)):
for i in range(len(mask[0])):
if sum(mask[j-1:j+2,i-1:i+2].flatten())>1 and mask[j,i]!=1:
out[j,i]=1
Output:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 0.]]
Such a 'neighborhood sum' operation is often called a 2D convolution. In your case since you don't have any weighting it is efficiently implemented in the (IMO somewhat poorly named) scipy.ndimage.uniform_filter, which can compute the mean of a neighborhood (and the sum is
just the mean multiplied by the size).
import numpy as np
from scipy.ndimage import uniform_filter
mask = np.array([[0,0,0,0], [1,0,0,0], [0,0,0,1], [1,0,0,0]])
neighbor_sum = 9 * uniform_filter(mask.astype(np.float32), 3, mode="constant")
neighbor_sum = np.rint(neighbor_sum).astype(int)
out = ((neighbor_sum > 1) & (mask != 1)).astype(int)
print(out)
Output (which is different than your example but looking at it by hand is correct, assuming you don't want the edges to wrap around):
[[0 0 0 0]
[0 0 0 0]
[1 1 0 0]
[0 0 0 0]]
If you do want the edges to wrap around (or other edge behavior), look at the mode argument of uniform_filter.
I have read most related questions here, but I cannot seem to figure out how to use np.pad in this case. Maybe it is not meant for this particular problem.
Let's say I have a list of Numpy arrays. Every array is the same length, e.g. 2. The list itself has to be padded to be e.g. 5 arrays and can be transformed into a numpy array as well. The padded elements should be arrays filled with zeroes. As an example
arr = [array([0, 1]), array([1, 0]), array([1, 1])]
expected_output = array([array([0, 1]), array([1, 0]), array([1, 1]), array([0, 0]), array([0, 0])])
The following seems to work, but I feel there must be a better and more efficient way. In reality this is run hundreds of thousands if not millions of times so speed is important. Perhaps with np.pad?
import numpy as np
def pad_array(l, item_size, pad_size=5):
s = len(l)
if s < pad_size:
zeros = np.zeros(item_size)
for _ in range(pad_size-s):
# not sure if I need a `copy` of zeros here?
l.append(zeros)
return np.array(l)
B = [np.array([0,1]), np.array([1,0]), np.array([1,1])]
AB = pad_array(B, 2)
print(AB)
It seems like you want to pad zeros at the end of the axis 0, speaking in numpy terms. So what you need is,
output = numpy.pad(arr, ((0,2),(0,0)), 'constant')
The trick is the pad_width parameter, which you need to specify as pad_width=((0,2),(0,0)) to get your expected output. This is you telling pad() to insert 0 padding at the beginning and 2 padding at the end of the axis 0, and to insert 0 padding at the beginning and 0 padding at the end of the axis 1. The format of pad_width is ((before_1, after_1), … (before_N, after_N)) according to the documentation
mode='constant' tells pad() to pad with the value specified by parameter constant_values which defaults to 0.
You could re-write your function like this:
import numpy as np
def pad_array(l, item_size, pad_size=5):
if pad_size < len(l):
return np.array(l)
s = len(l)
res = np.zeros((pad_size, item_size)) # create an array of (item_size, pad_size)
res[:s] = l # set the first rows equal to the elements of l
return res
B = [np.array([0, 1]), np.array([1, 0]), np.array([1, 1])]
AB = pad_array(B, 2)
print(AB)
Output
[[0. 1.]
[1. 0.]
[1. 1.]
[0. 0.]
[0. 0.]]
The idea is to create an array of zeroes and then fill the first rows with the values from the input list.
If X = [[ 1. 1. 1. 1. 1. 1.]] and Y = [[ 0. 0. 0. 0.]] - how can I concatenate the two vectors to form a single vector along column?
I did the following but it didn't work:
import tensorflow as tf
X = tf.constant(1.0, shape=[1, 6])
Y = tf.zeros(shape=[1,4])
XY = tf.concat((X,Y), axis = 0)
sess = tf.Session()
print(sess.run(XY))
If you want to concat them on axis 0, then their size must be equal
Assuming that you don't want it,
you need to set axis = 1 in the tf.concat method