Related
I have 2 NumPy arrays like the below
array_1 = np.array([1.2, 2.3, -1.0, -0.5])
array_2 = np.array([-0.5, 1.3, 2.5, -0.9])
We can do the element-wise simple arithmetic calculation (addition, subtraction, division etc) easily using different np functions
array_sum = np.add(array_1, array_2)
print(array_sum) # [ 0.7 3.6 3.5 -0.4]
array_sign = np.sign(array_1 * array_2)
print(array_sign) # [-1. 1. 1. -1.]
However, I need to check element-wise multiple conditions for 2 arrays and want to save them in 2 new arrays (say X and Y).
For example, if both elements contain different sign (e.g.: 1st and 3rd element pairs of the given example)) then, X will contain 0 and Y will be the sum of the poitive element and abs(negative element)
X = [0]
Y = [1.7]
When both elements are positive (e.g.: 2nd element pair of the given example) then, X will contain the lower value and Y will contain the greater value
X = [1.3]
Y = [2.3]
If both elements are negative, then, X will be 0 and Y will be the sum of the abs(negative element) and abs(negative element)
So, the final X and Y will be something like
X = [0, 1.3, 0, 0]
Y = [1.7, 2.3, 3.5, 1.4]
I have gone through some posts (this, and this) that described, the comparison procedures between 2 arrays, but not getting idea for multiple conditions. Here, 2 arrays are very small but, my real arrays are very large (e.g.: contains 2097152 element per array).
Any ideas are highly appreciated.
Try with numpy.select:
conditions = [(array_1>0)&(array_2>0), (array_1<0)&(array_2<0)]
choiceX = [np.minimum(array_1, array_2), np.zeros(len(array_1))]
choiceY = [np.maximum(array_1, array_2), -np.add(array_1,array_2)]
X = np.select(conditions, choiceX)
Y = np.select(conditions, choiceY, np.add(np.abs(array_1), np.abs(array_2)))
>>> X
array([0. , 1.3, 0. , 0. ])
>>> Y
array([1.7, 2.3, 3.5, 1.4])
This will do it. It does require vertically stacking the two arrays. I'm sure someone will pipe up if there is a more efficient solution.
import numpy as np
array_1 = np.array([1.2, 2.3, -1.0, -0.5])
array_2 = np.array([-0.5, 1.3, 2.5, -0.9])
def pick(t):
if t[0] < 0 or t[1] < 0:
return (0,abs(t[0])+abs(t[1]))
return (t.min(), t.max())
print( np.apply_along_axis( pick, 0, np.vstack((array_1,array_2))))
Output:
[[0. 1.3 0. 0. ]
[1.7 2.3 3.5 1.4]]
The second line of the function can also be written:
return (0,np.abs(t).sum())
But since these will only be two-element arrays, I doubt that saves anything at all.
For machine learning, I'm appliying Parzen Window algorithm.
I have an array (m,n). I would like to check on each row if any of the values is > 0.5 and if each of them is, then I would return 0, otherwise 1.
I would like to know if there is a way to do this without a loop thanks to numpy.
You can use np.all with axis=1 on a boolean array.
import numpy as np
arr = np.array([[0.8, 0.9], [0.1, 0.6], [0.2, 0.3]])
print(np.all(arr>0.5, axis=1))
>> [True False False]
import numpy as np
# Value Initialization
a = np.array([0.75, 0.25, 0.50])
y_predict = np.zeros((1, a.shape[0]))
#If the value is greater than 0.5, the value is 1; otherwise 0
y_predict = (a > 0.5).astype(float)
I have an array (m,n). I would like to check on each row if any of the values is > 0.5
That will be stored in b:
import numpy as np
a = # some np.array of shape (m,n)
b = np.any(a > 0.5, axis=1)
and if each of them is, then I would return 0, otherwise 1.
I'm assuming you mean 'and if this is the case for all rows'. In this case:
c = 1 - 1 * np.all(b)
c contains your return value, either 0 or 1.
I read that numpy is unbiased in rounding and that it works the way its designed. That "if you always round 0.5 up to the next largest number, then the average of a bunch rounded numbers is likely to be slightly larger than the average of the unrounded numbers: this bias or drift can have very bad effects on some numerical algorithms and make them inaccurate."
Disregarding this information and assuming that I always want to round up, how can I do it in numpy? Assuming my array can be quite large.
For simplicity, lets assume i have the array:
import numpy as np
A = [ [10, 15, 30], [25, 134, 41], [134, 413, 51]]
A = np.array(A, dtype=np.int16)
decimal = A * .1
whole = np.round(decimal)
decimal looks like:
[[ 1. 1.5 3. ]
[ 2.5 13.4 4.1]
[ 13.4 41.3 5.1]]
whole looks like:
[[ 1. 2. 3.]
[ 2. 13. 4.]
[ 13. 41. 5.]]
As you can see, 1.5 rounded to 2 and 2.5 also rounded to 2. How can I force to always get a round up answer for a XX.5? I know I can loop through the array and use python round() but that would definitely be much slower. Was wondering if there is a way to do it using numpy functions
The answer is almost never np.vectorize. You can, and should, do this in a fully vectorized manner. Let's say that for x >= 0, you want r = floor(x + 0.5). If you want negative numbers to round towards zero, the same formula applies for x < 0. So let's say that you always want to round away from zero. In that case, you are looking for ceil(x - 0.5) for x < 0.
To implement that for an entire array without calling np.vectorize, you can use masking:
def round_half_up(x):
mask = (x >= 0)
out = np.empty_like(x)
out[mask] = np.floor(x[mask] + 0.5)
out[~mask] = np.ceil(x[~mask] - 0.5)
return out
Notice that you don't need to use a mask if you round all in one direction:
def round_up(x):
return np.floor(x + 0.5)
Now if you want to make this really efficient, you can get rid of all the temp arrays. This will use the full power of ufuncs:
def round_half_up(x):
out = x.copy()
mask = (out >= 0)
np.add(out, 0.5, where=mask, out=out)
np.floor(out, where=mask, out=out)
np.invert(mask, out=mask)
np.subtract(out, 0.5, where=mask, out=out)
np.ceil(out, where=mask, out=out)
return out
And:
def round_up(x):
out = x + 0.5
np.floor(out, out=out)
return out
import numpy as np
A = [ [1.0, 1.5, 3.0], [2.5, 13.4, 4.1], [13.4, 41.3, 5.1]]
A = np.array(A)
print(A)
def rounder(x):
if (x-int(x) >= 0.5):
return np.ceil(x)
else:
return np.floor(x)
rounder_vec = np.vectorize(rounder)
whole = rounder_vec(A)
print(whole)
Alternatively, you can also look at numpy.ceil, numpy.floor, numpy.trunc for other rounding styles
I have a numpy.array that holds some time-series data, where data[:,0] is the time, and the other columns are some measurements. I also have a list_of_peaks which is a list of times where there is something interesting in the data.
My goal is to calculate a certain measure for each point in list_of_peaks which is based the points in data closer to it than any other peak, i.e. I want to partition data halfway between each point in list_of_peaks.
My current (very slow) algorithm is this:
def measure(d,t_m,t_p):
radius = d[(d[:,0] > t_m)* (d[:,0] < t_p)]
return np.max(radius) - np.min(radius)
list_of_measures = []
for i in range(len(list_of_peaks)):
if i == 0:
list_of_measures.append(measure(data,data[0,0],(list_of_peaks[i+1] - list_of_peaks[i])/2+list_of_peaks[i]))
elif i == len(list_of_peaks) - 1:
list_of_measures.append(measure(data,list_of_peaks[i] - (list_of_peaks[i]-list_of_peaks[i-1])/2,data[-1,0]))
else:
list_of_measures.append(measure(data,list_of_peaks[i] - (list_of_peaks[i]-list_of_peaks[i-1])/2,(list_of_peaks[i+1] - list_of_peaks[i])/2+list_of_peaks[i]))
I haven't found any nice built-in numpy function that would serve my purpose, but I am pretty sure this can be done a LOT better, I just don't think see how.
You can use numpy.where() ( np.where())
x = np.array([
[0.1, 0.4, 0.7],
[0.3, 0.5, 0.2],
[0.9, 0.1, 0.8],
])
y = x[np.where(x[:,1] == 0.5)]
y
[[0.3 0.5 0.2]]
# or with multiple condition
y = x[np.where((x[:, 1] > 0.1 ) & (x[:, 1] < 0.5))]
y
[[0.1 0.4 0.7]]
As Brenlla pointed out, np.split can do most of what I want, so it came down to find the indices the fastest. Fortunately, there is also a built-in numpy function that is very fast for a time-series, since it is a sorted list by definition. The final map may have a faster solution, but the slow part of this algorithm was the splitting anyway:
splitter = np.ediff1d(list_of_peaks)/2 + list_of_peaks[:-1]
splitter_ind = np.searchsorted(data[:,0],splitter,side='right')
split_data = np.split(data[:,1],splitter_ind)
measures = np.array(list(map(lambda x: np.max(x) - np.min(x),split_data)))
I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function:
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
This function handles the situation where vector v has the norm value of 0.
Is there any similar functions provided in sklearn or numpy?
If you're using scikit-learn you can use sklearn.preprocessing.normalize:
import numpy as np
from sklearn.preprocessing import normalize
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True
I agree that it would be nice if such a function were part of the included libraries. But it isn't, as far as I know. So here is a version for arbitrary axes that gives optimal performance.
import numpy as np
def normalized(a, axis=-1, order=2):
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
l2[l2==0] = 1
return a / np.expand_dims(l2, axis)
A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))
print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))
This might also work for you
import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))
but fails when v has length 0.
In that case, introducing a small constant to prevent the zero division solves this.
As proposed in the comments one could also use
v/np.linalg.norm(v)
To avoid zero division I use eps, but that's maybe not great.
def normalize(v):
norm=np.linalg.norm(v)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm
If you have multidimensional data and want each axis normalized to its max or its sum:
def normalize(_d, to_sum=True, copy=True):
# d is a (n x dimension) np array
d = _d if not copy else np.copy(_d)
d -= np.min(d, axis=0)
d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
return d
Uses numpys peak to peak function.
a = np.random.random((5, 3))
b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1
c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1
If you don't need utmost precision, your function can be reduced to:
v_norm = v / (np.linalg.norm(v) + 1e-16)
You mentioned sci-kit learn, so I want to share another solution.
sci-kit learn MinMaxScaler
In sci-kit learn, there is a API called MinMaxScaler which can customize the the value range as you like.
It also deal with NaN issues for us.
NaNs are treated as missing values: disregarded in fit, and maintained
in transform. ... see reference [1]
Code sample
The code is simple, just type
# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
Reference
[1] sklearn.preprocessing.MinMaxScaler
There is also the function unit_vector() to normalize vectors in the popular transformations module by Christoph Gohlke:
import transformations as trafo
import numpy as np
data = np.array([[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
[1.0, 2.0, 3.0]])
print(trafo.unit_vector(data, axis=1))
If you work with multidimensional array following fast solution is possible.
Say we have 2D array, which we want to normalize by last axis, while some rows have zero norm.
import numpy as np
arr = np.array([
[1, 2, 3],
[0, 0, 0],
[5, 6, 7]
], dtype=np.float)
lengths = np.linalg.norm(arr, axis=-1)
print(lengths) # [ 3.74165739 0. 10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0. 0. 0. ]
# [0.47673129 0.57207755 0.66742381]]
If you want to normalize n dimensional feature vectors stored in a 3D tensor, you could also use PyTorch:
import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize
vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()
If you're working with 3D vectors, you can do this concisely using the toolbelt vg. It's a light layer on top of numpy and it supports single values and stacked vectors.
import numpy as np
import vg
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are way too verbose in NumPy.
Without sklearn and using just numpy.
Just define a function:.
Assuming that the rows are the variables and the columns the samples (axis= 1):
import numpy as np
# Example array
X = np.array([[1,2,3],[4,5,6]])
def stdmtx(X):
means = X.mean(axis =1)
stds = X.std(axis= 1, ddof=1)
X= X - means[:, np.newaxis]
X= X / stds[:, np.newaxis]
return np.nan_to_num(X)
output:
X
array([[1, 2, 3],
[4, 5, 6]])
stdmtx(X)
array([[-1., 0., 1.],
[-1., 0., 1.]])
For a 2D array, you can use the following one-liner to normalize across rows. To normalize across columns, simply set axis=0.
a / np.linalg.norm(a, axis=1, keepdims=True)
If you want all values in [0; 1] for 1d-array then just use
(a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
Where a is your 1d-array.
An example:
>>> a = np.array([0, 1, 2, 4, 5, 2])
>>> (a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
array([0. , 0.2, 0.4, 0.8, 1. , 0.4])
Note for the method. For saving proportions between values there is a restriction: 1d-array must have at least one 0 and consists of 0 and positive numbers.
A simple dot product would do the job. No need for any extra package.
x = x/np.sqrt(x.dot(x))
By the way, if the norm of x is zero, it is inherently a zero vector, and cannot be converted to a unit vector (which has norm 1). If you want to catch the case of np.array([0,0,...0]), then use
norm = np.sqrt(x.dot(x))
x = x/norm if norm != 0 else x