Related
I have an array with approximately 12000 length, something like array([0.3, 0.6, 0.3, 0.5, 0.1, 0.9, 0.4...]). Also, I have a column in a dataframe that provides values like 2,3,7,3,2,7.... The length of the column is 48, and the sum of those values is 36.
I want to distribute the values, which means the 12000 lengths of array is distributed by specific every value. For example, the first value in that column( = 2) gets its own array of 12000*(2/36) (maybe [0.3, 0.6, 0.3]), and the second value ( = 3) gets its array of 12000*(3/36), and its value continues after the first value(something like [0.5, 0.1, 0.9, 0.4]) and so on.
import pandas as pd
import numpy as np
# mock some data
a = np.random.random(12000)
df = pd.DataFrame({'col': np.random.randint(1, 5, 48)})
indices = (len(a) * df.col.to_numpy() / sum(df.col)).cumsum()
indices = np.concatenate(([0], indices)).round().astype(int)
res = []
for s, e in zip(indices[:-1], indices[1:]):
res.append(a[round(s):round(e)])
# some tests
target_pcts = df.col.to_numpy() / sum(df.col)
realized_pcts = np.array([len(sl) / len(a) for sl in res])
diffs = target_pcts / realized_pcts
assert 0.99 < np.min(diffs) and np.max(diffs) < 1.01
assert all(np.concatenate([*res]) == a)
For machine learning, I'm appliying Parzen Window algorithm.
I have an array (m,n). I would like to check on each row if any of the values is > 0.5 and if each of them is, then I would return 0, otherwise 1.
I would like to know if there is a way to do this without a loop thanks to numpy.
You can use np.all with axis=1 on a boolean array.
import numpy as np
arr = np.array([[0.8, 0.9], [0.1, 0.6], [0.2, 0.3]])
print(np.all(arr>0.5, axis=1))
>> [True False False]
import numpy as np
# Value Initialization
a = np.array([0.75, 0.25, 0.50])
y_predict = np.zeros((1, a.shape[0]))
#If the value is greater than 0.5, the value is 1; otherwise 0
y_predict = (a > 0.5).astype(float)
I have an array (m,n). I would like to check on each row if any of the values is > 0.5
That will be stored in b:
import numpy as np
a = # some np.array of shape (m,n)
b = np.any(a > 0.5, axis=1)
and if each of them is, then I would return 0, otherwise 1.
I'm assuming you mean 'and if this is the case for all rows'. In this case:
c = 1 - 1 * np.all(b)
c contains your return value, either 0 or 1.
I want to use the function InterX to find the intersection of two curves. However the function does not return the expected result. The function is availabel here
The function always return the point of intersection as P = None, None. When a valid point was expected.
import numpy as np
import pandas as pd
from InterX import InterX
x_t = np.linspace(0, 10, 10, True)
z_t = np.array((0, 0, 0, 0, 0, 0, 0.055, 0.41, 1.23, 4))
X_P = np.array((2,4))
Z_P = np.array((3,-1))
Line = pd.DataFrame(np.array((X_P,Z_P)))
Curve = pd.DataFrame(np.array([x_t,z_t]))
Curve = Curve.T
P = InterX(Line[0],Line[1],Curve[0],Curve[1])
In this script the expected result was P = [3.5,0]. However, the resulting point P is P = [None,None]
The short answer - use:
P = InterX(L1, L1, L2, L2)
or
P = InterX(L1.iloc[:,0].to_frame(),L1.iloc[:,1].to_frame(),L2.iloc[:,0].to_frame(),L2.iloc[:,1].to_frame())
For a detailed answer see the following that refers to the code of your original question.
This refers to the code of the original question:
You need two pass two dataframes with x and y values (it would be of course much more logical if InterX would accept 4 Series or 2 DataFrames respectively).
InterX then gets the x and y values in a very convoluted way from these dataframes in lines 90 through 119 (which could be done far more easyly). So the working solution is:
import numpy as np
import pandas as pd
from InterX import InterX
x_t = np.linspace(0, 10, 10, True)
z_t = np.array((0, 0, 0, 0, 0, 0, 0.055, 0.41, 1.23, 4))
x_P = np.array((2,4))
z_P = np.array((3,-1))
curve_x = pd.DataFrame(x_t)
curve_z = pd.DataFrame(z_t)
line_x = pd.DataFrame(X_P)
line_z = pd.DataFrame(Z_P)
p = InterX(line_x, line_z, curve_x, curve_z)
Output of print(p):
xs ys
0 3.5 0.0
Please note that according to the python naming convention (PEP8) function and variable names should be lowercase, with words separated by underscores.
I find the code of InterX very difficult to understand, a much cleaner solution (along with a nice plot) is this one.
With
x_t = np.linspace(0, 10, 10, True)
z_t = np.array((0, 0, 0, 0, 0, 0, 0.055, 0.41, 1.23, 4))
X_P = np.array((2,4))
Z_P = np.array((3,-1))
x,y = intersection(x_t,z_t,X_P,Z_P)
print(x,y)
plt.plot(x_t,z_t,c='r')
plt.plot(X_P,Z_P,c='g')
plt.plot(x,y,'*k')
plt.show()
we get [3.5] [-0.] and this picture:
If I have two lists
a = [2,5,1,9]
b = [4,9,5,10]
How can I find the mean value of each element, so that the resultant list would be:
[3,7,3,9.5]
>>> a = [2,5,1,9]
>>> b = [4,9,5,10]
>>> [(g + h) / 2 for g, h in zip(a, b)]
[3.0, 7.0, 3.0, 9.5]
Referring to your title of the question, you can achieve this simply with:
import numpy as np
multiple_lists = [[2,5,1,9], [4,9,5,10]]
arrays = [np.array(x) for x in multiple_lists]
[np.mean(k) for k in zip(*arrays)]
Above script will handle multiple lists not just two. If you want to compare the performance of two approaches try:
%%time
import random
import statistics
random.seed(33)
multiple_list = []
for seed in random.sample(range(100), 100):
random.seed(seed)
multiple_list.append(random.sample(range(100), 100))
result = [statistics.mean(k) for k in zip(*multiple_list)]
or alternatively:
%%time
import random
import numpy as np
random.seed(33)
multiple_list = []
for seed in random.sample(range(100), 100):
random.seed(seed)
multiple_list.append(np.array(random.sample(range(100), 100)))
result = [np.mean(k) for k in zip(*multiple_list)]
To my experience numpy approach is much faster.
What you want is the mean of two arrays (or vectors in math).
Since Python 3.4, there is a statistics module which provides a mean() function:
statistics.mean(data)
Return the sample arithmetic mean of data, a sequence or iterator of real-valued numbers.
You can use it like this:
import statistics
a = [2, 5, 1, 9]
b = [4, 9, 5, 10]
result = [statistics.mean(k) for k in zip(a, b)]
# -> [3.0, 7.0, 3.0, 9.5]
notice: this solution can be use for more than two arrays, because zip() can have multiple parameters.
An alternate to using a list and for loop would be to use a numpy array.
import numpy as np
# an array can perform element wise calculations unlike lists.
a, b = np.array([2,5,1,9]), np.array([4,9,5,10])
mean = (a + b)/2; print(mean)
>>>[ 3. 7. 3. 9.5]
Put the two lists into a numpy array using vstack and then take the mean (using 'tolist' to get back from the numpy array):
import numpy as np
a = [2,5,1,9]
b = [4,9,5,10]
np.mean(np.vstack([a,b]), axis=0).tolist()
[3.0, 7.0, 3.0, 9.5]
Seems you are looking for an element-wise mean value. setting axis=0 in np.mean is what you need.
>>> import numpy as np
>>> a = [2,5,1,9]
>>> b = [4,9,5,10]
Create a list containing all your lists
>>> a_b = [a,b]
>>> a_b
[[2, 5, 1, 9], [4, 9, 5, 10]]
Use np.mean and set the axis to 0
>>> np.mean(a_b, axis=0)
array([3. , 7. , 3. , 9.5])
I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function:
def normalize(v):
norm = np.linalg.norm(v)
if norm == 0:
return v
return v / norm
This function handles the situation where vector v has the norm value of 0.
Is there any similar functions provided in sklearn or numpy?
If you're using scikit-learn you can use sklearn.preprocessing.normalize:
import numpy as np
from sklearn.preprocessing import normalize
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = normalize(x[:,np.newaxis], axis=0).ravel()
print np.all(norm1 == norm2)
# True
I agree that it would be nice if such a function were part of the included libraries. But it isn't, as far as I know. So here is a version for arbitrary axes that gives optimal performance.
import numpy as np
def normalized(a, axis=-1, order=2):
l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
l2[l2==0] = 1
return a / np.expand_dims(l2, axis)
A = np.random.randn(3,3,3)
print(normalized(A,0))
print(normalized(A,1))
print(normalized(A,2))
print(normalized(np.arange(3)[:,None]))
print(normalized(np.arange(3)))
This might also work for you
import numpy as np
normalized_v = v / np.sqrt(np.sum(v**2))
but fails when v has length 0.
In that case, introducing a small constant to prevent the zero division solves this.
As proposed in the comments one could also use
v/np.linalg.norm(v)
To avoid zero division I use eps, but that's maybe not great.
def normalize(v):
norm=np.linalg.norm(v)
if norm==0:
norm=np.finfo(v.dtype).eps
return v/norm
If you have multidimensional data and want each axis normalized to its max or its sum:
def normalize(_d, to_sum=True, copy=True):
# d is a (n x dimension) np array
d = _d if not copy else np.copy(_d)
d -= np.min(d, axis=0)
d /= (np.sum(d, axis=0) if to_sum else np.ptp(d, axis=0))
return d
Uses numpys peak to peak function.
a = np.random.random((5, 3))
b = normalize(a, copy=False)
b.sum(axis=0) # array([1., 1., 1.]), the rows sum to 1
c = normalize(a, to_sum=False, copy=False)
c.max(axis=0) # array([1., 1., 1.]), the max of each row is 1
If you don't need utmost precision, your function can be reduced to:
v_norm = v / (np.linalg.norm(v) + 1e-16)
You mentioned sci-kit learn, so I want to share another solution.
sci-kit learn MinMaxScaler
In sci-kit learn, there is a API called MinMaxScaler which can customize the the value range as you like.
It also deal with NaN issues for us.
NaNs are treated as missing values: disregarded in fit, and maintained
in transform. ... see reference [1]
Code sample
The code is simple, just type
# Let's say X_train is your input dataframe
from sklearn.preprocessing import MinMaxScaler
# call MinMaxScaler object
min_max_scaler = MinMaxScaler()
# feed in a numpy array
X_train_norm = min_max_scaler.fit_transform(X_train.values)
# wrap it up if you need a dataframe
df = pd.DataFrame(X_train_norm)
Reference
[1] sklearn.preprocessing.MinMaxScaler
There is also the function unit_vector() to normalize vectors in the popular transformations module by Christoph Gohlke:
import transformations as trafo
import numpy as np
data = np.array([[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
[1.0, 2.0, 3.0]])
print(trafo.unit_vector(data, axis=1))
If you work with multidimensional array following fast solution is possible.
Say we have 2D array, which we want to normalize by last axis, while some rows have zero norm.
import numpy as np
arr = np.array([
[1, 2, 3],
[0, 0, 0],
[5, 6, 7]
], dtype=np.float)
lengths = np.linalg.norm(arr, axis=-1)
print(lengths) # [ 3.74165739 0. 10.48808848]
arr[lengths > 0] = arr[lengths > 0] / lengths[lengths > 0][:, np.newaxis]
print(arr)
# [[0.26726124 0.53452248 0.80178373]
# [0. 0. 0. ]
# [0.47673129 0.57207755 0.66742381]]
If you want to normalize n dimensional feature vectors stored in a 3D tensor, you could also use PyTorch:
import numpy as np
from torch import FloatTensor
from torch.nn.functional import normalize
vecs = np.random.rand(3, 16, 16, 16)
norm_vecs = normalize(FloatTensor(vecs), dim=0, eps=1e-16).numpy()
If you're working with 3D vectors, you can do this concisely using the toolbelt vg. It's a light layer on top of numpy and it supports single values and stacked vectors.
import numpy as np
import vg
x = np.random.rand(1000)*10
norm1 = x / np.linalg.norm(x)
norm2 = vg.normalize(x)
print np.all(norm1 == norm2)
# True
I created the library at my last startup, where it was motivated by uses like this: simple ideas which are way too verbose in NumPy.
Without sklearn and using just numpy.
Just define a function:.
Assuming that the rows are the variables and the columns the samples (axis= 1):
import numpy as np
# Example array
X = np.array([[1,2,3],[4,5,6]])
def stdmtx(X):
means = X.mean(axis =1)
stds = X.std(axis= 1, ddof=1)
X= X - means[:, np.newaxis]
X= X / stds[:, np.newaxis]
return np.nan_to_num(X)
output:
X
array([[1, 2, 3],
[4, 5, 6]])
stdmtx(X)
array([[-1., 0., 1.],
[-1., 0., 1.]])
For a 2D array, you can use the following one-liner to normalize across rows. To normalize across columns, simply set axis=0.
a / np.linalg.norm(a, axis=1, keepdims=True)
If you want all values in [0; 1] for 1d-array then just use
(a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
Where a is your 1d-array.
An example:
>>> a = np.array([0, 1, 2, 4, 5, 2])
>>> (a - a.min(axis=0)) / (a.max(axis=0) - a.min(axis=0))
array([0. , 0.2, 0.4, 0.8, 1. , 0.4])
Note for the method. For saving proportions between values there is a restriction: 1d-array must have at least one 0 and consists of 0 and positive numbers.
A simple dot product would do the job. No need for any extra package.
x = x/np.sqrt(x.dot(x))
By the way, if the norm of x is zero, it is inherently a zero vector, and cannot be converted to a unit vector (which has norm 1). If you want to catch the case of np.array([0,0,...0]), then use
norm = np.sqrt(x.dot(x))
x = x/norm if norm != 0 else x