Min-max normalisation of a NumPy array

Min-max normalisation of a NumPy array - python

I have the following numpy array:
foo = np.array([[0.0, 10.0], [0.13216, 12.11837], [0.25379, 42.05027], [0.30874, 13.11784]])
which yields:
[[ 0. 10. ]
[ 0.13216 12.11837]
[ 0.25379 42.05027]
[ 0.30874 13.11784]]
How can I normalize the Y component of this array. So it gives me something like:
[[ 0. 0. ]
[ 0.13216 0.06 ]
[ 0.25379 1 ]
[ 0.30874 0.097]]

Referring to this Cross Validated Link, How to normalize data to 0-1 range?, it looks like you can perform min-max normalisation on the last column of foo.
v = foo[:, 1] # foo[:, -1] for the last column
foo[:, 1] = (v - v.min()) / (v.max() - v.min())
foo
array([[ 0. , 0. ],
[ 0.13216 , 0.06609523],
[ 0.25379 , 1. ],
[ 0.30874 , 0.09727968]])
Another option for performing normalisation (as suggested by OP) is using sklearn.preprocessing.normalize, which yields slightly different results -
from sklearn.preprocessing import normalize
foo[:, [-1]] = normalize(foo[:, -1, None], norm='max', axis=0)
foo
array([[ 0. , 0.2378106 ],
[ 0.13216 , 0.28818769],
[ 0.25379 , 1. ],
[ 0.30874 , 0.31195614]])

sklearn.preprocessing.MinMaxScaler can also be used (feature_range=(0, 1) is default):
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
v = foo[:,1]
v_scaled = min_max_scaler.fit_transform(v)
foo[:,1] = v_scaled
print(foo)
Output:
[[ 0. 0. ]
[ 0.13216 0.06609523]
[ 0.25379 1. ]
[ 0.30874 0.09727968]]
Advantage is that scaling to any range can be done.

I think you want this:
foo[:,1] = (foo[:,1] - foo[:,1].min()) / (foo[:,1].max() - foo[:,1].min())

You are trying to min-max scale between 0 and 1 only the second column.
Using sklearn.preprocessing.minmax_scale, should easily solve your problem.
e.g.:
from sklearn.preprocessing import minmax_scale
column_1 = foo[:,0] #first column you don't want to scale
column_2 = minmax_scale(foo[:,1], feature_range=(0,1)) #second column you want to scale
foo_norm = np.stack((column_1, column_2), axis=1) #stack both columns to get a 2d array
Should yield
array([[0. , 0. ],
[0.13216 , 0.06609523],
[0.25379 , 1. ],
[0.30874 , 0.09727968]])
Maybe you want to min-max scale between 0 and 1 both columns. In this case, use:
foo_norm = minmax_scale(foo, feature_range=(0,1), axis=0)
Which yields
array([[0. , 0. ],
[0.42806245, 0.06609523],
[0.82201853, 1. ],
[1. , 0.09727968]])
note: Not to be confused with the operation that scales the norm (length) of a vector to a certain value (usually 1), which is also commonly referred to as normalization.

Related

How can I reduce a numpy array based on a key rather than an axis?

I have a numpy array with 2 columns. The second column represents the keys that I want to reduce on.
>>> x
array([[0.1 , 1. ],
[0.25, 1. ],
[0.45, 0. ],
[0.55, 0. ]])
I want to sum up all the values which share a key, like this.
>>>sum_key(x)
array([[0.35 , 1. ],
[1.0, 0. ]])
This seems like a relatively universal task, but I can't find a good name for it or see it discussed.
Any ideas?

This is kinda overcomplicated but it should do the work:
import numpy as np
x = np.array([[0.1 , 1. ],
[0.25, 1. ],
[0.45, 0. ],
[0.55, 0. ]])
keys = x[:,1]
values = x[:,0]
keys_unique = np.unique(keys)
print([[sum(values[keys == k]), k] for k in keys_unique])
Output:
[[1.0, 0.0], [0.35, 1.0]]

If the indices (keys) are ascending integers (or can be casted easily as in your case) the most convenient way is to use
np.bincount.
import numpy as np
x = np.array([[0.1 , 1. ],
[0.25, 1. ],
[0.45, 0. ],
[0.55, 0. ]])
v = x[:, 0]
i = x[:, 1]
counts = np.bincount(i.astype(int), v)
print(counts)
# returns [1. 0.35]

A solution without numpy.
Grouping elements by key is typically done with a python dict.
Be careful if your keys are floating-points. For instance, 1.000000001 and 1.0 will be distinct keys. I suggest rounding to int first.
Using a dict
x = [[0.1 , 1 ],
[0.25, 1 ],
[0.45, 0 ],
[0.55, 0 ]]
y = {}
for v, k in x:
y[k] = y.get(k, 0) + v
print(y)
{1: 0.35, 0: 1.0}
You can get an array again from dict y if you want:
z = np.array([(v,k) for k,v in y.items()])
print(z)
# [[0.35 1. ]
# [1. 0. ]]

import numpy as np
import pandas as pd
data = np.array([[0.1 , 1. ],
[0.25, 1. ],
[0.45, 0. ],
[0.55, 0. ]])
df = pd.DataFrame(data)
gr = df.groupby([1])[0].agg('sum')
print(gr.keys().values)
data1 = np.array([[gr[k],k] for k in gr.keys().values])
print(data1)

how to split an array by value

My code so far is:
import numpy as np
data=np.genfromtxt('filename')
print(data)
which prints:
[[ 0.723 1. ]
[ 0.433 2. ]
[ 0.258 1. ]
[ 1.52 2. ]
[ 0.083 2. ]
[ 2.025 1. ]
[ 3.928 1. ]]
How do i split the data into two groups, based on if the line has a 1 or 2?

A simple solution is to use np.where which returns results of a conditional statement in the form of a tuple of arrays, which can be directly used with numpy's advanced slice notation to slice that data into a new variable.
import numpy as np
data = np.array(
[[ 0.723, 1. ],
[ 0.433, 2. ],
[ 0.258, 1. ],
[ 1.52, 2. ],
[ 0.083, 2. ],
[ 2.025, 1. ],
[ 3.928, 1. ]])
data1 = data[np.where(data[:,1] == 1)]
data2 = data[np.where(data[:,1] == 2)]
print(data1)
print(data2)

How about something like this:
import numpy as np
data = np.asarray([[0.723, 1.],
[0.433, 2.],
[0.258, 1.],
[1.520, 2.],
[0.083, 2.],
[2.025, 1.],
[3.928, 1.]])
split_data = [data[data[:,1] == 1.], data[data[:,1] == 2.]]
print(f'data:\n{data}')
print(f'split_data:\n{split_data}')
Explanation:
data[:,1] references the value in the 2nd "column" per se.
Output:
data:
[[0.723 1. ]
[0.433 2. ]
[0.258 1. ]
[1.52 2. ]
[0.083 2. ]
[2.025 1. ]
[3.928 1. ]]
split_data:
[array([[0.723, 1. ],
[0.258, 1. ],
[2.025, 1. ],
[3.928, 1. ]]),
array([[0.433, 2. ],
[1.52 , 2. ],
[0.083, 2. ]])]

Your question was rather brief, so I didn't quite catch the dataformat but I tried replicating it with:
foo = [[ 0.723, 1 ], [ 0.433, 2 ], [ 0.258, 1 ], [ 1.52, 2 ],
[ 0.083, 2 ], [ 2.025, 1 ], [ 3.928, 1 ]]
In case would want to filter this list foo to only contain numbers matching certain number you could use the following list comprehension:
foo_is_1 = [e for e in foo if e[1] == 1]
foo_is_2 = [e for e in foo if e[1] == 2]
print(foo_is_1)
print(foo_is_2)
In case you know nothing about the second argument and just want to split your list up in a list of lists with unique second arguments you could use:
list_of_lists = [[e for e in foo if e[1] == a] for a in list(set([a[1] for a in foo]))]
for entry in list_of_lists:
print(entry)
Which is basically two list comprehensions, one for each unique second argument a, and one for each entry e in foo.

Numpy covariance command returning matrix with more dimensions than input

I have an arbitrary row vector "u" and an arbitrary matrix "e" as follows:
u = np.resize(np.array([8,3]),[1,2])
e = np.resize(np.array([[2,2,5,5],[1, 6, 7, 4]]),[4,2])
np.cov(u,e)
array([[ 12.5, 0. , 0. , -12.5, 7.5],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[-12.5, 0. , 0. , 12.5, -7.5],
[ 7.5, 0. , 0. , -7.5, 4.5]])
The matrix that this returns is 5x5. This is confusing to me because the largest dimension of the inputs is only 4.
Thus, this may be less of a numpy question and more of a math question...not sure...

Please refer to the official numpy documentation (https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.cov.html) and check whether you usage of the numpy.cov function is consistent with what you are trying to achieve and you understand what you are trying to do.
When looking at the signature
numpy.cov(m, y=None, rowvar=True, bias=False, ddof=None, fweights=None, aweights=None)
m : array_like
A 1-D or 2-D array containing multiple variables and observations.
Each row of m represents a variable, and each column a single observation > > of all those variables. Also see rowvar below.
y : array_like, optional
An additional set of variables and observations. y has the same form as that of m.
Note how m and y are combined as shown in the last example on the page
>>> x = [-2.1, -1, 4.3]
>>> y = [3, 1.1, 0.12]
>>> X = np.stack((x, y), axis=0)
>>> print(np.cov(X))
[[ 11.71 -4.286 ]
[ -4.286 2.14413333]]
>>> print(np.cov(x, y))
[[ 11.71 -4.286 ]
[ -4.286 2.14413333]]
>>> print(np.cov(x))
11.71

Explanation on Numpy Broadcasting Answer

I recently posted a question here which was answered exactly as I asked. However, I think I overestimated my ability to manipulate the answer further. I read the broadcasting doc, and followed a few links that led me way back to 2002 about numpy broadcasting.
I've used the second method of array creation using broadcasting:
N = 10
out = np.zeros((N**3,4),dtype=int)
out[:,:3] = (np.arange(N**3)[:,None]/[N**2,N,1])%N
which outputs:
[[0,0,0,0]
[0,0,1,0]
...
[0,1,0,0]
[0,1,1,0]
...
[9,9,8,0]
[9,9,9,0]]
but I do not understand via the docs how to manipulate that. I would ideally like to be able to set the increments in which each individual column changes.
ex. Column A changes by 0.5 up to 2, column B changes by 0.2 up to 1, and column C changes by 1 up to 10.
[[0,0,0,0]
[0,0,1,0]
...
[0,0,9,0]
[0,0.2,0,0]
...
[0,0.8,9,0]
[0.5,0,0,0]
...
[1.5,0.8,9,0]]
Thanks for any help.

You can adjust your current code just a little bit to make it work.
>>> out = np.zeros((4*5*10,4))
>>> out[:,:3] = (np.arange(4*5*10)[:,None]//(5*10, 10, 1)*(0.5, 0.2, 1)%(2, 1, 10))
>>> out
array([[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 1. , 0. ],
[ 0. , 0. , 2. , 0. ],
...
[ 0. , 0. , 8. , 0. ],
[ 0. , 0. , 9. , 0. ],
[ 0. , 0.2, 0. , 0. ],
...
[ 0. , 0.8, 9. , 0. ],
[ 0.5, 0. , 0. , 0. ],
...
[ 1.5, 0.8, 9. , 0. ]])
The changes are:
No int dtype on the array, since we need it to hold floats in some columns. You could specify a float dtype if you want (or even something more complicated that only allows floats in the first two columns).
Rather than N**3 total values, figure out the number of distinct values for each column, and multiply them together to get our total size. This is used for both zeros and arange.
Use the floor division // operator in the first broadcast operation because we want integers at this point, but later we'll want floats.
The values to divide by are again based on the number of values for the later columns (e.g. for A,B,C numbers of values, divide by B*C, C, 1).
Add a new broadcast operation to multiply by various scale factors (how much each value increases at once).
Change the values in the broadcast mod % operation to match the bounds on each column.

This small example helps me understand what is going on:
In [123]: N=2
In [124]: np.arange(N**3)[:,None]/[N**2, N, 1]
Out[124]:
array([[ 0. , 0. , 0. ],
[ 0.25, 0.5 , 1. ],
[ 0.5 , 1. , 2. ],
[ 0.75, 1.5 , 3. ],
[ 1. , 2. , 4. ],
[ 1.25, 2.5 , 5. ],
[ 1.5 , 3. , 6. ],
[ 1.75, 3.5 , 7. ]])
So we generate a range of numbers (0 to 7) and divide them by 4,2, and 1.
The rest of the calculation just changes each value without further broadcasting
Apply %N to each element
In [126]: np.arange(N**3)[:,None]/[N**2, N, 1]%N
Out[126]:
array([[ 0. , 0. , 0. ],
[ 0.25, 0.5 , 1. ],
[ 0.5 , 1. , 0. ],
[ 0.75, 1.5 , 1. ],
[ 1. , 0. , 0. ],
[ 1.25, 0.5 , 1. ],
[ 1.5 , 1. , 0. ],
[ 1.75, 1.5 , 1. ]])
Assigning to an int array is the same as converting the floats to integers:
In [127]: (np.arange(N**3)[:,None]/[N**2, N, 1]%N).astype(int)
Out[127]:
array([[0, 0, 0],
[0, 0, 1],
[0, 1, 0],
[0, 1, 1],
[1, 0, 0],
[1, 0, 1],
[1, 1, 0],
[1, 1, 1]])

Numpy array scaling not returning proper values

I have a numpy array that I want to alter by scaling all of the columns (e.g. all the values in a column are divided by the maximum value in that column so that all values are <1).
A sample output of the array is
[ 2. 0. 367.877 ..., -0.358 51.547 -32.633]
[ 2. 0. 339.824 ..., -0.33 52.562 -27.581]
[ 3. 0. 371.438 ..., -0.406 55.108 -35.573]
I've tried scaling the array (data_in) by the following code:
#normalize the data_in array
data_in_normalized = data_in / data_in.max(axis=0)
However, the output of data_in_normalized is:
[ 0.5 0. 0.95437199 0.89363654 0.80751792 ]
[ 0.46931238 0.50660904 0.5003812 0.91250444 0.625 ]
[ 0.96229214 0.89483109 0.86989432 0.86491407 0.71287646 ]
[ -23.90909091 0.34346373 1.25110652 0. 0.8537859 1. 1.]
Clearly, it didn't normalize--there are multiple areas where the maximum value is >1. Is there a better way to scale the data, or am I using the max() function incorrectly (e.g. is the max() value being shared between columns?)

IIUC, it's not that the maximum value is shared between columns, it's that you probably want to divide by the maximum absolute value instead, because you have elements of both signs. 1 > -100, after all, and so if you divide by the maximum value of a column with [1, -100], nothing would change.
For example:
>>> data_in = np.array([[-3,-2],[2,1]])
>>> data_in
array([[-3, -2],
[ 2, 1]])
>>> data_in.max(axis=0)
array([2, 1])
>>> data_in / data_in.max(axis=0)
array([[-1.5, -2. ],
[ 1. , 1. ]])
but
>>> data_in / np.abs(data_in).max(axis=0)
array([[-1. , -1. ],
[ 0.66666667, 0.5 ]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Min-max normalisation of a NumPy array - python

I think you want this: foo[:,1] = (foo[:,1] - foo[:,1].min()) / (foo[:,1].max() - foo[:,1].min())

Related

How can I reduce a numpy array based on a key rather than an axis?

how to split an array by value

Numpy covariance command returning matrix with more dimensions than input

Explanation on Numpy Broadcasting Answer

Numpy array scaling not returning proper values

Categories

Resources