Unexpected behavior in scitkit-learn's normalizer

Unexpected behavior in scitkit-learn's normalizer - python

I have a pandas array and want to normalize 1 single column, here 'col3'
This is how my data looks like:
test1['col3']
1 73.506
2 73.403
3 74.038
4 73.980
5 74.295
6 72.864
7 74.013
8 73.748
9 74.536
10 74.926
11 74.355
12 75.577
13 75.563
Name: col3, dtype: float64
When I use the normalizer function (I hope that I am just using it incorrectly), I get:
from sklearn import preprocessing
preprocessing.normalize(test1['col3'][:, np.newaxis], axis=0)
array([[ 0.27468327],
[ 0.27429837],
[ 0.27667129],
[ 0.27645455],
[ 0.27763167],
[ 0.27228419],
[ 0.27657787],
[ 0.27558759],
[ 0.27853226],
[ 0.27998964],
[ 0.27785588],
[ 0.28242235],
[ 0.28237003]])
But for normalization (not standardization), I would usually want to scale the values to a range 0 to 1, right? E.g., via the equation
$X' = \frac{X \; - \; X_{min} }{X_{max} - X_{min}}$
(Hm, somehow the Latex doesn't work today...)
So, when I do it "manually", I get completely different results (but results I would expect)
(test1['col3'] - test1['col3'].min()) / (test1['col3'].max() - test1['col3'].min())
1 0.236638
2 0.198673
3 0.432731
4 0.411353
5 0.527460
6 0.000000
7 0.423516
8 0.325839
9 0.616292
10 0.760044
11 0.549576
12 1.000000
13 0.994840
Name: col3, dtype: float64

This is not all what sklearn.preprocessing.normalize does. In fact, it scales its input vectors to unit L2 norm (or L1 norm if requested), i.e.
>>> from sklearn.preprocessing import normalize
>>> rng = np.random.RandomState(42)
>>> x = rng.randn(2, 5)
>>> x
array([[ 0.49671415, -0.1382643 , 0.64768854, 1.52302986, -0.23415337],
[-0.23413696, 1.57921282, 0.76743473, -0.46947439, 0.54256004]])
>>> normalize(x)
array([[ 0.28396232, -0.07904315, 0.37027159, 0.87068807, -0.13386116],
[-0.12251149, 0.82631858, 0.40155802, -0.24565113, 0.28389299]])
>>> x / np.linalg.norm(x, axis=1).reshape(-1, 1)
array([[ 0.28396232, -0.07904315, 0.37027159, 0.87068807, -0.13386116],
[-0.12251149, 0.82631858, 0.40155802, -0.24565113, 0.28389299]])
>>> np.linalg.norm(normalize(x), axis=1)
array([ 1., 1.])
(normalize uses a faster way of computing the norm than np.linalg and deals with zeros gracefully, but otherwise these two expressions are the same.)
What you were expecting is called min-max scaling in scikit-learn.

Related

Understanding xarray groupby

I am trying to count the number of members in each group, akin to pandas.DataFrame.groupby.count. However, it doesn't seem to be working. Here is an example:
In [1]: xr_test = xr.DataArray(np.random.rand(6), coords=[[10,10,11,12,12,12]], dims=['dim0'])
xr_test
Out[1]: <xarray.DataArray (dim0: 6)>
array([ 0.92908804, 0.15495709, 0.85304435, 0.24039265, 0.3755476 ,
0.29261274])
Coordinates:
* dim0 (dim0) int32 10 10 11 12 12 12
In [2]: xr_test.groupby('dim0').count()
Out[2]: <xarray.DataArray (dim0: 6)>
array([1, 1, 1, 1, 1, 1])
Coordinates:
* dim0 (dim0) int32 10 10 11 12 12 12
However, I expect this output:
Out[2]: <xarray.DataArray (dim0: 3)>
array([2, 1, 3])
Coordinates:
* dim0 (dim0) int32 10 11 12
What's going on?
In other words:
In [3]: xr_test.to_series().groupby(level=0).count()
Out[3]: dim0
10 2
11 1
12 3
dtype: int64

This is a bug! Xarray currently makes the (in this case mistaken) assumption that coordinates corresponding to dimensions have all unique values. This usually a good idea, but shouldn't be required. If you make another coordinate this should work properly, e.g.,
xr_test = xr.DataArray(np.random.rand(6), coords={'aux': ('x', [10,10,11,12,12,12])}, dims=['x'])
xr_test.groupby('aux').count()

Divide each row by a vector element with floating value precision

Suppose i have
a = np.arange(9).reshape((3,3))
and i want to divide each row with a vector
n = np.array([1.1,2.2,3.3])
I tried the proposed solution in this question but the fractional value is not taken into account.

I understand your question differently from the comments above:
import numpy as np
a = np.arange(12).reshape((4,3))
print a
n = np.array([[1.1,2.2,3.3]])
print n
print a/n
Output:
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
[[ 1.1 2.2 3.3]]
[[ 0. 0.45454545 0.60606061]
[ 2.72727273 1.81818182 1.51515152]
[ 5.45454545 3.18181818 2.42424242]
[ 8.18181818 4.54545455 3.33333333]]
I also changed from a symmetric matrix (3x3) to (3x4) to point out that row vs columns matter. Also the divisor is a column vector now (double brackets).

Create three arrays from pandas series

For example, I have pandas data series like this:
df = pd.DataFrame({'A': ['foo', 'bar', 'ololo'] * 4,
'B': np.random.randn(12),
'C': np.random.randint(0, 2, 12)})
ga = df.groupby(['A'])['C'].value_counts()
print ga
A
bar 1 3
0 1
foo 0 3
1 1
ololo 0 4
I want to create three arrays, like this:
First array
bar, foo, ololo
Second array (number of '1')
2 3 1
Third array (number of '0')
2 1 3
What's a simplest way to do this?

Starting with:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['foo', 'bar', 'ololo'] * 4,
'B': np.random.randn(12),
'C': np.random.randint(0, 2, 12)
})
counts = df.groupby('A')['C'].value_counts()
Gives (for counts):
A
bar 1 4
foo 1 4
ololo 0 3
1 1
dtype: int64
So, effectively we want to unstack and transpose so that 0/1 are the index, which we do by:
reshaped = counts.unstack().T.reindex([0, 1]).fillna(0)
DSM points out it's possible to avoid .reindex by doing the following:
reshaped = counts.unstack().T.loc[[0, 1]].fillna(0)
Which gives:
A bar foo ololo
0 0 0 3
1 4 4 1
We force a .reindex to ensure it always contains 0/1 (in cases where the randomness means that nothing turns up for 0/1) and force all columns values to be 0 (.fillna(0)) where that's the case. You can then get your arrays by doing the following:
arrays = reshaped.columns.values, reshaped.loc[1].values, reshaped.loc[0].values
Which gives you:
(array(['bar', 'foo', 'ololo'], dtype=object),
array([ 4., 4., 1.]),
array([ 0., 0., 3.]))

numpy random not working with seed

import random
seed = random.random()
random_seed = random.Random(seed)
random_vec = [ random_seed.random() for i in range(10)]
The above is essentially:
np.random.randn(10)
But I am not able to figure out how to set the seed?

I'm not sure why you want to set the seed—especially to a random number, even more especially to a random float (note that random.seed wants a large integer).
But if you do, it's simple: call the numpy.random.seed function.
Note that NumPy's seeds are arrays of 32-bit integers, while Python's seeds are single arbitrary-sized integers (although see the docs for what happens when you pass other types).
So, for example:
In [1]: np.random.seed(0)
In [2]: s = np.random.randn(10)
In [3]: s
Out[3]:
array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799,
-0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985 ])
In [4]: np.random.seed(0)
In [5]: s = np.random.randn(10)
In [6]: s
Out[6]:
array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799,
-0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985 ])
Same seed used twice (I took the shortcut of passing a single int, which NumPy will internally convert into an array of 1 int32), same random numbers generated.

To put it simply random.seed(value) does not work with numpy arrays.
For example,
import random
import numpy as np
random.seed(10)
print( np.random.randint(1,10,10)) #generates 10 random integer of values from 1~10
[4 1 5 7 9 2 9 5 2 4]
random.seed(10)
print( np.random.randint(1,10,10))
[7 6 4 7 2 5 3 7 8 9]
However, if you want to seed the numpy generated values, you have to use np.random.seed(value).
If I revisit the above example,
import numpy as np
np.random.seed(10)
print( np.random.randint(1,10,10))
[5 1 2 1 2 9 1 9 7 5]
np.random.seed(10)
print( np.random.randint(1,10,10))
[5 1 2 1 2 9 1 9 7 5]

Get dot-product of dataframe with vector, and return dataframe, in Pandas

I am unable to find the entry on the method dot() in the official documentation. However the method is there and I can use it. Why is this?
On this topic, is there a way compute an element-wise multiplication of every row in a data frame with another vector? (and obtain a dataframe back?), i.e. similar to dot() but rather than computing the dot product, one computes the element-wise product.

mul is doing essentially an outer-product, while dot is an inner product. Let me expand on the accepted answer:
In [13]: df = pd.DataFrame({'A': [1., 1., 1., 2., 2., 2.], 'B': np.arange(1., 7.)})
In [14]: v1 = np.array([2,2,2,3,3,3])
In [15]: v2 = np.array([2,3])
In [16]: df.shape
Out[16]: (6, 2)
In [17]: v1.shape
Out[17]: (6,)
In [18]: v2.shape
Out[18]: (2,)
In [24]: df.mul(v2)
Out[24]:
A B
0 2 3
1 2 6
2 2 9
3 4 12
4 4 15
5 4 18
In [26]: df.dot(v2)
Out[26]:
0 5
1 8
2 11
3 16
4 19
5 22
dtype: float64
So:
df.mul takes matrix of shape (6,2) and vector (6, 1) and returns matrix shape (6,2)
While:
df.dot takes matrix of shape (6,2) and vector (2,1) and returns (6,1).
These are not the same operation, they are outer and inner products, respectively.

Here is an example of how to multiply a DataFrame by a vector:
In [60]: df = pd.DataFrame({'A': [1., 1., 1., 2., 2., 2.], 'B': np.arange(1., 7.)})
In [61]: vector = np.array([2,2,2,3,3,3])
In [62]: df.mul(vector, axis=0)
Out[62]:
A B
0 2 2
1 2 4
2 2 6
3 6 12
4 6 15
5 6 18

It's quite hard to say with a degree of accuracy.
Often, a method exists and is undocumented because it's considered internal by the vendor, and may be subject to change.
It could, of course, be a simple oversight by the folks who put together the documentation.
Regarding your second question; I don't really know about that - but it might be better to make a new S/O question for it.
Just scanning the the API, could you do something with the DataFrame's .applymap(function) feature ?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unexpected behavior in scitkit-learn's normalizer - python

Related

Understanding xarray groupby

Divide each row by a vector element with floating value precision

Create three arrays from pandas series

numpy random not working with seed

Get dot-product of dataframe with vector, and return dataframe, in Pandas

Categories

Resources