Understanding xarray groupby

Understanding xarray groupby - python

I am trying to count the number of members in each group, akin to pandas.DataFrame.groupby.count. However, it doesn't seem to be working. Here is an example:
In [1]: xr_test = xr.DataArray(np.random.rand(6), coords=[[10,10,11,12,12,12]], dims=['dim0'])
xr_test
Out[1]: <xarray.DataArray (dim0: 6)>
array([ 0.92908804, 0.15495709, 0.85304435, 0.24039265, 0.3755476 ,
0.29261274])
Coordinates:
* dim0 (dim0) int32 10 10 11 12 12 12
In [2]: xr_test.groupby('dim0').count()
Out[2]: <xarray.DataArray (dim0: 6)>
array([1, 1, 1, 1, 1, 1])
Coordinates:
* dim0 (dim0) int32 10 10 11 12 12 12
However, I expect this output:
Out[2]: <xarray.DataArray (dim0: 3)>
array([2, 1, 3])
Coordinates:
* dim0 (dim0) int32 10 11 12
What's going on?
In other words:
In [3]: xr_test.to_series().groupby(level=0).count()
Out[3]: dim0
10 2
11 1
12 3
dtype: int64

This is a bug! Xarray currently makes the (in this case mistaken) assumption that coordinates corresponding to dimensions have all unique values. This usually a good idea, but shouldn't be required. If you make another coordinate this should work properly, e.g.,
xr_test = xr.DataArray(np.random.rand(6), coords={'aux': ('x', [10,10,11,12,12,12])}, dims=['x'])
xr_test.groupby('aux').count()

Related

Python language construction when filtering the array

I can see many questions in SO about how to filter an array (usually using pandas or numpy). The examples are very simple:
df = pd.DataFrame({ 'val': [1, 2, 3, 4, 5, 6, 7] })
a = df[df.val > 3]
Intuitively I understand the df[df.val > 3] statement. But it confuses me from the syntax point of view. In other languages, I would expect a lambda function instead of df.val > 3.
The question is: Why this style is possible and what is going on underhood?
Update 1:
To be more clear about the confusing part: In other languages I saw the next syntax:
df.filter(row => row.val > 3)
I understand here what is going on under the hood: For every iteration, the lambda function is called with the row as an argument and returns a boolean value. But here df.val > 3 doesn't make sense to me because df.val looks like a column.
Moreover, I can write df[df > 3] and it will be compiled and executed successfully. And it makes me crazy because I don't understand how a DataFrame object can be compared to a number.

Create an array and dataframe from it:
In [104]: arr = np.arange(1,8); df = pd.DataFrame({'val':arr})
In [105]: arr
Out[105]: array([1, 2, 3, 4, 5, 6, 7])
In [106]: df
Out[106]:
val
0 1
1 2
2 3
3 4
4 5
5 6
6 7
numpy arrays have methods and operators that operate on the whole array. For example you can multiply the array by a scalar, or add a scalar to all elements. Or in this case compare each element to a scalar. That's all implemented by the class (numpy.ndarray), not by Python syntax.
In [107]: arr>3
Out[107]: array([False, False, False, True, True, True, True])
Similarly pandas implements these methods (or uses numpy methods on the underlying arrays). Selecting a column of a frame, with `df['val'] or:
In [108]: df.val
Out[108]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
Name: val, dtype: int32
This is a pandas Series (slight difference in display).
It can be compared to a scalar - as with the array:
In [110]: df.val>3
Out[110]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: val, dtype: bool
And the boolean array can be used to index the frame:
In [111]: df[arr>3]
Out[111]:
val
3 4
4 5
5 6
6 7
The boolean Series also works:
In [112]: df[df.val>3]
Out[112]:
val
3 4
4 5
5 6
6 7
Boolean array indexing works the same way:
In [113]: arr[arr>3]
Out[113]: array([4, 5, 6, 7])
Here I use the indexing to fetch values; setting values is analogous.

Pandas - how to slice value_counts?

I would like to slice a pandas value_counts() :
>sur_perimetre[col].value_counts()
44341006.0 610
14231009.0 441
12131001.0 382
12222009.0 364
12142001.0 354
But I get an error :
> sur_perimetre[col].value_counts()[:5]
KeyError: 5.0
The same with ix :
> sur_perimetre[col].value_counts().ix[:5]
KeyError: 5.0
How would you deal with that ?
EDIT
Maybe :
pd.DataFrame(sur_perimetre[col].value_counts()).reset_index()[:5]

Method 1:
You need to observe that value_counts() returns a Series object. You can process it like any other series and get the values. You can even construct a new dataframe out of it.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: vc = df.C1.value_counts()
In [4]: type(vc)
Out[4]: pandas.core.series.Series
In [5]: vc.values
Out[5]: array([2, 1, 1, 1, 1])
In [6]: vc.values[:2]
Out[6]: array([2, 1])
In [7]: vc.index.values
Out[7]: array([3, 5, 4, 2, 1])
In [8]: df2 = pd.DataFrame({'value':vc.index, 'count':vc.values})
In [8]: df2
Out[8]:
count value
0 2 3
1 1 5
2 1 4
3 1 2
4 1 1
Method2:
Then, I was trying to regenerate the error you mentioned. But, using a single column in DF, I didnt get any error in the same notation as you mentioned.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: df['C1'].value_counts()[:3]
Out[3]:
3 2
5 1
4 1
Name: C1, dtype: int64
In [4]: df.C1.value_counts()[:5]
Out[4]:
3 2
5 1
4 1
2 1
1 1
Name: C1, dtype: int64
In [5]: pd.__version__
Out[5]: u'0.17.1'
Hope it helps!

Unexpected behavior in scitkit-learn's normalizer

I have a pandas array and want to normalize 1 single column, here 'col3'
This is how my data looks like:
test1['col3']
1 73.506
2 73.403
3 74.038
4 73.980
5 74.295
6 72.864
7 74.013
8 73.748
9 74.536
10 74.926
11 74.355
12 75.577
13 75.563
Name: col3, dtype: float64
When I use the normalizer function (I hope that I am just using it incorrectly), I get:
from sklearn import preprocessing
preprocessing.normalize(test1['col3'][:, np.newaxis], axis=0)
array([[ 0.27468327],
[ 0.27429837],
[ 0.27667129],
[ 0.27645455],
[ 0.27763167],
[ 0.27228419],
[ 0.27657787],
[ 0.27558759],
[ 0.27853226],
[ 0.27998964],
[ 0.27785588],
[ 0.28242235],
[ 0.28237003]])
But for normalization (not standardization), I would usually want to scale the values to a range 0 to 1, right? E.g., via the equation
$X' = \frac{X \; - \; X_{min} }{X_{max} - X_{min}}$
(Hm, somehow the Latex doesn't work today...)
So, when I do it "manually", I get completely different results (but results I would expect)
(test1['col3'] - test1['col3'].min()) / (test1['col3'].max() - test1['col3'].min())
1 0.236638
2 0.198673
3 0.432731
4 0.411353
5 0.527460
6 0.000000
7 0.423516
8 0.325839
9 0.616292
10 0.760044
11 0.549576
12 1.000000
13 0.994840
Name: col3, dtype: float64

This is not all what sklearn.preprocessing.normalize does. In fact, it scales its input vectors to unit L2 norm (or L1 norm if requested), i.e.
>>> from sklearn.preprocessing import normalize
>>> rng = np.random.RandomState(42)
>>> x = rng.randn(2, 5)
>>> x
array([[ 0.49671415, -0.1382643 , 0.64768854, 1.52302986, -0.23415337],
[-0.23413696, 1.57921282, 0.76743473, -0.46947439, 0.54256004]])
>>> normalize(x)
array([[ 0.28396232, -0.07904315, 0.37027159, 0.87068807, -0.13386116],
[-0.12251149, 0.82631858, 0.40155802, -0.24565113, 0.28389299]])
>>> x / np.linalg.norm(x, axis=1).reshape(-1, 1)
array([[ 0.28396232, -0.07904315, 0.37027159, 0.87068807, -0.13386116],
[-0.12251149, 0.82631858, 0.40155802, -0.24565113, 0.28389299]])
>>> np.linalg.norm(normalize(x), axis=1)
array([ 1., 1.])
(normalize uses a faster way of computing the norm than np.linalg and deals with zeros gracefully, but otherwise these two expressions are the same.)
What you were expecting is called min-max scaling in scikit-learn.

Loading a table in numpy with row- and column-indices, like in R?

I would like to load a table in numpy, so that the first row and first column would be considered text labels. Something equivalent to this R code:
read.table("filename.txt", row.header=T)
Where the file is a delimited text file like this:
A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5
So that read in I will have an array:
[[5,4,3,2],
[1,0,9,9],
[8,7,6,5]]
With some sort of:
rownames ["X","Y","Z"]
colnames ["A","B","C","D"]
Is there such a class / mechanism?

Numpy arrays aren't perfectly suited to table-like structures. However, pandas.DataFrames are.
For what you're wanting, use pandas.
For your example, you'd do
data = pandas.read_csv('filename.txt', delim_whitespace=True, index_col=0)
As a more complete example (using StringIO to simulate your file):
from StringIO import StringIO
import pandas as pd
f = StringIO("""A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5""")
x = pd.read_csv(f, delim_whitespace=True, index_col=0)
print 'The DataFrame:'
print x
print 'Selecting a column'
print x['D'] # or "x.D" if there aren't spaces in the name
print 'Selecting a row'
print x.loc['Y']
This yields:
The DataFrame:
A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5
Selecting a column
X 2
Y 9
Z 5
Name: D, dtype: int64
Selecting a row
A 1
B 0
C 9
D 9
Name: Y, dtype: int64
Also, as #DSM pointed out, it's very useful to know about things like DataFrame.values or DataFrame.to_records() if you do need a "raw" numpy array. (pandas is built on top of numpy. In a simple, non-strict sense, each column of a DataFrame is stored as a 1D numpy array.)
For example:
In [2]: x.values
Out[2]:
array([[5, 4, 3, 2],
[1, 0, 9, 9],
[8, 7, 6, 5]])
In [3]: x.to_records()
Out[3]:
rec.array([('X', 5, 4, 3, 2), ('Y', 1, 0, 9, 9), ('Z', 8, 7, 6, 5)],
dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8'), ('D', '<i8')])

Get dot-product of dataframe with vector, and return dataframe, in Pandas

I am unable to find the entry on the method dot() in the official documentation. However the method is there and I can use it. Why is this?
On this topic, is there a way compute an element-wise multiplication of every row in a data frame with another vector? (and obtain a dataframe back?), i.e. similar to dot() but rather than computing the dot product, one computes the element-wise product.

mul is doing essentially an outer-product, while dot is an inner product. Let me expand on the accepted answer:
In [13]: df = pd.DataFrame({'A': [1., 1., 1., 2., 2., 2.], 'B': np.arange(1., 7.)})
In [14]: v1 = np.array([2,2,2,3,3,3])
In [15]: v2 = np.array([2,3])
In [16]: df.shape
Out[16]: (6, 2)
In [17]: v1.shape
Out[17]: (6,)
In [18]: v2.shape
Out[18]: (2,)
In [24]: df.mul(v2)
Out[24]:
A B
0 2 3
1 2 6
2 2 9
3 4 12
4 4 15
5 4 18
In [26]: df.dot(v2)
Out[26]:
0 5
1 8
2 11
3 16
4 19
5 22
dtype: float64
So:
df.mul takes matrix of shape (6,2) and vector (6, 1) and returns matrix shape (6,2)
While:
df.dot takes matrix of shape (6,2) and vector (2,1) and returns (6,1).
These are not the same operation, they are outer and inner products, respectively.

Here is an example of how to multiply a DataFrame by a vector:
In [60]: df = pd.DataFrame({'A': [1., 1., 1., 2., 2., 2.], 'B': np.arange(1., 7.)})
In [61]: vector = np.array([2,2,2,3,3,3])
In [62]: df.mul(vector, axis=0)
Out[62]:
A B
0 2 2
1 2 4
2 2 6
3 6 12
4 6 15
5 6 18

It's quite hard to say with a degree of accuracy.
Often, a method exists and is undocumented because it's considered internal by the vendor, and may be subject to change.
It could, of course, be a simple oversight by the folks who put together the documentation.
Regarding your second question; I don't really know about that - but it might be better to make a new S/O question for it.
Just scanning the the API, could you do something with the DataFrame's .applymap(function) feature ?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding xarray groupby - python

Related

Python language construction when filtering the array

Pandas - how to slice value_counts?

Unexpected behavior in scitkit-learn's normalizer

Loading a table in numpy with row- and column-indices, like in R?

Get dot-product of dataframe with vector, and return dataframe, in Pandas

Categories

Resources