The result of dataframe.mean() is incorrect - python

I am workint in Python 2.7 and I have a data frame and I want to get the average of the column called 'c', but only the rows that verify that the values in another column are equal to some value.
When I execute the code, the answer is unexpected, but when I execute the calculation, calculating the median, the result is correct.
Why is the output of the mean incorrect?
The code is the following:
df = pd.DataFrame(
np.array([['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]]),
columns=['a', 'b', 'c', 'd']
)
df
mean1 = df[df.a == 'A'].c.mean()
mean2 = df[df.a == 'B'].c.mean()
median1 = df[df.a == 'A'].c.median()
median2 = df[df.a == 'B'].c.median()
The output:
df
Out[1]:
a b c d
0 A 1 2 3
1 A 4 5 nan
2 A 7 8 9
3 B 3 2 nan
4 B 5 6 nan
5 B 5 6 nan
mean1
Out[2]: 86.0
mean2
Out[3]: 88.66666666666667
median1
Out[4]: 5.0
median2
Out[5]: 6.0
It is obvious that the output of the mean is incorrect.
Thanks.

Pandas is doing string concatenation for the "sum" when calculating the mean, this is plain to see from your example frame.
>>> df[df.a == 'B'].c
3 2
4 6
5 6
Name: c, dtype: object
>>> 266 / 3
88.66666666666667
If you look at the dtype's for your DataFrame, you'll notice that all of them are object, even though no single Series contains mixed types. This is due to the declaration of your numpy array. Arrays are not meant to contain heterogenous types, so the array defaults to dtype object, which is then passed to the DataFrame constructor. You can avoid this behavior by passing the constructor a list instead, which can hold differing dtype's with no issues.
df = pd.DataFrame(
[['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
columns=['a', 'b', 'c', 'd']
)
df[df.a == 'B'].c.mean()
4.666666666666667
In [17]: df.dtypes
Out[17]:
a object
b int64
c int64
d float64
dtype: object
I still can't imagine that this behavior is intended, so I believe it's worth opening an issue report on the pandas development page, but in general, you shouldn't be using object dtype Series for numeric calculations.

Related

Why is the result of a Pandas' melt Fortran contiguous and not C-contiguous?

I ran into some pandas melt behavior that undermines my mental model of that function and I wonder if somebody could explain why this is sane/logical/desirable behavior.
The following snippet melts down a dataframe and then converts the result into a numpy array. Since I'm melting all columns I would have expected the result to be similar to what np.ndarray.ravel() would do. I.e., create a 1D view into the data and add a column with the respective column names (var names). However, - to my surprise - melt actually makes a copy of the data and reorders it as f-contigous. Why is f-contiguity a good idea here?
expected_flat = np.arange(100*3)
expected_full = expected_flat.reshape(100, 3)
# expected_full is view into flat array
assert expected_full.base is expected_flat
assert expected_flat.flags["C_CONTIGUOUS"]
test_df = pd.DataFrame(
expected_flat.reshape(100, 3),
columns=["a", "b", "c"],
)
# test_df, too, is a view into flat array
reconstructed = test_df.to_numpy()
assert reconstructed.base is expected_flat
flatten_melt = test_df.melt(var_name="col", value_name="foobar")
flatten_melt_numpy = flatten_melt.foobar.to_numpy()
# flatten_melt is NOT a view and reordered
assert flatten_melt_numpy.base is not expected_flat
assert np.allclose(flatten_melt_numpy, expected_flat) == False
# the confusing part is that the array is now F-contigous
reconstructed_melt = flatten_melt_numpy.reshape(100, 3, order="F")
assert np.allclose(reconstructed_melt, expected_full)
Construct a frame from a pair of "series":
In [322]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [323]: df
Out[323]:
a b
0 1 4
1 2 5
2 3 6
In [324]: arr = df.to_numpy()
In [325]: arr
Out[325]:
array([[1, 4],
[2, 5],
[3, 6]])
In [326]: arr.flags
Out[326]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
In [327]: arr.strides
Out[327]: (8, 24)
The resulting array is F_CONTIGUOUS.
If I make a frame from a 2d array, the value is the same as the input, and in this case order 'C':
In [328]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [329]: df1
Out[329]:
a b
0 1 2
1 3 4
2 5 6
In [330]: df1.to_numpy().strides
Out[330]: (16, 8)
Create it with an order F, the result is same as in the first case:
In [332]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2, order="F"), columns=[
...: "a", "b"])
In [333]: df1
Out[333]:
a b
0 1 4
1 2 5
2 3 6
In [334]: df1.to_numpy().strides
Out[334]: (8, 24)
melt
Going back to the frame created from an order C:
In [335]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [336]: df2 = df1.melt()
In [337]: df2
Out[337]:
variable value
0 a 1
1 a 3
2 a 5
3 b 2
4 b 4
5 b 6
Notice how the value column is a vertical concatenation of the 'a' and 'b' columns. This is what the method examples show. I don't use pivot enough to know if this a natural interpretation of that or not.
With the order 'F' frame:
In [338]: df2.to_numpy()
Out[338]:
array([['a', 1],
['a', 3],
['a', 5],
['b', 2],
['b', 4],
['b', 6]], dtype=object)
In [339]: _.strides
Out[339]: (8, 48)
In df1 both columns are int dtype, and can be stored as a 2d array:
In [340]: df1.dtypes
Out[340]:
a int64
b int64
dtype: object
df2 columns are different, object (string) and int, so are stored as separate arrays. to_numpy constructs an object dtype array from them, but it is order 'F':
In [341]: df2.dtypes
Out[341]:
variable object
value int64
dtype: object
We get a hint of this storage from:
In [352]: df1._mgr
Out[352]:
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: int64
In [353]: df2._mgr
Out[353]:
BlockManager
Items: Index(['variable', 'value'], dtype='object')
Axis 1: RangeIndex(start=0, stop=6, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 6, dtype: object
NumericBlock: slice(1, 2, 1), 1 x 6, dtype: int64
How a dataframe stores its values is a complex subject, and I have not read a comprehensive description. I've only gathered bits and pieces from experimenting like this.

Returna value in Pandas by index row number and column name?

I have a DF where the index is equal strings.
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=['a', 'a', 'a'], columns=['A', 'B', 'C'])
>>> df
A B C
a 0 2 3
a 0 4 1
a 10 20 30
Let's say I am trying to access the value in col 'B' at the first row. I am using something like this:
>>> df.iloc[0]['B']
2
Reading the post here it seems .at is recommended to be used for efficiency. Is there any better way in my example to return the value by the index row number and column name?
Try with iat with get_indexer
df.iat[0,df.columns.get_indexer(['B'])[0]]
Out[124]: 2

How to replace values in multiple categoricals in a pandas DataFrame

I want to replace certain values in a dataframe containing multiple categoricals.
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
If I apply .replace on a single column, the result is as expected:
>>> df.s1.replace('a', 1)
0 1
1 b
2 c
Name: s1, dtype: object
If I apply the same operation to the whole dataframe, an error is shown (short version):
>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions
If the dataframe contains integers as categories, the following happens:
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
>>> df.replace(1, 3)
s1 s2
0 3 3
1 2 3
2 3 4
But,
>>> df.replace(1, 2)
ValueError: Wrong number of dimensions
What am I missing?
Without digging, that seems to be buggy to me.
My Work Around
pd.DataFrame.apply with pd.Series.replace
This has the advantage that you don't need to mess with changing any types.
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)
s1 s2
0 2 2
1 2 3
2 3 4
Or
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)
s1 s2
0 1 1
1 b c
2 c d
#cᴏʟᴅsᴘᴇᴇᴅ's Work Around
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)
s1 s2
0 1 1
1 b c
2 c d
The reason for such behavior is different set of categorical values for each column:
In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')
In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')
so if you will replace to a value that is in both categories it'll work:
In [226]: df.replace('d','a')
Out[226]:
s1 s2
0 a a
1 b c
2 c a
As a solution you might want to make your columns categorical manually, using:
pd.Categorical(..., categories=[...])
where categories would have all possible values for all columns...

Pandas Groupby and Orderby with Rank and Summary Statistics

I'm looking to use pandas to group, rank, and get summary statistics on a key of values for data. Say I have data like this:
df = pd.DataFrame({'g_one': [1, 2, 3, 1, 2, 3],
'g_two': ['A', 'B', 'C', 'A', 'B', 'C'],
'g_three': [10, 5, 8, 12, 3, 9]})
I'd like to be able to group by g_one and g_two, rank by g_three and then get averages for all g_three values, means, etc.
I've tried grouping and sorting, but haven't had success with ranking the data.
Try this:
df.groupby(['g_one', 'g_two'],as_index=False).mean().sort_values(by='g_three')
Output:
g_one g_two g_three
1 2 B 4.0
2 3 C 8.5
0 1 A 11.0

Pandas div using index

I am sometimes struggling a bit to understand pandas datastructures and it seems to be the case again. Basically, I've got:
1 pivot table, major axis being a serial number
a Serie using the same index
I would like to divide each column of my pivot table by the value in the Serie using index to match the lines. I've tried plenty of combinations... without being successful so far :/
import pandas as pd
df = pd.DataFrame([['123', 1, 1, 3], ['456', 2, 3, 4], ['123', 4, 5, 6]], columns=['A', 'B', 'C', 'D'])
pt = pd.pivot_table(df, rows=['A', 'B'], cols='C', values='D', fill_value=0)
serie = pd.Series([5, 5, 5], index=['123', '678', '345'])
pt.div(serie, axis='index')
But I am only getting NaN. I guess it's because columns names are not matching but that's why I was using index as the axis. Any ideas on what I am doing wrong?
Thanks
You say "using the same index", but they're not the same: pt has a multiindex, and serie only an index:
>>> pt.index
MultiIndex(levels=[[u'123', u'456'], [1, 2, 4]],
labels=[[0, 0, 1], [0, 2, 1]],
names=[u'A', u'B'])
And you haven't told the division that you want to align on the A part of the index. You can pass that information using level:
>>> pt.div(serie, level='A', axis='index')
C 1 3 5
A B
123 1 0.6 0 0.0
4 0.0 0 1.2
456 2 NaN NaN NaN
[3 rows x 3 columns]

Categories

Resources