Why is the result of a Pandas' melt Fortran contiguous and not C-contiguous? - python

I ran into some pandas melt behavior that undermines my mental model of that function and I wonder if somebody could explain why this is sane/logical/desirable behavior.
The following snippet melts down a dataframe and then converts the result into a numpy array. Since I'm melting all columns I would have expected the result to be similar to what np.ndarray.ravel() would do. I.e., create a 1D view into the data and add a column with the respective column names (var names). However, - to my surprise - melt actually makes a copy of the data and reorders it as f-contigous. Why is f-contiguity a good idea here?
expected_flat = np.arange(100*3)
expected_full = expected_flat.reshape(100, 3)
# expected_full is view into flat array
assert expected_full.base is expected_flat
assert expected_flat.flags["C_CONTIGUOUS"]
test_df = pd.DataFrame(
expected_flat.reshape(100, 3),
columns=["a", "b", "c"],
)
# test_df, too, is a view into flat array
reconstructed = test_df.to_numpy()
assert reconstructed.base is expected_flat
flatten_melt = test_df.melt(var_name="col", value_name="foobar")
flatten_melt_numpy = flatten_melt.foobar.to_numpy()
# flatten_melt is NOT a view and reordered
assert flatten_melt_numpy.base is not expected_flat
assert np.allclose(flatten_melt_numpy, expected_flat) == False
# the confusing part is that the array is now F-contigous
reconstructed_melt = flatten_melt_numpy.reshape(100, 3, order="F")
assert np.allclose(reconstructed_melt, expected_full)

Construct a frame from a pair of "series":
In [322]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [323]: df
Out[323]:
a b
0 1 4
1 2 5
2 3 6
In [324]: arr = df.to_numpy()
In [325]: arr
Out[325]:
array([[1, 4],
[2, 5],
[3, 6]])
In [326]: arr.flags
Out[326]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
In [327]: arr.strides
Out[327]: (8, 24)
The resulting array is F_CONTIGUOUS.
If I make a frame from a 2d array, the value is the same as the input, and in this case order 'C':
In [328]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [329]: df1
Out[329]:
a b
0 1 2
1 3 4
2 5 6
In [330]: df1.to_numpy().strides
Out[330]: (16, 8)
Create it with an order F, the result is same as in the first case:
In [332]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2, order="F"), columns=[
...: "a", "b"])
In [333]: df1
Out[333]:
a b
0 1 4
1 2 5
2 3 6
In [334]: df1.to_numpy().strides
Out[334]: (8, 24)
melt
Going back to the frame created from an order C:
In [335]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [336]: df2 = df1.melt()
In [337]: df2
Out[337]:
variable value
0 a 1
1 a 3
2 a 5
3 b 2
4 b 4
5 b 6
Notice how the value column is a vertical concatenation of the 'a' and 'b' columns. This is what the method examples show. I don't use pivot enough to know if this a natural interpretation of that or not.
With the order 'F' frame:
In [338]: df2.to_numpy()
Out[338]:
array([['a', 1],
['a', 3],
['a', 5],
['b', 2],
['b', 4],
['b', 6]], dtype=object)
In [339]: _.strides
Out[339]: (8, 48)
In df1 both columns are int dtype, and can be stored as a 2d array:
In [340]: df1.dtypes
Out[340]:
a int64
b int64
dtype: object
df2 columns are different, object (string) and int, so are stored as separate arrays. to_numpy constructs an object dtype array from them, but it is order 'F':
In [341]: df2.dtypes
Out[341]:
variable object
value int64
dtype: object
We get a hint of this storage from:
In [352]: df1._mgr
Out[352]:
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: int64
In [353]: df2._mgr
Out[353]:
BlockManager
Items: Index(['variable', 'value'], dtype='object')
Axis 1: RangeIndex(start=0, stop=6, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 6, dtype: object
NumericBlock: slice(1, 2, 1), 1 x 6, dtype: int64
How a dataframe stores its values is a complex subject, and I have not read a comprehensive description. I've only gathered bits and pieces from experimenting like this.

Related

Access segment of row with correct dtype

I have a dataframe with a few different types in it. For example:
df = pd.DataFrame({'A': ['A', 'B', 'C', 'D'],
'B': np.random.randint(10, size=4),
'C': np.random.randint(10, size=4),
'D': np.random.rand(4),
'E': np.random.rand(4)})
The dtypes are
>>> df.dtypes
A object
B int32
C int32
D float64
E float64
dtype: object
I want to be able to extract the values of B and C in a numpy array of dtype np.int32 directly from the third row of df. Seems straightforward:
>>> df.iloc[2][['B', 'C']].to_numpy()
array([9, 9], dtype=object)
This is consistent with the fact that the Series is of type object:
>>> df.iloc[2]
A C
B 9
C 9
D 0.211487
E 0.857848
Name: 2, dtype: object
So maybe I shouldn't get the row first:
>>> df.loc[df.index[2], ['B', 'C']].to_numpy()
array([9, 9], dtype=object)
Still no luck. Of course I can always post-process and do
df.loc[df.index[2], ['B', 'C']].to_numpy().astype(np.int32)
However, is there a way to extract a set of columns of the same dtype with their native dtype into a numpy array using just indexing?
V1
The answer was of course going in the opposite direction from iloc: extracting columns with a consistent dtype first, so that the row could be a contiguous block:
>>> df[['B', 'C']].iloc[2]
array([9, 9])
Which tells me that I shouldn't be using pandas directly except to load my data to begin with.
V2
It turns out that pd.DataFrame.to_numpy and pd.Series.to_numpy have a dtype argument that you can use to do the conversion. That means that the loc/iloc approaches can work too, although this still requires an additional conversion and a-priori knowledge of the dtype:
>>> df.loc[d.index[2], ['B', 'C']].to_numpy(dtype=np.int32)
array([9, 9])
and
>>> df.iloc[2][['B', 'C']].to_numpy(dtype=np.int32)
array([9, 9])
As an addenda to the Mad's answer
In [107]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null object
1 B 4 non-null int64
2 C 4 non-null int64
3 D 4 non-null float64
4 E 4 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 288.0+ bytes
I stumbled upon the _mgr which apparently manages how data is actually stored. Looks like it tries to group columns of like dtype together, storing the data asn (#col, #row) arrays:
In [108]: df._mgr
Out[108]:
BlockManager
Items: Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
FloatBlock: slice(3, 5, 1), 2 x 4, dtype: float64
IntBlock: slice(1, 3, 1), 2 x 4, dtype: int64
ObjectBlock: slice(0, 1, 1), 1 x 4, dtype: object
Selecting the 2 int columns:
In [109]: df[['B','C']]._mgr
Out[109]:
BlockManager
Items: Index(['B', 'C'], dtype='object')
Axis 1: RangeIndex(start=0, stop=4, step=1)
IntBlock: slice(0, 2, 1), 2 x 4, dtype: int64
and hence we int dtype array without further arguments:
In [110]: df[['B','C']].values
Out[110]:
array([[5, 0],
[5, 0],
[0, 5],
[9, 9]])
For single block cases (e.g. all int columns) the values is (or at least can be) a view of frame's data. But that doesn't appear to be the case here.
For a single row:
In [116]: df.iloc[2]._mgr
Out[116]:
SingleBlockManager
Items: Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
ObjectBlock: 5 dtype: object
The row selection is a Series, so can't have the mixed dtypes of dataframe.
But a "multirow" selection is a frame
In [128]: df.iloc[2][['B','C']].values
Out[128]: array([0, 5], dtype=object)
In [129]: df.iloc[[2]][['B','C']].values
Out[129]: array([[0, 5]])

The result of dataframe.mean() is incorrect

I am workint in Python 2.7 and I have a data frame and I want to get the average of the column called 'c', but only the rows that verify that the values in another column are equal to some value.
When I execute the code, the answer is unexpected, but when I execute the calculation, calculating the median, the result is correct.
Why is the output of the mean incorrect?
The code is the following:
df = pd.DataFrame(
np.array([['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]]),
columns=['a', 'b', 'c', 'd']
)
df
mean1 = df[df.a == 'A'].c.mean()
mean2 = df[df.a == 'B'].c.mean()
median1 = df[df.a == 'A'].c.median()
median2 = df[df.a == 'B'].c.median()
The output:
df
Out[1]:
a b c d
0 A 1 2 3
1 A 4 5 nan
2 A 7 8 9
3 B 3 2 nan
4 B 5 6 nan
5 B 5 6 nan
mean1
Out[2]: 86.0
mean2
Out[3]: 88.66666666666667
median1
Out[4]: 5.0
median2
Out[5]: 6.0
It is obvious that the output of the mean is incorrect.
Thanks.
Pandas is doing string concatenation for the "sum" when calculating the mean, this is plain to see from your example frame.
>>> df[df.a == 'B'].c
3 2
4 6
5 6
Name: c, dtype: object
>>> 266 / 3
88.66666666666667
If you look at the dtype's for your DataFrame, you'll notice that all of them are object, even though no single Series contains mixed types. This is due to the declaration of your numpy array. Arrays are not meant to contain heterogenous types, so the array defaults to dtype object, which is then passed to the DataFrame constructor. You can avoid this behavior by passing the constructor a list instead, which can hold differing dtype's with no issues.
df = pd.DataFrame(
[['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
columns=['a', 'b', 'c', 'd']
)
df[df.a == 'B'].c.mean()
4.666666666666667
In [17]: df.dtypes
Out[17]:
a object
b int64
c int64
d float64
dtype: object
I still can't imagine that this behavior is intended, so I believe it's worth opening an issue report on the pandas development page, but in general, you shouldn't be using object dtype Series for numeric calculations.

Python Pandas - How to get the (iloc) position of one or more filtered rows in a dataframe

Using this example
df = pd.DataFrame({'letters':
['A', 'B', 'C', 'D', 'E', 'F']},
index=[10, 20, 30, 40, 50, 30])
With df.iloc[x] I can get the row x in the dataframe. For example.
df.iloc[3]
returns
letters D
Name: 40, dtype: object
When I filter the dataframe like
df2 = df.iloc[1:3]
I get for df2
letters
20 B
30 C
Now assume that I didn't know how the filter was applied and I need to find out the values for the filtered rows (1 and 2).
What's the best way to get the list of positions, that allow me to access a filtered result like this result in via the original dataframe using df.iloc ? How do I get the position numbers?
I am looking for the result
[1, 2]
Note: I had a good suggestion:
df.index.get_indexer_for((df2.index))
which doesn't work, if the index is not unique.
Int64Index([1, 2, 5], dtype='int64')
Because we have to incorporate the value as well if we want to handle cases like df.iloc[[1,5]], where you'd need to get 5 from "30 F", I think the easiest way is to leverage a merge:
In [172]: df.reset_index().reset_index().merge(df.iloc[1:3].reset_index())
Out[172]:
level_0 index letters
0 1 20 B
1 2 30 C
In [173]: df.reset_index().reset_index().merge(df.iloc[1:3].reset_index())["level_0"].values
Out[173]: array([1, 2], dtype=int64)
In [174]: df.reset_index().reset_index().merge(df.iloc[[1,5]].reset_index())
Out[174]:
level_0 index letters
0 1 20 B
1 5 30 F
In [175]: df.reset_index().reset_index().merge(df.iloc[[1,5]].reset_index())["level_0"].values
Out[175]: array([1, 5], dtype=int64)
In the case where it's not possible to uniquely recover original positions because of duplicate rows, you'll get all of them:
In [179]: df.iloc[-1, 0] = "C"
In [180]: df.reset_index().reset_index().merge(df.iloc[[1,2]].reset_index())
Out[180]:
level_0 index letters
0 1 20 B
1 2 30 C
2 5 30 C
In [181]: df.reset_index().reset_index().merge(df.iloc[[1,2]].reset_index())["level_0"].values
Out[181]: array([1, 2, 5], dtype=int64)
but you can decide how you want to drop duplicates after the merge.

Pandas ValueError when calling apply with axis=1 and setting lists of varying length as cell-value

While calling apply on a Pandas dataframe with axis=1, getting ValueError when trying to set a list as cell-value.
Note: Lists in different rows are of varying lengths and this seems to be cause, but not sure how to overcome it.
import numpy as np
import pandas as pd
data = [{'a': 1, 'b': '3412', 'c': 0}, {'a': 88, 'b': '56\t23', 'c': 1},
{'a': 45, 'b': '412\t34\t324', 'c': 2}]
df = pd.DataFrame.from_dict(data)
print("df: ")
print(df)
def get_rank_array(ids):
ids = list(map(int, ids))
return np.random.randint(0, 10, len(ids))
def get_rank_list(ids):
ids = list(map(int, ids))
return np.random.randint(0, 10, len(ids)).tolist()
df['rank'] = df.apply(lambda row: get_rank_array(row['b'].split('\t')), axis=1)
ValueError: could not broadcast input array from shape (2) into shape (3)
df['rank'] = df.apply(lambda row: get_rank_list(row['b'].split('\t')), axis=1)
print("df: ")
print(df)
df:
a b c rank
0 1 3412 0 [6]
1 88 56\t23 1 [0, 0]
2 45 412\t34\t324 2 [3, 3, 6]
get_rank_list works but not get_rank_array in producing the above expected result.
I understand the (3,) shape comes from the number of columns in the dataframe, and (2,) is from the length of the list after splitting 56\t23 in the second row.
But I do not get the reason behind the error itself.
When
data = [{'a': 45, 'b': '412\t34\t324', 'c': 2},
{'a': 1, 'b': '3412', 'c': 0}, {'a': 88, 'b': '56\t23', 'c': 1}]
the error occurs with lists too.
Observe -
df.apply(lambda x: [0, 1, 2])
a b c
0 0 0 0
1 1 1 1
2 2 2 2
df.apply(lambda x: [0, 1])
a [0, 1]
b [0, 1]
c [0, 1]
dtype: object
Pandas does two things inside apply:
it special cases np.arrays and lists, and
it attempts to snap the results into a DataFrame if the shape is compatible
Note that arrays are special cased a little differently to lists, in that, if the shape is not compatible, for lists, the result is a series (as you see in the second output above), but for arrays,
df.apply(lambda x: np.array([0, 1, 2]))
a b c
0 0 0 0
1 1 1 1
2 2 2 2
df.apply(lambda x: np.array([0, 1]))
ValueError: Shape of passed values is (3, 2), indices imply (3, 3)
In short, this is a consequence of the pandas internals. For more information, peruse the apply function code on GitHub.
To get your desired o/p, use a list comprehension and assign the result to df['new']. Don't use apply.
df['new'] = [
np.random.randint(0, 10, len(x.split('\t'))).tolist() for x in df.b
]
df
a b c new
0 1 3412 0 [8]
1 88 56\t23 1 [4, 2]
2 45 412\t34\t324 2 [9, 0, 3]

MultiColumns get lost when indexing and re-indexing

Create some data
cols = pd.MultiIndex.from_product([['what', 'why'], ['me', 'you']])
df = pd.DataFrame(columns=cols)
df.loc[0, :] = [1, 2, 3, 4]
What do we have?
In[8]: df
Out[8]:
what why
me you me you
0 1 2 3 4
Set one (or more) columns as index:
In[11]: df.set_index(('what', 'me'))
Out[11]:
what why
you me you
(what, me)
1 2 3 4
Let's reset that index:
In[12]: df.set_index(('what', 'me')).reset_index()
Out[12]:
(what, me) what why
you me you
0 1 2 3 4
And in particular,
In[13]: df.set_index(('what', 'me')).reset_index().columns
Out[13]:
MultiIndex(levels=[['what', 'why', ('what', 'me')], ['me', 'you', '']],
labels=[[2, 0, 1, 1], [2, 1, 0, 1]])
Is there any way to use these (multi) columns as indices without losing the column structure?

Categories

Resources