Create some data
cols = pd.MultiIndex.from_product([['what', 'why'], ['me', 'you']])
df = pd.DataFrame(columns=cols)
df.loc[0, :] = [1, 2, 3, 4]
What do we have?
In[8]: df
Out[8]:
what why
me you me you
0 1 2 3 4
Set one (or more) columns as index:
In[11]: df.set_index(('what', 'me'))
Out[11]:
what why
you me you
(what, me)
1 2 3 4
Let's reset that index:
In[12]: df.set_index(('what', 'me')).reset_index()
Out[12]:
(what, me) what why
you me you
0 1 2 3 4
And in particular,
In[13]: df.set_index(('what', 'me')).reset_index().columns
Out[13]:
MultiIndex(levels=[['what', 'why', ('what', 'me')], ['me', 'you', '']],
labels=[[2, 0, 1, 1], [2, 1, 0, 1]])
Is there any way to use these (multi) columns as indices without losing the column structure?
Related
I ran into some pandas melt behavior that undermines my mental model of that function and I wonder if somebody could explain why this is sane/logical/desirable behavior.
The following snippet melts down a dataframe and then converts the result into a numpy array. Since I'm melting all columns I would have expected the result to be similar to what np.ndarray.ravel() would do. I.e., create a 1D view into the data and add a column with the respective column names (var names). However, - to my surprise - melt actually makes a copy of the data and reorders it as f-contigous. Why is f-contiguity a good idea here?
expected_flat = np.arange(100*3)
expected_full = expected_flat.reshape(100, 3)
# expected_full is view into flat array
assert expected_full.base is expected_flat
assert expected_flat.flags["C_CONTIGUOUS"]
test_df = pd.DataFrame(
expected_flat.reshape(100, 3),
columns=["a", "b", "c"],
)
# test_df, too, is a view into flat array
reconstructed = test_df.to_numpy()
assert reconstructed.base is expected_flat
flatten_melt = test_df.melt(var_name="col", value_name="foobar")
flatten_melt_numpy = flatten_melt.foobar.to_numpy()
# flatten_melt is NOT a view and reordered
assert flatten_melt_numpy.base is not expected_flat
assert np.allclose(flatten_melt_numpy, expected_flat) == False
# the confusing part is that the array is now F-contigous
reconstructed_melt = flatten_melt_numpy.reshape(100, 3, order="F")
assert np.allclose(reconstructed_melt, expected_full)
Construct a frame from a pair of "series":
In [322]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [323]: df
Out[323]:
a b
0 1 4
1 2 5
2 3 6
In [324]: arr = df.to_numpy()
In [325]: arr
Out[325]:
array([[1, 4],
[2, 5],
[3, 6]])
In [326]: arr.flags
Out[326]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
In [327]: arr.strides
Out[327]: (8, 24)
The resulting array is F_CONTIGUOUS.
If I make a frame from a 2d array, the value is the same as the input, and in this case order 'C':
In [328]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [329]: df1
Out[329]:
a b
0 1 2
1 3 4
2 5 6
In [330]: df1.to_numpy().strides
Out[330]: (16, 8)
Create it with an order F, the result is same as in the first case:
In [332]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2, order="F"), columns=[
...: "a", "b"])
In [333]: df1
Out[333]:
a b
0 1 4
1 2 5
2 3 6
In [334]: df1.to_numpy().strides
Out[334]: (8, 24)
melt
Going back to the frame created from an order C:
In [335]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [336]: df2 = df1.melt()
In [337]: df2
Out[337]:
variable value
0 a 1
1 a 3
2 a 5
3 b 2
4 b 4
5 b 6
Notice how the value column is a vertical concatenation of the 'a' and 'b' columns. This is what the method examples show. I don't use pivot enough to know if this a natural interpretation of that or not.
With the order 'F' frame:
In [338]: df2.to_numpy()
Out[338]:
array([['a', 1],
['a', 3],
['a', 5],
['b', 2],
['b', 4],
['b', 6]], dtype=object)
In [339]: _.strides
Out[339]: (8, 48)
In df1 both columns are int dtype, and can be stored as a 2d array:
In [340]: df1.dtypes
Out[340]:
a int64
b int64
dtype: object
df2 columns are different, object (string) and int, so are stored as separate arrays. to_numpy constructs an object dtype array from them, but it is order 'F':
In [341]: df2.dtypes
Out[341]:
variable object
value int64
dtype: object
We get a hint of this storage from:
In [352]: df1._mgr
Out[352]:
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: int64
In [353]: df2._mgr
Out[353]:
BlockManager
Items: Index(['variable', 'value'], dtype='object')
Axis 1: RangeIndex(start=0, stop=6, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 6, dtype: object
NumericBlock: slice(1, 2, 1), 1 x 6, dtype: int64
How a dataframe stores its values is a complex subject, and I have not read a comprehensive description. I've only gathered bits and pieces from experimenting like this.
Let's assume that I have the following data-frame:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2], "nominal": [1, np.nan, 1, 1, np.nan], "numeric1": [3, np.nan, np.nan, 7, np.nan], "numeric2": [2, 3, np.nan, 2, np.nan], "numeric3": [np.nan, 2, np.nan, np.nan, 3], "date":[pd.Timestamp(2005, 6, 22), pd.Timestamp(2006, 2, 11), pd.Timestamp(2008, 9, 13), pd.Timestamp(2009, 5, 12), pd.Timestamp(2010, 5, 9)]})
As output, I want to get a data-frame, that will indicate the number of days that have passed since a non-nan value was seen for that column, for that id. If a column has a value for the corresponding date, or if a column doesn't have a value at the start for an new id, the value should be a 0. In addition, this is supposed to be computed only for the numeric columns. With that said, the output data-frame should be:
output_df = pd.DataFrame({"numeric1_delta": [0, 234, 1179, 0, 362], "numeric2_delta": [0, 0, 945, 0, 362], "numeric3_delta": [0, 0, 945, 0, 0]})
Looking forward to your answers!
You can groupby the cumsum of the non null and then subtract the first date:
In [11]: df.numeric1.notnull().cumsum()
Out[11]:
0 1
1 1
2 1
3 2
4 2
Name: numeric1, dtype: int64
In [12]: df.groupby(df.numeric1.notnull().cumsum()).date.transform(lambda x: x.iloc[0])
Out[12]:
0 2005-06-22
1 2005-06-22
2 2005-06-22
3 2009-05-12
4 2009-05-12
Name: date, dtype: datetime64[ns]
In [13]: df.date - df.groupby(df.numeric1.notnull().cumsum()).date.transform(lambda x: x.iloc[0])
Out[13]:
0 0 days
1 234 days
2 1179 days
3 0 days
4 362 days
Name: date, dtype: timedelta64[ns]
For multiple columns:
ncols = [col for col in df.columns if col.startswith("numeric")]
for c in ncols:
df[c + "_delta"] = df.date - df.groupby(df[c].notnull().cumsum()).date.transform('first')
What I have now:
import numpy as np
# 1) Read CSV with headers
data = np.genfromtxt("big.csv", delimiter=',', names=True)
# 2) Get absolute values for column in a new ndarray
new_ndarray = np.absolute(data["target_column_name"])
# 3) Append column in new_ndarray to data
# I'm having trouble here. Can't get hstack, concatenate, append, etc; to work
# 4) Sort by new column and obtain a new ndarray
data.sort(order="target_column_name_abs")
I would like:
A solution for 3): To be able to add this new "abs" column to the original ndarray or
Another approach to be able to sort a csv file by the absolute values of a column.
Here is a way to do it.
First, let's create a sample array:
In [39]: a = (np.arange(12).reshape(4, 3) - 6)
In [40]: a
Out[40]:
array([[-6, -5, -4],
[-3, -2, -1],
[ 0, 1, 2],
[ 3, 4, 5]])
Ok, lets say
In [41]: col = 1
which is the column we want to sort by,
and here is the sorting code - using Python's sorted:
In [42]: b = sorted(a, key=lambda row: np.abs(row[col]))
Let's convert b from list to array, and we have:
In [43]: np.array(b)
Out[43]:
array([[ 0, 1, 2],
[-3, -2, -1],
[ 3, 4, 5],
[-6, -5, -4]])
Which is the array with the rows sorted according to
the absolute value of column 1.
Here's a solution using pandas:
In [117]: import pandas as pd
In [118]: df = pd.read_csv('test.csv')
In [119]: df
Out[119]:
a b
0 1 -3
1 2 2
2 3 -1
3 4 4
In [120]: df['c'] = abs(df['b'])
In [121]: df
Out[121]:
a b c
0 1 -3 3
1 2 2 2
2 3 -1 1
3 4 4 4
In [122]: df.sort_values(by='c')
Out[122]:
a b c
2 3 -1 1
1 2 2 2
0 1 -3 3
3 4 4 4
Suppose I have a data frame with multiindex column names that looks like this:
A B
'1.5' '2.3' '8.4' b1
r1 1 2 3 a
r2 4 5 6 b
r3 7 8 9 10
How would I change the just the column names under 'A' from strings to floats, without modifying 'b1', to get the following?
A B
1.5 2.3 8.4 b1
r1 1 2 3 a
r2 4 5 6 b
r3 7 8 9 10
In the real use case, under 'A' there would be thousands of columns with names that should be floats (they represent the wavelengths for a spectrometer) and the data in the data frame represents multiple different observations.
Thanks!
# build the DataFrame (sideways at first, then transposed)
arrays = [['A','A','A','B'],['1.5', '2.3', '8.4', 'b1']]
tuples = list( zip(*arrays) )
data1 = np.array([[1,2,3,'a'], [4,5,6,'b'], [7,8,9,10]])
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(data1.T, index=index).T
Printing df.columns gives the existing column names.
Out[84]:
MultiIndex(levels=[[u'A', u'B'], [u'1.5', u'2.3', u'8.4', u'b1']],
labels=[[0, 0, 0, 1], [0, 1, 2, 3]],
names=[u'first', u'second'])
Now change the column names
# make new column titles (probably more pythonic ways to do this)
A_cols = [float(i) for i in df['A'].columns]
B_cols = [i for i in df['B'].columns]
cols = A_cols + B_cols
# set levels
levels = [df.columns.levels[0],cols]
df.columns.set_levels(levels,inplace=True)
Gives the following output
Out[86]:
MultiIndex(levels=[[u'A', u'B'], [1.5, 2.3, 8.4, u'b1']],
labels=[[0, 0, 0, 1], [0, 1, 2, 3]],
names=[u'first', u'second'])
I am sometimes struggling a bit to understand pandas datastructures and it seems to be the case again. Basically, I've got:
1 pivot table, major axis being a serial number
a Serie using the same index
I would like to divide each column of my pivot table by the value in the Serie using index to match the lines. I've tried plenty of combinations... without being successful so far :/
import pandas as pd
df = pd.DataFrame([['123', 1, 1, 3], ['456', 2, 3, 4], ['123', 4, 5, 6]], columns=['A', 'B', 'C', 'D'])
pt = pd.pivot_table(df, rows=['A', 'B'], cols='C', values='D', fill_value=0)
serie = pd.Series([5, 5, 5], index=['123', '678', '345'])
pt.div(serie, axis='index')
But I am only getting NaN. I guess it's because columns names are not matching but that's why I was using index as the axis. Any ideas on what I am doing wrong?
Thanks
You say "using the same index", but they're not the same: pt has a multiindex, and serie only an index:
>>> pt.index
MultiIndex(levels=[[u'123', u'456'], [1, 2, 4]],
labels=[[0, 0, 1], [0, 2, 1]],
names=[u'A', u'B'])
And you haven't told the division that you want to align on the A part of the index. You can pass that information using level:
>>> pt.div(serie, level='A', axis='index')
C 1 3 5
A B
123 1 0.6 0 0.0
4 0.0 0 1.2
456 2 NaN NaN NaN
[3 rows x 3 columns]