DataFrame groupby and shifting values based on freq - python

I am trying to groupby my DataFrame and shift some columns for 1 day.
Code for the test df:
import pandas as pd
import datetime as dt
d = {'date' : ['202211', '202211', '202211','202211','202211', '202212', '202212', '202212', '202212', '202213', '202213', '202213', '202213', '202213'],
'id' : ['a', 'b', 'c', 'd','e', 'a', 'b', 'c', 'd', 'a', 'b', 'c', 'd', 'e'],
'price' : [1, 1.2, 1.3, 1.5, 1.7, 2, 1.5, 2, 1.1, 2, 1.5, 0.8, 1.3, 1.5],
'shrs' : [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]}
df = pd.DataFrame(data = d)
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df.set_index('date', inplace=True)
df["me"] = df['price'] * df['shrs']
df['rank'] = df.groupby('date')['price'].transform(lambda x: pd.qcut(x, 2, labels=range(1,3),
duplicates='drop'))
df['weight'] = df['me'] / (df.groupby(['date', 'rank'])['me'].transform('sum'))
df['ew'] = 1 / (df.groupby(['date', 'rank'])['price'].transform('count'))
df.sort_values(['id', 'date'], inplace=True)
print(df)
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100 100.0 1 0.285714 0.333333
2022-01-02 a 2.0 100 200.0 2 0.500000 0.500000
2022-01-03 a 2.0 100 200.0 2 1.000000 1.000000
2022-01-01 b 1.2 100 120.0 1 0.342857 0.333333
2022-01-02 b 1.5 100 150.0 1 0.576923 0.500000
2022-01-03 b 1.5 100 150.0 1 0.294118 0.250000
2022-01-01 c 1.3 100 130.0 1 0.371429 0.333333
2022-01-02 c 2.0 100 200.0 2 0.500000 0.500000
2022-01-03 c 0.8 100 80.0 1 0.156863 0.250000
2022-01-01 d 1.5 100 150.0 2 0.468750 0.500000
2022-01-02 d 1.1 100 110.0 1 0.423077 0.500000
2022-01-03 d 1.3 100 130.0 1 0.254902 0.250000
2022-01-01 e 1.7 100 170.0 2 0.531250 0.500000
2022-01-03 e 1.5 100 150.0 1 0.294118 0.250000
The following code results almost in what I want. But as my data is not consistent and some days might be skipped (see observations for id "e"), I need cannot do simply shift(1) but need to implement the frequency as well.
df['rank'] = df.groupby('id')['rank'].shift(1)
df['weight'] = df.groupby('id')['weight'].shift(1)
df['ew'] = df.groupby('id')['ew'].shift(1)
print(df)
results in:
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100 100.0 NaN NaN NaN
2022-01-02 a 2.0 100 200.0 1 0.285714 0.333333
2022-01-03 a 2.0 100 200.0 2 0.500000 0.500000
2022-01-01 b 1.2 100 120.0 NaN NaN NaN
2022-01-02 b 1.5 100 150.0 1 0.342857 0.333333
2022-01-03 b 1.5 100 150.0 1 0.576923 0.500000
2022-01-01 c 1.3 100 130.0 NaN NaN NaN
2022-01-02 c 2.0 100 200.0 1 0.371429 0.333333
2022-01-03 c 0.8 100 80.0 2 0.500000 0.500000
2022-01-01 d 1.5 100 150.0 NaN NaN NaN
2022-01-02 d 1.1 100 110.0 2 0.468750 0.500000
2022-01-03 d 1.3 100 130.0 1 0.423077 0.500000
2022-01-01 e 1.7 100 170.0 NaN NaN NaN
2022-01-03 e 1.5 100 150.0 2 0.531250 0.500000
The desired outcome would be (watch observation for id "e"):
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100 100.0 NaN NaN NaN
2022-01-02 a 2.0 100 200.0 1 0.285714 0.333333
2022-01-03 a 2.0 100 200.0 2 0.500000 0.500000
2022-01-01 b 1.2 100 120.0 NaN NaN NaN
2022-01-02 b 1.5 100 150.0 1 0.342857 0.333333
2022-01-03 b 1.5 100 150.0 1 0.576923 0.500000
2022-01-01 c 1.3 100 130.0 NaN NaN NaN
2022-01-02 c 2.0 100 200.0 1 0.371429 0.333333
2022-01-03 c 0.8 100 80.0 2 0.500000 0.500000
2022-01-01 d 1.5 100 150.0 NaN NaN NaN
2022-01-02 d 1.1 100 110.0 2 0.468750 0.500000
2022-01-03 d 1.3 100 130.0 1 0.423077 0.500000
2022-01-01 e 1.7 100 170.0 NaN NaN NaN
2022-01-03 e 1.5 100 150.0 NaN NaN NaN
I did not manage to simply use freq='d' here. What could be the easiest solution?

One quick and dirty solution that works with your data is to unstack() and stack() the id column to create nulls for the gaps and then drop the nulls at the end:
In [50]: df = df.set_index('id', append=True).unstack().stack(dropna=False).reset_index(level=1).sort_values('id')
In [51]: df['rank'] = df.groupby('id')['rank'].shift(1)
...: df['weight'] = df.groupby('id')['weight'].shift(1)
...: df['ew'] = df.groupby('id')['ew'].shift(1)
In [52]: print(df[df['price'].notnull()])
id price shrs me rank weight ew
date
2022-01-01 a 1.0 100.0 100.0 NaN NaN NaN
2022-01-02 a 2.0 100.0 200.0 1 0.285714 0.333333
2022-01-03 a 2.0 100.0 200.0 2 0.500000 0.500000
2022-01-01 b 1.2 100.0 120.0 NaN NaN NaN
2022-01-02 b 1.5 100.0 150.0 1 0.342857 0.333333
2022-01-03 b 1.5 100.0 150.0 1 0.576923 0.500000
2022-01-01 c 1.3 100.0 130.0 NaN NaN NaN
2022-01-02 c 2.0 100.0 200.0 1 0.371429 0.333333
2022-01-03 c 0.8 100.0 80.0 2 0.500000 0.500000
2022-01-01 d 1.5 100.0 150.0 NaN NaN NaN
2022-01-02 d 1.1 100.0 110.0 2 0.468750 0.500000
2022-01-03 d 1.3 100.0 130.0 1 0.423077 0.500000
2022-01-01 e 1.7 100.0 170.0 NaN NaN NaN
2022-01-03 e 1.5 100.0 150.0 NaN NaN NaN

You can resample on a groupby to generate a daily series:
# Resample to produce a daily data for each ID
shifted = df.groupby("id").resample("1D").first().groupby(level=0).shift()
# `shifted` and `df`'s indexes must have the same shape
df = df.set_index("id", append=True).swaplevel()
# Rely on panda's auto row alignment to handle the assignment
cols = ["rank", "weight", "ew"]
df[cols] = shifted[cols]
df.reset_index(0, inplace=True)

Related

Pandas: how to prevent df.append() from returning NaN values

I am trying to append the content of one dataframe into another. Here is basic example of what I am working with:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'Id': ['001','001','001','002','002','002','004','004'],
'Date':['2020-01-01','2020-01-02','2020-01-03','2020-01-01','2020-01-02','2020-01-03','2020-01-02','2020-01-03'],
'Quantity': [100,100,100,50,50,50,60,60],
'fx' :[1,1,1,2,2,2,1,1],
'fy' : [1,1,1,3,3,3,1,1]})
df2 = pd.DataFrame({'Id': ['001','001','001','002','002','002', '003'],
'Date':['2019-01-01','2019-01-02','2019-01-03','2019-01-01','2019-01-02','2019-01-03','2019-02-02'],
'Quantity': [100,100,100,50,50,50,20]})
Now I want to append the content of df2 into df1, but the issue is that it results in some NaN here and there in df1
histo = df1.append(df2)
histo = histo.sort_values('Id')
print(histo)
Id Date Quantity fx fy
0 001 2020-01-01 100 1.0 1.0
1 001 2020-01-02 100 1.0 1.0
2 001 2020-01-03 100 1.0 1.0
0 001 2019-01-01 100 NaN NaN
1 001 2019-01-02 100 NaN NaN
2 001 2019-01-03 100 NaN NaN
3 002 2020-01-01 50 2.0 3.0
4 002 2020-01-02 50 2.0 3.0
5 002 2020-01-03 50 2.0 3.0
3 002 2019-01-01 50 NaN NaN
4 002 2019-01-02 50 NaN NaN
5 002 2019-01-03 50 NaN NaN
6 003 2019-02-02 20 NaN NaN
6 004 2020-01-02 60 1.0 1.0
7 004 2020-01-03 60 1.0 1.0
the output that I want to achieve is that for each 'Id' row, the values of fx and fy continue being the same. the result would look like this:
Id Date Quantity fx fy
0 001 2020-01-01 100 1.0 1.0
1 001 2020-01-02 100 1.0 1.0
2 001 2020-01-03 100 1.0 1.0
0 001 2019-01-01 100 1.0 1.0
1 001 2019-01-02 100 1.0 1.0
2 001 2019-01-03 100 1.0 1.0
3 002 2020-01-01 50 2.0 3.0
4 002 2020-01-02 50 2.0 3.0
5 002 2020-01-03 50 2.0 3.0
3 002 2019-01-01 50 2.0 3.0
4 002 2019-01-02 50 2.0 3.0
5 002 2019-01-03 50 2.0 3.0
6 003 2019-02-02 20 2.0 3.0
6 004 2020-01-02 60 1.0 1.0
7 004 2020-01-03 60 1.0 1.0
what can I do to achieve the above output? I cannot find it in pandas documentation. Thanks
Use ffill, forward fills NaN value with the last non-NaN value seen in a column.
histo = histo.sort_values('Id').ffill()

Replace values based on multiple conditions with groupby mean in Pandas

Say I have a dataframe as follows:
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=6, freq='M'),
'value': [3, 3.5, -5, 2, 7, 6.8], 'type': ['a', 'a', 'a', 'b', 'b', 'b']})
df['pct'] = df.groupby(['type'])['value'].pct_change()
Ouput:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -2.428571
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b -0.028571
I want to replace the pct values which is bigger than 0.2 or smaller than -0.2, then replace them with groupby type means.
My attempt to solve this problem by: first, replace "outliers" with extrame values -999, then replace them by groupby outputs, this is what I have done:
df.loc[df['pct'] >= 0.2, 'pct'] = -999
df.loc[df['pct'] <= -0.2, 'pct'] = -999
df["pct"] = df.groupby(['type'])['pct'].transform(lambda x: x.replace(-999, x.mean()))
But obviously, it is not best solution to solve this problem and results are not correct:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -499.416667
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b -499.514286
5 2013-06-30 6.8 b -0.028571
The expected result should look like this:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -1.130
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b 1.24
What I have done wrong? Again thanks for your kind help.
Instead your both conditions is possible use Series.between and set values in pct by GroupBy.transform with mean:
mask = df['pct'].between(-0.2, 0.2)
df.loc[mask, 'pct'] = df.groupby('type')['pct'].transform('mean').values
print (df)
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a -1.130952
2 2013-03-31 -5.0 a -2.428571
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b 1.235714
Alternative solution is use numpy.where:
mask = df['pct'].between(-0.2, 0.2)
df['pct'] = np.where(mask, df.groupby('type')['pct'].transform('mean'), df['pct'])

how to concat between columns keeping sequence unchanged in 2 dataframes pandas

I have 2 dataframes and i want to concat each other as follows:
df1:
index 394 min FIC-2000 398 min FFC
0 Recycle Gas min 20K20 Compressor min 20k
1 TT date kg/h AT date ..
2 nan 2011-03-02 -20.7 2011-03-02
08:00:00 08:00:00
3 nan 2011-03-02 -27.5 ...
08:00:10
df2:
index Unnamed:0 0 1 .. 394 395 .....
0 Service Prop Prop1 Recycle Gas RecG
the output df3 should be like this:
df3
index Unnamed:0 0 .. 394 395..
0 Service Prop Recycle Gas RecG
1 Recycle Gas min FIC-2000
2 min 20K20
3 TT date kg/h
4 nan 2011-03-02 -20.7
08:00:00
5 nan 2011-03-02 -27.5
08:00:10
i've tried to use this code:
df3=pd.concat([df1,df2), axis=1)
but this just concat index 394 and the rest of df1 is appended to the end of the dataframe of df2.
Any idea how to?
Just change to axis=0.
Consider this:
Input:
>>> df
col1 col2 col3
0 1 4 2
1 2 1 5
2 3 6 319
>>> df_1
col4 col5 col6
0 1 4 12
1 32 12 3
2 3 2 319
>>> df_2
col1 col3 col6
0 12 14 2
1 4 132 3
2 23 22 9
Concat mismatched (per column name)
>>> pd.concat([df, df_1], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
Concat matching:
>>> pd.concat([df, df_1, df_2], axis=0)
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 NaN NaN NaN
1 2.0 1.0 5.0 NaN NaN NaN
2 3.0 6.0 319.0 NaN NaN NaN
0 NaN NaN NaN 1.0 4.0 12.0
1 NaN NaN NaN 32.0 12.0 3.0
2 NaN NaN NaN 3.0 2.0 319.0
0 12.0 NaN 14.0 NaN NaN 2.0
1 4.0 NaN 132.0 NaN NaN 3.0
2 23.0 NaN 22.0 NaN NaN 9.0
Concat matched, fill NaN-s (analogically you can fill None-s)
>>> pd.concat([df, df_1, df_2], axis=0).fillna(0) #in case you wish to prettify it, maybe in case of strings do .fillna('')
col1 col2 col3 col4 col5 col6
0 1.0 4.0 2.0 0.0 0.0 0.0
1 2.0 1.0 5.0 0.0 0.0 0.0
2 3.0 6.0 319.0 0.0 0.0 0.0
0 0.0 0.0 0.0 1.0 4.0 12.0
1 0.0 0.0 0.0 32.0 12.0 3.0
2 0.0 0.0 0.0 3.0 2.0 319.0
0 12.0 0.0 14.0 0.0 0.0 2.0
1 4.0 0.0 132.0 0.0 0.0 3.0
2 23.0 0.0 22.0 0.0 0.0 9.0
EDIT
Triggered by the conversation with OP in the comment section below.
So you do:
(1) To concat dataframes
df3=pd.concat([df1,df2], axis=0)
(2) To join another dataframe on them:
df5=pd.merge(df3, df4[["FIC", "min"]], on="FIC", how="outer")
(you may want to consider suffixes field if you think it's relevant)
REF: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

ReArrange Pandas DataFrame date columns in date order

I have a pandas dataframe that summarises sales by calendar month & outputs something like:
Month level_0 UNIQUE_ID 102018 112018 12018 122017 122018 22018 32018 42018 52018 62018 72018 82018 92018
0 SOLD_QUANTITY 01 3692.0 5182.0 3223.0 1292.0 2466.0 2396.0 2242.0 2217.0 3590.0 2593.0 1665.0 3371.0 3069.0
1 SOLD_QUANTITY 011 3.0 6.0 NaN NaN 7.0 5.0 2.0 1.0 5.0 NaN 1.0 1.0 3.0
2 SOLD_QUANTITY 02 370.0 130.0 NaN NaN 200.0 NaN NaN 269.0 202.0 NaN 201.0 125.0 360.0
3 SOLD_QUANTITY 03 2.0 6.0 NaN NaN 2.0 1.0 NaN 6.0 11.0 9.0 2.0 3.0 5.0
4 SOLD_QUANTITY 08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 175.0 NaN NaN
I want to be able to programmatically re-arrange the column headers in ascending date order (eg starting 122017, 12018, 22018...). I need to do it in a way that is programmatic as every way the report runs, it will be a different list of months as it runs every month for last 365 days.
The index data type:
Index(['level_0', 'UNIQUE_ID', '102018', '112018', '12018', '122017', '122018',
'22018', '32018', '42018', '52018', '62018', '72018', '82018', '92018'],
dtype='object', name='Month')
Use set_index for only dates columns, convert them to datetimes and get order positions by argsort, then change ordering with iloc:
df = df.set_index(['level_0','UNIQUE_ID'])
df = df.iloc[:, pd.to_datetime(df.columns, format='%m%Y').argsort()].reset_index()
print (df)
level_0 UNIQUE_ID 122017 12018 22018 32018 42018 52018 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0 3590.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0 5.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0 202.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0 11.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN NaN
62018 72018 82018 92018 102018 112018 122018
0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN 175.0 NaN NaN NaN NaN NaN
Another idea is create month period index by DatetimeIndex.to_period, so is possible use sort_index:
df = df.set_index(['level_0','UNIQUE_ID'])
df.columns = pd.to_datetime(df.columns, format='%m%Y').to_period('m')
#alternative for convert to datetimes
#df.columns = pd.to_datetime(df.columns, format='%m%Y')
df = df.sort_index(axis=1).reset_index()
print (df)
level_0 UNIQUE_ID 2017-12 2018-01 2018-02 2018-03 2018-04 \
0 SOLD_QUANTITY 1 1292.0 3223.0 2396.0 2242.0 2217.0
1 SOLD_QUANTITY 11 NaN NaN 5.0 2.0 1.0
2 SOLD_QUANTITY 2 NaN NaN NaN NaN 269.0
3 SOLD_QUANTITY 3 NaN NaN 1.0 NaN 6.0
4 SOLD_QUANTITY 8 NaN NaN NaN NaN NaN
2018-05 2018-06 2018-07 2018-08 2018-09 2018-10 2018-11 2018-12
0 3590.0 2593.0 1665.0 3371.0 3069.0 3692.0 5182.0 2466.0
1 5.0 NaN 1.0 1.0 3.0 3.0 6.0 7.0
2 202.0 NaN 201.0 125.0 360.0 370.0 130.0 200.0
3 11.0 9.0 2.0 3.0 5.0 2.0 6.0 2.0
4 NaN NaN 175.0 NaN NaN NaN NaN NaN

How to access prior rows within a multiindex Panda dataframe

How to reach within a Datetime indexed multilevel Dataframe such as the following: This is downloaded Fin data.
The tough part is getting inside the frame and accessing non adjacent rows of a particular inner level, without specifying explicitly the outer level date, since I have thousands of such rows..
ABC DEF GHI \
Date STATS
2012-07-19 00:00:00 NaN NaN NaN
investment 4 9 13
price 5 8 1
quantity 12 9 8
So the 2 formulas i am searching could be summarized as
X(today row) = quantity(prior row)*price(prior row)
or
X(today row) = quantity(prior row)*price(today)
The difficulty is how to formulate the access to those rows using numpy or panda for a multilevel index, and the rows are not adjacent.
In the end i would end up with this:
ABC DEF GHI XN
Date STATS
2012-07-19 00:00:00 NaN NaN NaN
investment 4 9 13 X1
price 5 8 1
quantity 12 9 8
2012-07-18 00:00:00 NaN NaN NaN
investment 1 2 3 X2
price 2 3 4
quantity 18 6 7
X1= (18*2)+(6*3)+(7*4) (quantity_day_2 *price_day_2 data)
or for the other formula
X1= (18*5)+(6*8)+(7*1) (quantity_day_2 *price_day_1 data)
Could I use a groupby?
If need add output to original DataFrame, then it is more complicated:
print (df)
ABC DEF GHI
Date STATS
2012-07-19 NaN NaN NaN
investment 4.0 9.0 13.0
price 5.0 8.0 1.0
quantity 12.0 9.0 8.0
2012-07-18 NaN NaN NaN
investment 1.0 2.0 3.0
price 2.0 3.0 4.0
quantity 18.0 6.0 7.0
2012-07-17 NaN NaN NaN
investment 1.0 2.0 3.0
price 0.0 1.0 4.0
quantity 5.0 1.0 0.0
df.sort_index(inplace=True)
#rename value in level to investment - align data in final concat
idx = pd.IndexSlice
p = df.loc[idx[:,'price'],:].rename(index={'price':'investment'})
q = df.loc[idx[:,'quantity'],:].rename(index={'quantity':'investment'})
print (p)
ABC DEF GHI
Date STATS
2012-07-17 investment 0.0 1.0 4.0
2012-07-18 investment 2.0 3.0 4.0
2012-07-19 investment 5.0 8.0 1.0
print (q)
ABC DEF GHI
Date STATS
2012-07-17 investment 5.0 1.0 0.0
2012-07-18 investment 18.0 6.0 7.0
2012-07-19 investment 12.0 9.0 8.0
#multiple and concat to original df
print (p * q)
ABC DEF GHI
Date STATS
2012-07-17 investment 0.0 1.0 0.0
2012-07-18 investment 36.0 18.0 28.0
2012-07-19 investment 60.0 72.0 8.0
a = (p * q).sum(axis=1).rename('col1')
print (pd.concat([df, a], axis=1))
ABC DEF GHI col1
Date STATS
2012-07-17 NaN NaN NaN NaN
investment 1.0 2.0 3.0 1.0
price 0.0 1.0 4.0 NaN
quantity 5.0 1.0 0.0 NaN
2012-07-18 NaN NaN NaN NaN
investment 1.0 2.0 3.0 82.0
price 2.0 3.0 4.0 NaN
quantity 18.0 6.0 7.0 NaN
2012-07-19 NaN NaN NaN NaN
investment 4.0 9.0 13.0 140.0
price 5.0 8.0 1.0 NaN
quantity 12.0 9.0 8.0 NaN
#shift with Multiindex - not supported yet - first create Datatimeindex with unstack
#, then shift and last reshape to original by stack
#multiple and concat to original df
print (p.unstack().shift(-1, freq='D').stack() * q)
ABC DEF GHI
Date STATS
2012-07-16 investment NaN NaN NaN
2012-07-17 investment 10.0 3.0 0.0
2012-07-18 investment 90.0 48.0 7.0
2012-07-19 investment NaN NaN NaN
b = (p.unstack().shift(-1, freq='D').stack() * q).sum(axis=1).rename('col2')
print (pd.concat([df, b], axis=1))
ABC DEF GHI col2
Date STATS
2012-07-16 investment NaN NaN NaN 0.0
2012-07-17 NaN NaN NaN NaN
investment 1.0 2.0 3.0 13.0
price 0.0 1.0 4.0 NaN
quantity 5.0 1.0 0.0 NaN
2012-07-18 NaN NaN NaN NaN
investment 1.0 2.0 3.0 145.0
price 2.0 3.0 4.0 NaN
quantity 18.0 6.0 7.0 NaN
2012-07-19 NaN NaN NaN NaN
investment 4.0 9.0 13.0 0.0
price 5.0 8.0 1.0 NaN
quantity 12.0 9.0 8.0 NaN
You can use:
#add new datetime with data for better testing
print (df)
ABC DEF GHI
Date STATS
2012-07-19 NaN NaN NaN
investment 4.0 9.0 13.0
price 5.0 8.0 1.0
quantity 12.0 9.0 8.0
2012-07-18 NaN NaN NaN
investment 1.0 2.0 3.0
price 2.0 3.0 4.0
quantity 18.0 6.0 7.0
2012-07-17 NaN NaN NaN
investment 1.0 2.0 3.0
price 0.0 1.0 4.0
quantity 5.0 1.0 0.0
#lexsorted Multiindex
df.sort_index(inplace=True)
#select data and remove last level, because:
#1. need shift
#2. easier working
idx = pd.IndexSlice
p = df.loc[idx[:,'price'],:]
p.index = p.index.droplevel(-1)
q = df.loc[idx[:,'quantity'],:]
q.index = q.index.droplevel(-1)
print (p)
ABC DEF GHI
Date
2012-07-17 0.0 1.0 4.0
2012-07-18 2.0 3.0 4.0
2012-07-19 5.0 8.0 1.0
print (q)
ABC DEF GHI
Date
2012-07-17 5.0 1.0 0.0
2012-07-18 18.0 6.0 7.0
2012-07-19 12.0 9.0 8.0
print (p * q)
ABC DEF GHI
Date
2012-07-17 0.0 1.0 0.0
2012-07-18 36.0 18.0 28.0
2012-07-19 60.0 72.0 8.0
print ((p * q).sum(axis=1).to_frame().rename(columns={0:'col1'}))
col1
Date
2012-07-17 1.0
2012-07-18 82.0
2012-07-19 140.0
#shift row with -1, because lexsorted df
print (p.shift(-1, freq='D') * q)
ABC DEF GHI
Date
2012-07-16 NaN NaN NaN
2012-07-17 10.0 3.0 0.0
2012-07-18 90.0 48.0 7.0
2012-07-19 NaN NaN NaN
print ((p.shift(-1, freq='D') * q).sum(axis=1).to_frame().rename(columns={0:'col2'}))
col2
Date
2012-07-16 0.0
2012-07-17 13.0
2012-07-18 145.0
2012-07-19 0.0

Categories

Resources