How to subtract buy/sell rows for each group in dataframe - python

I have a dataframe that looks like this:
symbol
side
min
max
mean
wav
1000038
buy
0.931
1.0162
0.977
0.992
1000038
sell
0.932
1.0173
0.978
0.995
1000039
buy
0.881
1.00
0.99
0.995
1000039
sell
0.885
1.025
0.995
1.001
What is the most pythonic (efficient) way of generating a new dataframe consisting of the differences between the buys and the sells of each symbol.
For example: symbol 1000038, the difference between the and min sell and min buy is (0.932 - 0.931) = 0.001.
I am seeking a method that avoids looping through the dataframe rows as I believe this would be inefficient. Instead looking for a grouping type of solution.
I have tried something like this:
df1 = stats[['symbol','side']].join(stats[['mean','wav']].diff(-1))
df2 = df1[df1['side']=='sell']
print(df2)
but it does not seem to work as expected.

You could use the pandas MultiIndex. First, set up the data:
import pandas as pd
columns = ('symbol', 'side', 'min', 'max', 'mean', 'wav')
data = [
(1000038, 'buy', 0.931, 1.0162, 0.977, 0.992),
(1000038, 'sell', 0.932, 1.0173, 0.978, 0.995),
(1000039, 'buy', 0.881, 1.00, 0.99, 0.995),
(1000039, 'sell', 0.885, 1.025, 0.995, 1.001),
]
df = pd.DataFrame(data = data, columns = columns)
Then, create the index and compute the difference between two data frames:
df2 = df.set_index(['side', 'symbol'], verify_integrity=True)
df2 = df2.sort_index()
df2.loc[('buy',), :] - df2.loc[('sell',), :]
The result is:
min max mean wav
symbol
1000038 -0.001 -0.0011 -0.001 -0.003
1000039 -0.004 -0.0250 -0.005 -0.006
I'm assuming that each symbol (like 1000038) appears twice. You could use fillna() if you have un-matched buys and sells.

If needed, start with drop_duplicates and sort_values to make sure each symbol only has 1xbuy and 1xsell (in that order):
df = df.drop_duplicates(['symbol', 'side']).sort_values(['symbol', 'side'])
Then use either xs (faster) or groupby.diff for the group subtractions.
xs
Set the index to ['side', 'symbol'] and use xs to get cross-sections for buy and sell:
df.set_index(['side', 'symbol']).pipe(lambda df: df.xs('sell') - df.xs('buy'))
# min max mean wav
# symbol
# 1000038 0.001 0.0011 0.001 0.003
# 1000039 0.004 0.0250 0.005 0.006
groupby.diff
Set the index to symbol and subtract the groups using groupby.diff:
df.drop(columns='side').set_index('symbol').groupby('symbol').diff().dropna()
# min max mean wav
# symbol
# 1000038 0.001 0.0011 0.001 0.003
# 1000039 0.004 0.0250 0.005 0.006
- To flip the subtraction order, use diff(-1).
- If your version throws an error with groupby('symbol'), use groupby(level=0).

Related

New pandas version: how to groupby all columns with different aggregation statistics

I have a df that looks like this:
time volts1 volts2
0 0.000 -0.299072 0.427551
2 0.001 -0.299377 0.427551
4 0.002 -0.298767 0.427551
6 0.003 -0.298767 0.422974
8 0.004 -0.298767 0.422058
10 0.005 -0.298462 0.422363
12 0.006 -0.298767 0.422668
14 0.007 -0.298462 0.422363
16 0.008 -0.301208 0.420227
18 0.009 -0.303345 0.418091
In actuality, the df has >50 columns, but for simplicity, I'm just showing 3.
I want to groupby this df every n rows, lets say 5. I want to aggregate time with max and the rest of the columns I want to aggregate by mean. Because there are so many columns, I'd love to be able to loop this and not have to do it manually.
I know I can do something like this where I go through and create all new columns manually:
df.groupby(df.index // 5).agg(time=('time', 'max'),
volts1=('volts1', 'mean'),
volts1=('volts1', 'mean'),
...
)
but because there are so many columns, I want to do this in a loop, something like:
df.groupby(df.index // 5).agg(time=('time', 'max'),
# df.time is always the first column
[i for i in df.columns[1:]]=(i, 'mean'),
)
If useful:
print(pd.__version__)
1.0.5
You can use a dictionary:
d = {col: "mean" if not col=='time' else "max" for col in df.columns}
#{'time': 'max', 'volts1': 'mean', 'volts2': 'mean'}
df.groupby(df.index // 5).agg(d)
time volts1 volts2
0 0.002 -0.299072 0.427551
1 0.004 -0.298767 0.422516
2 0.007 -0.298564 0.422465
3 0.009 -0.302276 0.419159

groupby rolling agg custom function for portfolio beta

Thanks for reading and in advance for any answers.
Beta is a measure of systematic risk of an investment portfolio. It is calculated by taking the covariance of that portfolios returns against the benchmark / market and dividing it by the variance of the market. I'd like to calc this on a rolling basis against many portfolios.
I have a df as follows
PERIOD,PORT1,PORT2,BM
201504,-0.004,-0.001,-0.013
201505,0.017,0.019,0.022
201506,-0.027,-0.037,-0.039
201507,0.026,0.033,0.017
201508,-0.045,-0.054,-0.081
201509,-0.033,-0.026,-0.032
201510,0.053,0.07,0.09
201511,0.03,0.032,0.038
201512,-0.05,-0.034,-0.044
201601,-0.016,-0.043,-0.057
201602,-0.007,-0.007,-0.011
201603,0.014,0.014,0.026
201604,0.003,0.001,0.01
201605,0.046,0.038,0.031
Except with many more columns like port1 and port2.
I would like to create a dataset with a rolling beta vs the BM column.
I created a similar rolling correlation dataset with
df.rolling(3).corr(df['BM'])
...which took every column in my large set and calced a correlation vs my BM column.
I tried to make a custom function for Beta but because it takes two arguments I am struggling. Below is my custom function and how I got it to work by feeding it two columns of returns.
def beta(arr1,arr2):
#ddof = 0 gives population covar. the 0 and 1 coordinates take the arr1 vs arr2 covar from the matrix
return (np.cov(arr1,arr2,ddof=0)[0][1])/np.var(arr2)
beta_test = beta(df['PORT1'],df['BM'])
So this helps me find the beta between two columns that I feed in... question is how to do this for my data above and data with many columns/portfolios? And then how to do it on a rolling basis? From what I saw above with the correlation, the below should be possible, to run each rolling 3 month data set in each column vs one specified column.
beta_data = df.rolling(3).agg(beta(df['BM']))
Any pointer in the right direction would be appreciated
IIUC, you can set_index the columns PERIOD and BM, filter the column with PORT in it (in case you have other columns you don't want to apply the beta function), then use rolling.apply like:
print (df.set_index(['PERIOD','BM']).filter(like='PORT')
.rolling(3).apply(lambda x: beta(x, x.index.get_level_values(1)))
.reset_index())
PERIOD BM PORT1 PORT2
0 201504 -0.013 NaN NaN
1 201505 0.022 NaN NaN
2 201506 -0.039 0.714514 0.898613
3 201507 0.017 0.814734 1.055798
4 201508 -0.081 0.736486 0.907336
5 201509 -0.032 0.724490 0.887755
6 201510 0.090 0.598332 0.736964
7 201511 0.038 0.715848 0.789221
8 201512 -0.044 0.787248 0.778703
9 201601 -0.057 0.658877 0.794949
10 201602 -0.011 0.412270 0.789567
11 201603 0.026 0.354829 0.690573
12 201604 0.010 0.562924 0.558083
13 201605 0.031 1.716066 1.530471
def getbetas(df, market, window = 45):
""" given an unstacked pandas dataframe (columns instruments, rows
dates), compute the rolling betas vs the market.
"""
nmarket = market/market.rolling(window).var()
thebetas = df.rolling(window).cov(other=nmarket)
return thebetas

Is there a simpler way to merge results of describe() from multiple chunks of a DataFrame?

I am working on a large csv files. Since I cannot import the whole csv file into a dataframe at the same time due to memory limitations, I am using chunks to process the data.
df = pd.read_csv(filepath, chunksize = chunksize)
for chunk in df:
print(chunk['col2'].describe())
This gives me the stats for each chunk. Is there a way to merge the results from each chunk.describe() call to be merged so that I can get the stats for all data at once?
The only way I can think of right now is to maintain a dictionary to store the stats and update with each iteration.
EDITED:
I got to playing around with this a little. I am new so take this with a grain of salt:
Load sample with remote source
import pandas as pd
df1_iter = pd.read_csv("https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv",
chunksize=5,
iterator=True)
Do a simple for look to do .describe and .T on each chunk and append it to list
Next use pd.concat() on df_list
df_list = []
for chunk in df1_iter:
df_list.append(chunk.describe().T)
df_concat = pd.concat(df_list)
Groupby
For the agg I used function I thought to be useful, adjust as needed.
desc_df = df_concat.groupby(df_concat.index).agg(
{
'mean':'mean',
'std': 'std',
'min': 'min',
'25%': 'mean',
'50%': 'mean',
'75%': 'mean',
'max': 'max'
}
)
print(desc_df)
mean std min 25% 50% 75% max
am 0.433333 0.223607 0.000 0.333333 0.500000 0.500000 1.000
carb 3.100000 1.293135 1.000 2.250000 2.666667 4.083333 8.000
cyl 6.200000 0.636339 4.000 5.500000 6.000000 7.166667 8.000
disp 232.336667 40.954447 71.100 177.216667 195.233333 281.966667 472.000
drat 3.622833 0.161794 2.760 3.340417 3.649167 3.849583 4.930
gear 3.783333 0.239882 3.000 3.541667 3.916667 3.958333 5.000
hp 158.733333 44.053017 52.000 124.416667 139.333333 191.083333 335.000
mpg 19.753333 2.968229 10.400 16.583333 20.950000 23.133333 33.900
qsec 17.747000 0.868257 14.500 16.948333 17.808333 18.248333 22.900
vs 0.450000 0.102315 0.000 0.208333 0.416667 0.625000 1.000
wt 3.266900 0.598493 1.513 2.850417 3.042500 3.809583 5.424
I hope this was helpful.

Q: Python (pandas or other) - I need to "flatten" a data file from many rows, few columns to 1 row many columns

I need to "flatten" a data file from many rows, few columns to 1 row many columns.
I currently have a dataframe in pandas (loaded from Excel) and ultimately need to change the way the data is displayed so I can accumulate large amounts of data in a logical manner. The below tables are an attempt at illustrating my requirements.
From:
1 2
Ryan 0.706 0.071
Chad 0.151 0.831
Stephen 0.750 0.653
To:
1_Ryan 1_Chad 1_Stephen 2_Ryan 2_Chad 2_Stephen
0.706 0.151 0.75 0.071 0.831 0.653
Thank you for any assistance!
One line, for fun
df.unstack().pipe(
lambda s: pd.DataFrame([s.values], columns=s.index.map('{0[0]}_{0[1]}'.format))
)
1_Ryan 1_Chad 1_Stephen 2_Ryan 2_Chad 2_Stephen
0 0.706 0.151 0.75 0.071 0.831 0.653
Let's use stack, swaplevel, to_frame, and T:
df_out = df.stack().swaplevel(1,0).to_frame().T.sort_index(axis=1)
Or better yet,(using #piRSquared unstack solution)
df_out = df.unstack().to_frame().T
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out
Output:
1_Chad 1_Ryan 1_Stephen 2_Chad 2_Ryan 2_Stephen
0 0.151 0.706 0.75 0.831 0.071 0.653

How to barplot Pandas dataframe columns aligning by sub-index?

I have a pandas dataframe df contains two stocks' financial ratio data :
>>> df
ROIC ROE
STK_ID RPT_Date
600141 20110331 0.012 0.022
20110630 0.031 0.063
20110930 0.048 0.103
20111231 0.063 0.122
20120331 0.017 0.033
20120630 0.032 0.077
20120930 0.050 0.120
600809 20110331 0.536 0.218
20110630 0.734 0.278
20110930 0.806 0.293
20111231 1.679 0.313
20120331 0.666 0.165
20120630 1.039 0.257
20120930 1.287 0.359
And I try to plot the ratio 'ROIC' & 'ROE' of stock '600141' & '600809' together on the same 'RPT_Date' to benchmark their performance.
df.plot(kind='bar') gives below
The chart draws '600141' on the left side , '600809' on the right side. It is somewhat inconvenience to compare the 'ROIC' & 'ROE' of the two stocks on same report date 'RPT_Date' .
What I want is to put the 'ROIC' & 'ROE' bar indexed by same 'RPT_Date' in same group side by side ( 4 bar per group), and x-axis only labels the 'RPT_Date', that will clearly tell the difference of two stocks.
How to do that ?
And if I df.plot(kind='line') , it only shows two lines, but it should be four lines (2 stocks * 2 ratios) :
Is it a bug, or what I can do to correct it ? Thanks.
I am using Pandas 0.8.1.
If you unstack STK_ID, you can create side by side plots per RPT_Date.
In [55]: dfu = df.unstack("STK_ID")
In [56]: fig, axes = subplots(2,1)
In [57]: dfu.plot(ax=axes[0], kind="bar")
Out[57]: <matplotlib.axes.AxesSubplot at 0xb53070c>
In [58]: dfu.plot(ax=axes[1])
Out[58]: <matplotlib.axes.AxesSubplot at 0xb60e8cc>

Categories

Resources