python groupby calculate ratio - python

I have some simple code that does a multi-groupby (first on the date column, second on the cp_flag column) and calculates an aggregated sum for each cp_flag per day.
df.groupby(['date', 'cp_flag']).volume.sum()
I would like to calculate the ratio between C and P (e.g. for 2015-01-02, return 170381/366072) without using .apply, .transform or .agg if possible. I can't quite figure out how to extend my current code to achieve this ratio calculation.
Edit:
The desired output would just be an individual series with the C/P ratio for each date, e.g.
2015-01-02 0.465
...
2020-12-31 0.309

Related

pandas: calculate the daily average, grouped by label

I want to create a graph with lines represented by my label
so in this example picture, each line represents a distinct label
The data looks something like this where the x-axis is the datetime and the y-axis is the count.
datetime, count, label
1656140642, 12, A
1656140643, 20, B
1656140645, 11, A
1656140676, 1, B
Because I have a lot of data, I want to aggregate it by 1 hour or even 1 day chunks.
I'm able to generate the above picture with
# df is dataframe here, result from pandas.read_csv
df.set_index("datetime").groupby("label")["count"].plot
and I can get a time-range average with
df.set_index("datetime").groupby(pd.Grouper(freq='2min')).mean().plot()
but I'm unable to get both rules applied. Can someone point me in the right direction?
You can use .pivot (documentation) function to create a convenient structure where datetime is index and the different labels are the columns, with count as values.
df.set_index('datetime').pivot(columns='label', values='count')
output:
label A B
datetime
1656140642 12.0 NaN
1656140643 NaN 20.0
1656140645 11.0 NaN
1656140676 NaN 1.0
Now when you have your data in this format, you can perform simple aggregation over the index (with groupby / resample/ whatever suits you) so it will be applied each column separately. Then plotting the results is just plotting different line for each column.

Python dataframe: Standard deviation of last one year of data

I have dataframe df with daily stock market for 10 years having columns Date, Open, Close.
I want to calculate the daily standard deviation of the close price. For this the mathematical formula is:
Step1: Calculate the daily interday change of the Close
Step2: Next, calculate the daily standard deviation of the daily interday change (calculated from Step1) for the last 1 year of data
Presently, I have figured out Step1 as per the code below. The column Interday_Close_change calculates the difference between each row and the value one day ago.
df = pd.DataFrame(data, columns=columns)
df['Close_float'] = df['Close'].astype(float)
df['Interday_Close_change'] = df['Close_float'].diff()
df.fillna('', inplace=True)
Questions:
(a). How to I obtain a column Daily_SD which finds the standard deviation of the last 252 days (which is 1 year of trading days)? On Excel, we have the formula STDEV.S() to do this.
(b). The Daily_SD should begin on the 252th row of the data since that is when the data will have 252 datapoints to calculate from. How do I realize this?
It looks like you are trying to calculate a rolling standard deviation, with the rolling window consisting of previous 252 rows.
Pandas has many .rolling() methods, including one for standard deviation:
df['Daily_SD'] = df['Interday_Close_change'].rolling(252).std().shift()
If there is less than 252 rows available from which to calculate the standard deviation, the result for the row will be a null value (NaN). Think about whether you really want to apply the .fillna('') method to fill null values, as you are doing. That will convert the entire column from a numeric (float) data type to object data type.
Without the .shift() method, the current row's value will be included in calculations. The .shift() method will shift all rolling standard deviation values down by 1 row, so the current row's result will be the standard deviation of the previous 252 rows, as you want.
with pandas version >= 1.2 you can use this instead:
df['Daily_SD'] = df['Interday_Close_change'].rolling(252, closed='left').std()
The closed=left parameter will exclude the last point in the window from calculations.

Plotting a cumulative sum with groupby in pandas

I'm missing something really obvious or simply doing this wrong. I have two dataframes of similar structure and I'm trying to plot a time-series of the cumulative sum of one column from both. The dataframes are indexed by date:
df1
value
2020-01-01 2435
2020-01-02 12847
...
2020-10-01 34751
The plot should be grouped by month and be a cumulative sum of the whole time range. I've tried:
line1 = df1.groupby(pd.Grouper(freq='1M')).value.cumsum()
line2 = df2.groupby(pd.Grouper(freq='1M')).value.cumsum()
and then plot, but it resets after each month. How can I change this?
I am guessing you want to group and take the mean or something to represent the cumulative value for each month, and plot:
df1 = pd.DataFrame({'value':np.random.randint(100,200,366)},
index=pd.date_range(start='1/1/2018', end='1/1/2019'))
df1.cumsum().groupby(pd.Grouper(freq='1M')).mean().plot()

Normalize data by first value in the group

I have a DataFrame of 6 million rows of intraday data that looks like such:
closingDate Time Last
1997-09-09 11:30:00-04:00 1997-09-09 11:30:00 100
1997-09-09 11:31:00-04:00 1997-09-09 11:31:00 105
I want to normalize my Last column in a vectorized manner by dividing every row by the price on the first row that contains that day. This is my attempt:
df['Last']/df.groupby('closingDate').first()['Last']
The denominator looks like such:
closingDate
1997-09-09 943.25
1997-09-10 942.50
1997-09-11 928.00
1997-09-12 915.75
1997-09-14 933.00
1997-09-15 933.00
However, this division just gives me a column of nans. How can I associate the division to be broadcasted across my DateTime index?
Usually, this is a good use case for transform:
df['Last'] /= df.groupby('closingDate')['Last'].transform('first')
The groupby result is broadcasted with respect to the original DataFrame, and division is now made possible.

Groupby with Apply Method in Pandas : Percentage Sum of Grouped Values

I am trying to develop a program to convert daily data into monthly or yearly data and so on.
I have a DataFrame with datetime index and price change %:
% Percentage
Date
2015-06-02 0.78
2015-06-10 0.32
2015-06-11 0.34
2015-06-12 -0.06
2015-06-15 -0.41
...
I had success grouping by some frequency. Then I tested:
df.groupby('Date').sum()
df.groupby('Date').cumsum()
If it was the case it would work fine, but the problem is that I can't sum it percent way (1+x0) * (1+x1)... -1. Then I tried:
def myfunc(values):
p = 0
for val in values:
p = (1+p)*(1+val)-1
return p
df.groupby('Date').apply(myfunc)
I can't understand how apply () works. It seems to apply my function to all data and not just to the grouped items.
Your apply is applying to all rows individually because you're grouping by the date column. Your date column looks to have unique values for each row, so each group has only one row in it. You need to use a Grouper to group by month, then use cumprod and get the last value for each group:
# make sure Date is a datetime
df["Date"] = pd.to_datetime(df["Date"])
# add one to percentages
df["% Percentage"] += 1
# use cumprod on each month group, take the last value, and subtract 1
df.groupby(pd.Grouper(key="Date", freq="M"))["% Percentage"].apply(lambda g: g.cumprod().iloc[-1] - 1)
Note, though, that this applies the percentage growth as if the steps between your rows were the same, but it looks like sometimes it's 8 days and sometimes it's 1 day. You may need to do some clean-up depnding on the result you want.

Categories

Resources