Normalize data by first value in the group - python

I have a DataFrame of 6 million rows of intraday data that looks like such:
closingDate Time Last
1997-09-09 11:30:00-04:00 1997-09-09 11:30:00 100
1997-09-09 11:31:00-04:00 1997-09-09 11:31:00 105
I want to normalize my Last column in a vectorized manner by dividing every row by the price on the first row that contains that day. This is my attempt:
df['Last']/df.groupby('closingDate').first()['Last']
The denominator looks like such:
closingDate
1997-09-09 943.25
1997-09-10 942.50
1997-09-11 928.00
1997-09-12 915.75
1997-09-14 933.00
1997-09-15 933.00
However, this division just gives me a column of nans. How can I associate the division to be broadcasted across my DateTime index?

Usually, this is a good use case for transform:
df['Last'] /= df.groupby('closingDate')['Last'].transform('first')
The groupby result is broadcasted with respect to the original DataFrame, and division is now made possible.

Related

Python dataframe: Standard deviation of last one year of data

I have dataframe df with daily stock market for 10 years having columns Date, Open, Close.
I want to calculate the daily standard deviation of the close price. For this the mathematical formula is:
Step1: Calculate the daily interday change of the Close
Step2: Next, calculate the daily standard deviation of the daily interday change (calculated from Step1) for the last 1 year of data
Presently, I have figured out Step1 as per the code below. The column Interday_Close_change calculates the difference between each row and the value one day ago.
df = pd.DataFrame(data, columns=columns)
df['Close_float'] = df['Close'].astype(float)
df['Interday_Close_change'] = df['Close_float'].diff()
df.fillna('', inplace=True)
Questions:
(a). How to I obtain a column Daily_SD which finds the standard deviation of the last 252 days (which is 1 year of trading days)? On Excel, we have the formula STDEV.S() to do this.
(b). The Daily_SD should begin on the 252th row of the data since that is when the data will have 252 datapoints to calculate from. How do I realize this?
It looks like you are trying to calculate a rolling standard deviation, with the rolling window consisting of previous 252 rows.
Pandas has many .rolling() methods, including one for standard deviation:
df['Daily_SD'] = df['Interday_Close_change'].rolling(252).std().shift()
If there is less than 252 rows available from which to calculate the standard deviation, the result for the row will be a null value (NaN). Think about whether you really want to apply the .fillna('') method to fill null values, as you are doing. That will convert the entire column from a numeric (float) data type to object data type.
Without the .shift() method, the current row's value will be included in calculations. The .shift() method will shift all rolling standard deviation values down by 1 row, so the current row's result will be the standard deviation of the previous 252 rows, as you want.
with pandas version >= 1.2 you can use this instead:
df['Daily_SD'] = df['Interday_Close_change'].rolling(252, closed='left').std()
The closed=left parameter will exclude the last point in the window from calculations.

python groupby calculate ratio

I have some simple code that does a multi-groupby (first on the date column, second on the cp_flag column) and calculates an aggregated sum for each cp_flag per day.
df.groupby(['date', 'cp_flag']).volume.sum()
I would like to calculate the ratio between C and P (e.g. for 2015-01-02, return 170381/366072) without using .apply, .transform or .agg if possible. I can't quite figure out how to extend my current code to achieve this ratio calculation.
Edit:
The desired output would just be an individual series with the C/P ratio for each date, e.g.
2015-01-02 0.465
...
2020-12-31 0.309

Summing values of (dropped) duplicate rows Pandas DataFrame

For a time series analysis, I have to drop instances that occur on the same date. However, keep some of the 'deleted' information and add it to the remaining 'duplicate' instance. Below a short example of part of my dataset.
z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]})
lat lon time duration size
0 49.125 -114.125 2005-08-09 3 4
1 49.125 -114.125 2005-08-09 6 10
I would like to drop the (duplicate) instance which has the lowest 'duration' value but at the same time sum the 'size' variables. Output would look like:
lat lon time duration size
0 49.125 -114.125 2005-08-09 6 14
Does anyone know how I would be able to tackle such a problem? Furthermore, for another variable, I would like to take the mean of these values. Yet I do think the process would be similar to summing the values.
edit: so far I know how to get the highest duration value to remain using:
z.sort_values(by='duration', ascending=False).drop_duplicates(subset=['lat', 'lon','time'], keep='last')
If those are all the columns in your dataframe, you can get your result using a groupbyon your time column, and passing in your aggregations for each column.
More specifically, you can drop the (duplicate) instance which has the lowest 'duration' by keeping the max() duration, and at the same time sum the 'size' variables by using sum() on your size column.
res = z.groupby('time').agg({'lat':'first',
'lon':'first',
'duration':'max',
'size':'sum'}). \
reset_index()
res
time lat lon duration size
0 2005-08-09 49.125 -114.125 6 14
The only difference is that 'time' is now your first column, which you can quickly fix.
Group by to get the sum and merge back on unique values on the df without duplicates:
import pandas as pd
import numpy as np
z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]}) # original data
gp = z.groupby(['lat', 'lon','time'], as_index=False)[['size']].sum() # getting the sum of 'size' for unique combination of lat, lon, time
df = z.sort_values(by='duration', ascending=True).drop_duplicates(subset=['lat', 'lon','time'], keep='last') # dropping duplicates
pd.merge(df[['lat', 'lon', 'time', 'duration']], gp, on=['lat', 'lon', 'time']) # adding the columns summed onto the df without duplicates
lat lon time duration size
0 49.125 -114.125 2005-08-09 6 14
Another way base on sophocles answer:
res = z.sort_values(by='duration', ascending=False).groupby(['time', 'lat', 'lon']).agg({
'duration':'first', # same as 'max' since we've sorted the data by duration DESC
'size':'sum'})
This one could become less readable if you have several columns you want to keep (you'd have a lot of first in the agg function)

Find Gaps in a Pandas Dataframe

I have a Dataframe which has a column for Minutes and correlated value, the frequency is about 79 seconds but sometimes there is missing data for a period (no rows at all). I want to detect if there is a gap of 25 or more Minutes and delete the dataset if so.
How do I test if there is a gap which is?
The dataframe looks like this:
INDEX minutes data
0 23.000 1.456
1 24.185 1.223
2 27.250 0.931
3 55.700 2.513
4 56.790 1.446
... ... ...
So there is a irregular but short gap and one that exceeds 25 Minutes. In this case I want the dataset to be empty:
I am quite new to Python, especially to Pandas so an explanation would be helpful to learn.
You can use numpy.roll to create a column with shifted values (i.e. the first value from the original column becomes the second value, the second becomes the third, etc):
import pandas as pd
import numpy as np
df = pd.DataFrame({'minutes': [23.000, 24.185, 27.250, 55.700, 56.790]})
np.roll(df['minutes'], 1)
# output: array([56.79 , 23. , 24.185, 27.25 , 55.7 ])
Add this as a new column to your dataframe and subtract the original column with the new column.
We also drop the first row beforehand, since we don't want to calculate the difference from your first timepoint in the original column and your last timepoint that got rolled to the start of the new column.
Then we just ask if any of the values resulting from the subtraction is above your threshold:
df['rolled_minutes'] = np.roll(df['minutes'], 1)
dropped_df = df.drop(index=0)
diff = dropped_df['minutes'] - dropped_df['rolled_minutes']
(diff > 25).any()
# output: True

linear interpolation between values stored in separate dataframes

I have numbers stored in 2 data frames (real ones are much bigger) which look like
df1
A B C T Z
13/03/2017 1.321674 3.1790 3.774602 30.898 13.22
06/02/2017 1.306358 3.1387 3.712554 30.847 13.36
09/01/2017 1.361103 3.2280 3.738500 32.062 13.75
05/12/2016 1.339258 3.4560 3.548593 31.978 13.81
07/11/2016 1.295137 3.2323 3.188161 31.463 13.43
df2
A B C T Z
13/03/2017 1.320829 3.1530 3.7418 30.933 13.1450
06/02/2017 1.305483 3.1160 3.6839 30.870 13.2985
09/01/2017 1.359989 3.1969 3.7129 32.098 13.6700
05/12/2016 1.338151 3.4215 3.5231 32.035 13.7243
07/11/2016 1.293996 3.2020 3.1681 31.480 13.3587
and a list where I have stored all daily dates from 13/03/2017 to 7/11/2016.
I would like to create a dataframe with the following features:
the list of daily dates is the indexrow
I would like to create columns (in this case from A to Z) and for each row/ day compute the linear interpolation value between the value in df1 and the corresponding value in df2 shifted by -1. For example, in the row '12/03/2017' for column A I want to compute [(34/35)*1.321674]+[(1/35)*1.305483] = 1.3212114. Where 35 is the number of days between 13/03/2017 and 06/02/2017, 1.321674 is the value in df1 corresponding to column A for the day 13/03/2017 and 1.305483 is the value in df2 corresponding to column A for the day 06/02/2017. For 11/03/2017 for column A I want to compute [(33/35)*1.321674]+[(2/35)*1.305483] = 1.3207488. Thus keeping fixed the values 1.321674 and 1.305483 for the time interval until 6/2/2017 where it should show 1.305483.
Finally, the linear interpolation should shift interpolating values when the corresponding row shows a date which is the next time interval. For example, once I reach 05/02/2017, the linear interpolation should be between 1.306358 (df1, column A) and 1.359989 (df2, column B), that is shift one position down.
For clarity, date format is 'dd/mm/yyyy'
I would greatly appreciate any piece of advice or suggestion, I am aware it's a lot of work so any hint is valued!
Please let me know if I you need more clarification.
Thanks!

Categories

Resources