Python: Create DataFrame with hierarchical columns and add columns - python

I have a DataFrame with a multiindex as follows:
df:
open close
date Symbol
2022-01-01 SPY 100 102
TSLA 232 245
2022-01-02 SPY 103 100
TSLA 222 220
AAPL 143 147
I want to convert this into a DataFrame with hierarchical columns and add another column df['delta']=df['open']-df['close'] as follows:
df2:
SPY TSLA AAPL
Open Close Open Close Open Close
date
2022-01-01 100 102 232 245 nan nan nan
2022-01-02 103 100 222 220 143 147 -4
EDIT: After I get the shape in df2, I want to calculate a third column called delta to get the following:
df:
SPY TSLA AAPL
Open Close delta Open Close delta Open Close delta
date
2022-01-01 100 102 -2 232 245 -13 nan nan nan
2022-01-02 103 100 3 222 220 2 143 147 -4
How can this be done? I tried pivoting the DataFrame but it did not work.

You should be able to do with:
(df.assign(delta=lambda x: x['open'] - x['close'])
.stack()
.unstack(level=[1,2])
)
Output:
Symbol SPY TSLA AAPL
open close delta open close delta open close delta
date
2022-01-01 100.0 102.0 -2.0 232.0 245.0 -13.0 NaN NaN NaN
2022-01-02 103.0 100.0 3.0 222.0 220.0 2.0 143.0 147.0 -4.0

Related

How to calculate time difference in minutes and populate the dataframe according

I have a time series data, converted to a dataframe. It has multiple columns, where the first column is timestamps and rest of the column names are timestamps with values.
The dataframe looks like
date 2022-01-02 10:20:00 2022-01-02 10:25:00 2022-01-02 10:30:00 2022-01-02 10:35:00 2022-01-02 10:40:00 2022-01-02 10:45:00 2022-01-02 10:50:00 2022-01-02 10:55:00 2022-01-02 11:00:00
2022-01-02 10:30:00 25.5 26.3 26.9 NaN NaN NaN NaN NaN NaN
2022-01-02 10:45:00 60.3 59.3 59.2 58.4 56.9 58.0 NaN NaN NaN
2022-01-02 11:00:00 43.7 43.9 48 48 48.1 48.9 49 49.5 49.5
Note that if value in date column matches with columns names, there are NaNs after the intersecting column.
The dataframe I am trying to achieve is as below where the column names are the minutes before date (40,35,30,25,20,15,10,5,0) and the same values are populated accordingly:
For example : 1) 2022-01-02 10:30:00 - 2022-01-02 10:30:00 = 0 mins, hence the corresponding value there should be 26.9. 2) 2022-01-02 10:30:00 - 2022-01-02 10:25:00 = 5 mins, hence the value there should be 26.3 and so on.
Note - values with * are dummy values to represent.(The real dataframe has many more columns)
date 40mins 35mins 30mins 25mins 20mins 15mins 10mins 5mins 0mins
2022-01-02 10:30:00 24* 24* 24.8* 24.8* 25* 25* 25.5 26.3 26.9
2022-01-02 10:45:00 59* 58* 60* 60.3 59.3 59.2 58.4 56.9 58.0
2022-01-02 11:00:00 43.7 43.9 48 48 48.1 48.9 49 49.5 49.5
I would highly appreciate some help here. Apologies if I have not framed the question well. Please ask for clarification if needed.
IIUC, you can melt, compute the timedelta and filter, then pivot back:
(df.melt('date', var_name='date2') # reshape the columns to rows
# convert the date strings to datetime
# and compute the timedelta
.assign(date=lambda d: pd.to_datetime(d['date']),
date2=lambda d: pd.to_datetime(d['date2']),
delta=lambda d: d['date'].sub(d['date2'])
.dt.total_seconds().floordiv(60)
)
# filter out negative timedelta
.loc[lambda d: d['delta'].ge(0)]
# reshape the rows back to columns
.pivot('date', 'delta', 'value')
# rename columns from integer to "Xmins"
.rename(columns=lambda x: f'{x:.0f}mins')
# remove columns axis label
.rename_axis(columns=None)
)
output:
0mins 5mins 10mins 15mins 20mins 25mins 30mins 35mins 40mins
date
2022-01-02 10:30:00 26.9 26.3 25.5 NaN NaN NaN NaN NaN NaN
2022-01-02 10:45:00 58.0 56.9 58.4 59.2 59.3 60.3 NaN NaN NaN
2022-01-02 11:00:00 49.5 49.5 49.0 48.9 48.1 48.0 48.0 43.9 43.7

Python: Dynamically calculate rolling returns over different frequencies

Consider a DataFrame with multiple columns as follows:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
Is there a way to write a loop so I can calculate the rolling returns on a daily ('1D'), weekly ('1W'), monthly ('1M') and six monthly ('6M') basis?
EDIT: Here is my attempt at calculating the rolling return on a daily and weekly basis:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in pd.date_range(start=df1[col].first_valid_index(), end=df1[col].last_valid_index(), freq=key):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df[col].iloc[i-'1W'])/df[col].iloc[i-'1W']
pct_change does the shifting math for you, but you would have to do it one window at a time.
windows = ["1D", "7D"]
for window in windows:
df1 = pd.merge(
df1,
(
df1[["col_A", "col_B", "col_C"]]
.pct_change(freq=window)
.add_suffix(f"_rolling_{window}")
),
left_index=True,
right_index=True,
)
You can use shift to shift your index by a certain time period. For instance you can shift everything one day with:
df1.shift(freq="1D").add_suffix("_1D")
This will then be something like:
col_A_1D col_B_1D col_C_1D
2022-01-02 99330 12 122
2022-01-03 1123 1230 1287
2022-01-04 123 101 812739
2022-01-05 1143 1230123 252
2022-01-06 234 342 4546
You can then add the new columns to the existing data:
df1.merge(df1.shift(freq="1D").add_suffix("_1D"), how="left", left_index=True, right_index=True)
col_A col_B col_C col_A_1D col_B_1D col_C_1D
2022-01-01 99330 12 122 NaN NaN NaN
2022-01-02 1123 1230 1287 99330.0 12.0 122.0
2022-01-03 123 101 812739 1123.0 1230.0 1287.0
2022-01-04 1143 1230123 252 123.0 101.0 812739.0
2022-01-05 234 342 4546 1143.0 1230123.0 252.0
And then just calculate e.g. (df1["col_A"] - df1["col_A_1D"]) / df1["col_A_1D"]. This will then result in:
2022-01-01 NaN
2022-01-02 -0.988694
2022-01-03 -0.890472
2022-01-04 8.292683
2022-01-05 -0.795276
You can do this for all the required columns and time shifts in the same way. For instance:
initial_cols = ["col_A", "col_B", "col_C"]
shifted_cols = [f"{c}_1D" for c in initial_cols]
for i, s in zip(initial_cols, shifted_cols):
df1[f"{i}_rolling"] = (df1[i] - df1[s]) / df1[s]
This will then result in:
col_A col_B col_C col_A_1D col_B_1D col_C_1D col_A_rolling col_B_rolling col_C_rolling
2022-01-01 99330 12 122 NaN NaN NaN NaN NaN NaN
2022-01-02 1123 1230 1287 99330.0 12.0 122.0 -0.988694 101.500000 9.549180
2022-01-03 123 101 812739 1123.0 1230.0 1287.0 -0.890472 -0.917886 630.498834
2022-01-04 1143 1230123 252 123.0 101.0 812739.0 8.292683 12178.435644 -0.999690
2022-01-05 234 342 4546 1143.0 1230123.0 252.0 -0.795276 -0.999722 17.039683
So to answer the main question:
Is there a way to write a loop so I can calculate the rolling returns on a daily ('1D'), weekly ('1W'), monthly ('1M') and six monthly ('6M') basis?
Yes, but there is also a way to do it without a loop :)

What's wrong with this code to conditionally count Pandas dataframe columns?

I have the following data:
Data:
ObjectID,Date,Price,Vol,Mx
101,2017-01-01,,145,203
101,2017-01-02,,155,163
101,2017-01-03,67.0,140,234
101,2017-01-04,78.0,130,182
101,2017-01-05,58.0,178,202
101,2017-01-06,53.0,134,204
101,2017-01-07,52.0,134,183
101,2017-01-08,62.0,148,176
101,2017-01-09,42.0,152,193
101,2017-01-10,80.0,137,150
I want to add a new column called CheckCount counting the values in the Vol and Mx columns IF they are greater than 150. I have written the following code:
Code:
import pandas as pd
Observations = pd.read_csv("C:\\Users\\Observations.csv", parse_dates=['Date'], index_col=['ObjectID', 'Date'])
Observations['CheckCount'] = (Observations[['Vol', 'Mx']]>150).count(axis=1)
print(Observations)
However, unfortunately it is counting every value (result is always 2) rather than only where the values are >150 - what is wrong with my code?
Current Result:
ObjectID,Date,Price,Vol,Mx,CheckCount
101,2017-01-01,,145,203,2
101,2017-01-02,,155,163,2
101,2017-01-03,67.0,140,234,2
101,2017-01-04,78.0,130,182,2
101,2017-01-05,58.0,178,202,2
101,2017-01-06,53.0,134,204,2
101,2017-01-07,52.0,134,183,2
101,2017-01-08,62.0,148,176,2
101,2017-01-09,42.0,152,193,2
101,2017-01-10,80.0,137,150,2
Desired Result:
ObjectID,Date,Price,Vol,Mx,CheckCount
101,2017-01-01,,145,203,1
101,2017-01-02,,155,163,2
101,2017-01-03,67.0,140,234,1
101,2017-01-04,78.0,130,182,1
101,2017-01-05,58.0,178,202,2
101,2017-01-06,53.0,134,204,1
101,2017-01-07,52.0,134,183,1
101,2017-01-08,62.0,148,176,1
101,2017-01-09,42.0,152,193,2
101,2017-01-10,80.0,137,150,0
Are you looking for:
df['CheckCount'] = df[['Vol','Mx']].gt(150).sum(1)
Output:
ObjectID Date Price Vol Mx CheckCount
0 101 2017-01-01 NaN 145 203 1
1 101 2017-01-02 NaN 155 163 2
2 101 2017-01-03 67.0 140 234 1
3 101 2017-01-04 78.0 130 182 1
4 101 2017-01-05 58.0 178 202 2
5 101 2017-01-06 53.0 134 204 1
6 101 2017-01-07 52.0 134 183 1
7 101 2017-01-08 62.0 148 176 1
8 101 2017-01-09 42.0 152 193 2
9 101 2017-01-10 80.0 137 150 0

Pandas DataFrame mean of data in columns occurring before certain date time

I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses per ID but only the years before a certain date can be taken into account when calculating the mean value (so column 'Date' dictates which columns can be taken into account for the mean).
Example: for index 0 (ID: 12), the date states '2016-03-08', then the mean should be taken from the columns 'y_2014' and 'y_2015', so then for this index, the mean is 111.0. If the date is too early (e.g. somewhere in 2014 or earlier in this case), then NaN should be returned (see index 6 and 9).
Desired output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.0
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.0
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.0
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.0
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.0
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.6
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
The code below is what I tried.
Tried code:
import pandas as pd
import numpy as np


df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
 '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})

print(df)

# the years from columns
data = df.filter(like='y_')
data_years = data.columns.str.extract('(\d+)')[0].astype(int)

# the years from Date
years = pd.to_datetime(df.Date).dt.year.values


df['mean'] = data.where(data_years<years[:,None]).mean(1)
print(df)
-> ValueError: Lengths must match to compare
Solved: one possible answer to my own question
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']]
#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']

s = subset.columns[0:].values < df.Date.values[:,None]
t = s.astype(float)
t[t == 0] = np.nan
df['mean'] = (subset.iloc[:,0:]*t).mean(1)

print(df)
#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)

print(df)

How do I get minimal value of multiple column timestamp

I want to get minimal value of multiple column timestamp. Here's my data
Id timestamp 1 timestamp 2 timestamp 3
136 2014-08-27 17:29:23 2014-11-05 13:02:18 2014-09-29 22:26:34
245 2015-09-06 15:46:00 NaN NaN
257 2014-09-29 22:26:34 2016-02-02 17:59:54 NaN
258 NaN NaN NaN
480 2016-02-02 17:59:54 2014-11-05 13:02:18 NaN
I want to get minimal timestamp of minimal
Id minimal
136 2014-08-27 17:29:23
245 2015-09-06 15:46:00
257 2014-09-29 22:26:34
258 NaN
480 2014-11-05 13:02:18
Select all columns without first by iloc, convert to datetimes and get min per rows and it is added to first column by join:
df = df[['Id']].join(df.iloc[:, 1:].apply(pd.to_datetime).min(axis=1).rename('min'))
print (df)
Id min
0 136 2014-08-27 17:29:23
1 245 2015-09-06 15:46:00
2 257 2014-09-29 22:26:34
3 258 NaT
4 480 2014-11-05 13:02:18

Categories

Resources