Python: Create DataFrame with hierarchical columns and add columns

Python: Create DataFrame with hierarchical columns and add columns - python

I have a DataFrame with a multiindex as follows:
df:
open close
date Symbol
2022-01-01 SPY 100 102
TSLA 232 245
2022-01-02 SPY 103 100
TSLA 222 220
AAPL 143 147
I want to convert this into a DataFrame with hierarchical columns and add another column df['delta']=df['open']-df['close'] as follows:
df2:
SPY TSLA AAPL
Open Close Open Close Open Close
date
2022-01-01 100 102 232 245 nan nan nan
2022-01-02 103 100 222 220 143 147 -4
EDIT: After I get the shape in df2, I want to calculate a third column called delta to get the following:
df:
SPY TSLA AAPL
Open Close delta Open Close delta Open Close delta
date
2022-01-01 100 102 -2 232 245 -13 nan nan nan
2022-01-02 103 100 3 222 220 2 143 147 -4
How can this be done? I tried pivoting the DataFrame but it did not work.

You should be able to do with:
(df.assign(delta=lambda x: x['open'] - x['close'])
.stack()
.unstack(level=[1,2])
)
Output:
Symbol SPY TSLA AAPL
open close delta open close delta open close delta
date
2022-01-01 100.0 102.0 -2.0 232.0 245.0 -13.0 NaN NaN NaN
2022-01-02 103.0 100.0 3.0 222.0 220.0 2.0 143.0 147.0 -4.0

Related

How to calculate time difference in minutes and populate the dataframe according

I have a time series data, converted to a dataframe. It has multiple columns, where the first column is timestamps and rest of the column names are timestamps with values.
The dataframe looks like
date 2022-01-02 10:20:00 2022-01-02 10:25:00 2022-01-02 10:30:00 2022-01-02 10:35:00 2022-01-02 10:40:00 2022-01-02 10:45:00 2022-01-02 10:50:00 2022-01-02 10:55:00 2022-01-02 11:00:00
2022-01-02 10:30:00 25.5 26.3 26.9 NaN NaN NaN NaN NaN NaN
2022-01-02 10:45:00 60.3 59.3 59.2 58.4 56.9 58.0 NaN NaN NaN
2022-01-02 11:00:00 43.7 43.9 48 48 48.1 48.9 49 49.5 49.5
Note that if value in date column matches with columns names, there are NaNs after the intersecting column.
The dataframe I am trying to achieve is as below where the column names are the minutes before date (40,35,30,25,20,15,10,5,0) and the same values are populated accordingly:
For example : 1) 2022-01-02 10:30:00 - 2022-01-02 10:30:00 = 0 mins, hence the corresponding value there should be 26.9. 2) 2022-01-02 10:30:00 - 2022-01-02 10:25:00 = 5 mins, hence the value there should be 26.3 and so on.
Note - values with * are dummy values to represent.(The real dataframe has many more columns)
date 40mins 35mins 30mins 25mins 20mins 15mins 10mins 5mins 0mins
2022-01-02 10:30:00 24* 24* 24.8* 24.8* 25* 25* 25.5 26.3 26.9
2022-01-02 10:45:00 59* 58* 60* 60.3 59.3 59.2 58.4 56.9 58.0
2022-01-02 11:00:00 43.7 43.9 48 48 48.1 48.9 49 49.5 49.5
I would highly appreciate some help here. Apologies if I have not framed the question well. Please ask for clarification if needed.

IIUC, you can melt, compute the timedelta and filter, then pivot back:
(df.melt('date', var_name='date2') # reshape the columns to rows
# convert the date strings to datetime
# and compute the timedelta
.assign(date=lambda d: pd.to_datetime(d['date']),
date2=lambda d: pd.to_datetime(d['date2']),
delta=lambda d: d['date'].sub(d['date2'])
.dt.total_seconds().floordiv(60)
)
# filter out negative timedelta
.loc[lambda d: d['delta'].ge(0)]
# reshape the rows back to columns
.pivot('date', 'delta', 'value')
# rename columns from integer to "Xmins"
.rename(columns=lambda x: f'{x:.0f}mins')
# remove columns axis label
.rename_axis(columns=None)
)
output:
0mins 5mins 10mins 15mins 20mins 25mins 30mins 35mins 40mins
date
2022-01-02 10:30:00 26.9 26.3 25.5 NaN NaN NaN NaN NaN NaN
2022-01-02 10:45:00 58.0 56.9 58.4 59.2 59.3 60.3 NaN NaN NaN
2022-01-02 11:00:00 49.5 49.5 49.0 48.9 48.1 48.0 48.0 43.9 43.7

Python: Dynamically calculate rolling returns over different frequencies

Consider a DataFrame with multiple columns as follows:
data = [[99330,12,122],[1123,1230,1287],[123,101,812739],[1143,1230123,252],[234,342,4546],[2445,3453,3457],[7897,8657,5675],[46,5675,453],[76,484,3735],[363,93,4568],[385,568,367],[458,846,4847],[574,45747,658468],[57457,46534,4675]]
df1 = pd.DataFrame(data, index=['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
'2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
'2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
'2022-01-13', '2022-01-14'],
columns=['col_A', 'col_B', 'col_C'])
df1.index = pd.to_datetime(df1.index)
df1:
col_A col_B col_C
2022-01-01 99330 12 122
2022-01-02 1123 1230 1287
2022-01-03 123 101 812739
2022-01-04 1143 1230123 252
2022-01-05 234 342 4546
2022-01-06 2445 3453 3457
2022-01-07 7897 8657 5675
2022-01-08 46 5675 453
2022-01-09 76 484 3735
2022-01-10 363 93 4568
2022-01-11 385 568 367
2022-01-12 458 846 4847
2022-01-13 574 45747 658468
2022-01-14 57457 46534 4675
Is there a way to write a loop so I can calculate the rolling returns on a daily ('1D'), weekly ('1W'), monthly ('1M') and six monthly ('6M') basis?
EDIT: Here is my attempt at calculating the rolling return on a daily and weekly basis:
periodicity_dict = {'1D':'daily', '1W':'weekly'}
df_columns = df1.columns
for key in periodicity_dict:
for col in df_columns:
df1[col+'_rolling']= np.nan
for i in pd.date_range(start=df1[col].first_valid_index(), end=df1[col].last_valid_index(), freq=key):
df1[col+'_rolling'].iloc[i] = (df1[col].iloc[i] - df[col].iloc[i-'1W'])/df[col].iloc[i-'1W']

pct_change does the shifting math for you, but you would have to do it one window at a time.
windows = ["1D", "7D"]
for window in windows:
df1 = pd.merge(
df1,
(
df1[["col_A", "col_B", "col_C"]]
.pct_change(freq=window)
.add_suffix(f"_rolling_{window}")
),
left_index=True,
right_index=True,
)

You can use shift to shift your index by a certain time period. For instance you can shift everything one day with:
df1.shift(freq="1D").add_suffix("_1D")
This will then be something like:
col_A_1D col_B_1D col_C_1D
2022-01-02 99330 12 122
2022-01-03 1123 1230 1287
2022-01-04 123 101 812739
2022-01-05 1143 1230123 252
2022-01-06 234 342 4546
You can then add the new columns to the existing data:
df1.merge(df1.shift(freq="1D").add_suffix("_1D"), how="left", left_index=True, right_index=True)
col_A col_B col_C col_A_1D col_B_1D col_C_1D
2022-01-01 99330 12 122 NaN NaN NaN
2022-01-02 1123 1230 1287 99330.0 12.0 122.0
2022-01-03 123 101 812739 1123.0 1230.0 1287.0
2022-01-04 1143 1230123 252 123.0 101.0 812739.0
2022-01-05 234 342 4546 1143.0 1230123.0 252.0
And then just calculate e.g. (df1["col_A"] - df1["col_A_1D"]) / df1["col_A_1D"]. This will then result in:
2022-01-01 NaN
2022-01-02 -0.988694
2022-01-03 -0.890472
2022-01-04 8.292683
2022-01-05 -0.795276
You can do this for all the required columns and time shifts in the same way. For instance:
initial_cols = ["col_A", "col_B", "col_C"]
shifted_cols = [f"{c}_1D" for c in initial_cols]
for i, s in zip(initial_cols, shifted_cols):
df1[f"{i}_rolling"] = (df1[i] - df1[s]) / df1[s]
This will then result in:
col_A col_B col_C col_A_1D col_B_1D col_C_1D col_A_rolling col_B_rolling col_C_rolling
2022-01-01 99330 12 122 NaN NaN NaN NaN NaN NaN
2022-01-02 1123 1230 1287 99330.0 12.0 122.0 -0.988694 101.500000 9.549180
2022-01-03 123 101 812739 1123.0 1230.0 1287.0 -0.890472 -0.917886 630.498834
2022-01-04 1143 1230123 252 123.0 101.0 812739.0 8.292683 12178.435644 -0.999690
2022-01-05 234 342 4546 1143.0 1230123.0 252.0 -0.795276 -0.999722 17.039683
So to answer the main question:
Is there a way to write a loop so I can calculate the rolling returns on a daily ('1D'), weekly ('1W'), monthly ('1M') and six monthly ('6M') basis?
Yes, but there is also a way to do it without a loop :)

What's wrong with this code to conditionally count Pandas dataframe columns?

I have the following data:
Data:
ObjectID,Date,Price,Vol,Mx
101,2017-01-01,,145,203
101,2017-01-02,,155,163
101,2017-01-03,67.0,140,234
101,2017-01-04,78.0,130,182
101,2017-01-05,58.0,178,202
101,2017-01-06,53.0,134,204
101,2017-01-07,52.0,134,183
101,2017-01-08,62.0,148,176
101,2017-01-09,42.0,152,193
101,2017-01-10,80.0,137,150
I want to add a new column called CheckCount counting the values in the Vol and Mx columns IF they are greater than 150. I have written the following code:
Code:
import pandas as pd
Observations = pd.read_csv("C:\\Users\\Observations.csv", parse_dates=['Date'], index_col=['ObjectID', 'Date'])
Observations['CheckCount'] = (Observations[['Vol', 'Mx']]>150).count(axis=1)
print(Observations)
However, unfortunately it is counting every value (result is always 2) rather than only where the values are >150 - what is wrong with my code?
Current Result:
ObjectID,Date,Price,Vol,Mx,CheckCount
101,2017-01-01,,145,203,2
101,2017-01-02,,155,163,2
101,2017-01-03,67.0,140,234,2
101,2017-01-04,78.0,130,182,2
101,2017-01-05,58.0,178,202,2
101,2017-01-06,53.0,134,204,2
101,2017-01-07,52.0,134,183,2
101,2017-01-08,62.0,148,176,2
101,2017-01-09,42.0,152,193,2
101,2017-01-10,80.0,137,150,2
Desired Result:
ObjectID,Date,Price,Vol,Mx,CheckCount
101,2017-01-01,,145,203,1
101,2017-01-02,,155,163,2
101,2017-01-03,67.0,140,234,1
101,2017-01-04,78.0,130,182,1
101,2017-01-05,58.0,178,202,2
101,2017-01-06,53.0,134,204,1
101,2017-01-07,52.0,134,183,1
101,2017-01-08,62.0,148,176,1
101,2017-01-09,42.0,152,193,2
101,2017-01-10,80.0,137,150,0

Are you looking for:
df['CheckCount'] = df[['Vol','Mx']].gt(150).sum(1)
Output:
ObjectID Date Price Vol Mx CheckCount
0 101 2017-01-01 NaN 145 203 1
1 101 2017-01-02 NaN 155 163 2
2 101 2017-01-03 67.0 140 234 1
3 101 2017-01-04 78.0 130 182 1
4 101 2017-01-05 58.0 178 202 2
5 101 2017-01-06 53.0 134 204 1
6 101 2017-01-07 52.0 134 183 1
7 101 2017-01-08 62.0 148 176 1
8 101 2017-01-09 42.0 152 193 2
9 101 2017-01-10 80.0 137 150 0

Pandas DataFrame mean of data in columns occurring before certain date time

I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses per ID but only the years before a certain date can be taken into account when calculating the mean value (so column 'Date' dictates which columns can be taken into account for the mean).
Example: for index 0 (ID: 12), the date states '2016-03-08', then the mean should be taken from the columns 'y_2014' and 'y_2015', so then for this index, the mean is 111.0. If the date is too early (e.g. somewhere in 2014 or earlier in this case), then NaN should be returned (see index 6 and 9).
Desired output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.0
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.0
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.0
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.0
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.0
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.6
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
The code below is what I tried.
Tried code:
import pandas as pd import numpy as np   df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34], "y_2014": [100,120,np.nan,180,110,130,170,140,80,96], "y_2015": [122,159,164,421,654,np.nan,256,754,985,65], "y_2016": [324,54,687,512,913,754,843,95,184,127], "y_2017": [632,452,165,184,173,124,97,101,84,130], "y_2018": [np.nan,541,245,953,103,207,806,541,90,421], "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',  '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})  print(df)  # the years from columns data = df.filter(like='y_') data_years = data.columns.str.extract('(\d+)')[0].astype(int)  # the years from Date years = pd.to_datetime(df.Date).dt.year.values
  df['mean'] = data.where(data_years<years[:,None]).mean(1) print(df)
-> ValueError: Lengths must match to compare

Solved: one possible answer to my own question
import pandas as pd import numpy as np  df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34], "y_2014": [100,120,np.nan,180,110,130,170,140,80,96], "y_2015": [122,159,164,421,654,np.nan,256,754,985,65], "y_2016": [324,54,687,512,913,754,843,95,184,127], "y_2017": [632,452,165,184,173,124,97,101,84,130], "y_2018": [np.nan,541,245,953,103,207,806,541,90,421], "Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04', '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']] #an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']  s = subset.columns[0:].values < df.Date.values[:,None] t = s.astype(float)
t[t == 0] = np.nan df['mean'] = (subset.iloc[:,0:]*t).mean(1)  print(df)
#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)  print(df)

How do I get minimal value of multiple column timestamp

I want to get minimal value of multiple column timestamp. Here's my data
Id timestamp 1 timestamp 2 timestamp 3
136 2014-08-27 17:29:23 2014-11-05 13:02:18 2014-09-29 22:26:34
245 2015-09-06 15:46:00 NaN NaN
257 2014-09-29 22:26:34 2016-02-02 17:59:54 NaN
258 NaN NaN NaN
480 2016-02-02 17:59:54 2014-11-05 13:02:18 NaN
I want to get minimal timestamp of minimal
Id minimal
136 2014-08-27 17:29:23
245 2015-09-06 15:46:00
257 2014-09-29 22:26:34
258 NaN
480 2014-11-05 13:02:18

Select all columns without first by iloc, convert to datetimes and get min per rows and it is added to first column by join:
df = df[['Id']].join(df.iloc[:, 1:].apply(pd.to_datetime).min(axis=1).rename('min'))
print (df)
Id min
0 136 2014-08-27 17:29:23
1 245 2015-09-06 15:46:00
2 257 2014-09-29 22:26:34
3 258 NaT
4 480 2014-11-05 13:02:18

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Create DataFrame with hierarchical columns and add columns - python

Related

How to calculate time difference in minutes and populate the dataframe according

Python: Dynamically calculate rolling returns over different frequencies

What's wrong with this code to conditionally count Pandas dataframe columns?

Pandas DataFrame mean of data in columns occurring before certain date time

How do I get minimal value of multiple column timestamp

Categories

Resources