Thanks for reading and in advance for any answers.
Beta is a measure of systematic risk of an investment portfolio. It is calculated by taking the covariance of that portfolios returns against the benchmark / market and dividing it by the variance of the market. I'd like to calc this on a rolling basis against many portfolios.
I have a df as follows
PERIOD,PORT1,PORT2,BM
201504,-0.004,-0.001,-0.013
201505,0.017,0.019,0.022
201506,-0.027,-0.037,-0.039
201507,0.026,0.033,0.017
201508,-0.045,-0.054,-0.081
201509,-0.033,-0.026,-0.032
201510,0.053,0.07,0.09
201511,0.03,0.032,0.038
201512,-0.05,-0.034,-0.044
201601,-0.016,-0.043,-0.057
201602,-0.007,-0.007,-0.011
201603,0.014,0.014,0.026
201604,0.003,0.001,0.01
201605,0.046,0.038,0.031
Except with many more columns like port1 and port2.
I would like to create a dataset with a rolling beta vs the BM column.
I created a similar rolling correlation dataset with
df.rolling(3).corr(df['BM'])
...which took every column in my large set and calced a correlation vs my BM column.
I tried to make a custom function for Beta but because it takes two arguments I am struggling. Below is my custom function and how I got it to work by feeding it two columns of returns.
def beta(arr1,arr2):
#ddof = 0 gives population covar. the 0 and 1 coordinates take the arr1 vs arr2 covar from the matrix
return (np.cov(arr1,arr2,ddof=0)[0][1])/np.var(arr2)
beta_test = beta(df['PORT1'],df['BM'])
So this helps me find the beta between two columns that I feed in... question is how to do this for my data above and data with many columns/portfolios? And then how to do it on a rolling basis? From what I saw above with the correlation, the below should be possible, to run each rolling 3 month data set in each column vs one specified column.
beta_data = df.rolling(3).agg(beta(df['BM']))
Any pointer in the right direction would be appreciated
IIUC, you can set_index the columns PERIOD and BM, filter the column with PORT in it (in case you have other columns you don't want to apply the beta function), then use rolling.apply like:
print (df.set_index(['PERIOD','BM']).filter(like='PORT')
.rolling(3).apply(lambda x: beta(x, x.index.get_level_values(1)))
.reset_index())
PERIOD BM PORT1 PORT2
0 201504 -0.013 NaN NaN
1 201505 0.022 NaN NaN
2 201506 -0.039 0.714514 0.898613
3 201507 0.017 0.814734 1.055798
4 201508 -0.081 0.736486 0.907336
5 201509 -0.032 0.724490 0.887755
6 201510 0.090 0.598332 0.736964
7 201511 0.038 0.715848 0.789221
8 201512 -0.044 0.787248 0.778703
9 201601 -0.057 0.658877 0.794949
10 201602 -0.011 0.412270 0.789567
11 201603 0.026 0.354829 0.690573
12 201604 0.010 0.562924 0.558083
13 201605 0.031 1.716066 1.530471
def getbetas(df, market, window = 45):
""" given an unstacked pandas dataframe (columns instruments, rows
dates), compute the rolling betas vs the market.
"""
nmarket = market/market.rolling(window).var()
thebetas = df.rolling(window).cov(other=nmarket)
return thebetas
Related
I have a Pandas DataFrame like (abridged):
age
gender
control
county
11877
67.0
F
0
AL-Calhoun
11552
60.0
F
0
AL-Coosa
11607
60.0
F
0
AL-Talladega
13821
NaN
NaN
1
AL-Mobile
11462
59.0
F
0
AL-Dale
I want to run a linear regression with fixed effects by county entity (not by time) to balance check my control and treatment groups for an experimental design, such that my dependent variable is membership in the treatment group (control = 1) or not (control = 0).
In order to do this, so far as I have seen I need to use linearmodels.panel.PanelOLS and set my entity field (county) as my index.
So far as I'm aware my model should look like this:
# set index on entity effects field:
to_model = to_model.set_index(["county"])
# implement fixed effects linear model
model = PanelOLS.from_formula("control ~ age + gender + EntityEffects", to_model)
When I try to do this, I get the below error:
ValueError: The index on the time dimension must be either numeric or date-like
I have seen a lot of implementations of such models online and they all seem to use a temporal effect, which is not relevant in my case. If I try to encode my county field using numerics, I get a different error.
# create a dict to map county values to numerics
county_map = dict(zip(to_model["county"].unique(), range(len(to_model.county.unique()))))
# create a numeric column as alternative to county
to_model["county_numeric"] = to_model["county"].map(county_map)
# set index on numeric entity effects field
to_model = to_model.set_index(["county_numeric"])
FactorEvaluationError: Unable to evaluate factor `control`. [KeyError: 'control']
How am I able to implement this model using the county as a unit fixed effect?
Assuming you have multiple entries for each county, then you could use the following. The key step is to use a groupby transform to create a distinct numeric index for each county which can be used as a fake time index.
import numpy as np
import pandas as pd
import string
import linearmodels as lm
# Generate randomd DF
rs = np.random.default_rng(1213892)
counties = rs.choice([c for c in string.ascii_lowercase], (1000, 3))
counties = np.array([["".join(c)] * 10 for c in counties]).ravel()
age = rs.integers(18, 65, (10 * 1000))
gender = rs.choice(["m", "f"], size=(10 * 1000))
control = rs.integers(0, 2, size=10 * 1000)
df = pd.DataFrame(
{"counties": counties, "age": age, "gender": gender, "control": control}
)
# Construct a dummy numeric index for each county
numeric_index = df.groupby("counties").age.transform(lambda c: np.arange(len(c)))
df["numeric_index"] = numeric_index
df = df.set_index(["counties","numeric_index"])
# Take a look
df.head(15)
age gender control
counties numeric_index
qbt 0 51 m 1
1 36 m 0
2 28 f 1
3 28 m 0
4 47 m 0
5 19 m 1
6 32 m 1
7 54 m 0
8 36 m 1
9 52 m 0
nub 0 19 m 0
1 57 m 0
2 49 f 0
3 53 m 1
4 30 f 0
This just shows that the model can be estimated.
# Fit the model
# Note: Results are meaningless, just shows that this works
lm.PanelOLS.from_formula("control ~ age + gender + EntityEffects", data=df)
mod = lm.PanelOLS.from_formula("control ~ age + gender + EntityEffects", data=df)
mod.fit()
PanelOLS Estimation Summary
================================================================================
Dep. Variable: control R-squared: 0.0003
Estimator: PanelOLS R-squared (Between): 0.0005
No. Observations: 10000 R-squared (Within): 0.0003
Date: Thu, May 12 2022 R-squared (Overall): 0.0003
Time: 11:08:00 Log-likelihood -6768.3
Cov. Estimator: Unadjusted
F-statistic: 1.4248
Entities: 962 P-value 0.2406
Avg Obs: 10.395 Distribution: F(2,9036)
Min Obs: 10.0000
Max Obs: 30.000 F-statistic (robust): 2287.4
P-value 0.0000
Time periods: 30 Distribution: F(2,9036)
Avg Obs: 333.33
Min Obs: 2.0000
Max Obs: 962.00
Parameter Estimates
===============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-------------------------------------------------------------------------------
age -0.0002 0.0004 -0.5142 0.6072 -0.0010 0.0006
gender[T.f] 0.5191 0.0176 29.559 0.0000 0.4847 0.5535
gender[T.m] 0.5021 0.0175 28.652 0.0000 0.4678 0.5365
===============================================================================
F-test for Poolability: 0.9633
P-value: 0.7768
Distribution: F(961,9036)
Included effects: Entity
PanelEffectsResults, id: 0x2246f38a9d0
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
I'm trying to implement a paper where PIMA Indians Diabetes dataset is used. This is the dataset after imputing missing values:
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age Outcome
0 1 148.0 72.000000 35.00000 155.548223 33.600000 0.627 50 1
1 1 85.0 66.000000 29.00000 155.548223 26.600000 0.351 31 0
2 1 183.0 64.000000 29.15342 155.548223 23.300000 0.672 32 1
3 1 89.0 66.000000 23.00000 94.000000 28.100000 0.167 21 0
4 0 137.0 40.000000 35.00000 168.000000 43.100000 2.288 33 1
5 1 116.0 74.000000 29.15342 155.548223 25.600000 0.201 30 0
The description:
df.describe()
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age
count768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean0.855469 121.686763 72.405184 29.153420 155.548223 32.457464 0.471876 33.240885
std 0.351857 30.435949 12.096346 8.790942 85.021108 6.875151 0.331329 11.760232
min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000
25% 1.000000 99.750000 64.000000 25.000000 121.500000 27.500000 0.243750 24.000000
50% 1.000000 117.000000 72.202592 29.153420 155.548223 32.400000 0.372500 29.000000
75% 1.000000 140.250000 80.000000 32.000000 155.548223 36.600000 0.626250 41.000000
max 1.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000
The description of normalization from the paper is as follows:
As part of our data preprocessing, the original data values are scaled so as to fall within a small specified range of [0,1] values by performing normalization of the dataset. This will improve speed and reduce runtime complexity. Using the Z-Score we normalize our value set V to obtain a new set of normalized values V’ with the equation below:
V'=V-Y/Z
where V’= New normalized value, V=previous value, Y=mean and Z=standard deviation
z=scipy.stats.zscore(df)
But when I try to run the code above, I'm getting negative values and values greater than one i.e., not in the range [0,1].
There are several points to note here.
Firstly, z-score normalisation will not result in features in the range [0, 1] unless the input data has very specific characteristics.
Secondly, as others have noted, two of the most common ways of normalising data are standardisation and min-max scaling.
Set up data
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv')
# For the purposes of this exercise, we'll just use the alphabet as column names
df.columns = list(string.ascii_lowercase)[:len(df.columns)]
$ print(df.head())
a b c d e f g h i
0 1 85 66 29 0 26.6 0.351 31 0
1 8 183 64 0 0 23.3 0.672 32 1
2 1 89 66 23 94 28.1 0.167 21 0
3 0 137 40 35 168 43.1 2.288 33 1
4 5 116 74 0 0 25.6 0.201 30 0
Standardisation
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {standardised.min().min():4.3f} Max: {standardised.max().max():4.3f}")
Min: -4.055 Max: 845.307
As you can see, the values are far from being in [0, 1]. Note the range of the resulting data from z-score normalisation will vary depending on the distribution of the input data.
Min-max scaling
min_max = (df - df.values.min()) / (df.values.max() - df.values.min())
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {min_max.min().min():4.3f} Max: {min_max.max().max():4.3f}")
Min: 0.000 Max: 1.000
Here we do indeed get values in [0, 1].
Discussion
These and a number of other scalers exist in the sklearn preprocessing module. I recommend reading the sklearn documentation and using these instead of doing it manually, for various reasons:
There are fewer chances of making a mistake as you have to do less typing.
sklearn will be at least as computationally efficient and often more so.
You should use the same scaling parameters from training on the test data to avoid leakage of test data information. (In most real world uses, this is unlikely to be significant but it is good practice.) By using sklearn you don't need to store the min/max/mean/SD etc. from scaling training data to reuse subsequently on test data. Instead, you can just use scaler.fit_transform(X_train) and scaler.transform(X_test).
If you want to reverse the scaling later on, you can use scaler.inverse_transform(data).
I'm sure there are other reasons, but these are the main ones that come to mind.
Your standardization formula hasn't the aim of putting values in the range [0, 1].
If you want to normalize data to make it in such a range, you can use the following formula :
z = (actual_value - min_value_in_database)/(max_value_in_database - min_value_in_database)
And sir, you're not oblige to do it manually, just use sklearn library, you'll find different standardization and normalization methods in the preprocessing section.
Assuming your original dataframe is df and it has no invalid float values, this should work
df2 = (df - df.values.min()) / (df.values.max()-df.values.min())
I am still a noob when it comes to statistics.
I am using Python Package Statsmodel, with the patsy functionality.
My pandas dataframe looks as such:
index sed label c_g lvl1 lvl2
0 5.0 SP_A c b c
1 10.0 SP_B g b c
2 0.0 SP_C c b c
3 -10.0 SP_H c b c
4 0.0 SP_J g b c
5 -20.0 SP_K g b c
6 30.0 SP_W g a a
7 40.0 SP_X g a a
8 -10.0 SP_Y c a a
9 45.0 SP_BB g a a
10 45.0 SP_CC g a a
11 10.0 SP_A c b c
12 10.0 SP_B g b c
13 10.0 SP_C c b c
14 6.0 SP_D g b c
15 10.0 SP_E c b c
16 29.0 SP_F c b c
17 3.0 SP_G g b c
18 23.0 SP_H c b c
19 34.0 SP_J g b c
Dependent variable: Sedimentation (longitudinal data)
Independent variables: Label (categorical), control_grid (categorical), lvl1(categorical) , lvl2 (categorical).
I am interested in two things.
Which Independent variables have significant effect on Dependent variable?
Which Independent variables have significant interaction?
After having searched and read multiple documents, I do this as such:
import statsmodels.formula.api as smf
import pandas as pd
df = pd.read_csv('some.csv')
model = smf.ols(formula = 'sedimentation ~ lvl1*lvl2',data=df)
results = model.fit()
results.summary()
With results showing:
OLS Regression Results
==============================================================================
Dep. Variable: sedimentation R-squared: 0.129
Model: OLS Adj. R-squared: 0.124
Method: Least Squares F-statistic: 24.91
Date: Tue, 17 Jul 2018 Prob (F-statistic): 4.80e-15
Time: 11:15:28 Log-Likelihood: -2353.6
No. Observations: 510 AIC: 4715.
Df Residuals: 506 BIC: 4732.
Df Model: 3
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 6.9871 1.611 4.338 0.000 3.823 10.151
lvl1[T.b] -3.7990 1.173 -3.239 0.001 -6.103 -1.495
lvl1[T.d] -3.5124 1.400 -2.509 0.012 -6.263 -0.762
lvl2[T.b] -8.9427 1.155 -7.744 0.000 -11.212 -6.674
lvl2[T.c] 5.1436 0.899 5.722 0.000 3.377 6.910
lvl2[T.f] -3.5124 1.400 -2.509 0.012 -6.263 -0.762
lvl1[T.b]:lvl2[T.b] -8.9427 1.155 -7.744 0.000 -11.212 -6.674
lvl1[T.d]:lvl2[T.b] 0 0 nan nan 0 0
lvl1[T.b]:lvl2[T.c] 5.1436 0.899 5.722 0.000 3.377 6.910
lvl1[T.d]:lvl2[T.c] 0 0 nan nan 0 0
lvl1[T.b]:lvl2[T.f] 0 0 nan nan 0 0
lvl1[T.d]:lvl2[T.f] -3.5124 1.400 -2.509 0.012 -6.263 -0.762
==============================================================================
Omnibus: 13.069 Durbin-Watson: 1.118
Prob(Omnibus): 0.001 Jarque-Bera (JB): 18.495
Skew: -0.224 Prob(JB): 9.63e-05
Kurtosis: 3.818 Cond. No. inf
==============================================================================
Am I using the correct model in Python to get my desired results?
I think I am, but I would like to verify. The way I read the table is that the categorical variables lvl1 and lvl2 have a significant effect on the dependent variable AND show significant interaction (for some of the variables). However, I don't understand why not all of my variables are showing...as you see in my data, lvl1 column also contains "a" but this variable is not shown in the results summary.
I am not an expert and I fear I can't tell you what is the correct test to apply to longitudinal data, but I think that the numbers you got can't really be trusted that much.
First, the easy part of the answer, regarding your "why not all of my variables are showing": for example, in lvl1, "a" is not showing because you have to fix a "base" value of some kind. So you should read every entry as "effect of having 'b' instead of 'a'" and "effect of having 'd' instead of 'a'", etc.. In more mathematical terms, if you have a categorical variable that takes three values (a,b,d here), then when you implicitly one-hot encode them you'll get three dimensions that always have values 0 or 1, and the sum of which is always 1. This means that your final A matrix in the regression y = A.x + b will always be degenerate, and you have to delete one column to have a chance of it not being so (thus giving any interpretability at all to the regression coefficients).
Concerning why I think the numbers you got cannot be trusted: among the various hypothesis of the linear regression is independence of the consecutive observations (rows). In the case of longitudinal data, this is exactly what clearly fails. Pushing the example to the limit, if you observe a bunch of people (e.g. 11 as in your set) every second for 1 day, you'll get a huge data frame of nearly 1M rows, and every single person will have virtually the same data repeated over and over again. In this setting, any spurious correlation between the independent and dependent variable will be seen by your model as hugely significant (to him, you've run 86400 independent tests and they all exactly confirmed the same conclusion!), while of course this is not the case.
Summing up, I can't say for sure that the regression coefficients you get are not the best guess you can hope for, but certainly the t statistic, the p-value and everything else that looks like statistic there doesn't make much sense.
I was wondering how I could calculate the average of a specific category via Python? I have a csv file called demo.csv
import pandas as pd
import numpy as np
#loading the data into data frame
X = pd.read_csv('demo.csv')
the two columns of interest are the Category and Totals column:
Category Totals estimates
2 2777 0.43
4 1003 0.26
4 3473 0.65
4 2638 0.17
1 2855 0.74
0 2196 0.13
0 2630 0.91
2 2714 0.39
3 2472 0.51
0 1090 0.12
I'm interested in finding the average for the Totals corresponding with Category 2. I know how to do this on excel, I would just filter to only show category 2 and get the average(which ends up being 2745.5) but how would I code this via Python?
You can restrict your dataframe to the subset of the rows you want(Category=2), followed by taking mean of the columns corresponding to Totals column as follows:
df[df['Category'] == 2]['Totals'].mean()
2745.5
I'm interested in finding the average for the Totals corresponding with Category 2
You may set the category as the index then calculate the mean for any category using the .loc or .ix indexers:
df.set_index('Category').loc['2', 'Totals'].mean()
=> 2745.50
df.set_index('Category').ix['2', 'Totals'].mean()
=> 2745.50
The same can be achieved by using groupby
df.groupby('Category').Totals.mean().loc['2']
=> 2745.50
Note I'm assuming Category is a string.