I have a dataset and I want to select the subset of variables with VIF(Variance Inflation Factor) smaller than a certain threshold. My idea was to calculate the VIF for every variable, then take out the variable for the highest value (if its higher than a certain threshold), recalculate the VIF for every remaining variable and repeat the process until there is no VIF higher than the treshold.
There is no novel idea in this approach but I couldn't get past a certain point to make a function to automatize this process in Python.
x is the dataset with the target variable dropped
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
x_vif = add_constant(x)
vif = pd.DataFrame([variance_inflation_factor(x_vif.values, i) for i in range(x_vif.shape[1])], index=x_vif.columns)
The vif could also be a List. So, is there any package that does that automatically or could you give me an idea how to create this function ?
I found a R library (thinXwithVIF) that could do that automatically, but I couldn't make rpy2 work with the python version that I need to use.
Maybe what would make sense is to remove the variable with the highest vif in each round, subset the dataframe and stop when all variables are lower than your threshold. I don't think vif would be be-all-and-end-all and you really have to look at the data to decide what to include etc.
import statsmodels.api as sm
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
data = sm.datasets.get_rdataset('mtcars')
x_vif = data.data.iloc[:,1:]
y = data.data['mpg']
thres = 10
while True:
Cols = range(x_vif.shape[1])
vif = np.array([variance_inflation_factor(x_vif.values, i) for i in Cols])
if all(vif < thres):
break
else:
Cols = np.delete(Cols,np.argmax(vif))
x_vif = x_vif.iloc[:,Cols]
Related
I have a csv file with 10 columns. I can use pandas to import the dataframe and use the corr() function to output a matrix heatmap. What I want to achieve next is for the code to loop through the dataframe and find high or low correlations between combinations of columns
For example, the simple correlation matrix looks at:
A:A, A:B, A:C, A:D etc
But I want the code to combine columns, in every conceivable way, such as:
AB:A, AB:B, AB:C, AB: D etc
ABC:A, ABC:B, ABC:D etc
And if there are any noticeable correlations between certain combinations, to highlight those.
Is this possible at all? Or are there proprietary applications that can do this?
Thanks
I assume with "combination" you mean linear combination. You can loop over the columns (not the most elegant way) and use sklearn linear_model
import pandas as pd
import numpy as np
from sklearn import linear_model
df = pd.DataFrame(np.random.random([10,10]),columns=['A','B','C','D','E','F','G','H','I','J'])
for i,col1 in enumerate(df):
if i > 0:
X = df.iloc[:,0:i]
for j,col2 in enumerate(df):
if j >= i:
y = df[[col2]]
regr = linear_model.LinearRegression()
regr.fit(X, y)
score = regr.score(X,y)
print(f'X: {X.columns} y: {y.columns} score:{score}')
I am trying to find whether my time series is additive or multiplicative I tried two-approach
If the variance is high and varying with time i.e. high variability then the series is multiplicative or else its additive, but confused should it be on detrended series or original
I have converted the blog code here
The converted python code is here :
import pandas as pd
import numpy as np
df=pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv")
df['Month'] = pd.to_datetime(df['Month'])
# Set the column 'Date' as index (skip if already done)
df = df.set_index('Month')
##Trend
df['Trend']=[0]*(7-1)+list(pd.Series(df['Passengers']).rolling(window=7).mean().iloc[7-1:].values)
# De-trend data
df['detrended_a']=df['Passengers']-df['Trend']
df['detrended_m']=df['Passengers']/df['Trend']
### Make seasonals
df['seasonal_a']=[0]*(7-1)+list(pd.Series(df['detrended_a']).rolling(window=7).mean().iloc[7-1:].values)
df['seasonal_m']=[0]*(7-1)+list(pd.Series(df['detrended_m']).rolling(window=7).mean().iloc[7-1:].values)
### residuals
df['residual_a']=df['detrended_a'] - df['seasonal_a']
df['residual_m']=df['detrended_m'] / df['seasonal_m']
import statsmodels.api as sm
acf_a = sum(pd.Series(sm.tsa.acf(df['residual_a'])).fillna(0))
acf_m= sum(pd.Series(sm.tsa.acf(df['residual_m'])).fillna(0))
if acf_a>acf_m:
print("Additive")
else:
print("Multiplicative")
But yet I am not sure about its usability across different kinds of series. If any one can help me to improve this method or suggest any better method
I am trying to solve a multiobjective optimization problem with 3 objectives and 2 decision variables using NSGA 2. The pymoo code for NSGA2 algorithm and termination criteria is given below. My pop_size is 100 and n_offspring is 100. The algorithm is iterated over 100 generations. I want to store all 100 values of decision variables considered in each generation for all 100 generations in a dataframe.
NSGA2 implementation in pymoo code:
from pymoo.algorithms.nsga2 import NSGA2
from pymoo.factory import get_sampling, get_crossover, get_mutation
algorithm = NSGA2(
pop_size=20,
n_offsprings=10,
sampling=get_sampling("real_random"),
crossover=get_crossover("real_sbx", prob=0.9, eta=15),
mutation=get_mutation("real_pm", prob=0.01,eta=20),
eliminate_duplicates=True
)
from pymoo.factory import get_termination
termination = get_termination("n_gen", 100)
from pymoo.optimize import minimize
res = minimize(MyProblem(),
algorithm,
termination,
seed=1,
save_history=True,
verbose=True)
What I have tried (My reference: stackoverflow question):
import pandas as pd
df2 = pd.DataFrame (algorithm.pop)
df2.head(10)
The result from above code is blank and on passing
print(df2)
I get
Empty DataFrame
Columns: []
Index: []
Glad you intend to use pymoo for your research. You have correctly enabled the save_history option, which means you can access the algorithm objects.
To have all solutions from the run, you can combine the offsprings (algorithm.off) from each generation. Don't forget the Population objects contain Individual objectives. With the get method you can get the X and F or other values. See the code below.
import pandas as pd
from pymoo.algorithms.nsga2 import NSGA2 from pymoo.factory import get_sampling, get_crossover, get_mutation, ZDT1 from pymoo.factory import get_termination from pymoo.model.population import Population from pymoo.optimize import minimize
problem = ZDT1()
algorithm = NSGA2(
pop_size=20,
n_offsprings=10,
sampling=get_sampling("real_random"),
crossover=get_crossover("real_sbx", prob=0.9, eta=15),
mutation=get_mutation("real_pm", prob=0.01,eta=20),
eliminate_duplicates=True )
termination = get_termination("n_gen", 10)
res = minimize(problem,
algorithm,
termination,
seed=1,
save_history=True,
verbose=True)
all_pop = Population()
for algorithm in res.history:
all_pop = Population.merge(all_pop, algorithm.off)
df = pd.DataFrame(all_pop.get("X"), columns=[f"X{i+1}" for i in range(problem.n_var)])
print(df)
Another way would be to use a callback and fill the data frame each generation. Similar as shown here: https://pymoo.org/interface/callback.html
I'm trying to write a block of code that will allow me to identify the risk contribution of assets in a portfolio. The covariance matrix is a 6x6 pandas dataframe.
My code is as follows:
import numpy as np
import pandas as pd
weights = np.array([.1,.2,.05,.25,.1,.3])
data = pd.DataFrame(np.random.randn(1000,6),columns = 'a','b','c','d','e','f'])
covariance = data.cov()
portfolio_variance = (weights*covariance*weights.T)[0,0]
sigma = np.sqrt(portfolio_variance)
marginal_risk = covariance*weights.T
risk_contribution = np.multiply(marginal_risk, weights.T)/sigma
print(risk_contribution)
When I try to run the code I get a KeyError, and if I remove the [0,0] from portfolio_variance I get output that doesn't seem to make sense.
Can somebody point me to my error(s)?
Three problems with your code:
Open your list operator square brackets on line 6:
data = pd.DataFrame(np.random.randn(1000,6),columns = ['a','b','c','d','e','f'])
You're using the two dimensional indexing operator wrong. You can't say [0,0], you have to say [0][0].
And last, because you named the columns, you have to use them when indexing, so it's actually ['a'][0]:
portfolio_variance = (weights*covariance*weights.T)['a'][0]
Final working code:
import numpy as np
import pandas as pd
weights = np.array([.1,.2,.05,.25,.1,.3])
data = pd.DataFrame(np.random.randn(1000,6),columns = ['a','b','c','d','e','f'])
covariance = data.cov()
portfolio_variance = (weights*covariance*weights.T)['a'][0]
sigma = np.sqrt(portfolio_variance)
marginal_risk = covariance*weights.T
risk_contribution = np.multiply(marginal_risk, weights.T)/sigma
print(risk_contribution)
portfolio_variance =(weights*covariance*weights.T)
portfolio_variance should be
portfolio_variance =(weights#covariance#weights.T)
This will provide the portfolio variance, which should be a single number.
same for marginal risk, it should be
marginal_risk = covariance#weights.T
For a given Series I want to change the value of each element around it's current value and then calculate an arbitrary function (here std) as shown in the following code:
import pandas as pd
import numpy as np
a = pd.Series(np.random.randn(10))
perturb = {}
for item in range(2,len(a)):
serturb = {}
for ep in np.arange(-1,1,0.1):
temp = a.ix[0:item]
temp.iloc[-1] += ep
serturb[ep] = temp.std()
perturb[item] = pd.Series(serturb)
perturb = pd.DataFrame(perturb).T
The above code will become too slow for a large amount of data. The above process, when applied on a DataFrame would return a Panel. Is there an efficient way of doing it, since a lot of the calculations are being repeated?