perturbation analysis in pandas - python

For a given Series I want to change the value of each element around it's current value and then calculate an arbitrary function (here std) as shown in the following code:
import pandas as pd
import numpy as np
a = pd.Series(np.random.randn(10))
perturb = {}
for item in range(2,len(a)):
serturb = {}
for ep in np.arange(-1,1,0.1):
temp = a.ix[0:item]
temp.iloc[-1] += ep
serturb[ep] = temp.std()
perturb[item] = pd.Series(serturb)
perturb = pd.DataFrame(perturb).T
The above code will become too slow for a large amount of data. The above process, when applied on a DataFrame would return a Panel. Is there an efficient way of doing it, since a lot of the calculations are being repeated?

Related

parallelized groupby: apply a function to a groupby object simultaneously

I would like to perform a group by operations and for every single group estimate a linear model.
Writing a function and then using a for loop is pretty easy ,however, kind of slow.
This is a toy example but it does serve the purpose. What is, in your opinion, the "best" way of making this parallelized?
an intuitive example:
import seaborn as sns
import pandas as pd
from statsmodels.formula.api import ols
import time
# Dataset
df = sns.load_dataset("tips")
df.head()
# Groupby the dataset
df_grouped = df.groupby(["day"])
# Some function to be applied for every grouped element
def regression_model(df):
"""
This function estimates a linear regression model and returns coefs as dictionary
"""
model = ols('tip ~ total_bill + C(sex) + size', data = df)
return dict(model.fit().params)
# Performing the function in the for loop ------ Slow. We want to perform it for each grouped element simultaneously.
coefs_dict = {}
for i, j in df_grouped:
coefs_i = regression_model(j)
coefs_dict[i] = coefs_i
# Artificial sleep so we can demostrate that the "mechanical" for loop is slow....
time.sleep(2)
In this particular case I am using the 'sleep' module to make it slower to demonstrate that the for loop will take a lot of time especially if we would be grouping by much larger number of unique cathegories.
You can use multiprocessing module as suggested by #JérômeRichard and the Pool.starmap to be used with groupby
import pandas as pd
import multiprocessing
def regression_model(keys, df):
print(f'Pool: {keys}')
# do stuff here
return df
if __name__ == '__main__':
data = []
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.starmap(regression_model, df.groupby('day'))
df2 = pd.concat(data)

Creating Python Function to Iterate over List/DataFrame (VIF)

I have a dataset and I want to select the subset of variables with VIF(Variance Inflation Factor) smaller than a certain threshold. My idea was to calculate the VIF for every variable, then take out the variable for the highest value (if its higher than a certain threshold), recalculate the VIF for every remaining variable and repeat the process until there is no VIF higher than the treshold.
There is no novel idea in this approach but I couldn't get past a certain point to make a function to automatize this process in Python.
x is the dataset with the target variable dropped
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
x_vif = add_constant(x)
vif = pd.DataFrame([variance_inflation_factor(x_vif.values, i) for i in range(x_vif.shape[1])], index=x_vif.columns)
The vif could also be a List. So, is there any package that does that automatically or could you give me an idea how to create this function ?
I found a R library (thinXwithVIF) that could do that automatically, but I couldn't make rpy2 work with the python version that I need to use.
Maybe what would make sense is to remove the variable with the highest vif in each round, subset the dataframe and stop when all variables are lower than your threshold. I don't think vif would be be-all-and-end-all and you really have to look at the data to decide what to include etc.
import statsmodels.api as sm
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
data = sm.datasets.get_rdataset('mtcars')
x_vif = data.data.iloc[:,1:]
y = data.data['mpg']
thres = 10
while True:
Cols = range(x_vif.shape[1])
vif = np.array([variance_inflation_factor(x_vif.values, i) for i in Cols])
if all(vif < thres):
break
else:
Cols = np.delete(Cols,np.argmax(vif))
x_vif = x_vif.iloc[:,Cols]

Replacing masked data with previous values of same dataset

I am working on filling in missing data in a large (4GB) netcdf datafile (3 dimensions: time, longitude and latitude). The method is to fill in the masked values in data1 either with:
1) previous values from data1 or
2) with data from another (also masked dataset, data2) if the found value from data1 < the found value from data2.
So fare I have tried a couple of things, one is to make a very complex script with long for loops which never finished running after 24 hours. I have tried to reduce it, but i think it is still very much to complicated. I believe there is a much more simple procedure to do it than the way I am doing it now I just can't see how.
I have made a script where masked data is first replaced with zeroes in order to use the function np.where to get the index of my masked data (i did not find a function that returns the coordinates of masked data, so this is my work arround it). My problem is that my code is very long and i think time consuming for large datasets to run through. I believe there is a more simple way of doing it, but I haven't found another work arround it.
Here is what I have so fare: : (the first part is just to generate some matrices that are easy to work with):
if __name__ == '__main__':
import numpy as np
import numpy.ma as ma
from sortdata_helpers import decision_tree
# Generating some (easy) test data to try the algorithm on:
# data1
rand1 = np.random.randint(10, size=(10, 10, 10))
rand1 = ma.masked_where(rand1 > 5, rand1)
rand1 = ma.filled(rand1, fill_value=0)
rand1[0,:,:] = 1
#data2
rand2 = np.random.randint(10, size=(10, 10, 10))
rand2[0, :, :] = 1
coordinates1 = np.asarray(np.where(rand1 == 0)) # gives the locations of where in the data there are zeros
filled_data = decision_tree(rand1, rand2, coordinates1)
print(filled_data)
The functions that I defined to be called in the main script are these, in the same order as they are used:
def decision_tree(data1, data2, coordinates):
# This is the main function,
# where the decision between data1 or data2 is chosen.
import numpy as np
from sortdata_helpers import generate_vector
from sortdata_helpers import find_value
for i in range(coordinates.shape[1]):
coordinate = [coordinates[0, i], coordinates[1,i], coordinates[2,i]]
AET_vec = generate_vector(data1, coordinate) # makes vector to go back in time
AET_value = find_value(AET_vec) # Takes the vector and find closest day with data
PET_vec = generate_vector(data2, coordinate)
PET_value = find_value(PET_vec)
if PET_value > AET_value:
data1[coordinate[0], coordinate[1], coordinate[2]] = AET_value
else:
data1[coordinate[0], coordinate[1], coordinate[2]] = PET_value
return(data1)
def generate_vector(data, coordinate):
# This one generates the vector to go back in time.
vector = data[0:coordinate[0], coordinate[1], coordinate[2]]
return(vector)
def find_value(vector):
# Here the fist value in vector that is not zero is chosen as "value"
from itertools import dropwhile
value = list(dropwhile(lambda x: x == 0, reversed(vector)))[0]
return(value)
Hope someone has a good idea or suggestions on how to improve my code. I am still struggling with understanding indexing in python, and I think this can definately be done in a more smooth way than I have done here.
Thanks for any suggestions or comments,

How to calculate risk contribution of assets in Python

I'm trying to write a block of code that will allow me to identify the risk contribution of assets in a portfolio. The covariance matrix is a 6x6 pandas dataframe.
My code is as follows:
import numpy as np
import pandas as pd
weights = np.array([.1,.2,.05,.25,.1,.3])
data = pd.DataFrame(np.random.randn(1000,6),columns = 'a','b','c','d','e','f'])
covariance = data.cov()
portfolio_variance = (weights*covariance*weights.T)[0,0]
sigma = np.sqrt(portfolio_variance)
marginal_risk = covariance*weights.T
risk_contribution = np.multiply(marginal_risk, weights.T)/sigma
print(risk_contribution)
When I try to run the code I get a KeyError, and if I remove the [0,0] from portfolio_variance I get output that doesn't seem to make sense.
Can somebody point me to my error(s)?
Three problems with your code:
Open your list operator square brackets on line 6:
data = pd.DataFrame(np.random.randn(1000,6),columns = ['a','b','c','d','e','f'])
You're using the two dimensional indexing operator wrong. You can't say [0,0], you have to say [0][0].
And last, because you named the columns, you have to use them when indexing, so it's actually ['a'][0]:
portfolio_variance = (weights*covariance*weights.T)['a'][0]
Final working code:
import numpy as np
import pandas as pd
weights = np.array([.1,.2,.05,.25,.1,.3])
data = pd.DataFrame(np.random.randn(1000,6),columns = ['a','b','c','d','e','f'])
covariance = data.cov()
portfolio_variance = (weights*covariance*weights.T)['a'][0]
sigma = np.sqrt(portfolio_variance)
marginal_risk = covariance*weights.T
risk_contribution = np.multiply(marginal_risk, weights.T)/sigma
print(risk_contribution)
portfolio_variance =(weights*covariance*weights.T)
portfolio_variance should be
portfolio_variance =(weights#covariance#weights.T)
This will provide the portfolio variance, which should be a single number.
same for marginal risk, it should be
marginal_risk = covariance#weights.T

How to make my python integration faster?

Hi i want to integrate a function from 0 to several different upper limits (around 1000). I have written a piece of code to do this using a for loop and appending each value to an empty array. However i realise i could make the code faster by doing smaller integrals and then adding the previous integral result to the one just calculated. So i would be doing the same number of integrals, but over a smaller interval, then just adding the previous integral to get the integral from 0 to that upper limit. Heres my code at the moment:
import numpy as np #importing all relevant modules and functions
from scipy.integrate import quad
import pylab as plt
import datetime
t0=datetime.datetime.now() #initial time
num=np.linspace(0,10,num=1000) #setting up array of values for t
Lt=np.array([]) #empty array that values for L(t) are appended to
def L(t): #defining function for L
return np.cos(2*np.pi*t)
for g in num: #setting up for loop to do integrals for L at the different values for t
Lval,x=quad(L,0,g) #using the quad function to get the values for L. quad takes the function, where to start the integral from, where to end the integration
Lv=np.append(Lv,[Lval]) #appending the different values for L at different values for t
What changes do I need to make to do the optimisation technique I've suggested?
Basically, we need to keep track of the previous values of Lval and g. 0 is a good initial value for both, since we want to start by adding 0 to the first integral, and 0 is the start of the interval. You can replace your for loop with this:
last, lastG = 0, 0
for g in num:
Lval,x = quad(L, lastG, g)
last, lastG = last + Lval, g
Lv=np.append(Lv,[last])
In my testing, this was noticeably faster.
As #askewchan points out in the comments, this is even faster:
Lv = []
last, lastG = 0, 0
for g in num:
Lval,x = quad(L, lastG, g)
last, lastG = last + Lval, g
Lv.append(last)
Lv = np.array(Lv)
Using this function:
scipy.integrate.cumtrapz
I was able to reduce time to below machine precision (very small).
The function does exactly what you are asking for in a highly efficient manner. See docs for more info: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.integrate.cumtrapz.html
The following code, which reproduces your version first and then mine:
# Module Declarations
import numpy as np
from scipy.integrate import quad
from scipy.integrate import cumtrapz
import time
# Initialise Time Array
num=np.linspace(0,10,num=1000)
# Your Method
t0 = time.time()
Lv=np.array([])
def L(t):
return np.cos(2*np.pi*t)
for g in num:
Lval,x=quad(L,0,g)
Lv=np.append(Lv,[Lval])
t1 = time.time()
print(t1-t0)
# My Method
t2 = time.time()
functionValues = L(num)
Lv_Version2 = cumtrapz(functionValues, num, initial=0)
t3 = time.time()
print(t3-t2)
Which consistently yields:
t1-t0 = O(0.1) seconds
t3-t2 = 0 seconds

Categories

Resources