Improving performance of Cronbach Alpha code python numpy - python

I made some code for calculating Cronbach Alpha that works. But I am not too good using lambda functions. Is there a way to reduce the code and improve efficiency by using lambda instead of the svar() function and getting rid of some of the for loops by using numpy arrays?
import numpy as np
def svar(X):
n = float(len(X))
svar=(sum([(x-np.mean(X))**2 for x in X]) / n)* n/(n-1.)
return svar
def CronbachAlpha(itemscores):
itemvars = [svar(item) for item in itemscores]
tscores = [0] * len(itemscores[0])
for item in itemscores:
for i in range(len(item)):
tscores[i]+= item[i]
nitems = len(itemscores)
#print "total scores=", tscores, 'number of items=', nitems
Calpha=nitems/(nitems-1.) * (1-sum(itemvars)/ svar(tscores))
return Calpha
###########Test################
itemscores = [[ 4,14,3,3,23,4,52,3,33,3],
[ 5,14,4,3,24,5,55,4,15,3]]
print "Cronbach alpha = ", CronbachAlpha(itemscores)

def CronbachAlpha(itemscores):
itemscores = numpy.asarray(itemscores)
itemvars = itemscores.var(axis=1, ddof=1)
tscores = itemscores.sum(axis=0)
nitems = len(itemscores)
return nitems / (nitems-1.) * (1 - itemvars.sum() / tscores.var(ddof=1))
NumPy has a variance function built in. Specifying ddof=1 uses a denominator of N-1, giving a sample variance. There's also a sum builtin.

As Julien Marrec mentioned I suggest the following refactoring of the CronbachAlpha:
def CronbachAlpha(itemscores):
# cols are items, rows are observations
itemscores = np.asarray(itemscores)
itemvars = itemscores.var(axis=0, ddof=1)
tscores = itemscores.sum(axis=1)
nitems = len(itemscores.columns)
return (nitems / (nitems-1)) * (1 - (itemvars.sum() / tscores.var(ddof=1)))

Same as the other answers, just a bit more Pythonic. X is a data matrix -- that is, the rows are samples, the columns are items. X may be a numpy array or pandas DataFrame.
def cronbach_alpha(X):
num_items = X.shape[1]
sum_of_item_variances = X.var(axis=0).sum()
variance_of_sum_of_items = X.sum(axis=1).var()
return num_items/(num_items - 1)*(1 - sum_of_item_variances/variance_of_sum_of_items)
(It's not necessary to specify ddof, as the term appears in the denominator and numerator, and cancels.)

Related

Vectorization for computing variance of a vector split at different points

I have a 1-D array arr and I need to compute the variance of all possible contiguous subvectors that begin at position 0. It may be easier to understand with a for loop:
np.random.seed(1)
arr = np.random.normal(size=100)
res = []
for i in range(1, arr.size+1):
subvector = arr[:i]
var = np.var(subvector)
res.append(var)
Is there any way to compute res witouth the for loop?
Yes, since var = sum_squares / N - mean**2, and mean = sum /N, you can do cumsum to get the accumulate sums:
cumsum = np.cumsum(arr)
cummean = cumsum/(np.arange(len(arr)) + 1)
sq = np.cumsum(arr**2)
# correct the dof here
cumvar = sq/(np.arange(len(arr))+1) - cummean**2
np.allclose(res, cumvar)
# True
With pandas, you could use expanding:
import pandas as pd
pd.Series(arr).expanding().var(ddof=0).values
NB. one of the advantages is that you can benefit from the var parameters (by default ddof=1), and of course, you can run many other methods.

EWMA Covariance Matrix in Pandas - Optimization

I would like to calculate the EWMA Covariance Matrix from a DataFrame of stock price returns using Pandas and have followed the methodology in PyPortfolioOpt.
I like the flexibility of using Pandas objects and functions but when the set of assets grows the function is becomes very slow:
import pandas as pd
import numpy as np
def ewma_cov_pairwise_pd(x, y, alpha=0.06):
x = x.mask(y.isnull(), np.nan)
y = y.mask(x.isnull(), np.nan)
covariation = ((x - x.mean()) * (y - y.mean()).dropna()
return covariation.ewm(alpha=0.06).mean().iloc[-1]
def ewma_cov_pd(rets, alpha=0.06):
assets = rets.columns
n = len(assets)
cov = np.zeros((n, n))
for i in range(n):
for j in range(i, n):
cov[i, j] = cov[j, i] = ewma_cov_pairwise_pd(
rets.iloc[:, i], rets.iloc[:, j], alpha=alpha)
return pd.DataFrame(cov, columns=assets, index=assets)
I would like to improve the speed of the code ideally while still using Pandas but the bottleneck is within the DataFrame.ewm() function which uses 90% of the calculation time.
If using this function was a binding constraint, what is the most efficient way of improving the speed at which the code runs? I was considering taking a brute force approach and using concurrent.futures.ProcessPoolExecutor but perhaps there is a better solutions.
n = 100 # n is typically 2000
rets = pd.DataFrame(np.random.normal(0, 1., size=(n, n)))
cov_pd = ewma_cov_pd(rets)
The true time-series data can contain leading nulls and potentially missing values after that although the latter less likely.
Update I
A potential solution which leverages off the answer provided by Quang Hoang and produces the expected results in a far more reasonable time would be something similar to:
def ewma_cov_frame_qh(rets, alpha=0.06):
weights = (1-alpha) ** np.arange(len(df))[::-1]
normalized = (rets-rets.mean()).to_numpy()
out = (weights * normalized.T) # normalized / weights.sum()
return pd.DataFrame(out, index=rets.columns, columns=rets.columns)
def ewma_cov_qh(rets, alpha=0.06):
syms = rets.columns
covar = pd.DataFrame(index=rets.columns, columns=rets.columns)
delta = rets.isnull().sum(axis=1).shift(1) - rets.isnull().sum(axis=1)
dates = delta.loc[delta != 0].index.tolist()
for date in dates:
frame = rets.loc[rets.index >= date].dropna(axis=1, how='any')
cov = ewma_cov_frame_qh(frame).reindex(index=syms, columns=syms)
covar = covar.fillna(cov)
return covar
cov_qh = ewma_cov_qh(rets)
This violates the requirement that the underlying covariance is calculated using the native Pandas/Numpy functions and calculation time will depend on the number leading na's in the data set.
Update II
A potential improvement on the above which uses (a naive implementation of) multiprocessing and improves the calculation time by a further 42.5% on my machine is listed below:
from concurrent.futures import ProcessPoolExecutor, as_completed
from functools import partial
def ewma_cov_mp_worker(date, rets, alpha=0.06):
syms = rets.columns
frame = rets.loc[rets.index >= date].dropna(axis=1, how='any')
return ewma_cov_frame_qh(frame, alpha=alpha).reindex(index=syms, columns=syms)
def ewma_cov_mp(rets, alpha=0.06):
covar = pd.DataFrame(index=rets.columns, columns=rets.columns)
delta = rets.isnull().sum(axis=1).shift(1) - rets.isnull().sum(axis=1)
dates = delta.loc[delta != 0].index.tolist()
func = partial(ewma_cov_mp_worker, rets=rets, alpha=alpha)
covs = {}
with ProcessPoolExecutor(max_workers=6) as exec:
future_to_date = {exec.submit(func, date): date for date in dates}
covs = {future_to_date[future]: future.result() for future in as_completed(future_to_date)}
for date in dates:
covar.fillna(covs[date], inplace=True)
return covar
[I have not added as answer as not addressed the original question and I am optimistic there is a better solution.]
since you don't really care for ewm, i.e, you only take the last value. We can try matrix multiplication:
def ewma(df, alpha=0.94):
weights = (1-alpha) ** np.arange(len(df))[::-1]
# fillna with 0 here
normalized = (df-df.mean()).fillna(0).to_numpy()
out = ((weights * normalized.T) # normalized / weights.sum()
return out
# verify
out = ewma(df)
print(out[0,1] == ewma_cov_pairwise(df[0],df[1]) )
# True
And this took about 150 ms on my system with df.shape==(2000,2000) while your code refuses to run within minutes :-).

Normalization of a vector using loops in Python

Write a function that normalizes a vector (finds the unit vector). A vector can be normalized by dividing each individual component of the vector by its magnitude. Your input for this function will be a vector i.e. 1 dimensional list containing 3 integers.
According to the solution devised, I have considered a predefined list of 3 elements. But if I want to apply loops, then please explain me that how I could deduce the solution using loops. I tried working on the problem. This is my solution so far:
from math import sqrt
def vector_normalization(my_vector):
result = 0
for x in my_vector:
result = result + (x ** 2)
magnitude = sqrt(result)
nx_vector = my_vector[0] / magnitude
ny_vector = my_vector[1] / magnitude
nz_vector = my_vector[2] / magnitude
n_vector = [nx_vector, ny_vector, nz_vector]
return n_vector
Now, after I calculate the magnitude using for loop of some random list, according to my program I will get only three elements in the list as the output. But I want all the elements in the random list to be normalized. Please suggest me the way to achieve the same.
Also, you can use high order functions in Python like map:
vec = [1,2,3]
magnitude = sqrt(sum(map(lambda x: x**2, vec)))
normalized_vec = list(map(lambda x: x/magnitude, vec))
So normalized_vec becomes:
[0.2672612419124244, 0.5345224838248488, 0.8017837257372732]
Or using Numpy:
import numpy as np
arr = np.array([1,2,3])
arr_normalized = arr/sqrt(sum(arr**2))
arr_normalized results in:
array([ 0.26726124, 0.53452248, 0.80178373])
Please try the following code,
vector = [1,2,4]
y=0
for x in vector:
y+=x**2
y = y**0.5
unit_vector = []
for x in vector:
unit_vector.append(x/y)
Hope this helps.
def vector_normalization(vec):
result = 0
for x in vec:
result = result + (x**2)
magnitude = (result)**0.5
x = vec[0]/magnitude
y = vec[1]/magnitude
z = vec[2]/magnitude
vec = [x,y,z]
return vec

iteration through pandas dataframes and numpy

I am coming from a java background and new to numpy and pandas.
I want to translate the following pseudo code into python.
theta[0...D] - numpy
input[1...D][0...N-1] - Pandas data frame
PSEUDO CODE:
mean = theta[0]
for(row = 0 to N-1)
for(col = 1 to D)
mean += theta[col] * input[row][col]
Implementation:
class simulator:
theta = np.array([])
stddev = 0
def __init__(self, v_coefficents, v_stddev):
self.theta = v_coefficents
self.stddev = v_stddev
def sim( self, input ):
mean = self.theta[0]
D = input.shape[0]
N = input.shape[1]
for index, row in input.iterrows():
mean = self.theta[0]
for i in range(D):
mean += self.theta[i+1] *row['y']
I am concerned with iteration in the last line of code:
mean += self.theta[i+1] *row['y'].
Since you are working with NumPy, I would suggest extracting the pandas dataframe as an array and then we would have the luxury of working with theta and the extracted version of input both as arrays.
Thus, starting off we would have the array as -
input_arr = input.values
Then, the translation of the pseudo code would be -
mean = theta[0]
for row in range(N):
for col in range(1,D+1):
mean += theta[col] * input_arr[row,col]
To perform the sum-reductions, with NumPy supporting vectorized operations and broadcasting, we would have the output with simply -
mean = theta[0] + (theta[1:D+1]*input_arr[:,1:D+1]).sum()
This could be optimized further with np.dot as a matrix-multiplication, like so -
mean = theta[0] + np.dot(input_arr[:,1:D+1], theta[1:D+1]).sum()
Please note that if you meant that input has a length of D-1, then we need few edits :
Loopy code would have : input_arr[row,col-1] instead of input_arr[row,col].
Vectorized codes would have : input_arr instead of input_arr[:,1:D+1].
Sample run based on comments -
In [71]: df = {'y' : [1,2,3,4,5]}
...: data_frame = pd.DataFrame(df)
...: test_coefficients = np.array([1,2,3,4,5,6])
...:
In [79]: input_arr = data_frame.values
...: theta = test_coefficients
...:
In [80]: theta[0] + np.dot(input_arr[:,0], theta[1:])
Out[80]: 71

how to vectorize a matrix sum in a for loop using numpy?

Basically I have a matrix with rows=3600 and columns=5 and wish to downsample it to parcels of 60 rows:
import numpy as np
X = np.random.rand(3600,5)
down_sample = 60
ds_rng = range(0,X.shape[0],down_sample)
X_ds = np.zeros((ds_rng.__len__(),X.shape[1]))
i = 0
for j in ds_rng:
X_ds[i,:] = np.sum( X[j:j+down_sample,:], axis=0 )
i += 1
Another way to do this might be:
def blockwise_sum(X, down_sample=60):
n, m = X.shape
ds_n = n / down_sample
N = ds_n * down_sample
if N == n:
return np.sum(X.reshape(-1, down_sample, m), axis=1)
X_ds = np.zeros((ds_n + 1, m))
X_ds[:ds_n] = np.sum(X[:N].reshape(-1, down_sample, m), axis=1)
X_ds[-1] = np.sum(X[N:], axis=0)
return X_ds
I don't know if it's any faster though.
At least in this case, einsum is faster than sum.
np.einsum('ijk->ik',x.reshape(-1,down_sample,x.shape[1]))
is 2x faster than blockwise_sum.
My timings:
OP iterative - 1.59 ms
with strided - 198 us
blockwise_sum - 179 us
einsum - 76 us
Looks like you can use some stride tricks to get the job done.
Here's the setup code we'll need:
import numpy as np
X = np.random.rand(1000,5)
down_sample = 60
And now we trick numpy into thinking X is split into parcels:
num_parcels = int(np.ceil(X.shape[0] / float(down_sample)))
X_view = np.lib.stride_tricks.as_strided(X, shape=(num_parcels,down_sample,X.shape[1]))
X_ds = X_view.sum(axis=1) # sum over the down_sample axis
Finally, if your downsampling interval doesn't exactly divide your rows evenly, you'll need to fix up the last row in X_ds, because the stride trick we pulled made it wrap back around.
rem = X.shape[0] % down_sample
if rem != 0:
X_ds[-1] = X[-rem:].sum(axis=0)

Categories

Resources