Efficient calculation of rolling pearson correlation - python

As shown in this question Calculating rolling correlation of pandas dataframes , I need to get a correlation of an array of length N to each window in a second array length M.
x= np.random.randint(0,100,10000)
y= [4,5,4,5]
corrs = []
for i in range(0,(len(x)-len(y) ) +1):
corrs.append( np.corrcoef(x[i:i+4],y)[0,1] )
Every question I find that is similar to this discusses how to do it on a matrix for NxK to MxK. However the ones I try are not working for 1d data. In the linked question, the suggest is to roll over the pandas frame, which is pretty slow. Is there a faster way to calculate this?
The above code takes around 0.4s and the code from the example link takes 1.6s:
corr = x.rolling(4).apply(lambda x: np.corrcoef(x,y)[0,1],raw=False ).dropna(how='all',axis=0)
Is there a much more efficient way to do this?

Store your correlation coefficients in a numpy array instead of a regular python list (you are resizing the list every time you insert an element)
corrs = np.zeros([len(x)-len(y)+1])
for i in range(0,(len(x)-len(y) ) +1):
corrs[i] = np.corrcoef(x[i:i+4],y)[0,1]

Related

Most efficient way of computing pairwise cosine similarity for large DataFrame

I have a 300.000 row pd.DataFrame comprised of multiple columns, out of which, one is a 50-dimension numpy array of shape (1,50) like so:
ID Array1
1 [2.4252 ... 5.6363]
2 [3.1242 ... 9.0091]
3 [6.6775 ... 12.958]
...
300000 [0.1260 ... 5.3323]
I then generate a new numpy array (let's call it array2) with the same shape and calculate the cosine similarity between each row of the dataframe and the generated array. For this, I am currently using sklearn.metrics.pairwise.cosine_similarity and save the results in a new column:
from sklearn.metrics.pairwise import cosine_similarity
df['Cosine'] = cosine_similarity(df['Array1].tolist(), array2)
Which works as intended and takes, on average, 2.5 seconds to execute. I am currently trying to lower this time to under 1 second simply for the sake of having less waiting time in the system I am building.
I am beginning to learn about Vaex and Dask as alternatives to pandas but am failing to convert the code I provided to a working equivalent that is also faster.
Preferably with one of the technologies I mentioned, how can I go about making pairwise cosine calculations even faster for large datasets?
You could use Faiss here and apply a knn operation. To do this, you would put dataframe into a Faiss index and then search it using the array with k=3000000 (or whatever the total number of rows of your dataframe).
import faiss
dimension = 100
array1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatIP(d)
#add the rows of the dataframe into Faiss
for index, row in df.iterrows():
index.add(row)
k= len(df)
D, I = index.search(array1, k)
Note that you'll need to normalise the vectors to make this work (as the above solution is based on inner product).

Speed up iteration over DataFrame items

I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these

How to make a simple stock price simulation process more efficient in Python?

I have some very simple code written to simulate a stock price assuming random movement between -2% and +2% a day (it's overly simplistic but for demonstration purposes I figured it was easier than using a GMB formula).
The issue I have is that it's very slow, I understand that it's because I'm using double loops. From what I understand I might be able to use vectorization, but I can't figure out how.
Basically what I did was create 100 simulations assuming 256 trading days in a year, each day the previous stock price is multiplied by a random number between .98 and 1.02.
I currently do this using a nested for loop. As I gather this is not good but as a novice I'm having a hard time vectorizing. I've read about it online and from what I understand basically instead of using loops you would try to convert both of them into matrices and use matrix multiplication but I'm unsure how to apply that here. Can anyone point me in the right direction?
from numpy import exp, sqrt, log, linspace
from random import gauss
from random import uniform
import pandas as pd
nsims = 100
stpx = 100
days = 256
mainframe = pd.DataFrame(0, index = list(range(1,days)), columns = list(range(1,nsims)))
mainframe.iloc[0] = stpx
for i in range(0, nsims-1):
for x in range(1, days-1):
mainframe.iloc[x, i] = mainframe.iloc[x-1, i]* uniform(.98, 1.02)
Vectorization can be tricky when one calculation relies on the result of a previous calculation, like in this case where day x needs to know the results from day x-1. I won't say it can't be done as quite possibly somebody can find a way, but here's my solution that at least gets rid of one of the loops. We still loop through days, but we do all 100 simulations at once by generating an array of random numbers and making use of numpy's element-wise multiplication (which is much faster than using a loop):
You will need to add the following import:
import numpy as np
And then replace your nested loop with this single loop:
for x in range(1, days-1):
mainframe.iloc[x] = mainframe.iloc[x-1] * np.random.uniform(0.98, 1.02, nsims-1)
Edit to add: because you are using a very simple formula which only involves basic multiplication, you actually can get rid of both loops by generating a random matrix of numbers, using numpy's cumulative product function column-wise, and multiplying it by a DataFrame where each value begins at 100. I'm not sure such an approach would be viable if you started using a more complicated formula though. Here it is anyway:
import pandas as pd
import numpy as np
nsims = 100
stpx = 100
days = 256
mainframe = pd.DataFrame(stpx, index=list(range(1, days)), columns=list(range(1, nsims)))
rand_matrix = np.random.uniform(0.98, 1.02, (days-2, nsims-1)).cumprod(axis=0)
mainframe.iloc[1:] *= rand_matrix

Vectorize an operation in Numpy

I am trying to do the following on Numpy without using a loop :
I have a matrix X of dimensions N*d and a vector y of dimension N.
y contains integers ranging from 1 to K.
I am trying to get a matrix M of size K*d, where M[i,:]=np.mean(X[y==i,:],0)
Can I achieve this without using a loop?
With a loop, it would go something like this.
import numpy as np
N=3
d=3
K=2
X=np.eye(N)
y=np.random.randint(1,K+1,N)
M=np.zeros((K,d))
for i in np.arange(0,K):
line=X[y==i+1,:]
if line.size==0:
M[i,:]=np.zeros(d)
else:
M[i,:]=mp.mean(line,0)
Thank you in advance.
The code's basically collecting specific rows off X and adding them for which we have a NumPy builtin in np.add.reduceat. So, with that in focus, the steps to solve it in a vectorized way could be as listed next -
# Get sort indices of y
sidx = y.argsort()
# Collect rows off X based on their IDs so that they come in consecutive order
Xr = X[np.arange(N)[sidx]]
# Get unique row IDs, start positions of each unique ID
# and their counts to be used for average calculations
unq,startidx,counts = np.unique((y-1)[sidx],return_index=True,return_counts=True)
# Add rows off Xr based on the slices signified by the start positions
vals = np.true_divide(np.add.reduceat(Xr,startidx,axis=0),counts[:,None])
# Setup output array and set row summed values into it at unique IDs row positions
out = np.zeros((K,d))
out[unq] = vals
This solves the question, but creates an intermediate K×N boolean matrix, and doesn't use the built-in mean function. This may lead to worse performance or worse numerical stability in some cases. I'm letting the class labels range from 0 to K-1 rather than 1 to K.
# Define constants
K,N,d = 10,1000,3
# Sample data
Y = randint(0,K-1,N) #K-1 to omit one class to test no-examples case
X = randn(N,d)
# Calculate means for each class, vectorized
# Map samples to labels by taking a logical "outer product"
mark = Y[None,:]==arange(0,K)[:,None]
# Count number of examples in each class
count = sum(mark,1)
# Avoid divide by zero if no examples
count += count==0
# Sum within each class and normalize
M = (dot(mark,X).T/count).T
print(M, shape(M), shape(mark))

Comparaison of two list with NaN python

I believed this a simple question and looked for relative topics but I didn't find the right thing. Here is the problem:
I have two NumPy arrays for which I need to make statistic analysis by calculating some criterions, for exemple the correlation coefficient and the Nash criterion (for who are familiar with Nash). Since in the first array are observation data (the second is simulation results), I have some NaNs. I would like my programme to calculate the criterions in ignoring the value couples where the value in the first array is NaN.
I tried the mask method. It worked well if I need only to deal with the first array (for calculation its average for exemple), but didn't work for comparisons of the two arrays value by value.
Could anyone give some help? Thanks!
Just answered a similar question Numpy only on finite entries. You can replace the NaN values in you array with Numpy's isnan function, which is a common way to deal with NaN values.
import numpy as np
replace_NaN = np.isnan(array_name)
array_name[replace_NaN] = 0

Categories

Resources