Compute weighted sums on rolling window with pandas dataframes of different length - python

I have a large dataframe > 5000000 rows that I am performing a rolling calculation on.
df = pd.DataFrame(np.randn(10000,1), columns = ['rand'])
sum_abs = df.rolling(5).sum()
I would like to do the same calculations but add in a weighted sum.
df2 = pd.DataFrame(pd.Series([1,2,3,4,5]), name ='weight'))
df3 = df.mul(df2.set_index(df.index)).rolling(5).sum()
However, I am getting a Length Mismatch expected axis has 5 elements error.
I know I could do something like [a *b for a, b in zip(L, weight)] if I converted everything to a list but I would like to keep it in a dataframe if possible. Is there a way to multiply against different size frames or do I need to repeat the set of numbers the length of the dataset I'm multiplying against?

Easy way to do this is
w = np.arange(1, 6)
df.rolling(5).apply(lambda x: (x * w).sum())
A less easy way using strides
from numpy.lib.stride_tricks import as_strided as strided
v = df.values
n, m = v.shape
s1, s2 = v.strides
k = 5
w = np.arange(1, 6).reshape(1, 1, k)
pd.DataFrame(
(strided(v, (n - k + 1, m, k), (s1, s2, s1)) * w).sum(-1),
df.index[k - 1:], df.columns)
naive time test

Related

Efficient way to perform if condition nested in for loop in python

Is there an efficient pythonic way to perform if conditions in nested for loops:
import numpy as np
big = 3
med = 2
small = 5
mat1 = np.zeros((big, 3))
mat2 = np.zeros((big, med, 3))
mat3 = np.zeros((big, med, small))
mat1 = np.array([
[0,0,0],\
[1.0,0.5,0.2],\
[0.2,0.1,-0.1]])
mat2 = np.array([[
[1.0,0.5,0.2],\
[0.1,0.1,0.1]],\
[[0.2,0.2,0.2],\
[1.0,-0.5,-0.2]],\
[[1.0,-0.5,-0.2],\
[-1.0,0.5,-0.2]]])
mat3 = np.array([[
[1,1,1,1,1],\
[0,21,1,3,5]],\
[[1,2,3,4,5],\
[-1,-2,-2,-3,-4]],\
[[1.0,1.2,1.3,1.4,1.5],\
[5,4,3,2,1]]])
sol = np.zeros((small))
for ii in np.arange(big):
found = False
for jj in np.arange(big):
for kk in np.arange(med):
if all(abs(mat1[ii, :] - mat2[jj, kk, :]) < 1E-8):
found = True
sol = mat3[jj, kk, :]
print(sol)
break
if found:
break
where big and med can be much bigger. The above dummy code works but is very slow. Is there a way to speed it up ?
Note: the mat1, mat2 and mat3 are floats (not integer) and are not zeros in practice.
Solution:
The solution for me was the following (greatly benefiting from #LRRR answer):
for ii in np.arange(big):
tmp = mat1[ii, :]
A = np.tile(tmp[:], (med, 1))
AA = np.repeat(A[np.newaxis, :], big, 0)
sub = abs(AA - mat2) < 1E-8
tmp2 = mat3[sub.all(axis=2)]
if (len(tmp2) > 0):
val = tmp2[0, :]
Note that because I had other complications I kept the outer loop.
The if statement is required as I want the first occurrence of a match.
Also worth noting, this is significantly faster but probably can be made even faster since we could stop at the match rather than having all matches.
If I understand correctly your goal is for each row of mat1, subtract each row in each matrix of mat2, check if all values in the resultant vector are negative, and if true then use that index to return the values from mat3?
Here's an example on smaller data:
import random
import numpy as np
random.seed(10)
big = 5
med = 3
small = 2
mat1 = np.random.randint(0, 10, (big, 3))
mat2 = np.random.randint(0, 10, (big, med, 3))
mat3 = np.random.randint(0, 10, (big, med, small))
# Row subtractions
A = abs(np.repeat(mat1[:, np.newaxis], med, 1) - mat2) < 1E-8
# Extract from mat3
mat3[A.all(axis = 2)]
Breaking it down mat1[:, np.newaxis] increases the array by another dimension and np.repeat() will duplicate each row, so the sizes of mat1 and mat2 will line up to do a simple subtraction between the two.
Note: I left out the abs() from your original code on the line if all(abs(mat1[ii, :] - mat2[jj, kk, :]) < 1E-8):. It seems that by taking the absolute value, the condition < 1E-8 will never be satisfied.
Update:
Here's the redo using the new data added to the original post:
# Repeat each row of mat1 for rows in mat2
A = np.repeat(mat1, big * med, 0)
# Reshape mat2 to match matrix A
B = mat2.reshape(big*med, 3)
C = np.tile(B, (big, 1))
# Subtraction rows
sub = abs(A - C) < 1E-8
# Find values from tiled mat2
values = C[sub.all(axis = 1)]
# Get indices on reshaped mat2
indices = np.all(B == values, axis=1)
# Reshape mat3
M = mat3.reshape(big * med, small)
# Result
M[indices]
output: array([[1., 1., 1., 1., 1.]])

Calculate the empirical distribution of a sequence in NumPy?

Suppose A is a (NumPy) length-M array of integers in 0, 1, ..., N-1, I would like to calculate an array of length N, c, such that c[i] = sum(A == i). A for-based solution is obvious, but is there a faster solution?
I am also aware of np.histogram but it sounds like a bit of overkill for this problem.
I think I found a solution.
N = 10 # just an example
M = 10000
A = np.random.randint(0, N, size=M)
# for-based solution
c1 = [sum(A == i) for i in range(N)]
# using numpy unique
c2 = np.zeros(N, dtype=int)
val, count = np.unique(A, return_counts=True)
c2[val] = count
assert all(c2 == c1)

SKlearn Minimum Covariance Determinant (MCD) Function yields different results if applied to whole data array vs looped

I have a repeated experiment (n=K) which measures time series of equal length N, i.e. my data matrix has the shape NxK. I now want to compute a robust estimate of the covariance between the experiments for which I use the Minimum Covariance Determinent algorithm implemented in Sci Kit Learn.
One way to apply the algorithm is to directly apply the function to the data array D, i.e.:
import numpy as np
from sklearn.covariance import MinCovDet
N = 300 #number of rows
K = 40 #number of columns
D = np.random.normal(0, 1, size=(N, K)) #create random Data
mcd = MinCovDet().fit(D) #yields a KxK matrix
cov_mat = mcd.covariance_ #covariances between the columns
another way is to loop over the experiments
cov_loop = np.zeros((K, K))
for i in range(0, K):
for j in range(i, K):
temp_arr = np.zeros((N, 2))
temp_arr[:, 0] = D[:, i]
temp_arr[:, 1] = D[:, j]
mcd_temp = MinCovDet().fit(temp_arr)
cov_temp = mcd_temp.covariance_ #yields 2x2 matrix, we are only interested in the [0,1] element
cov_loop[i, j] = cov_temp[0, 1]
cov_loop[j, i] = cov_loop[i, j]
print(cov_loop/cov_mat)
The results differ significantly, which is why I wanted to ask what went wrong here.

How to find the nearest neighbour index from one series to another

I have a target array A, which represents isobaric pressure levels in NCEP reanalysis data.
I also have the pressure at which a cloud is observed as a long time series, B.
What I am looking for is a k-nearest neighbour lookup that returns the indices of those nearest neighbours, something like knnsearch in Matlab that could be represented the same in python such as: indices, distance = knnsearch(A, B, n)
where indices is the nearest n indices in A for every value in B, and distance is how far removed the value in B is from the nearest value in A, and A and B can be of different lengths (this is the bottleneck that I have found with most solutions so far, whereby I would have to loop each value in B to return my indices and distance)
import numpy as np
A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10]) # this is a fixed 17-by-1 array
B = np.array([923, 584.2, 605.3, 153.2]) # this can be any n-by-1 array
n = 2
What I would like returned from indices, distance = knnsearch(A, B, n) is this:
indices = [[1, 2],[4, 5] etc...]
where 923 in A is matched to first A[1]=925 and then A[2]=850
and 584.2 in A is matched to first A[4]=600 and then A[5]=500
distance = [[72, 77],[15.8, 84.2] etc...]
where 72 represents the distance between queried value in B to the nearest value in A e.g. distance[0, 0] == np.abs(B[0] - A[1])
The only solution I have been able to come up with is:
import numpy as np
def knnsearch(A, B, n):
indices = np.zeros((len(B), n))
distances = np.zeros((len(B), n))
for i in range(len(B)):
a = A
for N in range(n):
dif = np.abs(a - B[i])
ind = np.argmin(dif)
indices[i, N] = ind + N
distances[i, N] = dif[ind + N]
# remove this neighbour from from future consideration
np.delete(a, ind)
return indices, distances
array_A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10])
array_B = np.array([923, 584.2, 605.3, 153.2])
neighbours = 2
indices, distances = knnsearch(array_A, array_B, neighbours)
print(indices)
print(distances)
returns:
[[ 1. 2.]
[ 4. 5.]
[ 4. 3.]
[10. 11.]]
[[ 2. 73. ]
[ 15.8 84.2]
[ 5.3 94.7]
[ 3.2 53.2]]
There must be a way to remove the for loops, as I need the performance should my A and B arrays contain many thousands of elements with many nearest neighbours...
Please help! Thanks :)
The second loop can easily be vectorized. The most straightforward way to do it is to use np.argsort and select the indices corresponding to the n smallest dif values. However, for large arrays, as only n values should be sorted, it is better to use np.argpartition.
Therefore, the code would look like something like that:
def vector_knnsearch(A, B, n):
indices = np.empty((len(B), n))
distances = np.empty((len(B), n))
for i,b in enumerate(B):
dif = np.abs(A - b)
min_ind = np.argpartition(dif,n)[:n] # Returns the indexes of the 3 smallest
# numbers but not necessarily sorted
ind = min_ind[np.argsort(dif[min_ind])] # sort output of argpartition just in case
indices[i, :] = ind
distances[i, :] = dif[ind]
return indices, distances
As said in the comments, the first loop can also be removed using a meshgrid, however, the extra use of memory and computation time to construct the meshgrid makes this approach slower for the dimensions I tried (and this will probably get worse for large arrays and end up in Memory Error). In addition, the readability of the code decreases. Overall, this would probably do this approach less pythonic.
def mesh_knnsearch(A, B, n):
m = len(B)
rng = np.arange(m).reshape((m,1))
Amesh, Bmesh = np.meshgrid(A,B)
dif = np.abs(Amesh-Bmesh)
min_ind = np.argpartition(dif,n,axis=1)[:,:n]
ind = min_ind[rng,np.argsort(dif[rng,min_ind],axis=1)]
return ind, dif[rng,ind]
Not that it is important to define this rng as a 2d array in order to retrieve a[rng[0],ind[0]], a[rng[1],ind[1]], etc and maintain the dimensions of the array, as opposed to a[:,ind] which retrieves a[:,ind[0]], a[:,ind[1]], etc.

Python: Drop Rows With NaNs When Memory Constrained

I have 3 numpy arrays [A,B,C] they all have same number of rows but different number of columns. I need to drop the rows from all arrays if any of the arrays have a nan or inf on that row. I need to use as little memory as possible.
for example, if the first row of A has a nan or inf, I need to drop the first row of A,B,C
I considered making them into one big pandas data frame and then using dropna.But that takes up a lot of ram.
use isfinite() and sum(axis=-1):
import numpy as np
def random_with_nan_and_inf(shape, count):
a = np.random.rand(*shape)
idx = [np.random.randint(0, n, count) for n in shape]
a[idx] = ([np.nan, np.inf] * count)[:count]
return a
a = random_with_nan_and_inf((100, 3), 5)
b = random_with_nan_and_inf((100, 4), 10)
c = random_with_nan_and_inf((100, 5), 15)
mask = np.isfinite(a.sum(-1) + b.sum(-1) + c.sum(-1))
a2, b2, c2 = a[mask], b[mask], c[mask]

Categories

Resources