Python: Drop Rows With NaNs When Memory Constrained - python

I have 3 numpy arrays [A,B,C] they all have same number of rows but different number of columns. I need to drop the rows from all arrays if any of the arrays have a nan or inf on that row. I need to use as little memory as possible.
for example, if the first row of A has a nan or inf, I need to drop the first row of A,B,C
I considered making them into one big pandas data frame and then using dropna.But that takes up a lot of ram.

use isfinite() and sum(axis=-1):
import numpy as np
def random_with_nan_and_inf(shape, count):
a = np.random.rand(*shape)
idx = [np.random.randint(0, n, count) for n in shape]
a[idx] = ([np.nan, np.inf] * count)[:count]
return a
a = random_with_nan_and_inf((100, 3), 5)
b = random_with_nan_and_inf((100, 4), 10)
c = random_with_nan_and_inf((100, 5), 15)
mask = np.isfinite(a.sum(-1) + b.sum(-1) + c.sum(-1))
a2, b2, c2 = a[mask], b[mask], c[mask]

Related

sparse matrix to ijv format

Is there an efficient way to represent a sparse matrix as ijv (3 arrays : row, column, value) form.
Using nested loop seems very naive and slow for large matrices.
The code comes from here:
# Python program for Sparse Matrix Representation
# using arrays
# assume a sparse matrix of order 4*5
# let assume another matrix compactMatrix
# now store the value,row,column of arr1 in sparse matrix compactMatrix
sparseMatrix = [[0,0,3,0,4],[0,0,5,7,0],[0,0,0,0,0],[0,2,6,0,0]]
# initialize size as 0
size = 0
for i in range(4):
for j in range(5):
if (sparseMatrix[i][j] != 0):
size += 1
# number of columns in compactMatrix(size) should
# be equal to number of non-zero elements in sparseMatrix
rows, cols = (3, size)
compactMatrix = [[0 for i in range(cols)] for j in range(rows)]
k = 0
for i in range(4):
for j in range(5):
if (sparseMatrix[i][j] != 0):
compactMatrix[0][k] = i
compactMatrix[1][k] = j
compactMatrix[2][k] = sparseMatrix[i][j]
k += 1
for i in compactMatrix:
print(i)
# This code is contributed by MRINALWALIA
I am going to print sparse matrix to file in ijv form and read it in C++.
scipy.sparse.coo_matrix just give me:
print(coo_matrix([[0,0,3,0,4],[0,0,5,7,0],[0,0,0,0,0],[0,2,6,0,0]]))
(0, 2) 3
(0, 4) 4
(1, 2) 5
(1, 3) 7
(3, 1) 2
(3, 2) 6
with np.where() I can get the index of nonzero elements, but how about the v array?
Probably do you know a more efficient method (I am not going to use swig, ... to wrap the code)?
Edit
size=np.count_nonzero(sparseMatrix)
rows, cols = np.where(sparseMatrix)
compactMatrix = np.zeros((3, size))
for i in range(size):
compactMatrix[0][i] = rows[i]
compactMatrix[1][i] = cols[i]
compactMatrix[2][i] = sparseMatrix[rows[i]][cols[i]]
print(compactMatrix)
That's what I thought finally.

Efficient way to perform if condition nested in for loop in python

Is there an efficient pythonic way to perform if conditions in nested for loops:
import numpy as np
big = 3
med = 2
small = 5
mat1 = np.zeros((big, 3))
mat2 = np.zeros((big, med, 3))
mat3 = np.zeros((big, med, small))
mat1 = np.array([
[0,0,0],\
[1.0,0.5,0.2],\
[0.2,0.1,-0.1]])
mat2 = np.array([[
[1.0,0.5,0.2],\
[0.1,0.1,0.1]],\
[[0.2,0.2,0.2],\
[1.0,-0.5,-0.2]],\
[[1.0,-0.5,-0.2],\
[-1.0,0.5,-0.2]]])
mat3 = np.array([[
[1,1,1,1,1],\
[0,21,1,3,5]],\
[[1,2,3,4,5],\
[-1,-2,-2,-3,-4]],\
[[1.0,1.2,1.3,1.4,1.5],\
[5,4,3,2,1]]])
sol = np.zeros((small))
for ii in np.arange(big):
found = False
for jj in np.arange(big):
for kk in np.arange(med):
if all(abs(mat1[ii, :] - mat2[jj, kk, :]) < 1E-8):
found = True
sol = mat3[jj, kk, :]
print(sol)
break
if found:
break
where big and med can be much bigger. The above dummy code works but is very slow. Is there a way to speed it up ?
Note: the mat1, mat2 and mat3 are floats (not integer) and are not zeros in practice.
Solution:
The solution for me was the following (greatly benefiting from #LRRR answer):
for ii in np.arange(big):
tmp = mat1[ii, :]
A = np.tile(tmp[:], (med, 1))
AA = np.repeat(A[np.newaxis, :], big, 0)
sub = abs(AA - mat2) < 1E-8
tmp2 = mat3[sub.all(axis=2)]
if (len(tmp2) > 0):
val = tmp2[0, :]
Note that because I had other complications I kept the outer loop.
The if statement is required as I want the first occurrence of a match.
Also worth noting, this is significantly faster but probably can be made even faster since we could stop at the match rather than having all matches.
If I understand correctly your goal is for each row of mat1, subtract each row in each matrix of mat2, check if all values in the resultant vector are negative, and if true then use that index to return the values from mat3?
Here's an example on smaller data:
import random
import numpy as np
random.seed(10)
big = 5
med = 3
small = 2
mat1 = np.random.randint(0, 10, (big, 3))
mat2 = np.random.randint(0, 10, (big, med, 3))
mat3 = np.random.randint(0, 10, (big, med, small))
# Row subtractions
A = abs(np.repeat(mat1[:, np.newaxis], med, 1) - mat2) < 1E-8
# Extract from mat3
mat3[A.all(axis = 2)]
Breaking it down mat1[:, np.newaxis] increases the array by another dimension and np.repeat() will duplicate each row, so the sizes of mat1 and mat2 will line up to do a simple subtraction between the two.
Note: I left out the abs() from your original code on the line if all(abs(mat1[ii, :] - mat2[jj, kk, :]) < 1E-8):. It seems that by taking the absolute value, the condition < 1E-8 will never be satisfied.
Update:
Here's the redo using the new data added to the original post:
# Repeat each row of mat1 for rows in mat2
A = np.repeat(mat1, big * med, 0)
# Reshape mat2 to match matrix A
B = mat2.reshape(big*med, 3)
C = np.tile(B, (big, 1))
# Subtraction rows
sub = abs(A - C) < 1E-8
# Find values from tiled mat2
values = C[sub.all(axis = 1)]
# Get indices on reshaped mat2
indices = np.all(B == values, axis=1)
# Reshape mat3
M = mat3.reshape(big * med, small)
# Result
M[indices]
output: array([[1., 1., 1., 1., 1.]])

NumPy: Concatenating 1D array to 3D array

Suppose I have a 5x10x3 array, which I interpret as 5 'sub-arrays', each consisting of 10 rows and 3 columns. I also have a seperate 1D array of length 5, which I call b.
I am trying to insert a new column into each sub-array, where the column inserted into the ith (i=0,1,2,3,4) sub-array is a 10x1 vector where each element is equal to b[i].
For example:
import numpy as np
np.random.seed(777)
A = np.random.rand(5,10,3)
b = np.array([2,4,6,8,10])
A[0] should look like:
A[1] should look like:
And similarly for the other 'sub-arrays'.
(Notice b[0]=2 and b[1]=4)
What about this?
# Make an array B with the same dimensions than A
B = np.tile(b, (1, 10, 1)).transpose(2, 1, 0) # shape: (5, 10, 1)
# Concatenate both
np.concatenate([A, B], axis=-1) # shape: (5, 10, 4)
One method would be np.pad:
np.pad(A, ((0,0),(0,0),(0,1)), 'constant', constant_values=[[[],[]],[[],[]],[[],b[:, None,None]]])
# array([[[9.36513084e-01, 5.33199169e-01, 1.66763960e-02, 2.00000000e+00],
# [9.79060284e-02, 2.17614285e-02, 4.72452812e-01, 2.00000000e+00],
# etc.
Or (more typing but probably faster):
i,j,k = A.shape
res = np.empty((i,j,k+1), np.result_type(A, b))
res[...,:-1] = A
res[...,-1] = b[:, None]
Or dstack after broadcast_to:
np.dstack([A,np.broadcast_to(b[:,None],A.shape[:2])]

Compute weighted sums on rolling window with pandas dataframes of different length

I have a large dataframe > 5000000 rows that I am performing a rolling calculation on.
df = pd.DataFrame(np.randn(10000,1), columns = ['rand'])
sum_abs = df.rolling(5).sum()
I would like to do the same calculations but add in a weighted sum.
df2 = pd.DataFrame(pd.Series([1,2,3,4,5]), name ='weight'))
df3 = df.mul(df2.set_index(df.index)).rolling(5).sum()
However, I am getting a Length Mismatch expected axis has 5 elements error.
I know I could do something like [a *b for a, b in zip(L, weight)] if I converted everything to a list but I would like to keep it in a dataframe if possible. Is there a way to multiply against different size frames or do I need to repeat the set of numbers the length of the dataset I'm multiplying against?
Easy way to do this is
w = np.arange(1, 6)
df.rolling(5).apply(lambda x: (x * w).sum())
A less easy way using strides
from numpy.lib.stride_tricks import as_strided as strided
v = df.values
n, m = v.shape
s1, s2 = v.strides
k = 5
w = np.arange(1, 6).reshape(1, 1, k)
pd.DataFrame(
(strided(v, (n - k + 1, m, k), (s1, s2, s1)) * w).sum(-1),
df.index[k - 1:], df.columns)
naive time test

Reshape a six-dimensional array into a 2d dataframe

I have a six-dimensional numeric array A, and I want to reshape it into a two-dimensional array. The rows of the resulting matrix should be multi-indexed by the first three dimensions of A, and the columns should be multi-indexed by the last three dimensions of A. What is the best way to achieve this using pandas or numpy?
Here is a handy function to do just this.
def make2d(a):
shape = a.shape
n = len(shape)
col_lvls = n // 2
idx_lvls = n - col_lvls
midx = pd.MultiIndex.from_product(
[range(i) for i in shape[:idx_lvls]],
names=['d-{}'.format(d) for d in range(1, idx_lvls + 1)])
mcol = pd.MultiIndex.from_product(
[range(i) for i in shape[idx_lvls:]],
names=['d-{}'.format(d) for d in range(idx_lvls + 1, idx_lvls + col_lvls + 1)])
return pd.DataFrame(
a.reshape(np.array(shape[:3]).prod(), -1),
midx, mcol
)
demonstration
a = np.arange(216).reshape(2, 3, 2, 3, 2, 3)
make2d(a)

Categories

Resources