Comparing two matrices row-wise by occurrence in NumPy - python

Suppose I have two NumPy matrices (or Pandas DataFrames, though I'm guessing this will be faster in NumPy).
>>> arr1
array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
>>> arr2
array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4]])
For every row-vector in arr1, I want to count the occurrence of that row vector in arr2 and generate a vector of these counts. So for this example, the result would be
[3, 0, 1, 1, 3]
What is an efficient way to do this?
First approach:
The obvious approach of just using looping over the row-vectors of arr1 and generating a corresponding boolean vector on arr2 seems very slow.
np.apply_along_axis(lambda x: (x == arr2).all(1).sum(), axis=1, arr=arr1)
And it seems like a bad algorithm, as I have to check the same rows multiple times.
Second approach: I could store the row counts in a collections.Counter, and then just access that with apply_along_axis.
cnter = Counter(tuple(row) for row in arr2)
np.apply_along_axis(lambda x: cnter[tuple(x)], axis=1, arr=arr1)
This seems to be somewhat faster, but I feel like there has to still be a more direct approach than this.

Here's a NumPy approach after converting the inputs to 1D equivalents and then sorting and using np.searchsorted alongwith np.bincount for the counting -
def searchsorted_based(a,b):
dims = np.maximum(a.max(0), b.max(0))+1
a1D = np.ravel_multi_index(a.T,dims)
b1D = np.ravel_multi_index(b.T,dims)
unq_a1D, IDs = np.unique(a1D, return_inverse=1)
fidx = np.searchsorted(unq_a1D, b1D)
fidx[fidx==unq_a1D.size] = 0
mask = unq_a1D[fidx] == b1D
count = np.bincount(fidx[mask])
out = count[IDs]
return out
Sample run -
In [308]: a
Out[308]:
array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
In [309]: b
Out[309]:
array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4],
[2, 1, 5]])
In [310]: searchsorted_based(a,b)
Out[310]: array([3, 0, 1, 1, 3])
Runtime test -
In [377]: A = a[np.random.randint(0,a.shape[0],(1000))]
In [378]: B = b[np.random.randint(0,b.shape[0],(1000))]
In [379]: np.allclose(comp2D_vect(A,B), searchsorted_based(A,B))
Out[379]: True
# #Nickil Maveli's soln
In [380]: %timeit comp2D_vect(A,B)
10000 loops, best of 3: 184 µs per loop
In [381]: %timeit searchsorted_based(A,B)
10000 loops, best of 3: 92.6 µs per loop

numpy:
Start off with gathering the linear index equivalents to row and column subscripts of a2 using np.ravel_multi_index. Add 1 to account for the 0-based indexing of numpy. Get the respective counts for the unique rows present through np.unique(). Next, find matching rows between the unique rows of a2 and a1 by extending a1 to a new dimension towards the right-axis (also known as broadcasting) and extract indices of non-zero rows for both the arrays.
Initialize an array of zeros and fill it's values by slicing based on the obtained indices.
def comp2D_vect(a1, a2):
midx = np.ravel_multi_index(a2.T, a2.max(0)+1)
a, idx, cnt = np.unique(midx, return_counts=True, return_index=True)
m1, m2 = (a1[:, None] == a2[idx]).all(-1).nonzero()
out = np.zeros(a1.shape[0], dtype=int)
out[m1] = cnt[m2]
return out
benchmarks:
For: a2 = a2.repeat(100000, axis=0)
%%timeit
df = pd.DataFrame(a2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(a1.T.tolist(), fill_value=0).values
10 loops, best of 3: 67.2 ms per loop # # Ted Petrou's solution
%timeit comp2D_vect(a1, a2)
10 loops, best of 3: 34 ms per loop # Posted solution
%timeit searchsorted_based(a1,a2)
10 loops, best of 3: 27.6 ms per loop # # Divakar's solution (winner)

Pandas would be a good tool for this. You can put arr2 into a dataframe and use the groupby method to count the number of occurences of each row and then reindex the result with arr1.
arr1=np.array([[3, 1, 4],
[4, 3, 5],
[6, 5, 4],
[6, 5, 4],
[3, 1, 4]])
arr2 = np.array([[3, 1, 4],
[8, 5, 4],
[3, 1, 4],
[6, 5, 4],
[3, 1, 4]])
df = pd.DataFrame(arr2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(arr1.T.tolist(), fill_value=0)
Output
a b c
3 1 4 3
4 3 5 0
6 5 4 1
4 1
3 1 4 3
dtype: int64
Timings
Create a lot more data first
arr2_2 = arr2.repeat(100000, axis=0)
Now time it:
%%timeit
cnter = Counter(tuple(row) for row in arr2_2)
np.apply_along_axis(lambda x: cnter[tuple(x)], axis=1, arr=arr1)
1 loop, best of 3: 704 ms per loop
%%timeit
df = pd.DataFrame(arr2_2, columns=['a', 'b', 'c'])
df_count = df.groupby(df.columns.tolist()).size()
df_count.reindex(arr1.T.tolist(), fill_value=0)
10 loops, best of 3: 53.8 ms per loop

Related

Cycling Slicing in Python

I've come up with this question while trying to apply a Cesar Cipher to a matrix with different shift values for each row, i.e. given a matrix X
array([[1, 0, 8],
[5, 1, 4],
[2, 1, 1]])
with shift values of S = array([0, 1, 1]), the output needs to be
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
This is easy to implement by the following code:
Y = []
for i in range(X.shape[0]):
if (S[i] > 0):
Y.append( X[i,S[i]::].tolist() + X[i,:S[i]:].tolist() )
else:
Y.append(X[i,:].tolist())
Y = np.array(Y)
This is a left-cycle-shift. I wonder how to do this in a more efficient way using numpy arrays?
Update: This example applies the shift to the columns of a matrix. Suppose that we have a 3D array
array([[[8, 1, 8],
[8, 6, 2],
[5, 3, 7]],
[[4, 1, 0],
[5, 9, 5],
[5, 1, 7]],
[[9, 8, 6],
[5, 1, 0],
[5, 5, 4]]])
Then, the cyclic right shift of S = array([0, 0, 1]) over the columns leads to
array([[[8, 1, 7],
[8, 6, 8],
[5, 3, 2]],
[[4, 1, 7],
[5, 9, 0],
[5, 1, 5]],
[[9, 8, 4],
[5, 1, 6],
[5, 5, 0]]])
Approach #1 : Use modulus to implement the cyclic pattern and get the new column indices and then simply use advanced-indexing to extract the elements, giving us a vectorized solution, like so -
def cyclic_slice(X, S):
m,n = X.shape
idx = np.mod(np.arange(n) + S[:,None],n)
return X[np.arange(m)[:,None], idx]
Approach #2 : We can also leverage the power of strides for further speedup. The idea would be to concatenate the sliced off portion from the start and append it at the end, then create sliding windows of lengths same as the number of cols and finally index into the appropriate window numbers to get the same rolled over effect. The implementation would be like so -
def cyclic_slice_strided(X, S):
X2 = np.column_stack((X,X[:,:-1]))
s0,s1 = X2.strides
strided = np.lib.stride_tricks.as_strided
m,n1 = X.shape
n2 = X2.shape[1]
X2_3D = strided(X2, shape=(m,n2-n1+1,n1), strides=(s0,s1,s1))
return X2_3D[np.arange(len(S)),S]
Sample run -
In [34]: X
Out[34]:
array([[1, 0, 8],
[5, 1, 4],
[2, 1, 1]])
In [35]: S
Out[35]: array([0, 1, 1])
In [36]: cyclic_slice(X, S)
Out[36]:
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])
Runtime test -
In [75]: X = np.random.rand(10000,100)
...: S = np.random.randint(0,100,(10000))
# #Moses Koledoye's soln
In [76]: %%timeit
...: Y = []
...: for i, x in zip(S, X):
...: Y.append(np.roll(x, -i))
10 loops, best of 3: 108 ms per loop
In [77]: %timeit cyclic_slice(X, S)
100 loops, best of 3: 14.1 ms per loop
In [78]: %timeit cyclic_slice_strided(X, S)
100 loops, best of 3: 4.3 ms per loop
Adaption for 3D case
Adapting approach #1 for the 3D case, we would have -
shift = 'left'
axis = 1 # axis along which S is to be used (axis=1 for rows)
n = X.shape[axis]
if shift == 'left':
Sa = S
else:
Sa = -S
# For rows
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[:,idx, np.arange(len(S))]
# For columns
idx = np.mod(Sa[:,None] + np.arange(n),n)
out = X[:,np.arange(len(S))[:,None], idx]
# For axis=0
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[idx, np.arange(len(S))]
There could be a way to have a generic solution for a generic axis, but I will keep it to this point.
You could shift each row using np.roll and use the new rows to build the output array:
Y = []
for i, x in zip(S, X):
Y.append(np.roll(x, -i))
print(np.array(Y))
array([[1, 0, 8],
[1, 4, 5],
[1, 1, 2]])

Create a 2D array from another array and its indices with NumPy

Given an array:
arr = np.array([[1, 3, 7], [4, 9, 8]]); arr
array([[1, 3, 7],
[4, 9, 8]])
And given its indices:
np.indices(arr.shape)
array([[[0, 0, 0],
[1, 1, 1]],
[[0, 1, 2],
[0, 1, 2]]])
How would I be able to stack them neatly one against the other to form a new 2D array? This is what I'd like:
array([[0, 0, 1],
[0, 1, 3],
[0, 2, 7],
[1, 0, 4],
[1, 1, 9],
[1, 2, 8]])
This is my current solution:
def foo(arr):
return np.hstack((np.indices(arr.shape).reshape(2, arr.size).T, arr.reshape(-1, 1)))
It works, but is there something shorter/more elegant to carry this operation out?
Using array-initialization and then broadcasted-assignment for assigning indices and the array values in subsequent steps -
def indices_merged_arr(arr):
m,n = arr.shape
I,J = np.ogrid[:m,:n]
out = np.empty((m,n,3), dtype=arr.dtype)
out[...,0] = I
out[...,1] = J
out[...,2] = arr
out.shape = (-1,3)
return out
Note that we are avoiding the use of np.indices(arr.shape), which could have slowed things down.
Sample run -
In [10]: arr = np.array([[1, 3, 7], [4, 9, 8]])
In [11]: indices_merged_arr(arr)
Out[11]:
array([[0, 0, 1],
[0, 1, 3],
[0, 2, 7],
[1, 0, 4],
[1, 1, 9],
[1, 2, 8]])
Performance
arr = np.random.randn(100000, 2)
%timeit df = pd.DataFrame(np.hstack((np.indices(arr.shape).reshape(2, arr.size).T,\
arr.reshape(-1, 1))), columns=['x', 'y', 'value'])
100 loops, best of 3: 4.97 ms per loop
%timeit pd.DataFrame(indices_merged_arr_divakar(arr), columns=['x', 'y', 'value'])
100 loops, best of 3: 3.82 ms per loop
%timeit pd.DataFrame(indices_merged_arr_eric(arr), columns=['x', 'y', 'value'], dtype=np.float32)
100 loops, best of 3: 5.59 ms per loop
Note: Timings include conversion to pandas dataframe, that is the eventual use case for this solution.
A more generic answer for nd arrays, that handles other dtypes correctly:
def indices_merged_arr(arr):
out = np.empty(arr.shape, dtype=[
('index', np.intp, arr.ndim),
('value', arr.dtype)
])
out['value'] = arr
for i, l in enumerate(arr.shape):
shape = (1,)*i + (-1,) + (1,)*(arr.ndim-1-i)
out['index'][..., i] = np.arange(l).reshape(shape)
return out.ravel()
This returns a structured array with an index column and a value column, which can be of different types.

Padding a 2D numpy with varying rows into a same size [duplicate]

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)
You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.
Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.zeros(mask.shape,dtype=int)
out[mask] = np.concatenate(v)
return out
Sample run
In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]
In [28]: out
Out[28]:
array([[1, 0, 0, 0, 0],
[1, 2, 0, 0, 0],
[3, 6, 7, 8, 9],
[4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by #Alberto Garcia-Raboso, itertools-based solution by #ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]
In [45]: v = v*1000
In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop
In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop
In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]
In [50]: v = v*1000
In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop
In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop
In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs
...: N = 10000 # Number of elems in list
...: maxn = 100 # Max. size of a list element
...: lens = np.random.randint(0,maxn,(N))
...: v = [list(np.random.randint(0,9,(L))) for L in lens]
...:
In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop
In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop
In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop
To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!
Pandas and its DataFrame-s deal beautifully with missing data.
import numpy as np
import pandas as pd
v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))
# array([[1, 0],
# [1, 2]], dtype=int32)
max_len = max(len(sub_list) for sub_list in v)
result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])
>>> result
array([[1, 0],
[1, 2]])
>>> type(result)
numpy.ndarray
Here is a general way:
>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1, 0, 0, 0],
[ 2, 3, 4, 0],
[ 5, 6, 0, 0],
[ 7, 8, 9, 10],
[11, 12, 0, 0]], dtype=int32)
you can try to convert pandas dataframe first, after that convert it to numpy array
ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3
# 0 1 2 3.0 NaN
# 1 4 5 NaN NaN
# 2 6 7 8.0 9.0
npl = df.to_numpy()
print(npl)
# [[ 1. 2. 3. nan]
# [ 4. 5. nan nan]
# [ 6. 7. 8. 9.]]
I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:
pad = len(max(X, key=len))
result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])
If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)
This creates an array padded with 0.
or a deeper example:
w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

Convert Python sequence to NumPy array, filling missing values

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)
You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.
Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.zeros(mask.shape,dtype=int)
out[mask] = np.concatenate(v)
return out
Sample run
In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]
In [28]: out
Out[28]:
array([[1, 0, 0, 0, 0],
[1, 2, 0, 0, 0],
[3, 6, 7, 8, 9],
[4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by #Alberto Garcia-Raboso, itertools-based solution by #ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]
In [45]: v = v*1000
In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop
In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop
In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]
In [50]: v = v*1000
In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop
In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop
In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs
...: N = 10000 # Number of elems in list
...: maxn = 100 # Max. size of a list element
...: lens = np.random.randint(0,maxn,(N))
...: v = [list(np.random.randint(0,9,(L))) for L in lens]
...:
In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop
In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop
In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop
To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!
Pandas and its DataFrame-s deal beautifully with missing data.
import numpy as np
import pandas as pd
v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))
# array([[1, 0],
# [1, 2]], dtype=int32)
max_len = max(len(sub_list) for sub_list in v)
result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])
>>> result
array([[1, 0],
[1, 2]])
>>> type(result)
numpy.ndarray
Here is a general way:
>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1, 0, 0, 0],
[ 2, 3, 4, 0],
[ 5, 6, 0, 0],
[ 7, 8, 9, 10],
[11, 12, 0, 0]], dtype=int32)
you can try to convert pandas dataframe first, after that convert it to numpy array
ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3
# 0 1 2 3.0 NaN
# 1 4 5 NaN NaN
# 2 6 7 8.0 9.0
npl = df.to_numpy()
print(npl)
# [[ 1. 2. 3. nan]
# [ 4. 5. nan nan]
# [ 6. 7. 8. 9.]]
I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:
pad = len(max(X, key=len))
result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])
If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)
This creates an array padded with 0.
or a deeper example:
w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

Difference between every row and column in two DataFrames (Python / Pandas)

Is there a more efficient way to compare every column in every row in one DF to every column in every row of another DF? This feels sloppy to me, but my loop / apply attempts have been much slower.
df1 = pd.DataFrame({'a': np.random.randn(1000),
'b': [1, 2] * 500,
'c': np.random.randn(1000)},
index=pd.date_range('1/1/2000', periods=1000))
df2 = pd.DataFrame({'a': np.random.randn(100),
'b': [2, 1] * 50,
'c': np.random.randn(100)},
index=pd.date_range('1/1/2000', periods=100))
df1 = df1.reset_index()
df1['embarrassingHackInd'] = 0
df1.set_index('embarrassingHackInd', inplace=True)
df1.rename(columns={'index':'origIndex'}, inplace=True)
df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9
df1['df2Date'] = 0
df2 = df2.reset_index()
df2['embarrassingHackInd'] = 0
df2.set_index('embarrassingHackInd', inplace=True)
df2.rename(columns={'index':'origIndex'}, inplace=True)
df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9
df2['df1Date'] = 0
timeit df3 = abs(df1-df2)
10 loops, best of 3: 60.6 ms per loop
I need to know which comparison was made, thus the ugly addition of each opposing index to the comparison DF so that it will end up in the final DF.
Thanks in advance for any assistance.
The code you posted shows a clever way to produce a subtraction table. However, it doesn't play to Pandas strengths. Pandas DataFrames store the underlying data in column-based blocks. So retrieval of the data is fastest when done by column, not by row. Since all the rows have the same index, the subtractions are performed by row (pairing each row with every other row), which means there is a lot of row-based data retrieval going on in df1-df2. That's not ideal for Pandas, particularly when not all the columns have the same dtype.
Subtraction tables are something NumPy is good at:
In [5]: x = np.arange(10)
In [6]: y = np.arange(5)
In [7]: x[:, np.newaxis] - y
Out[7]:
array([[ 0, -1, -2, -3, -4],
[ 1, 0, -1, -2, -3],
[ 2, 1, 0, -1, -2],
[ 3, 2, 1, 0, -1],
[ 4, 3, 2, 1, 0],
[ 5, 4, 3, 2, 1],
[ 6, 5, 4, 3, 2],
[ 7, 6, 5, 4, 3],
[ 8, 7, 6, 5, 4],
[ 9, 8, 7, 6, 5]])
You can think of x as one column of df1, and y as one column of df2. You'll see below that NumPy can handle all the columns of df1 and all the columns of df2 in basically the same way, using basically the same syntax.
The code below defines orig and using_numpy. orig is the code you posted, using_numpy is an alternative method which performs the subtraction using NumPy arrays:
In [2]: %timeit orig(df1.copy(), df2.copy())
10 loops, best of 3: 96.1 ms per loop
In [3]: %timeit using_numpy(df1.copy(), df2.copy())
10 loops, best of 3: 19.9 ms per loop
import numpy as np
import pandas as pd
N = 100
df1 = pd.DataFrame({'a': np.random.randn(10*N),
'b': [1, 2] * 5*N,
'c': np.random.randn(10*N)},
index=pd.date_range('1/1/2000', periods=10*N))
df2 = pd.DataFrame({'a': np.random.randn(N),
'b': [2, 1] * (N//2),
'c': np.random.randn(N)},
index=pd.date_range('1/1/2000', periods=N))
def orig(df1, df2):
df1 = df1.reset_index() # 312 µs per loop
df1['embarrassingHackInd'] = 0 # 75.2 µs per loop
df1.set_index('embarrassingHackInd', inplace=True) # 526 µs per loop
df1.rename(columns={'index':'origIndex'}, inplace=True) # 209 µs per loop
df1['df1Date'] = df1.origIndex.astype(np.int64) // 10**9 # 23.1 µs per loop
df1['df2Date'] = 0
df2 = df2.reset_index()
df2['embarrassingHackInd'] = 0
df2.set_index('embarrassingHackInd', inplace=True)
df2.rename(columns={'index':'origIndex'}, inplace=True)
df2['df2Date'] = df2.origIndex.astype(np.int64) // 10**9
df2['df1Date'] = 0
df3 = abs(df1-df2) # 88.7 ms per loop <-- this is the bottleneck
return df3
def using_numpy(df1, df2):
df1.index.name = 'origIndex'
df2.index.name = 'origIndex'
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df1_date = df1['origIndex']
df2_date = df2['origIndex']
df1['origIndex'] = df1_date.astype(np.int64)
df2['origIndex'] = df2_date.astype(np.int64)
arr1 = df1.values
arr2 = df2.values
arr3 = np.abs(arr1[:,np.newaxis,:]-arr2) # 3.32 ms per loop vs 88.7 ms
arr3 = arr3.reshape(-1, 4)
index = pd.MultiIndex.from_product(
[df1_date, df2_date], names=['df1Date', 'df2Date'])
result = pd.DataFrame(arr3, index=index, columns=df1.columns)
# You could stop here, but the rest makes the result more similar to orig
result.reset_index(inplace=True, drop=False)
result['df1Date'] = result['df1Date'].astype(np.int64) // 10**9
result['df2Date'] = result['df2Date'].astype(np.int64) // 10**9
return result
def is_equal(expected, result):
expected.reset_index(inplace=True, drop=True)
result.reset_index(inplace=True, drop=True)
# expected has dtypes 'O', while result has some float and int dtypes.
# Make all the dtypes float for a quick and dirty comparison check
expected = expected.astype('float')
result = result.astype('float')
columns = ['a','b','c','origIndex','df1Date','df2Date']
return expected[columns].equals(result[columns])
expected = orig(df1.copy(), df2.copy())
result = using_numpy(df1.copy(), df2.copy())
assert is_equal(expected, result)
How x[:, np.newaxis] - y works:
This expression takes advantage of NumPy broadcasting.
To understand broadcasting -- and in general with NumPy -- it pays to know the shape of the arrays:
In [6]: x.shape
Out[6]: (10,)
In [7]: x[:, np.newaxis].shape
Out[7]: (10, 1)
In [8]: y.shape
Out[8]: (5,)
The [:, np.newaxis] adds a new axis to x on the right, so the shape is (10, 1). So x[:, np.newaxis] - y is the subtraction of an array of shape (10, 1) with an array of shape (5,).
On the face of it, that doesn't make sense, but NumPy arrays broadcast their shape according to certain rules to try to make their shapes compatible.
The first rule is that new axes can be added on the left. So an array of shape (5,) can broadcast itself to shape (1, 5).
The next rule is that axes of length 1 can broadcast itself to arbitrary length. The values in the array are simply repeated as often as needed along the extra dimension(s).
So when arrays of shape (10, 1) and (1, 5) are put together in a NumPy arithmetic operation, they are both broadcasted up to arrays of shape (10, 5):
In [14]: broadcasted_x, broadcasted_y = np.broadcast_arrays(x[:, np.newaxis], y)
In [15]: broadcasted_x
Out[15]:
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2],
[3, 3, 3, 3, 3],
[4, 4, 4, 4, 4],
[5, 5, 5, 5, 5],
[6, 6, 6, 6, 6],
[7, 7, 7, 7, 7],
[8, 8, 8, 8, 8],
[9, 9, 9, 9, 9]])
In [16]: broadcasted_y
Out[16]:
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
So x[:, np.newaxis] - y is equivalent to broadcasted_x - broadcasted_y.
Now, with this simpler example under our belt, we can look at
arr1[:,np.newaxis,:]-arr2.
arr1 has shape (1000, 4) and arr2 has shape (100, 4). We want to subtract the items in the axis of length 4, for each row along the 1000-length axis, and each row along the 100-length axis. In other words, we want the subtraction to form an array of shape (1000, 100, 4).
Importantly, we don't want the 1000-axis to interact with the 100-axis. We want them to be in separate axes.
So if we add an axis to arr1 like this: arr1[:,np.newaxis,:], then its shape becomes
In [22]: arr1[:, np.newaxis, :].shape
Out[22]: (1000, 1, 4)
And now, NumPy broadcasting pumps up both arrays to the common shape of (1000, 100, 4). Voila, a subtraction table.
To massage the values into a 2D DataFrame of shape (1000*100, 4), we can use reshape:
arr3 = arr3.reshape(-1, 4)
The -1 tells NumPy to replace -1 with whatever positive integer is needed for the reshape to make sense. Since arr has 1000*100*4 values, the -1 is replaced with 1000*100. Using -1 is nicer than writing 1000*100 however since it allows the code to work even if we change the number of rows in df1 and df2.

Categories

Resources