Vectorized solution for generating NumPy matrix - python

I am looking for some built-in NumPy module or some vectorized approach to get such n by n matrices for n>1. The only key thing here is that the last element of a given row serves as the first element of the following row.
n = 2
# array([[1, 2],
# [2, 3]])
n = 3
# array([[1, 2, 3],
# [3, 4, 5],
# [5, 6, 7]])
n = 4
# array([[1, 2, 3, 4],
# [4, 5, 6, 7],
# [7, 8, 9, 10],
# [10, 11, 12, 13]])
My attempt using list comprehensions. The same can be written in an extended for loop syntax as well.
import numpy as np
n = 4
arr = np.array([[(n-1)*j+i for i in range(1, n+1)] for j in range(n)])
# array([[ 1, 2, 3, 4],
# [ 4, 5, 6, 7],
# [ 7, 8, 9, 10],
# [10, 11, 12, 13]])

If you want to keep things simple (and somewhat readable), this should do it:
def ranged_mat(n):
out = np.arange(1, n ** 2 + 1).reshape(n, n)
out -= np.arange(n).reshape(n, 1)
return out
Simply build all numbers from 1 to n², reshape them to the desired block shape, then subtract the row number from each row.
This does the same as Divakar's ranged_mat_v2, but I like being explicit with intermediate array shapes. Not everyone is an expert at NumPy's broadcasting rules.

Approach #1
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows. More info on use of as_strided based view_as_windows.
Additionally, it accepts a step argument and that would fit in perfectly for this problem. Hence, the implementation would be -
from skimage.util.shape import view_as_windows
def ranged_mat(n):
r = np.arange(1,n*(n-1)+2)
return view_as_windows(r,n,step=n-1)
Sample runs -
In [270]: ranged_mat(2)
Out[270]:
array([[1, 2],
[2, 3]])
In [271]: ranged_mat(3)
Out[271]:
array([[1, 2, 3],
[3, 4, 5],
[5, 6, 7]])
In [272]: ranged_mat(4)
Out[272]:
array([[ 1, 2, 3, 4],
[ 4, 5, 6, 7],
[ 7, 8, 9, 10],
[10, 11, 12, 13]])
Approach #2
Another with outer-broadcasted-addition -
def ranged_mat_v2(n):
r = np.arange(n)
return (n-1)*r[:,None]+r+1
Approach #3
We can also use numexpr module that supports multi-core processing and hence achieve better efficiency on large n's -
import numexpr as ne
def ranged_mat_v3(n):
r = np.arange(n)
r2d = (n-1)*r[:,None]
return ne.evaluate('r2d+r+1')
Making use of slicing gives us a more memory-efficient one -
def ranged_mat_v4(n):
r = np.arange(n+1)
r0 = r[1:]
r1 = r[:-1,None]*(n-1)
return ne.evaluate('r0+r1')
Timings -
In [423]: %timeit ranged_mat(10000)
273 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [424]: %timeit ranged_mat_v2(10000)
316 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [425]: %timeit ranged_mat_v3(10000)
176 ms ± 85.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [426]: %timeit ranged_mat_v4(10000)
154 ms ± 82.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

There is also np.fromfunction as below. docs here
def func(n):
return np.fromfunction(lambda r,c: (n-1)*r+1+c, shape=(n,n))
It takes a function which calculates an array from the index values.

We can use NumPy strides for this:
def as_strides(n):
m = n**2 - (n-1)
a = np.arange(1, m+1)
s = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(n,n), strides=((n-1)*s,s))
as_strides(2)
rray([[1, 2],
[2, 3]])
as_strides(3)
array([[1, 2, 3],
[3, 4, 5],
[5, 6, 7]])
as_strides(4)
array([[ 1, 2, 3, 4],
[ 4, 5, 6, 7],
[ 7, 8, 9, 10],
[10, 11, 12, 13]])

Related

How to filter out rows in a Pandas DataFrame that contain a specific subsequence in a list-column?

I have a DataFrame like below:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5],
"list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})
I want to filter this DataFrame such that it only contains the row in which X is a subsequence (so the order of the elements in X are the same as in the list and they not interleaved by other elements in the list) of the list column.
for example, if X = [6, 8, 3], I want the output to look like this:
id list
1 [2, 51, 6, 8, 3]
3 [6, 8, 3, 9, 10, 11]
I know I can check if a list is a subsequence of another list with the following function (found on How to check subsequence exists in a list?):
def x_in_y(query, base):
l = len(query)
for i in range(len(base) - l + 1):
if base[i:i+l] == query:
return True
return False
I have two questions:
Question 1:
How to apply this to a Pandas DataFrame column like in my example?
Question 2:
Is this the most efficient way to do this? And if not, what would be? The function does not look that elegant/Pythonic, and I have to apply it to a very big DataFrame of about 200K rows.
[Note: the elements of the lists in the list column are unique, should that help to optimize things]
Here is solution call function for column:
df = df[df.list.map(lambda x: x_in_y(X, x))]
#alternative
#df = df[df.list.apply(lambda x: x_in_y(X, x))]
print (df)
id list
0 1 [2, 51, 6, 8, 3]
2 3 [6, 8, 3, 9, 10, 11]
Performance is really good in sample data, in real the best test too:
#200k rows
df = pd.concat([df] * 40000, ignore_index=True)
print (df)
X = [6, 8, 3]
x = to_string([6, 8, 3])
In [166]: %timeit df.list.map(lambda x: x_in_y(X, x))
214 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [167]: %timeit df['list'].map(to_string).str.contains(x)
413 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [168]: %timeit df["list"].apply(has_subsequence, subseq=X)
5.2 s ± 420 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [169]: %timeit df.list.apply(lambda y: ''.join(map(str,X)) in ''.join(map(str,y)))
573 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using the numpy rolling-window technique:
import numpy as np
def rolling_window(a, size):
a = np.array(a)
shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def has_subsequence(a, subseq):
return (rolling_window(a, len(subseq)) == subseq).all(axis=1).any()
mask = df["list"].apply(has_subsequence, subseq=[6,8,3])
df[mask]
Explanation:
rolling_window creates a view into the array with the given shape and strides:
>>> rolling_window([1,2,3,4], 2)
np.array([[1,2], [2,3], [3,4]])
Then we compare the result with our target X
>>> np.array([[1,2], [2,3], [3,4]]) == [2,3]
np.array([[False, False], [True, True], [False, False]])
Then we tell to numpy to return True in cases when the all items are True over the 1st axis.
>>> np.array([[False, False], [True, True], [False, False]]).all(axis=1)
np.array([False, True, False])
And finally return True if there are any True in the array.
>>> np.array([False, True, False]).any()
You can try this:
import pandas as pd
def to_string(l):
return '-' + '-'.join(map(str, l)) + '-'
X = to_string([6, 8, 3])
df = pd.DataFrame({"id": [1, 2, 3, 4, 5], "list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})
df[df['list'].map(to_string).str.contains(X)]
# id list
# 0 1 [2, 51, 6, 8, 3]
# 2 3 [6, 8, 3, 9, 10, 11]
In my oppinion it is important that you add delimiters to start and end of the string. Otherwise you will have problems with lists such as: [666, 8, 3]
You can try this:
x = [6, 8, 3]
df = df.loc[df.list.apply(lambda y: ''.join(map(str,x)) in ''.join(map(str,y)))]
OUTPUT:
id list
0 1 [2, 51, 6, 8, 3]
2 3 [6, 8, 3, 9, 10, 11]

Efficiently apply different permutations for each row of a 2D NumPy array [duplicate]

This question already has answers here:
Randomly shuffle items in each row of numpy array
(6 answers)
Closed 3 years ago.
Given a matrix A, I want to apply different random shuffles for different row of A; for example,
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
becomes
array([[1, 3, 2],
[6, 5, 4],
[7, 9, 8]])
Of course we can loop through the matrix and make every row randomly shuffle; however iteration is slow and I am asking if there is more efficient way to do this.
Picked up this neat trick from Divakar which involves randn and argsort:
np.random.seed(0)
s = np.arange(16).reshape(4, 4)
np.take_along_axis(s, np.random.randn(*s.shape).argsort(axis=1), axis=1)
array([[ 1, 0, 3, 2],
[ 4, 6, 5, 7],
[11, 10, 8, 9],
[14, 12, 13, 15]])
For a 2D array, this can be simplified to
s[np.arange(len(s))[:,None], np.random.randn(*s.shape).argsort(axis=1)]
array([[ 1, 0, 3, 2],
[ 4, 6, 5, 7],
[11, 10, 8, 9],
[14, 12, 13, 15]])
You can also apply np.random.permutation over each row independently to return a new array.
np.apply_along_axis(np.random.permutation, axis=1, arr=s)
array([[ 3, 1, 0, 2],
[ 4, 6, 5, 7],
[ 8, 9, 10, 11],
[15, 14, 13, 12]])
Performance -
s = np.arange(10000 * 100).reshape(10000, 100)
%timeit s[np.arange(len(s))[:,None], np.random.randn(*s.shape).argsort(axis=1)]
%timeit np.apply_along_axis(np.random.permutation, 1, s)
84.6 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
842 ms ± 8.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I've noticed it depends on the dimensions of your data, make sure to test it out first.
Codewise you can use numpy's apply_along_axis as
np.apply_along_axis(np.random.shuffle, 1, matrix)
but it doesn't seem to be more efficient than iterating at least for a 3x3 matrix, for that method I get
> %%timeit
> np.apply_along_axis(np.random.shuffle, 1, test)
67 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
while the iteration gives
> %%timeit
> for i in range(test.shape[0]):
> np.random.shuffle(test[i])
20.3 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Repeat a 2D NumPy array N times [duplicate]

This question already has answers here:
Create 3D array from a 2D array by replicating/repeating along the first axis
(4 answers)
Closed 4 years ago.
I need to augment(replicate) a 2d array of shape 32X32 to a 3d array of shape 32X32X3 by duplicating the source array. How can i do this in the best possible way?
Below is the sample of the source and expected array. I need to apply this logic over a bigger scope of my application
Source array:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Expected array:
array([[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]])
By my tests, np.repeat is a little faster than np.tile:
X = np.repeat(arr[None,:], 3, axis=0)
Alternatively, use np.concatenate:
X = np.concatenate([[arr]] * 3, axis=0)
arr = np.arange(10000 * 1000).reshape(10000, 1000)
%timeit np.repeat(arr[None,:], 3, axis=0)
%timeit np.tile(arr, (3, 1, 1))
%timeit np.concatenate([[arr]] * 3, axis=0)
# Read-only, array cannot be modified.
%timeit np.broadcast_to(arr, (3, *arr.shape))
# Creating copy of the above.
%timeit np.broadcast_to(arr, (3, *arr.shape)).copy()
170 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
187 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
243 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.9 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 100000 loops
189 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)each)
np.array_equals(np.repeat(arr[None,:], 3, axis=0),
np.tile(arr, (3, 1, 1))
True
Sounds like a job for np.tile:
In [101]: np.tile(A, (3,1,1))
Out[101]:
array([[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]])
The second argument specifies the number of copies on each dimension.
If you don't need to modify the result, make use of broadcast_to:
np.broadcast_to(arr, (3, *arr.shape))
Validation using #coldspeed's answer:
arr = np.arange(10000 * 1000).reshape(10000, 1000)
X = np.repeat(arr[None,:], 3, axis=0)
broadcast_x = np.broadcast_to(arr, (3, *arr.shape))
np.array_equal(X, broadcast_x)
True
If you do need to be able to modify, you can call copy() on the result, which should come close to repeat and tile in terms of speed.

Numpy: create a matrix from a cartesian product of two vectors (or one with itself) while applying a function to all pairs

To give a bit of explanation, I want to create a covariance matrix where each element is defined by a kernel function k(x, y), and I want to do this for a single vector. It should be something like:
# This is given
x = [x1, x2, x3, x4, ...]
# This is what I want to compute
result = [[k(x1, x1), k(x1, x2), k(x1, x3), ...],
[k(x2, x1), k(x2, x2), ...],
[k(x3, x1), k(x3, x2), ...],
...]
but of course this should be done in numpy arrays, ideally without doing Python interations, because of performance. If I didn't care about performance, I'd probably just write:
result = np.zeros((len(x), len(x)))
for i in range(len(x)):
for j in range(len(x)):
result[i, j] = k(x[i], x[j])
But I feel like there must be a more idiomatic way to write this pattern.
If k operates on 2D arrays, you could use np.meshgrid. But, this would have extra memory overhead. One alternative would be to create 2D mesh views same as with np.meshgrid, like so -
def meshgrid1D_view(x):
shp = (len(x),len(x))
mesh1 = np.broadcast_to(x,shp)
mesh2 = np.broadcast_to(x[:,None],shp)
return mesh1, mesh2
Sample run -
In [140]: x
Out[140]: array([3, 5, 6, 8])
In [141]: np.meshgrid(x,x)
Out[141]:
[array([[3, 5, 6, 8],
[3, 5, 6, 8],
[3, 5, 6, 8],
[3, 5, 6, 8]]), array([[3, 3, 3, 3],
[5, 5, 5, 5],
[6, 6, 6, 6],
[8, 8, 8, 8]])]
In [142]: meshgrid1D(x)
Out[142]:
(array([[3, 5, 6, 8],
[3, 5, 6, 8],
[3, 5, 6, 8],
[3, 5, 6, 8]]), array([[3, 3, 3, 3],
[5, 5, 5, 5],
[6, 6, 6, 6],
[8, 8, 8, 8]]))
How does this help?
It helps with memory efficiency and hence performance. Let's test out on large arrays to see the difference -
In [143]: x = np.random.randint(0,10,(10000))
In [144]: %timeit np.meshgrid(x,x)
10 loops, best of 3: 171 ms per loop
In [145]: %timeit meshgrid1D(x)
100000 loops, best of 3: 6.91 µs per loop
Another solution is to let numpy do the broadcasting itself:
import numpy as np
def k(x,y):
return x**2+y
def meshgrid1D_view(x):
shp = (len(x),len(x))
mesh1 = np.broadcast_to(x,shp)
mesh2 = np.broadcast_to(x[:,None],shp)
return mesh1, mesh2
x = np.random.randint(0,10,(10000))
b=k(a[:,None],a[None,:])
def sol0(x):
k(x[:,None],x[None,:])
def sol1(x):
x,y=np.meshgrid(x,x)
k(x,y)
def sol2(x):
x,y=meshgrid1D_view(x)
k(x,y)
%timeit sol0(x)
165 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sol1(x)
655 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit sol2(x)
341 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You see this is more efficient and there's less code.

Python numpy, reshape/transform array avoiding iteration through rows

I have a time series with 4 features at each step, it looks like a set of rows with 4 columns. I want convert it, so row N will contain a vector of features of rows N and N-1
a = np.array([[1,2,3,0], [4,5,6,0], [7,8,9,0], [10,11,12,0]])
array([[ 1, 2, 3, 0],
[ 4, 5, 6, 0],
[ 7, 8, 9, 0],
[10, 11, 12, 0]])
a.shape
(4, 4)
convert to:
array([[[ 1, 2, 3, 0],
[ 4, 5, 6, 0]],
[[ 4, 5, 6, 0],
[ 7, 8, 9, 0]],
[[ 7, 8, 9, 0],
[10, 11, 12, 0]]])
a_.shape
(3, 2, 4)
I'm using the following code to do that:
seq_len = 2
for i in range(seq_len, a.shape[0]+1):
if i-seq_len == 0:
a_ = a[i-seq_len:i, :].reshape(1, -1, 4)
else:
a_ = np.vstack([a_, a[i-seq_len:i, :].reshape(1, -1, 4)])
It's working but I think it is not an optimal solution. Could you please suggest how I can improve my code by avoiding 'for' cycle?
Use adequate slicing and np.stack along the adequate axis.
np.stack((a[:-1], a[1:]), axis=1)
Some timings to compare with the other answer out there.
In [13]: s = 1_000_000
In [15]: a = np.arange(s).reshape((s//4,4))
In [21]: %timeit a[[(i-1,i) for i in range(1,a.shape[0])],:]
127 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [22]: %timeit np.stack((a[:-1], a[1:]), axis=1) # My solution
6.8 ms ± 8.18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoiding any python-level for-loop is the way to go, OP was right.
Use slicing: a[[(i-1,i) for i in range(1,a.shape[0])],:]
Edit: nicoco's answer is the better one.

Categories

Resources