I am looking for some built-in NumPy module or some vectorized approach to get such n by n matrices for n>1. The only key thing here is that the last element of a given row serves as the first element of the following row.
n = 2
# array([[1, 2],
# [2, 3]])
n = 3
# array([[1, 2, 3],
# [3, 4, 5],
# [5, 6, 7]])
n = 4
# array([[1, 2, 3, 4],
# [4, 5, 6, 7],
# [7, 8, 9, 10],
# [10, 11, 12, 13]])
My attempt using list comprehensions. The same can be written in an extended for loop syntax as well.
import numpy as np
n = 4
arr = np.array([[(n-1)*j+i for i in range(1, n+1)] for j in range(n)])
# array([[ 1, 2, 3, 4],
# [ 4, 5, 6, 7],
# [ 7, 8, 9, 10],
# [10, 11, 12, 13]])
If you want to keep things simple (and somewhat readable), this should do it:
def ranged_mat(n):
out = np.arange(1, n ** 2 + 1).reshape(n, n)
out -= np.arange(n).reshape(n, 1)
return out
Simply build all numbers from 1 to n², reshape them to the desired block shape, then subtract the row number from each row.
This does the same as Divakar's ranged_mat_v2, but I like being explicit with intermediate array shapes. Not everyone is an expert at NumPy's broadcasting rules.
Approach #1
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows. More info on use of as_strided based view_as_windows.
Additionally, it accepts a step argument and that would fit in perfectly for this problem. Hence, the implementation would be -
from skimage.util.shape import view_as_windows
def ranged_mat(n):
r = np.arange(1,n*(n-1)+2)
return view_as_windows(r,n,step=n-1)
Sample runs -
In [270]: ranged_mat(2)
Out[270]:
array([[1, 2],
[2, 3]])
In [271]: ranged_mat(3)
Out[271]:
array([[1, 2, 3],
[3, 4, 5],
[5, 6, 7]])
In [272]: ranged_mat(4)
Out[272]:
array([[ 1, 2, 3, 4],
[ 4, 5, 6, 7],
[ 7, 8, 9, 10],
[10, 11, 12, 13]])
Approach #2
Another with outer-broadcasted-addition -
def ranged_mat_v2(n):
r = np.arange(n)
return (n-1)*r[:,None]+r+1
Approach #3
We can also use numexpr module that supports multi-core processing and hence achieve better efficiency on large n's -
import numexpr as ne
def ranged_mat_v3(n):
r = np.arange(n)
r2d = (n-1)*r[:,None]
return ne.evaluate('r2d+r+1')
Making use of slicing gives us a more memory-efficient one -
def ranged_mat_v4(n):
r = np.arange(n+1)
r0 = r[1:]
r1 = r[:-1,None]*(n-1)
return ne.evaluate('r0+r1')
Timings -
In [423]: %timeit ranged_mat(10000)
273 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [424]: %timeit ranged_mat_v2(10000)
316 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [425]: %timeit ranged_mat_v3(10000)
176 ms ± 85.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [426]: %timeit ranged_mat_v4(10000)
154 ms ± 82.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
There is also np.fromfunction as below. docs here
def func(n):
return np.fromfunction(lambda r,c: (n-1)*r+1+c, shape=(n,n))
It takes a function which calculates an array from the index values.
We can use NumPy strides for this:
def as_strides(n):
m = n**2 - (n-1)
a = np.arange(1, m+1)
s = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(n,n), strides=((n-1)*s,s))
as_strides(2)
rray([[1, 2],
[2, 3]])
as_strides(3)
array([[1, 2, 3],
[3, 4, 5],
[5, 6, 7]])
as_strides(4)
array([[ 1, 2, 3, 4],
[ 4, 5, 6, 7],
[ 7, 8, 9, 10],
[10, 11, 12, 13]])
Related
I have a DataFrame like below:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5],
"list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})
I want to filter this DataFrame such that it only contains the row in which X is a subsequence (so the order of the elements in X are the same as in the list and they not interleaved by other elements in the list) of the list column.
for example, if X = [6, 8, 3], I want the output to look like this:
id list
1 [2, 51, 6, 8, 3]
3 [6, 8, 3, 9, 10, 11]
I know I can check if a list is a subsequence of another list with the following function (found on How to check subsequence exists in a list?):
def x_in_y(query, base):
l = len(query)
for i in range(len(base) - l + 1):
if base[i:i+l] == query:
return True
return False
I have two questions:
Question 1:
How to apply this to a Pandas DataFrame column like in my example?
Question 2:
Is this the most efficient way to do this? And if not, what would be? The function does not look that elegant/Pythonic, and I have to apply it to a very big DataFrame of about 200K rows.
[Note: the elements of the lists in the list column are unique, should that help to optimize things]
Here is solution call function for column:
df = df[df.list.map(lambda x: x_in_y(X, x))]
#alternative
#df = df[df.list.apply(lambda x: x_in_y(X, x))]
print (df)
id list
0 1 [2, 51, 6, 8, 3]
2 3 [6, 8, 3, 9, 10, 11]
Performance is really good in sample data, in real the best test too:
#200k rows
df = pd.concat([df] * 40000, ignore_index=True)
print (df)
X = [6, 8, 3]
x = to_string([6, 8, 3])
In [166]: %timeit df.list.map(lambda x: x_in_y(X, x))
214 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [167]: %timeit df['list'].map(to_string).str.contains(x)
413 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [168]: %timeit df["list"].apply(has_subsequence, subseq=X)
5.2 s ± 420 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [169]: %timeit df.list.apply(lambda y: ''.join(map(str,X)) in ''.join(map(str,y)))
573 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using the numpy rolling-window technique:
import numpy as np
def rolling_window(a, size):
a = np.array(a)
shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def has_subsequence(a, subseq):
return (rolling_window(a, len(subseq)) == subseq).all(axis=1).any()
mask = df["list"].apply(has_subsequence, subseq=[6,8,3])
df[mask]
Explanation:
rolling_window creates a view into the array with the given shape and strides:
>>> rolling_window([1,2,3,4], 2)
np.array([[1,2], [2,3], [3,4]])
Then we compare the result with our target X
>>> np.array([[1,2], [2,3], [3,4]]) == [2,3]
np.array([[False, False], [True, True], [False, False]])
Then we tell to numpy to return True in cases when the all items are True over the 1st axis.
>>> np.array([[False, False], [True, True], [False, False]]).all(axis=1)
np.array([False, True, False])
And finally return True if there are any True in the array.
>>> np.array([False, True, False]).any()
You can try this:
import pandas as pd
def to_string(l):
return '-' + '-'.join(map(str, l)) + '-'
X = to_string([6, 8, 3])
df = pd.DataFrame({"id": [1, 2, 3, 4, 5], "list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})
df[df['list'].map(to_string).str.contains(X)]
# id list
# 0 1 [2, 51, 6, 8, 3]
# 2 3 [6, 8, 3, 9, 10, 11]
In my oppinion it is important that you add delimiters to start and end of the string. Otherwise you will have problems with lists such as: [666, 8, 3]
You can try this:
x = [6, 8, 3]
df = df.loc[df.list.apply(lambda y: ''.join(map(str,x)) in ''.join(map(str,y)))]
OUTPUT:
id list
0 1 [2, 51, 6, 8, 3]
2 3 [6, 8, 3, 9, 10, 11]
This question already has answers here:
Randomly shuffle items in each row of numpy array
(6 answers)
Closed 3 years ago.
Given a matrix A, I want to apply different random shuffles for different row of A; for example,
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
becomes
array([[1, 3, 2],
[6, 5, 4],
[7, 9, 8]])
Of course we can loop through the matrix and make every row randomly shuffle; however iteration is slow and I am asking if there is more efficient way to do this.
Picked up this neat trick from Divakar which involves randn and argsort:
np.random.seed(0)
s = np.arange(16).reshape(4, 4)
np.take_along_axis(s, np.random.randn(*s.shape).argsort(axis=1), axis=1)
array([[ 1, 0, 3, 2],
[ 4, 6, 5, 7],
[11, 10, 8, 9],
[14, 12, 13, 15]])
For a 2D array, this can be simplified to
s[np.arange(len(s))[:,None], np.random.randn(*s.shape).argsort(axis=1)]
array([[ 1, 0, 3, 2],
[ 4, 6, 5, 7],
[11, 10, 8, 9],
[14, 12, 13, 15]])
You can also apply np.random.permutation over each row independently to return a new array.
np.apply_along_axis(np.random.permutation, axis=1, arr=s)
array([[ 3, 1, 0, 2],
[ 4, 6, 5, 7],
[ 8, 9, 10, 11],
[15, 14, 13, 12]])
Performance -
s = np.arange(10000 * 100).reshape(10000, 100)
%timeit s[np.arange(len(s))[:,None], np.random.randn(*s.shape).argsort(axis=1)]
%timeit np.apply_along_axis(np.random.permutation, 1, s)
84.6 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
842 ms ± 8.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I've noticed it depends on the dimensions of your data, make sure to test it out first.
Codewise you can use numpy's apply_along_axis as
np.apply_along_axis(np.random.shuffle, 1, matrix)
but it doesn't seem to be more efficient than iterating at least for a 3x3 matrix, for that method I get
> %%timeit
> np.apply_along_axis(np.random.shuffle, 1, test)
67 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
while the iteration gives
> %%timeit
> for i in range(test.shape[0]):
> np.random.shuffle(test[i])
20.3 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This question already has answers here:
Create 3D array from a 2D array by replicating/repeating along the first axis
(4 answers)
Closed 4 years ago.
I need to augment(replicate) a 2d array of shape 32X32 to a 3d array of shape 32X32X3 by duplicating the source array. How can i do this in the best possible way?
Below is the sample of the source and expected array. I need to apply this logic over a bigger scope of my application
Source array:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Expected array:
array([[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]])
By my tests, np.repeat is a little faster than np.tile:
X = np.repeat(arr[None,:], 3, axis=0)
Alternatively, use np.concatenate:
X = np.concatenate([[arr]] * 3, axis=0)
arr = np.arange(10000 * 1000).reshape(10000, 1000)
%timeit np.repeat(arr[None,:], 3, axis=0)
%timeit np.tile(arr, (3, 1, 1))
%timeit np.concatenate([[arr]] * 3, axis=0)
# Read-only, array cannot be modified.
%timeit np.broadcast_to(arr, (3, *arr.shape))
# Creating copy of the above.
%timeit np.broadcast_to(arr, (3, *arr.shape)).copy()
170 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
187 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
243 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.9 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 100000 loops
189 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)each)
np.array_equals(np.repeat(arr[None,:], 3, axis=0),
np.tile(arr, (3, 1, 1))
True
Sounds like a job for np.tile:
In [101]: np.tile(A, (3,1,1))
Out[101]:
array([[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]])
The second argument specifies the number of copies on each dimension.
If you don't need to modify the result, make use of broadcast_to:
np.broadcast_to(arr, (3, *arr.shape))
Validation using #coldspeed's answer:
arr = np.arange(10000 * 1000).reshape(10000, 1000)
X = np.repeat(arr[None,:], 3, axis=0)
broadcast_x = np.broadcast_to(arr, (3, *arr.shape))
np.array_equal(X, broadcast_x)
True
If you do need to be able to modify, you can call copy() on the result, which should come close to repeat and tile in terms of speed.
To give a bit of explanation, I want to create a covariance matrix where each element is defined by a kernel function k(x, y), and I want to do this for a single vector. It should be something like:
# This is given
x = [x1, x2, x3, x4, ...]
# This is what I want to compute
result = [[k(x1, x1), k(x1, x2), k(x1, x3), ...],
[k(x2, x1), k(x2, x2), ...],
[k(x3, x1), k(x3, x2), ...],
...]
but of course this should be done in numpy arrays, ideally without doing Python interations, because of performance. If I didn't care about performance, I'd probably just write:
result = np.zeros((len(x), len(x)))
for i in range(len(x)):
for j in range(len(x)):
result[i, j] = k(x[i], x[j])
But I feel like there must be a more idiomatic way to write this pattern.
If k operates on 2D arrays, you could use np.meshgrid. But, this would have extra memory overhead. One alternative would be to create 2D mesh views same as with np.meshgrid, like so -
def meshgrid1D_view(x):
shp = (len(x),len(x))
mesh1 = np.broadcast_to(x,shp)
mesh2 = np.broadcast_to(x[:,None],shp)
return mesh1, mesh2
Sample run -
In [140]: x
Out[140]: array([3, 5, 6, 8])
In [141]: np.meshgrid(x,x)
Out[141]:
[array([[3, 5, 6, 8],
[3, 5, 6, 8],
[3, 5, 6, 8],
[3, 5, 6, 8]]), array([[3, 3, 3, 3],
[5, 5, 5, 5],
[6, 6, 6, 6],
[8, 8, 8, 8]])]
In [142]: meshgrid1D(x)
Out[142]:
(array([[3, 5, 6, 8],
[3, 5, 6, 8],
[3, 5, 6, 8],
[3, 5, 6, 8]]), array([[3, 3, 3, 3],
[5, 5, 5, 5],
[6, 6, 6, 6],
[8, 8, 8, 8]]))
How does this help?
It helps with memory efficiency and hence performance. Let's test out on large arrays to see the difference -
In [143]: x = np.random.randint(0,10,(10000))
In [144]: %timeit np.meshgrid(x,x)
10 loops, best of 3: 171 ms per loop
In [145]: %timeit meshgrid1D(x)
100000 loops, best of 3: 6.91 µs per loop
Another solution is to let numpy do the broadcasting itself:
import numpy as np
def k(x,y):
return x**2+y
def meshgrid1D_view(x):
shp = (len(x),len(x))
mesh1 = np.broadcast_to(x,shp)
mesh2 = np.broadcast_to(x[:,None],shp)
return mesh1, mesh2
x = np.random.randint(0,10,(10000))
b=k(a[:,None],a[None,:])
def sol0(x):
k(x[:,None],x[None,:])
def sol1(x):
x,y=np.meshgrid(x,x)
k(x,y)
def sol2(x):
x,y=meshgrid1D_view(x)
k(x,y)
%timeit sol0(x)
165 ms ± 1.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit sol1(x)
655 ms ± 6.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit sol2(x)
341 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You see this is more efficient and there's less code.
I have a time series with 4 features at each step, it looks like a set of rows with 4 columns. I want convert it, so row N will contain a vector of features of rows N and N-1
a = np.array([[1,2,3,0], [4,5,6,0], [7,8,9,0], [10,11,12,0]])
array([[ 1, 2, 3, 0],
[ 4, 5, 6, 0],
[ 7, 8, 9, 0],
[10, 11, 12, 0]])
a.shape
(4, 4)
convert to:
array([[[ 1, 2, 3, 0],
[ 4, 5, 6, 0]],
[[ 4, 5, 6, 0],
[ 7, 8, 9, 0]],
[[ 7, 8, 9, 0],
[10, 11, 12, 0]]])
a_.shape
(3, 2, 4)
I'm using the following code to do that:
seq_len = 2
for i in range(seq_len, a.shape[0]+1):
if i-seq_len == 0:
a_ = a[i-seq_len:i, :].reshape(1, -1, 4)
else:
a_ = np.vstack([a_, a[i-seq_len:i, :].reshape(1, -1, 4)])
It's working but I think it is not an optimal solution. Could you please suggest how I can improve my code by avoiding 'for' cycle?
Use adequate slicing and np.stack along the adequate axis.
np.stack((a[:-1], a[1:]), axis=1)
Some timings to compare with the other answer out there.
In [13]: s = 1_000_000
In [15]: a = np.arange(s).reshape((s//4,4))
In [21]: %timeit a[[(i-1,i) for i in range(1,a.shape[0])],:]
127 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [22]: %timeit np.stack((a[:-1], a[1:]), axis=1) # My solution
6.8 ms ± 8.18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoiding any python-level for-loop is the way to go, OP was right.
Use slicing: a[[(i-1,i) for i in range(1,a.shape[0])],:]
Edit: nicoco's answer is the better one.