I have an array arr_multi_dim which is multi-dimensional. Every time when I increase a parameter n, there will be more entries created in the array results and the array will get larger.
With each increase in n, I need to perform the function np.concatenate() on the array arr_multi_dim, in such a way that there will be more np.concatenate() function nested every time n increases.
For eg.,
when n=2:
arr_multi_dim = np.concatenate(np.concatenate(arr_multi_dim, axis=1), axis=1)
when n=3:
arr_multi_dim = np.concatenate(np.concatenate(
np.concatenate(np.concatenate(arr_multi_dim, axis=1), axis=1), axis=1), axis=1)
when n=4:
arr_multi_dim = np.concatenate(np.concatenate(
np.concatenate(np.concatenate(
np.concatenate(np.concatenate(arr_multi_dim, axis=1), axis=1), axis=1), axis=1), axis=1), axis=1)
etc.
where at each increment of n, a pair of np.concatenate() (ie. two) gets added into the function.
How do I write a function, loops (or something similar), so that when I specify any values for n, the appropriate np.concatenate() function will be used?
Many thanks in advance.
Edit:
This is the full code that I have written which uses the above np.concatenate() function.
from itertools import product
from joblib import Parallel, delayed
from functools import reduce
from operator import mul
import numpy as np
lst = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
arr = np.array(lst)
n = 2
def test1(arr, n):
flat = np.ravel(arr).tolist()
gen = (list(a) for a in product(flat, repeat=n))
results = Parallel(n_jobs=-1)(delayed(reduce)(mul, x) for (x) in gen)
nrows = arr.shape[0]
ncols = arr.shape[1]
arr_multi_dim = np.array(results).reshape((nrows, ncols)*n)
arr_final = np.concatenate(np.concatenate(arr_multi_dim, axis=1), axis=1) # need to generalise this
return arr_final
The above code only works for n=2. I am trying to generalize the np.concatenate part of the code so that it would work for any n as mentioned above.
If i understood you correctly its pretty simple:
arr_multi_dim = results
for i in range(n):
if i < 2:
arr_multi_dim = np.concatenate(arr_multi_dim , axis=1)
else:
arr_multi_dim = np.concatenate(np.concatenate(arr_multi_dim , axis=1), axis=1)
becase the first two iteration only add a single layer while the rest add two layers
Related
I have a boolean sparse matrix that I represent with row indices and column indices of True values.
import numpy as np
import jax
from jax import numpy as jnp
N = 10000
M = 1000
X = np.random.randint(0, 100, size=(N, M)) == 0 # data setup
rows, cols = np.where(X == True)
rows = jax.device_put(rows)
cols = jax.device_put(cols)
I want to get a column slice of the matrix like X[:, 3], but just from rows indices and column indices.
I managed to do that by using jnp.isin like below, but the problem is that this is not JIT compatible because of the data-dependent shaped array rows[cols == m].
def not_jit_compatible_slice(rows, cols, m):
return jnp.isin(jnp.arange(N), rows[cols == m])
I could make it JIT compatible by using jnp.where in the three-argument form, but this operation is much slower than the previous one.
def jit_compatible_but_slow_slice(rows, cols, m):
return jnp.isin(jnp.arange(N), jnp.where(cols == m, rows, -1))
Is there any fast and JIT compatible solution to acheive the same output?
You can do a bit better than the first answer by using the mode argument of set() to drop out-of-bound indices, eliminating the final slice:
out = jnp.zeros(N, bool).at[jnp.where(cols==3, rows, N)].set(True, mode='drop')
I figured out that the implementation below returns the same output much faster, and it’s JIT compatible.
def slice(rows, cols, m):
res = jnp.zeros(N + 1, dtype=bool)
res = res.at[jnp.where(cols == m, rows, -1)].set(True)
return res[:-1]
I am trying to get good at numpy and want to know if I can use values in exisiting arrays to serve as indices for a function that returns values for another array. I can do this:
def somefun(i):
return i+1
x = np.array([2, 4, 5])
k_labs = np.arange(100)
k_labs2 = k_labs[somefun(x[:])]
But how do I deal with using vectors in matrices in case x was a double array, where I just want to use one vector at a time as indices-arguments for a function, such as X[:, i], without using for-loops?
such as would be the case in:
x = np.array([[2, 4, 5],[7, 8, 9]])
def somefun(i):
return i+1
k_labs = np.arange(100)
k_labs2 = k_labs[somefun(x[:, i])]
EDIT ITERATION 2
To get the gist of what I am trying to accomplish see the code below. In the function pred as you can see i wanted to write the things I've commented out in a numpy fashion that might work better yet. I have some probelms though we the two lines I put in instead, since I get an error of wrong broadcast dimensions in the function called distance, at the the line where I try to assign the normalized vectors at a variable.
class kNN:
def __init__(self, X_train : np.array, label_train, val = None):
self.X = X_train#X[:-1, :]
self.labels = label_train#X[-1, :]
#self.k = k
self.kNN_4all = None #np.zeros(self.X.shape[1])
def distance(self, x1):
x1 = np.tile(x1, (self.X.shape[1], 1)) #creates a matrix of len of X with copyes of x1 vector for easy matrix subtraction.
dists = np.linalg.norm(x1 - self.X.T, axis = 1) #Flips to find linalg.norm for all the axis
return dists
def k_nearest(self, x_vec, k):
k_nearest = self.distance(x_vec)
k_nearest = np.argsort(k_nearest)[ :k]
kNN_labs = np.zeros(k_nearest.shape)
kNN_labs[:] = self.labels[k_nearest[:]]
unique, vote = np.unique(kNN_labs, return_counts=True)
return unique[np.argmax(vote)]
def pred(self, X_test, k):
self.kNN_4all = np.zeros(X_test.shape[1])
self.kNN_4all = self.k_nearest(X_test[:, :], k)
#for i in range(X_test.shape[1]):
# NewLabel = self.k_nearest(X_test[:, i], k) #defines x_vec in matrix X
# self.kNN_4all[i] = NewLabel
#return self.kNN_4all
def prec(self, labels_val):
elem_equal = (self.kNN_4all == labels_val).astype(int).flatten()
prec = np.sum(elem_equal)/elem_equal.shape
return 1 - prec[0]
X_train = X[:, :100]
labs_train = labs[:100]
pilot = kNN(X_train, labs_train)
pilot.pred(X[:,100:200], 10)
pilot.prec(labs[100:200])
I get the following error:
ValueError: operands could not be broadcast together with shapes (78400,100) (100,784)
As we can see from the code the k_nearest(self, x_vec, k) takes one 1D-subarray, so passing any full matrix X will cause the broad-casting error, since the functions within k_nearest relies on passing only a 1D subarray.
I don't know if it really is possible to avoid for loops in this regard and use numpy to increment through 1D subarrays as arguments for a function, such that each call of the function with the arguments can be assigned to a different cell in another array, in this case the self.kNN_4all
x = np.array([[2, 4, 5], [7, 8, 9], [33, 50, 71]])
x = x + 1
k_labs = np.arange(100)
ttt = k_labs[x]
print(ttt)
ttt creates an array that takes values from 'k_labs' based on pseudo-indexes 'x'. The array is accessed for example:
print(ttt[1])#[ 8 9 10]
If you want to refer to a certain value (for example, with indexes x[2]) alone, then the code will be as follows:
x = np.array([[2, 4, 5], [7, 8, 9], [33, 50, 71]])
x = x + 1
k_labs = np.arange(100)
print(k_labs[x[2]])
I have a matrix (2d numpy ndarray, to be precise):
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
And I want to roll each row of A independently, according to roll values in another array:
r = np.array([2, 0, -1])
That is, I want to do this:
print np.array([np.roll(row, x) for row,x in zip(A, r)])
[[0 0 4]
[1 2 3]
[0 5 0]]
Is there a way to do this efficiently? Perhaps using fancy indexing tricks?
Sure you can do it using advanced indexing, whether it is the fastest way probably depends on your array size (if your rows are large it may not be):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:, np.newaxis]
result = A[rows, column_indices]
numpy.lib.stride_tricks.as_strided stricks (abbrev pun intended) again!
Speaking of fancy indexing tricks, there's the infamous - np.lib.stride_tricks.as_strided. The idea/trick would be to get a sliced portion starting from the first column until the second last one and concatenate at the end. This ensures that we can stride in the forward direction as needed to leverage np.lib.stride_tricks.as_strided and thus avoid the need of actually rolling back. That's the whole idea!
Now, in terms of actual implementation we would use scikit-image's view_as_windows to elegantly use np.lib.stride_tricks.as_strided under the hoods. Thus, the final implementation would be -
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r):
# Concatenate with sliced to cover all rolls
a_ext = np.concatenate((a,a[:,:-1]),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), (n-r)%n,0]
Here's a sample run -
In [327]: A = np.array([[4, 0, 0],
...: [1, 2, 3],
...: [0, 0, 5]])
In [328]: r = np.array([2, 0, -1])
In [329]: strided_indexing_roll(A, r)
Out[329]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
Benchmarking
# #seberg's solution
def advindexing_roll(A, r):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
return A[rows, column_indices]
Let's do some benchmarking on an array with large number of rows and columns -
In [324]: np.random.seed(0)
...: a = np.random.rand(10000,1000)
...: r = np.random.randint(-1000,1000,(10000))
# #seberg's solution
In [325]: %timeit advindexing_roll(a, r)
10 loops, best of 3: 71.3 ms per loop
# Solution from this post
In [326]: %timeit strided_indexing_roll(a, r)
10 loops, best of 3: 44 ms per loop
In case you want more general solution (dealing with any shape and with any axis), I modified #seberg's solution:
def indep_roll(arr, shifts, axis=1):
"""Apply an independent roll for each dimensions of a single axis.
Parameters
----------
arr : np.ndarray
Array of any shape.
shifts : np.ndarray
How many shifting to use for each dimension. Shape: `(arr.shape[axis],)`.
axis : int
Axis along which elements are shifted.
"""
arr = np.swapaxes(arr,axis,-1)
all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]
# Convert to a positive shift
shifts[shifts < 0] += arr.shape[-1]
all_idcs[-1] = all_idcs[-1] - shifts[:, np.newaxis]
result = arr[tuple(all_idcs)]
arr = np.swapaxes(result,-1,axis)
return arr
I implement a pure numpy.lib.stride_tricks.as_strided solution as follows
from numpy.lib.stride_tricks import as_strided
def custom_roll(arr, r_tup):
m = np.asarray(r_tup)
arr_roll = arr[:, [*range(arr.shape[1]),*range(arr.shape[1]-1)]].copy() #need `copy`
strd_0, strd_1 = arr_roll.strides
n = arr.shape[1]
result = as_strided(arr_roll, (*arr.shape, n), (strd_0 ,strd_1, strd_1))
return result[np.arange(arr.shape[0]), (n-m)%n]
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
r = np.array([2, 0, -1])
out = custom_roll(A, r)
Out[789]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
By using a fast fourrier transform we can apply a transformation in the frequency domain and then use the inverse fast fourrier transform to obtain the row shift.
So this is a pure numpy solution that take only one line:
import numpy as np
from numpy.fft import fft, ifft
# The row shift function using the fast fourrier transform
# rshift(A,r) where A is a 2D array, r the row shift vector
def rshift(A,r):
return np.real(ifft(fft(A,axis=1)*np.exp(2*1j*np.pi/A.shape[1]*r[:,None]*np.r_[0:A.shape[1]][None,:]),axis=1).round())
This will apply a left shift, but we can simply negate the exponential exponant to turn the function into a right shift function:
ifft(fft(...)*np.exp(-2*1j...)
It can be used like that:
# Example:
A = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
r = np.array([1,-1,3])
print(rshift(A,r))
Building on divakar's excellent answer, you can apply this logic to 3D array easily (which was the problematic that brought me here in the first place). Here's an example - basically flatten your data, roll it & reshape it after::
def applyroll_30(cube, threshold=25, offset=500):
flattened_cube = cube.copy().reshape(cube.shape[0]*cube.shape[1], cube.shape[2])
roll_matrix = calc_roll_matrix_flattened(flattened_cube, threshold, offset)
rolled_cube = strided_indexing_roll(flattened_cube, roll_matrix, cube_shape=cube.shape)
rolled_cube = triggered_cube.reshape(cube.shape[0], cube.shape[1], cube.shape[2])
return rolled_cube
def calc_roll_matrix_flattened(cube_flattened, threshold, offset):
""" Calculates the number of position along time axis we need to shift
elements in order to trig the data.
We return a 1D numpy array of shape (X*Y, time) elements
"""
# armax(...) finds the position in the cube (3d) where we are above threshold
roll_matrix = np.argmax(cube_flattened > threshold, axis=1) + offset
# ensure we don't have index out of bound
roll_matrix[roll_matrix>cube_flattened.shape[1]] = cube_flattened.shape[1]
return roll_matrix
def strided_indexing_roll(cube_flattened, roll_matrix_flattened, cube_shape):
# Concatenate with sliced to cover all rolls
# otherwise we shift in the wrong direction for my application
roll_matrix_flattened = -1 * roll_matrix_flattened
a_ext = np.concatenate((cube_flattened, cube_flattened[:, :-1]), axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = cube_flattened.shape[1]
result = viewW(a_ext,(1,n))[np.arange(len(roll_matrix_flattened)), (n - roll_matrix_flattened) % n, 0]
result = result.reshape(cube_shape)
return result
Divakar's answer doesn't do justice to how much more efficient this is on large cube of data. I've timed it on a 400x400x2000 data formatted as int8. An equivalent for-loop does ~5.5seconds, Seberg's answer ~3.0seconds and strided_indexing.... ~0.5second.
Preamble
How can I apply a function to a list with non-overlapping sliding window. E.g. data = {x_1, x_2, ...., x_n} and we apply f with window size 2 to get {f(x_1,x_2), f(x_3, x_4), ...., f(x_{n-1}, x_n)}.
I understand that I can partition and use map on the partitioned list. But are there more efficient ways to handle this operation, especially for ndarray and dataframe? Something that would analogous to BlockMap of Mathematica.
Question
The ultimate goal of this is: suppose the dataframe is a time series with values for each hour of the day. How can I apply a function (e.g. mean, variance) for each day, i.e. function blockmaps with a non-overlapping window of 24 hour size?
EDIT 1:
Here is a code that returns a pandas dataframe:
import pandas as pd
import numpy as np
dat = np.random.uniform(0,10,40)
xpd = pd.DataFrame(dat)
xpd.rename(columns = {0:'new_name'}, inplace = True)
date_rng = pd.date_range(start='1/1/2018 03:00:00', periods=40, freq='H')
xpd.set_index(date_rng, inplace=True)
How can I calculate the variance for each day, i.e. from hourly data, and return as a dataframe.
I tried the below line but it didn't work:
xpd.groupby(by=lambda x: pd.Series.dt.floor(x, freq='d'))
EDIT 2
This worked, problem seems to be solved:
xpd.groupby(by=lambda x: x.floor('d')).var()
(EDIT: Answered when was without edits and titled: map a function with non-overlapping window on a dataframe or ndarray).
One way, assuming that n is always even, is:
def pairwise_map(func, items):
iterators = [iter(items)] * 2
return map(func, zip(*iterators))
list(pairwise_map(sum, range(10)))
# [1, 5, 9, 13, 17]
This consists of two steps: the separation in group and the mapping.
A more general version of the group separation can be found in flyingcircus.base.group_by().
(Disclaimer: I am the main author of the package).
While the above works for the general case, if you have a NumPy array arr and the function func() is vectorized, one can simply use:
import numpy as np
arr = np.arange(10)
def func(x, y):
return x + y
func(arr[::2], arr[1::2])
# array([ 1, 5, 9, 13, 17])
EDIT
This can be generalized to any size, e.g.:
def pairwise_map(func, items, window=2):
iterators = [iter(items)] * window
return map(func, zip(*iterators))
list(pairwise_map(sum, range(10), 3))
# [3, 12, 21]
This obviously rely on func() being able to accept the correct or a variable number of arguments.
Similarly, for NumPy arrays and NumPy-aware functions:
import numpy as np
arr = np.arange(9)
def func(*args):
return sum(args)
window = 3
func(*(arr[i::window] for i in range(window)))
# array([ 3, 12, 21])
Note that this require len(arr) % window == 0.
For NumPy functions that support the axis keyword (e.g. np.mean(), np.std(), etc.), one can simply use the following reshaping trick:
import numpy as np
arr = np.arange(56)
window = 8
np.mean(arr.reshape(-1, window), axis=1)
# array([ 3.5, 11.5, 19.5, 27.5, 35.5, 43.5, 51.5])
Note that this also strictly requires len(arr) % window == 0, which can be enforced with e.g. np.concatenate() to pad zeros at the end of the input:
import numpy as np
arr = np.arange(53)
remainder = len(arr) % window
padder = np.zeros(window - remainder if remainder else 0, dtype=arr.dtype)
window = 8
np.mean(np.concatenate((arr, padder)).reshape(-1, window), axis=1)
# array([ 3.5 , 11.5 , 19.5 , 27.5 , 35.5 , 43.5 , 31.25])
I have a numpy array my_array of size 100x20. I want to create a function that receives as an input a 2d numpy array my_arr and an index x and will return two arrays one with size 1x20 test_arr and one with 99x20 train_arr. The vector test_arr will correspond to the row of the matrix my_arr with the index x and the train_arr will contain the rest rows. I tried to follow a solution using masking:
def split_train_test(my_arr, x):
a = np.ma.array(my_arr, mask=False)
a.mask[x, :] = True
a = np.array(a.compressed())
return a
Apparently this is not working as i wanted. How can i return a numpy array as a result and the train and test arrays properly?
You can use simple index and numpy.delete for this:
def split_train_test(my_arr, x):
return np.delete(my_arr, x, 0), my_arr[x:x+1]
my_arr = np.arange(10).reshape(5,2)
train, test = split_train_test(my_arr, 2)
train
#array([[0, 1],
# [2, 3],
# [6, 7],
# [8, 9]])
test
#array([[4, 5]])
You can also use a boolean index as the mask:
def split_train_test(my_arr, x):
# define mask
mask=np.zeros(my_arr.shape[0], dtype=bool)
mask[x] = True # True only at index x, False elsewhere
return my_arr[mask, :], my_arr[~mask, :]
Sample run:
test_arr, train_arr = split_train_test(np.random.rand(100, 20), x=10)
print(test_arr.shape, train_arr.shape)
((1L, 20L), (99L, 20L))
EDIT:
If someone is looking for the general case where more than one element needs to be allocated to the test array (say 80%-20% split), x can also accept an array:
my_arr = np.random.rand(100, 20)
x = np.random.choice(np.arange(my_arr.shape[0]), int(my_arr .shape[0]*0.8), replace=False)
test_arr, train_arr = split_train_test(my_arr, x)
print(test_arr.shape, train_arr.shape)
((80L, 20L), (20L, 20L))