Related
I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.
This code works, and illustrates my goals:
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
permutation = numpy.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
For example:
>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
[1, 1],
[3, 3]]), array([2, 1, 3]))
However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.
Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.
One other thought I had was this:
def shuffle_in_unison_scary(a, b):
rng_state = numpy.random.get_state()
numpy.random.shuffle(a)
numpy.random.set_state(rng_state)
numpy.random.shuffle(b)
This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.
Your can use NumPy's array indexing:
def unison_shuffled_copies(a, b):
assert len(a) == len(b)
p = numpy.random.permutation(len(a))
return a[p], b[p]
This will result in creation of separate unison-shuffled arrays.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)
To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html
Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.
If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.
Example: Let's assume the arrays a and b look like this:
a = numpy.array([[[ 0., 1., 2.],
[ 3., 4., 5.]],
[[ 6., 7., 8.],
[ 9., 10., 11.]],
[[ 12., 13., 14.],
[ 15., 16., 17.]]])
b = numpy.array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.]])
We can now construct a single array containing all the data:
c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[ 0., 1., 2., 3., 4., 5., 0., 1.],
# [ 6., 7., 8., 9., 10., 11., 2., 3.],
# [ 12., 13., 14., 15., 16., 17., 4., 5.]])
Now we create views simulating the original a and b:
a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)
The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).
In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.
This solution could be adapted to the case that a and b have different dtypes.
Very simple solution:
randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]
the two arrays x,y are now both randomly shuffled in the same way
James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)
from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array
# Data is currently unshuffled; we should shuffle
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]
Shuffle any number of arrays together, in-place, using only NumPy.
import numpy as np
def shuffle_arrays(arrays, set_seed=-1):
"""Shuffles arrays in-place, in the same order, along axis=0
Parameters:
-----------
arrays : List of NumPy arrays.
set_seed : Seed value if int >= 0, else seed is random.
"""
assert all(len(arr) == len(arrays[0]) for arr in arrays)
seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed
for arr in arrays:
rstate = np.random.RandomState(seed)
rstate.shuffle(arr)
And can be used like this
a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])
shuffle_arrays([a, b, c])
A few things to note:
The assert ensures that all input arrays have the same length along
their first dimension.
Arrays shuffled in-place by their first dimension - nothing returned.
Random seed within positive int32 range.
If a repeatable shuffle is needed, seed value can be set.
After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.
you can make an array like:
s = np.arange(0, len(a), 1)
then shuffle it:
np.random.shuffle(s)
now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.
x_data = x_data[s]
x_label = x_label[s]
There is a well-known function that can handle this:
from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)
Just setting test_size to 0 will avoid splitting and give you shuffled data.
Though it is usually used to split train and test data, it does shuffle them too.
From documentation
Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y)) and application to input data into a
single call for splitting (and optionally subsampling) data in a
oneliner.
This seems like a very simple solution:
import numpy as np
def shuffle_in_unison(a,b):
assert len(a)==len(b)
c = np.arange(len(a))
np.random.shuffle(c)
return a[c],b[c]
a = np.asarray([[1, 1], [2, 2], [3, 3]])
b = np.asarray([11, 22, 33])
shuffle_in_unison(a,b)
Out[94]:
(array([[3, 3],
[2, 2],
[1, 1]]),
array([33, 22, 11]))
One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.
# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
np.random.seed(seed)
np.random.shuffle(a)
np.random.seed(seed)
np.random.shuffle(b)
That's it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.
EDIT, don't use np.random.seed() use np.random.RandomState instead
def shuffle(a, b, seed):
rand_state = np.random.RandomState(seed)
rand_state.shuffle(a)
rand_state.seed(seed)
rand_state.shuffle(b)
When calling it just pass in any seed to feed the random state:
a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)
Output:
>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]
Edit: Fixed code to re-seed the random state
Say we have two arrays: a and b.
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]])
We can first obtain row indices by permutating first dimension
indices = np.random.permutation(a.shape[0])
[1 2 0]
Then use advanced indexing.
Here we are using the same indices to shuffle both arrays in unison.
a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]
This is equivalent to
np.take(a, indices, axis=0)
[[4 5 6]
[7 8 9]
[1 2 3]]
np.take(b, indices, axis=0)
[[6 6 6]
[4 2 0]
[9 1 1]]
If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array
for old_index in len(a):
new_index = numpy.random.randint(old_index+1)
a[old_index], a[new_index] = a[new_index], a[old_index]
b[old_index], b[new_index] = b[new_index], b[old_index]
This implements the Knuth-Fisher-Yates shuffle algorithm.
Shortest and easiest way in my opinion, use seed:
random.seed(seed)
random.shuffle(x_data)
# reset the same seed to get the identical random sequence and shuffle the y
random.seed(seed)
random.shuffle(y_data)
most solutions above work, however if you have column vectors you have to transpose them first. here is an example
def shuffle(self) -> None:
"""
Shuffles X and Y
"""
x = self.X.T
y = self.Y.T
p = np.random.permutation(len(x))
self.X = x[p].T
self.Y = y[p].T
With an example, this is what I'm doing:
combo = []
for i in range(60000):
combo.append((images[i], labels[i]))
shuffle(combo)
im = []
lab = []
for c in combo:
im.append(c[0])
lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)
I extended python's random.shuffle() to take a second arg:
def shuffle_together(x, y):
assert len(x) == len(y)
for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = int(random.random() * (i+1))
x[i], x[j] = x[j], x[i]
y[i], y[j] = y[j], y[i]
That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.
Just use numpy...
First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method. Finally split them and return.
import numpy as np
def shuffle_2d(a, b):
rows= a.shape[0]
if b.shape != (rows,1):
b = b.reshape((rows,1))
S = np.hstack((b,a))
np.random.shuffle(S)
b, a = S[:,0], S[:,1:]
return a,b
features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)
How can I stack arrays in an alternating fashion? Consider the following example with three arrays:
import numpy as np
one = np.ones((5, 2, 2))
two = np.ones((5, 2, 2))*2
three = np.ones((5, 2, 2))*3
I would like to create a new array result with shape (15, 2, 2) which is formed by alternately taking a slice from each of the given arrays, i.e. the result should look like:
result[0] = one[0]
result[1] = two[0]
result[2] = three[0]
result[3] = one[1]
result[4] = two[1]
result[5] = three[1]
result[6] = one[2]
etc...
The arrays above are just an example to illustrate the question, I am not looking for a way to create this specific result array. What is the easiest way to achieve this, at best with specifying a stacking axis?
Of course, it is possible to do some loops but it seems rather inconvenient...
You may wanne take a look at np.stack() i.e.:
np.stack([one, two, three], axis=1).reshape(15, 2, 2)
With np.hstack and then reshape (with -1 for the first axis appended with the lengths along last two axes for a generic solution) -
np.hstack([one,two,three]).reshape((-1,)+one.shape[1:])
I think you are looking for np.vstack
np.vstack((one,two,three))
Read more about it here np.vstack
With selectable axis:
# example arrays
a,b,c = np.multiply.outer([1,2,3],np.ones((5,2,2)))
# axis
k = 1
np.stack([a,b,c],k+1).reshape(*(-(k==j) or s for j,s in enumerate(a.shape)))
# array([[[1., 1.],
# [2., 2.],
# [3., 3.],
# [1., 1.],
# [2., 2.],
# [3., 3.]],
#
# [[1., 1.],
...
I have a network with multiple inputs and I split out the first 10 inputs and calculate the weighted sum, and then concatenate it with the rest of the input:
first = Lambda(lambda z: z[:, 0:11])(d_inputs)
wsum_first = Lambda(calcWSumF)(first )
d_input = concatenate([d_inputs, wsum_first], axis=-1)
with the function defined as:
w_vec = K.constant(np.array([range(10)]*64).reshape(10, 64)) # batch size is 64
def calcWSumF(x):
y = K.dot(w_vec, x)
y = K.expand_dims(y, -1)
return y
I want a constant vector to be used to calculate the weighted sum of the first part of the input. The concatenation doesn't work because the shapes don't match. How can I implement this correctly?
You can write this much better using K.sum and only a vector containing the coefficients. Further, there is no need to use a fixed batch size (it can be any number):
def calcWSumF(x, idx):
w_vec = K.constant(np.arange(idx))
y = K.sum(x[:, 0:idx] * w_vec, axis=-1, keepdims=True)
return y
d_inputs = Input((15,))
wsum_first = Lambda(calcWSumF, arguments={'idx': 10})(d_inputs)
d_input = concatenate([d_inputs, wsum_first], axis=-1)
model = Model(d_inputs, d_input)
model.predict(np.arange(15).reshape(1, 15))
# output:
array([[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 285.]], dtype=float32)
# Note: 0*0 + 1*1 + 2*2 + ... + 9*9 = 285
Note that, to make it more general, we have added another argument (idx) to the lambda function which specifies how many of the elements from the beginning we would like to consider.
Is there a way to optimize this snippet of code? With my current im value, it is taking me ~28 seconds. Wondering if I could reduce this time.
im = output_image[-min_xy[1]:-min_xy[1] + image_2.shape[0], -min_xy[0]:-min_xy[0] + image_2.shape[1]]
for idx, rk in enumerate(im):
for ix, k in enumerate(rk):
image_2[idx][ix] = avg(im[idx][ix], image_2[idx][ix])
type(image_2) and type(im) is <type 'numpy.ndarray'>
im.shape and image_2.shape is (2386, 3200, 3)
what my avg() does is
def avg(a1, a2):
if [0., 0., 0.] in a1:
return a2
else:
return (a1 + a2) / 2
NOTE: a1 is an array of size 3 ex: array([ 0.68627451, 0.5372549 , 0.4745098])
The only obstacle to vectorization seemed to be that IF conditional in avg. To get past it, simply use the choosing capability of np.where and thus have our solution like so -
avgs = (im + image_2)/2.0
image_2_out = np.where((im == 0).any(-1,keepdims=1), image_2, avgs)
Note that this assumes with if [0., 0., 0.] in a1, you meant to check for ANY one match. If you meant to check for ALL zeros, simply use .all instead of .any.
Alternatively, to do in-situ edits in image_2, use a mask for boolean-indexing -
mask = ~(im == 0).any(-1)
image_2[mask] = avgs[mask]
I have very large matrix, so dont want to sum by going through each row and column.
a = [[1,2,3],[3,4,5],[5,6,7]]
def neighbors(i,j,a):
return [a[i][j-1], a[i][(j+1)%len(a[0])], a[i-1][j], a[(i+1)%len(a)][j]]
[[np.mean(neighbors(i,j,a)) for j in range(len(a[0]))] for i in range(len(a))]
This code works well for 3x3 or small range of matrix, but for large matrix like 2k x 2k this is not feasible. Also this does not work if any of the value in matrix is missing or it's like na
This code works well for 3x3 or small range of matrix, but for large matrix like 2k x 2k this is not feasible. Also this does not work if any of the value in matrix is missing or it's like na. If any of the neighbor values is na then skip that neighbour in getting the average
Shot #1
This assumes you are looking to get sliding windowed average values in an input array with a window of 3 x 3 and considering only the north-west-east-south neighborhood elements.
For such a case, signal.convolve2d with an appropriate kernel could be used. At the end, you need to divide those summations by the number of ones in kernel, i.e. kernel.sum() as only those contributed to the summations. Here's the implementation -
import numpy as np
from scipy import signal
# Inputs
a = [[1,2,3],[3,4,5],[5,6,7],[4,8,9]]
# Convert to numpy array
arr = np.asarray(a,float)
# Define kernel for convolution
kernel = np.array([[0,1,0],
[1,0,1],
[0,1,0]])
# Perform 2D convolution with input data and kernel
out = signal.convolve2d(arr, kernel, boundary='wrap', mode='same')/kernel.sum()
Shot #2
This makes the same assumptions as in shot #1, except that we are looking to find average values in a neighborhood of only zero elements with the intention to replace them with those average values.
Approach #1: Here's one way to do it using a manual selective convolution approach -
import numpy as np
# Convert to numpy array
arr = np.asarray(a,float)
# Pad around the input array to take care of boundary conditions
arr_pad = np.lib.pad(arr, (1,1), 'wrap')
R,C = np.where(arr==0) # Row, column indices for zero elements in input array
N = arr_pad.shape[1] # Number of rows in input array
offset = np.array([-N, -1, 1, N])
idx = np.ravel_multi_index((R+1,C+1),arr_pad.shape)[:,None] + offset
arr_out = arr.copy()
arr_out[R,C] = arr_pad.ravel()[idx].sum(1)/4
Sample input, output -
In [587]: arr
Out[587]:
array([[ 4., 0., 3., 3., 3., 1., 3.],
[ 2., 4., 0., 0., 4., 2., 1.],
[ 0., 1., 1., 0., 1., 4., 3.],
[ 0., 3., 0., 2., 3., 0., 1.]])
In [588]: arr_out
Out[588]:
array([[ 4. , 3.5 , 3. , 3. , 3. , 1. , 3. ],
[ 2. , 4. , 2. , 1.75, 4. , 2. , 1. ],
[ 1.5 , 1. , 1. , 1. , 1. , 4. , 3. ],
[ 2. , 3. , 2.25, 2. , 3. , 2.25, 1. ]])
To take care of the boundary conditions, there are other options for padding. Look at numpy.pad for more info.
Approach #2: This would be a modified version of convolution based approach listed earlier in Shot #1. This is same as that earlier approach, except that at the end, we selectively replace
the zero elements with the convolution output. Here's the code -
import numpy as np
from scipy import signal
# Inputs
a = [[1,2,3],[3,4,5],[5,6,7],[4,8,9]]
# Convert to numpy array
arr = np.asarray(a,float)
# Define kernel for convolution
kernel = np.array([[0,1,0],
[1,0,1],
[0,1,0]])
# Perform 2D convolution with input data and kernel
conv_out = signal.convolve2d(arr, kernel, boundary='wrap', mode='same')/kernel.sum()
# Initialize output array as a copy of input array
arr_out = arr.copy()
# Setup a mask of zero elements in input array and
# replace those in output array with the convolution output
mask = arr==0
arr_out[mask] = conv_out[mask]
Remarks: Approach #1 would be the preferred way when you have fewer number of zero elements in input array, otherwise go with Approach #2.
This is an appendix to comments under #Divakar's answer (rather than an independent answer).
Out of curiosity I tried different 'pseudo' convolutions against the scipy convolution. The fastest one was the % (modulus) wrapping one, which surprised me: obviously numpy does something clever with its indexing, though obviously not having to pad will save time.
fn3 -> 9.5ms, fn1 -> 21ms, fn2 -> 232ms
import timeit
setup = """
import numpy as np
from scipy import signal
N = 1000
M = 750
P = 5 # i.e. small number -> bigger proportion of zeros
a = np.random.randint(0, P, M * N).reshape(M, N)
arr = np.asarray(a,float)"""
fn1 = """
arr_pad = np.lib.pad(arr, (1,1), 'wrap')
R,C = np.where(arr==0)
N = arr_pad.shape[1]
offset = np.array([-N, -1, 1, N])
idx = np.ravel_multi_index((R+1,C+1),arr_pad.shape)[:,None] + offset
arr[R,C] = arr_pad.ravel()[idx].sum(1)/4"""
fn2 = """
kernel = np.array([[0,1,0],
[1,0,1],
[0,1,0]])
conv_out = signal.convolve2d(arr, kernel, boundary='wrap', mode='same')/kernel.sum()
mask = arr == 0.0
arr[mask] = conv_out[mask]"""
fn3 = """
R,C = np.where(arr == 0.0)
arr[R, C] = (arr[(R-1)%M,C] + arr[R,(C-1)%N] + arr[R,(C+1)%N] + arr[(R+1)%M,C]) / 4.0
"""
print(timeit.timeit(fn1, setup, number = 100))
print(timeit.timeit(fn2, setup, number = 100))
print(timeit.timeit(fn3, setup, number = 100))
Using numpy and scipy.ndimage, you can apply a "footprint" that defines where you look for the neighbours of each element and apply a function to those neighbours:
import numpy as np
import scipy.ndimage as ndimage
# Getting neighbours horizontally and vertically,
# not diagonally
footprint = np.array([[0,1,0],
[1,0,1],
[0,1,0]])
a = [[1,2,3],[3,4,5],[5,6,7]]
# Need to make sure that dtype is float or the
# mean won't be calculated correctly
a_array = np.array(a, dtype=float)
# Can specify that you want neighbour selection to
# wrap around at the borders
ndimage.generic_filter(a_array, np.mean,
footprint=footprint, mode='wrap')
Out[36]:
array([[ 3.25, 3.5 , 3.75],
[ 3.75, 4. , 4.25],
[ 4.25, 4.5 , 4.75]])