I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.
This code works, and illustrates my goals:
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
permutation = numpy.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
For example:
>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
[1, 1],
[3, 3]]), array([2, 1, 3]))
However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.
Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.
One other thought I had was this:
def shuffle_in_unison_scary(a, b):
rng_state = numpy.random.get_state()
numpy.random.shuffle(a)
numpy.random.set_state(rng_state)
numpy.random.shuffle(b)
This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.
Your can use NumPy's array indexing:
def unison_shuffled_copies(a, b):
assert len(a) == len(b)
p = numpy.random.permutation(len(a))
return a[p], b[p]
This will result in creation of separate unison-shuffled arrays.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)
To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html
Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.
If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.
Example: Let's assume the arrays a and b look like this:
a = numpy.array([[[ 0., 1., 2.],
[ 3., 4., 5.]],
[[ 6., 7., 8.],
[ 9., 10., 11.]],
[[ 12., 13., 14.],
[ 15., 16., 17.]]])
b = numpy.array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.]])
We can now construct a single array containing all the data:
c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[ 0., 1., 2., 3., 4., 5., 0., 1.],
# [ 6., 7., 8., 9., 10., 11., 2., 3.],
# [ 12., 13., 14., 15., 16., 17., 4., 5.]])
Now we create views simulating the original a and b:
a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)
The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).
In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.
This solution could be adapted to the case that a and b have different dtypes.
Very simple solution:
randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]
the two arrays x,y are now both randomly shuffled in the same way
James wrote in 2015 an sklearn solution which is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.
X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)
from np.random import permutation
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data #numpy array
y = iris.target #numpy array
# Data is currently unshuffled; we should shuffle
# each X[i] with its corresponding y[i]
perm = permutation(len(X))
X = X[perm]
y = y[perm]
Shuffle any number of arrays together, in-place, using only NumPy.
import numpy as np
def shuffle_arrays(arrays, set_seed=-1):
"""Shuffles arrays in-place, in the same order, along axis=0
Parameters:
-----------
arrays : List of NumPy arrays.
set_seed : Seed value if int >= 0, else seed is random.
"""
assert all(len(arr) == len(arrays[0]) for arr in arrays)
seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed
for arr in arrays:
rstate = np.random.RandomState(seed)
rstate.shuffle(arr)
And can be used like this
a = np.array([1, 2, 3, 4, 5])
b = np.array([10,20,30,40,50])
c = np.array([[1,10,11], [2,20,22], [3,30,33], [4,40,44], [5,50,55]])
shuffle_arrays([a, b, c])
A few things to note:
The assert ensures that all input arrays have the same length along
their first dimension.
Arrays shuffled in-place by their first dimension - nothing returned.
Random seed within positive int32 range.
If a repeatable shuffle is needed, seed value can be set.
After the shuffle, the data can be split using np.split or referenced using slices - depending on the application.
you can make an array like:
s = np.arange(0, len(a), 1)
then shuffle it:
np.random.shuffle(s)
now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.
x_data = x_data[s]
x_label = x_label[s]
There is a well-known function that can handle this:
from sklearn.model_selection import train_test_split
X, _, Y, _ = train_test_split(X,Y, test_size=0.0)
Just setting test_size to 0 will avoid splitting and give you shuffled data.
Though it is usually used to split train and test data, it does shuffle them too.
From documentation
Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y)) and application to input data into a
single call for splitting (and optionally subsampling) data in a
oneliner.
This seems like a very simple solution:
import numpy as np
def shuffle_in_unison(a,b):
assert len(a)==len(b)
c = np.arange(len(a))
np.random.shuffle(c)
return a[c],b[c]
a = np.asarray([[1, 1], [2, 2], [3, 3]])
b = np.asarray([11, 22, 33])
shuffle_in_unison(a,b)
Out[94]:
(array([[3, 3],
[2, 2],
[1, 1]]),
array([33, 22, 11]))
One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.
# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
np.random.seed(seed)
np.random.shuffle(a)
np.random.seed(seed)
np.random.shuffle(b)
That's it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.
EDIT, don't use np.random.seed() use np.random.RandomState instead
def shuffle(a, b, seed):
rand_state = np.random.RandomState(seed)
rand_state.shuffle(a)
rand_state.seed(seed)
rand_state.shuffle(b)
When calling it just pass in any seed to feed the random state:
a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)
Output:
>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]
Edit: Fixed code to re-seed the random state
Say we have two arrays: a and b.
a = np.array([[1,2,3],[4,5,6],[7,8,9]])
b = np.array([[9,1,1],[6,6,6],[4,2,0]])
We can first obtain row indices by permutating first dimension
indices = np.random.permutation(a.shape[0])
[1 2 0]
Then use advanced indexing.
Here we are using the same indices to shuffle both arrays in unison.
a_shuffled = a[indices[:,np.newaxis], np.arange(a.shape[1])]
b_shuffled = b[indices[:,np.newaxis], np.arange(b.shape[1])]
This is equivalent to
np.take(a, indices, axis=0)
[[4 5 6]
[7 8 9]
[1 2 3]]
np.take(b, indices, axis=0)
[[6 6 6]
[4 2 0]
[9 1 1]]
If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array
for old_index in len(a):
new_index = numpy.random.randint(old_index+1)
a[old_index], a[new_index] = a[new_index], a[old_index]
b[old_index], b[new_index] = b[new_index], b[old_index]
This implements the Knuth-Fisher-Yates shuffle algorithm.
Shortest and easiest way in my opinion, use seed:
random.seed(seed)
random.shuffle(x_data)
# reset the same seed to get the identical random sequence and shuffle the y
random.seed(seed)
random.shuffle(y_data)
most solutions above work, however if you have column vectors you have to transpose them first. here is an example
def shuffle(self) -> None:
"""
Shuffles X and Y
"""
x = self.X.T
y = self.Y.T
p = np.random.permutation(len(x))
self.X = x[p].T
self.Y = y[p].T
With an example, this is what I'm doing:
combo = []
for i in range(60000):
combo.append((images[i], labels[i]))
shuffle(combo)
im = []
lab = []
for c in combo:
im.append(c[0])
lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)
I extended python's random.shuffle() to take a second arg:
def shuffle_together(x, y):
assert len(x) == len(y)
for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = int(random.random() * (i+1))
x[i], x[j] = x[j], x[i]
y[i], y[j] = y[j], y[i]
That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.
Just use numpy...
First merge the two input arrays 1D array is labels(y) and 2D array is data(x) and shuffle them with NumPy shuffle method. Finally split them and return.
import numpy as np
def shuffle_2d(a, b):
rows= a.shape[0]
if b.shape != (rows,1):
b = b.reshape((rows,1))
S = np.hstack((b,a))
np.random.shuffle(S)
b, a = S[:,0], S[:,1:]
return a,b
features, samples = 2, 5
x, y = np.random.random((samples, features)), np.arange(samples)
x, y = shuffle_2d(train, test)
Related
I am trying to use fancy indexing to modifying a large sparce matrix. Suppose you have the following code:
import numpy as np
import scipy.sparse as sp
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
b = sp.lil_matrix(a)
c = sp.lil_matrix((3,4))
c[[1,2], 0] = b[[1,2], 0]
However, this code gives the following error:
ValueError: shape mismatch in assignment
I don't understand why this doesn't work. Both matrices have the same shape and this usually works if both matrices are numpy arrays. I would appreciate any help.
Yeah this is a bug with the sparse __setitem__. I've run into it before (but I just worked around it). Now I actually looked into it; first, you can fix this pretty easily:
import numpy as np
import scipy.sparse as sp
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
b = sp.lil_matrix(a)
c = sp.lil_matrix((3,4))
c[[1,2], 0] = b[[1,2], 0]
This raises the ValueError you saw. This doesn't and works as expected:
c[[1,2], 0] = b[[1,2], [0]]
>>> c.A
array([[0., 0., 0., 0.],
[5., 0., 0., 0.],
[9., 0., 0., 0.]])
Lets just walk through the offending __setitem__ (I'm going to omit a lot of code that doesn't get called):
row, col = self._validate_indices(key)
This is fine - row = [1, 2] and col = 0
col = np.atleast_1d(col)
i, j = _broadcast_arrays(row, col)
So far so good - i = [1, 2] and j = [0, 0]
if i.ndim == 1:
# Inner indexing, so treat them like row vectors.
i = i[None]
j = j[None]
broadcast_row = x.shape[0] == 1 and i.shape[0] != 1
broadcast_col = x.shape[1] == 1 and i.shape[1] != 1
Here's our problem - i and j both got turned into row vectors with shape (1, 2). x here is what you're trying to assign (b[[1,2], 0]), which is of shape (2, 1); the next step raises a ValueError cause x and the indices don't align.
>>> c[[1,2], 0] = b[[1,2], 0].A
ValueError: cannot reshape array of size 4 into shape (2,)
Here's the same problem but __setitem__ broadcasts x into a (2,2) array, which then fails again because it's larger than the array you're assigning it to.
The workaround (b[[1,2], [0]]) has a shape of (1, 2) which is not correct, but that error ends up cancelling out the error in indexing c.
I'm not sure exactly what the logic is behind this indexing code so I'm not sure how to fix this without introducing other subtle bugs.
I want to create an array of a given shape based on another numpy array. The number of dimensions will be matching, but the sizes will differ from axis to axis. If the original size is too small, I want to pad it with zeros to fulfill the requirements. Example of expected behaviour to clarify:
embedding = np.array([
[1, 2, 3, 4],
[5, 6, 7, 8]
])
resize_with_outer_zeros(embedding, (4, 3)) = np.array([
[1, 2, 3],
[5, 6, 7],
[0, 0, 0],
[0, 0, 0]
])
I think I achieved the desired behaviour with the function below.
def resize_with_outer_zeros(embedding: np.ndarray, target_shape: Tuple[int, ...]) -> np.ndarray:
padding = tuple((0, max(0, target_size - size)) for target_size, size in zip(target_shape, embedding.shape))
target_slice = tuple(slice(0, target_size) for target_size in target_shape)
return np.pad(embedding, padding)[target_slice]
However, I have strong doubts about its efficiency and elegance, as it involves a lot of pure python tuple operations. Is there a better and more concise way to do it?
If you know that your array won't be bigger than some size (r, c), why not just:
def pad_with_zeros(A, r, c):
out = np.zeros((r, c))
r_, c_ = np.shape(A)
out[0:r_, 0:c_] = A
return out
If you want to support arbitrary dimensions (tensors) it gets a little uglier, but the principle remains the same:
def pad(A, shape):
out = np.zeros(shape)
out[tuple(slice(0, d) for d in np.shape(A))] = A
return out
And to support larger arrays (larger than what you would pad):
def pad(A, shape):
shape = np.max([np.shape(A), shape], axis=0)
out = np.zeros(shape)
out[tuple(slice(0, d) for d in np.shape(A))] = A
return out
I don't think you can do much better, but instead of using pad and then slicing, just do zeros at the right size and then an assignment - this cuts it to one list comprehension instead of two.
embedding = np.array([
[1, 2, 3, 4],
[5, 6, 7, 8]
])
z = np.zeros((4,3))
s = tuple([slice(None, min(za,ea)) for za,ea in zip(z.shape, embedding.shape)])
z[s] = embedding[s]
z
# array([[1., 2., 3.],
# [5., 6., 7.],
# [0., 0., 0.],
# [0., 0., 0.]])
I'd just use a zero-matrix and run a nested for-loop to set the values from the older array - the remaining places will automatically be padded with zeros.
import numpy as np
def resize_array(array, new_size):
Z = np.zeros(new_size)
for i in range(len(Z)):
for j in range(len(Z[i])):
try:
Z[i][j] = array[i][j]
except IndexError: # just in case array[i][j] doesn't exist in the new size and should be truncated
pass
return Z
embedding = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(resize_array(embedding, (4, 3)))
How can I stack arrays in an alternating fashion? Consider the following example with three arrays:
import numpy as np
one = np.ones((5, 2, 2))
two = np.ones((5, 2, 2))*2
three = np.ones((5, 2, 2))*3
I would like to create a new array result with shape (15, 2, 2) which is formed by alternately taking a slice from each of the given arrays, i.e. the result should look like:
result[0] = one[0]
result[1] = two[0]
result[2] = three[0]
result[3] = one[1]
result[4] = two[1]
result[5] = three[1]
result[6] = one[2]
etc...
The arrays above are just an example to illustrate the question, I am not looking for a way to create this specific result array. What is the easiest way to achieve this, at best with specifying a stacking axis?
Of course, it is possible to do some loops but it seems rather inconvenient...
You may wanne take a look at np.stack() i.e.:
np.stack([one, two, three], axis=1).reshape(15, 2, 2)
With np.hstack and then reshape (with -1 for the first axis appended with the lengths along last two axes for a generic solution) -
np.hstack([one,two,three]).reshape((-1,)+one.shape[1:])
I think you are looking for np.vstack
np.vstack((one,two,three))
Read more about it here np.vstack
With selectable axis:
# example arrays
a,b,c = np.multiply.outer([1,2,3],np.ones((5,2,2)))
# axis
k = 1
np.stack([a,b,c],k+1).reshape(*(-(k==j) or s for j,s in enumerate(a.shape)))
# array([[[1., 1.],
# [2., 2.],
# [3., 3.],
# [1., 1.],
# [2., 2.],
# [3., 3.]],
#
# [[1., 1.],
...
I have a list of 100 values in python where each value in the list corresponds to an n-dimensional list.
For e.g
x=[[1 2],[2 3]] is a 2d list
I want to compute euclidean norm over all such points. Is there a standard method to do this?
I found this on scipy and this works.
scipy
If I have interpreted the question correctly, then you have a list of 100 n-dimensional vectors, and you would like a list of their (Euclidean) norms.
I think using numpy is easiest (and quickest!) here,
import numpy as np
a = np.array(x)
np.sqrt((a*a).sum(axis=1))
If the vectors do not have equal dimension, or if you want to avoid numpy, then perhaps,
[sum([i*i for i in vec])**0.5 for vec in x]
or,
import math
[math.sqrt(sum([i*i for i in vec])) for vec in x]
Edit: Not entirely sure what you were asking for. So, alternatively: it looks like you have a list, each element of which is an n-dimensional vector, and you want the Euclidean distance between each consecutive pair. With numpy (assuming n is fixed),
x = [ [1,2,3], [4,5,6], [8,9,10], [13,14,15] ] # 3D example.
import numpy as np
a = np.array(x)
sqrDiff = (a[:-1] - a[1:])**2
np.sqrt(sqrDiff.sum(axis=1))
where the last line returns,
array([ 5.19615242, 6.92820323, 8.66025404])
Try this code:
from math import sqrt
valueList = [[[1,2], [2,3]], [[2,2], [3,3]]]
def distance(valueList):
resultList = []
for (point1, point2) in valueList:
resultList.append(sqrt(sum(map(lambda (x1, x2): (x1 - x2) * (x1 - x2), zip(point1, point2)))))
return resultList
print distance(valueList)
output is
[1.4142135623730951, 1.4142135623730951]
Here is valuelist contains 2 values, but no problem with 100 values..
You can do this to compute the euclidean norm of each row:
>>> a = np.arange(200.).reshape((100,2))
>>> a
array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.],
[ 6., 7.],
[ 8., 9.],
[ 10., 11.],
...
>>> np.sum(a**2,axis=-1) ** .5
array([ 1. , 3.60555128, 6.40312424, 9.21954446,
12.04159458, 14.86606875, 17.69180601, 20.51828453,
23.34523506, 26.17250466, 29. , 31.82766093,
34.6554469 , 37.48332963, 40.31128874, 43.13930922,
45.96737974, 48.7954916 , 51.623638 , 54.45181356,
...
I have a list of tuples e.g. like this:
l=[ (2,2,1), (2,4,0), (2,8,0),
(4,2,0), (4,4,1), (4,8,0),
(8,2,0), (8,4,0), (8,8,1) ]
and want to transform it to an numpy array like this (only z values in the matrix, corresponding to the sequence of x, y coordinates, the coordinates should be stored separately) ):
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
I'm posting my solution below, but it's pretty low-level and I think there should be some higher-lever solution for this, either using matplotlib or numpy. Any idea?
One needs this kind of conversion to provide the arrays to matplotlib plotting functions like pcolor, imshow, contour.
It looks like np.unique with the return_inverse option fits the bill. For example,
In [203]: l[:,0]
Out[203]: array([2, 2, 2, 4, 4, 4, 8, 8, 8])
In [204]: np.unique(l[:,0], return_inverse = True)
Out[204]: (array([2, 4, 8]), array([0, 0, 0, 1, 1, 1, 2, 2, 2]))
np.unique returns a 2-tuple. The first array in the 2-tuple is an array of all the unique values in l[:,0]. The second array is the
index values associating values in array([2, 4, 8]) with values in the original array l[:,0]. It also happens to be the rank, since np.unique returns the unique values in sorted order.
import numpy as np
import matplotlib.pyplot as plt
l = np.array([ (2,2,1), (2,4,0), (2,8,0),
(4,2,0), (4,4,1), (4,8,0),
(8,2,0), (8,4,0), (8,8,1) ])
x, xrank = np.unique(l[:,0], return_inverse = True)
y, yrank = np.unique(l[:,1], return_inverse = True)
a = np.zeros((max(xrank)+1, max(yrank)+1))
a[xrank,yrank] = l[:,2]
fig = plt.figure()
ax = plt.subplot(111)
ax.pcolor(x, y, a)
plt.show()
yields
My solution first ranks the x and y values, and then creates the array.
l=[ (2,2,1), (2,4,0), (2,8,0),
(4,2,0), (4,4,1), (4,8,0),
(8,2,0), (8,4,0), (8,8,1) ]
def rankdata_ignoretied(data):
"""ranks data counting all tied values as one"""
# first translate the data values to integeres in increasing order
counter=0
encountered=dict()
for e in sorted(data):
if e not in encountered:
encountered[e]=counter
counter+=1
# then map the original sequence of the data values
result=[encountered[e] for e in data]
return result
x=[e[0] for e in l]
y=[e[1] for e in l]
z=[e[2] for e in l]
xrank=rankdata_ignoretied(x)
yrank=rankdata_ignoretied(y)
import numpy
a=numpy.zeros((max(xrank)+1, max(yrank)+1))
for i in range(len(l)):
a[xrank[i],yrank[i]]=l[i][2]
To use the resulting array for plotting one also needs the original x and y values, e.g.:
ax=plt.subplot(511)
ax.pcolor(sorted(set(x)), sorted(set(y)), a)
Anyone has a better idea of how to achieve this?
I don't understand why you're making this so complex. You can do it simply with:
array([
[cell[2] for cell in row] for row in zip(*[iter(x)] * 3)
])
Or perhaps more readably:
array([
[a[2], b[2], c[2]] for a, b, c in zip(x[0::3], x[1::3], x[2::3])
])
a solution using standard python construct set, list and sorted. if you don't have a lot of pointsit gains in readability even if slower than the numpy solution given by unutbu
l=[ (2,2,1), (2,4,0), (2,8,0),
(4,2,0), (4,4,1), (4,8,0),
(8,2,0), (8,4,0), (8,8,1) ]
#get the ranks of the values for x and y
xi = sorted(list(set( i[0] for i in l )))
yi = sorted(list(set( i[1] for i in l )))
a = np.zeros((len(xi),len(yi)))
#fill the matrix using the list.index
for x,y,v in l:
a[xi.index(x),yi.index(y)]=v
ax=plt.subplot(111)
ax.pcolor(array(xi), array(yi), a)