Shuffle columns of an array with Numpy - python

Let's say I have an array r of dimension (n, m). I would like to shuffle the columns of that array.
If I use numpy.random.shuffle(r) it shuffles the lines. How can I only shuffle the columns? So that the first column become the second one and the third the first, etc, randomly.
Example:
input:
array([[ 1, 20, 100],
[ 2, 31, 401],
[ 8, 11, 108]])
output:
array([[ 20, 1, 100],
[ 31, 2, 401],
[ 11, 8, 108]])

One approach is to shuffle the transposed array:
np.random.shuffle(np.transpose(r))
Another approach (see YXD's answer https://stackoverflow.com/a/20546567/1787973) is to generate a list of permutations to retrieve the columns in that order:
r = r[:, np.random.permutation(r.shape[1])]
Performance-wise, the second approach is faster.

For a general axis you could follow the pattern:
>>> import numpy as np
>>>
>>> a = np.array([[ 1, 20, 100, 4],
... [ 2, 31, 401, 5],
... [ 8, 11, 108, 6]])
>>>
>>> print a[:, np.random.permutation(a.shape[1])]
[[ 4 1 20 100]
[ 5 2 31 401]
[ 6 8 11 108]]
>>>
>>> print a[np.random.permutation(a.shape[0]), :]
[[ 1 20 100 4]
[ 2 31 401 5]
[ 8 11 108 6]]
>>>

So, one step further from your answer:
Edit: I very easily could be mistaken how this is working, so I'm inserting my understanding of the state of the matrix at each step.
r == 1 2 3
4 5 6
6 7 8
r = np.transpose(r)
r == 1 4 6
2 5 7
3 6 8 # Columns are now rows
np.random.shuffle(r)
r == 2 5 7
3 6 8
1 4 6 # Columns-as-rows are shuffled
r = np.transpose(r)
r == 2 3 1
5 6 4
7 8 6 # Columns are columns again, shuffled.
which would then be back in the proper shape, with the columns rearranged.
The transpose of the transpose of a matrix == that matrix, or, [A^T]^T == A. So, you'd need to do a second transpose after the shuffle (because a transpose is not a shuffle) in order for it to be in its proper shape again.
Edit: The OP's answer skips storing the transpositions and instead lets the shuffle operate on r as if it were.

In general if you want to shuffle a numpy array along axis i:
def shuffle(x, axis = 0):
n_axis = len(x.shape)
t = np.arange(n_axis)
t[0] = axis
t[axis] = 0
xt = np.transpose(x.copy(), t)
np.random.shuffle(xt)
shuffled_x = np.transpose(xt, t)
return shuffled_x
shuffle(array, axis=i)

>>> print(s0)
>>> [[0. 1. 0. 1.]
[0. 1. 0. 0.]
[0. 1. 0. 1.]
[0. 0. 0. 1.]]
>>> print(np.random.permutation(s0.T).T)
>>> [[1. 0. 1. 0.]
[0. 0. 1. 0.]
[1. 0. 1. 0.]
[1. 0. 0. 0.]]
np.random.permutation(), does the row permutation.

There is another way, which does not use transposition and is apparently faster:
np.take(r, np.random.permutation(r.shape[1]), axis=1, out=r)
CPU times: user 1.14 ms, sys: 1.03 ms, total: 2.17 ms. Wall time: 3.89 ms
The approach in other answers: np.random.shuffle(r.T)
CPU times: user 2.24 ms, sys: 0 ns, total: 2.24 ms
Wall time: 5.08 ms
I used r = np.arange(64*1000).reshape(64, 1000) as an input.

Related

Referencing index for vectorization in NumPy

I have a couple of for loops that I want to vectorize in order to improve performance. They operate on 1 x N matrices.
for y in range(1, len(array[0]) + 1):
array[0, y - 1] = np.floor(np.nanmean(otherArray[0, ((y-1)*3):((y-1)*3+3)]))
for i in range(len(array[0])):
array[0, int((i-1)*L+1)] = otherArray[0, i]
The operations are reliant on the index of the array which is given by the for loop. Is there any way to access the index while using numpy.vectorize so that I can rewrite these as vectorized functions?
First loop:
import numpy as np
array = np.zeros((1, 10))
otherArray = np.arange(30).reshape(1, -1)
print(f'array = \n{array}')
print(f'otherArray = \n{otherArray}')
for y in range(1, len(array[0]) + 1):
array[0, y - 1] = np.floor(np.nanmean(otherArray[0, ((y-1)*3):((y-1)*3+3)]))
print(f'array = \n{array}')
array = np.floor(np.nanmean(otherArray.reshape(-1, 3), axis = 1)).reshape(1, -1)
print(f'array = \n{array}')
output:
array =
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
otherArray =
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29]]
array =
[[ 1. 4. 7. 10. 13. 16. 19. 22. 25. 28.]]
array =
[[ 1. 4. 7. 10. 13. 16. 19. 22. 25. 28.]]
Second loop:
array = np.zeros((1, 10))
otherArray = np.arange(10, dtype = float).reshape(1, -1)
L = 1
print(f'array = \n{array}')
print(f'otherArray = \n{otherArray}')
for i in range(len(otherArray[0])):
array[0, int((i-1)*L+1)] = otherArray[0, i]
print(f'array = \n{array}')
array = otherArray
print(f'array = \n{array}')
output:
array =
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
otherArray =
[[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]]
array =
[[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]]
array =
[[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]]
It looks like in the first loop you are trying to compute a moving average. This is best done like this:
import numpy as np
window_width = 3
arr = np.arange(12)
out = np.floor(np.nanmean(arr.reshape(-1,window_width) ,axis=-1))
print(out)
Regarding your second loop, I have no clue what it does. You are trying to copy values from otherArray to array with some offset? I’d recommend you look at numpy’s slicing functionality.

Convert for loop outcome to numpy array

I want to create an array with 17 elements starting with 1 and other numbers are each twice the value immediately before it.
what I have so far is:
import numpy as np
array = np.zeros(shape=17)
array[0]=1
x = 1
for i in array:
print(x)
x *= 2
print(array)
what I got is:
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
and what I want is:
[1.2.4.8.16.32.64.128.256.512.1024.2048.4096.8192.16384.32768.65536]
There is a function for that
np.logspace(0,16,17,base=2,dtype=int)
# array([ 1, 2, 4, 8, 16, 32, 64, 128, 256,
# 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536])
Alternatives:
1<<np.arange(17)
2**np.arange(17)
np.left_shift.accumulate(np.ones(17,int))
np.repeat((1,2),(1,16)).cumprod()
np.vander([2],17,True)[0]
np.ldexp(1,np.arange(17),dtype=float)
Silly alternatives:
from scipy.sparse import linalg,diags
linalg.spsolve(diags([(1,),(-2,)],(0,-1),(17,17)),np.r_[:17]==0
np.packbits(np.identity(32,'?')[:17],1,'little').view('<i4').T[0]
np.ravel_multi_index(np.identity(17,int)[::-1],np.full(17,2))
np.where(np.sum(np.ix_(*17*((0,1),))).reshape(-1)==1)[0]
You need to assign the value back
import numpy as np
array = np.zeros(shape=17, dtype="int")
x = 1
for i in range(len(array)):
array[i] = x
print(x)
x *= 2
>>> print(array)
[ 1 2 4 8 16 32 64 128 256 512 1024 2048
4096 8192 16384 32768 65536]
it will be more efficient using numpy vectorization like below.
import numpy as np
n=17
triangle = (np.tri(n,n,-1, dtype=np.int64)+1)
triangle.cumprod(axis=1)[:,-1]
Explanation
np.tri(n,n, dtype=np.int64) will create triangle matrix with values 1 at and below diagonal and 0 else where
np.tri(n,n, -1, dtype=np.int64) will shift the triangle matrix by one row such that first row is all zero
np.tri(n,n, -1, dtype=np.int64)+1 will change 0s to 1s and 1s to 2s
at last step use cumprod and take last column which is our answer as it will be products of 0,1,2 ... n 2's with remaining 1s

What does copy=False do in sklearn

In the documentation for the PCA function in scikitlearn, there is a copy argument that is True by default.
The documentation says this about the argument:
If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
I'm not sure what this is saying, however, because how would the function overwrite the input X? When you call .fit(X), the function should just be calculating the PCA vectors and updating the internal state of the PCA object, right?
So even if you set copy to False, the .fit(X) function should still be returning the object self as it says in the documentation, so shouldn't fit(X).transform(X) still work?
So what is it copying when this argument is set to False?
Additionally, when would I want to set it to False?
Edit:
I ran the fit and transform function together and separately and got different results even though the copy parameter was the same for both.
from sklearn.decomposition import PCA
import numpy as np
X = np.arange(20).reshape((5,4))
print("Separate")
XT = X.copy()
pcaT = PCA(n_components=2, copy=True)
print("Original: ", XT)
results = pcaT.fit(XT).transform(XT)
print("New: ", XT)
print("Results: ", results)
print("\nCombined")
XF = X.copy()
pcaF = PCA(n_components=2, copy=True)
print("Original: ", XF)
results = pcaF.fit_transform(XF)
print("New: ", XF)
print("Results: ", results)
########## Results
Separate
Original: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
New: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
Results: [[ 1.60000000e+01 -2.66453526e-15]
[ 8.00000000e+00 -1.33226763e-15]
[ 0.00000000e+00 0.00000000e+00]
[ -8.00000000e+00 1.33226763e-15]
[ -1.60000000e+01 2.66453526e-15]]
Combined
Original: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
New: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
Results: [[ 1.60000000e+01 1.44100598e-15]
[ 8.00000000e+00 -4.80335326e-16]
[ -0.00000000e+00 0.00000000e+00]
[ -8.00000000e+00 4.80335326e-16]
[ -1.60000000e+01 9.60670651e-16]]
In your example the value of copy ends up being ignored, as explained below. But here is what can happen if you set it to False:
X = np.arange(20).reshape((5,4)).astype(np.float64)
print(X)
pca = PCA(n_components=2, copy=False).fit(X)
print(X)
This prints the original X
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[ 12. 13. 14. 15.]
[ 16. 17. 18. 19.]]
and then shows that X was mutated by fit method.
[[-8. -8. -8. -8.]
[-4. -4. -4. -4.]
[ 0. 0. 0. 0.]
[ 4. 4. 4. 4.]
[ 8. 8. 8. 8.]]
The culprit is this line: X -= self.mean_, where augmented assignment mutates the array.
If you set copy=True, which is the default value, then X is not mutated.
Copy is sometimes made even if copy=False
Why has not copy made a difference in your example? The only thing that the method PCA.fit does with the value of copy is pass it to a utility function check_array which is called to make sure the data matrix has datatype either float32 or float64. If the data type isn't one of those, type conversion happens, and that creates a copy anyway (in your example, there is conversion from int to float). This is why in my example above I made X a float array.
Tiny differences between fit().transform() and fit_transform()
You also asked why fit(X).transform(X) and fit_transform(X) return slightly different results. This has nothing to do with copy parameter. The difference is within the errors of double-precision arithmetics. And it comes from the following:
fit performs the SVD as X = U # S # V.T (where # means matrix multiplication) and stores V in the components_ property.
transform multiplies the data by V
fit_transform performs the same SVD as fit does, and then returns U # S
Mathematically, U # S is the same as X # V because V is an orthogonal matrix. But the errors of floating-point arithmetic result in tiny differences.
It makes sense that fit_transform does U # S instead of X # V; it's a simpler and more accurate multiplication to perform because S is diagonal. The reason fit does not do the same is that only V is stored, and in any case it doesn't really know that the argument it got was the same that the model got to fit.

python numpy - improve efficiency on column-wise cosine similarity

I am fairly new to programming and I never used numpy before.
So, I have a matrix with 19001 x 19001 dimensions. It contains a lot of zeros, so it is relatively sparse. I wrote some code to compute the pairwise cosine similarity of the columns if the item in the row is non-zero. I add all the pairwise similarity values of one row and do some mathematical operations on them to obtain one value for each row of the matrix in the end (see code below). It does what it is supposed to, however as dealing with a great number of dimensions, it is really slow. Is there any way to modify my code to make it more efficient?
import numpy as np
from scipy.spatial.distance import cosine
row_number = 0
out_file = open('outfile.txt', 'w')
for row in my_matrix:
non_zeros = np.nonzero(my_matrix[row_number])[0]
non_zeros = list(non_zeros)
cosine_sim = []
for item in non_zeros:
if len(non_zeros) <= 1:
break
x = non_zeros[0]
y = non_zeros[1]
similarity = 1 - cosine(my_matrix[:, x], my_matrix[:, y])
cosine_sim.append(similarity)
non_zeros.pop(0)
summing = np.sum(cosine_sim)
mean = summing / len(cosine_sim)
log = np.log(mean)
out_file_value = log * -1
out_file.write(str(row_number) + " " + str(out_file_value) + "\n")
if row_number <= 19000:
row_number += 1
else:
break
I know that there are some function to actually compute the cosine similarity even between columns (from sklearn.metrics.pairwise import cosine_similarity), so I tried it. However, the output is kind of the same but on the same time really confusing to me even though I read the documentation and the posts on this page referring to the issue.
For instance:
my_matrix =[[0. 0. 7. 0. 5.]
[0. 0. 11. 0. 0.]
[0. 2. 0. 0. 0.]
[0. 0. 2. 11. 5.]
[0. 0. 5. 0. 0.]]
transposed = np.transpose(my_matrix)
sim_matrix = cosine_similarity(transposed)
# resulting similarity matrix
sim_matrix =[[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0.14177624 0.45112924]
[0. 0. 0.14177624 1. 0.70710678]
[0. 0. 0.45112924 0.70710678 1.]]
If I compute the cosine similarity with my code above, it returns 0.45112924 for the 1st row ([0]) and 0.14177624 and 0.70710678 for row 4 ([3]).
out_file.txt
0 0.796001425306
1 nan
2 nan
3 0.856981065776
4 nan
I greatly appreciate any help or suggestions to my question!
You can consider using scipy instead. However, it doesn't take sparse matrix input. You have to provide numpy array.
import scipy.sparse as sp
from scipy.spatial.distance import cdist
X = np.random.randn(10000, 10000)
D = cdist(X, X.T, metric='cosine') # cosine distance matrix between 2 columns
Here is the speed that I got for 10000 x 10000 random array.
%timeit cdist(X, X.T, metric='cosine')
16.4 s ± 325 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try on small array
X = np.array([[1,0,1], [0, 3, 2], [1,0,1]])
D = cdist(X, X.T, metric='cosine')
This will give
[[ 1.11022302e-16 1.00000000e+00 4.22649731e-01]
[ 6.07767730e-01 1.67949706e-01 9.41783727e-02]
[ 1.11022302e-16 1.00000000e+00 4.22649731e-01]]
For example D[0, 2] is the cosine distance between column 0 and 2
from numpy.linalg import norm
1 - np.dot(X[:, 0], X[:,2])/(norm(X[:, 0]) * norm(X[:,2])) # give 0.422649

Divide each row by a vector element with floating value precision

Suppose i have
a = np.arange(9).reshape((3,3))
and i want to divide each row with a vector
n = np.array([1.1,2.2,3.3])
I tried the proposed solution in this question but the fractional value is not taken into account.
I understand your question differently from the comments above:
import numpy as np
a = np.arange(12).reshape((4,3))
print a
n = np.array([[1.1,2.2,3.3]])
print n
print a/n
Output:
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
[[ 1.1 2.2 3.3]]
[[ 0. 0.45454545 0.60606061]
[ 2.72727273 1.81818182 1.51515152]
[ 5.45454545 3.18181818 2.42424242]
[ 8.18181818 4.54545455 3.33333333]]
I also changed from a symmetric matrix (3x3) to (3x4) to point out that row vs columns matter. Also the divisor is a column vector now (double brackets).

Categories

Resources