In the documentation for the PCA function in scikitlearn, there is a copy argument that is True by default.
The documentation says this about the argument:
If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
I'm not sure what this is saying, however, because how would the function overwrite the input X? When you call .fit(X), the function should just be calculating the PCA vectors and updating the internal state of the PCA object, right?
So even if you set copy to False, the .fit(X) function should still be returning the object self as it says in the documentation, so shouldn't fit(X).transform(X) still work?
So what is it copying when this argument is set to False?
Additionally, when would I want to set it to False?
Edit:
I ran the fit and transform function together and separately and got different results even though the copy parameter was the same for both.
from sklearn.decomposition import PCA
import numpy as np
X = np.arange(20).reshape((5,4))
print("Separate")
XT = X.copy()
pcaT = PCA(n_components=2, copy=True)
print("Original: ", XT)
results = pcaT.fit(XT).transform(XT)
print("New: ", XT)
print("Results: ", results)
print("\nCombined")
XF = X.copy()
pcaF = PCA(n_components=2, copy=True)
print("Original: ", XF)
results = pcaF.fit_transform(XF)
print("New: ", XF)
print("Results: ", results)
########## Results
Separate
Original: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
New: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
Results: [[ 1.60000000e+01 -2.66453526e-15]
[ 8.00000000e+00 -1.33226763e-15]
[ 0.00000000e+00 0.00000000e+00]
[ -8.00000000e+00 1.33226763e-15]
[ -1.60000000e+01 2.66453526e-15]]
Combined
Original: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
New: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
Results: [[ 1.60000000e+01 1.44100598e-15]
[ 8.00000000e+00 -4.80335326e-16]
[ -0.00000000e+00 0.00000000e+00]
[ -8.00000000e+00 4.80335326e-16]
[ -1.60000000e+01 9.60670651e-16]]
In your example the value of copy ends up being ignored, as explained below. But here is what can happen if you set it to False:
X = np.arange(20).reshape((5,4)).astype(np.float64)
print(X)
pca = PCA(n_components=2, copy=False).fit(X)
print(X)
This prints the original X
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[ 12. 13. 14. 15.]
[ 16. 17. 18. 19.]]
and then shows that X was mutated by fit method.
[[-8. -8. -8. -8.]
[-4. -4. -4. -4.]
[ 0. 0. 0. 0.]
[ 4. 4. 4. 4.]
[ 8. 8. 8. 8.]]
The culprit is this line: X -= self.mean_, where augmented assignment mutates the array.
If you set copy=True, which is the default value, then X is not mutated.
Copy is sometimes made even if copy=False
Why has not copy made a difference in your example? The only thing that the method PCA.fit does with the value of copy is pass it to a utility function check_array which is called to make sure the data matrix has datatype either float32 or float64. If the data type isn't one of those, type conversion happens, and that creates a copy anyway (in your example, there is conversion from int to float). This is why in my example above I made X a float array.
Tiny differences between fit().transform() and fit_transform()
You also asked why fit(X).transform(X) and fit_transform(X) return slightly different results. This has nothing to do with copy parameter. The difference is within the errors of double-precision arithmetics. And it comes from the following:
fit performs the SVD as X = U # S # V.T (where # means matrix multiplication) and stores V in the components_ property.
transform multiplies the data by V
fit_transform performs the same SVD as fit does, and then returns U # S
Mathematically, U # S is the same as X # V because V is an orthogonal matrix. But the errors of floating-point arithmetic result in tiny differences.
It makes sense that fit_transform does U # S instead of X # V; it's a simpler and more accurate multiplication to perform because S is diagonal. The reason fit does not do the same is that only V is stored, and in any case it doesn't really know that the argument it got was the same that the model got to fit.
Related
I am wondering why Python truncates the numbers to integers whenever I assign floating point numbers to a numpy array:
import numpy as np
lst = np.asarray(list(range(10)))
print ("lst before assignment: ", lst)
lst[:4] = [0.3, 0.5, 10.6, 0.2];
print ("lst after assignment: ", lst)
output:
lst before assignment: [0 1 2 3 4 5 6 7 8 9]
lst after assignment: [ 0 0 10 0 4 5 6 7 8 9]
Why does it do this? Since you do not need to specify types in the language, I cannot understand why numpy would cast the floats to ints before assigning to the array (which contains integers).
Try:
lst = np.asarray(list(range(10)), dtype=float)
lst before assignment: [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
lst after assignment: [ 0.3 0.5 10.6 0.2 4. 5. 6. 7. 8. 9. ]
numpy defines the data type or .dtype of the array at the moment of creation. The program understands that you are storing integers and once it is specified as such it stays that way. If you plan to use floats you should either input floats or specify it in the data type, i.e.
np.array(list(map(float, range(10)))
or
np.array(list(range(10)), dtype=np.float)
or
np.array(list(range(10)).astype(np.float)
I tested two 3x3 matrix to know the inverse in Python and Excel, but the results are different. Which I should consider as the correct or best result?
These are the matrix I tested:
Matrix 1:
1 0 0
1 2 0
1 2 3
Matrix 2:
1 0 0
4 5 0
7 8 9
The Matrix 1 inverse is the same in Python and Excel, but Matrix 2 inverse is different.
In Excel I use the MINVERSE(matrix) function, and in Python np.linalg.inv(matrix) (from Numpy library)
I can't post images yet, so I can't show the results from Excel :c
This is the code I use in Python:
# Matrix 1
A = np.array([[1,0,0],
[1,2,0],
[1,2,3]])
Ainv = np.linalg.inv(A)
print(Ainv)
Result:
[[ 1. 0. 0. ]
[-0.5 0.5 0. ]
[ 0. -0.33333333 0.33333333]]
# (This is the same in Excel)
# Matrix 2
B = np.array([[1,0,0],
[4,5,0],
[7,8,9]])
Binv = np.linalg.inv(B)
print(Binv)
Result:
[[ 1.00000000e+00 0.00000000e+00 -6.16790569e-18]
[-8.00000000e-01 2.00000000e-01 1.23358114e-17]
[-6.66666667e-02 -1.77777778e-01 1.11111111e-01]]
# (This is different in Excel)
I was trying to achieve a kind of 2d filter with numpy, and I found something that looks to me like a bug.
In the example below, I'm trying to target the 2nd and 4th columns of the first, second and last lines of my data, ie:
[[ 2 4]
[ 8 10]
[26 28]]
I am aware that the second to last line does return that, but I wouldn't be able to assign anything there (it returns a copy). And this still doesn't explain why the last one fails.
import numpy as np
# create my data: 5x6 array
data = np.arange(0,30).reshape(5,6)
# mask: only keep row 1,2,and 5
mask = np.array([1,1,0,0,1])
mask = mask.astype(bool)
# this is fine
print 'data\n', data, '\n'
# this is good
print 'mask\n', mask, '\n'
# this is nice
print 'data[mask]\n', data[mask], '\n'
# this is great
print 'data[mask, 2]\n', data[mask, 2], '\n'
# this is awesome
print 'data[mask][:,[2,4]]\n', data[mask][:,[2,4]], '\n'
# this fails ??
print 'data[mask, [2,4]]\n', data[mask, [2,4]], '\n'
output:
data
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]]
mask
[ True True False False True]
data[mask]
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[24 25 26 27 28 29]]
data[mask, 2]
[ 2 8 26]
data[mask][:,[2,4]]
[[ 2 4]
[ 8 10]
[26 28]]
data[mask, [2,4]]
Traceback (most recent call last):
[...]
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (3,) (2,)
I'm posting this here, because I'm not confident enough in my numpy skills to be sure this is a bug, and file a bug report...
Thanks for your help/feedback !
This is not a bug.
This is an implementation definition
If you read array indexing in section Advanced Indexing you notice that it says
Purely integer array indexing When the index consists of as many
integer arrays as the array being indexed has dimensions, the indexing
is straight forward, but different from slicing. Advanced indexes
always are broadcast and iterated as one:
result[i_1, ..., i_M] == x[ind_1[i_1, ..., i_M], ind_2[i_1, ..., i_M],
..., ind_N[i_1, ..., i_M]]
therefore
print 'data[mask, [2,4]]\n', data[mask, [1,2,4]], '\n'
works and outputs
data[mask, [1,2,4]]
[ 1 8 28]
index length in broadcasting must be the same
Maybe you can achieve what you want using ix_ function. See array indexing
columns = np.array([2, 4], dtype=np.intp)
print data[np.ix_(mask, columns)]
which outputs
[[ 2 4]
[ 8 10]
[26 28]]
Suppose i have
a = np.arange(9).reshape((3,3))
and i want to divide each row with a vector
n = np.array([1.1,2.2,3.3])
I tried the proposed solution in this question but the fractional value is not taken into account.
I understand your question differently from the comments above:
import numpy as np
a = np.arange(12).reshape((4,3))
print a
n = np.array([[1.1,2.2,3.3]])
print n
print a/n
Output:
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
[[ 1.1 2.2 3.3]]
[[ 0. 0.45454545 0.60606061]
[ 2.72727273 1.81818182 1.51515152]
[ 5.45454545 3.18181818 2.42424242]
[ 8.18181818 4.54545455 3.33333333]]
I also changed from a symmetric matrix (3x3) to (3x4) to point out that row vs columns matter. Also the divisor is a column vector now (double brackets).
Let's say I have an array r of dimension (n, m). I would like to shuffle the columns of that array.
If I use numpy.random.shuffle(r) it shuffles the lines. How can I only shuffle the columns? So that the first column become the second one and the third the first, etc, randomly.
Example:
input:
array([[ 1, 20, 100],
[ 2, 31, 401],
[ 8, 11, 108]])
output:
array([[ 20, 1, 100],
[ 31, 2, 401],
[ 11, 8, 108]])
One approach is to shuffle the transposed array:
np.random.shuffle(np.transpose(r))
Another approach (see YXD's answer https://stackoverflow.com/a/20546567/1787973) is to generate a list of permutations to retrieve the columns in that order:
r = r[:, np.random.permutation(r.shape[1])]
Performance-wise, the second approach is faster.
For a general axis you could follow the pattern:
>>> import numpy as np
>>>
>>> a = np.array([[ 1, 20, 100, 4],
... [ 2, 31, 401, 5],
... [ 8, 11, 108, 6]])
>>>
>>> print a[:, np.random.permutation(a.shape[1])]
[[ 4 1 20 100]
[ 5 2 31 401]
[ 6 8 11 108]]
>>>
>>> print a[np.random.permutation(a.shape[0]), :]
[[ 1 20 100 4]
[ 2 31 401 5]
[ 8 11 108 6]]
>>>
So, one step further from your answer:
Edit: I very easily could be mistaken how this is working, so I'm inserting my understanding of the state of the matrix at each step.
r == 1 2 3
4 5 6
6 7 8
r = np.transpose(r)
r == 1 4 6
2 5 7
3 6 8 # Columns are now rows
np.random.shuffle(r)
r == 2 5 7
3 6 8
1 4 6 # Columns-as-rows are shuffled
r = np.transpose(r)
r == 2 3 1
5 6 4
7 8 6 # Columns are columns again, shuffled.
which would then be back in the proper shape, with the columns rearranged.
The transpose of the transpose of a matrix == that matrix, or, [A^T]^T == A. So, you'd need to do a second transpose after the shuffle (because a transpose is not a shuffle) in order for it to be in its proper shape again.
Edit: The OP's answer skips storing the transpositions and instead lets the shuffle operate on r as if it were.
In general if you want to shuffle a numpy array along axis i:
def shuffle(x, axis = 0):
n_axis = len(x.shape)
t = np.arange(n_axis)
t[0] = axis
t[axis] = 0
xt = np.transpose(x.copy(), t)
np.random.shuffle(xt)
shuffled_x = np.transpose(xt, t)
return shuffled_x
shuffle(array, axis=i)
>>> print(s0)
>>> [[0. 1. 0. 1.]
[0. 1. 0. 0.]
[0. 1. 0. 1.]
[0. 0. 0. 1.]]
>>> print(np.random.permutation(s0.T).T)
>>> [[1. 0. 1. 0.]
[0. 0. 1. 0.]
[1. 0. 1. 0.]
[1. 0. 0. 0.]]
np.random.permutation(), does the row permutation.
There is another way, which does not use transposition and is apparently faster:
np.take(r, np.random.permutation(r.shape[1]), axis=1, out=r)
CPU times: user 1.14 ms, sys: 1.03 ms, total: 2.17 ms. Wall time: 3.89 ms
The approach in other answers: np.random.shuffle(r.T)
CPU times: user 2.24 ms, sys: 0 ns, total: 2.24 ms
Wall time: 5.08 ms
I used r = np.arange(64*1000).reshape(64, 1000) as an input.