I am wondering why Python truncates the numbers to integers whenever I assign floating point numbers to a numpy array:
import numpy as np
lst = np.asarray(list(range(10)))
print ("lst before assignment: ", lst)
lst[:4] = [0.3, 0.5, 10.6, 0.2];
print ("lst after assignment: ", lst)
output:
lst before assignment: [0 1 2 3 4 5 6 7 8 9]
lst after assignment: [ 0 0 10 0 4 5 6 7 8 9]
Why does it do this? Since you do not need to specify types in the language, I cannot understand why numpy would cast the floats to ints before assigning to the array (which contains integers).
Try:
lst = np.asarray(list(range(10)), dtype=float)
lst before assignment: [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
lst after assignment: [ 0.3 0.5 10.6 0.2 4. 5. 6. 7. 8. 9. ]
numpy defines the data type or .dtype of the array at the moment of creation. The program understands that you are storing integers and once it is specified as such it stays that way. If you plan to use floats you should either input floats or specify it in the data type, i.e.
np.array(list(map(float, range(10)))
or
np.array(list(range(10)), dtype=np.float)
or
np.array(list(range(10)).astype(np.float)
Related
WHY? very strange...
In python, if we test np.astype() with numba, the following will print some results as
x: [-6. -5. -4. -3. -2. -1. 0. 1. 2. 3. 4. 5.]
x-int: [-6 -5 -4 -3 -2 -1 0 1 2 3 4 5]
#numba.njit
def tt():
nn = 3
x = np.linspace(0, 4*nn-1, 4*nn)-2*nn
print(x)
print(x.astype(np.int32))
BUT, if I change the line of x to be x = np.linspace(0, 8*nn-1, 8*nn)-4*nn, the result will be strange as
x: [-12. -11. -10. -9. -8. -7. -6. -5. -4. -3. -2. -1. 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.]
x-int: [-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 0 2 3 4 5 6 7 8 9 10 11]
There are two 0 in x-int? why?
tl;dr: This is a reported bug of Numba.
The issue come from a slightly inaccuracy in the Numba linspace function related to floating-point rounding. Here is an example to highlight the issue:
def tt_classic():
nn = 3
return np.linspace(0, 8*nn-1, 8*nn)-4*nn
#numba.njit
def tt_numba():
nn = 3
return np.linspace(0, 8*nn-1, 8*nn)-4*nn
print(tt_classic()[13])
print(tt_numba()[13])
Here is the result:
1.0
0.9999999999999982
As you can see, the Numba implementation does not return a exact value. While this problem cannot be avoided for big values, it can be considered as a bug for such small value since they can be represented exactly (without any loss of precision) on any IEEE-754 platform.
As a result, the cast will then truncate the floating point number 0.9999999999999982 to 0 (and not to the nearest integer). If you want a safe conversion (ie. workaround), you can explicitly tell Numpy/Numba to do it. Here is an example:
#numba.njit
def tt():
nn = 3
x = np.linspace(0, 4*nn-1, 4*nn)-2*nn
np.round(x, 0, x)
print(x)
print(x.astype(np.int32))
This bug as been reported on the Numba bug tracker here.
You may also be interested in this related Numba issue.
I'm aligning multiple datasets (model and observations) and I thought it would make a lot of sense if xarray.align had a method to propagate NaNs/missing data in one dataset to the others. For now, I'm using xr.dataset.where in combination with np.isfinite, but especially my attempt to generalize this for more than two arrays feels a bit tricky. Is there a better way to do this?
a = xr.DataArray(np.arange(10).astype(float))
b = xr.DataArray(np.arange(10).astype(float))
a[[4, 5]] = np.nan
print(a.values)
print(b.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# Default behaviour
c, d = xr.align(a, b)
print(c.values)
print(d.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
# Desired behaviour
e, f = xr.align(a.where(np.isfinite(b)), b.where(np.isfinite(a)))
print(e.values)
print(f.values)
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
>> [ 0. 1. 2. 3. nan nan 6. 7. 8. 9.]
# Attempt to generalize for multiple arrays
c = b.copy()
c [[1, -1]] = np.nan
def align_better(*dataarrays):
allvalid = np.all(np.array([np.isfinite(x) for x in dataarrays]), axis=0)
return xr.align(*[da.where(allvalid) for da in dataarrays])
g, h, i = align_better(a, b, c)
print(g.values)
print(h.values)
print(i.values)
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
>> [ 0. nan 2. 3. nan nan 6. 7. 8. nan]
From the xarray docs:
Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes and dimension sizes.
Array from the aligned objects are suitable as input to mathematical operators, because along each dimension they have the same index and size.
Missing values (if join != 'inner') are filled with NaN.
Nothing about this function deals with the values in the arrays, just the dimensions and coordinates. This function is used for setting up arrays for operations against each other.
If your desired behavior is a function that returns NaN for all arrays where any arrays are NaN, your align_better function seems like a decent way to do it.
The function in my initial attempt was slow because dataarrays were casted to numpy arrays. In this modified version, I first align the datasets. Then I can safely use the .values method. This is much faster.
def align_better(*dataarrays):
""" Align datasets and propage NaNs """
aligned = xr.align(*dataarrays)
allvalid = np.all(np.asarray([np.isfinite(x).values for x in aligned]), axis=0)
return [da.where(allvalid) for da in aligned]
In the documentation for the PCA function in scikitlearn, there is a copy argument that is True by default.
The documentation says this about the argument:
If False, data passed to fit are overwritten and running fit(X).transform(X) will not yield the expected results, use fit_transform(X) instead.
I'm not sure what this is saying, however, because how would the function overwrite the input X? When you call .fit(X), the function should just be calculating the PCA vectors and updating the internal state of the PCA object, right?
So even if you set copy to False, the .fit(X) function should still be returning the object self as it says in the documentation, so shouldn't fit(X).transform(X) still work?
So what is it copying when this argument is set to False?
Additionally, when would I want to set it to False?
Edit:
I ran the fit and transform function together and separately and got different results even though the copy parameter was the same for both.
from sklearn.decomposition import PCA
import numpy as np
X = np.arange(20).reshape((5,4))
print("Separate")
XT = X.copy()
pcaT = PCA(n_components=2, copy=True)
print("Original: ", XT)
results = pcaT.fit(XT).transform(XT)
print("New: ", XT)
print("Results: ", results)
print("\nCombined")
XF = X.copy()
pcaF = PCA(n_components=2, copy=True)
print("Original: ", XF)
results = pcaF.fit_transform(XF)
print("New: ", XF)
print("Results: ", results)
########## Results
Separate
Original: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
New: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
Results: [[ 1.60000000e+01 -2.66453526e-15]
[ 8.00000000e+00 -1.33226763e-15]
[ 0.00000000e+00 0.00000000e+00]
[ -8.00000000e+00 1.33226763e-15]
[ -1.60000000e+01 2.66453526e-15]]
Combined
Original: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
New: [[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]]
Results: [[ 1.60000000e+01 1.44100598e-15]
[ 8.00000000e+00 -4.80335326e-16]
[ -0.00000000e+00 0.00000000e+00]
[ -8.00000000e+00 4.80335326e-16]
[ -1.60000000e+01 9.60670651e-16]]
In your example the value of copy ends up being ignored, as explained below. But here is what can happen if you set it to False:
X = np.arange(20).reshape((5,4)).astype(np.float64)
print(X)
pca = PCA(n_components=2, copy=False).fit(X)
print(X)
This prints the original X
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[ 12. 13. 14. 15.]
[ 16. 17. 18. 19.]]
and then shows that X was mutated by fit method.
[[-8. -8. -8. -8.]
[-4. -4. -4. -4.]
[ 0. 0. 0. 0.]
[ 4. 4. 4. 4.]
[ 8. 8. 8. 8.]]
The culprit is this line: X -= self.mean_, where augmented assignment mutates the array.
If you set copy=True, which is the default value, then X is not mutated.
Copy is sometimes made even if copy=False
Why has not copy made a difference in your example? The only thing that the method PCA.fit does with the value of copy is pass it to a utility function check_array which is called to make sure the data matrix has datatype either float32 or float64. If the data type isn't one of those, type conversion happens, and that creates a copy anyway (in your example, there is conversion from int to float). This is why in my example above I made X a float array.
Tiny differences between fit().transform() and fit_transform()
You also asked why fit(X).transform(X) and fit_transform(X) return slightly different results. This has nothing to do with copy parameter. The difference is within the errors of double-precision arithmetics. And it comes from the following:
fit performs the SVD as X = U # S # V.T (where # means matrix multiplication) and stores V in the components_ property.
transform multiplies the data by V
fit_transform performs the same SVD as fit does, and then returns U # S
Mathematically, U # S is the same as X # V because V is an orthogonal matrix. But the errors of floating-point arithmetic result in tiny differences.
It makes sense that fit_transform does U # S instead of X # V; it's a simpler and more accurate multiplication to perform because S is diagonal. The reason fit does not do the same is that only V is stored, and in any case it doesn't really know that the argument it got was the same that the model got to fit.
I have a file that looks something like this:
some text
the grids are
3 x 3
more text
matrix marker 1 1
3 2 4
7 4 2
9 1 1
new matrix 2 4
9 4 1
1 3 4
4 3 1
new matrix 3 3
7 2 1
1 3 4
2 3 2
.. the file continues, with several 3x3 matrices appearing in the same fashion. Each matrix is prefaced by text with a unique ID, though the IDs aren't particularly important to me. I want to create a matrix of these matrixes. Can I use loadtxt to do that?
Here is my best attempt. The 6 in this code could be replaced with an iterating variable starting at 6 and incrementing by the number of rows in the matrix. I thought that skiprows would accept a list, but apparently it only accepts integers.
np.loadtxt(fl, skiprows = [x for x in range(nlines) if x not in (np.array([1,2,3])+ 6)])
TypeError Traceback (most recent call last)
<ipython-input-23-7d82fb7ef14a> in <module>()
----> 1 np.loadtxt(fl, skiprows = [x for x in range(nlines) if x not in (np.array([1,2,3])+ 6)])
/usr/local/lib/python2.7/site-packages/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
932
933 # Skip the first `skiprows` lines
--> 934 for i in range(skiprows):
935 next(fh)
936
Maybe I misunderstand, but if you can match the lines preceding the 3x3 matrices, then you can create a generator to feed to loadtxt:
import numpy as np
def get_matrices(fs):
while True:
line = next(fs)
if not line:
break
if 'matrix' in line: # or whatever matches the line before a matrix
yield next(fs)
yield next(fs)
yield next(fs)
with open('matrices.dat') as fs:
g = get_matrices(fs)
M = np.loadtxt(g)
M = M.reshape((M.size//9, 3, 3))
print(M)
If you feed it:
some text
the grids are
3 x 3
more text
matrix marker 1 1
3 2 4
7 4 2
9 1 1
new matrix 2 4
9 4 1
1 3 4
4 3 1
new matrix 3 3
7 2 1
1 3 4
2 3 2
new matrix 7 6
1 0 1
2 0 3
0 1 2
You get an array of the matrices:
[[[ 3. 2. 4.]
[ 7. 4. 2.]
[ 9. 1. 1.]]
[[ 9. 4. 1.]
[ 1. 3. 4.]
[ 4. 3. 1.]]
[[ 7. 2. 1.]
[ 1. 3. 4.]
[ 2. 3. 2.]]
[[ 1. 0. 1.]
[ 2. 0. 3.]
[ 0. 1. 2.]]]
Alternatively, if you just want to yield all lines that look like they might be rows from a 3x3 matrix of integers, match to a regular expression:
import re
def get_matrices(fs):
while True:
line = next(fs)
if not line:
break
if re.match('\d+\s+\d+\s+\d+', line):
yield line
You need to change your processing workflow to use steps: first, extract substrings corresponding to your desired matrices, then call numpy.loadtxt. To do this, a great way would be:
Find matrix start and end with re.
Load matrix within that range
Reset your range and continue.
Your matrix marker seem to be diverse, so you could use a regular expression like this:
start = re.compile("\w+\s+matrix\s+(\d+)\s+(\d+)\n")
end = re.compile("\n\n")
Then, you can find start/end pairs and then load the text for each matrix:
import io
import numpy as np
# read our data
data = open("/path/to/file.txt").read()
def load_matrix(data, *args):
# find start and end bounds
s = start.search(data)
if not s:
# no matrix leftover, return None
return None
e = end.search(data, s.end())
e_index = e.end() if e else len(data)
# load text
buf = io.StringIO(data[s.end(): e_index])
matrix = np.loadtxt(buf, *args) # add other args here
# reset our buffer
data = data[e_index:]
return matrix
Idea
In this case, my regular expression marker for the start of the matrix has capturing groups (\d+) for the matrix dimensions, so you can get the MxN representation of the matrix if you wish. List itemI also then search for items with the word "matrix" on the line, with arbitrary leading text and two numbers separated by whitespace at the end.
My match for the end is two "\n\n" groups, or two newlines (if you have Windows line endings, you may need to consider "\r" too).
Automating This
Now that we have a way to find a single case, all you need to do is iterate this and populate a list of matrices while you still get matches.
matrices = []
# read our data
data = open("/path/to/file.txt").read()
while True:
result = load_matrix(data, ...) # pass other arguments to loadtxt
if not result:
break
matrices.append(result)
Suppose i have
a = np.arange(9).reshape((3,3))
and i want to divide each row with a vector
n = np.array([1.1,2.2,3.3])
I tried the proposed solution in this question but the fractional value is not taken into account.
I understand your question differently from the comments above:
import numpy as np
a = np.arange(12).reshape((4,3))
print a
n = np.array([[1.1,2.2,3.3]])
print n
print a/n
Output:
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]
[[ 1.1 2.2 3.3]]
[[ 0. 0.45454545 0.60606061]
[ 2.72727273 1.81818182 1.51515152]
[ 5.45454545 3.18181818 2.42424242]
[ 8.18181818 4.54545455 3.33333333]]
I also changed from a symmetric matrix (3x3) to (3x4) to point out that row vs columns matter. Also the divisor is a column vector now (double brackets).