I have a numpy.float32 matrix V, then I did a division with a integer scalar value:
V = V/num
where num is an integer. The outcome is somewhat surprising that V is converted to a numpy.float64 matrix.
Can anyone help understand why this is so?
Thanks!
According to Numpy.result_type, numpy.float32 can not hold int32 losslessly. When the operation is with an int32, numpy will promotes the resulting value as a float64. Also according to #Eric, the actual int type may change in different environment, so a pre-test is a good practice to avoid some potential surprise.
A previous similar question is suggested for further reading:Numpy casting float32 to float64
. Numpy has a different treatment for purely scalar operations and the operations involving an array. In this case, the division operation involves a ndarray, so when num is smaller than 65536 but larger thant 255, numpy converts it as int16. Numpy determines int16 can be cast to float32 losslessly while int32 can't. This is shown by np.can_cast(np.int16, np.float32) gives True but np.can_cast(np.int32, np.float32) gives False.
Thanks for the insightful comments under the question. This answer is a short summary of these comments.
Related
I am converting a NumPy array from a float dtype to an integer dtype. In the process, I want to cast values above the maximum value allowable by the dtype to that maximum. But for some reason that fails, and the conversion returns the minimum value. Here is code to reproduce (Python3, Numpy 1.22.2), with just numpy.inf as an example
float_array = numpy.array([[1, +numpy.inf], [2,2]])
dtype = numpy.dtype(numpy.int64)
cut_array = numpy.nan_to_num(float_array, posinf=numpy.iinfo(dtype).max)
int_array = cut_array.astype(dtype)
This returns int_array[0,1] equals -9223372036854775808.
Why is the maximum value (about 9.2e+18) representable actually not usable for dtype int64?
I tested a bit, a slightly smaller value than the max will work, e.g. using posinf=numpy.iinfo(dtype).max - 600 will lead to a good conversion.
From the comments by Warren Weckesser and Tim Roberts:
since a double only has 53 bits of precision, it can not represent exactly int64 e.g.
int(float(9223372036854775807)) = 9223372036854775808
In this example, the int conversion has rounded the original int value approximated by float, which essentially added +1 to the int, making it overflow.
I'm encountering a problem with incorrect numpy calculations when the inputs to a calculation are a numpy array with a 32-bit integer data type, but the outputs include larger numbers that require 64-bit representation.
Here's a minimal working example:
arr = np.ones(5, dtype=int) * (2**24 + 300) # arr.dtype defaults to 'int32'
# Following comment from #hpaulj I changed the first line, which was originally:
# arr = np.zeros(5, dtype=int)
# arr[:] = 2**24 + 300
single_value_calc = 2**8 * (2**24 + 300)
numpy_calc = 2**8 * arr
print(single_value_calc)
print(numpy_calc[0])
# RESULTS
4295044096
76800
The desired output is that the numpy array contains the correct value of 4295044096, which requires 64 bits to represent it. i.e. I would have expected numpy arrays to automatically upcast from int32 to int64 when the output requires it, rather maintaining a 32-bit output and wrapping back to 0 after the value of 2^32 is exceeded.
Of course, I can fix the problem manually by forcing int64 representation:
numpy_calc2 = 2**8 * arr.astype('int64')
but this is undesirable for general code, since the output will only need 64-bit representation (i.e. to hold large numbers) in some cases and not all. In my use case, performance is critical so forcing upcasting every time would be costly.
Is this the intended behaviour of numpy arrays? And if so, is there a clean, performant solution please?
Type casting and promotion in numpy is fairly complicated and occasionally surprising. This recent unofficial write-up by Sebastian Berg explains some of the nuances of the subject (mostly concentrating on scalars and 0d arrays).
Quoting from this document:
Python Integers and Floats
Note that python integers are handled exactly like numpy ones. They are, however, special in that they do not have a dtype associated with them explicitly. Value based logic, as described here, seems useful for python integers and floats to allow:
arr = np.arange(10, dtype=np.int8)
arr += 1
# or:
res = arr + 1
res.dtype == np.int8
which ensures that no upcast (for example with higher memory usage) occurs.
(emphasis mine.)
See also Allan Haldane's gist suggesting C-style type coercion, linked from the previous document:
Currently, when two dtypes are involved in a binary operation numpy's principle is that "the output dtype's range covers the range of both input dtypes", and when a single dtype is involved there is never any cast.
(emphasis again mine.)
So my understanding is that the promotion rules for numpy scalars and arrays differ, primarily because it's not feasible to check every element inside an array to determine whether casting can be done safely. Again from the former document:
Scalar based rules
Unlike arrays, where inspection of all values is not feasable, for scalars (and 0-D arrays) the value is inspected.
This would mean that you can either use np.int64 from the start to be safe (and if you're on linux then dtype=int will actually do this on its own), or check the maximum value of your arrays before suspect operations and determine if you have to promote the dtype yourself, on a case-by-case basis. I understand that this might not be feasible if you are doing a lot of calculations, but I don't believe there is a way around this considering numpy's current type promotion rules.
Why does Python not cast long numbers to numpy floats when doing sth. like
a = np.array([10.0, 56.0]) + long(10**47)
The dtype of the variable a is object. I did not expect this when during an maximum likelihood optimization problem one fit parameter B was an integer and thus 10**B became a long.
Is this due to fear of precision loss?
I suspect this is because python is able to store arbitrarily long integers and so numpy realizes that it can't safely cast the result to a known data type. Therefore, it falls back to treating the array as an array of python objects and multiplies elementwise using python's rules (which casts to a float).
You can see what the result type is by using np.result_type:
>>> np.result_type(np.array([10.0, 56.0],long(10**47))
dtype('O')
Based on the documentation for np.result_type what happens is:
First, np.min_scalar_type() is called on each of the inputs:
>>> np.min_scalar_type(np.array([10.0, 56.0]))
dtype('float64')
>>> np.min_scalar_type(long(10**47))
dtype('O')
Second, the result is determined by combining these types using np.promote_types:
>>> np.promote_types(np.float64,np.dtype('O'))
dtype('O')
Suppose I enter:
a = uint8(200)
a*2
Then the result is 400, and it is recast to be of type uint16.
However:
a = array([200],dtype=uint8)
a*2
and the result is
array([144], dtype=uint8)
The multiplication has been performed modulo 256, to ensure that the result stays in one byte.
I'm confused about "types" and "dtypes" and where one is used in preference to another. And as you see, the type may make a significant difference in the output.
Can I, for example, create a single number of dtype uint8, so that operations on that number will be performed modulo 256? Alternatively, can I create an array of type (not dtype) uint8 so that operations on it will produce values outside the range 0-255?
The simple, high-level answer is that NumPy layers a second type system atop Python's type system.
When you ask for the type of an NumPy object, you get the type of the container--something like numpy.ndarray. But when you ask for the dtype, you get the (numpy-managed) type of the elements.
>>> from numpy import *
>>> arr = array([1.0, 4.0, 3.14])
>>> type(arr)
<type 'numpy.ndarray'>
>>> arr.dtype
dtype('float64')
Sometimes, as when using the default float type, the element data type (dtype) is equivalent to a Python type. But that's equivalent, not identical:
>>> arr.dtype == float
True
>>> arr.dtype is float
False
In other cases, there is no equivalent Python type. For example, when you specified uint8. Such data values/types can be managed by Python, but unlike in C, Rust, and other "systems languages," managing values that align directly to machine data types (like uint8 aligns closely with "unsigned bytes" computations) is not the common use-case for Python.
So the big story is that NumPy provides containers like arrays and matrices that operate under its own type system. And it provides a bunch of highly useful, well-optimized routines to operate on those containers (and their elements). You can mix-and-match NumPy and normal Python computations, if you use care.
There is no Python type uint8. There is a constructor function named uint8, which when called returns a NumPy type:
>>> u = uint8(44)
>>> u
44
>>> u.dtype
dtype('uint8')
>>> type(u)
<type 'numpy.uint8'>
So "can I create an array of type (not dtype) uint8...?" No. You can't. There is no such animal.
You can
do computations constrained to uint8 rules without using NumPy arrays (a.k.a. NumPy scalar values). E.g.:
>>> uint8(44 + 1000)
20
>>> uint8(44) + uint8(1000)
20
But if you want to compute values mod 256, it's probably easier to use Python's mod operator:
>> (44 + 1000) % 256
20
Driving data values larger than 255 into uint8 data types and then doing arithmetic is a rather backdoor way to get mod-256 arithmetic. If you're not careful, you'll either cause Python to "upgrade" your values to full integers (killing your mod-256 scheme), or trigger overflow exceptions (because tricks that work great in C and machine language are often flagged by higher level languages).
The type of a NumPy array is numpy.ndarray; this is just the type of Python object it is (similar to how type("hello") is str for example).
dtype just defines how bytes in memory will be interpreted by a scalar (i.e. a single number) or an array and the way in which the bytes will be treated (e.g. int/float). For that reason you don't change the type of an array or scalar, just its dtype.
As you observe, if you multiply two scalars, the resulting datatype is the smallest "safe" type to which both values can be cast. However, multiplying an array and a scalar will simply return an array of the same datatype. The documentation for the function np.inspect_types is clear about when a particular scalar or array object's dtype is changed:
Type promotion in NumPy works similarly to the rules in languages like C++, with some slight differences. When both scalars and arrays are used, the array's type takes precedence and the actual value of the scalar is taken into account.
The documentation continues:
If there are only scalars or the maximum category of the scalars is higher than the maximum category of the arrays, the data types are combined with promote_types to produce the return value.
So for np.uint8(200) * 2, two scalars, the resulting datatype will be the type returned by np.promote_types:
>>> np.promote_types(np.uint8, int)
dtype('int32')
For np.array([200], dtype=np.uint8) * 2 the array's datatype takes precedence over the scalar int and a np.uint8 datatype is returned.
To address your final question about retaining the dtype of a scalar during operations, you'll have to restrict the datatypes of any other scalars you use to avoid NumPy's automatic dtype promotion:
>>> np.array([200], dtype=np.uint8) * np.uint8(2)
144
The alternative, of course, is to simply wrap the single value in a NumPy array (and then NumPy won't cast it in operations with scalars of different dtype).
To promote the type of an array during an operation, you could wrap any scalars in an array first:
>>> np.array([200], dtype=np.uint8) * np.array([2])
array([400])
A numpy array contains elements of the same type, so np.array([200],dtype=uint8) is an array with one value of type uint8. When you do np.uint8(200), you don't have an array, only a single value. This make a huge difference.
When performing some operation on the array, the type stays the same, irrespective of a single value overflows or not. Automatic upcasting in arrays is forbidden, as the size of the whole array has to change. This is only done if the user explicitly wants that. When performing an operation on a single value, it can easily upcast, not influencing other values.
[EDIT]
Okay my test case was poorly thought out. I only tested on 1-D arrays. in which case I get a 64bit scalar returned. If I do it on 3D array, I get the 32 bit as expected.
I am trying to calculate the mean and standard deviation of a very large numpy array (600*600*4044) and I am close to the limit of my memory (16GB on a 64bit machine). As such I am trying to process everything as a float32 rather than the float64 that is the default. However, any time I try to work on the data I get a float64 returned even if I specify the dtype as float32. why is this happening? Yes I can convert afterwards, but like I said I am close the to limit of my RAM and I am trying to keep everything as small as possible even during the processing step. Below is an example of what I am getting.
import scipy
a = scipy.ones((600,600,4044), dtype=scipy.float32)
print(a.dtype)
a_mean = scipy.mean(a, 2, dtype=scipy.float32)
a_std = scipy.std(a, 2, dtype=scipy.float32)
print(a_mean.dtype)
print(a_std.dtype)
Returns
float32
float32
float32
Note: This answer applied to the original question
You have to switch to 64 bit Python. According to your comments your object has size 5.7GB even with 32 bit floats. That cannot fit in 32 bit address space which is 4GB, at best.
Once you've switched to 64 bit Python I think you can stop worrying about intermediate values using 64 bit floats. In fact you can quite probably perform your entire calculation using 64 bit floats.
If you are already using 64 bit Python (and your comments confused me on the matter), then you simply do not need to worry about scipy.mean or scipy.std returning a 64 bit float. That's one single value out of ~1.5 billion values in your array. It's nothing to worry about.
Note: This answer applies to the new question
The code in your question produces the following output:
float32
float32
float32
In other words, the symptoms that you report are not in fact representative of reality. The reason for the confusion is that you earlier code, that to which my original answer applied, was quite different and operated on a single dimensional array. It looks awfully like scipy returns scalars as float64. But when the return value is not a scalar, then the data type is not transformed in the way you thought.
You can force to change the base type :
a_mean = numpy.ndarray( scipy.mean(a, dtype=scipy.float32) , dtype = scipy.float32 )
I have tested it, so feel free to correct me if I'm wrong.
There is a out option : http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
a = scipy.ones(10, dtype=scipy.float32)
b = numpy.array(0,dtype=scipy.float32)
scipy.mean(a, dtype=scipy.float32, out=b)
Test :
In [34]: b= numpy.array(0)
In [35]: b= numpy.array(0,dtype = scipy.float32)
In [36]: b.dtype
Out[36]: dtype('float32')
In [37]: scipy.mean(a, dtype=scipy.float32, out = numpy.array(b) )
Out[37]: 1.0
In [38]: b
Out[38]: array(0.0, dtype=float32)
In [39]: