This question already has an answer here:
Different slices give different inequalities for same elements
(1 answer)
Closed 1 year ago.
import numpy as np
np.random.seed(2)
x = np.random.randn(1000000).astype('float32')
print(float(np.linalg.norm(x, keepdims=1)**2))
print(float(np.linalg.norm(x, keepdims=0)**2))
998428.125
998428.1084311157
Reproduced in Colab. Also, Colab outputs different values than my CPU:
998425.0625
998425.059075091
Removing **2, they match. Also reproduced with sum, haven't tried other methods.
Why this behavior? I can understand device dependence but keepdims seems buggy.
This is because after keepdims=0, your NumPy turns into a single dim float32, which after **2, becomes a float64. The other still has two axes and for some reason NumPy does not do this.
>>> print(
>>> np.linalg.norm(x, keepdims=1).dtype,
>>> np.linalg.norm(x, keepdims=0).dtype
>>> )
# Returns
float32 float32
>>> print(
>>> (np.linalg.norm(x, keepdims=1)**2).dtype,
>>> (np.linalg.norm(x, keepdims=0) ** 2).dtype
>>> )
# Returns
float32 float64
The version of numpy I'm using is 1.20.3.
I could not find any documentation of why this happens in the NumPy documentation. I think opening a GitHub issue in NumPy's repository, might be a good idea.
Related
I'm encountering a problem with incorrect numpy calculations when the inputs to a calculation are a numpy array with a 32-bit integer data type, but the outputs include larger numbers that require 64-bit representation.
Here's a minimal working example:
arr = np.ones(5, dtype=int) * (2**24 + 300) # arr.dtype defaults to 'int32'
# Following comment from #hpaulj I changed the first line, which was originally:
# arr = np.zeros(5, dtype=int)
# arr[:] = 2**24 + 300
single_value_calc = 2**8 * (2**24 + 300)
numpy_calc = 2**8 * arr
print(single_value_calc)
print(numpy_calc[0])
# RESULTS
4295044096
76800
The desired output is that the numpy array contains the correct value of 4295044096, which requires 64 bits to represent it. i.e. I would have expected numpy arrays to automatically upcast from int32 to int64 when the output requires it, rather maintaining a 32-bit output and wrapping back to 0 after the value of 2^32 is exceeded.
Of course, I can fix the problem manually by forcing int64 representation:
numpy_calc2 = 2**8 * arr.astype('int64')
but this is undesirable for general code, since the output will only need 64-bit representation (i.e. to hold large numbers) in some cases and not all. In my use case, performance is critical so forcing upcasting every time would be costly.
Is this the intended behaviour of numpy arrays? And if so, is there a clean, performant solution please?
Type casting and promotion in numpy is fairly complicated and occasionally surprising. This recent unofficial write-up by Sebastian Berg explains some of the nuances of the subject (mostly concentrating on scalars and 0d arrays).
Quoting from this document:
Python Integers and Floats
Note that python integers are handled exactly like numpy ones. They are, however, special in that they do not have a dtype associated with them explicitly. Value based logic, as described here, seems useful for python integers and floats to allow:
arr = np.arange(10, dtype=np.int8)
arr += 1
# or:
res = arr + 1
res.dtype == np.int8
which ensures that no upcast (for example with higher memory usage) occurs.
(emphasis mine.)
See also Allan Haldane's gist suggesting C-style type coercion, linked from the previous document:
Currently, when two dtypes are involved in a binary operation numpy's principle is that "the output dtype's range covers the range of both input dtypes", and when a single dtype is involved there is never any cast.
(emphasis again mine.)
So my understanding is that the promotion rules for numpy scalars and arrays differ, primarily because it's not feasible to check every element inside an array to determine whether casting can be done safely. Again from the former document:
Scalar based rules
Unlike arrays, where inspection of all values is not feasable, for scalars (and 0-D arrays) the value is inspected.
This would mean that you can either use np.int64 from the start to be safe (and if you're on linux then dtype=int will actually do this on its own), or check the maximum value of your arrays before suspect operations and determine if you have to promote the dtype yourself, on a case-by-case basis. I understand that this might not be feasible if you are doing a lot of calculations, but I don't believe there is a way around this considering numpy's current type promotion rules.
I have some hard times learning Python array handling with numpy.
I have a .csv file which contains in one column unsigned integer data which represents binary values from an analog digital converter.
I would like to convert this unsigned integer values in 12 bit binary representation using Python inside a jupyter notebook.
I tried several ways of implementing it, but I still fail...
here is my code:
import pandas as pd
df = pd.read_csv('my_adc_values.csv', delimiter ='\s+', header=None, usecols=[19])
decimalValues = df.values
print(decimalValues.shape)
so far so good... I have all my adc data column values in the decimalValues numpy array.
Now, I would like to iterate through the array and convert the integers in the array to a binary representation:
import numpy as np
# destination array of shape of source array
binaryValues = np.zeros(decimalValues.shape)
for i in range(len(decimalValues)):
print(decimalValues[i])
binaryValues[i]=(bin(decimalValues[i]))
print(binaryValues)
With this code I get the error message
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-10-890444040b2e> in <module>()
6 for i in range(len(decimalValues)):
7 print(decimalValues[i])
----> 8 binaryValues[i]=(bin(decimalValues[i]))
9
10 print(binaryValues)
TypeError: only integer scalar arrays can be converted to a scalar index
I tried several different solutions, but none of them worked. It seems as if I have a massive misunderstanding of numpy arrays.
I'm looking for a tip on how to solve my described problem. I found some threads, describing the the mentioned error message. I suspected, it had something to do with the shape of the source/destination arrays. therefore, I initialized the destination array with the same shape as the source. It did not help...
Thank you,
Maik
Numpy is primarily for working with numeric data, it doesn't give you much benefit when you're working with strings. Numpy can convert integers to decimal or hexadecimal strings, using the numpy.char.mod function, which utilises the old % string interpolation operator. Unfortunately, that doesn't support binary output. We can create a Numpy vectorized function that uses the standard Python format function to do the conversion. This is better than bin, since you don't get the leading '0b', and you can specify the minimum length.
import numpy as np
# Make some fake numeric data
nums = (1 << np.arange(1, 10)) - 1
print(nums)
# Convert to 12 bit binary strings
func = np.vectorize(lambda n: format(n, '012b'))
bins = func(nums)
print(bins)
output
[ 1 3 7 15 31 63 127 255 511]
['000000000001' '000000000011' '000000000111' '000000001111' '000000011111'
'000000111111' '000001111111' '000011111111' '000111111111']
Alternatively, do the conversion using plain Python. You can convert the result back to a Numpy array, if you really need that. This code uses the str.format method, rather than the format function used by the previous version.
bins = list(map('{:012b}'.format, nums))
What is causing the error in your case is that you are trying to apply a bin function on a slice, whereas it can only be applied on a single value. You might need an extra for loop to iterate over column values. Try changing your code in this way:
for i in range(len(decimalValues)):
for j in range(decimalValues.shape[1]):
print(decimalValues[i])
binaryValues[i, j]=(bin(decimalValues[i, j]))
print(binaryValues)
Let me know if it works!
This question already has answers here:
numpy.sum() giving strange results on large arrays
(4 answers)
Closed 5 years ago.
I am using numpy like this code
>>>import numpy as np
>>>a=np.arange(1,100000001).sum()
>>>a
987459712
I guess the result must be some like
5000000050000000
I noticed that until five numbers the result is ok.
Does someone knows what is happened?
regards
Numpy is not doing a mistake here. This phenomenon is known as integer overflow.
x = np.arange(1,100000001)
print(x.sum()) # 987459712
print(x.dtype) # dtype('int32')
The 32 bit integer type used in arange for the given input simply cannot hold 5000000050000000. At most it can take 2147483647.
If you explicitly use a larger integer or floating point data type you get the expected result.
a = np.arange(1, 100000001, dtype='int64').sum()
print(a) # 5000000050000000
a = np.arange(1.0, 100000001.0).sum()
print(a) # 5000000050000000.0
I suspect you are using Windows, where the data type of the result is a 32 bit integer (while for those using, say, Mac OS X or Linux, the data type is 64 bit). Note that 5000000050000000 % (2**32) = 987459712
Try using
a = np.arange(1, 100000001, dtype=np.int64).sum()
or
a = np.arange(1, 100000001).sum(dtype=np.int64)
P.S. Anyone not using Windows can reproduce the result as follows:
>>> np.arange(1, 100000001).sum(dtype=np.int32)
987459712
I read the post is-floating-point-math-broken and get Why it
happens, but I couldn't find a solution that could help me..
How can I do the correct subtraction?
Python version 2.6.6, Numpy version 1.4.1.
I have two numpy.ndarray each one contain float32 values, origin and new. I'm trying to use numpy.subtract to subtract them but I get the following (odd) result:
>>> import numpy as
>>> with open('base_R.l_BSREM_S.9.1_001.bin', 'r+') as fid:
origin = np.fromfile(fid, np.float32)
>>> with open('new_R.l_BSREM_S.9.1_001.bin', 'r+') as fid:
new = np.fromfile(fid, np.float32)
>>> diff = np.subtract(origin, new)
>>> origin[5184939]
0.10000000149011611938
>>> new[5184939]
0.00000000023283064365
>>> diff[5184939]
0.10000000149011611938
Also when I try to subtract the arrays at 5184939 I get the same result as diff[5184939]
>>> origin[5184939] - new[5184939]
0.10000000149011611938
But when I do the following I get this results:
>>> 0.10000000149011611938 - 0.00000000023283064365
0.10000000125728548
and that's not equal to diff[5184939]
How the right subtraction can be done? (0.10000000125728548 is the one that I need)
Please help, and Thanks in advance
You might add your Python and numpy versions to the question.
Differences can arise from np.float32 v np.float64 dtype, the default Python float type, as well as display standards. numpy uses different display rounding than the underlying Python.
The subtraction itself does not differ.
I can reproduce the 0.10000000125728548 value, which may also display as 0.1 (out 8 decimals).
I'm not sure where the 0.10000000149011611938 comes from. That looks as though new[5184939] was identically 0, not just something small like 0.00000000023283064365.
[EDIT]
Okay my test case was poorly thought out. I only tested on 1-D arrays. in which case I get a 64bit scalar returned. If I do it on 3D array, I get the 32 bit as expected.
I am trying to calculate the mean and standard deviation of a very large numpy array (600*600*4044) and I am close to the limit of my memory (16GB on a 64bit machine). As such I am trying to process everything as a float32 rather than the float64 that is the default. However, any time I try to work on the data I get a float64 returned even if I specify the dtype as float32. why is this happening? Yes I can convert afterwards, but like I said I am close the to limit of my RAM and I am trying to keep everything as small as possible even during the processing step. Below is an example of what I am getting.
import scipy
a = scipy.ones((600,600,4044), dtype=scipy.float32)
print(a.dtype)
a_mean = scipy.mean(a, 2, dtype=scipy.float32)
a_std = scipy.std(a, 2, dtype=scipy.float32)
print(a_mean.dtype)
print(a_std.dtype)
Returns
float32
float32
float32
Note: This answer applied to the original question
You have to switch to 64 bit Python. According to your comments your object has size 5.7GB even with 32 bit floats. That cannot fit in 32 bit address space which is 4GB, at best.
Once you've switched to 64 bit Python I think you can stop worrying about intermediate values using 64 bit floats. In fact you can quite probably perform your entire calculation using 64 bit floats.
If you are already using 64 bit Python (and your comments confused me on the matter), then you simply do not need to worry about scipy.mean or scipy.std returning a 64 bit float. That's one single value out of ~1.5 billion values in your array. It's nothing to worry about.
Note: This answer applies to the new question
The code in your question produces the following output:
float32
float32
float32
In other words, the symptoms that you report are not in fact representative of reality. The reason for the confusion is that you earlier code, that to which my original answer applied, was quite different and operated on a single dimensional array. It looks awfully like scipy returns scalars as float64. But when the return value is not a scalar, then the data type is not transformed in the way you thought.
You can force to change the base type :
a_mean = numpy.ndarray( scipy.mean(a, dtype=scipy.float32) , dtype = scipy.float32 )
I have tested it, so feel free to correct me if I'm wrong.
There is a out option : http://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
a = scipy.ones(10, dtype=scipy.float32)
b = numpy.array(0,dtype=scipy.float32)
scipy.mean(a, dtype=scipy.float32, out=b)
Test :
In [34]: b= numpy.array(0)
In [35]: b= numpy.array(0,dtype = scipy.float32)
In [36]: b.dtype
Out[36]: dtype('float32')
In [37]: scipy.mean(a, dtype=scipy.float32, out = numpy.array(b) )
Out[37]: 1.0
In [38]: b
Out[38]: array(0.0, dtype=float32)
In [39]: