Using numpy where functions - python

I am trying to understand the behavior of the following piece of code:
import numpy as np
theta = np.arange(0,1.1,0.1)
prior_theta = 0.7
prior_prob = np.where(theta == prior_theta)
print(prior_prob)
However if I explicitly give the datatype the where function works as per expectation
import numpy as np
theta = np.arange(0,1.1,0.1,dtype = np.float32)
prior_theta = 0.7
prior_prob = np.where(theta == prior_theta)
print(prior_prob)
This seems like a data type comparison. Any idea on this will be very helpful.

This is just how floating point numbers work. You can't rely on exact comparisons. The number 0.7 cannot be represented in binary -- it is an infinitely repeating fraction. arange has to compute 0.1+0.1+0.1+0.1 etc,, and the round-off errors accumulate. The 7th value is not exactly the same as the literal value 0.7. The rounding is different for float32s, so you happened to get lucky.
You need to get in the habit of using "close enough" comparisons, like where(np.abs(theta-prior_theta) < 0.0001).

np.isclose (and np.allclose) is useful when making floats tests.
In [240]: theta = np.arange(0,1.1,0.1)
In [241]: theta
Out[241]: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
In [242]: theta == 0.7
Out[242]:
array([False, False, False, False, False, False, False, False, False,
False, False])
np.arange warns us about using float increments - read the warnings section.
In [243]: theta.tolist()
Out[243]:
[0.0,
0.1,
0.2,
0.30000000000000004,
0.4,
0.5,
0.6000000000000001,
0.7000000000000001,
0.8,
0.9,
1.0]
In [244]: np.isclose(theta, 0.7)
Out[244]:
array([False, False, False, False, False, False, False, True, False,
False, False])
In [245]: np.nonzero(np.isclose(theta, 0.7))
Out[245]: (array([7]),)
arange suggests using np.linspace, but that's more to address the end point issue, which you've already handled with 1.1 value. The 0.7 value is still the same.

Related

Find indices of element in 2D array

I have a piece of code below that calculates the maximum value of an array. It then calculates a value for 90% of the maximum, finds the closest value to this in the array as well as its corresponding index.
I need to ensure that I am finding the closest value to 90% that occurs only before the maximum. Can anyone help with this please? I was thinking about maybe compressing the array after the maximum has occurred but then each array I use will be a different size and that will be difficult later on.
import numpy as np
#make amplitude arrays
amplitude=[0,1,2,3, 5.5, 6,5,2,2, 4, 2,3,1,6.5,5,7,1,2,2,3,8,4,9,2,3,4,8,4,9,3]
#split arrays up into a line for each sample
traceno=5 #number of traces in file
samplesno=6 #number of samples in each trace. This wont change.
amplitude_split=np.array(amplitude, dtype=np.int).reshape((traceno,samplesno))
#find max value of trace
max_amp=np.amax(amplitude_split,1)
#find index of max value
ind_max_amp=np.argmax(amplitude_split, axis=1, out=None)
#find 90% of max value of trace
amp_90=np.amax(amplitude_split,1)*0.9
# find the indices of the min absolute difference
indices_90 = np.argmin(np.abs(amplitude_split - amp_90[:, None]), axis=1)
print("indices for 90 percent are", + indices_90)
Use a mask to set the values after the maximum (including the maximum? ) to a known 'too high' value. Then argmin will return the index of the minimum difference in the 'valid' area of each row.
# Create a mask for amplitude equal to the maximum
# add a dimension to max_amp.
mask = np.equal(amplitude_split, max_amp[-1, None])
# Cumsum the mask to set all elements in a row after the first True to True
mask[:] = mask.cumsum(axis = 1)
mask
# array([[False, False, False, False, False, True],
#  [ True, True, True, True, True, True],
# [False, False, False, True, True, True],
# [False, False, False, False, True, True],
# [False, False, False, False, True, True]])
# Set inter to the absolute difference.
inter = np.abs(amplitude_split - amp_90[-1,None])
# Set the max and after to a high value (10. here).
inter[mask] = max_amp.max() # Any suitably high value
inter # Where the mask is True inter == 9.
# array([[8.1, 7.1, 6.1, 5.1, 3.1, 9. ],
# [9. , 9. , 9. , 9. , 9. , 9. ],
# [7.1, 2.1, 3.1, 9. , 9. , 9. ],
# [6.1, 5.1, 0.1, 4.1, 9. , 9. ],
# [5.1, 4.1, 0.1, 4.1, 9. , 9. ]])
# Find the indices of the minimum in each row
np.argmin(inter, axis = 1)
# array([4, 0, 1, 2, 2])

Masked `np.nan` in the `np.ma.array` problem in jupyter

Let's run in the Anaconda Jupyter the Python3 NumPy code:
y = np.ma.array(np.matrix([[np.nan, 2.0]]), mask=[0, 1])
m = (y < 0.01)
and we have the warning: /.../anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in less.
Substituting np.nan with 1.0 etc. --- no warning.
Why the np.nan can not be masked and then compared?
MA has several strategies to implementing methods.
1) evaluate the method on y.data, and make a new ma with y.mask. It may suppress any runtime warnings.
2) evaluate the method on y.filled() # with the default fill value
3) evaluate the method on y.filled(1) # or some other innocuous value
4) evaluate the method on y.compressed()
5) evaluate the method on y.data[~y.mask]
multiplication, for example use filled(1), and addition uses filled(0).
It appears that the comparisons are done with 1).
I haven't studied the ma code in detail, but I don't think it does 5).
If you are using ma just to avoid the runtime warning, there are some alternatives.
there's a collection of np.nan... functions that filter out nan before calculating
there are ways of surpressing runtime warnings
ufuncs have a where parameter that can be used to skip some elements. Use it with an out parameter to define the skipped ones.
===
Looking a np.ma.core.py I see functions like ma.less.
In [857]: y = np.ma.array([np.nan, 0.0, 2.0], mask=[1, 0, 0])
In [858]: y >1.0
/usr/local/bin/ipython3:1: RuntimeWarning: invalid value encountered in greater
#!/usr/bin/python3
Out[858]:
masked_array(data=[--, False, True],
mask=[ True, False, False],
fill_value=True)
In [859]: np.ma.greater(y,1.0)
Out[859]:
masked_array(data=[--, False, True],
mask=[ True, False, False],
fill_value=True)
Looking at the code, ma.less and such are a MaskedBinaryOperation class, and use 1) - evaluate on the data with
np.seterr(divide='ignore', invalid='ignore')
The result mask is logical combination of the arguments' masks.
https://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html#operations-on-masked-arrays
Making the issue more simple, let's assume:
y = np.ma.array([np.nan, 0.0, 2.0], mask=[1, 0, 0])
m = (y > 1.0)
print(y, y.shape) ; print(y[m], y[m].shape, m.shape)
and the output is:
[-- 0.0 2.0] (3,)
[2.0] (1,) (3,)
with the RuntimeWarning: /.../anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in greater.
Changing:
...
m = (y != 2.0)
...
We get:
[-- 0.0 2.0] (3,)
[-- 0.0] (2,) (3,)
so we have a masked element and the result without any RuntimeWarning.
Changing now:
...
m = y.mask.copy() ; y[np.isnan(y)] = 9.0 ; y.mask = m ; m = (y > 1.0)
...
We get (without RuntimeWorning):
[-- 0.0 2.0] (3,)
[-- 2.0] (2,) (3,)
This work-around is however strange (by setting arbitrary value in the place of np.nan and mask saving). Comparing something with masked should be always masked, shouldn't it?

Is there a concise way to produce this dataframe mask?

I have a pandas DataFrame with a model score and order_amount_bucket. There are 8 bins in the order amount bucket and I have a different threshold for each bin. I want to filter the frame and produce a boolean mask showing which rows pass.
I can do this by exhaustively listing the conditions but I feel like there must be a more pythonic way to do this.
A small example of how I have made this work so far (with only 3 bins for simplicity).
import pandas as pd
sc = 'score'
amt = 'order_amount_bucket'
example_data = {sc:[0.5, 0.8, 0.99, 0.95, 0.8,0.8],
amt: [1, 2, 2, 2, 3, 1]}
thresholds = [0.7, 0.8, 0.9]
df = pd.DataFrame(example_data)
# the exhaustive method to create the pass mask
# is there a better way to do this part?
pass_mask = (((df[amt]==1) & (df[sc]<thresholds[0]))
|((df[amt]==2) & (df[sc]<thresholds[1]))
|((df[amt]==3) & (df[sc]<thresholds[2]))
)
pass_mask.values
>> array([ True, False, False, False, True, False])
You could covert thresholds to a dict and use Series.map:
d = dict(enumerate(thresholds, 1))
# d: {1: 0.7, 2: 0.8, 3: 0.9}
pass_mark = df['order_amount_bucket'].map(d) > df['score']
[out]
print(pass_mark.values)
array([ True, False, False, False, True, False])

How to get a list of decimals in python 3

No, this is not a duplicate and the link above is specifically what I was referring to as not the correct answer. That link, and my post here specifically ask about producing a Decimal list. But the "answer" produces a float list.
The correct answer is to use Decimal parameters with np.arange as in
`x_values = np.arange(Decimal(-2.0), Decimal(2.0), Decimal(0.1)) Thanks https://stackoverflow.com/users/2084384/boargules
I believe this may be answered elsewhere, but the answers I've found seem wrong. I want a list of decimals (precision = 1 decimal place) from -2 to 2.
-2, -1.9, -1.8 ... 1.8, 1.9, 2.0
When I do:
import numpy as np
x_values = np.arange(-2,2,0.1)
x_values
I get:
array([ -2.00000000e+00, -1.90000000e+00, -1.80000000e+00, ...
I tried:
from decimal import getcontext, Decimal
getcontext().prec = 2
x_values = [x for x in np.around(np.arange(-2, 2, .1), 2)]
x_values2 = [Decimal(x) for x in x_values]
x_values2
I get:
[Decimal('-2'),
Decimal('-1.899999999999999911182158029987476766109466552734375'),
Decimal('-1.8000000000000000444089209850062616169452667236328125'), ...
I'm running 3.6.3 in jupyter notebook.
Update: I changed the ranges from 2 to 2.0. This improved the result, but I still get a rounding error:
import numpy as np
x_values = np.arange(-2.0, 2.0, 0.1)
x_values
Which produces:
-2.00000000e+00, -1.90000000e+00, -1.80000000e+00, ...
1.00000000e-01, 1.77635684e-15, 1.00000000e-01, ...
1.80000000e+00, 1.90000000e+00
Note 1.77635684e-15 may be an incredibly small number, but it's NOT zero. A test for zero will fail. Therefore the output is wrong.
My response to the duplicate assertion. As you can see by my results the answer at How to use a decimal range() step value? does not produce the same results I'm seeing with a different range. Specifically floats are still being returned and not rounded and 1.77635684e-15 is not equal to zero.
The discussion and duplicate dance around a simple solution:
In [177]: np.arange(Decimal('-2.0'), Decimal('2.0'), Decimal('0.1'))
Out[177]:
array([Decimal('-2.0'), Decimal('-1.9'), Decimal('-1.8'), Decimal('-1.7'),
Decimal('-1.6'), Decimal('-1.5'), Decimal('-1.4'), Decimal('-1.3'),
Decimal('-1.2'), Decimal('-1.1'), Decimal('-1.0'), Decimal('-0.9'),
Decimal('-0.8'), Decimal('-0.7'), Decimal('-0.6'), Decimal('-0.5'),
Decimal('-0.4'), Decimal('-0.3'), Decimal('-0.2'), Decimal('-0.1'),
Decimal('0.0'), Decimal('0.1'), Decimal('0.2'), Decimal('0.3'),
Decimal('0.4'), Decimal('0.5'), Decimal('0.6'), Decimal('0.7'),
Decimal('0.8'), Decimal('0.9'), Decimal('1.0'), Decimal('1.1'),
Decimal('1.2'), Decimal('1.3'), Decimal('1.4'), Decimal('1.5'),
Decimal('1.6'), Decimal('1.7'), Decimal('1.8'), Decimal('1.9')],
dtype=object)
Giving float values to Decimal does not work well:
In [180]: np.arange(Decimal(-2.0), Decimal(2.0), Decimal(0.1))
Out[180]:
array([Decimal('-2'), Decimal('-1.899999999999999994448884877'),
Decimal('-1.799999999999999988897769754'),
Decimal('-1.699999999999999983346654631'),
because Decimal(0.1) just solidifies the floating point inprecision of 0.1:
In [178]: Decimal(0.1)
Out[178]: Decimal('0.1000000000000000055511151231257827021181583404541015625')
Suggested duplicate: How to use a decimal range() step value?
From numpy docs -
import numpy as np
np.set_printoptions(suppress=True)
will make sure that "always print floating point numbers using fixed point notation, in which case numbers equal to zero in the current precision will print as zero"
In[2]: import numpy as np
In[3]: np.array([1/50000000])
Out[3]: array([2.e-08])
In[4]: np.set_printoptions(suppress=True)
In[5]: np.array([1/50000000])
Out[5]: array([0.00000002])
In[6]: np.set_printoptions(precision=6)
In[7]: np.array([1/50000000])
Out[7]: array([0.])
In[8]: x_values = np.arange(-2,2,0.1)
In[9]: x_values
Out[9]:
array([-2. , -1.9, -1.8, -1.7, -1.6, -1.5, -1.4, -1.3, -1.2, -1.1, -1. ,
-0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1, 0. , 0.1,
0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,
1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])

Why is x <= x false?

The title is a bit misleading, because it's not exactly x and x, it's x and 0.3; however, the values should be the same.
I have:
arr = np.arange(0, 1.1, 0.1)
and I receive:
arr[arr <= 0.3]
> array([0., 0.1, 0.2])
The correct result should be:
arr[arr <= 0.3]
> array([0., 0.1, 0.2, 0.3])
I have not yet stumbled upon this problem. I know it is related to floating point precision ... but what can I do here?
Don't rely on comparing floats for equality (unless you know exactly what floats you are dealing with).
Since you know the stepsize used to generate the array is 0.1,
arr = np.arange(0, 1.1, 0.1)
you could increase the threshold value, 0.3, by half the stepsize to find a new threshold which is safely between values in arr:
In [48]: stepsize = 0.1; arr[arr < 0.3+(stepsize/2)]
Out[48]: array([ 0. , 0.1, 0.2, 0.3])
By the way, the 1.1 in np.arange(0, 1.1, 0.1) is an application of the same idea -- given the vagaries of floating-point arithmetic, we couldn't be sure that 1.0 would be included if we wrote np.arange(0, 1.0, 0.1), so the right endpoint was increased by the stepsize.
Fundamentally, the problem boils down to floating-point arithmetic being inaccurate:
In [17]: 0.1+0.2 == 0.3
Out[17]: False
So the fourth value in the array is a little bit greater than 0.3.
In [40]: arr = np.arange(0,1.1, 0.1)
In [41]: arr[3]
Out[41]: 0.30000000000000004
Note that rounding may not be a viable solution. For example,
if arr has dtype float128:
In [53]: arr = np.arange(0, 1.1, 0.1, dtype='float128')
In [56]: arr[arr.round(1) <= 0.3]
Out[56]: array([ 0.0, 0.1, 0.2], dtype=float128)
Although making the dtype float128 made arr[3] closer to the decimal 0.3,
In [54]: arr[3]
Out[54]: 0.30000000000000001665
now rounding does not produce a number less than 0.3:
In [55]: arr.round(1)[3]
Out[55]: 0.30000000000000000001
Unutbu points out the main problem. You should avoid comparing floating point numbers, as they have a round off error.
However this is a problem many people come across, therefore there is a function that helps you getting around this problem; np.isclose in your case this would lead to:
arr[np.logical_or(arr <= 0.3, np.isclose(0.3, arr))]
>>> array([0., 0.1, 0.2, 0.3])
In this case this might not be the best option, but it might be helpful to know about this function.
Sidenote:
In case nobody has ever explained to you, why this happens. Basically computer save everything in binary, however 0.1 is a periodic number in binary, this means that the computer can't save all the digits (as there are infinitely many). The equivalent in decimal would be:
1/3+1/3+1/3 = 0.33333 + 0.33333 + 0.33333 = 0.99999
Which is not 1

Categories

Resources