Realistic float value for "about zero" - python

I'm working on a program with fairly complex numerics, mostly in numpy with complex datatypes. Some of the calculation are returning nearly empty arrays with a complex component that is almost zero. For example:
(2 + 0j, 3+0j, 4+3.9320340202e-16j)
Clearly the third component is basically 0, but for whatever reason, this is the output of my calculation and it turns out that for some of these nearly zero values, np.is_complex() returns True. Rather than dig through that big code, I think it's sensible to just apply a cutoff. My question is, what is a sensible cutoff that anything below should be considered a zero? 0.00? 0.000000? etc...
I understand that these values are due to rounding errors in floating point math, and just want to handle them sensibly. What is the tolerance/range one allows for such precision error? I'd like to set it to a parameter:
ABOUTZERO=0.000001

As others have commented, what constitutes 'almost zero' really does depend on your particular application, and how large you expect the rounding errors to be.
If you must use a hard threshold, a sensible value might be the machine epsilon, which is defined as the upper bound on the relative error due to rounding for floating point operations. Intuitively, it is the smallest positive number that, when added to 1.0, gives a result >1.0 using a given floating point representation and rounding method.
In numpy, you can get the machine epsilon for a particular float type using np.finfo:
import numpy as np
print(np.finfo(float).eps)
# 2.22044604925e-16
print(np.finfo(np.float32).eps)
# 1.19209e-07

Related

numpy.linalg.det returns very small numbers instead of 0

I calculated the determinant of matrix using np.linalg.det(matrix) but it returns weird values. For example, it gives 1.1012323e-16 instead of 0.
Of course, I can round the result with numpy.around, but is there any option to set some "default" rounding for results of all numpy methods, including numpy.linalg.det?
The value of the determinant looking "weird" is due to the floating point arithmetic, you can look it up.
Regarding your question, I believe numpy.set_printoptions is what you are looking for. Please, see Docs

overflow encountered in expm1

the following code is from a reference notebook from Kaggle House Price Prediction:
X=train_df.drop(['SalePrice'],axis=1)
y=train_df.SalePrice
X_pwr=power_transformer.fit_transform(X)
test_std=std_scaler.fit_transform(test_df)
test_rbst=rbst_scaler.fit_transform(test_df)
test_pwr=power_transformer.fit_transform(test_df)
gb_reg = GradientBoostingRegressor(n_estimators=1792,
learning_rate=0.01005, max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=14, loss='huber', random_state =42)
gb_reg.fit(X_pwr, y)
y_head=gb_reg.predict(X_test)
test_pred_gb=gb_reg.predict(test_pwr)
test_pred_gb=pd.DataFrame(test_pred_gb,columns=['SalePrice'])
test_pred_gb.SalePrice =np.floor(np.expm1(test_pred_gb.SalePrice))
sample_sub.iloc[:,1]=(0.5 * test_pred_gb.iloc[:,0])+(0.5 *
old_prediction.iloc[:,1])
#here old_prediction is the sample prediction given by kaggle
I wanna know the reason for the last line of code. Why they are assigning exponent of predicted values.
also, the last line is giving runtime warning: overflow encountered in expm1. I also wanna know how to solve this overflow problem because, after this step, all the SalePrice is replaced by Nan
For the first question, it is hard to say without seeing more code, though I doubt there is a good reason, as the numbers you are feeding np.expm1 are apparently large (which makes sense if they're the sale prices of houses). This brings me to the second question:
expm1 is a special function for computing exp(x) - 1. It returns greater precision for very small x than just using exp(x) - 1. I don't know the exact way in which numpy is performing the calculation, though typically it is done with a Taylor Series. You start with the Taylor Series for exp(x) and just move the initial term of 1 over to the other side to get exp(x) - 1 = a large polynomial sum of terms. This polynomial contains things like x^n and n! where n is the number of terms that the polynomial is taken to (i.e. the level of precision). For large x, the numbers get unwieldy pretty quick! In other words, you quickly approach the limits of how big a number can be represented in bits on your operating system. To show this, just try the following:
import numpy as np
import warnings
warnings.filterwarnings('error')
for i in range(200000):
try:
np.expm1(i)
except Warning:
print(i)
break
Which, on my system, prints 710. For a workaround, you may try and get away with making large numbers small (i.e. a price of $200,000 would really be 0.2 mega-dollars).

Safety of taking `int(numpy.sqrt(N))`

Let's say I'm considering M=N**2 where N is an integer. It appears that numpy.sqrt(M) returns a float (actually numpy.float64).
I could imagine that there could be a case where it returns, say, N-10**(-16) due to numerical precision issues, in which case int(numpy.sqrt(M)) would be N-1.
Nevertheless, my tests have N==numpy.sqrt(M) returning True, so it looks like this approximation isn't happening.
Is it safe for me to assume that int(numpy.sqrt(M)) is indeed accurate when M is a perfect square? If so, for bonus, what's going on in the background that makes it work?
To avoid missing the integer by 1E-15, you could use :
int(numpy.sqrt(M)+0.5)
or
int(round(numpy.sqrt(M)))

vector magnitude for large components

I noticed that numpy has a built in function linalg.norm(vector), which produces the magnitude. For small values I get the desired output
>>> import numpy as np
>>> np.linalg.norm([0,2])
2.0
However for large values:
>>> np.linalg.norm([0,149600000000])
2063840737.6330884
This is a huge error, what could I do instead. Making my own function seems to produce the same error. What is the problem here, is a rounding error this big?, and what can I do instead?
Your number is written as an integer, and yet it is too big to fit into a numpy.int32. This problem seems to happen even in python3, where
the native numbers are big.
In numerical work I try to make everything floating point unless it is an index. So I tried:
In [3]: np.linalg.norm([0.0,149600000000.0])
Out[3]: 149600000000.0
To elaborate: in this case Adding the .0 was an easy way of turning integers into doubles. In more realistic code, you might have incoming data which is of uncertain type. The safest (but not always the right) thing to do is just coerce to a floating point array at the top of your function.
def do_something_with_array(arr):
arr = np.double(arr) # or np.float32 if you prefer.
... do something ...

Python plotting

I have a question about the plotting. I want to plot some data between ranges :
3825229325678980.0786812569752124806963380417361932
and
3825229325678980.078681262584097479512892231994772
but I get the following error:
Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=3.82522932568e+15, top=3.82522932568e+15
How should I increase the decimal points here to solve the problem?
The difference between your min and max value is less than the precision an eps of a double (~1e-15).
Basically using a 4-byte floating point representation you can not distinguish between the two numbers.
I suggest to remove all the integer digits from your data and represent only the decimal part. The integer part is only a big constant that you can always add later.
It might be easiest to scale your data to provide a range that looks less like zero.

Categories

Resources