overflow encountered in expm1 - python

the following code is from a reference notebook from Kaggle House Price Prediction:
X=train_df.drop(['SalePrice'],axis=1)
y=train_df.SalePrice
X_pwr=power_transformer.fit_transform(X)
test_std=std_scaler.fit_transform(test_df)
test_rbst=rbst_scaler.fit_transform(test_df)
test_pwr=power_transformer.fit_transform(test_df)
gb_reg = GradientBoostingRegressor(n_estimators=1792,
learning_rate=0.01005, max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=14, loss='huber', random_state =42)
gb_reg.fit(X_pwr, y)
y_head=gb_reg.predict(X_test)
test_pred_gb=gb_reg.predict(test_pwr)
test_pred_gb=pd.DataFrame(test_pred_gb,columns=['SalePrice'])
test_pred_gb.SalePrice =np.floor(np.expm1(test_pred_gb.SalePrice))
sample_sub.iloc[:,1]=(0.5 * test_pred_gb.iloc[:,0])+(0.5 *
old_prediction.iloc[:,1])
#here old_prediction is the sample prediction given by kaggle
I wanna know the reason for the last line of code. Why they are assigning exponent of predicted values.
also, the last line is giving runtime warning: overflow encountered in expm1. I also wanna know how to solve this overflow problem because, after this step, all the SalePrice is replaced by Nan

For the first question, it is hard to say without seeing more code, though I doubt there is a good reason, as the numbers you are feeding np.expm1 are apparently large (which makes sense if they're the sale prices of houses). This brings me to the second question:
expm1 is a special function for computing exp(x) - 1. It returns greater precision for very small x than just using exp(x) - 1. I don't know the exact way in which numpy is performing the calculation, though typically it is done with a Taylor Series. You start with the Taylor Series for exp(x) and just move the initial term of 1 over to the other side to get exp(x) - 1 = a large polynomial sum of terms. This polynomial contains things like x^n and n! where n is the number of terms that the polynomial is taken to (i.e. the level of precision). For large x, the numbers get unwieldy pretty quick! In other words, you quickly approach the limits of how big a number can be represented in bits on your operating system. To show this, just try the following:
import numpy as np
import warnings
warnings.filterwarnings('error')
for i in range(200000):
try:
np.expm1(i)
except Warning:
print(i)
break
Which, on my system, prints 710. For a workaround, you may try and get away with making large numbers small (i.e. a price of $200,000 would really be 0.2 mega-dollars).

Related

Can I discard the complex portion of results generated with scipy.linalg.logm?

I have a matrix that look like the one below. It is always a square matrix (up to 1000 x 1000) with the values are between 0 and 1:
data = np.array([[0.0308, 0.07919, 0.05694, 0.00662, 0.00927],
[0.07919, 0.00757, 0.00720, 0.00526, 0.00709],
[0.05694, 0.00720, 0.00518, 0.00707, 0.00413],
[0.00662, 0.00526, 0.00707, 0.01612, 0.00359],
[0.00927, 0.00709, 0.00413, 0.00359, 0.00870]])
When I try to take the natural log of this matrix, using scipy.linalg.logm, it gives me the following result.
print(logm(data))
>> [[-2.3492917 +1.42962407j 0.15360003-1.26717846j 0.15382223-0.91631624j 0.15673496+0.0443927j 0.20636448-0.01113953j]
[ 0.15360003-1.26717846j -3.75764578+2.16378501j 1.92614937-0.60836013j -0.13584605+0.27652444j 0.27819383-0.25190565j]
[ 0.15382223-0.91631624j 1.92614937-0.60836013j -5.08018989+2.52657239j 0.37036433-0.45966441j -0.03892575+0.36450564j]
[ 0.15673496+0.0443927j -0.13584605+0.27652444j 0.37036433-0.45966441j -4.22733838+0.09726189j 0.26291385-0.07980921j]
[ 0.20636448-0.01113953j 0.27819383-0.25190565j -0.03892575+0.36450564j 0.26291385-0.07980921j -4.91972246+0.06594195j]]
First of all, why is this happening? Based on another post I found here, pertaining to a different scipy.linalg method, this is due to truncation and rounding issues caused by floating point errors.
If that is correct, then how am I able to fix it? The second answer on that same linked post suggested this:
(2) All imaginary parts returned by numpy's linalg.eig are close to the machine precision. Thus you should consider them zero.
Is this correct? I can use numpy.real(data) to simply discard the complex portion of the values, but I don't know if that is a mathematically (or scientifically) robust thing to do.
Additionally I attempted to use tensorflow's linalg.logm method, but got the exact same complex results meaning this isn't unexpected behavior?

Why is the result of numpy fft different from matlab fft?

I was using parameters and formulations below to generate signals.
python code:
import numpy as np
fs=15e6
dt=1/fs
f0=1e6
pri=400e-6
t=np.arange(0,pri,dt)
i=64
fd=5/(i*pri)
xt=0.1*np.exp(2j*np.pi*f0*t)
xf=np.fft.fft(xt)
matlab code is very similar with python code:
fs=15e6
dt=1/fs
f0=1e6
pri=400e-6
t=0:dt:pri-dt
i=64
fd=5/(i*pri)
xt=0.1*exp(2j*pi*f0*t)
xf=fft(xt)
These code will generate an array of length 6000 to perform fft. Then I calculate the result in matlab using the same method. The result is absolutely same when the fft length is less than 6000. But it became a little different when the fft length is 6000.
The result of xf in python is:
xf[:5] = [4.68819428e-12-2.53650626e-12j,
6.55886345e-12+4.51937973e-13j,
5.91758655e-12+4.48215898e-12j,
2.07297400e-12+6.37992397e-12j,
-1.44454940e-12+5.60550355e-12j]
The result of xf in matlab is:
xf(1:5) = 5.165829569664382e-12+1.503743771929872e-12j
4.389776854811194e-12+5.127317569216533e-12j
1.067288620484369e-12+7.191186166371298e-12j
-3.058138112418996e-12+6.189531470616248e-12j
-5.288313073640339e-12+2.908982377132765e-12j
if use length 5999 to do fft like this in python:
xf=np.fft.fft(xt, 5999)
or in matlab:
xf=fft(xt, 5999)
The result is absolutely identical.
In python:
xf[:5] = [-0.09135455+0.04067366j,
-0.09160153+0.04072616j,
-0.09184974+0.04077892j,
-0.09209917+0.04083194j,
-0.09234986+0.04088522j]
In matlab:
xf(1:5) = -9.135455e-02+4.067366e-2j
-9.160153e-02+4.072616e-2j
-9.184974e-02+4.077892e-2j
-9.209917e-02+4.083194e-2j
-9.234986e-02+4.088522e-2j
I was confused. Can anybody illustrate this phenomenon? Thanks for your help.
PS: python 3.8.5, numpy 1.19.2, matlab 2014
demio. I think the different values you are getting is because MATLAB's floating point rounding errors. For low values, of order 1e-15, that values are rounded to 0 and that generates an error of the order that is being rounded to. It happens the same way for really big values. You can see a related post with pretty good explanation of this on: https://es.mathworks.com/matlabcentral/answers/475494-unexpected-results-due-to-floating-point-rounding-errors-by-performing-arithmetic-calculations-on-la.
Also it is worth noticing that even though this floating point rounding errors always occur you have to determine whether that's significant or not taking into account your set of data and the result you are expecting. Sometimes those absolute differences does not mean anything because the relative differences are marginal. If you wish to avoid this behavior from MATLAB you need to use the sym function, that triggers MATLAB to use a Symbolic representation which involves several things, one of them being that the numbers are represented more accurately. More on this subject can be found here: https://es.mathworks.com/help/symbolic/create-symbolic-numbers-variables-and-expressions.html#buyfu27.

Finding combination of columns which provides best combination based on function return

I have a dataframe with daily returns 6 portfolios (PORT1, PORT2, PORT3, ... PORT6).
I have defined functions for compound annual returns and risk-adjusted returns. I can run this function for any one PORT.
I want to find a combination of portfolios (assume equal weighting) to obtain the highest returns. For example, a combination of PORT1, PORT3, PORT4, and PORT6 may provide the highest risk adjusted return. Is there a method to automatically run the defined function on all combinations and obtain the desired combination?
No code is included as I do not think it is necessary to show the computation used to determine risk adj return.
def returns(PORT):
val = ... [computation of return here for PORT]
return val
Finding the optimal location within a multidimensional space is possible, but people have made fortunes figuring out better ways of achieving exactly this.
The problem at the outset is setting out your possibility space. You've six dimensions, and presumably you want to allocate 1 unit of "stuff" across all those six, such that a vector of the allocations {a,b,c,d,e,f} sums to 1. That's still an infinity of numbers, so maybe we only start off with increments of size 0.10. So 10 increments possible, across 6 dimensions, gives you 10^6 possibilities.
So the simple brute-force method would be to "simply" run your function across the entire parameter space, store the values and pick the best one.
That may not be the answer you want, other methods exist, including randomising your guesses and limiting your results to a more manageable number. But the performance gain is offset with some uncertainty - and some potentially difficult conversations with your clients "What do you mean you did it randomly?!".
To make any guesses at what might be optimal, it would be helpful to have an understanding of the response curves each portfolio has under different circumstances and the sorts of risk/reward profiles you might expect them to operate under. Are they linear, quadratic, or are they more complex? If you can model them mathematically, you might be able to use an algorithm to reduce your search space.
Short (but fundamental) answer is "it depends".
You can do
import itertools
best_return = 0
for r in range(len(PORTS)):
for PORT in itertools.combinations(PORTS,r):
cur_return = returns(PORT)
if cur_return > best_return :
best_return = cur_return
best_PORT = PORT
You can also do
max([max([PORT for PORT in itertools.combinations(PORTS,r)], key = returns)
for r in range(len(PORTS))], key = returns)
However, this is more of an economics question than a CS one. Given a set of positions and their returns and risk, there are explicit formulae to find the optimal portfolio without having to brute force it.

Linear regression with Python

I am studying "Building Machine Learning System With Python (2nd)".
I have a silly doubt in very first chapters' answer part.
According to the book and based on my observation I always get 2nd order polynomial as the best fitting curve.
whenever I train my system with training dataset, I get different Test error for different Polynomial Function.
Thus my parameters of the equation also differs.
But surprisingly, I get approximately same answer every time in the range 9.19-9.99 .
My final hypothesis function each time have different parameters but I get approximately same answer.
Can anyone tell me the reason behind it?
[FYI:I am finding answer for y=100000]
I am sharing the code sample and the output of each iteration.
Here are the errors and the corresponding answers with it:
https://i.stack.imgur.com/alVzU.png
https://i.stack.imgur.com/JVGSm.png
https://i.stack.imgur.com/RB53X.png
Thanks in advance!
def error(f, x, y):
return sp.sum((f(x)-y)**2)
import scipy as sp
import matplotlib.pyplot as mp
data=sp.genfromtxt("web_traffic.tsv",delimiter="\t")
x=data[:,0]
y=data[:,1]
x=x[~sp.isnan(y)]
y=y[~sp.isnan(y)]
mp.scatter(x,y,s=10)
mp.title("web traffic over the month")
mp.xlabel("week")
mp.ylabel("hits/hour")
mp.xticks([w*24*7 for w in range(10)],["week %i"%i for i in range(10)])
mp.autoscale(enable=True,tight=True)
mp.grid(color='b',linestyle='-',linewidth=1)
mp.show()
infletion=int(3.5*7*24)
xa=x[infletion:]
ya=y[infletion:]
f1=sp.poly1d(sp.polyfit(xa,ya,1))
f2=sp.poly1d(sp.polyfit(xa,ya,2))
f3=sp.poly1d(sp.polyfit(xa,ya,3))
print(error(f1,xa,ya))
print(error(f2,xa,ya))
print(error(f3,xa,ya))
fx=sp.linspace(0,xa[-1],1000)
mp.plot(fx,f1(fx),linewidth=1)
mp.plot(fx,f2(fx),linewidth=2)
mp.plot(fx,f3(fx),linewidth=3)
frac=0.3
partition=int(frac*len(xa))
shuffled=sp.random.permutation(list(range(len(xa))))
test=sorted(shuffled[:partition])
train=sorted(shuffled[partition:])
fbt1=sp.poly1d(sp.polyfit(xa[train],ya[train],1))
fbt2=sp.poly1d(sp.polyfit(xa[train],ya[train],2))
fbt3=sp.poly1d(sp.polyfit(xa[train],ya[train],3))
fbt4=sp.poly1d(sp.polyfit(xa[train],ya[train],4))
print ("error in fbt1:%f"%error(fbt1,xa[test],ya[test]))
print ("error in fbt2:%f"%error(fbt2,xa[test],ya[test]))
print ("error in fbt3:%f"%error(fbt3,xa[test],ya[test]))
from scipy.optimize import fsolve
print (fbt2)
print (fbt2-100000)
maxreach=fsolve(fbt2-100000,x0=800)/(7*24)
print ("ans:%f"%maxreach)
Don't do this like that.
Linear regression is more "up to you" than you think.
Start by getting the slope of the line, (#1) average((f(x2)-f(x))/(x2-x))
Then use that answer as M to (#2) average(f(x)-M*x).
Now you have (#1) and (#2) as your regression.
For any type of regression similar to this ex, Polynomial,
you need to subtract the A-Factor (First Factor), by using the n super-delta of f(x) with every one with respect to delta(x). Ex. delta(ax^2+bx+c)/delta(x) gives you a equation with a and b, and from there it works. When doing this take the average every time if there is more entries. Do It like a window on a paper sliding down. Ex. You select entries 1-10, then 2-11,3-12 etc for some crazy awesome regression. You may want to create a matrix API. The best way to handle it, is first create a API that takes a row and a column out first. THEN you fool around with that to automate it. The Ratios of the in-out entries left in only 2 cols, is averaged and is the solution to the coefficient. Then Make a program to take rows out but for example leave row 1 & row 5 (OUTPUT), then row 2,row 5... row 4 and row 5. I wouldn't recommend python for coding this. I recommend C programming, because It prevents you from making dirty arrays that you don't remember. Systems-Theory you need to understand. You must create system-by-system. It is insane to code matrices without building automated sub-systems that are carefully tested. I failed until I worked on it in C, so I already made a 1 time shrinking function that is carefully tested, then built systems to automate getting 1 coefficient, tested that, then automated the repetition of that program to solve it. You won't understand any of this by using python or similar shortcuts. You use them after you realize what they really are. That's how I learned. I still am like how did I code that? I still am amazed. Problem is though, it's unstable above 4x4 (actually 4x5) matrices.
Good Luck,
Misha Taylor

Realistic float value for "about zero"

I'm working on a program with fairly complex numerics, mostly in numpy with complex datatypes. Some of the calculation are returning nearly empty arrays with a complex component that is almost zero. For example:
(2 + 0j, 3+0j, 4+3.9320340202e-16j)
Clearly the third component is basically 0, but for whatever reason, this is the output of my calculation and it turns out that for some of these nearly zero values, np.is_complex() returns True. Rather than dig through that big code, I think it's sensible to just apply a cutoff. My question is, what is a sensible cutoff that anything below should be considered a zero? 0.00? 0.000000? etc...
I understand that these values are due to rounding errors in floating point math, and just want to handle them sensibly. What is the tolerance/range one allows for such precision error? I'd like to set it to a parameter:
ABOUTZERO=0.000001
As others have commented, what constitutes 'almost zero' really does depend on your particular application, and how large you expect the rounding errors to be.
If you must use a hard threshold, a sensible value might be the machine epsilon, which is defined as the upper bound on the relative error due to rounding for floating point operations. Intuitively, it is the smallest positive number that, when added to 1.0, gives a result >1.0 using a given floating point representation and rounding method.
In numpy, you can get the machine epsilon for a particular float type using np.finfo:
import numpy as np
print(np.finfo(float).eps)
# 2.22044604925e-16
print(np.finfo(np.float32).eps)
# 1.19209e-07

Categories

Resources