Python stats module: How to extract confidence/prediction intervals from GPy? - python
After having looked through all the docs and examples online, I have not been able to find a way to extract information regarding the confidence or prediction intervals from GPy models.
I generate dummy data like this,
## Generating data for regression
# First, regular sine wave + normal noise
x = np.linspace(0,40, num=300)
noise1 = np.random.normal(0,0.3,300)
y = np.sin(x) + noise1
## Second, an upward trending starting midway, with its own noise as well
temp = x[150:]
noise2 = 0.004*temp**2 + np.random.normal(0,0.1,150)
y[150:] = y[150:] + noise2
plt.plot(x, y)
and then estimate a basic model,
## Pre-processing
X = np.expand_dims(x, axis=1)
Y = np.expand_dims(y, axis=1)
## Model
kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)
model1 = GPy.models.GPRegression(X, Y, kernel)
However, nothing makes it clear how to proceed from there... Another question here tried asking the same thing, but that answer does not work any more, and seems rather unsatisfactory, for such an important element of statistical modelling.
Given a model, and a set of target x values we want to generate the intervals at, you can extract the intervals using:
intervals = model.predict_quantiles( X = target_x_vals, quantiles = (2.5, 97.5) )
You can change the quantiles argument to get the appropriate width ones. The documentation for this function is found at: https://gpy.readthedocs.io/en/deploy/_modules/GPy/core/gp.html
Related
Finding two linear fits on different domains in the same data
I'm trying to plot a 3rd-order polynomial, and two linear fits on the same set of data. My data looks like this: ,Frequency,Flux Density,log_freq,log_flux 0,1.25e+18,1.86e-07,18.096910013008056,-6.730487055782084 1,699000000000000.0,1.07e-06,14.84447717574568,-5.97061622231479 2,541000000000000.0,1.1e-06,14.73319726510657,-5.958607314841775 3,468000000000000.0,1e-06,14.670245853074125,-6.0 4,458000000000000.0,1.77e-06,14.660865478003869,-5.752026733638194 5,89400000000000.0,3.01e-05,13.951337518795917,-4.521433504406157 6,89400000000000.0,9.3e-05,13.951337518795917,-4.031517051446065 7,89400000000000.0,0.00187,13.951337518795917,-2.728158393463501 8,65100000000000.0,2.44e-05,13.813580988568193,-4.61261017366127 9,65100000000000.0,6.28e-05,13.813580988568193,-4.2020403562628035 10,65100000000000.0,0.00108,13.813580988568193,-2.96657624451305 11,25900000000000.0,0.000785,13.413299764081252,-3.1051303432547472 12,25900000000000.0,0.00106,13.413299764081252,-2.9746941347352296 13,25900000000000.0,0.000796,13.413299764081252,-3.099086932262331 14,13600000000000.0,0.00339,13.133538908370218,-2.469800301796918 15,13600000000000.0,0.00372,13.133538908370218,-2.4294570601181023 16,13600000000000.0,0.00308,13.133538908370218,-2.5114492834995557 17,12700000000000.0,0.00222,13.103803720955957,-2.653647025549361 18,12700000000000.0,0.00204,13.103803720955957,-2.6903698325741012 19,230000000000.0,0.133,11.361727836017593,-0.8761483590329142 22,90000000000.0,0.518,10.954242509439325,-0.28567024025476695 23,61000000000.0,1.0,10.785329835010767,0.0 24,61000000000.0,0.1,10.785329835010767,-1.0 25,61000000000.0,0.4,10.785329835010767,-0.3979400086720376 26,42400000000.0,0.8,10.627365856592732,-0.09691001300805639 27,41000000000.0,0.9,10.612783856719735,-0.045757490560675115 28,41000000000.0,0.7,10.612783856719735,-0.1549019599857432 29,41000000000.0,0.8,10.612783856719735,-0.09691001300805639 30,41000000000.0,0.6,10.612783856719735,-0.2218487496163564 31,41000000000.0,0.7,10.612783856719735,-0.1549019599857432 32,37000000000.0,1.0,10.568201724066995,0.0 33,36800000000.0,1.0,10.565847818673518,0.0 34,36800000000.0,0.98,10.565847818673518,-0.00877392430750515 35,33000000000.0,0.8,10.518513939877888,-0.09691001300805639 36,33000000000.0,1.0,10.518513939877888,0.0 37,31400000000.0,0.92,10.496929648073214,-0.036212172654444715 38,23000000000.0,1.4,10.361727836017593,0.146128035678238 39,23000000000.0,1.1,10.361727836017593,0.04139268515822508 40,23000000000.0,1.11,10.361727836017593,0.045322978786657475 41,23000000000.0,1.1,10.361727836017593,0.04139268515822508 42,22200000000.0,1.23,10.346352974450639,0.08990511143939793 43,22200000000.0,1.24,10.346352974450639,0.09342168516223506 44,21700000000.0,0.98,10.33645973384853,-0.00877392430750515 45,21700000000.0,1.07,10.33645973384853,0.029383777685209667 46,20000000000.0,1.44,10.301029995663981,0.15836249209524964 47,15400000000.0,1.32,10.187520720836464,0.12057393120584989 48,15000000000.0,1.5,10.176091259055681,0.17609125905568124 49,15000000000.0,1.5,10.176091259055681,0.17609125905568124 50,15000000000.0,1.42,10.176091259055681,0.15228834438305647 51,15000000000.0,1.43,10.176091259055681,0.1553360374650618 52,15000000000.0,1.42,10.176091259055681,0.15228834438305647 53,15000000000.0,1.47,10.176091259055681,0.1673173347481761 54,15000000000.0,1.38,10.176091259055681,0.13987908640123647 55,10700000000.0,2.59,10.02938377768521,0.4132997640812518 56,8870000000.0,2.79,9.947923619831727,0.44560420327359757 57,8460000000.0,2.69,9.927370363039023,0.42975228000240795 58,8400000000.0,2.8,9.924279286061882,0.4471580313422192 59,8400000000.0,2.53,9.924279286061882,0.40312052117581787 60,8400000000.0,2.06,9.924279286061882,0.31386722036915343 61,8300000000.0,2.58,9.919078092376074,0.41161970596323016 62,8080000000.0,2.76,9.907411360774587,0.4409090820652177 63,5010000000.0,3.68,9.699837725867246,0.5658478186735176 64,5000000000.0,0.81,9.698970004336019,-0.09151498112135022 65,5000000000.0,3.5,9.698970004336019,0.5440680443502757 66,5000000000.0,3.57,9.698970004336019,0.5526682161121932 67,4980000000.0,3.46,9.697229342759718,0.5390760987927766 68,4900000000.0,2.95,9.690196080028514,0.46982201597816303 69,4850000000.0,3.46,9.685741738602264,0.5390760987927766 70,4850000000.0,3.45,9.685741738602264,0.5378190950732742 71,4780000000.0,2.16,9.679427896612118,0.3344537511509309 72,4540000000.0,3.61,9.657055852857104,0.557507201905658 73,2700000000.0,3.5,9.431363764158988,0.5440680443502757 74,2700000000.0,3.7,9.431363764158988,0.568201724066995 75,2700000000.0,3.92,9.431363764158988,0.5932860670204573 76,2700000000.0,3.92,9.431363764158988,0.5932860670204573 77,2250000000.0,4.21,9.352182518111363,0.6242820958356683 78,1660000000.0,3.69,9.220108088040055,0.5670263661590603 79,1660000000.0,3.8,9.220108088040055,0.5797835966168101 80,1410000000.0,3.5,9.14921911265538,0.5440680443502757 81,1400000000.0,3.45,9.146128035678238,0.5378190950732742 82,1400000000.0,3.28,9.146128035678238,0.5158738437116791 83,1400000000.0,3.19,9.146128035678238,0.5037906830571811 84,1400000000.0,3.51,9.146128035678238,0.5453071164658241 85,1340000000.0,3.31,9.127104798364808,0.5198279937757188 86,1340000000.0,3.31,9.127104798364808,0.5198279937757188 87,750000000.0,3.14,8.8750612633917,0.49692964807321494 88,408000000.0,1.46,8.61066016308988,0.1643528557844371 89,408000000.0,1.46,8.61066016308988,0.1643528557844371 90,365000000.0,1.62,8.562292864456476,0.20951501454263097 91,365000000.0,1.56,8.562292864456476,0.1931245983544616 92,333000000.0,1.32,8.52244423350632,0.12057393120584989 93,302000000.0,1.23,8.48000694295715,0.08990511143939793 94,151000000.0,2.13,8.178976947293169,0.3283796034387377 95,73800000.0,3.58,7.868056361823042,0.5538830266438743 and my code is import numpy as np import matplotlib.pyplot as plt import pandas as pd import numpy.polynomial.polynomial as poly def find_extrema(poly, bounds): ''' Finds the extrema of the polynomial; ensure real. https://stackoverflow.com/questions/72932816/python-finding-local-maxima-minima-for-multiple-polynomials-efficiently ''' deriv = poly.deriv() extrema = deriv.roots() # Filter out complex roots extrema = extrema[np.isreal(extrema)] # Get real part of root extrema = np.real(extrema) # Apply bounds check lb, ub = bounds extrema = extrema[(lb <= extrema) & (extrema <= ub)] return extrema def find_maximum(poly, bounds): ''' Find the maximum point; returns the value of the turnover frequency. https://stackoverflow.com/questions/72932816/python-finding-local-maxima-minima-for-multiple-polynomials-efficiently ''' extrema = find_extrema(poly, bounds) # Either bound could end up being the minimum. Check those too. extrema = np.concatenate((extrema, bounds)) value_at_extrema = poly(extrema) maximum_index = np.argmax(value_at_extrema) return extrema[maximum_index] # LOAD THE DATA FROM FILE HERE # CARRY ON... xvar = 'log_freq' yvar = 'log_flux' x, y = pks[xvar], pks[yvar] lower = min(x) upper = max(x) # Find the 3rd-order polynomial which fits the SED coefs = poly.polyfit(x, y, 3) # find the coeffs x_new = np.linspace(lower, upper, num=len(x)*10) # space to plot the fit ffit = poly.Polynomial(coefs) # find the polynomial # Find turnover frequency and peak flux nu_to = find_maximum(ffit, (lower, upper)) F_p = ffit(nu_to) # HERE'S THE TRICKY BIT # Find the straight line to fit to the left of nu_to left_linefit = poly.polyfit(x, y, 1) x_left = np.linspace(lower, nu_to, num=len(x)*10) # space to plot the fit ffit_thin = poly.Polynomial(left_linefit, domain = (lower, nu_to) ) # PLOTS THE POLYNOMIAL WELL ax1 = plt.subplot(1, 1, 1) ax1.scatter(pks[xvar], pks[yvar], label = 'PKS 0742+10', c = 'b') ax1.plot(x_new, ffit(x_new), color = 'r') ax1.plot(x_left, ffit_left(x_left), color = 'gold') ax1.set_yscale('linear') ax1.set_xscale('linear') ax1.legend() ax1.set_xlabel(r'$\log\nu$ ($\nu$ in Hz)') ax1.set_ylabel(r'$\log F_{\nu}$ ($F_{\nu}$ in Jy)') ax1.grid(axis = 'both', which = 'major') The code produces the poly fit well: I'm trying to plot the straight-line fits for the points on either side of the maximum, as shown schematically below: I thought I could do it with ffit_left = poly.Polynomial(left_linefit, domain = (lower, nu_to) ) and similar for ffit_right, but that produces which is actually the straight-line fit for the whole dataset, plotted only for that domain. I don't want to manipulate the dataset, because eventually I'll have to do it on a lot of datasets. The fitting part of the code comes from an answer to this question . How can I fit a straight line to just set of points without manipulating the dataset? My guess is that I have to make left_linefit = poly.polyfit(x, y, 1) recognise a domain, but I can't see anything in the numpy polyfit docs. Sorry for the long question!
I am not sure to well understand your request. If you want to fit a piecewise function made of three linear segments a method is described in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf with theory and numerical examples. Several cases are considered. Among them the case below might be convenient for you. H(*) is the Heaviside step function.
how to make a high pass filter?
I have a 3D data matrix of sea level data (time, y, x) and I found the power spectrum by taking the square of the FFT but there are low frequencies that are really dominant. I want to get rid of those low frequencies by applying a high pass filter... how would I go about doing that? Example of data set and structure/code is below: This is the data set and creating the arrays: Yearmin = 2018 Yearmax = 2019 year_len = Yearmax - Yearmin + 1.0 # number of years direcInput = "filepath" a = s.Dataset(direcInput+"test.nc", mode='r') #creating arrays lat = a.variables["latitude"][:] lon = a.variables["longitude"][:] time1 = a.variables["time"][:] #DAYS SINCE JAN 1ST 1950 sla = a.variables["sla"][:,:,:] #t, y, x time = Yearmin + (year_len * (time1 - np.min(time1)) / ( np.max(time1) - np.min(time1))) #detrending and normalizing data def standardize(y, detrend = True, normalize = True): if detrend == True: y = signal.detrend(y, axis=0) y = (y - np.mean(y, axis=0)) if normalize == True: y = y / np.std(y, axis=0) return y sla_standard = standardize(sla) print(sla_standard.shape) = (710, 81, 320) #fft fft = np.fft.rfft(sla_standard, axis=0) spec = np.square(abs(fft)) frequencies = (0, nyquist, df) #PLOTTING THE FREQUENCIES VS SPECTRUM FOR A FEW DIFFERENT SPATIAL LOCATIONS plt.plot(frequencies, spec[:, 68,85]) plt.plot(frequencies, spec[:, 23,235]) plt.plot(frequencies, spec[:, 39,178]) plt.plot(frequencies, spec[:, 30,149]) plt.xlim(0,.05) plt.show() My goal is to make a high pass filter of the ORIGINAL time series (sla_standard) to remove the two really big peaks. Which type of filter should I use? Thank you!
Use .axes.Axes.set_ylim to set the y-axis limit. Axes.set_ylim(self, left=None, right=None, emit=True, auto=False, *, ymin=None, ymax=None) So in your case ymin=None and you set ymax for example to ymax=60000 before you start plotting. Thus plt.ylim(ymin=None, ymax=60000). Taking out data should not be done here because its "falsifying results". What you actually want is to zoom in on the chart. The person who reads the chart independently from you would interpret the data falsely if not made aware in advance. Peaks that go off the chart are okay because everybody understands that. Or: Directly replacement of certain values in an array (arr): arr[arr > ori] = dest For example in your case ori=60000 and dest=1 All values larger ">" than 60k are replaces by 1.
The different filters: As you state a filter acts on the frequencies of your signal. Different filter shapes exist and some of them have complex expressions because they need to be implemented in real time processing (causal). However in your case, you seem to post process the data. You can use the Fourier Transform, that requires all the data (non causal). The filter to choose: Consequently you can directly perform you filtering operation in the Fourier domain by applying a mask on your frequencies. If you want to remove frequencies, I recommand you to use a binary mask made of 0 and 1. Why? Because it is the simplest filter you can think about. It is scientifically relevant to state that you completely removed some frequencies (say it and justify it). However it is more difficult to claim that you let some and attenuated a little bit others, and that you chose arbitrarily the attenuation factor... Python implementation signal_fft = np.fft.rfft(sla_standard,axis=0) mask = np.ones_like(sla_standard) mask[freq_to_filter,...] = 0.0 # define here the frequencies to filter filtered_signal = np.fft.irfft(mask*signal_fft,axis=0)
Python GPy module: how to plot model predictions over simple x-axis?
In Python, I was attempting to dive into the GPy library for estimating Gaussian Process models, when I encountered a stumbling block early on with simple plotting. For my data, I generated a simple sine wave with a squared growth rate added in midway, and GPy successfully estimated the initial model. Data generation: ## Generating data for regression # First, regular sine wave + normal noise x = np.linspace(0,40, num=300) noise1 = np.random.normal(0,0.3,300) y = np.sin(x) + noise1 # Second, an upward trending starting midway, with its own noise as well temp = x[150:] noise2 = 0.004*temp**2 + np.random.normal(0,0.1,150) y[150:] = y[150:] + noise2 plt.plot(x, y) Initial model: ## Pre-processing X = np.expand_dims(x, axis=1) Y = np.expand_dims(y, axis=1) ## Model kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.) model1 = GPy.models.GPRegression(X, Y, kernel) ## Plotting fig = model1.plot() GPy.plotting.show(fig, filename='basic_gp_regression_notebook') However, this model is mis-specified, since the data was only created using sin(X) and X^2, and not just X, so I create the next model: X_all = np.hstack((np.sin(X), np.square(X))) model2 = GPy.models.GPRegression(X_all, Y, kernel) fig = model2.plot() GPy.plotting.show(fig, filename='basic_correct_gp_regression_notebook') However, now, I am getting plotting errors, Invalid value of type 'builtins.str' received for the 'size' property of scatter.marker Received value: '5' I assume this is because the plot does not know to use "X" as the x-axis, having been supplied only sin(X) and X^2. How could I fix this?
How does offset in XGBoost is handled in binary:logistic objective function
I am working on a mortality prediction (binary outcome) problem with “base mortality probability” as my offset in the XGboost problem. I have used gbtree booster and binary:logistic objective function. In my data data I have multiple observations/records having same X values but different offset values. As per my understanding (please correct me, if wrong) the XGBoost under binary:logistic setup tries to fit a model of below representation. log(p/1-p) = offset + F(x). Where F(x) is optimized (for a specific loss function) using splits with various X values. Thus, when the X values are exactly same, to get the F(x), I can use the predicted output (with outputmargin = True option) and subtract the offset from here. However, when I got the output, it turned out in the above mentioned approach, I am getting different values F(X) for a same set X. I believe the way offset is handled internally in the XGBoost is different from my understanding. Can anyone explain me this method/mathematical formulation of handlng offset. I am specifically interested in extracting the value of F(x) (as this is additional information the model is providing) by adjusting the model prediction from the offset values. Here are the sample codes: library(xgboost) x1 = runif(1000) y1 = as.numeric(runif(1000)>.8) y2 = as.numeric(runif(1000)>.8) off1 = runif(1000) off2 = runif(1000) #stacking the data to have same X values x= c(x1,x1) y = c(y1,y2) off = c(off1,off2) length(unique(off)) # shows unique 2000 values length(unique(x)) # shows unique 1000 values, i.e. each X is repeated once (as expected) fulldata = cbind.data.frame(x,y,off) train_dMtrix = xgb.DMatrix(data = as.matrix(x), label = y, base_margin = off) params_list=list(booster = "gblinear", objective = "binary:logistic", eta = 0.05, max_depth= 4, min_child_weight = 10, eval_metric = 'logloss') set.seed(100) xgbmodel = xgb.train(params = params_list, data = train_dMtrix, nrounds=100, callbacks = list(cb.gblinear.history())) # Getting the prediction in link format fulldata$Predicted_link = predict(xgbmodel, train_dMtrix, outputmargin = TRUE) # Assuming Predicted_link = offset + F(x), calculating F(x) for each values of X fulldata$F_x = fulldata$Predicted_link - fulldata$off # As per my understanding, since the F(X) in purely independent of offset, # the model predictions of F_x (not the predicted probability) should be exactly same for same values of x, # irrespective of the corresponding offsets. Given I have 1000 distinct X values, I'm expecting 1000 distinct F_x values length(unique(fulldata$F_x)) # shows almost 2000 unique values, which is contrary to my expectation.
Least Squares method in practice
Very simple regression task. I have three variables x1, x2, x3 with some random noise. And I know target equation: y = q1*x1 + q2*x2 + q3*x3. Now I want to find target coefs: q1, q2, q3 evaluate the performance using the mean Relative Squared Error (RSE) (Prediction/Real - 1)^2 to evaluate the performance of our prediction methods. In the research, I see that this is ordinary Least Squares Problem. But I can't get from examples on the internet how to solve this particular problem in Python. Let say I have data: import numpy as np sourceData = np.random.rand(1000, 3) koefs = np.array([1, 2, 3]) target = np.dot(sourceData, koefs) (In real life that data are noisy, with not normal distribution.) How to find this koefs using Least Squares approach in python? Any lib usage.
#ayhan made a valuable comment. And there is a problem with your code: Actually there is no noise in the data you collect. The input data is noisy, but after the multiplication, you don't add any additional noise. I've added some noise to your measurements and used the least squares formula to fit the parameters, here's my code: data = np.random.rand(1000,3) true_theta = np.array([1,2,3]) true_measurements = np.dot(data, true_theta) noise = np.random.rand(1000) * 1 noisy_measurements = true_measurements + noise estimated_theta = np.linalg.inv(data.T # data) # data.T # noisy_measurements The estimated_theta will be close to true_theta. If you don't add noise to the measurements, they will be equal. I've used the python3 matrix multiplication syntax. You could use np.dot instead of # That makes the code longer, so I've split the formula: MTM_inv = np.linalg.inv(np.dot(data.T, data)) MTy = np.dot(data.T, noisy_measurements) estimated_theta = np.dot(MTM_inv, MTy) You can read up on least squares here: https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#The_general_problem UPDATE: Or you could just use the builtin least squares function: np.linalg.lstsq(data, noisy_measurements)
In addition to the #lhk answer I have found great scipy Least Squares function. It is easy to get the requested behavior with it. This way we can provide a custom function that returns residuals and form Relative Squared Error instead of absolute squared difference: import numpy as np from scipy.optimize import least_squares data = np.random.rand(1000,3) true_theta = np.array([1,2,3]) true_measurements = np.dot(data, true_theta) noise = np.random.rand(1000) * 1 noisy_measurements = true_measurements + noise #noisy_measurements[-1] = data[-1] # (1000 * true_theta) - uncoment this outliner to see how much Relative Squared Error esimator works better then default abs diff for this case. def my_func(params, x, y): res = (x # params) / y - 1 # If we change this line to: (x # params) - y - we will got the same result as np.linalg.lstsq return res res = least_squares(my_func, x0, args=(data, noisy_measurements) ) estimated_theta = res.x Also, we can provide custom loss with loss argument function that will process the residuals and form final loss.