Fitting Gaussian like model to data does not work? - python

I want to fit this data:
I have the following model function.
def losvd_param(v, v_rot, v_disp, h3, h4):
y = np.asarray((np.asarray(v)-v_rot)/(v_disp)) # define new variably y for compact notation
return (np.exp(-0.5 * y**2) * (1 + h3*((2*np.sqrt(2)*y**3-3*np.sqrt(2)*y)/np.sqrt(6)) + h4*((4*y**4-12*y**2+3)/np.sqrt(24))))
The 4 parameters refer to: x-value of maximum, width of the distribution, skewness and kurtosis.
I use curve_fit() to fit my data:
gh_moments = curve_fit(losvd_param, vel_corr_peak, broadening_func)[0]
and get the unexpected output [1. 1. 1. 1.] which is clearly not correct it should more be something like [1318, 300, 0, 0], putting these values in manually into my model function I roughly get the right fit to my data. I also get the warning:
OptimizeWarning: Covariance of the parameters could not be estimated
Can anybody tell me why this could be the case ?
Edit: I get the same results, when I use a different model function simple gaussian. Using a linear model I get the fit is "working", so it might be something connected to the gaussian function ? (Note that my x-array goes from values rougly 500-2250)

Related

Why does fit_transform() always give me zeros?

I'm wondering why the following:
sklearn.preprocessing.StandardScaler().fit_transform([[58,144000]])
gives this result:
array([[0., 0.]])
I'm doing a Logistic Regression where I run fit_transform() on array of values (the actual data file) like the ones above. Yet, that transform seems to work fine. But when I try to do a single pair of values as shown above ([[58,144000]]), I get zeros.
For predictions using a "new" input, I need to scale that new value the same way as the test/train data were scaled so my ML predictions will work.
Thanks for suggestions.
Thanks!
If you read the docs, you may wondering, why does it expect a 2D array? You can compute mean and standard deviation of a vector, which is a 1D array, as you reflect it on your question. The answer is, it expects (samples, features) data.
So, in case where you pass data like [[58,144000]], it is a (1,2) array which means 1 sample with 2 features. Then it will fit transform each feature, which is a single number. Normalizing each feature give you a zero: [[0., 0.]].
On the other hand, if you pass the data like [[58],[144000]], then it will be (2,1), which means 2 samples and 1 feature. Then it scale and standard each feature, and give you the result as you may expected like: [[-1],[1]].
x = [58,144000]
mu = np.mean(x)
sigma = np.std(x)
print([((58 - mu) / sigma),((144000 - mu) / sigma)]) # [-1.0, 1.0]
from sklearn.preprocessing import StandardScaler
print(StandardScaler().fit_transform([[58],[144000]])) # [[-1.] [ 1.]]

tensorflow_probability: Gradients always zero when backpropagating the log_prob of a sample of a normal distribution

As part of a project I am having trouble with the gradients of a normal distribution with tensorflow_probability. For this I create a normal distribution of which a sample is drawn. The log_prob of this sample shall then be fed into an optimizer to update the weights of network.
If I get the log_prob of some constant I always get non-zero gradients. Unfortunately I have not found any relevant help in tutorials or similar sources of help.
def get_log_prob(mu, std)
extracted_location = tf.squeeze(extracted_location)
normal = tfd.Normal(mu, scale=std)
samples = normal.sample(sample_shape=(1))
log_prob = normal.log_prob(samples)
return log_prob
const = tf.constant([0.1], dtype=np.float32)
log_prob = get_log_prob(const, 0.01)
grads = tf.gradients(log_prob, const)
with tf.Session() as sess:
gradients = sess.run([grads])
print('gradients', gradients)
Output: gradients [array([0.], dtype=float32)]
I expect to get non-zero gradients if when computing the gradient of a sample. Instead the output is always "0."
This is a consequence of TensorFlow Probability implementing reparameterization gradients (aka the "reparameterization trick", and in fact is the correct answer in certain situations. Let me show you how that 0. answer comes about.
One way to generate a sample from a normal distribution with some location and scale is to first generate a sample from a standard normal distribution (this is usually some library provided function, e.g. tf.random.normal in TensorFlow) and then shift and scale it. E.g. let's say the output of tf.random.normal is z. To get a sample x from the normal distribution with location loc and scale scale, you'd do: x = z * scale + loc.
Now, how does one compute value of the probability density of a number under the normal distribution? One way to do it is to reverse that transformation, so that you're now dealing with a standard normal distribution, and then compute the log-probability density there. I.e. log_prob(x) = log_prob_std_normal((x - loc) / scale) + f(scale) (the f(scale) term comes about from the change of variables involved in the transformation, it's form doesn't matter for this explanation).
You can now plug in the first expression into the second, you'll get log_prob(x) = log_prob_std_normal(z) + f(scale), i.e. the loc cancelled entirely! As a result, the gradient of log_prob with respect to loc is 0.. This also explains why you don't get a 0. if you evaluate the log probability at a constant: it'll be missing the forward transformation used to create the sample and you'll get some (typically) non-zero gradient.
So, when is this the correct behavior? The reparameterization gradients are correct when you're computing gradients of the distribution parameters with respect to an expectation of a function under that distribution. One way to compute such an expectation is to do a Monte-Carlo approximation, like so: tf.reduce_mean(g(dist.sample(N), axis=0). It sounds like that's what you're doing (where your g() is log_prob()), so it looks like the gradients are correct.

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?
There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.
As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.
I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

Python: Automatically choose parameters for ARMA model

I am trying to fit an ARMA model to time series data. I haven't find any functions that can automatically choose the parameter. Below are the code I wrote however as I am a beginner to Python hence I believe this code can be optimised.
Can someone give me some ideas on how to:
do the Vectorization on the double loop
quicker way to do the parameter choosing
Much appreciate.
parameter_bound = 3
# Creating a 2-D array, storing the residuals of two different parameters of ARMA model
residuals = [[0 for x in range(parameter_bound)] for x in range(parameter_bound)]
model = [[0 for x in range(parameter_bound)] for x in range(parameter_bound)]
# Calculate residuals for each parameter combinations
for i in range(parameter_bound):
for j in range(parameter_bound):
model[i][j] = sm.tsa.ARMA(input_data, (i,j)).fit()
residuals[i][j] = sum(abs(model[i][j].resid))
# Find the parameters with lowest residuals
parameters = np.argmin(residuals)
parameter1 = parameters/parameter_bound
parameter2 = parameters - parameters/parameter_bound*parameter_bound
# Use the model with lowest residuals to get prediction data
prediction = model[parameter1][parameter2].resid + input_data
I'm not sure exactly what you're expecting, but you could replace your lists with numpy arrays (I don't think it'll improve your specific code):
import numpy as np
residuals = np.zeros((parameter_bound, parameter_bound))
model = np.zeros((parameter_bound, parameter_bound), np.object)
Also, be aware that np.argmin with axis=None returns an index for a flattened array, if you want to return the model parameters of the model with the lowest residuals you might try:
prediction = model.ravel()[np.argmin(residuals)].resid + input_data
You can use Ljung-Box test:
__, pvalue = sm.diagnostic.acorr_ljungbox(model[i][j].resid)
# if p-value higher than confidence interval 0.95, reject H
if pvalue > 0.05:
use_parameters = ...

Creating a data that follows a spcific data distribution [duplicate]

This question already has answers here:
Generate random numbers replicating arbitrary distribution
(4 answers)
Closed 8 years ago.
I have a variable x of 2700 points. It is my original data.
The HISTOGRAM of my data looks like this. The cyan color line is the distribution my data follows. I used curve_fit to my histogram and obtained the fitted curve. The fitted curve is a numpy array of 100000 points.
I want to generate a smoothed random data, of say 100000 points, that follows the DISTRIBUTION of my original data. i.e in principle I want 100000 points below the fitted curve, starting from 0.0 and increasing in the same way as the curve till 0.5
What I have tried so far to get 100000 points below the curve is:
I generated uniform random numbers using np.random.uniform(0,0.5,100000)
random_x = []
u = np.random.uniform(0,0.5,100000)
for i in u:
if i<=y_ran: # here y_ran is the numpy array of the fitted curve
random_x.append(i)
But I get an error `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I know the above code is not the proper one, but how should I proceed further??
`
I would approach the problem in the following way: first, fit your y_ran fitted curve to a gaussian (see for instance this question), and then draw your sample from a normal distribution with known coefficients by np.random.normal function. Something along these lines will work (in part taken from the answer to the question I'm referring to):
import numpy
from scipy.optimize import curve_fit
# Define model function to be used to fit to the data above:
def gauss(x, *p):
A, mu, sigma = p
return A*numpy.exp(-(x-mu)**2/(2.*sigma**2))
# p0 is the initial guess for the fitting coefficients (A, mu and sigma above)
p0 = [1., 0., 1.]
coeff, var_matrix = curve_fit(gauss, x, y_ran, p0=p0)
sample = numpy.random.normal(*coeff, (100000,))
Note: 1. this is not tested, 2. you'll need x values for your fitted curve.
Okay, so y_ran is a list of values that defines your curve. If I understand correctly, you want a random dataset that falls underneath your curve. One approach is to start with your curve points, and decrease each of them by some amount; for example, you could just make each new point equal somewhere in the range of 80%-100% of the original.
variation = np.random.uniform(low=.8, high=1.0, size=len(y_ran))
newData = y_ran * variation
Does that give you someplace to start?

Categories

Resources