I'm looking into the QuantileTransformer object in the Scikit-Learn Python library, in an attempt to "uniformize" residuals from an ARIMA model as part of a copula model. The idea is to feed my Kendall correlation matrix of residuals into a Student's t copula, and then apply the reverse transformation of the simulated residuals, in order to get values that are on the scale of the original data.
My question is this: What underlies this mechanism? I'm struggling to understand how the QuantileTransformer uniformizes values without knowledge of the true distribution. How does this happen without a percentile point function (PPF)? Or is there an assumed PPF that is simply not stated in the documentation (here)?
Related
Problem statement: I'm working with a linear system of equations that correspond to an inverse problem that is ill-posed. I can apply Tikhonov regularization or ridge regression by hand in Python, and get solutions on test data that are sufficiently accurate for my problem. I'd like to try solving this problem using sklearn.linear_model.Ridge, because I'd like to try other machine-learning methods in the linear models part of that package (https://scikit-learn.org/stable/modules/linear_model.html). I'd like to know if using sklearn in this context is using the wrong tool.
What I've done: I read the documentation for sklearn.linear_model.Ridge. Since I know the linear transformation corresponding to the forward problem, I have run it over impulse responses to create training data, and then supplied it to sklearn.linear_model.Ridge to generate a model. Unlike when I apply the equation for ridge regression myself in Python, the model from sklearn.linear_model.Ridge only works on impulse responses. On the other hand, applying ridge regression using the equations myself, generates a model that can be applied to any linear combination of the impulse responses.
Is there a way to apply the linear methods of sklearn, without needing to generate a large test data set that represents the entire parameter space of the problem, or is this requisite for using (even linear) machine learning algorithms?
Should sklearn.model.Ridge return the same results as solving the equation for ridge regression, when the sklearn method is applied to test cases that span the forward problem?
Many thanks to anyone who can help my understanding.
Found the answer through trial and error. Answering my own question in case anyone was thinking like I did and needs clarity.
Yes, if you use training data that spans the problem space, it is the same as running ridge regression in python using the equations. sklearn does what it says in the documentation.
You need to use fit_intercept=True to get sklearn.linear_model.Ridge to fit the Y intercept of your problem, otherwise it is assumed to be zero.
If you use the default, fit_intercept=False, and your problem does NOT have a Y-intercept of zero, you will of course, get a bad solution.
This might lead a novice like me to the impression that you haven't supplied enough training data, which is incorrect.
I'm doing a regression with the support vector machine from sklearn and I need to calculate the deviation from the real value for each point. I only found the method "score" to calculate the overall performance of the modell. Is there another method, which will give me the error for each data point? Thanks in advance :)
Im running a linear model using statsmodels.formula.api.ols after scaling the continuous input variables using sklearn.preprocessing.standardScaler
So everything works ok during the model building. My issue is more of a technical one after the fitting is done and I want to convert the fitted coefficients back into the space where they can be used with un-scaled input.
I know how standarScaler works: scaled = (x - x_mean)/ sqrt(x_var). So for any given scaled x value, I can calculate the transformation back (by hand if I wanted to), (~ x = scaled*sqrt(x_var) + x_mean).
What Im unsure of is how to apply this type of reverse scaling to the fit "scaled" linear coefficents.
Can anyone explain how to do this?
As a side note, Id really love to use sklearn.linear_model.LinearRegression with normalize=True in a pipeline... in this way the coefficents are scaled back for you (its much more elegant and streamlined)... alas, I really need the rich statistical output of the statsmodels ols method so that I can make an ANOVA analysis after the fit.
In R, when doing linear modeling I often use the residualPlots method from the cars library to plot my model residuals against my fitted values and against each numeric/applicable regressor. This is done all at once with a single function call. For example,
Is there an equivalent library in Python that plots residuals against fitted values and against my predictors, all at once?
I'm aware I can plot small multiples in matplotlib using a loop, I'm looking for something that does it in one go. I know statsmodels.graphics.plot_partregress does this for partial regression plots, but I haven't been able to find its equivalent just for straight residuals. Integration with statsmodels, and ability to compute other residuals (externally studentized) as part of plotting a big bonus.
Given time-series data, I want to find the best fitting logarithmic curve. What are good libraries for doing this in either Python or SQL?
Edit: Specifically, what I'm looking for is a library that can fit data resembling a sigmoid function, with upper and lower horizontal asymptotes.
If your data were categorical, then you could use a logistic regression to fit the probabilities of belonging to a class (classification).
However, I understand you are trying to fit the data to a sigmoid curve, which means you just want to minimize the mean squared error of the fit.
I would redirect you to the SciPy function called scipy.optimize.leastsq: it is used to perform least squares fits.