How to check in easy way non-linear relationships using Python? - python

I have dataset in pandas DataFrame. I build a function which returns me a dataframe which looks like this:
Feature_name_1 Feature_name_2 corr_coef p-value
ABC DCA 0.867327 0.02122
So it's taking independent variables and returns correlation coefficient of them.
Is there is any easy way I can check in this way non-linear relationship?
In above case I used scipy Pearson correlation but I cannot find how to check non-linear? I found only more sophisticated methods and I would like have something easy to implement as above.
It will be enough if method will be easy to implement it's not necessary have to be from scipy on other specific packages

Regress your dependent variables on your independent variables and examine the residuals. If your residuals show a pattern there is likely a nonlinear relationship.
It may also be the case that your model is missing a cross term or could benefit from a transformation or something along those lines. I might be wrong but I'm not aware of a cut and dry test for non linearity.
Quick google search returned this which seems like it might be useful for you.
https://stattrek.com/regression/residual-analysis.aspx
Edit: Per the comment below, this is very general method that helps verify the linear regression assumptions.

Related

A Python module that can perform fitting with asymmetrical X and Y errors, and introduce bounds on fitted parameters?

I am fitting datasets with some broken powerlaws, the data has assymetrical errors in X and Y, and I'd like to be able to introduce constrains on the fitted parameters (i.e. not below 0, or within a certain range).
Using Scipy.ODR, I can fit the data great including the assymetrical errors on both axes, however I can't seem to find any way in the documentation to introduce bounds on my fitted parameters and discussions online seem to suggest this is flat out impossible with this module: https://stackoverflow.com/a/17786438/19086741
Using Lmfit, I can also fit the data well and can introduce bounds to the fitted parameters. However, discussions online once again state that Lmfit is not able to handle asymmetrical errors, and errors on both axes.
Is there some module, or perhaps I am missing something with one of these modules, that would allow me to meet both of my requirements in this case? Many thanks.
Sorry, I don't have a good answer for you. As you note, Lmfit does not support ODR regression which allows for uncertainties in the (single) independent variable as well as uncertainties in the dependent variables.
I think this would be possible in principle. Unfortunately, ODR has a very different interface to the other minimization routines making a wrapper as "another possible solving algorithm for lmfit" a bit challenging. I am sure that none of the developers would object to someone trying this, but it would take some effort.
FWIW, you say "both axes" as if you are certain there are exactly 2 axes. ODR supports exactly 1 independent variable: lmfit is not limited to this assumption.
You also say that lmfit cannot handle asymmetric uncertainties. That is only partially true. The lmfit.Model interface allows only a single uncertainty value per data point. But with the lmfit.minimize interface, you write your own objective function to calculate the array you want minimized, and so can weight some residual of "data" and "model" any way you want.

Find the appropriate polynomial fit for data in Python

Is there a function or library in Python to automatically compute the best polynomial fit for a set of data points? I am not really interested in the ML use case of generalizing to a set of new data, I am just focusing on the data I have. I realize the higher the degree, the better the fit. However, I want something that penalizes or looks at where the error elbows? When I say elbowing, I mean something like this (although usually it is not so drastic or obvious):
One idea I had was to use Numpy's polyfit: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.polyfit.html to compute polynomial regression for a range of orders/degrees. Polyfit requires the user to specify the degree of polynomial, which poses a challenge because I don't have any assumptions or preconceived notions. The higher the degree of fit, the lower the error will be but eventually it plateaus like the image above. Therefore if I want to automatically compute the degree of polynomial where the error curve elbows: if my error is E and d is my degree, I want to maximize (E[d+1]-E[d]) - (E[d+1] - E[d]).
Is this even a valid approach? Are there other tools and approaches in well-established Python libraries lik Numpy or Scipy that can help with finding the appropriate polynomial fit (without me having to specify the order/degree)? I would appreciate any thoughts or suggestions! Thanks!
To select the "right" fit and prevent over-fitting, you can use the Akiake Information Criterion or the Bayesian Information Criterion. Note that your fitting procedure can be non-Bayesian and you can still use these to compare fits. Here is a quick comparison between the two methods.

parameter within an interval while optimizing

Usually I use Mathematica, but now trying to shift to python, so this question might be a trivial one, so I am sorry about that.
Anyways, is there any built-in function in python which is similar to the function named Interval[{min,max}] in Mathematica ? link is : http://reference.wolfram.com/language/ref/Interval.html
What I am trying to do is, I have a function and I am trying to minimize it, but it is a constrained minimization, by that I mean, the parameters of the function are only allowed within some particular interval.
For a very simple example, lets say f(x) is a function with parameter x and I am looking for the value of x which minimizes the function but x is constrained within an interval (min,max) . [ Obviously the actual problem is just not one-dimensional rather multi-dimensional optimization, so different paramters may have different intervals. ]
Since it is an optimization problem, so ofcourse I do not want to pick the paramter randomly from an interval.
Any help will be highly appreciated , thanks!
If it's a highly non-linear problem, you'll need to use an algorithm such as the Generalized Reduced Gradient (GRG) Method.
The idea of the generalized reduced gradient algorithm (GRG) is to solve a sequence of subproblems, each of which uses a linear approximation of the constraints. (Ref)
You'll need to ensure that certain conditions known as the KKT conditions are met, etc. but for most continuous problems with reasonable constraints, you'll be able to apply this algorithm.
This is a good reference for such problems with a few examples provided. Ref. pg. 104.
Regarding implementation:
While I am not familiar with Python, I have built solver libraries in C++ using templates as well as using function pointers so you can pass on functions (for the objective as well as constraints) as arguments to the solver and you'll get your result - hopefully in polynomial time for convex problems or in cases where the initial values are reasonable.
If an ability to do that exists in Python, it shouldn't be difficult to build a generalized GRG solver.
The Python Solution:
Edit: Here is the python solution to your problem: Python constrained non-linear optimization

Python: matrix completion functions/library?

Are there functions in python that will fill out missing values in a matrix for you, by using collaborative filtering (ex. alternating minimization algorithm, etc). Or does one need to implement such functions from scratch?
[EDIT]: Although this isn't a matrix-completion example, but just to illustrate a similar situation, I know there is an svd() function in Matlab that takes a matrix as input and automatically outputs the singular value decomposition (svd) of it. I'm looking for something like that in Python, hopefully a built-in function, but even a good library out there would be great.
Check out numpy's linalg library to find a python SVD implementation
There is a library fancyimpute. Also, sklearn NMF

Doing model selection in python / scipy

During model selection, sometimes the likelihood-ratio test, or analysis using BIC (Bayesian Information Criterion) are often necessary. While I could definitely do it by hand, I was wondering, is there any scipy functions that are designed to do this?
I am asking this question because I think there should be a way to do this type of analysis, or, at least a function to get the likelihood value.
PS: I am not thinking about fitting a single distribution, instead, I am thinking about looking at some 1D data that changes with time (i.e. the model prediction changes with time).
Any help would be appreciated!
Example for this question:
I have some data that looks like this.
And now, I have two models - one with four parameters, another model nested in it with two parameters (fixing the other two).
I want to perform BIC / likelihood-ratio test to see, whether the two free parameters will make a significant difference.
In statsmodels you can perform likelihood ratio and Wald tests. Different information criteria are also available for all of the models. There are a few other model selection techniques, but I'm going to need to know a little bit more about what you're doing to give specific answers. Meanwhile, our documentation should help http://statsmodels.sourceforge.net/devel/

Categories

Resources