Python - vectorized operation with scipy.stats.hypergeom - python

I have some intensive calculation to do involving hypergeomotric distribution and I'd like to know if there exists an implementation of this function that support 'real' vectorize calculation ? (Q1)
By 'real' I mean : not using the np.vectorize who seems to be implementing a for loop.
I know R implementation of the hypergeometric law allows for that kind of computation, which makes the code much more effcicient.
If it helps : I'm trying to duplicate an algorithm develop in R to Python, to compute exact confidence intervals for hypergeometric parameters. The concept was develop in the following paper :
http://www.wright.edu/~weizhen.wang/paper/37-2015jasa_wang.pdf
The solution involve iterating over all the possible combination of paremters to select the subset that satisfy to a specific condition.
(Q2) As an alternative, would it be possible to create my own udf using the definition of the law and the np.choose method but applied to vectors ?
(Q3) as a final option is it possible to call my script in R from Python ?
Thank you !

Related

Is there any built-in function in python to minimize the sum of all y-deviations squared (Least-Square)?

The task is to get datasets as well as ideal functions through csv and choose ideal functions on the basis of how they minimize the sum of all y-deviations squared (Least-Square).
I am basically confused about this task so can someone explain this to me as i m trying to learn python.
No, there is not a built-in function to do that. There is an implementation here that uses numpy, pandas and matplotlib.

How to check in easy way non-linear relationships using Python?

I have dataset in pandas DataFrame. I build a function which returns me a dataframe which looks like this:
Feature_name_1 Feature_name_2 corr_coef p-value
ABC DCA 0.867327 0.02122
So it's taking independent variables and returns correlation coefficient of them.
Is there is any easy way I can check in this way non-linear relationship?
In above case I used scipy Pearson correlation but I cannot find how to check non-linear? I found only more sophisticated methods and I would like have something easy to implement as above.
It will be enough if method will be easy to implement it's not necessary have to be from scipy on other specific packages
Regress your dependent variables on your independent variables and examine the residuals. If your residuals show a pattern there is likely a nonlinear relationship.
It may also be the case that your model is missing a cross term or could benefit from a transformation or something along those lines. I might be wrong but I'm not aware of a cut and dry test for non linearity.
Quick google search returned this which seems like it might be useful for you.
https://stattrek.com/regression/residual-analysis.aspx
Edit: Per the comment below, this is very general method that helps verify the linear regression assumptions.

parameter within an interval while optimizing

Usually I use Mathematica, but now trying to shift to python, so this question might be a trivial one, so I am sorry about that.
Anyways, is there any built-in function in python which is similar to the function named Interval[{min,max}] in Mathematica ? link is : http://reference.wolfram.com/language/ref/Interval.html
What I am trying to do is, I have a function and I am trying to minimize it, but it is a constrained minimization, by that I mean, the parameters of the function are only allowed within some particular interval.
For a very simple example, lets say f(x) is a function with parameter x and I am looking for the value of x which minimizes the function but x is constrained within an interval (min,max) . [ Obviously the actual problem is just not one-dimensional rather multi-dimensional optimization, so different paramters may have different intervals. ]
Since it is an optimization problem, so ofcourse I do not want to pick the paramter randomly from an interval.
Any help will be highly appreciated , thanks!
If it's a highly non-linear problem, you'll need to use an algorithm such as the Generalized Reduced Gradient (GRG) Method.
The idea of the generalized reduced gradient algorithm (GRG) is to solve a sequence of subproblems, each of which uses a linear approximation of the constraints. (Ref)
You'll need to ensure that certain conditions known as the KKT conditions are met, etc. but for most continuous problems with reasonable constraints, you'll be able to apply this algorithm.
This is a good reference for such problems with a few examples provided. Ref. pg. 104.
Regarding implementation:
While I am not familiar with Python, I have built solver libraries in C++ using templates as well as using function pointers so you can pass on functions (for the objective as well as constraints) as arguments to the solver and you'll get your result - hopefully in polynomial time for convex problems or in cases where the initial values are reasonable.
If an ability to do that exists in Python, it shouldn't be difficult to build a generalized GRG solver.
The Python Solution:
Edit: Here is the python solution to your problem: Python constrained non-linear optimization

Python: matrix completion functions/library?

Are there functions in python that will fill out missing values in a matrix for you, by using collaborative filtering (ex. alternating minimization algorithm, etc). Or does one need to implement such functions from scratch?
[EDIT]: Although this isn't a matrix-completion example, but just to illustrate a similar situation, I know there is an svd() function in Matlab that takes a matrix as input and automatically outputs the singular value decomposition (svd) of it. I'm looking for something like that in Python, hopefully a built-in function, but even a good library out there would be great.
Check out numpy's linalg library to find a python SVD implementation
There is a library fancyimpute. Also, sklearn NMF

Constrained least-squares estimation in Python

I'm trying to perform a constrained least-squares estimation using Scipy such that all of the coefficients are in the range (0,1) and sum to 1 (this functionality is implemented in Matlab's LSQLIN function).
Does anybody have tips for setting up this calculation using Python/Scipy. I believe I should be using scipy.optimize.fmin_slsqp(), but am not entirely sure what parameters I should be passing to it.[1]
Many thanks for the help,
Nick
[1] The one example in the documentation for fmin_slsqp is a bit difficult for me to parse without the referenced text -- and I'm new to using Scipy.
scipy-optimize-leastsq-with-bound-constraints on SO givesleastsq_bounds, which is
leastsq
with bound constraints such as 0 <= x_i <= 1.
The constraint that they sum to 1 can be added in the same way.
(I've found leastsq_bounds / MINPACK to be good on synthetic test functions in 5d, 10d, 20d;
how many variables do you have ?)
Have a look at this tutorial, it seems pretty clear.
Since MATLAB's lsqlin is a bounded linear least squares solver, you would want to check out scipy.optimize.lsq_linear.
Non-negative least squares optimization using scipy.optimize.nnls is a robust way of doing it. Note that, if the coefficients are constrained to be positive and sum to unity, they are automatically limited to interval [0,1], that is one need not additionally constrain them from above.
scipy.optimize.nnls automatically makes variables positive using Lawson and Hanson algorithm, whereas the sum constraint can be taken care of as discussed in this thread and this one.
Scipy nnls uses an old fortran backend, which is apparently widely used in equivalent implementations of nnls by other software.

Categories

Resources