GSVD for python Generalized Singular Value Decomposition - python

MATLAB has a gsvd function to perform the generalised SVD. Since 2013 I think there has been a lot of discussion on the github pages regarding putting it in scipy and some pages have code that I can use such as here which is super complicated for a novice like me(to get it running).
I also found LJWilliams github page with an implementation. This is of no good as has lot of bugs when transferred to python 3. Attempted correcting the simple ones such as assert and print. It quickly gets complicated.
Can someone help me with a gsvd code for python or show me how to use the ones that are online?
Also, This is what I get with the LJWilliams implementation, once the print and assert statements are corrected. The code looks complicated and I am not sure spending time on it is the best thing to do! Also some people have reported issues on the same github page which I am not sure are fixed or connected.
n = 10
m = 6
p = 6
A = np.random.rand(m,n)
B = np.random.rand(p,n)
gsvd(A,B)
File "/home/eghx/agent18/master_thesis/AMfe/amfe/gsvd.py", line 260,
in gsvd
U, V, Z, C, S = csd(Q[0:m,:],Q[m:m+n,:])
File "/home/eghx/agent18/master_thesis/AMfe/amfe/gsvd.py", line 107,
in csd
Q,R = scipy.linalg.qr(S[q:n,m:p])
File
"/home/eghx/anaconda3/lib/python3.5/site-packages/scipy/linalg/decomp_qr.py",
line 141, in qr
overwrite_a=overwrite_a)
File
"/home/eghx/anaconda3/lib/python3.5/site-packages/scipy/linalg/decomp_qr.py",
line 19, in safecall
ret = f(*args, **kwargs)
ValueError: failed to create intent(cache|hide)|optional array-- must
have defined dimensions but got (0,)

If you want to work from the LJWillams implementation on github, there are a couple of bugs. However, to understand the technique fully, I'd probably recommend having a go at implementing it yourself. I looked up what Octave (MATLAB free software equivalent) do and their "code is a wrapper to the corresponding Lapack dggsvd and zggsvd routines.", which is what scipy should do IMHO.
I'll post up the bugs I found, but I'm not going to post the code in full working order, because I'm not sure how that stands with regard to copyright, given the copyrighted MATLAB implementation from which it is translated.
Caveat : I am not an expert on the Generalised SVD and have approached this only from the perspective of debugging, not whether the underlying algorithm is correct. I have had this working on your original random arrays and the test case already present in the Python file.
Bugs
Setting k
Around line 63, the conditions for setting k and a misunderstanding of numpy.argparse (particularly in comparison to MATLAB's find) seem to set k wrong in some circumstances. Change that code to
if q == 1:
k = 0
elif m < p:
k = n;
else:
k = max([0,sum((np.diag(C) <= 1/np.sqrt(2)))])
line 79
S[1,1] should be S[0,0], I think (Python 0-indexed arrays)
lines 83 onwards
The numpy matrix slicing around here seems wrong. I got the code working by changing lines 83-95 to read:
UT, ST, VT = scipy.linalg.svd(slice_matrix(S,i,j))
ST = add_zeros(ST,np.zeros([n-k,r-k]))
if k > 0:
print('Zeroing elements of S in row indices > r, to be replaced by ST')
S[0:k,k:r] = 0
S[k:n,k:r] = ST
C[:,j] = np.dot(C[:,j],VT)
V[:,i] = np.dot(V[:,i],UT)
Z[:,j] = np.dot(Z[:,j],VT)
i = np.arange(k,q)
Q,R = scipy.linalg.qr(C[k:q,k:r])
C[i,j] = np.diag(diagf(R))
U[:,k:q] = np.dot(U[:,k:q],Q)
in diagp()
There are two matrix multiplications using X*Y that should be np.dot(X,Y) instead (note * is element-wise multiplication in numpy, not matrix multiplication.)

Related

Python curve fit with change point

As I'm really struggleing to get from R-code, to Python code, I would like to ask some help. The code I want to use has been provided to my from withing the mathematics forum of stackexchange.
https://math.stackexchange.com/questions/2205573/curve-fitting-on-dataset
I do understand what is going on. But I'm really having a hard time trying to solve the R-code, as I have never seen anything of it. I have written the function to return the sum of squares. But I'm stuck at how I could use a function similar to the optim function. And also I don't really like the guesswork at the initial values. I would like it better to run and re-run a type of optim function untill I get the wanted result, because my needs for a nearly perfect curve fit are really high.
def model (par,x):
n = len(x)
res = []
for i in range(1,n):
A0 = par[3] + (par[4]-par[1])*par[6] + (par[5]-par[2])*par[6]**2
if(x[i] == par[6]):
res[i] = A0 + par[1]*x[i] + par[2]*x[i]**2
else:
res[i] = par[3] + par[4]*x[i] + par[5]*x[i]**2
return res
This is my model function...
def sum_squares (par, x, y):
ss = sum((y-model(par,x))^2)
return ss
And this is the sum of squares
But I have no idea on how to convert this:
#I found these initial values with a few minutes of guess and check.
par0 <- c(7,-1,-395,70,-2.3,10)
sol <- optim(par= par0, fn=sqerror, x=x, y=y)$par
To Python code...
I wrote an open source Python package (BSD license) that has a genetic algorithm (Differential Evolution) front end to the scipy Levenberg-Marquardt solver, it functions similarly to what you describe in your question. The github URL is:
https://github.com/zunzun/pyeq3
It comes with a "user-defined function" example that's fairly easy to use:
https://github.com/zunzun/pyeq3/blob/master/Examples/Simple/FitUserDefinedFunction_2D.py
along with command-line, GUI, cluster, parallel, and web-based examples. You can install the package with "pip3 install pyeq3" to see if it might suit your needs.
Seems like I have been able to fix the problem.
def model (par,x):
n = len(x)
res = np.array([])
for i in range(0,n):
A0 = par[2] + (par[3]-par[0])*par[5] + (par[4]-par[1])*par[5]**2
if(x[i] <= par[5]):
res = np.append(res, A0 + par[0]*x[i] + par[1]*x[i]**2)
else:
res = np.append(res,par[2] + par[3]*x[i] + par[4]*x[i]**2)
return res
def sum_squares (par, x, y):
ss = sum((y-model(par,x))**2)
print('Sum of squares = {0}'.format(ss))
return ss
And then I used the functions as follow:
parameter = sy.array([0.0,-8.0,0.0018,0.0018,0,200])
res = least_squares(sum_squares, parameter, bounds=(-360,360), args=(x1,y1),verbose = 1)
The only problem is that it doesn't produce the results I'm looking for... And that is mainly because my x values are [0,360] and the Y values only vary by about 0.2, so it's a hard nut to crack for this function, and it produces this (poor) result:
Result
I think that the range of x values [0, 360] and y values (which you say is ~0.2) is probably not the problem. Getting good initial values for the parameters is probably much more important.
In Python with numpy / scipy, you would definitely want to not loop over values of x but do something more like
def model(par,x):
res = par[2] + par[3]*x + par[4]*x**2
A0 = par[2] + (par[3]-par[0])*par[5] + (par[4]-par[1])*par[5]**2
res[np.where(x <= par[5])] = A0 + par[0]*x + par[1]*x**2
return res
It's not clear to me that that form is really what you want: why should A0 (a value independent of x added to a portion of the model) be so complicated and interdependent on the other parameters?
More importantly, your sum_of_squares() function is actually not what least_squares() wants: you should return the residual array, you should not do the sum of squares yourself. So, that should be
def sum_of_squares(par, x, y):
return (y - model(par, x))
But most importantly, there is a conceptual problem that is probably going to plague this model: Your par[5] is meant to represent a breakpoint where the model changes form. This is going to be very hard for these optimization routines to find. These routines generally make a very small change to each parameter value to estimate to derivative of the residual array with respect to that variable in order to figure out how to change that variable. With a parameter that is essentially used as an integer, the small change in the initial value will have no effect at all, and the algorithm will not be able to determine the value for this parameter. With some of the scipy.optimize algorithms (notably, leastsq) you can specify a scale for the relative change to make. With leastsq that is called epsfcn. You may need to set this as high as 0.3 or 1.0 for fitting the breakpoint to work. Unfortunately, this cannot be set per variable, only per fit. You might need to experiment with this and other options to least_squares or leastsq.

How to read a system of differential equations from a text file to solve the system with scipy.odeint?

I have a large (>2000 equations) system of ODE's that I want to solve with python scipy's odeint.
I have three problems that I want to solve (maybe I will have to ask 3 different questions?).
For simplicity, I will explain them here with a toy model, but please keep in mind that my system is large.
Suppose I have the following system of ODE's:
dS/dt = -beta*S
dI/dt = beta*S - gamma*I
dR/dt = gamma*I
with beta = cpI
where c, p and gamma are parameters that I want to pass to odeint.
odeint is expecting a file like this:
def myODEs(y, t, params):
c,p, gamma = params
beta = c*p
S = y[0]
I = y[1]
R = y[2]
dydt = [-beta*S*I,
beta*S*I - gamma*I,
- gamma*I]
return dydt
that then can be passed to odeint like this:
myoutput = odeint(myODEs, [1000, 1, 0], np.linspace(0, 100, 50), args = ([c,p,gamma], ))
I generated a text file in Mathematica, say myOdes.txt, where each line of the file corresponds to the RHS of my system of ODE's, so it looks like this
#myODEs.txt
-beta*S*I
beta*S*I - gamma*I
- gamma*I
My text file looks similar to what odeint is expecting, but I am not quite there yet.
I have three main problems:
How can I pass my text file so that odeint understands that this is the RHS of my system?
How can I define my variables in a smart way, that is, in a systematic way? Since there are >2000 of them, I cannot manually define them. Ideally I would define them in a separate file and read that as well.
How can I pass the parameters (there are a lot of them) as a text file too?
I read this question that is close to my problems 1 and 2 and tried to copy it (I directly put values for the parameters so that I didn't have to worry about my point 3 above):
systemOfEquations = []
with open("myODEs.txt", "r") as fp :
for line in fp :
systemOfEquations.append(line)
def dX_dt(X, t):
vals = dict(S=X[0], I=X[1], R=X[2], t=t)
return [eq for eq in systemOfEquations]
out = odeint(dX_dt, [1000,1,0], np.linspace(0, 1, 5))
but I got the error:
odepack.error: Result from function call is not a proper array of floats.
ValueError: could not convert string to float: -((12*0.01/1000)*I*S),
Edit: I modified my code to:
systemOfEquations = []
with open("SIREquationsMathematica2.txt", "r") as fp :
for line in fp :
pattern = regex.compile(r'.+?\s+=\s+(.+?)$')
expressionString = regex.search(pattern, line)
systemOfEquations.append( sympy.sympify( expressionString) )
def dX_dt(X, t):
vals = dict(S=X[0], I=X[1], R=X[2], t=t)
return [eq for eq in systemOfEquations]
out = odeint(dX_dt, [1000,1,0], np.linspace(0, 100, 50), )
and this works (I don't quite get what the first two lines of the for loop are doing). However, I would like to do the process of defining the variables more automatic, and I still don't know how to use this solution and pass parameters in a text file. Along the same lines, how can I define parameters (that will depend on the variables) inside the dX_dt function?
Thanks in advance!
This isn't a full answer, but rather some observations/questions, but they are too long for comments.
dX_dt is called many times by odeint with a 1d array y and tuple t. You provide t via the args parameter. y is generated by odeint and varies with each step. dX_dt should be streamlined so it runs fast.
Usually an expresion like [eq for eq in systemOfEquations] can be simplified to systemOfEquations. [eq for eq...] doesn't do anything meaningful. But there may be something about systemOfEquations that requires it.
I'd suggest you print out systemOfEquations (for this small 3 line case), both for your benefit and ours. You are using sympy to translated the strings from the file into equations. We need to see what it produces.
Note that myODEs is a function, not a file. It may be imported from a module, which of course is a file.
The point to vals = dict(S=X[0], I=X[1], R=X[2], t=t) is to produce a dictionary that the sympy expressions can work with. A more direct (and I think faster) dX_dt function would look like:
def myODEs(y, t, params):
c,p, gamma = params
beta = c*p
dydt = [-beta*y[0]*y[1],
beta*y[0]*y[1] - gamma*y[1],
- gamma*y[1]]
return dydt
I suspect that the dX_dt that runs sympy generated expressions will be a lot slower than a 'hardcoded' one like this.
I'm going add sympy tag, because, as written, that is the key to translating your text file into a function that odeint can use.
I'd be inclined to put the equation variability in the t parameters, rather a list of sympy expressions.
That is replace:
dydt = [-beta*y[0]*y[1],
beta*y[0]*y[1] - gamma*y[1],
- gamma*y[1]]
with something like
arg12=np.array([-beta, beta, 0])
arg1 = np.array([0, -gamma, -gamma])
arg0 = np.array([0,0,0])
dydt = arg12*y[0]*y[1] + arg1*y[1] + arg0*y[0]
Once this is right, then the argxx definitions can be move outside dX_dt, and passed via args. Now dX_dt is just a simple, and fast, calculation.
This whole sympy approach may work fine, but I'm afraid that in practice it will be slow. But someone with more sympy experience may have other insights.

MemoryError with large sparse matrices

For a project I have built a program that constructs large matrices.
def ExpandSparse(LNew):
SpId = ssp.csr_matrix(np.identity(MS))
Sz = MS**LNew
HNew = ssp.csr_matrix((Sz,Sz))
Bulk = dict()
for i in range(LNew-1):
for j in range(LNew-1):
if i == j:
Bulk[(i,j)]=H2
else:
Bulk[(i,j)]=SpId
Ha = ssp.csr_matrix((8,8))
try:
for i in range(LNew-1):
for j in range(LNew-2):
if j < 1:
Ha = ssp.csr_matrix(ssp.kron(Bulk[(i,j)],Bulk[(i,j+1)]))
else:
Ha = ssp.csr_matrix(ssp.kron(Ha,Bulk[(i,j+1)]))
HNew = HNew + Ha
except MemoryError:
print('The matrix you tried to build requires too much memory space.')
return
return HNew
This does the job, however it does not work as well as I would have expected. The problem is that it won't allow for really large matrices. When LNewis larger than 13 I will get a MemoryError. My experiences with numpy suggest that, memorywise, I should be able to get LNew up to 18 or 19 before I get this error. Does this have to do with my code, or with the way scipy.sparse.kron() works with these matrices?
Another note that might be important is that I use Windows not Linux.
After some more reading on the working of the scipy.sparse.kron() function I have noticed that there is a third term named format you can enter. The default setting is None, but when it is put on 'csr' or another supported format it will only use the sparse format making it a lot more efficient, now for me it can build a 2097152 x 2097152 matrix. Here LNew is 21.

Scipy Sparse Eigensolver: MemoryError after multiple passes through loop without anything new being written during loop

I'm using Python + Scipy to diagonalize sparse matrices with random entries on the diagonal; in particular, I need eigenvalues in the middle of the spectrum. The code I've written has worked fine for months, but now I'm looking at bigger matrices and am running into "MemoryError"s. What's confusing/driving me insane is that the error only shows up after a few iterations (namely 9) of constructing a random matrix and diagonalizing it, but I don't see any way in which my code stores anything extra in memory from one iteration to the next, and so can't see how my code could fail during the 9th iteration but not the 1st.
Here are the details (and I apologize in advance if I've left anything out, I'm new to posting on this site):
Each matrix I construct is 16000x16000, with 15x16000 non-zero entries. Everything ran fine when I was looking at 4000x4000-size matrices. The bulk of my code is
#Initialization
#...
for i in range(dim):
for n in range(N):
digit = (i % 2**(n+1)) / 2**n
index = (i % 2**n) + ((digit + 1) % 2)*(2**n) + (i / 2**(n+1))*(2**(n+1))
row[dim + N*i + n] = index
col[dim + N*i + n] = i
dat[dim + N*i + n] = -G
e_list = open(e_list_name + "_%03dk_%010ds" % (num_states, int(start_time)), "w")
e_log = open(e_log_name + "_%03dk_%010ds" % (num_states, int(start_time)), "w")
for t in range(num_itr): #Begin iterations
dat[0:dim] = math.sqrt(N/2.0)*np.random.randn(dim) #Get new diagonal elements
H = sparse.csr_matrix((dat, (row, col))) #Construct new matrix
vals = sparse.linalg.eigsh(H, k = num_states + 2, sigma = target_energy, which = 'LM', return_eigenvectors = False) #Get new eigenvalues
vals = np.sort(vals)
vals.tofile(e_list)
e_log.write("Iter %d complete\n" % (t+1))
e_list.flush()
e_log.flush()
e_list.close()
e_log.close()
I've been setting num_itr to 100. During the 9th pass through the num_itr loop (as indicated by 8 lines having been written to e_log), the program crashes with the error message
Can't expand MemType 0: jcol 7438
Traceback (most recent call last):
File "/usr/lusers/clb37/QREM_Energy_Gatherer.py", line 55, in <module>
vals = sparse.linalg.eigsh(H, k = num_states + 2, sigma = target_energy, which = 'LM', return_eigenvectors = False)
File "/usr/lusers/clb37/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1524, in eigsh
symmetric=True, tol=tol)
File "/usr/lusers/clb37/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1030, in get_OPinv_matvec
return SpLuInv(A.tocsc()).matvec
File "/usr/lusers/clb37/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 898, in __init__
self.M_lu = splu(M)
File "/usr/lusers/clb37/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py", line 242, in splu
ilu=False, options=_options)
MemoryError
Sure enough, the program will fail during the 9th pass through that loop every time I run it on my machine, and when I try running this code on machines with more memory the program makes it through more iterations before crashing, so it looks like the computer really is running out of memory. If that's all there is to it then fine, but what I can't understand is why the program doesn't crash during the 1st iteration. I don't see any point in the 8 lines of the num_itr loop at which something gets written to memory without just being overwritten during the following iteration. I've used Heapy's heap() function to look at my memory usage, and it just prints out "Total size = 11715240 bytes" during every pass.
I feel like there's something fundamental that I just don't know about going on here, either some bug in my writing that I don't know to look for or some detail about how memory is handled. Can anyone explain to me why this code fails during the 9th pass through the num_itr loop but not the 1st?
Ok, this seems to be reproducible on Scipy 0.14.0.
It can apparently be worked around the issue by adding
import gc; gc.collect()
inside the loop to force Pythons cyclic garbage collector to run.
The issue appears that somewhere inside scipy.sparse.eigh there is a cyclic reference loop, in the vein of:
class Foo(object):
pass
a = Foo()
b = Foo()
a.spam = b
b.spam = a
del a, b # <- but a, b still refer to each other and are not dead
This is still perfectly OK in principle: although Python's reference counting doesn't detect such cyclic garbage, a collection is run periodically to gather such objects. However, if each object is very large in memory (eg. big Numpy arrays) the periodic runs are too infrequent, and you run out of memory before the next cyclic garbage collection run is done.
So a workaround is to force the GC to run when you know there's big garbage to collect.
A better workaround would be to change scipy.sparse.eigh so that such cyclic garbage is not generated in the first place.

GLPK linear programming

I am working on some very large scale linear programming problems. (Matrices are currently roughly 1000x1000 and these are the 'mini' ones.)
I thought that I had the program running successfully, only I have realized that I am getting some very unintuitive answers. For example, let's say I were to maximize x+y+z subject to a set of constraints x+y<10 and y+z <5. I run this and get an optimal solution. Then, I run the same equation but with different constraints: x+y<20 and y+z<5. Yet in the second iteration, my maximization decreases!
I have painstakingly gone through and assured myself that the constraints are loading correctly.
Does anyone know what the problem might be?
I found something in the documentation about lpx_check_kkt which seems to tell you when your solution is likely to be correct or high confidence (or low confidence for that matter), but I don't know how to use it.
I made an attempt and got the error message lpx_check_kkt not defined.
I am adding some code as an addendum in hopes that someone can find an error.
The result of this is that it claims an optimal solution has been found. And yet every time I raise an upper bound, it gets less optimal.
I have confirmed that my bounds are going up and not down.
size = 10000000+1
ia = intArray(size)
ja = intArray(size)
ar = doubleArray(size)
prob = glp_create_prob()
glp_set_prob_name(prob, "sample")
glp_set_obj_dir(prob, GLP_MAX)
glp_add_rows(prob, Num_constraints)
for x in range(Num_constraints):
Variables.add_variables(Constraints_for_simplex)
glp_set_row_name(prob, x+1, Variables.variers[x])
glp_set_row_bnds(prob, x+1, GLP_UP, 0, Constraints_for_simplex[x][1])
print 'we set the row_bnd for', x+1,' to ',Constraints_for_simplex[x][1]
glp_add_cols(prob, len(All_Loops))
for x in range(len(All_Loops)):
glp_set_col_name(prob, x+1, "".join(["x",str(x)]))
glp_set_col_bnds(prob,x+1,GLP_LO,0,0)
glp_set_obj_coef(prob,x+1,1)
for x in range(1,len(All_Loops)+1):
z=Constraints_for_simplex[0][0][x-1]
ia[x] = 1; ja[x] = x; ar[x] = z
x=len(All_Loops)+1
while x<Num_constraints + len(All_Loops):
for y in range(2, Num_constraints+1):
z=Constraints_for_simplex[y-1][0][0]
ia[x] = y; ja[x] =1 ; ar[x] = z
x+=1
x=Num_constraints+len(All_Loops)
while x <len(All_Loops)*(Num_constraints-1):
for z in range(2,len(All_Loops)+1):
for y in range(2,Num_constraints+1):
if x<len(All_Loops)*Num_constraints+1:
q = Constraints_for_simplex[y-1][0][z-1]
ia[x] = y ; ja[x]=z; ar[x] = q
x+=1
glp_load_matrix(prob, len(All_Loops)*Num_constraints, ia, ja, ar)
glp_exact(prob,None)
Z = glp_get_obj_val(prob)
Start by solving your problematic instances with different solvers and checking the objective function value. If you can export your model to .mps format (I don't know how to do this with GLPK, sorry), you can upload the mps file to http://www.neos-server.org/neos/solvers/index.html and solve it with several different LP solvers.

Categories

Resources