I have to factorize a big sparse matrix ( 6.5mln rows representing users* 6.5mln columns representing items) to find users and items latent vectors. I chose the als algorithm in spark framework(pyspark).
To boost the quality I have to reduce the sparsity of my matrix till 98%. (current value is 99.99% because I have inly 356mln of filled entries).
I can do it by dropping rows or columns, but I must find the optimal solution maximizing number of rows(users).
The main problem is that I must find some subsets of users and items sets, and dropping some row can drop some columns and vice versa, the second problem is that function that evaluates sparsity is not linear.
Which way I can solve this problem? which libraries in python can help me with it?
Thank you.
This is a combinatorial problem. There is no easy way to drop an optimal set of columns to achieve max number of users while reducing sparsity. A formal approach would be to formulate it as a mixed-integer program. Consider the following 0-1 matrix, derived from your original matrix C.
A(i,j) = 1 if C(i,j) is nonzero,
A(i,j) = 0 if C(i,j) is zero
Parameters:
M : a sufficiently big numeric value, e.g. number of columns of A (NCOLS)
N : total number of nonzeros in C (aka in A)
Decision vars are
x(j) : 0 or 1 implying whether column j is dropped or not
nr(i): number of nonzeros covered in row i
y(i) : 0 or 1 implying whether row i is dropped or not
Constraints:
A(i,:) x = nr(i) for i = 1..NROWS
nr(i) <= y(i) * M for i = 1..NROWS
#sum(nr(i)) + e = 0.98 * N # explicit slack 'e' to be penalized in the objective
y(i) and x(j) are 0-1 variables (binary variables) for i,j
Objective:
maximize #sum(y(i)) - N.e
Such a model would be extremely difficult to solve as an integer model. However, barrier methods should be able to solve the linear programming relaxations (LP) Possible solvers are Coin/CLP (open-source), Lindo (commercial) etc... It may then be possible to use the LP solution to compute approximate integer solutions by simple rounding.
In the end, you will definitely require an iterative approach which will require solving MF problem several times each time factoring a different submatrix of C, computed with above approach, until you are satisfied with the solution.
Related
Cross-posted from https://cs.stackexchange.com/questions/153558/find-a-range-of-values-to-subset-the-rows-to-maximize-the-objective-function?noredirect=1#comment323025_153558.
I have searched around for some time but couldn't find a similar example to my problem.
It looks common enough that I would expect it to be solved. It lies between search and optimization/regression.
The goal is to find a range of values for each feature, so that the subset of rows where every feature falls in the corresponding range maximizes the objective function.
Assume we have a matrix with Yi and corresponding set of features Xi (say around 40).
Number of samples relatively large, 100k+.
Table example
So in this case for the total data Sum(Y_i) = 73 and the mean(Y_i)= 6.0833
The problem is to:
Max sum(Yi) subj to:
mean(Y_i) > 7$
sum(i) > 5000
, where i are the row index and rows are selected by imposing 2 constraints ( < and > ) or each feature.
I have managed to get solution using DEoptim in R for 5-6 variables with 2 conditions (partitions) "<" and ">". For more features it gets slow/fail to converge.
Seeing the (somewhat) similar question (and answer) here : Pandas find subset of rows minimizing the sum of a column under other column constraint
I am wondering if there is a way to formulate my problem in OR-Tools as well. I have went through the documentation on the https://developers.google.com/optimization but still struggle to understand how to express my problem.
Would appreciate any pointers as to how to formulate (solve) this problem in OR-tools in the general case, where there is a dataset with features + response variable and the objective is find the splits on features to maximize (minimize) the sum (or other function) of the response variable.
The number of splits should be 2 per feature as we want solution to be locally monotonic wrt to features.
Thanks.
For a linear optimization problem, I would like to include a penalty. The penalty of every option (penalties[(i)]) should be 1 if the the sum is larger than 0 and 0 if the penalty is zero. Is there a way to do this?
The penalty is defined as:
penalties = {}
for i in A:
penalties[(i)]=(lpSum(choices[i][k] for k in B))/len(C)
prob += Objective Function + sum(penalties)
For example:
penalties[(0)]=0
penalties[(1)]=2
penalties[(3)]=6
penalties[(4)]=0
The sum of the penalties should then be:
sum(penalties)=0+1+1+0= 2
Yes. What you need to do is to create binary variables: use_ith_row. The interpretation of this variable will be ==1 if any of the choices[i][k] are >= 0 for row i (and 0 otherwise).
The penalty term in your objective function simply needs to be sum(use_ith_row[i] for i in A).
The last thing you need is the set of constraints which enforce the rule described above:
for i in A:
lpSum(choices[i][k] for k in B) <= use_ith_row[i]*M
Finnaly, you need to choose M large enough so that the constraint above has no limiting effect when use_ith_row is 1 (you can normally work out this bound quite easily). Choosing an M which is way too large will also work, but will tend to make your problem solve slower.
p.s. I don't know what C is or why you divide by its length - but typically if this penalty is secondary to you other/primary objective you would weight it so that improvement in your primary objective is always given greater weight.
My goal in a nutshell:
Given an 8x8 matrix C I have an algorithm which constructs another 8x8 matrix L=L(C) in physically relevant way, wherein each entry of L is given by a particular (possibly irrational) linear combination of up to 8 entries of C. I want to find a particular choice of C which is positive semidefinite and has large rank (7 or 8) but gives rise to a singular L.
Facts:
Every positive semidefinite C can be written as C=U*U for some U (where U* denotes the complex conjugate transpose of U).
In this case ker U=ker C, and so the rank of U and C are the same.
If U has a nonsingular 7x7 principle submatrix, then the rank of U is at least 7.
Naive solution:
Declare U as an 8x8 matrix of symbols and create C=U*U. This guarantees C is positive semidefinite.
Define V to be the upper-left 7x7 principle submatrix of U. Note if V is nonsingular the rank of C is at least 7.
Compute L from C.
Try to solve([ det(V)~=0, det(L)==0 ], [vars]), where vars=entries of U.
The difficulty:
One immediate problem is that not every choice of U which is rank 7 or 8 will have V nonsingular, so the matrix C I'm seeking might be missed in this regime. Ignoring this here, my issue for this forum is more to the complexity of the problem:
By setting C=U*U and using the entries of U as the variables, each entry of C becomes a sum of 8 terms, each degree 2. E.g., the c21 entry is given by
u11*conj(u12) + u21*conj(u22) + u31*conj(u32) + u41*conj(u42) + u51*conj(u52) + u61*conj(u62) + u71*conj(u72) + u81*conj(u82)
Since each entry of L is a linear combination of between 2 and 8 entries of C, we have that each entry of L is a linear combination of between 2*8=16 and 8*8=64 degree 2 terms from the entries of U (or U*). Thus det(L) is a uniform degree 2*8=16 polynomial with between 8!*16^8≈10^14 and 8!*64^8≈10^19 terms before simplification. I need this polynomial to be zero while simultaneously det(V) is nonzero (a uniform degree 7 polynomial with 7!=5040 terms).
Note that if one were to avoid using C=U*U and instead let the entries of C be the variables, then det(L) would be a uniform degree 8 polynomial with between 8!*2^8≈10^7 and 8!*8^8≈10^11 terms before simplification. I would need this polynomial to be zero while simultaneously det(V) is nonzero and some extra conditions are placed so that C is positive semidefinite (Sylvester's Criterion, etc).
My question:
Is there a smarter way to do this? Certainly the determinant is not the most efficient way to determine if L is singular, but ideally I would like an exact answer for C, rather then a numerical approximation.
I am most familiar with Matlab, but any suggestions using any system (Python, Macaulay2, ...) would be greatly appreciated. For computing power, I have access to several supercomputer clusters.
Edits:
Perhaps a bit lofty a question. More digestible sub-questions:
Is there a computationally easier, ideally symbolic, algorithm for determining if a matrix is singular (opposed to computing the determinant)?
Is there a computationally easier way of demanding the answer be positive semidefinite (opposed to setting C=U*U and using the entries of U as the variables)?
Is there a less restrictive (but still computationally easy) way to demand that C has rank 7 or 8?
I have 50 lists, each one filled with 0s ans 1s. I know the overall proportion of 1s when you consider all the 50 lists pooled together. I want to find the 10 lists that pooled together best resemble the overall proportion of 1s.
The function I want to minimise is abs(mean(pooled subset) - mean(pooled full set))
For those who know pandas:
In pandas terms, I have a dataframe as follows
and so on, with a total of 50 labels, each one with a number of values ranging between 100 and 1000.
I want to find the subset of 10 labels that minimises d, where d
d = abs(df.loc[df.label.isin(subset), 'Value'].mean() - df.Value.mean())
I tried to apply dynamic programming solutions to the knapsack problem, but the issue is that the contribution of each list (label) to the final sample mean changes depending on which other lists you will include afterwards (because they will increase the sample size in unpredictable ways). It's like having knapsack problem where every new item you pick changes the value of the items you previously picked. Tricky.
Is there a better algorithm to solve this problem?
There is a way, somewhat cumbersome, to formulate this problem as a MIP (Mixed Integer Programming) problem.
We need the following data:
mu : mean of all data
mu(i) : mean of each subset i
n(i) : number of elements in each subset
N : number of subsets we need to select
And we need some binary decision variables
delta(i) = 1 if subset i is selected and 0 otherwise
A formal statement of the optimization problem can look like:
min | mu - sum(i, mu(i)*n(i)*delta(i)) / sum(i, n(i)*delta(i)) |
subject to
sum(i, delta(i)) = N
delta(i) in {0,1}
Here sum(i, mu(i)*n(i)*delta(i)) is the total value of the selected items and sum(i, n(i)*delta(i)) is the total number of selected items.
The objective is clearly nonlinear (we have an absolute value and a division). This is sometimes called an MINLP problem (MINLP for Mixed Integer Nonlinear Programming). Although MINLP solvers are readily available, we actually can do better. Using some gymnastics we can reformulate this problem into a linear problem (by adding some extra variables and extra inequality constraints). The full details are here. The resulting MIP model can be solved with any MIP solver.
Interestingly we don't need the data values in the model, just n(i),mu(i) for each subset.
I can test the rank of a matrix using np.linalg.matrix_rank(A) . But how can I test if all the rows of A are orthogonal efficiently?
I could take all pairs of rows and compute the inner product between them but is there a better way?
My matrix has fewer rows than columns and the rows are not unit vectors.
This answer basically summarizes the approaches mentioned in the question and the comments, and adds some comparison/insights about them
Approach #1 -- checking all row-pairs
As you suggested, you can iterate over all row pairs, and compute the inner product. If A.shape==(N,M), i.e. you have N rows of size M each, you end up with a O(M*N^2) complexity.
Approach #2 -- matrix multiplication
As suggested in the comments by #JoeKington, you can compute the multiplication A.dot(A.T), and check all the non-diagonal elements. Depending on the algorithm used for matrix multiplication, this can be faster than the naive O(M*N^2) algorithm, but only asymptotically better. Unless your matrices are big, they would be slower.
The advantages of approach #1:
You can "short circuit" -- quit the check as soon as you find the first non-orthogonal pair
requires less memory. In #2, you create a temporary NxN matrix.
The advantages of approach #2:
The multiplication is fast, as it is implemented in the heavily-optimized linear-algebra library (BLAS of ATLAS). I believe those libraries choose the right algorithm to use according to input size (i.e. they won't use the fancy algorithms on small matrices, because they are slower for small matrices. There's a big constant hidden behind that O-notation).
less code to write
My bet is that for small matrices, approach #2 would prove faster due to the fact the LA libraries are heavily optimized, and despite the fact they compute the entire multiplication, even after processing the first pair of non-orthogonal rows.
It seems that this will do
product = np.dot(A,A.T)
np.fill_diagonal(product,0)
if (product.any() == 0):
Approach #3: Compute the QR decomposition of AT
In general, to find an orthogonal basis of the range space of some matrix X, one can compute the QR decomposition of this matrix (using Givens rotations or Householder reflectors). Q is an orthogonal matrix and R upper triangular. The columns of Q corresponding to non-zero diagonal entries of R form an orthonormal basis of the range space.
If the columns of X=AT, i.e., the rows of A, already are orthogonal, then the QR decomposition will necessarily have the R factor diagonal, where the diagonal entries are plus or minus the lengths of the columns of X resp. the rows of A.
Common folklore has it that this approach is numerically better behaved than the computation of the product A*AT=RT*R. This may only matter for larger matrices. The computation is not as straightforward as the matrix product, however, the amount of operations is of the same size.
(U.T # U == np.eye(U.shape[0])).all()
This will give 'True' if matrix 'U' is orthogonal otherwise 'False', here 'all()' function is used to convert the matrix of boolean values(True/False values) that we get after 'U.T # U == np.eye(U.shape[0])', into a single boolean value.
if you want to check that matrix is approximately orthonormal(by this I mean that the matrix that we get after 'U.T # U' is nearly equal to an identity matrix),
Then use 'np.allclose()' like this
np.allclose(U.T # U, np.eye(U.shape[0]))
Note: '#' is used for matrix multiplication