Find subset with similar mean as full set - python

I have 50 lists, each one filled with 0s ans 1s. I know the overall proportion of 1s when you consider all the 50 lists pooled together. I want to find the 10 lists that pooled together best resemble the overall proportion of 1s.
The function I want to minimise is abs(mean(pooled subset) - mean(pooled full set))
For those who know pandas:
In pandas terms, I have a dataframe as follows
and so on, with a total of 50 labels, each one with a number of values ranging between 100 and 1000.
I want to find the subset of 10 labels that minimises d, where d
d = abs(df.loc[df.label.isin(subset), 'Value'].mean() - df.Value.mean())
I tried to apply dynamic programming solutions to the knapsack problem, but the issue is that the contribution of each list (label) to the final sample mean changes depending on which other lists you will include afterwards (because they will increase the sample size in unpredictable ways). It's like having knapsack problem where every new item you pick changes the value of the items you previously picked. Tricky.
Is there a better algorithm to solve this problem?

There is a way, somewhat cumbersome, to formulate this problem as a MIP (Mixed Integer Programming) problem.
We need the following data:
mu : mean of all data
mu(i) : mean of each subset i
n(i) : number of elements in each subset
N : number of subsets we need to select
And we need some binary decision variables
delta(i) = 1 if subset i is selected and 0 otherwise
A formal statement of the optimization problem can look like:
min | mu - sum(i, mu(i)*n(i)*delta(i)) / sum(i, n(i)*delta(i)) |
subject to
sum(i, delta(i)) = N
delta(i) in {0,1}
Here sum(i, mu(i)*n(i)*delta(i)) is the total value of the selected items and sum(i, n(i)*delta(i)) is the total number of selected items.
The objective is clearly nonlinear (we have an absolute value and a division). This is sometimes called an MINLP problem (MINLP for Mixed Integer Nonlinear Programming). Although MINLP solvers are readily available, we actually can do better. Using some gymnastics we can reformulate this problem into a linear problem (by adding some extra variables and extra inequality constraints). The full details are here. The resulting MIP model can be solved with any MIP solver.
Interestingly we don't need the data values in the model, just n(i),mu(i) for each subset.

Related

OR-Tools solution to partition the data so that subset of rows where every feature falls in the corresponding range maximizes the objective function

Cross-posted from https://cs.stackexchange.com/questions/153558/find-a-range-of-values-to-subset-the-rows-to-maximize-the-objective-function?noredirect=1#comment323025_153558.
I have searched around for some time but couldn't find a similar example to my problem.
It looks common enough that I would expect it to be solved. It lies between search and optimization/regression.
The goal is to find a range of values for each feature, so that the subset of rows where every feature falls in the corresponding range maximizes the objective function.
Assume we have a matrix with Yi and corresponding set of features Xi (say around 40).
Number of samples relatively large, 100k+.
Table example
So in this case for the total data Sum(Y_i) = 73 and the mean(Y_i)= 6.0833
The problem is to:
Max sum(Yi) subj to:
mean(Y_i) > 7$
sum(i) > 5000
, where i are the row index and rows are selected by imposing 2 constraints ( < and > ) or each feature.
I have managed to get solution using DEoptim in R for 5-6 variables with 2 conditions (partitions) "<" and ">". For more features it gets slow/fail to converge.
Seeing the (somewhat) similar question (and answer) here : Pandas find subset of rows minimizing the sum of a column under other column constraint
I am wondering if there is a way to formulate my problem in OR-Tools as well. I have went through the documentation on the https://developers.google.com/optimization but still struggle to understand how to express my problem.
Would appreciate any pointers as to how to formulate (solve) this problem in OR-tools in the general case, where there is a dataset with features + response variable and the objective is find the splits on features to maximize (minimize) the sum (or other function) of the response variable.
The number of splits should be 2 per feature as we want solution to be locally monotonic wrt to features.
Thanks.

Implement variational approach for budget closure with 2 constraints in python

I'm new to Python and am quite helpless with a problem I have to solve:
I have two budget equations, let's say a+b+c+d=Res1 and a+c+e+f=Res2, now every term has a specific standard deviation a_std, b_std,... and I want to distribute the budget residuals Res1 and Res2 onto the individual terms relative to their uncertainty (see eqution below), to get a_new+b_new+c_new+d_new=0 and a_new+c_new+e_new+f_new=0
Regarding only 1 budget equation I'm able to solve the problem and get the terms a_new, b_new, c_new and d_new. But how can I add the second constraint to also get e_new and f_new?
e.g. I calculate a_new = a + (a_std^2/(a_std+b_std+c_std))*Res1 , however this is only dependent of the first equation, but I want a to be modified that way to also satisfy the second equation..
I appreciate any help/any ideas on how to approach this problem.
Thanks in advance,
Sue
Edit:
What I have so far:
def var_close(a,a_std,b,b_std,c,c_std,d,d_std,e,e_std,f,f_std,g,g_std):
x=[a,b,c,d,e]
Res1=np.sum([x])
std_ges1=a_std*a_std+b_std*b_std+c_std*c_std+d_std*d_std+e_std*e_std
y=[a,c,f,g]
Res2=np.sum([y])
std_ges2=a_std*a_std+c_std*c_std+f_std*f_std+g_std*g_std
a_new=a-((a_std*a_std)/std_ges1)*Res1
b_new=b-((b_std*b_std)/std_ges1)*Res1
c_new=c-((c_std*c_std)/std_ges1)*Res1
d_new=d-((d_std*d_std)/std_ges1)*Res1
e_new=e-((e_std*e_std)/std_ges1)*Res1
a_new2=a-((a_std*a_std)/std_ges2)*Res2
c_new2=c-((c_std*c_std)/std_ges2)*Res2
f_new=f-((f_std*f_std)/std_ges2)*Res2
g_new=g-((g_std*g_std)/std_ges2)*Res2
return a_new,b_new,c_new,d_new,e_new,a_new2,c_new2,f_new,g_new
But like this e.g. a_new and a_new2 are slightly different, but I want them to be equal and the other terms modified correspondng to their uncertainty..

Sparsity reduction

I have to factorize a big sparse matrix ( 6.5mln rows representing users* 6.5mln columns representing items) to find users and items latent vectors. I chose the als algorithm in spark framework(pyspark).
To boost the quality I have to reduce the sparsity of my matrix till 98%. (current value is 99.99% because I have inly 356mln of filled entries).
I can do it by dropping rows or columns, but I must find the optimal solution maximizing number of rows(users).
The main problem is that I must find some subsets of users and items sets, and dropping some row can drop some columns and vice versa, the second problem is that function that evaluates sparsity is not linear.
Which way I can solve this problem? which libraries in python can help me with it?
Thank you.
This is a combinatorial problem. There is no easy way to drop an optimal set of columns to achieve max number of users while reducing sparsity. A formal approach would be to formulate it as a mixed-integer program. Consider the following 0-1 matrix, derived from your original matrix C.
A(i,j) = 1 if C(i,j) is nonzero,
A(i,j) = 0 if C(i,j) is zero
Parameters:
M : a sufficiently big numeric value, e.g. number of columns of A (NCOLS)
N : total number of nonzeros in C (aka in A)
Decision vars are
x(j) : 0 or 1 implying whether column j is dropped or not
nr(i): number of nonzeros covered in row i
y(i) : 0 or 1 implying whether row i is dropped or not
Constraints:
A(i,:) x = nr(i) for i = 1..NROWS
nr(i) <= y(i) * M for i = 1..NROWS
#sum(nr(i)) + e = 0.98 * N # explicit slack 'e' to be penalized in the objective
y(i) and x(j) are 0-1 variables (binary variables) for i,j
Objective:
maximize #sum(y(i)) - N.e
Such a model would be extremely difficult to solve as an integer model. However, barrier methods should be able to solve the linear programming relaxations (LP) Possible solvers are Coin/CLP (open-source), Lindo (commercial) etc... It may then be possible to use the LP solution to compute approximate integer solutions by simple rounding.
In the end, you will definitely require an iterative approach which will require solving MF problem several times each time factoring a different submatrix of C, computed with above approach, until you are satisfied with the solution.

Devising objective function for integer linear programming

I am working to devise a objective function for a integer linear programming model. The goal is to determine the copy number of two genes as well as if a gene conversion event has happened (where one copy is overwritten by the other, which looks like one was deleted but the net copy number has not changed).
The problem involves two data vectors, P_A and P_B. The vectors contain continuous values larger than zero that correspond to a measure of copy number made at each position. P_{A,i} is not necessarily the same spot across the gene as P_{B,i} is, because the positions are unique to each copy (and can be mapped to an absolute position in the genome).
Given this, my plan was to try and minimize the difference between my decision variables and the measured data across different genome windows, giving me different slices of the two data vectors that correspond to the same region.
Decision variables:
A_w = copy number of A in window in {0,1,2,3,4}
B_w = copy number of B in window in {0,1,2,3,4}
C_w = gene conversion in {-2,-1,0,1,2}
The goal then would be to minimize the difference between the left and right sides of the below equations:
A_w - C_w ~= mean(P_{A,W})
B_w + C_w ~= mean(P_{B,W})
Subject to a handful of constraints such as 2 <- A_w + B_w <= 4
But I am unsure how to formulate this into a function to minimize. I have two equations that are not really a function, and the decision variables have no coefficients.
I am also unsure of how to handle the negative values of C_w.
I also am unsure of how to bring the results back together; after I solve the LP in each window, I still need to merge it into one gene-wide call (and ideally identify which window(s) had non-zero values of C_w.
Create the LpProblem instance:
problem = LpProblem("Another LpProblem", LpMinimize)
Objective (per what you've vaguely described above):
problem += (mean(P_{A,W}) - (A_w - C_w)) + (mean(P_{B,W}) - (B_w + C_w))
This is all I could tell from your really rather vague question. You'll need to be much more specific with what you mean by terms like "bring the results back together", or "handle the negative values in C_w". Add in your current code snippets and the errors you're getting for more details.

How would I use an Artificial Intelligence algorithm to a program for it to learn and assign appropriate weight values?

I have a program I am writing and I am wondering how I would use some AI algorithm for my program so that it can learn and assign appropriate weight values to my fields.
For example I have fields a, b, c, d, and e. Each of these fields would have different weights because field a is more valuable than d. I was wondering how I would go about doing this so I can normalize my values and use a sum of these values to compare.
Example:
Weight of a = 1
Weight of b = 2
Weight of c = 3
Weight of d = 4
Weight of e = 5
For the sum, multiply each field's value with its assigned weight:
Result = (value of a) * 1 + (value of b) * 2 + (value of c) * 3 + (value of d) * 4 + (value of e) * 5
I am looking to input some training data and train my program to learn and compare the a,b,c,d,e values possessed by each object so that it can assign weights to each one.
EDIT: I am just looking for the method to approach this, whether it be by using neural nets, or some other means to learn and assign weights to these fields.
The best way to do this depends a lot on what kind of a program you're writing. How do you assess how good of an answer result is?
If result can either be correct or incorrect in a categorical way, then a neural net would be a great option. You could use a two-layer topology (i.e. all of the input nodes are connected to each output node, with no layer in between), and have each input node correspond to one of your fields (a, b, c, etc.). You can then use backpropagation to train the network such that each set of field values maps to the correct category. The edge weights that you end up with at the end will be the weights to associate with each field.
However, if result can either be either more or less accurate in some sort of a continuous way, a genetic algorithm is probably a better solution. This could be the case if you're comparing result to some ideal value, if you're using the weights in some sort of function with an evaluatable outcome (like a game), or some other similar situation. The fitness function that you use will depend on your exact circumstances (for the examples above you might use proximity to the ideal value, or win-loss ratio when playing the game with those values). There are a variety of ways that you could format the candidate solutions:
One option would be to use a sequence of bitstrings representing each weight in binary. Mutations could flip any bit and crossover could occur at any point along the string (or you could only allow it to occur between numbers)
If you want to allow for floating point values, though, you might be
better off using a system where each candidate solution is a list of weights. Mutations can add or subtract from a given weight an crossover can occur at any point within the list.
If you want to provide more information on what specifically your program is trying to accomplish, I can try to offer a more specific suggestion.

Categories

Resources