Fitness function with multiple weights in DEAP

Fitness function with multiple weights in DEAP - python

I'm learning to use the Python DEAP module and I have created a minimising fitness function and an evaluation function. The code I am using for the fitness function is below:
ct.create("FitnessFunc", base.Fitness, weights=(-0.0001, -100000.0))
Notice the very large difference in weights. This is because the DEAP documentation for Fitness says:
The weights can also be used to vary the importance of each objective one against another. This means that the weights can be any real number and only the sign is used to determine if a maximization or minimization is done.
To me, this says that you can prioritise one weight over another by making it larger.
I'm using algorithms.eaSimple (with a HallOfFame) to evolve and the best individuals in the population are selected with tools.selTournament.
The evaluation function returns abs(sum(input)), len(input).
After running, I take the values from the HallOfFame and evaluate them, however, the output is something like the following (numbers at end of line added by me):
(154.2830144, 3) 1
(365.6353634, 4) 2
(390.50576340000003, 3) 3
(390.50576340000003, 14) 4
(417.37616340000005, 4) 5
The thing that is confusing me is that I thought that the documentation stated that the larger second weight meant that len(input) would have a larger influence and would result in an output like so:
(154.2830144, 3) 1
(365.6353634, 4) 2
(390.50576340000003, 3) 3
(417.37616340000005, 4) 5
(390.50576340000003, 14) 4
Notice that lines 4 and 5 are swapped. This is because the weight of line 4 was much larger than the weight of line 5.
It appears that the fitness is actually evaluated based on the first element first, and then the second element is only considered if there is a tie between the first elements. If this is the case, then what is the purpose of setting a weight other than -1 or +1?

From a Pareto-optimality standpoint, neither of the two A=(390.50576340000003, 14) and B=(417.37616340000005, 4) solutions are superior to the other, regardless of the weights; always f1(A) > f1(B) and f2(A) < f2(B), and therefore neither dominates the other (source):
If they are on the same frontier, the winner can now be selected based on a secondary metric: density of solutions surrounding each solution in the frontier, which now accounts for the weights (wighted crowding distance). Indeed, if you select an appropriate operator, like selNSGA2. The selTournament operator you are using selects on the basis the first objective only:
def selTournament(individuals, k, tournsize, fit_attr="fitness"):
chosen = []
for i in xrange(k):
aspirants = selRandom(individuals, tournsize)
chosen.append(max(aspirants, key=attrgetter(fit_attr)))
return chosen
If you still want to use that, you can consider updating your evaluation function to return a single output of the weighted sum of the objectives. This approach would fail in the case of a non-convex objective space though (Page 12 here for details).

Related

Is this considered a fitness function for a genetic algorithm?

I start off with a population. I also have properties each individual in the population can have. If an individual DOES have the property, it’s score goes up by 5. If it DOESNT have it, it’s score increases by 0.
Example code using length as a property:
for x in individual:
if len <5:
score += 5
if len >=5:
score += 0
Then I add up the total score and select the individuals I want to continue. Is this a fitness function?

Anything can be a fitness algorithm as long as it gives better points for better DNA. The code you wrote looks like a gene of a DNA rather than a constraint. If it was a constraint, you'd give it a growing score penalty (its a minimization of score?) depending on the distance to the constraint point so that the selection/crossover part could prioritize the closer DNAs to 5 for smaller and distant values to 5 for the bigger. But currently it looks like "anything > 5 works fine" so there will be a lot of random solutions to this with high diversity rather than values like 4.9, 4.99, etc even if you apply elitism.
If there are many variables like "len" with equal score, then one gene's failure could be shadowed by another gene's success. To stop this, you can give them different scores like 5,10,20,40,... so that the selection and crossover can know if it actually made progress without any failure.
If you've meant a constraint by that 5, then you should tell the selection that the "failed" values closer to 5 (i.e. 4,4.5,4.9,4.99) are better than distant ones, by applying a variable score like this:
if(gene < constraint_value)
score += (constraint_value - gene)^2;
// if you've meant to add zero, then don't need to add zero
In comments, you said molecular computations. Molecules have floating point coordinates&masses so if you are optimizing them, then the constraint with variable penalty will make it easier for the selection to get better groups of DNAs for future generations if the mutation is adding onto the current value of genes rather than setting them to a totally random value.

Minimize the number of outputs

For a linear optimization problem, I would like to include a penalty. The penalty of every option (penalties[(i)]) should be 1 if the the sum is larger than 0 and 0 if the penalty is zero. Is there a way to do this?
The penalty is defined as:
penalties = {}
for i in A:
penalties[(i)]=(lpSum(choices[i][k] for k in B))/len(C)
prob += Objective Function + sum(penalties)
For example:
penalties[(0)]=0
penalties[(1)]=2
penalties[(3)]=6
penalties[(4)]=0
The sum of the penalties should then be:
sum(penalties)=0+1+1+0= 2

Yes. What you need to do is to create binary variables: use_ith_row. The interpretation of this variable will be ==1 if any of the choices[i][k] are >= 0 for row i (and 0 otherwise).
The penalty term in your objective function simply needs to be sum(use_ith_row[i] for i in A).
The last thing you need is the set of constraints which enforce the rule described above:
for i in A:
lpSum(choices[i][k] for k in B) <= use_ith_row[i]*M
Finnaly, you need to choose M large enough so that the constraint above has no limiting effect when use_ith_row is 1 (you can normally work out this bound quite easily). Choosing an M which is way too large will also work, but will tend to make your problem solve slower.
p.s. I don't know what C is or why you divide by its length - but typically if this penalty is secondary to you other/primary objective you would weight it so that improvement in your primary objective is always given greater weight.

Find subset with similar mean as full set

I have 50 lists, each one filled with 0s ans 1s. I know the overall proportion of 1s when you consider all the 50 lists pooled together. I want to find the 10 lists that pooled together best resemble the overall proportion of 1s.
The function I want to minimise is abs(mean(pooled subset) - mean(pooled full set))
For those who know pandas:
In pandas terms, I have a dataframe as follows
and so on, with a total of 50 labels, each one with a number of values ranging between 100 and 1000.
I want to find the subset of 10 labels that minimises d, where d
d = abs(df.loc[df.label.isin(subset), 'Value'].mean() - df.Value.mean())
I tried to apply dynamic programming solutions to the knapsack problem, but the issue is that the contribution of each list (label) to the final sample mean changes depending on which other lists you will include afterwards (because they will increase the sample size in unpredictable ways). It's like having knapsack problem where every new item you pick changes the value of the items you previously picked. Tricky.
Is there a better algorithm to solve this problem?

There is a way, somewhat cumbersome, to formulate this problem as a MIP (Mixed Integer Programming) problem.
We need the following data:
mu : mean of all data
mu(i) : mean of each subset i
n(i) : number of elements in each subset
N : number of subsets we need to select
And we need some binary decision variables
delta(i) = 1 if subset i is selected and 0 otherwise
A formal statement of the optimization problem can look like:
min | mu - sum(i, mu(i)*n(i)*delta(i)) / sum(i, n(i)*delta(i)) |
subject to
sum(i, delta(i)) = N
delta(i) in {0,1}
Here sum(i, mu(i)*n(i)*delta(i)) is the total value of the selected items and sum(i, n(i)*delta(i)) is the total number of selected items.
The objective is clearly nonlinear (we have an absolute value and a division). This is sometimes called an MINLP problem (MINLP for Mixed Integer Nonlinear Programming). Although MINLP solvers are readily available, we actually can do better. Using some gymnastics we can reformulate this problem into a linear problem (by adding some extra variables and extra inequality constraints). The full details are here. The resulting MIP model can be solved with any MIP solver.
Interestingly we don't need the data values in the model, just n(i),mu(i) for each subset.

Devising objective function for integer linear programming

I am working to devise a objective function for a integer linear programming model. The goal is to determine the copy number of two genes as well as if a gene conversion event has happened (where one copy is overwritten by the other, which looks like one was deleted but the net copy number has not changed).
The problem involves two data vectors, P_A and P_B. The vectors contain continuous values larger than zero that correspond to a measure of copy number made at each position. P_{A,i} is not necessarily the same spot across the gene as P_{B,i} is, because the positions are unique to each copy (and can be mapped to an absolute position in the genome).
Given this, my plan was to try and minimize the difference between my decision variables and the measured data across different genome windows, giving me different slices of the two data vectors that correspond to the same region.
Decision variables:
A_w = copy number of A in window in {0,1,2,3,4}
B_w = copy number of B in window in {0,1,2,3,4}
C_w = gene conversion in {-2,-1,0,1,2}
The goal then would be to minimize the difference between the left and right sides of the below equations:
A_w - C_w ~= mean(P_{A,W})
B_w + C_w ~= mean(P_{B,W})
Subject to a handful of constraints such as 2 <- A_w + B_w <= 4
But I am unsure how to formulate this into a function to minimize. I have two equations that are not really a function, and the decision variables have no coefficients.
I am also unsure of how to handle the negative values of C_w.
I also am unsure of how to bring the results back together; after I solve the LP in each window, I still need to merge it into one gene-wide call (and ideally identify which window(s) had non-zero values of C_w.

Create the LpProblem instance:
problem = LpProblem("Another LpProblem", LpMinimize)
Objective (per what you've vaguely described above):
problem += (mean(P_{A,W}) - (A_w - C_w)) + (mean(P_{B,W}) - (B_w + C_w))
This is all I could tell from your really rather vague question. You'll need to be much more specific with what you mean by terms like "bring the results back together", or "handle the negative values in C_w". Add in your current code snippets and the errors you're getting for more details.

How would I use an Artificial Intelligence algorithm to a program for it to learn and assign appropriate weight values?

I have a program I am writing and I am wondering how I would use some AI algorithm for my program so that it can learn and assign appropriate weight values to my fields.
For example I have fields a, b, c, d, and e. Each of these fields would have different weights because field a is more valuable than d. I was wondering how I would go about doing this so I can normalize my values and use a sum of these values to compare.
Example:
Weight of a = 1
Weight of b = 2
Weight of c = 3
Weight of d = 4
Weight of e = 5
For the sum, multiply each field's value with its assigned weight:
Result = (value of a) * 1 + (value of b) * 2 + (value of c) * 3 + (value of d) * 4 + (value of e) * 5
I am looking to input some training data and train my program to learn and compare the a,b,c,d,e values possessed by each object so that it can assign weights to each one.
EDIT: I am just looking for the method to approach this, whether it be by using neural nets, or some other means to learn and assign weights to these fields.

The best way to do this depends a lot on what kind of a program you're writing. How do you assess how good of an answer result is?
If result can either be correct or incorrect in a categorical way, then a neural net would be a great option. You could use a two-layer topology (i.e. all of the input nodes are connected to each output node, with no layer in between), and have each input node correspond to one of your fields (a, b, c, etc.). You can then use backpropagation to train the network such that each set of field values maps to the correct category. The edge weights that you end up with at the end will be the weights to associate with each field.
However, if result can either be either more or less accurate in some sort of a continuous way, a genetic algorithm is probably a better solution. This could be the case if you're comparing result to some ideal value, if you're using the weights in some sort of function with an evaluatable outcome (like a game), or some other similar situation. The fitness function that you use will depend on your exact circumstances (for the examples above you might use proximity to the ideal value, or win-loss ratio when playing the game with those values). There are a variety of ways that you could format the candidate solutions:
One option would be to use a sequence of bitstrings representing each weight in binary. Mutations could flip any bit and crossover could occur at any point along the string (or you could only allow it to occur between numbers)
If you want to allow for floating point values, though, you might be
better off using a system where each candidate solution is a list of weights. Mutations can add or subtract from a given weight an crossover can occur at any point within the list.
If you want to provide more information on what specifically your program is trying to accomplish, I can try to offer a more specific suggestion.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.