concordance index in python - python

I'm looking for a python/sklearn/lifelines/whatever implementation of Harrell's c-index (concordance index), which is mentioned in random survival forests.
The C-index is calculated using the following steps:
Form all possible pairs of cases over the data.
Omit those pairs whose shorter survival time is censored. Omit pairs i and j if Ti=Tj unless at least one is a death. Let Permissible denote the total number of permissible pairs.
For each permissible pair where Ti and Tj are not equal, count 1 if the shorter survival
time has worse predicted outcome; count 0.5 if predicted outcomes are tied. For each permissible pair, where Ti=Tj and both are deaths, count 1 if predicted outcomes are tied; otherwise, count 0.5. For each permissible
pair where Ti=Tj, but not both are deaths, count 1 if the death has
worse predicted outcome; otherwise, count 0.5. Let Concordance denote
the sum over all permissible pairs.
The C-index, C, is defined by C=Concordance/Permissible.
Note: nltk has a ConcordanceIndex method with a different meaning :(

LifeLines package now has this implemented c-index, or concordance-index

LifeLine package could implement concordance index.
pip install lifelines
or
conda install -c conda-forge lifelines
Example:
from lifelines.utils import concordance_index
cph = CoxPHFitter().fit(df, 'T', 'E')
concordance_index(df['T'], -cph.predict_partial_hazard(df), df['E'])

Related

Is this considered a fitness function for a genetic algorithm?

I start off with a population. I also have properties each individual in the population can have. If an individual DOES have the property, it’s score goes up by 5. If it DOESNT have it, it’s score increases by 0.
Example code using length as a property:
for x in individual:
if len <5:
score += 5
if len >=5:
score += 0
Then I add up the total score and select the individuals I want to continue. Is this a fitness function?
Anything can be a fitness algorithm as long as it gives better points for better DNA. The code you wrote looks like a gene of a DNA rather than a constraint. If it was a constraint, you'd give it a growing score penalty (its a minimization of score?) depending on the distance to the constraint point so that the selection/crossover part could prioritize the closer DNAs to 5 for smaller and distant values to 5 for the bigger. But currently it looks like "anything > 5 works fine" so there will be a lot of random solutions to this with high diversity rather than values like 4.9, 4.99, etc even if you apply elitism.
If there are many variables like "len" with equal score, then one gene's failure could be shadowed by another gene's success. To stop this, you can give them different scores like 5,10,20,40,... so that the selection and crossover can know if it actually made progress without any failure.
If you've meant a constraint by that 5, then you should tell the selection that the "failed" values closer to 5 (i.e. 4,4.5,4.9,4.99) are better than distant ones, by applying a variable score like this:
if(gene < constraint_value)
score += (constraint_value - gene)^2;
// if you've meant to add zero, then don't need to add zero
In comments, you said molecular computations. Molecules have floating point coordinates&masses so if you are optimizing them, then the constraint with variable penalty will make it easier for the selection to get better groups of DNAs for future generations if the mutation is adding onto the current value of genes rather than setting them to a totally random value.

Fitness function with multiple weights in DEAP

I'm learning to use the Python DEAP module and I have created a minimising fitness function and an evaluation function. The code I am using for the fitness function is below:
ct.create("FitnessFunc", base.Fitness, weights=(-0.0001, -100000.0))
Notice the very large difference in weights. This is because the DEAP documentation for Fitness says:
The weights can also be used to vary the importance of each objective one against another. This means that the weights can be any real number and only the sign is used to determine if a maximization or minimization is done.
To me, this says that you can prioritise one weight over another by making it larger.
I'm using algorithms.eaSimple (with a HallOfFame) to evolve and the best individuals in the population are selected with tools.selTournament.
The evaluation function returns abs(sum(input)), len(input).
After running, I take the values from the HallOfFame and evaluate them, however, the output is something like the following (numbers at end of line added by me):
(154.2830144, 3) 1
(365.6353634, 4) 2
(390.50576340000003, 3) 3
(390.50576340000003, 14) 4
(417.37616340000005, 4) 5
The thing that is confusing me is that I thought that the documentation stated that the larger second weight meant that len(input) would have a larger influence and would result in an output like so:
(154.2830144, 3) 1
(365.6353634, 4) 2
(390.50576340000003, 3) 3
(417.37616340000005, 4) 5
(390.50576340000003, 14) 4
Notice that lines 4 and 5 are swapped. This is because the weight of line 4 was much larger than the weight of line 5.
It appears that the fitness is actually evaluated based on the first element first, and then the second element is only considered if there is a tie between the first elements. If this is the case, then what is the purpose of setting a weight other than -1 or +1?
From a Pareto-optimality standpoint, neither of the two A=(390.50576340000003, 14) and B=(417.37616340000005, 4) solutions are superior to the other, regardless of the weights; always f1(A) > f1(B) and f2(A) < f2(B), and therefore neither dominates the other (source):
If they are on the same frontier, the winner can now be selected based on a secondary metric: density of solutions surrounding each solution in the frontier, which now accounts for the weights (wighted crowding distance). Indeed, if you select an appropriate operator, like selNSGA2. The selTournament operator you are using selects on the basis the first objective only:
def selTournament(individuals, k, tournsize, fit_attr="fitness"):
chosen = []
for i in xrange(k):
aspirants = selRandom(individuals, tournsize)
chosen.append(max(aspirants, key=attrgetter(fit_attr)))
return chosen
If you still want to use that, you can consider updating your evaluation function to return a single output of the weighted sum of the objectives. This approach would fail in the case of a non-convex objective space though (Page 12 here for details).

Why the fuzzywuzzy Ratio() uses a slightly different implementation of Levenshtein Distance while calculating the ratio between two strings?

I am trying to wrap my head around how the fuzzywuzzy library calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.
The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using the addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.
Here are a couple of examples:
Example 1
s1 = 'hello'
s2 = 'hell'
Levenshtein Score = 1 (it requires 1 edit, addition of 'o')
Example 2
s1 = 'hello'
s2 = 'hella'
Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')
Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):
Example 1: (5+4-1)/9 = 89%
Example 2: (5+5-1)/10 = 90%
Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?
I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?
I would love if somebody could give me some insight. Also is there a better way to convert LD to a similarity score? Or in general measure the similarity between two strings? I am trying to measure the similarity between some audio file transcriptions done by a human transcription service and by an Automatic Speech Recognition system.
Thank you!

Selecting an item (from a set of items) based on distance and frequency of occurence

There exists a set of points (or items, it doesn't matter). Each point a is at a specific distance from other points in the set. The distance can be retrieved via the function retrieve_dist(a, b).
This question is about programming (in Python) an algorithm to pick a point, with replacement, from this set of points. The picked point:
i) has to be at the maximum possible distance from all already-selected points, while adhering to the requirement in (ii)
ii) the number of times an already-selected point occurs in the sample must carry weight in this calculation. I.e. more frequently-selected points should be weighed more heavily.
E.g. imagine a and b have already been selected (100 and 10 times respectively). Then when the next point is to be selected, it's distance from a matters more than its distance from b, in line with the frequency of occurrence of a in the already-selected sample.
What I can try:
This would have been easy to accomplish if weights/frequencies weren't in play. I could do:
distances = defaultdict(int)
for new_point in set_of_points:
for already_selected_point in selected_points:
distances[new_point] += retrieve_dist(new_point, already_selected_point)
Then I'd sort distances.items() by the second entry in each tuple, and would get the desired item to select.
However, when frequencies of already-selected points come into play, I just can't seem to wrap my head around this problem.
Can an expert help out? Thanks in advance.
A solution to your problem would be to make selected_points a list rather than a set. In this case, each new point is compared to a and b (and all other points) as many times as they have already been found.
If each point is typically found many times, it might be possible to improve perfomance using a dict instead, with the key being the points, and the value being the number of times each point is selected. In that case I think your algorithm would be
distances = defaultdict(int)
for new_point in set_of_points:
for already_selected_point, occurances in selected_points.items():
distances[new_point] += occurances * retrieve_dist(new_point, already_selected_point)

How to do binomial distribution in python where trial probabilities are unequal

I know how to do a standard binomial distribution in python where probabilities of each trial is the same. My question is what to do if the trial probabilities change each time. I'm drafting up an algorithm based on the paper below but thought I should check on here to see whether there's already a standard way to do it.
http://www.tandfonline.com/doi/abs/10.1080/00949658208810534#.UeVnWT6gk6w
Thanks in advance,
James
Is this kind of what you are looking for?
import numpy as np
def random_MN_draw(n, probs): # n=2 since binomial
""" get X random draws from the multinomial distribution whose probability is given by 'probs' """
mn_draw = np.random.multinomial(n,probs) # do 1 multinomial experiment with the given probs with probs= [0.5,0.5], this is a fair coin-flip
return mn_draw
def simulate(sim_probabilities):
len_sim = len(sim_probabilities)
simulated_flips = np.zeros(2,len_sim)
for i in range(0,len_sim)
simulated_flips(:,i) = random_MN_draw(2, sim_probabilities(i))
# Here, at the end of the simulation, you can count the number of heads
# in 'simulated_flips' to get your MLE's on P(H) and P(T).
Suppose you want to do 9 coin tosses, and P(H) on each flip is 0.1 .. 0.9, respectively. !0% chance of a head on first flip, 90% on last.
For E(H), the expected number of heads, you can just sum the 9 individual expectations.
For a distribution, you could enumerate the ordered possible outcomes (itertools.combinations_with_replacement(["H", "T"], 9))
(HHH HHH HHH)
(HHH HHH HHT)
...
(TTT TTT TTT)
and calculate a probability for the ordered outcome in a straightforward manner.
for each ordered outcome, increment a defaultdict(float) indexed by the number of heads with the calculated p.
When done, compute the sum of the dictionary values, then divide every value in the dictionary by that sum.
You'll have 10 values that correspond to the chances of observing 0 .. 9 heads.
Gerry
Well, the question is old and I can't answer it since I don't know pythons math libraries well enough.
Howewer, it might be helpful to other readers to know that this distribution often runs under the name
Poisson Binomial Distribution

Categories

Resources