calculating average artist entropy given user prediction and tracks in recommender systems - python

I have to calculate average artist entropy of users. I have solved this task on a test case but I am not able to generalize it to more task cases.
Shannon Entropy formula was used calculation the entropy of users.
def get_average_entropy_score(predictions: np.ndarray, item_df: pd.DataFrame, topK=10) -> float:
"""
predictions - np.ndarray - predictions of the recommendation algorithm for each user.
item_df - pd.DataFrame - information about each song with columns 'artist' and 'track'.
returns - float - average entropy score of the predictions.
"""
score = None
# TODO: YOUR IMPLEMENTATION.
l = []
for i in item_df['artist']:
l.append(i)
prob = 0
prob2 = 0
prob3 = 0
prob4 = 0
for i in range(len(predictions)):
for j, v in enumerate(predictions[i]):
if l[v] == 'A1':
p = 1/len(predictions[i])
prob += p
if l[v] == 'A2':
p = 1/len(predictions[i])
prob2 += p
if l[v] == 'A3':
p = 1/len(predictions[i])
prob3 += p
if l[v] == 'A4':
p = 1/len(predictions[i])
prob4 += p
if v != -1:
continue
entro1 = (prob*np.log2(prob))
entro2 = -(prob2*np.log2(prob2) + prob3*np.log2(prob3) + prob4*np.log2(prob4))
add = entro1 + entro2
entropy_over_users = add/4 # number of items/user
score = entropy_over_users
print(entropy_over_users)
return score
Now imagine I have a dataframe of artist - track like the following:
item_df = pd.DataFrame({'artist': ['A1', 'A1', 'A1', 'A1', 'A2', 'A3', 'A4']})
And I have a prediction of recommender system predicting items in position 0 1 2 or 3 like the following:
predictions = np.array([[0, 1, 2, 3], [6, 5, 4, 3], [-1, -1, -1, -1]])
From predictions e.g. the user 1 has been recommended item 0 first, item 1 second, item 2 third and item 3 fourth. A prediction of -1 means I should ignore this value because this item has not been seen by the user and should not be included in to calculation at all.
Now the question is I can't get it to work for general case where for example I don't know the A1, A2 and so on. or better Imagine you don't know the track names. Also see that item 0 in the prediction means that it is the first track in item_df, item 1 means the second and so on. Please help me. I don't know how to progress further! Please ask if something is unclear! Thanks!
Additional remark: solving the test case on paper gave me 0.5 if I normalize it.

Related

Uncapacited Facility location covering

I am trying to solve this problem using pulp :
This is my code, There is a problem, because the result should be to only keep the second Location :
# Import PuLP modeler functions
from pulp import *
# Set of locations J
Locations = ["A", "B","C"]
# Set of demands I
Demands = ["1", "2", "3", "4", "5"]
# Set of distances ij
dt = [ # Demands I
# 1 2 3 4 5
[2, 23, 30, 54, 1], # A Locations J
[3, 1, 2, 2, 3], # B
[50,65,80,90,100] # C distances are very long
]
# Max value to get covered
s = 5
# Theses binaries values should be generated by code from the dt array ... I write it down directly for simplification.
# Demand I is served by location J If distance is <= 5 ( 0 = KO , 1 = OK)
covered = [
[1,0,0,0,1],
[1,1,1,1,1] # This shows that we only need Location B , not A
[0,0,0,0,0] # This shows we can't use Location C, it's too far
]
# Creates the 'prob' variable to contain the problem data
prob = LpProblem("Set covering", LpMinimize)
# # Problem variables
J = LpVariable.dicts("location", Locations, cat='Binary')
# The distance data is made into a dictionary
distances = makeDict([Locations, Demands], covered, 0)
# The objective function
# Minimize J, which is the number of locations
prob += lpSum(J["A"]+J["B"]+J["C"])
# The constraint
# Is it covered or not ?
for w in Locations:
for b in Demands:
if(distances[w][b] > 0):
prob += int(distances[w][b]) * J[w] >= 1
# Or eventually this instead :
#for w in Locations:
# prob += (lpSum([distances[w][b] * J[w] for b in Demands]) >= 1)
# or that :
# prob += 1 * J["A"] >= 1
# prob += 1 * J["A"] >= 1
# prob += 1 * J["B"] >= 1
# prob += 1 * J["B"] >= 1
# prob += 1 * J["B"] >= 1
# prob += 1 * J["B"] >= 1
# prob += 1 * J["B"] >= 1
# The problem data is written to an .lp file
prob.writeLP("SetCovering.lp")
# The problem is solved using PuLP's choice of Solver
prob.solve()
# The status of the solution is printed to the screen
print("Status:", LpStatus[prob.status])
# Each of the variables is printed with it's resolved optimum value
for v in prob.variables():
print(v.name, "=", v.varValue)
# The optimised objective function value is printed to the screen
print("Total Locations = ", value(prob.objective))
# Show constraints
constraints = prob.constraints
print(constraints)
#Status: Optimal
#location_A = 1.0
#location_B = 1.0
#location_C = 0.0
#Total Locations = 2.0
The result should be :
location_A = 0.0
location_B = 1.0
location_C = 0.0
because location B covers all of our needs.
I wonder where is the problem, there is the maths code , I hope I wrote enough:
Thanks , it's nice if you have a solution, I have also tried lpSum with no luck
Edit : Modified the code a few, you can see 'optimal solution', but It's not the solution I want + Added a "Location_C"
EDIT : This is my new code, added a secondary continuous pulp dict for arcs(links) generation (ser_customer) . The solver should only pick Fac-2 in this case, because it's near all of the customers, and other facilities are way too far:
# Lists (sets / Array) of Customers and Facilities
Customer = [1,2,3,4,5]
Facility = ['Fac-1', 'Fac-2', 'Fac-3']
# Dictionary of distances in kms
distance = {'Fac-1' : {1 : 54, 2 : 76, 3 : 5, 4 : 76, 5 : 76},
'Fac-2' : {1 : 1, 2 : 3, 3 : 1, 4 : 8, 5 : 1},
'Fac-3' : {1 : 45, 2 : 23, 3 : 54, 4 : 87, 5 : 88}
}
# Setting the Problem
prob = LpProblem("pb", LpMinimize)
# Defining our Decision Variables
use_facility = LpVariable.dicts("Use Facility", Facility, 0, 1, LpBinary)
ser_customer = LpVariable.dicts("Service", [(i,j) for i in Customer for j in Facility], 0)
# Setting the Objective Function = Minimize amount of facilities and arcs
prob += lpSum(use_facility['Fac-1']+use_facility['Fac-2']+use_facility['Fac-3']) + lpSum(distance[j][i]*ser_customer[(i,j)] for j in Facility for i in Customer)
# Constraints,At least 1 arc must exist between facilities and customers
for i in Customer:
prob += lpSum(ser_customer[(i,j)] for j in Facility) >= 1
prob.solve()
# Print the solution of Decision Variables
for v in prob.variables():
print(v.name, "=", v.varValue)
# Print the solution of Binary Decision Variables
Tolerance = 0.0001
for j in Facility:
if use_facility[j].varValue > Tolerance:
print("Establish Facility at site = ", j)
The result seems to show good arcs(links), but there is no facility selection, I wonder if somebody have any idea, is there any way to force use_facility[index] to be > 0 , Is adding arcs decisions variables a good idea ? I have tried to moove the arcs as a constraint too instead of being into the objective function, with no luck. :
Service_(1,_'Fac_1') = 0.0
Service_(1,_'Fac_2') = 1.0
Service_(1,_'Fac_3') = 0.0
Service_(2,_'Fac_1') = 0.0
Service_(2,_'Fac_2') = 1.0
Service_(2,_'Fac_3') = 0.0
Service_(3,_'Fac_1') = 0.0
Service_(3,_'Fac_2') = 1.0
Service_(3,_'Fac_3') = 0.0
Service_(4,_'Fac_1') = 0.0
Service_(4,_'Fac_2') = 1.0
Service_(4,_'Fac_3') = 0.0
Service_(5,_'Fac_1') = 0.0
Service_(5,_'Fac_2') = 1.0
Service_(5,_'Fac_3') = 0.0
Use_Facility_Fac_1 = 0.0
Use_Facility_Fac_2 = 0.0
Use_Facility_Fac_3 = 0.0
I also have tried the AirSquid solution, ,I think I maybe miss sources decisions variables who should be minimized but don' t know how to do, I guess covered are arcs (links), anyway It is a good exercise, harder than a simple product mix, hi hi :
prob = LpProblem('source minimzer', LpMinimize)
dist_limit = 5
sources = ['A', 'B','C'] # the source locations
# note this is zero-indexed to work with the list indexes in dist dictionary...
destinations = list(range(5)) # the demand locations 0, 1, 2, 3, 4
dist = { 'A': [2, 23, 30, 54, 1],
'B': [3, 1, 2, 2, 3],
'C':[24,54,12,56,76]}
covered = LpVariable.dicts('covered', [(s, d) for s in sources for d in destinations], cat='Binary')
# The objective function
# Minimize the number of sources
prob += lpSum(covered[s, d])
# set up constraint to limit covered if the destination is "reachable"
for s in sources:
for d in destinations:
prob += covered[s, d] * dist[s][d] <= dist_limit
# add one more constraint to make sure that every destination is "covered"...
# The problem is solved using PuLP's choice of Solver
prob.solve()
# The status of the solution is printed to the screen
print("Status:", LpStatus[prob.status])
# The optimised objective function value is printed to the screen
print("Location Selection = ", prob.objective)
The solution displayed, while it should print "B" :
Status: Optimal
Total Locations = covered_('C',_4)
You are on the right track! A couple things will help...
First, you overlooked a key piece of information in your output in that the solver says your formulation is INFEASIBLE!
Status: Infeasible
So whatever came out in the variables is gibberish and you must figure that part out first.
So, why is it infeasible? Take a look at your constraint. You are trying to force the impossible if your distance value is zero this cannot be true:
prob += int(distances[w][b]) * J[w] >= 1
So, you need to reformulate! You are missing a concept here. You actually need 2 constraints for this problem.
You need to constrain the selection of a source-destination if the route is too long
You need to enforce that every destination is covered.
You also need a double-indexed decision variable. Why? Well, lets say that source 'A' covers destination 1, 2; and 'B' covers 2, 3, 4, 5.... You will be able to know that all the destinations are "covered" with one variable, but you will not know which sources were used, so you need to keep track of both to get the full picture.
Here is a start, along with a couple edits. I'd suggest the variable names source and destination as that is kinda standard. You do not have a specific demand in this particular problem, just the need for a connection. You might also want to use dictionaries more than nested lists, I think it is clearer. Below is an example start with the first constraint. Note the trick here in limiting the covered variable. If the distance is less than the limit, s, then this constraint is satisfiable. For instance, if the distance is 3:
3 * 1 <= s
Anyhow, here is a recommended start. The other constraint is not implemented. You will need to sum across all the sources to ensure the destination is "covered". Comment back if your are stuck.
prob = LpProblem('source minimzer', LpMinimize)
dist_limit = 5
sources = ['A', 'B'] # the source locations
# note this is zero-indexed to work with the list indexes in dist dictionary...
destinations = list(range(5)) # the demand locations 0, 1, 2, 3, 4
dist = { 'A': [2, 23, 30, 54, 1],
'B': [3, 1, 2, 2, 3]}
covered = LpVariable.dicts('covered', [(s, d) for s in sources for d in destinations], cat='Binary')
# set up constraint to limit covered if the destination is "reachable"
for s in sources:
for d in destinations:
prob += covered[s, d] * dist[s][d] <= dist_limit
# add one more constraint to make sure that every destination is "covered"...

Is there a statistical test that can compare two ordered lists

I would like to get a statistical test statistic to compare two lists. Suppose my Benchmark list is
Benchmark = [a,b,c,d,e,f,g]
and I have two other lists
A = [g,c,b,a,f,e,d]
C = [c,d,e,a,b,f,g]
I want the test to inform me which list is closer to the Benchmark. The test should consider the absolute location, but also the relative location for example it should penalize the fact that in list A 'g' is at the start but in the benchmark it is at the end(how far is something from its true location), but also it should also reward the fact that 'a' and 'b' are close to each other in list C just like in the Benchmark.
A and C are always shuffled Benchmark. I would like a statistical test or some kind of metric that informs me that the orderings of list A , B and C are not statistically different from that of the Benchmark but that of a certain list D is significantly different at a certain threshold or p-value such as 5%. And even among the lists A,B and C, the test should perfectly outline which ordering is closer to the Benchmark.
Well, if you come to the conclusion that a metric will suffice, here you go:
def dist(a, b):
perm = []
for v in b:
perm.append(a.index(v))
perm_vals = [a[p] for p in perm]
# displacement
ret = 0
for i, v in enumerate(perm):
ret += abs(v - i)
# coherence break
current = perm_vals.index(a[0])
for v in a[1:]:
new = perm_vals.index(v)
ret += abs(new - current) - 1
current = new
return ret
I've created a few samples to test this:
import random
ground_truth = [0, 1, 2, 3, 4, 5, 6]
samples = []
for i in range(7):
samples.append(random.sample(ground_truth, len(ground_truth)))
samples.append([0, 6, 1, 5, 3, 4, 2])
samples.append([6, 5, 4, 3, 2, 1, 0])
samples.append([0, 1, 2, 3, 4, 5, 6])
def dist(a, b):
perm = []
for v in b:
perm.append(a.index(v))
perm_vals = [a[p] for p in perm]
# displacement
ret = 0
for i, v in enumerate(perm):
ret += abs(v - i)
# coherence break
current = perm_vals.index(a[0])
for v in a[1:]:
new = perm_vals.index(v)
ret += abs(new - current) - 1
current = new
return ret
for s in samples:
print(s, dist(ground_truth, s))
The metric is a cost, that is, the lower it is, the better. I designed it to yield 0 iff the permutation is an identity. The job left for you, that which none can do for you, is deciding how strict you want to be when evaluating samples using this metric, which definitely depends on what you're trying to achieve.

Fast and not memory expensive k nearest neighbours search

I am trying to find nearest neighbours for each element in a new array of points in diffrent dataset, that would be fast and not memory expensive. My bigger concern is adapted code for more neighbours rather than more dimensions.
Based on https://glowingpython.blogspot.com/2012/04/k-nearest-neighbor-search.html?showComment=1355311029556#c8236097544823362777
I have wrote k nearest neighbour search, but it is very memory extensive. In my real problem I have 1 mln values to search in and 100k points that needs to be matched, to the 1 mln x 10k array is estimated to be 600GiB.
Is there a better way?
I have tried using bisect (based on from list of integers, get number closest to a given value), but I would have to loop 100k times, which will take some time, especialy that I have o make many searches.
Good code for small datasets - able to find K nearest neighbours, and easly addaptable for many dimansions (looping by dimension):
def knn_search(search_for, search_in, K = 1,
return_col = ["ID"],
col = 'A'):
#print(col)
a_search_in = array(search_in[col])
a_search_for = array(search_for[col])
#print('a')
a = np.tile(a_search_for, [a_search_in.shape[0], 1]).T
#print('b')
b = np.tile(a_search_in, [a_search_for.shape[0], 1])
#print('tdif')
t_diff = a - b
#print('suma')
diff = np.square(t_diff)
# sorting
idx = argsort(diff)
# return the indexes of K nearest neighbours
if search_for.shape[0] == 1:
return idx[:K]
elif K == 1:
return search_in.iloc[np.concatenate(idx[:,:K]), :][return_col]
else:
tmp = pd.DataFrame()
for i in range(min(K, search_in.shape[0])):
tmp = pd.concat([tmp.reset_index(drop=True),
search_in.iloc[idx[:,i], :][[return_col]].reset_index(drop=True)],
axis=1)
return tmp
Good code for 1 dimension and 1 neighbour:
def knn_search_1K_1D(search_for, search_in,
return_col = ["ID"],
col = 'A'):
sort_search_in = search_in.sort_values(col).reset_index()
idx = np.searchsorted(sort_search_in[col], search_for[col])
idx_pop = np.where(idx > len(sort_search_in) - 1, len(sort_search_in) - 1, idx)
t = sort_search_in.iloc[idx_pop , :][[return_col]]
search_for_nn = pd.concat([search_for.add_prefix('').reset_index(drop=True),
t.add_prefix('nn_').reset_index(drop=True)],
axis=1)
Current working solution for K nearest neighbours > 1 and 1 dimension, but takes more then an hour to calculate in real case scenario mentioned above
def knn_search_nK_1D(search_for, search_in, K = 1,
return_col = ["ID"],
col = 'A'):
t = []
#looping one point by one
for i in range(search_for.shape[0]):
y = search_in[col]
x = search_for.iloc[i, :][col]
nn = np.nanmean(search_in.iloc[np.argsort(np.abs(np.subtract(y, x)))[0:K], :][return_col])
t.append(nn)
search_for_nn = search_for
search_for_nn['nn_' + return_col] = t
Example data:
search_for = pd.DataFrame({'ID': ["F", "G"],
'A' : [-1, 9]})
search_in = pd.DataFrame({'ID': ["A", "B", "C", "D", "E"],
'A' : [1, 2, 3, 4, 5 ]})
t = knn_search(search_for = search_for ,
search_in = search_in,
K = 1,
return_col = ['ID'],
col = 'A')
print(t)
# ID
#0 A
#4 E
Do you want to have your own implementation ? if so you could use k-d tree within KNN, it's much more efficient, otherwise, you could use KNN library support GPU such knn_cuda
Update
You could try, cuml.

How to generate a weighted random list from a list of elements in Python such that all elements are picked [duplicate]

I needed to write a weighted version of random.choice (each element in the list has a different probability for being selected). This is what I came up with:
def weightedChoice(choices):
"""Like random.choice, but each element can have a different chance of
being selected.
choices can be any iterable containing iterables with two items each.
Technically, they can have more than two items, the rest will just be
ignored. The first item is the thing being chosen, the second item is
its weight. The weights can be any numeric values, what matters is the
relative differences between them.
"""
space = {}
current = 0
for choice, weight in choices:
if weight > 0:
space[current] = choice
current += weight
rand = random.uniform(0, current)
for key in sorted(space.keys() + [current]):
if rand < key:
return choice
choice = space[key]
return None
This function seems overly complex to me, and ugly. I'm hoping everyone here can offer some suggestions on improving it or alternate ways of doing this. Efficiency isn't as important to me as code cleanliness and readability.
Since version 1.7.0, NumPy has a choice function that supports probability distributions.
from numpy.random import choice
draw = choice(list_of_candidates, number_of_items_to_pick,
p=probability_distribution)
Note that probability_distribution is a sequence in the same order of list_of_candidates. You can also use the keyword replace=False to change the behavior so that drawn items are not replaced.
Since Python 3.6 there is a method choices from the random module.
In [1]: import random
In [2]: random.choices(
...: population=[['a','b'], ['b','a'], ['c','b']],
...: weights=[0.2, 0.2, 0.6],
...: k=10
...: )
Out[2]:
[['c', 'b'],
['c', 'b'],
['b', 'a'],
['c', 'b'],
['c', 'b'],
['b', 'a'],
['c', 'b'],
['b', 'a'],
['c', 'b'],
['c', 'b']]
Note that random.choices will sample with replacement, per the docs:
Return a k sized list of elements chosen from the population with replacement.
Note for completeness of answer:
When a sampling unit is drawn from a finite population and is returned
to that population, after its characteristic(s) have been recorded,
before the next unit is drawn, the sampling is said to be "with
replacement". It basically means each element may be chosen more than
once.
If you need to sample without replacement, then as #ronan-paixão's brilliant answer states, you can use numpy.choice, whose replace argument controls such behaviour.
def weighted_choice(choices):
total = sum(w for c, w in choices)
r = random.uniform(0, total)
upto = 0
for c, w in choices:
if upto + w >= r:
return c
upto += w
assert False, "Shouldn't get here"
Arrange the weights into a
cumulative distribution.
Use random.random() to pick a random
float 0.0 <= x < total.
Search the
distribution using bisect.bisect as
shown in the example at http://docs.python.org/dev/library/bisect.html#other-examples.
from random import random
from bisect import bisect
def weighted_choice(choices):
values, weights = zip(*choices)
total = 0
cum_weights = []
for w in weights:
total += w
cum_weights.append(total)
x = random() * total
i = bisect(cum_weights, x)
return values[i]
>>> weighted_choice([("WHITE",90), ("RED",8), ("GREEN",2)])
'WHITE'
If you need to make more than one choice, split this into two functions, one to build the cumulative weights and another to bisect to a random point.
If you don't mind using numpy, you can use numpy.random.choice.
For example:
import numpy
items = [["item1", 0.2], ["item2", 0.3], ["item3", 0.45], ["item4", 0.05]
elems = [i[0] for i in items]
probs = [i[1] for i in items]
trials = 1000
results = [0] * len(items)
for i in range(trials):
res = numpy.random.choice(items, p=probs) #This is where the item is selected!
results[items.index(res)] += 1
results = [r / float(trials) for r in results]
print "item\texpected\tactual"
for i in range(len(probs)):
print "%s\t%0.4f\t%0.4f" % (items[i], probs[i], results[i])
If you know how many selections you need to make in advance, you can do it without a loop like this:
numpy.random.choice(items, trials, p=probs)
As of Python v3.6, random.choices could be used to return a list of elements of specified size from the given population with optional weights.
random.choices(population, weights=None, *, cum_weights=None, k=1)
population : list containing unique observations. (If empty, raises IndexError)
weights : More precisely relative weights required to make selections.
cum_weights : cumulative weights required to make selections.
k : size(len) of the list to be outputted. (Default len()=1)
Few Caveats:
1) It makes use of weighted sampling with replacement so the drawn items would be later replaced. The values in the weights sequence in itself do not matter, but their relative ratio does.
Unlike np.random.choice which can only take on probabilities as weights and also which must ensure summation of individual probabilities upto 1 criteria, there are no such regulations here. As long as they belong to numeric types (int/float/fraction except Decimal type) , these would still perform.
>>> import random
# weights being integers
>>> random.choices(["white", "green", "red"], [12, 12, 4], k=10)
['green', 'red', 'green', 'white', 'white', 'white', 'green', 'white', 'red', 'white']
# weights being floats
>>> random.choices(["white", "green", "red"], [.12, .12, .04], k=10)
['white', 'white', 'green', 'green', 'red', 'red', 'white', 'green', 'white', 'green']
# weights being fractions
>>> random.choices(["white", "green", "red"], [12/100, 12/100, 4/100], k=10)
['green', 'green', 'white', 'red', 'green', 'red', 'white', 'green', 'green', 'green']
2) If neither weights nor cum_weights are specified, selections are made with equal probability. If a weights sequence is supplied, it must be the same length as the population sequence.
Specifying both weights and cum_weights raises a TypeError.
>>> random.choices(["white", "green", "red"], k=10)
['white', 'white', 'green', 'red', 'red', 'red', 'white', 'white', 'white', 'green']
3) cum_weights are typically a result of itertools.accumulate function which are really handy in such situations.
From the documentation linked:
Internally, the relative weights are converted to cumulative weights
before making selections, so supplying the cumulative weights saves
work.
So, either supplying weights=[12, 12, 4] or cum_weights=[12, 24, 28] for our contrived case produces the same outcome and the latter seems to be more faster / efficient.
Crude, but may be sufficient:
import random
weighted_choice = lambda s : random.choice(sum(([v]*wt for v,wt in s),[]))
Does it work?
# define choices and relative weights
choices = [("WHITE",90), ("RED",8), ("GREEN",2)]
# initialize tally dict
tally = dict.fromkeys(choices, 0)
# tally up 1000 weighted choices
for i in xrange(1000):
tally[weighted_choice(choices)] += 1
print tally.items()
Prints:
[('WHITE', 904), ('GREEN', 22), ('RED', 74)]
Assumes that all weights are integers. They don't have to add up to 100, I just did that to make the test results easier to interpret. (If weights are floating point numbers, multiply them all by 10 repeatedly until all weights >= 1.)
weights = [.6, .2, .001, .199]
while any(w < 1.0 for w in weights):
weights = [w*10 for w in weights]
weights = map(int, weights)
If you have a weighted dictionary instead of a list you can write this
items = { "a": 10, "b": 5, "c": 1 }
random.choice([k for k in items for dummy in range(items[k])])
Note that [k for k in items for dummy in range(items[k])] produces this list ['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'c', 'b', 'b', 'b', 'b', 'b']
Here's is the version that is being included in the standard library for Python 3.6:
import itertools as _itertools
import bisect as _bisect
class Random36(random.Random):
"Show the code included in the Python 3.6 version of the Random class"
def choices(self, population, weights=None, *, cum_weights=None, k=1):
"""Return a k sized list of population elements chosen with replacement.
If the relative weights or cumulative weights are not specified,
the selections are made with equal probability.
"""
random = self.random
if cum_weights is None:
if weights is None:
_int = int
total = len(population)
return [population[_int(random() * total)] for i in range(k)]
cum_weights = list(_itertools.accumulate(weights))
elif weights is not None:
raise TypeError('Cannot specify both weights and cumulative weights')
if len(cum_weights) != len(population):
raise ValueError('The number of weights does not match the population')
bisect = _bisect.bisect
total = cum_weights[-1]
return [population[bisect(cum_weights, random() * total)] for i in range(k)]
Source: https://hg.python.org/cpython/file/tip/Lib/random.py#l340
A very basic and easy approach for a weighted choice is the following:
np.random.choice(['A', 'B', 'C'], p=[0.3, 0.4, 0.3])
import numpy as np
w=np.array([ 0.4, 0.8, 1.6, 0.8, 0.4])
np.random.choice(w, p=w/sum(w))
I'm probably too late to contribute anything useful, but here's a simple, short, and very efficient snippet:
def choose_index(probabilies):
cmf = probabilies[0]
choice = random.random()
for k in xrange(len(probabilies)):
if choice <= cmf:
return k
else:
cmf += probabilies[k+1]
No need to sort your probabilities or create a vector with your cmf, and it terminates once it finds its choice. Memory: O(1), time: O(N), with average running time ~ N/2.
If you have weights, simply add one line:
def choose_index(weights):
probabilities = weights / sum(weights)
cmf = probabilies[0]
choice = random.random()
for k in xrange(len(probabilies)):
if choice <= cmf:
return k
else:
cmf += probabilies[k+1]
If your list of weighted choices is relatively static, and you want frequent sampling, you can do one O(N) preprocessing step, and then do the selection in O(1), using the functions in this related answer.
# run only when `choices` changes.
preprocessed_data = prep(weight for _,weight in choices)
# O(1) selection
value = choices[sample(preprocessed_data)][0]
If you happen to have Python 3, and are afraid of installing numpy or writing your own loops, you could do:
import itertools, bisect, random
def weighted_choice(choices):
weights = list(zip(*choices))[1]
return choices[bisect.bisect(list(itertools.accumulate(weights)),
random.uniform(0, sum(weights)))][0]
Because you can build anything out of a bag of plumbing adaptors! Although... I must admit that Ned's answer, while slightly longer, is easier to understand.
I looked the pointed other thread and came up with this variation in my coding style, this returns the index of choice for purpose of tallying, but it is simple to return the string ( commented return alternative):
import random
import bisect
try:
range = xrange
except:
pass
def weighted_choice(choices):
total, cumulative = 0, []
for c,w in choices:
total += w
cumulative.append((total, c))
r = random.uniform(0, total)
# return index
return bisect.bisect(cumulative, (r,))
# return item string
#return choices[bisect.bisect(cumulative, (r,))][0]
# define choices and relative weights
choices = [("WHITE",90), ("RED",8), ("GREEN",2)]
tally = [0 for item in choices]
n = 100000
# tally up n weighted choices
for i in range(n):
tally[weighted_choice(choices)] += 1
print([t/sum(tally)*100 for t in tally])
A general solution:
import random
def weighted_choice(choices, weights):
total = sum(weights)
treshold = random.uniform(0, total)
for k, weight in enumerate(weights):
total -= weight
if total < treshold:
return choices[k]
Here is another version of weighted_choice that uses numpy. Pass in the weights vector and it will return an array of 0's containing a 1 indicating which bin was chosen. The code defaults to just making a single draw but you can pass in the number of draws to be made and the counts per bin drawn will be returned.
If the weights vector does not sum to 1, it will be normalized so that it does.
import numpy as np
def weighted_choice(weights, n=1):
if np.sum(weights)!=1:
weights = weights/np.sum(weights)
draws = np.random.random_sample(size=n)
weights = np.cumsum(weights)
weights = np.insert(weights,0,0.0)
counts = np.histogram(draws, bins=weights)
return(counts[0])
It depends on how many times you want to sample the distribution.
Suppose you want to sample the distribution K times. Then, the time complexity using np.random.choice() each time is O(K(n + log(n))) when n is the number of items in the distribution.
In my case, I needed to sample the same distribution multiple times of the order of 10^3 where n is of the order of 10^6. I used the below code, which precomputes the cumulative distribution and samples it in O(log(n)). Overall time complexity is O(n+K*log(n)).
import numpy as np
n,k = 10**6,10**3
# Create dummy distribution
a = np.array([i+1 for i in range(n)])
p = np.array([1.0/n]*n)
cfd = p.cumsum()
for _ in range(k):
x = np.random.uniform()
idx = cfd.searchsorted(x, side='right')
sampled_element = a[idx]
There is lecture on this by Sebastien Thurn in the free Udacity course AI for Robotics. Basically he makes a circular array of the indexed weights using the mod operator %, sets a variable beta to 0, randomly chooses an index,
for loops through N where N is the number of indices and in the for loop firstly increments beta by the formula:
beta = beta + uniform sample from {0...2* Weight_max}
and then nested in the for loop, a while loop per below:
while w[index] < beta:
beta = beta - w[index]
index = index + 1
select p[index]
Then on to the next index to resample based on the probabilities (or normalized probability in the case presented in the course).
On Udacity find Lesson 8, video number 21 of Artificial Intelligence for Robotics where he is lecturing on particle filters.
Another way of doing this, assuming we have weights at the same index as the elements in the element array.
import numpy as np
weights = [0.1, 0.3, 0.5] #weights for the item at index 0,1,2
# sum of weights should be <=1, you can also divide each weight by sum of all weights to standardise it to <=1 constraint.
trials = 1 #number of trials
num_item = 1 #number of items that can be picked in each trial
selected_item_arr = np.random.multinomial(num_item, weights, trials)
# gives number of times an item was selected at a particular index
# this assumes selection with replacement
# one possible output
# selected_item_arr
# array([[0, 0, 1]])
# say if trials = 5, the the possible output could be
# selected_item_arr
# array([[1, 0, 0],
# [0, 0, 1],
# [0, 0, 1],
# [0, 1, 0],
# [0, 0, 1]])
Now let's assume, we have to sample out 3 items in 1 trial. You can assume that there are three balls R,G,B present in large quantity in ratio of their weights given by weight array, the following could be possible outcome:
num_item = 3
trials = 1
selected_item_arr = np.random.multinomial(num_item, weights, trials)
# selected_item_arr can give output like :
# array([[1, 0, 2]])
you can also think number of items to be selected as number of binomial/ multinomial trials within a set. So, the above example can be still work as
num_binomial_trial = 5
weights = [0.1,0.9] #say an unfair coin weights for H/T
num_experiment_set = 1
selected_item_arr = np.random.multinomial(num_binomial_trial, weights, num_experiment_set)
# possible output
# selected_item_arr
# array([[1, 4]])
# i.e H came 1 time and T came 4 times in 5 binomial trials. And one set contains 5 binomial trails.
let's say you have
items = [11, 23, 43, 91]
probability = [0.2, 0.3, 0.4, 0.1]
and you have function which generates a random number between [0, 1) (we can use random.random() here).
so now take the prefix sum of probability
prefix_probability=[0.2,0.5,0.9,1]
now we can just take a random number between 0-1 and use binary search to find where that number belongs in prefix_probability. that index will be your answer
Code will go something like this
return items[bisect.bisect(prefix_probability,random.random())]
One way is to randomize on the total of all the weights and then use the values as the limit points for each var. Here is a crude implementation as a generator.
def rand_weighted(weights):
"""
Generator which uses the weights to generate a
weighted random values
"""
sum_weights = sum(weights.values())
cum_weights = {}
current_weight = 0
for key, value in sorted(weights.iteritems()):
current_weight += value
cum_weights[key] = current_weight
while True:
sel = int(random.uniform(0, 1) * sum_weights)
for key, value in sorted(cum_weights.iteritems()):
if sel < value:
break
yield key
Using numpy
def choice(items, weights):
return items[np.argmin((np.cumsum(weights) / sum(weights)) < np.random.rand())]
I needed to do something like this really fast really simple, from searching for ideas i finally built this template. The idea is receive the weighted values in a form of a json from the api, which here is simulated by the dict.
Then translate it into a list in which each value repeats proportionally to it's weight, and just use random.choice to select a value from the list.
I tried it running with 10, 100 and 1000 iterations. The distribution seems pretty solid.
def weighted_choice(weighted_dict):
"""Input example: dict(apples=60, oranges=30, pineapples=10)"""
weight_list = []
for key in weighted_dict.keys():
weight_list += [key] * weighted_dict[key]
return random.choice(weight_list)
I didn't love the syntax of any of those. I really wanted to just specify what the items were and what the weighting of each was. I realize I could have used random.choices but instead I quickly wrote the class below.
import random, string
from numpy import cumsum
class randomChoiceWithProportions:
'''
Accepts a dictionary of choices as keys and weights as values. Example if you want a unfair dice:
choiceWeightDic = {"1":0.16666666666666666, "2": 0.16666666666666666, "3": 0.16666666666666666
, "4": 0.16666666666666666, "5": .06666666666666666, "6": 0.26666666666666666}
dice = randomChoiceWithProportions(choiceWeightDic)
samples = []
for i in range(100000):
samples.append(dice.sample())
# Should be close to .26666
samples.count("6")/len(samples)
# Should be close to .16666
samples.count("1")/len(samples)
'''
def __init__(self, choiceWeightDic):
self.choiceWeightDic = choiceWeightDic
weightSum = sum(self.choiceWeightDic.values())
assert weightSum == 1, 'Weights sum to ' + str(weightSum) + ', not 1.'
self.valWeightDict = self._compute_valWeights()
def _compute_valWeights(self):
valWeights = list(cumsum(list(self.choiceWeightDic.values())))
valWeightDict = dict(zip(list(self.choiceWeightDic.keys()), valWeights))
return valWeightDict
def sample(self):
num = random.uniform(0,1)
for key, val in self.valWeightDict.items():
if val >= num:
return key
Provide random.choice() with a pre-weighted list:
Solution & Test:
import random
options = ['a', 'b', 'c', 'd']
weights = [1, 2, 5, 2]
weighted_options = [[opt]*wgt for opt, wgt in zip(options, weights)]
weighted_options = [opt for sublist in weighted_options for opt in sublist]
print(weighted_options)
# test
counts = {c: 0 for c in options}
for x in range(10000):
counts[random.choice(weighted_options)] += 1
for opt, wgt in zip(options, weights):
wgt_r = counts[opt] / 10000 * sum(weights)
print(opt, counts[opt], wgt, wgt_r)
Output:
['a', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'd', 'd']
a 1025 1 1.025
b 1948 2 1.948
c 5019 5 5.019
d 2008 2 2.008
In case you don't define in advance how many items you want to pick (so, you don't do something like k=10) and you just have probabilities, you can do the below. Note that your probabilities do not need to add up to 1, they can be independent of each other:
soup_items = ['pepper', 'onion', 'tomato', 'celery']
items_probability = [0.2, 0.3, 0.9, 0.1]
selected_items = [item for item,p in zip(soup_items,items_probability) if random.random()<p]
print(selected_items)
>>>['pepper','tomato']
Step-1: Generate CDF F in which you're interesting
Step-2: Generate u.r.v. u
Step-3: Evaluate z=F^{-1}(u)
This modeling is described in course of probability theory or stochastic processes. This is applicable just because you have easy CDF.

Motif search with Gibbs sampler

I am a beginner in both programming and bioinformatics. So, I would appreciate your understanding. I tried to develop a python script for motif search using Gibbs sampling as explained in Coursera class, "Finding Hidden Messages in DNA". The pseudocode provided in the course is:
GIBBSSAMPLER(Dna, k, t, N)
randomly select k-mers Motifs = (Motif1, …, Motift) in each string
from Dna
BestMotifs ← Motifs
for j ← 1 to N
i ← Random(t)
Profile ← profile matrix constructed from all strings in Motifs
except for Motifi
Motifi ← Profile-randomly generated k-mer in the i-th sequence
if Score(Motifs) < Score(BestMotifs)
BestMotifs ← Motifs
return BestMotifs
Problem description:
CODE CHALLENGE: Implement GIBBSSAMPLER.
Input: Integers k, t, and N, followed by a collection of strings Dna.
Output: The strings BestMotifs resulting from running GIBBSSAMPLER(Dna, k, t, N) with
20 random starts. Remember to use pseudocounts!
Sample Input:
8 5 100
CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA
GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG
TAGTACCGAGACCGAAAGAAGTATACAGGCGT
TAGATCAAGTTTCAGGTGCACGTCGGTGAACC
AATCCACCAGCTCCACGTGCAATGTTGGCCTA
Sample Output:
TCTCGGGG
CCAAGGTG
TACAGGCG
TTCAGGTG
TCCACGTG
I followed the pseudocode to the best of my knowledge. Here is my code:
def BuildProfileMatrix(dnamatrix):
ProfileMatrix = [[1 for x in xrange(len(dnamatrix[0]))] for x in xrange(4)]
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
for seq in dnamatrix:
for i in xrange(len(dnamatrix[0])):
ProfileMatrix[indices[seq[i]]][i] += 1
ProbMatrix = [[float(x)/sum(zip(*ProfileMatrix)[0]) for x in y] for y in ProfileMatrix]
return ProbMatrix
def ProfileRandomGenerator(profile, dna, k, i):
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
score_list = []
for x in xrange(len(dna[i]) - k + 1):
probability = 1
window = dna[i][x : k + x]
for y in xrange(k):
probability *= profile[indices[window[y]]][y]
score_list.append(probability)
rnd = uniform(0, sum(score_list))
current = 0
for z, bias in enumerate(score_list):
current += bias
if rnd <= current:
return dna[i][z : k + z]
def score(motifs):
ProfileMatrix = [[0 for x in xrange(len(motifs[0]))] for x in xrange(4)]
indices = {'A':0, 'C':1, 'G': 2, 'T':3}
for seq in motifs:
for i in xrange(len(motifs[0])):
ProfileMatrix[indices[seq[i]]][i] += 1
score = len(motifs)*len(motifs[0]) - sum([max(x) for x in zip(*ProfileMatrix)])
return score
from random import randint, uniform
def GibbsSampler(k, t, N):
dna = ['CGCCCCTCTCGGGGGTGTTCAGTAACCGGCCA',
'GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG',
'TAGTACCGAGACCGAAAGAAGTATACAGGCGT',
'TAGATCAAGTTTCAGGTGCACGTCGGTGAACC',
'AATCCACCAGCTCCACGTGCAATGTTGGCCTA']
Motifs = []
for i in [randint(0, len(dna[0])-k) for x in range(len(dna))]:
j = 0
kmer = dna[j][i : k+i]
j += 1
Motifs.append(kmer)
BestMotifs = []
s_best = float('inf')
for i in xrange(N):
x = randint(0, t-1)
Motifs.pop(x)
profile = BuildProfileMatrix(Motifs)
Motif = ProfileRandomGenerator(profile, dna, k, x)
Motifs.append(Motif)
s_motifs = score(Motifs)
if s_motifs < s_best:
s_best = s_motifs
BestMotifs = Motifs
return [s_best, BestMotifs]
k, t, N =8, 5, 100
best_motifs = [float('inf'), None]
# Repeat the Gibbs sampler search 20 times.
for repeat in xrange(20):
current_motifs = GibbsSampler(k, t, N)
if current_motifs[0] < best_motifs[0]:
best_motifs = current_motifs
# Print and save the answer.
print '\n'.join(best_motifs[1])
Unfortunately, my code never gives the same output as the solved example. Besides, while trying to debug the code I found that I get weird scores that define the mismatches between motifs. However, when I tried to run the score function separately, it worked perfectly.
Each time I run the script, the output changes, but anyway here is an example of one of the outputs for the input present in the code:
Example output of my code
TATGTGTA
TATGTGTA
TATGTGTA
GGTGTTCA
TATACAGG
Could you please help me debug this code?!! I spent the whole day trying to find out what's wrong with it although I know it might be some silly mistake I made, but my eye failed to catch it.
Thank you all!!
Finally, I found out what was wrong in my code! It was in line 54:
Motifs.append(Motif)
After randomly removing one of the motifs, followed by building a profile out of these motifs then randomly selecting a new motif based on this profile, I should have added the selected motif in the same position before removal NOT appended to the end of the motif list.
Now, the correct code is:
Motifs.insert(x, Motif)
The new code worked as expected.

Categories

Resources