I have a dictionary that looks something like so:
exons = {'NM_015665': [(0, 225), (356, 441), (563, 645), (793, 861)], etc...}
and another file that has a position like so:
isoform pos
NM_015665 449
What I want to do is print the range of numbers that the position in the file is the closest to and then print the number within that range of numbers that the value is closest to. For this case, I want to print (356, 441) and then 441. I've successfully figured out a way to print the number in the set of numbers that the value is closest to, but my code below only takes into account 10 values on either side of the numbers listed. Is there any way to take into account that there are a different amount of numbers between each set of ranges?
This is the code I have so far:
with open('splicing_reinitialized.txt') as f:
reader = csv.DictReader(f,delimiter="\t")
for row in reader:
pos = row['pos']
name = row['isoform']
ppos1 = int(pos)
if name in exons:
y = exons[name]
for i, (low,high) in enumerate(exons[name]):
if low -5 <= ppos1 <= high + 5:
values = (low,high)
closest = min((low,high), key = lambda x:abs(x-ppos1))
I would rewrite it as a minimum distance search:
if name in exons:
y = exons[name]
minDist = 99999 # large number
minIdx = None
minNum = None
for i, (low,high) in enumerate(y):
dlow = abs(low - ppos1)
dhigh = abs(high - ppos1)
dist = min(dlow, dhigh)
if dist < minDist:
minDist = dist
minIdx = i
minNum = 0 if dlow < dhigh else 1
print(y[minIdx])
print(y[minIdx][minNum])
This ignores the search range, just search for the minimum distance pair.
A functional alternative :). This might even run faster. It clearly is very RAM-friendly and can be easily parallelized due to the perks of functional programming. I hope you'll find it interesting enough to study.
from itertools import imap, izip, ifilter, repeat
def closest_point(position, interval):
""":rtype: tuple[int, int]""" # closest interval point, distance to it
position_in_interval = interval[0] <= position <= interval[1]
closest = min([(border, abs(position - border)) for border in interval], key=lambda x: x[1])
return closest if not position_in_interval else (closest[0], 0) # distance is 0 if position is inside an interval
def closest_interval(exons, pos):
""":rtype: tuple[tuple[int, int], tuple[int, int]]"""
return min(ifilter(lambda x: x[1][1], izip(exons, imap(closest_point, repeat(pos, len(exons)), exons))),
key=lambda x: x[1][1])
print(closest_interval(exons['NM_015665'], 449))
This prints
((356, 441), (441, 8))
The first tuple is a range. The first integer in the second tuple is the closest point in the interval, the second integer is the distance.
Related
I am implementing in Python3 an algorithm to find the longest substring of two strings s and t. Given s and t, I need to return (a,b,l) where l is the length of the longest common substring, a is the position in s where the longest substring starts, and b is the position in t where the longest substring starts. I have a working version of the algorithm but it is quite slow and I am not sure why; it is frustrating because I have found other implementations in python using pretty much the same logic that are many times faster. I am self-learning so any help would be greatly appreciated.
The approach is based on comparing hash values rather than directly comparing substrings and using binary search to find maximal length of common substrings. Here is the code for my hash function (m is a big prime and x is just some constant):
def polynomial_hash(my_string, m, x):
str_len = len(my_string)
result = 0
for i in range(str_len):
result = (result + ord(my_string[i]) * power_mod_p(x, i, m)) % m
return result
Given two strings s and t, I first find which string is shorter, without loss of generality, let s be the shorter string. First I need to find the hash values of substrings of a string. I use the following function, implemented as a generator:
def all_length_k_hashes(my_string, k, m, x):
current_position = len(my_string) - k
x_to_the_k = power_mod_p(x, k, m)
hash_value = polynomial_hash(my_string[current_position:], m, x)
yield (hash_value, current_position)
while current_position > 0:
current_position = current_position - 1
hash_value = ((hash_value * x) + ord(my_string[current_position]) - x_to_the_k*ord(my_string[current_position + k])) % m
yield (hash_value, current_position)
This function is simple, its first yield is the hash value of the final length k substring of the string, after that each of its iteration is the hash value of the next length k substring to its left (we move left by one position, for example for k=3 from abcdefghi to abcdefghi then from abcdefghi to abcdefghi). This should be able to calculate all the hash values of all length k substrings of my_string in O(|my_string|).
Now I find out if s and t has a length k substring in common, I use the following function:
def common_sub_string_length_k(shorter_str, longer_str, k, m, x):
short_str_dict = dict()
for hash_and_index in all_length_k_hashes(shorter_str, k, m, x):
short_str_dict.update({hash_and_index[0]: hash_and_index[1]})
hash_generator_longer_str = all_length_k_hashes(longer_str, k, m, x)
for hash_and_index in hash_generator_longer_str:
if hash_and_index[0] in short_str_dict:
return (short_str_dict[hash_and_index[0]], hash_and_index[1])
return False
What is happening in this function is: I create a Python empty dictionary and fill it with (key:values) such that each key is the hash value of a length k substring of the shorter string and its value is that substring's starting index, I call this 'short_str_dict'
Then, using all_length_k_hashes, I create a generator of hash values of substrings of length k of the longer string, then I iterate through this generator to check if there is a hash value that's in the 'short_str_dict', if there is, then the two strings have a substring of length k in common (assuming no hash collisions). This whole process should take time O(|shorter_string| + |longer_string|)
Finally, the following function repeatedly uses the previous process to find the maximal k, using a binary search technique:
def longest_common_substring(str_1, str_2):
m_1 = 309000599
m_2 = 988017827
x = randint(1, 10 ** 6)
len_str_1 = len(str_1)
len_str_2 = len(str_2)
if len_str_1 <= len_str_2:
short_str = str_1
long_str = str_2
switched = False
else:
short_str = str_2
long_str = str_1
switched = True
len_short_str = len(short_str)
len_long_str = len(long_str)
low = 0
high = len_short_str
mid = 0
longest_so_far = 0
longest_indices = (0,0)
while low <= high:
mid = (high + low) // 2
m1_result = common_sub_string_length_k(short_str, long_str, mid, m_1, x)
m2_result = common_sub_string_length_k(short_str, long_str, mid, m_2, x)
if m1_result is False or m2_result is False:
high = mid - 1
else:
longest_so_far = mid
longest_indices = m1_result
low = mid + 1
if switched:
return (longest_indices[1], longest_indices[0], longest_so_far)
else:
return (longest_indices[0], longest_indices[1], longest_so_far)
Two different hashes are used to reduce the probability of a collision. So in total, assuming no collisions, this whole process should take
O(log|shorter_string|) * O(|shorter_string| + |longer_string|).
Have I made any error? Is it slow because of the use of Python dictionaries? I really want to understand my mistake. Any help is greatly appreciated.
The input is an integer that specifies the amount to be ordered.
There are predefined package sizes that have to be used to create that order.
e.g.
Packs
3 for $5
5 for $9
9 for $16
for an input order 13 the output should be:
2x5 + 1x3
So far I've the following approach:
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
while remaining_order > 0:
found = False
for pack_num in package_numbers:
if pack_num <= remaining_order:
required_packages.append(pack_num)
remaining_order -= pack_num
found = True
break
if not found:
break
But this will lead to the wrong result:
1x9 + 1x3
remaining: 1
So, you need to fill the order with the packages such that the total price is maximal? This is known as Knapsack problem. In that Wikipedia article you'll find several solutions written in Python.
To be more precise, you need a solution for the unbounded knapsack problem, in contrast to popular 0/1 knapsack problem (where each item can be packed only once). Here is working code from Rosetta:
from itertools import product
NAME, SIZE, VALUE = range(3)
items = (
# NAME, SIZE, VALUE
('A', 3, 5),
('B', 5, 9),
('C', 9, 16))
capacity = 13
def knapsack_unbounded_enumeration(items, C):
# find max of any one item
max1 = [int(C / item[SIZE]) for item in items]
itemsizes = [item[SIZE] for item in items]
itemvalues = [item[VALUE] for item in items]
# def totvalue(itemscount, =itemsizes, itemvalues=itemvalues, C=C):
def totvalue(itemscount):
# nonlocal itemsizes, itemvalues, C
totsize = sum(n * size for n, size in zip(itemscount, itemsizes))
totval = sum(n * val for n, val in zip(itemscount, itemvalues))
return (totval, -totsize) if totsize <= C else (-1, 0)
# Try all combinations of bounty items from 0 up to max1
bagged = max(product(*[range(n + 1) for n in max1]), key=totvalue)
numbagged = sum(bagged)
value, size = totvalue(bagged)
size = -size
# convert to (iten, count) pairs) in name order
bagged = ['%dx%d' % (n, items[i][SIZE]) for i, n in enumerate(bagged) if n]
return value, size, numbagged, bagged
if __name__ == '__main__':
value, size, numbagged, bagged = knapsack_unbounded_enumeration(items, capacity)
print(value)
print(bagged)
Output is:
23
['1x3', '2x5']
Keep in mind that this is a NP-hard problem, so it will blow as you enter some large values :)
You can use itertools.product:
import itertools
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
a=min([x for i in range(1,remaining_order+1//min(package_numbers)) for x in itertools.product(package_numbers,repeat=i)],key=lambda x: abs(sum(x)-remaining_order))
remaining_order-=sum(a)
print(a)
print(remaining_order)
Output:
(5, 5, 3)
0
This simply does the below steps:
Get value closest to 13, in the list with all the product values.
Then simply make it modify the number of remaining_order.
If you want it output with 'x':
import itertools
from collections import Counter
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
a=min([x for i in range(1,remaining_order+1//min(package_numbers)) for x in itertools.product(package_numbers,repeat=i)],key=lambda x: abs(sum(x)-remaining_order))
remaining_order-=sum(a)
print(' '.join(['{0}x{1}'.format(v,k) for k,v in Counter(a).items()]))
print(remaining_order)
Output:
2x5 + 1x3
0
For you problem, I tried two implementations depending on what you want, in both of the solutions I supposed you absolutely needed your remaining to be at 0. Otherwise the algorithm will return you -1. If you need them, tell me I can adapt my algorithm.
As the algorithm is implemented via dynamic programming, it handles good inputs, at least more than 130 packages !
In the first solution, I admitted we fill with the biggest package each time.
I n the second solution, I try to minimize the price, but the number of packages should always be 0.
remaining_order = 13
package_numbers = sorted([9,5,3], reverse=True) # To make sure the biggest package is the first element
prices = {9: 16, 5: 9, 3: 5}
required_packages = []
# First solution, using the biggest package each time, and making the total order remaining at 0 each time
ans = [[] for _ in range(remaining_order + 1)]
ans[0] = [0, 0, 0]
for i in range(1, remaining_order + 1):
for index, package_number in enumerate(package_numbers):
if i-package_number > -1:
tmp = ans[i-package_number]
if tmp != -1:
ans[i] = [tmp[x] if x != index else tmp[x] + 1 for x in range(len(tmp))]
break
else: # Using for else instead of a boolean value `found`
ans[i] = -1 # -1 is the not found combinations
print(ans[13]) # [0, 2, 1]
print(ans[9]) # [1, 0, 0]
# Second solution, minimizing the price with order at 0
def price(x):
return 16*x[0]+9*x[1]+5*x[2]
ans = [[] for _ in range(remaining_order + 1)]
ans[0] = ([0, 0, 0],0) # combination + price
for i in range(1, remaining_order + 1):
# The not found packages will be (-1, float('inf'))
minimal_price = float('inf')
minimal_combinations = -1
for index, package_number in enumerate(package_numbers):
if i-package_number > -1:
tmp = ans[i-package_number]
if tmp != (-1, float('inf')):
tmp_price = price(tmp[0]) + prices[package_number]
if tmp_price < minimal_price:
minimal_price = tmp_price
minimal_combinations = [tmp[0][x] if x != index else tmp[0][x] + 1 for x in range(len(tmp[0]))]
ans[i] = (minimal_combinations, minimal_price)
print(ans[13]) # ([0, 2, 1], 23)
print(ans[9]) # ([0, 0, 3], 15) Because the price of three packages is lower than the price of a package of 9
In case you need a solution for a small number of possible
package_numbers
but a possibly very big
remaining_order,
in which case all the other solutions would fail, you can use this to reduce remaining_order:
import numpy as np
remaining_order = 13
package_numbers = [9,5,3]
required_packages = []
sub_max=np.sum([(np.product(package_numbers)/i-1)*i for i in package_numbers])
while remaining_order > sub_max:
remaining_order -= np.product(package_numbers)
required_packages.append([max(package_numbers)]*np.product(package_numbers)/max(package_numbers))
Because if any package is in required_packages more often than (np.product(package_numbers)/i-1)*i it's sum is equal to np.product(package_numbers). In case the package max(package_numbers) isn't the one with the samllest price per unit, take the one with the smallest price per unit instead.
Example:
remaining_order = 100
package_numbers = [5,3]
Any part of remaining_order bigger than 5*2 plus 3*4 = 22 can be sorted out by adding 5 three times to the solution and taking remaining_order - 5*3.
So remaining order that actually needs to be calculated is 10. Which can then be solved to beeing 2 times 5. The rest is filled with 6 times 15 which is 18 times 5.
In case the number of possible package_numbers is bigger than just a handful, I recommend building a lookup table (with one of the others answers' code) for all numbers below sub_max which will make this immensely fast for any input.
Since no declaration about the object function is found, I assume your goal is to maximize the package value within the pack's capability.
Explanation: time complexity is fixed. Optimal solution may not be filling the highest valued item as many as possible, you have to search all possible combinations. However, you can reuse the possible optimal solutions you have searched to save space. For example, [5,5,3] is derived from adding 3 to a previous [5,5] try so the intermediate result can be "cached". You may either use an array or you may use a set to store possible solutions. The code below runs the same performance as the rosetta code but I think it's clearer.
To further optimize, use a priority set for opts.
costs = [3,5,9]
value = [5,9,16]
volume = 130
# solutions
opts = set()
opts.add(tuple([0]))
# calc total value
cost_val = dict(zip(costs, value))
def total_value(opt):
return sum([cost_val.get(cost,0) for cost in opt])
def possible_solutions():
solutions = set()
for opt in opts:
for cost in costs:
if cost + sum(opt) > volume:
continue
cnt = (volume - sum(opt)) // cost
for _ in range(1, cnt + 1):
sol = tuple(list(opt) + [cost] * _)
solutions.add(sol)
return solutions
def optimize_max_return(opts):
if not opts:
return tuple([])
cur = list(opts)[0]
for sol in opts:
if total_value(sol) > total_value(cur):
cur = sol
return cur
while sum(optimize_max_return(opts)) <= volume - min(costs):
opts = opts.union(possible_solutions())
print(optimize_max_return(opts))
If your requirement is "just fill the pack" it'll be even simpler using the volume for each item instead.
I have been wondering if there is some kind of data-structure or clever way to use a dictionary (O(1) lookup) to return a value if there are given values for defined ranges that do not overlap. So far I have been thinking this could be done if the ranges have some constant difference (0-2, 2-4, 4-6, etc.) or a binary-search could be done to do this in O(log(n)) time.
So, for example given a dictionary,
d = {[0.0 - 0.1): "a",
[0.1 - 0.3): "b",
[0.3 - 0.55): "c",
[0.55 - 0.7): "d",
[0.7 - 1.0): "e"}
it should return,
d[0.05]
>>> "a"
d[0.8]
>>> "e"
d[0.9]
>>> "e"
d[random.random()] # this should also work
Is there anyway to achieve something like this? Thanks for any responses or answers on this.
First, split your data into two arrays:
limits = [0.1, 0.3, 0.55, 0.7, 1.0]
values = ["a", "b", "c", "d", "e"]
limits is sorted, so you can do binary search in it:
import bisect
def value_at(n):
index = bisect.bisect_left(limits, n)
return values[index]
You can have O(1) lookup time if you accept a low(er) resolution of range boundaries and sacrifice memory for lookup speed.
A dictionary can do a lookup in O(1) average time because there is a simple arithmetic relationship between key and location in a fixed-size data structure (hash(key) % tablesize, for the average case). Your ranges are effectively of a variable size with floating-point boundaries, so there is no fixed tablesize to map a search value to.
Unless, that is, you limit the absolute lower and upper boundaries of the ranges, and let range boundaries fall on a fixed step size. Your example uses values from 0.0 through to 1.0, and the ranges can be quantized to 0.05 steps. That can be turned into a fixed table:
import math
from collections import MutableMapping
# empty slot marker
_EMPTY = object()
class RangeMap(MutableMapping):
"""Map points to values, and find values for points in O(1) constant time
The map requires a fixed minimum lower and maximum upper bound for
the ranges. Range boundaries are quantized to a fixed step size. Gaps
are permitted, when setting overlapping ranges last range set wins.
"""
def __init__(self, map=None, lower=0.0, upper=1.0, step=0.05):
self._mag = 10 ** -round(math.log10(step) - 1) # shift to integers
self._lower, self._upper = round(lower * self._mag), round(upper * self._mag)
self._step = round(step * self._mag)
self._steps = (self._upper - self._lower) // self._step
self._table = [_EMPTY] * self._steps
self._len = 0
if map is not None:
self.update(map)
def __len__(self):
return self._len
def _map_range(self, r):
low, high = r
start = round(low * self._mag) // self._step
stop = round(high * self._mag) // self._step
if not self._lower <= start < stop <= self._upper:
raise IndexError('Range outside of map boundaries')
return range(start - self._lower, stop - self._lower)
def __setitem__(self, r, value):
for i in self._map_range(r):
self._len += int(self._table[i] is _EMPTY)
self._table[i] = value
def __delitem__(self, r):
for i in self._map_range(r):
self._len -= int(self._table[i] is not _EMPTY)
self._table[i] = _EMPTY
def _point_to_index(self, point):
point = round(point * self._mag)
if not self._lower <= point <= self._upper:
raise IndexError('Point outside of map boundaries')
return (point - self._lower) // self._step
def __getitem__(self, point_or_range):
if isinstance(point_or_range, tuple):
low, high = point_or_range
r = self._map_range(point_or_range)
# all points in the range must point to the same value
value = self._table[r[0]]
if value is _EMPTY or any(self._table[i] != value for i in r):
raise IndexError('Not a range for a single value')
else:
value = self._table[self._point_to_index(point_or_range)]
if value is _EMPTY:
raise IndexError('Point not in map')
return value
def __iter__(self):
low = None
value = _EMPTY
for i, v in enumerate(self._table):
pos = (self._lower + (i * self._step)) / self._mag
if v is _EMPTY:
if low is not None:
yield (low, pos)
low = None
elif v != value:
if low is not None:
yield (low, pos)
low = pos
value = v
if low is not None:
yield (low, self._upper / self._mag)
The above implements the full mapping interface, and accepts both points and ranges (as a tuple modelling a [start, stop) interval) when indexing or testing for containment (supporting ranges made it easier to reuse the default keys, values and items dictionary view implementations, which all work from the __iter__ implementation).
Demo:
>>> d = RangeMap({
... (0.0, 0.1): "a",
... (0.1, 0.3): "b",
... (0.3, 0.55): "c",
... (0.55, 0.7): "d",
... (0.7, 1.0): "e",
... })
>>> print(*d.items(), sep='\n')
((0.0, 0.1), 'a')
((0.1, 0.3), 'b')
((0.3, 0.55), 'c')
((0.55, 0.7), 'd')
((0.7, 1.0), 'e')
>>> d[0.05]
'a'
>>> d[0.8]
'e'
>>> d[0.9]
'e'
>>> import random
>>> d[random.random()]
'c'
>>> d[random.random()]
'a'
If you can't limit the step size and boundaries so readily, then your next best option is to use some kind of binary search algorithm; you keep the ranges in sorted order and pick a point in the middle of the data structure; based on your search key being higher or lower than that mid point you continue the search in either half of the data structure until you find a match.
If your ranges cover the full interval from lowest to highest boundary, then you can use the bisect module for this; just store either the lower or upper boundaries of each range in one list, the corresponding values in another, and use bisection to map a position in the first list to a result in the second.
If your ranges have gaps, then you either need to keep a third list with the other boundary and first validate that the point falls in the range, or use an interval tree. For non-overlapping ranges a simple binary tree would do, but there are more specialised implementations that support overlapping ranges too. There is a intervaltree project on PyPI that supports full interval tree operations.
A bisect-based mapping that matches behaviour to the fixed-table implementation would look like:
from bisect import bisect_left
from collections.abc import MutableMapping
class RangeBisection(MutableMapping):
"""Map ranges to values
Lookups are done in O(logN) time. There are no limits set on the upper or
lower bounds of the ranges, but ranges must not overlap.
"""
def __init__(self, map=None):
self._upper = []
self._lower = []
self._values = []
if map is not None:
self.update(map)
def __len__(self):
return len(self._values)
def __getitem__(self, point_or_range):
if isinstance(point_or_range, tuple):
low, high = point_or_range
i = bisect_left(self._upper, high)
point = low
else:
point = point_or_range
i = bisect_left(self._upper, point)
if i >= len(self._values) or self._lower[i] > point:
raise IndexError(point_or_range)
return self._values[i]
def __setitem__(self, r, value):
lower, upper = r
i = bisect_left(self._upper, upper)
if i < len(self._values) and self._lower[i] < upper:
raise IndexError('No overlaps permitted')
self._upper.insert(i, upper)
self._lower.insert(i, lower)
self._values.insert(i, value)
def __delitem__(self, r):
lower, upper = r
i = bisect_left(self._upper, upper)
if self._upper[i] != upper or self._lower[i] != lower:
raise IndexError('Range not in map')
del self._upper[i]
del self._lower[i]
del self._values[i]
def __iter__(self):
yield from zip(self._lower, self._upper)
I'm getting this error: 'TypeError: list indices must be integers, not float'
but the functions I'm using need to accept non integer values, otherwise my results are different...
Just to give you an idea, I have written some code that fits a gaussian to some data with a single peak. To do this, I need to calculate an estimated value for sigma. To get that, I've written two functions that are meant to look at the data, use the x value for the peak to find two points(r_pos and l_pos) which are either side of the peak and a set distance from the y axis (thresh). And from that I can get an estimated sigma(r_pos - l_pos).
This is all coming about from a piece of code that worked, but the mark sheet for my coursework says I need to use functions, so I'm trying to turn this:
I0 = max(y)
pos = y.index(I0)
print 'Peak value is',I0,'Counts per sec at' ,x[pos], 'degrees(2theta)'
print pos,I0
#left position
thresh = 10
i = pos
while y[i] > thresh:
i -= 1
l_pos = x[i]
#right position
thresh = 10
i = y.index(I0)
while y[i] > thresh:
i += 1
r_pos = x[i]
print r_pos
sigma0 = r_pos - l_pos
print sigma0
Into something that uses functions that can be called etc. This is my attempt:
def Peak_Find(x,y):
I0 = max(y)
pos = y.index(I0)
return I0, x[pos]
def R_Pos(thresh,position):
i = position
while y[i] > thresh:
i += 0.1
r_pos = x[i]
return r_pos
peak_y,peak_x = Peak_Find(x,y)
Right Position = R_Pos(10,peak_x)
peak_y = 855.0
Peak_x = 32.1 , by the way
It looks like you want to replace the line
i = position
With something like
i = x.index(position)
because position is a float, and you want the location in the array of position. You are using i to get the index of an array, and you must use ints to do this, hence using the .index method to return the (integer) location in the array.
You are better off writing the program this way because then the variable names will actually match what is contained in the variables.
def Peak_Find(x,y):
I0 = max(y)
pos = y.index(I0)
return I0, pos
def R_Pos(thresh,position):
while y[position] > thresh:
position += 1 # Not sure if this is what you want
r_pos = x[position]
return r_pos # Not sure what you want here... this is the value at x, not the position
I wish to select a random word from a list where the is a known chance for each word, for example:
Fruit with Probability
Orange 0.10
Apple 0.05
Mango 0.15
etc
How would be the best way of implementing this? The actual list I will take from is up to 100 items longs and the % do not all tally to 100 % they do fall short to account for the items that had a really low chance of occurrence. I would ideally like to take this from a CSV which is where I store this data. This is not a time critical task.
Thank you for any advice on how best to proceed.
You can pick items with weighted probabilities if you assign each item a number range proportional to its probability, pick a random number between zero and the sum of the ranges and find what item matches it. The following class does exactly that:
from random import random
class WeightedChoice(object):
def __init__(self, weights):
"""Pick items with weighted probabilities.
weights
a sequence of tuples of item and it's weight.
"""
self._total_weight = 0.
self._item_levels = []
for item, weight in weights:
self._total_weight += weight
self._item_levels.append((self._total_weight, item))
def pick(self):
pick = self._total_weight * random()
for level, item in self._item_levels:
if level >= pick:
return item
You can then load the CSV file with the csv module and feed it to the WeightedChoice class:
import csv
weighed_items = [(item,float(weight)) for item,weight in csv.reader(open('file.csv'))]
picker = WeightedChoice(weighed_items)
print(picker.pick())
What you want is to draw from a multinomial distribution. Assuming you have two lists of items and probabilities, and the probabilities sum to 1 (if not, just add some default value to cover the extra):
def choose(items,chances):
import random
p = chances[0]
x = random.random()
i = 0
while x > p :
i = i + 1
p = p + chances[i]
return items[i]
lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.69) ]
x = 0.0
lst2 = []
for fruit, chance in lst:
tup = (x, fruit)
lst2.append(tup)
x += chance
tup = (x, None)
lst2.append(tup)
import random
def pick_one(lst2):
if lst2[0][1] is None:
raise ValueError, "no valid values to choose"
while True:
r = random.random()
for x, fruit in reversed(lst2):
if x <= r:
if fruit is None:
break # try again with a different random value
else:
return fruit
pick_one(lst2)
This builds a new list, with ascending values representing the range of values that choose a fruit; then pick_one() walks backward down the list, looking for a value that is <= the current random value. We put a "sentinel" value on the end of the list; if the values don't reach 1.0, there is a chance of a random value that shouldn't match anything, and it will match the sentinel value and then be rejected. random.random() returns a random value in the range [0.0, 1.0) so it is certain to match something in the list eventually.
The nice thing here is that you should be able to have one value with a 0.000001 chance of matching, and it should actually match with that frequency; the other solutions, where you make a list with the items repeated and just use random.choice() to choose one, would require a list with a million items in it to handle this case.
lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.69) ]
x = 0.0
lst2 = []
for fruit, chance in lst:
low = x
high = x + chance
tup = (low, high, fruit)
lst2.append(tup)
x += chance
if x > 1.0:
raise ValueError, "chances add up to more than 100%"
low = x
high = 1.0
tup = (low, high, None)
lst2.append(tup)
import random
def pick_one(lst2):
if lst2[0][2] is None:
raise ValueError, "no valid values to choose"
while True:
r = random.random()
for low, high, fruit in lst2:
if low <= r < high:
if fruit is None:
break # try again with a different random value
else:
return fruit
pick_one(lst2)
# test it 10,000 times
d = {}
for i in xrange(10000):
x = pick_one(lst2)
if x in d:
d[x] += 1
else:
d[x] = 1
I think this is a little clearer. Instead of a tricky way of representing ranges as ascending values, we just keep ranges. Because we are testing ranges, we can simply walk forward through the lst2 values; no need to use reversed().
from numpy.random import multinomial
import numpy as np
def pickone(dist):
return np.where(multinomial(1, dist) == 1)[0][0]
if __name__ == '__main__':
lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.70) ]
dist = [p[1] for p in lst]
N = 10000
draws = np.array([pickone(dist) for i in range(N)], dtype=int)
hist = np.histogram(draws, bins=[i for i in range(len(dist)+1)])[0]
for i in range(len(lst)):
print(f'{lst[i]} {hist[i]/N}')
One solution is to normalize the probabilities to integers and then repeat each element once per value (e.g. a list with 2 Oranges, 1 Apple, 3 Mangos). This is incredibly easy to do (from random import choice). If that is not practical, try the code here.
import random
d= {'orange': 0.10, 'mango': 0.15, 'apple': 0.05}
weightedArray = []
for k in d:
weightedArray+=[k]*int(d[k]*100)
random.choice(weightedArray)
EDITS
This is essentially what Brian said above.