Use range as a key value in a dictionary, most efficient way?

Use range as a key value in a dictionary, most efficient way? - python

I have been wondering if there is some kind of data-structure or clever way to use a dictionary (O(1) lookup) to return a value if there are given values for defined ranges that do not overlap. So far I have been thinking this could be done if the ranges have some constant difference (0-2, 2-4, 4-6, etc.) or a binary-search could be done to do this in O(log(n)) time.
So, for example given a dictionary,
d = {[0.0 - 0.1): "a",
[0.1 - 0.3): "b",
[0.3 - 0.55): "c",
[0.55 - 0.7): "d",
[0.7 - 1.0): "e"}
it should return,
d[0.05]
>>> "a"
d[0.8]
>>> "e"
d[0.9]
>>> "e"
d[random.random()] # this should also work
Is there anyway to achieve something like this? Thanks for any responses or answers on this.

First, split your data into two arrays:
limits = [0.1, 0.3, 0.55, 0.7, 1.0]
values = ["a", "b", "c", "d", "e"]
limits is sorted, so you can do binary search in it:
import bisect
def value_at(n):
index = bisect.bisect_left(limits, n)
return values[index]

You can have O(1) lookup time if you accept a low(er) resolution of range boundaries and sacrifice memory for lookup speed.
A dictionary can do a lookup in O(1) average time because there is a simple arithmetic relationship between key and location in a fixed-size data structure (hash(key) % tablesize, for the average case). Your ranges are effectively of a variable size with floating-point boundaries, so there is no fixed tablesize to map a search value to.
Unless, that is, you limit the absolute lower and upper boundaries of the ranges, and let range boundaries fall on a fixed step size. Your example uses values from 0.0 through to 1.0, and the ranges can be quantized to 0.05 steps. That can be turned into a fixed table:
import math
from collections import MutableMapping
# empty slot marker
_EMPTY = object()
class RangeMap(MutableMapping):
"""Map points to values, and find values for points in O(1) constant time
The map requires a fixed minimum lower and maximum upper bound for
the ranges. Range boundaries are quantized to a fixed step size. Gaps
are permitted, when setting overlapping ranges last range set wins.
"""
def __init__(self, map=None, lower=0.0, upper=1.0, step=0.05):
self._mag = 10 ** -round(math.log10(step) - 1) # shift to integers
self._lower, self._upper = round(lower * self._mag), round(upper * self._mag)
self._step = round(step * self._mag)
self._steps = (self._upper - self._lower) // self._step
self._table = [_EMPTY] * self._steps
self._len = 0
if map is not None:
self.update(map)
def __len__(self):
return self._len
def _map_range(self, r):
low, high = r
start = round(low * self._mag) // self._step
stop = round(high * self._mag) // self._step
if not self._lower <= start < stop <= self._upper:
raise IndexError('Range outside of map boundaries')
return range(start - self._lower, stop - self._lower)
def __setitem__(self, r, value):
for i in self._map_range(r):
self._len += int(self._table[i] is _EMPTY)
self._table[i] = value
def __delitem__(self, r):
for i in self._map_range(r):
self._len -= int(self._table[i] is not _EMPTY)
self._table[i] = _EMPTY
def _point_to_index(self, point):
point = round(point * self._mag)
if not self._lower <= point <= self._upper:
raise IndexError('Point outside of map boundaries')
return (point - self._lower) // self._step
def __getitem__(self, point_or_range):
if isinstance(point_or_range, tuple):
low, high = point_or_range
r = self._map_range(point_or_range)
# all points in the range must point to the same value
value = self._table[r[0]]
if value is _EMPTY or any(self._table[i] != value for i in r):
raise IndexError('Not a range for a single value')
else:
value = self._table[self._point_to_index(point_or_range)]
if value is _EMPTY:
raise IndexError('Point not in map')
return value
def __iter__(self):
low = None
value = _EMPTY
for i, v in enumerate(self._table):
pos = (self._lower + (i * self._step)) / self._mag
if v is _EMPTY:
if low is not None:
yield (low, pos)
low = None
elif v != value:
if low is not None:
yield (low, pos)
low = pos
value = v
if low is not None:
yield (low, self._upper / self._mag)
The above implements the full mapping interface, and accepts both points and ranges (as a tuple modelling a [start, stop) interval) when indexing or testing for containment (supporting ranges made it easier to reuse the default keys, values and items dictionary view implementations, which all work from the __iter__ implementation).
Demo:
>>> d = RangeMap({
... (0.0, 0.1): "a",
... (0.1, 0.3): "b",
... (0.3, 0.55): "c",
... (0.55, 0.7): "d",
... (0.7, 1.0): "e",
... })
>>> print(*d.items(), sep='\n')
((0.0, 0.1), 'a')
((0.1, 0.3), 'b')
((0.3, 0.55), 'c')
((0.55, 0.7), 'd')
((0.7, 1.0), 'e')
>>> d[0.05]
'a'
>>> d[0.8]
'e'
>>> d[0.9]
'e'
>>> import random
>>> d[random.random()]
'c'
>>> d[random.random()]
'a'
If you can't limit the step size and boundaries so readily, then your next best option is to use some kind of binary search algorithm; you keep the ranges in sorted order and pick a point in the middle of the data structure; based on your search key being higher or lower than that mid point you continue the search in either half of the data structure until you find a match.
If your ranges cover the full interval from lowest to highest boundary, then you can use the bisect module for this; just store either the lower or upper boundaries of each range in one list, the corresponding values in another, and use bisection to map a position in the first list to a result in the second.
If your ranges have gaps, then you either need to keep a third list with the other boundary and first validate that the point falls in the range, or use an interval tree. For non-overlapping ranges a simple binary tree would do, but there are more specialised implementations that support overlapping ranges too. There is a intervaltree project on PyPI that supports full interval tree operations.
A bisect-based mapping that matches behaviour to the fixed-table implementation would look like:
from bisect import bisect_left
from collections.abc import MutableMapping
class RangeBisection(MutableMapping):
"""Map ranges to values
Lookups are done in O(logN) time. There are no limits set on the upper or
lower bounds of the ranges, but ranges must not overlap.
"""
def __init__(self, map=None):
self._upper = []
self._lower = []
self._values = []
if map is not None:
self.update(map)
def __len__(self):
return len(self._values)
def __getitem__(self, point_or_range):
if isinstance(point_or_range, tuple):
low, high = point_or_range
i = bisect_left(self._upper, high)
point = low
else:
point = point_or_range
i = bisect_left(self._upper, point)
if i >= len(self._values) or self._lower[i] > point:
raise IndexError(point_or_range)
return self._values[i]
def __setitem__(self, r, value):
lower, upper = r
i = bisect_left(self._upper, upper)
if i < len(self._values) and self._lower[i] < upper:
raise IndexError('No overlaps permitted')
self._upper.insert(i, upper)
self._lower.insert(i, lower)
self._values.insert(i, value)
def __delitem__(self, r):
lower, upper = r
i = bisect_left(self._upper, upper)
if self._upper[i] != upper or self._lower[i] != lower:
raise IndexError('Range not in map')
del self._upper[i]
del self._lower[i]
del self._values[i]
def __iter__(self):
yield from zip(self._lower, self._upper)

Related

Finding and averaging elements of multiple dataframes that are within 10% of each other

I have objects that store values are dataframes. I have been able to compare if values from two dataframes are within 10% of each other. However, I am having difficulty extending this to multiple dataframes. Moreover, I am wondering how I should apporach this problem if dataframes are not the same size?
def add_well_peak(self, *other):
if len(self.Bell) == len(other.Bell): #if dataframes ARE the same size
for k in range(len(self.Bell)):
for j in range(len(other.Bell)):
if int(self.Size[k]) - int(self.Size[k])*(1/10) <= int(other.Size[j]) <= int(self.Size[k]) + int(self.Size[k])*(1/10):
#average all
For example, in the image below, there are objects that contain dataframes (i.e., self, other1, other2). The colors represent matches (i.e, values that are within 10% of each other). If a match exist, then average the values. If a match does not exist still include the unmatch number. I want to be able to generalize this for any number of objects greater or equal than 2 (other 1, other 2, other 3, other ....). Any help would be appreciated. Please let me know if anything is unclear. This is my first time posting. Thanks again.
matching data

Results:
Using my solution on the dataframes of your image, I get the following:
Threshold outlier = 0.2:
0
0 1.000000
1 1493.500000
2 5191.333333
3 35785.333333
4 43586.500000
5 78486.000000
6 100000.000000
Threshold outlier = 0.5:
0 1
0 1.000000 NaN
1 1493.500000 NaN
2 5191.333333 NaN
3 43586.500000 35785.333333
4 78486.000000 100000.000000
Explanations:
The lines are averaged peaks, the columns representing the different values obtained for these peaks. I assumed the average emanating from the biggest number of elements was the legitimate one, and the rest within the THRESHOLD_OUTLIER were the outliers (should be sorted, the more probable you are as a legitimate peak, the more you are on the left (the 0th column is the most probable)). For instance, on line 3 of the 0.5 outlier threshold results, 43586.500000 is an average coming from 3 dataframes, while 35785.333333 comes from only 2, thus the most probable is the first one.
Issues:
The solution is quite complicated. I assume a big part of it could be removed, but I can't see how for the moment, and as it works, I'll certainly leave the optimization to you.
Still, I tried commenting my best, and if you have any question, do not hesitate!
Files:
CombinationLib.py
from __future__ import annotations
from typing import Dict, List
from Errors import *
class Combination():
"""
Support class, to make things easier.
Contains a string `self.combination` which is a binary number stored as a string.
This allows to test every combination of value (i.e. "101" on the list `[1, 2, 3]`
would signify grouping `1` and `3` together).
There are some methods:
- `__add__` overrides the `+` operator
- `compute_degree` gives how many `1`s are in the combination
- `overlaps` allows to verify if combination overlaps (use the same value twice)
(i.e. `100` and `011` don't overlap, while `101` and `001` do)
"""
def __init__(self, combination:str) -> Combination:
self.combination:str = combination
self.degree:int = self.compute_degree()
def __add__(self, other: Combination) -> Combination:
if self.combination == None:
return other.copy()
if other.combination == None:
return self.copy()
if self.overlaps(other):
raise CombinationsOverlapError()
result = ""
for c1, c2 in zip(self.combination, other.combination):
result += "1" if (c1 == "1" or c2 == "1") else "0"
return Combination(result)
def __str__(self) -> str:
return self.combination
def compute_degree(self) -> int:
if self.combination == None:
return 0
degree = 0
for bit in self.combination:
if bit == "1":
degree += 1
return degree
def copy(self) -> Combination:
return Combination(self.combination)
def overlaps(self, other:Combination) -> bool:
for c1, c2 in zip(self.combination, other.combination):
if c1 == "1" and c1 == c2:
return True
return False
class CombinationNode():
"""
The main class.
The main idea was to build a tree of possible "combinations of combinations":
100-011 => 111
|---010-001 => 111
|---001-010 => 111
At each node, the combination applied to the current list of values was to be acceptable
(all within THREASHOLD_AVERAGING).
Also, the shorter a path, the better the solution as it means it found a way to average
a lot of the values, with the minimum amount of outliers possible, maybe by grouping
the outliers together in a way that makes sense, ...
- `populate` fills the tree automatically, with every solution possible
- `path` is used mainly on leaves, to obtain the path taken to arrive there.
"""
def __init__(self, combination:Combination) -> CombinationNode:
self.combination:Combination = combination
self.children:List[CombinationNode] = []
self.parent:CombinationNode = None
self.total_combination:Combination = combination
def __str__(self) -> str:
list_paths = self.recur_paths()
list_paths = [",".join([combi.combination.combination for combi in path]) for path in list_paths]
return "\n".join(list_paths)
def add_child(self, child:CombinationNode) -> None:
if child.combination.degree > self.combination.degree and not self.total_combination.overlaps(child.combination):
raise ChildDegreeExceedParentDegreeError(f"{child.combination} > {self.combination}")
self.children.append(child)
child.parent = self
child.total_combination += self.total_combination
def path(self) -> List[CombinationNode]:
path = []
current = self
while current.parent != None:
path.append(current)
current = current.parent
path.append(current)
return path[::-1]
def populate(self, combination_dict:Dict[int, List[Combination]]) -> None:
missing_degrees = len(self.combination.combination)-self.total_combination.degree
if missing_degrees == 0:
return
for i in range(min(self.combination.degree, missing_degrees), 0, -1):
for combination in combination_dict[i]:
if not self.total_combination.overlaps(combination):
self.add_child(CombinationNode(combination))
for child in self.children:
child.populate(combination_dict)
def recur_paths(self) -> List[List[CombinationNode]]:
if len(self.children) == 0:
return [self.path()]
paths = []
for child in self.children:
for path in child.recur_paths():
paths.append(path)
return paths
Errors.py
class ChildDegreeExceedParentDegreeError(Exception):
pass
class CombinationsOverlapError(Exception):
pass
class ToImplementError(Exception):
pass
class UncompletePathError(Exception):
pass
main.py
from typing import Dict, List, Set, Tuple, Union
import pandas as pd
from CombinationLib import *
best_depth:int = -1
best_path:List[CombinationNode] = []
THRESHOLD_OUTLIER = 0.2
THRESHOLD_AVERAGING = 0.1
def verif_averaging_pct(combination:Combination, values:List[float]) -> bool:
"""
For a given combination of values, we must have all the values within
THRESHOLD_AVERAGING of the average of the combination
"""
avg = 0
for c,v in zip(combination.combination, values):
if c == "1":
avg += v
avg /= combination.degree
for c,v in zip(combination.combination, values):
if c == "1"and (v > avg*(1+THRESHOLD_AVERAGING) or v < avg*(1-THRESHOLD_AVERAGING)):
return False
return True
def recursive_check(node:CombinationNode, depth:int, values:List[Union[float, int]]) -> None:
"""
Here is where we preferencially ask for a small number of bigger groups
"""
global best_depth
global best_path
# If there are more groups than the current best way to do, stop
if best_depth != -1 and depth > best_depth:
return
# If all the values of the combination are not within THRESHOLD_AVERAGING, stop
if not verif_averaging_pct(node.combination, values):
return
# If we finished the list of combinations, and this way is the best, keep it, stop
if len(node.children) == 0:
if best_depth == -1 or depth < best_depth:
best_depth = depth
best_path = node.path()
return
# If we are still not finished (not every value has been used), continue
for cnode in node.children:
recursive_check(cnode, depth+1, values)
def groups_from_list(values:List[Union[float, int]]) -> List[List[Union[float, int]]]:
"""
From a list of values, get the smallest list of groups of elements
within THRESHOLD_AVERAGING of each other.
It implies that we will try and recursively find the biggest group possible
within the unsused values (i.e. groups with combinations of size [3, 1] are prefered
over [2, 2])
"""
global best_depth
global best_path
groups:List[List[float]] = []
# Generate all the combinations (I used binary for this)
combination_dict:Dict[int, List[Combination]] = {}
for i in range(1, 2**len(values)):
combination = format(i, f"0{len(values)}b") # Here is the binary conversion
counter = 0
for c in combination:
if c == "1":
counter += 1
if counter not in combination_dict:
combination_dict[counter] = []
combination_dict[counter].append(Combination(combination))
# Generate of the combinations of combinations that use all values (without using one twice)
combination_trees:List[List[CombinationNode]] = []
for key in combination_dict:
for combination in combination_dict[key]:
cn = CombinationNode(combination)
cn.populate(combination_dict)
combination_trees.append(cn)
best_depth = -1
best_path = None
for root in combination_trees:
recursive_check(root, 0, values)
# print(",".join([combination.combination.combination for combination in best_path]))
for combination in best_path:
temp = []
for c,v in zip(combination.combination.combination, values):
if c == "1":
temp.append(v)
groups.append(temp)
return groups
def averages_from_groups(gs:List[List[Union[float, int]]]) -> List[float]:
"""Computing the averages of each group"""
avgs:List[float] = []
for group in gs:
avg = 0
for elt in group:
avg += elt
avg /= len(group)
avgs.append(avg)
return avgs
def end_check(ds:List[pd.DataFrame], ids:List[int]) -> bool:
"""Check if we finished consuming all the dataframes"""
for d,i in zip(ds, ids):
if i < len(d[0]):
return False
return True
def search(group:List[Union[float, int]], values_list:List[Union[float, int]]) -> List[int]:
"""Obtain all the indices corresponding to a set of values"""
# We will get all the indices in values_list of the values in group
# If a value is present in group, all the occurences of this value will be too,
# so we can use a set and search every occurence for each value.
indices:List[int] = []
group_set = set(group)
for value in group_set:
for i,v in enumerate(values_list):
if value == v:
indices.append(i)
return indices
def threshold_grouper(total_list:List[Union[float, int]]) -> pd.DataFrame:
"""Building a 2D pd.DataFrame with the averages (x) and the outliers (y)"""
result_list:List[List[Union[float, int]]] = [[total_list[0]]]
result_index = 0
total_index = 1
while total_index < len(total_list):
# Only checking if the bigger one is within THRESHOLD_OUTLIER of the little one.
# If it is the case, the opposite is true too.
# If yes, it is an outlier
if result_list[result_index][0]*(1+THRESHOLD_OUTLIER) >= total_list[total_index]:
result_list[result_index].append(total_list[total_index])
# Else it is a new peak
else:
result_list.append([total_list[total_index]])
result_index += 1
total_index += 1
result:pd.DataFrame = pd.DataFrame(result_list)
return result
def dataframes_merger(dataframes:List[pd.DataFrame]) -> pd.DataFrame:
"""Merging the dataframes, with THRESHOLDS"""
# Store the averages for the within 10% cells, in ascending order
result = []
# Keep tabs on where we are regarding each dataframe (needed for when we skip cells)
curr_indices:List[int] = [0 for _ in range(len(dataframes))]
# Repeat until all the cells in every dataframe has been seen once
while not end_check(dataframes, curr_indices):
# Get the values of the current indices in the dataframes
curr_values = [dataframe[0][i] for dataframe,i in zip(dataframes, curr_indices)]
# Get the largest 10% groups from the current list of values
groups = groups_from_list(curr_values)
# Compute the average of these groups
avgs = averages_from_groups(groups)
# Obtain the minimum average...
avg_min = min(avgs)
# ... and its index
avg_min_index = avgs.index(avg_min)
# Then get the group corresponding to the minimum average
avg_min_group = groups[avg_min_index]
# Get the indices of the values included in this group
indices_to_increment = search(avg_min_group, curr_values)
# Add the average to the result merged list
result.append(avg_min)
# For every element in the average we added, increment the corresponding index
for index in indices_to_increment:
curr_indices[index] += 1
# Re-assemble the dataframe, taking the threshold% around average into account
result = threshold_grouper(result)
print(result)
df1 = pd.DataFrame([1, 1487, 5144, 35293, 78486, 100000])
df2 = pd.DataFrame([1, 1500, 5144, 36278, 45968, 100000])
df3 = pd.DataFrame([1, 5286, 35785, 41205, 100000])
dataframes_merger([df3, df2, df1])

Longest common substring with rolling hash

I am implementing in Python3 an algorithm to find the longest substring of two strings s and t. Given s and t, I need to return (a,b,l) where l is the length of the longest common substring, a is the position in s where the longest substring starts, and b is the position in t where the longest substring starts. I have a working version of the algorithm but it is quite slow and I am not sure why; it is frustrating because I have found other implementations in python using pretty much the same logic that are many times faster. I am self-learning so any help would be greatly appreciated.
The approach is based on comparing hash values rather than directly comparing substrings and using binary search to find maximal length of common substrings. Here is the code for my hash function (m is a big prime and x is just some constant):
def polynomial_hash(my_string, m, x):
str_len = len(my_string)
result = 0
for i in range(str_len):
result = (result + ord(my_string[i]) * power_mod_p(x, i, m)) % m
return result
Given two strings s and t, I first find which string is shorter, without loss of generality, let s be the shorter string. First I need to find the hash values of substrings of a string. I use the following function, implemented as a generator:
def all_length_k_hashes(my_string, k, m, x):
current_position = len(my_string) - k
x_to_the_k = power_mod_p(x, k, m)
hash_value = polynomial_hash(my_string[current_position:], m, x)
yield (hash_value, current_position)
while current_position > 0:
current_position = current_position - 1
hash_value = ((hash_value * x) + ord(my_string[current_position]) - x_to_the_k*ord(my_string[current_position + k])) % m
yield (hash_value, current_position)
This function is simple, its first yield is the hash value of the final length k substring of the string, after that each of its iteration is the hash value of the next length k substring to its left (we move left by one position, for example for k=3 from abcdefghi to abcdefghi then from abcdefghi to abcdefghi). This should be able to calculate all the hash values of all length k substrings of my_string in O(|my_string|).
Now I find out if s and t has a length k substring in common, I use the following function:
def common_sub_string_length_k(shorter_str, longer_str, k, m, x):
short_str_dict = dict()
for hash_and_index in all_length_k_hashes(shorter_str, k, m, x):
short_str_dict.update({hash_and_index[0]: hash_and_index[1]})
hash_generator_longer_str = all_length_k_hashes(longer_str, k, m, x)
for hash_and_index in hash_generator_longer_str:
if hash_and_index[0] in short_str_dict:
return (short_str_dict[hash_and_index[0]], hash_and_index[1])
return False
What is happening in this function is: I create a Python empty dictionary and fill it with (key:values) such that each key is the hash value of a length k substring of the shorter string and its value is that substring's starting index, I call this 'short_str_dict'
Then, using all_length_k_hashes, I create a generator of hash values of substrings of length k of the longer string, then I iterate through this generator to check if there is a hash value that's in the 'short_str_dict', if there is, then the two strings have a substring of length k in common (assuming no hash collisions). This whole process should take time O(|shorter_string| + |longer_string|)
Finally, the following function repeatedly uses the previous process to find the maximal k, using a binary search technique:
def longest_common_substring(str_1, str_2):
m_1 = 309000599
m_2 = 988017827
x = randint(1, 10 ** 6)
len_str_1 = len(str_1)
len_str_2 = len(str_2)
if len_str_1 <= len_str_2:
short_str = str_1
long_str = str_2
switched = False
else:
short_str = str_2
long_str = str_1
switched = True
len_short_str = len(short_str)
len_long_str = len(long_str)
low = 0
high = len_short_str
mid = 0
longest_so_far = 0
longest_indices = (0,0)
while low <= high:
mid = (high + low) // 2
m1_result = common_sub_string_length_k(short_str, long_str, mid, m_1, x)
m2_result = common_sub_string_length_k(short_str, long_str, mid, m_2, x)
if m1_result is False or m2_result is False:
high = mid - 1
else:
longest_so_far = mid
longest_indices = m1_result
low = mid + 1
if switched:
return (longest_indices[1], longest_indices[0], longest_so_far)
else:
return (longest_indices[0], longest_indices[1], longest_so_far)
Two different hashes are used to reduce the probability of a collision. So in total, assuming no collisions, this whole process should take
O(log|shorter_string|) * O(|shorter_string| + |longer_string|).
Have I made any error? Is it slow because of the use of Python dictionaries? I really want to understand my mistake. Any help is greatly appreciated.

Generating random floats, summing to 1, with minimum value

I saw a many solutions for generating random floats within a specific range (like this) which actually helps me, and solutions for generating random floats summing to 1 (like this), and separately solutions work perfectly, but I can't figure how to merge them.
Currently my code is:
import random
def sample_floats(low, high, k=1):
""" Return a k-length list of unique random floats
in the range of low <= x <= high
"""
result = []
seen = set()
for i in range(k):
x = random.uniform(low, high)
while x in seen:
x = random.uniform(low, high)
seen.add(x)
result.append(x)
return result
And still, applying
weights = sample_floats(0.055, 1.0, 11)
weights /= np.sum(weights)
Returns weights array, in which there are some floats less that 0.055
Should I somehow implement np.random.dirichlet in function above, or it should be built on the basis of np.random.dirichlet and then implement condition > 0.055? Can't figure any solution.
Thank you in advice!

IIUC, you want to generate an array of k values, with minimum value of low=0.055.
It is easier to generate numbers from 0 that sum up to 1-low*k, and then to add low so that the final array sums to 1. Thus, this guarantees both the lower bound and the sum.
Regarding the high, I am pretty sure it is mathematically impossible to add this constraint as once you fix the lower bound and the sum, there is not enough degrees of freedom to chose an upper bound. The upper bound will be 1-low*(k-1) (here 0.505).
Also, be aware that, with a minimum value, you necessarily enforce a maximum k of 1//low (here 18 values). If you set k higher, the low bound won't be correct.
# parameters
low = 0.055
k = 10
a = np.random.rand(k)
a = (a/a.sum()*(1-low*k))
weights = a+low
# checking that the sum is 1
assert np.isclose(weights.sum(), 1)
Example output:
array([0.13608635, 0.06796974, 0.07444545, 0.1361171 , 0.07217206,
0.09223554, 0.12713463, 0.11012871, 0.1107402 , 0.07297022])

You could generate k-1 numbers iteratively by varying the lower and upper bounds of the uniform random number generator - the constraint at any iteration being that the number generated allows the rest of the numbers to be at least low
def sample_floats(low, high, k=1):
result = []
generated = 0
while generated < k-1:
current_higher_bound = max(low, 1 - (k - 1 - generated)*low - sum(result))
next_num = random.uniform(low, current_higher_bound)
result.append(next_num)
generated += 1
last_num = 1 - sum(result)
result.append(last_num)
return result
print(sample_floats(0.01, 1, k=15))
#[0.08878760926151083,
# 0.17897435239586243,
# 0.5873150041878156,
# 0.021487776792166513,
# 0.011234379498998357,
# 0.012408564286727042,
# 0.015391011259745103,
# 0.01264921242128719,
# 0.010759267284382326,
# 0.010615007333002748,
# 0.010288605412288477,
# 0.010060487014659121,
# 0.010027216923973544,
# 0.010000064276203318,
# 0.010001441651377285]

The samples are correlated, so I believe you can't generate them in an IID way. you can, however, do it in an iterative manner. For example, you can do it as I show in the code below. There are a few more special cases to check like what if the user inputs low<high or high*k<sum. But I figured you can find and account for them using my modification to your code.
import random
import warnings
def sample_floats(low = 0.055, high = 1., x_sum = 1., k = 1):
""" Return a k-length list of unique random floats
in the range of 'low' <= x <= 'high' summing up to 'sum'.
"""
sum_i = 0
xs = []
if x_sum - (k-1)*low < high:
warnings.warn(f'high = {high} is to high to be generated under the'
f' conditions set by k = {k}, sum = {x_sum}, and low = {low}.'
f' high automatically set to {x_sum - (k-1)*low}.')
if k == 1:
if high < x_sum:
raise ValueError(f'The parameter combination k = {k}, sum = {x_sum},'
' and high = {high} is impossible.')
else: return x_sum
high_i = high
for i in range(k-1):
x = random.uniform(low, high_i)
xs.append(x)
sum_i = sum_i + x
if high < (x_sum - sum_i - (k-1-i)*low):
high_i = high
else: high_i = x_sum - sum_i - (k-1-i)*low
xs.append(x_sum - sum_i)
return xs
For example:
random.seed(0)
xs = sample_floats(low = 0.055, high = 0.5, x_sum = 1., k = 5)
print(xs)
print(sum(xs))
Output:
[0.43076772392864643, 0.27801464913542906, 0.08495210994346317, 0.06568433355884717, 0.14058118343361425]
1.0

Finding nearest set of numbers given a position

I have a dictionary that looks something like so:
exons = {'NM_015665': [(0, 225), (356, 441), (563, 645), (793, 861)], etc...}
and another file that has a position like so:
isoform pos
NM_015665 449
What I want to do is print the range of numbers that the position in the file is the closest to and then print the number within that range of numbers that the value is closest to. For this case, I want to print (356, 441) and then 441. I've successfully figured out a way to print the number in the set of numbers that the value is closest to, but my code below only takes into account 10 values on either side of the numbers listed. Is there any way to take into account that there are a different amount of numbers between each set of ranges?
This is the code I have so far:
with open('splicing_reinitialized.txt') as f:
reader = csv.DictReader(f,delimiter="\t")
for row in reader:
pos = row['pos']
name = row['isoform']
ppos1 = int(pos)
if name in exons:
y = exons[name]
for i, (low,high) in enumerate(exons[name]):
if low -5 <= ppos1 <= high + 5:
values = (low,high)
closest = min((low,high), key = lambda x:abs(x-ppos1))

I would rewrite it as a minimum distance search:
if name in exons:
y = exons[name]
minDist = 99999 # large number
minIdx = None
minNum = None
for i, (low,high) in enumerate(y):
dlow = abs(low - ppos1)
dhigh = abs(high - ppos1)
dist = min(dlow, dhigh)
if dist < minDist:
minDist = dist
minIdx = i
minNum = 0 if dlow < dhigh else 1
print(y[minIdx])
print(y[minIdx][minNum])
This ignores the search range, just search for the minimum distance pair.

A functional alternative :). This might even run faster. It clearly is very RAM-friendly and can be easily parallelized due to the perks of functional programming. I hope you'll find it interesting enough to study.
from itertools import imap, izip, ifilter, repeat
def closest_point(position, interval):
""":rtype: tuple[int, int]""" # closest interval point, distance to it
position_in_interval = interval[0] <= position <= interval[1]
closest = min([(border, abs(position - border)) for border in interval], key=lambda x: x[1])
return closest if not position_in_interval else (closest[0], 0) # distance is 0 if position is inside an interval
def closest_interval(exons, pos):
""":rtype: tuple[tuple[int, int], tuple[int, int]]"""
return min(ifilter(lambda x: x[1][1], izip(exons, imap(closest_point, repeat(pos, len(exons)), exons))),
key=lambda x: x[1][1])
print(closest_interval(exons['NM_015665'], 449))
This prints
((356, 441), (441, 8))
The first tuple is a range. The first integer in the second tuple is the closest point in the interval, the second integer is the distance.

How to select an item from a list with known percentages in Python

I wish to select a random word from a list where the is a known chance for each word, for example:
Fruit with Probability
Orange 0.10
Apple 0.05
Mango 0.15
etc
How would be the best way of implementing this? The actual list I will take from is up to 100 items longs and the % do not all tally to 100 % they do fall short to account for the items that had a really low chance of occurrence. I would ideally like to take this from a CSV which is where I store this data. This is not a time critical task.
Thank you for any advice on how best to proceed.

You can pick items with weighted probabilities if you assign each item a number range proportional to its probability, pick a random number between zero and the sum of the ranges and find what item matches it. The following class does exactly that:
from random import random
class WeightedChoice(object):
def __init__(self, weights):
"""Pick items with weighted probabilities.
weights
a sequence of tuples of item and it's weight.
"""
self._total_weight = 0.
self._item_levels = []
for item, weight in weights:
self._total_weight += weight
self._item_levels.append((self._total_weight, item))
def pick(self):
pick = self._total_weight * random()
for level, item in self._item_levels:
if level >= pick:
return item
You can then load the CSV file with the csv module and feed it to the WeightedChoice class:
import csv
weighed_items = [(item,float(weight)) for item,weight in csv.reader(open('file.csv'))]
picker = WeightedChoice(weighed_items)
print(picker.pick())

What you want is to draw from a multinomial distribution. Assuming you have two lists of items and probabilities, and the probabilities sum to 1 (if not, just add some default value to cover the extra):
def choose(items,chances):
import random
p = chances[0]
x = random.random()
i = 0
while x > p :
i = i + 1
p = p + chances[i]
return items[i]

lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.69) ]
x = 0.0
lst2 = []
for fruit, chance in lst:
tup = (x, fruit)
lst2.append(tup)
x += chance
tup = (x, None)
lst2.append(tup)
import random
def pick_one(lst2):
if lst2[0][1] is None:
raise ValueError, "no valid values to choose"
while True:
r = random.random()
for x, fruit in reversed(lst2):
if x <= r:
if fruit is None:
break # try again with a different random value
else:
return fruit
pick_one(lst2)
This builds a new list, with ascending values representing the range of values that choose a fruit; then pick_one() walks backward down the list, looking for a value that is <= the current random value. We put a "sentinel" value on the end of the list; if the values don't reach 1.0, there is a chance of a random value that shouldn't match anything, and it will match the sentinel value and then be rejected. random.random() returns a random value in the range [0.0, 1.0) so it is certain to match something in the list eventually.
The nice thing here is that you should be able to have one value with a 0.000001 chance of matching, and it should actually match with that frequency; the other solutions, where you make a list with the items repeated and just use random.choice() to choose one, would require a list with a million items in it to handle this case.

lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.69) ]
x = 0.0
lst2 = []
for fruit, chance in lst:
low = x
high = x + chance
tup = (low, high, fruit)
lst2.append(tup)
x += chance
if x > 1.0:
raise ValueError, "chances add up to more than 100%"
low = x
high = 1.0
tup = (low, high, None)
lst2.append(tup)
import random
def pick_one(lst2):
if lst2[0][2] is None:
raise ValueError, "no valid values to choose"
while True:
r = random.random()
for low, high, fruit in lst2:
if low <= r < high:
if fruit is None:
break # try again with a different random value
else:
return fruit
pick_one(lst2)
# test it 10,000 times
d = {}
for i in xrange(10000):
x = pick_one(lst2)
if x in d:
d[x] += 1
else:
d[x] = 1
I think this is a little clearer. Instead of a tricky way of representing ranges as ascending values, we just keep ranges. Because we are testing ranges, we can simply walk forward through the lst2 values; no need to use reversed().

from numpy.random import multinomial
import numpy as np
def pickone(dist):
return np.where(multinomial(1, dist) == 1)[0][0]
if __name__ == '__main__':
lst = [ ('Orange', 0.10), ('Apple', 0.05), ('Mango', 0.15), ('etc', 0.70) ]
dist = [p[1] for p in lst]
N = 10000
draws = np.array([pickone(dist) for i in range(N)], dtype=int)
hist = np.histogram(draws, bins=[i for i in range(len(dist)+1)])[0]
for i in range(len(lst)):
print(f'{lst[i]} {hist[i]/N}')

One solution is to normalize the probabilities to integers and then repeat each element once per value (e.g. a list with 2 Oranges, 1 Apple, 3 Mangos). This is incredibly easy to do (from random import choice). If that is not practical, try the code here.

import random
d= {'orange': 0.10, 'mango': 0.15, 'apple': 0.05}
weightedArray = []
for k in d:
weightedArray+=[k]*int(d[k]*100)
random.choice(weightedArray)
EDITS
This is essentially what Brian said above.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use range as a key value in a dictionary, most efficient way? - python

First, split your data into two arrays: limits = [0.1, 0.3, 0.55, 0.7, 1.0] values = ["a", "b", "c", "d", "e"] limits is sorted, so you can do binary search in it: import bisect def value_at(n): index = bisect.bisect_left(limits, n) return values[index]

Related

Finding and averaging elements of multiple dataframes that are within 10% of each other

Longest common substring with rolling hash

Generating random floats, summing to 1, with minimum value

Finding nearest set of numbers given a position

How to select an item from a list with known percentages in Python

Categories

Resources