Pythonic way to assign range of number to bucket - python

I'm developing an ABtest framework using django. I want to assign variant number based on bucket_id from cookies' request.
bucket_id is set by the front end with a range integer from 0-99.
So far, I have created the function name get_bucket_name:
def get_bucket_range(data):
range_bucket = []
first_val = 0
next_val = 0
for i, v in enumerate(data.split(",")):
v = int(v)
if i == 0:
first_val = v
range_bucket.append([0, first_val])
elif i == 1:
range_bucket.append([first_val, first_val + v])
next_val = first_val + v
else:
range_bucket.append([next_val, next_val + v])
next_val = next_val + v
return range_bucket
Data input for get_bucket_range is a comma delineated string which means we have 3 variants where each variant has its own weight e.g. data = "25,25,50" with first variant's weight being 25 etc.
I then created a function to assign the variant named,
def assign_variant(range_bucket, num):
for i in range(len(range_bucket)):
if num in range(range_bucket[i][0], range_bucket[i][1]):
return i
This function should have 2 parameters, range_bucket -> from get_bucket_range function, and num -> bucket_id from cookies.
With this function I can return which bucket_id belongs to the variant id.
For example, we have 25 as bucket_id, with data = "25,25,50". This means our bucket_id should belong to variant id 1. Or in the case that we have 25 as bucket_id, with data = "10,10,10,70". This should mean that our bucket_id will belong to variant id 2.
However, it feels like neither of my functions are pythonic or optimised. Does anyone here have any suggestions as to how I could improve my code?

Your functions could look like this for example:
def get_bucket_range(data):
last = 0
range_bucket = []
for v in map(int, data.split(',')):
range_bucket.append([last, last+v])
last += v
return range_bucket
def assign_variant(range_bucket, num):
for i, (low, high) in enumerate(range_bucket):
if low <= num < high:
return i

You can greatly reduce the lengths of your functions with the itertools.accumulate and bisect.bisect functions. The first function accumulates all the weights into sums (10,10,10,70 becomes 10,20,30,100), and the second function gives you the index of where that element would belong, which in your case is equivalent to the index of the group it belongs to.
from itertools import accumulate
from bisect import bisect
def get_bucket_range(data):
return list(accumulate(map(int, data.split(',')))
def assign_variant(range_bucket, num):
return bisect(range_bucket, num)

Related

Finding and averaging elements of multiple dataframes that are within 10% of each other

I have objects that store values are dataframes. I have been able to compare if values from two dataframes are within 10% of each other. However, I am having difficulty extending this to multiple dataframes. Moreover, I am wondering how I should apporach this problem if dataframes are not the same size?
def add_well_peak(self, *other):
if len(self.Bell) == len(other.Bell): #if dataframes ARE the same size
for k in range(len(self.Bell)):
for j in range(len(other.Bell)):
if int(self.Size[k]) - int(self.Size[k])*(1/10) <= int(other.Size[j]) <= int(self.Size[k]) + int(self.Size[k])*(1/10):
#average all
For example, in the image below, there are objects that contain dataframes (i.e., self, other1, other2). The colors represent matches (i.e, values that are within 10% of each other). If a match exist, then average the values. If a match does not exist still include the unmatch number. I want to be able to generalize this for any number of objects greater or equal than 2 (other 1, other 2, other 3, other ....). Any help would be appreciated. Please let me know if anything is unclear. This is my first time posting. Thanks again.
matching data
Results:
Using my solution on the dataframes of your image, I get the following:
Threshold outlier = 0.2:
0
0 1.000000
1 1493.500000
2 5191.333333
3 35785.333333
4 43586.500000
5 78486.000000
6 100000.000000
Threshold outlier = 0.5:
0 1
0 1.000000 NaN
1 1493.500000 NaN
2 5191.333333 NaN
3 43586.500000 35785.333333
4 78486.000000 100000.000000
Explanations:
The lines are averaged peaks, the columns representing the different values obtained for these peaks. I assumed the average emanating from the biggest number of elements was the legitimate one, and the rest within the THRESHOLD_OUTLIER were the outliers (should be sorted, the more probable you are as a legitimate peak, the more you are on the left (the 0th column is the most probable)). For instance, on line 3 of the 0.5 outlier threshold results, 43586.500000 is an average coming from 3 dataframes, while 35785.333333 comes from only 2, thus the most probable is the first one.
Issues:
The solution is quite complicated. I assume a big part of it could be removed, but I can't see how for the moment, and as it works, I'll certainly leave the optimization to you.
Still, I tried commenting my best, and if you have any question, do not hesitate!
Files:
CombinationLib.py
from __future__ import annotations
from typing import Dict, List
from Errors import *
class Combination():
"""
Support class, to make things easier.
Contains a string `self.combination` which is a binary number stored as a string.
This allows to test every combination of value (i.e. "101" on the list `[1, 2, 3]`
would signify grouping `1` and `3` together).
There are some methods:
- `__add__` overrides the `+` operator
- `compute_degree` gives how many `1`s are in the combination
- `overlaps` allows to verify if combination overlaps (use the same value twice)
(i.e. `100` and `011` don't overlap, while `101` and `001` do)
"""
def __init__(self, combination:str) -> Combination:
self.combination:str = combination
self.degree:int = self.compute_degree()
def __add__(self, other: Combination) -> Combination:
if self.combination == None:
return other.copy()
if other.combination == None:
return self.copy()
if self.overlaps(other):
raise CombinationsOverlapError()
result = ""
for c1, c2 in zip(self.combination, other.combination):
result += "1" if (c1 == "1" or c2 == "1") else "0"
return Combination(result)
def __str__(self) -> str:
return self.combination
def compute_degree(self) -> int:
if self.combination == None:
return 0
degree = 0
for bit in self.combination:
if bit == "1":
degree += 1
return degree
def copy(self) -> Combination:
return Combination(self.combination)
def overlaps(self, other:Combination) -> bool:
for c1, c2 in zip(self.combination, other.combination):
if c1 == "1" and c1 == c2:
return True
return False
class CombinationNode():
"""
The main class.
The main idea was to build a tree of possible "combinations of combinations":
100-011 => 111
|---010-001 => 111
|---001-010 => 111
At each node, the combination applied to the current list of values was to be acceptable
(all within THREASHOLD_AVERAGING).
Also, the shorter a path, the better the solution as it means it found a way to average
a lot of the values, with the minimum amount of outliers possible, maybe by grouping
the outliers together in a way that makes sense, ...
- `populate` fills the tree automatically, with every solution possible
- `path` is used mainly on leaves, to obtain the path taken to arrive there.
"""
def __init__(self, combination:Combination) -> CombinationNode:
self.combination:Combination = combination
self.children:List[CombinationNode] = []
self.parent:CombinationNode = None
self.total_combination:Combination = combination
def __str__(self) -> str:
list_paths = self.recur_paths()
list_paths = [",".join([combi.combination.combination for combi in path]) for path in list_paths]
return "\n".join(list_paths)
def add_child(self, child:CombinationNode) -> None:
if child.combination.degree > self.combination.degree and not self.total_combination.overlaps(child.combination):
raise ChildDegreeExceedParentDegreeError(f"{child.combination} > {self.combination}")
self.children.append(child)
child.parent = self
child.total_combination += self.total_combination
def path(self) -> List[CombinationNode]:
path = []
current = self
while current.parent != None:
path.append(current)
current = current.parent
path.append(current)
return path[::-1]
def populate(self, combination_dict:Dict[int, List[Combination]]) -> None:
missing_degrees = len(self.combination.combination)-self.total_combination.degree
if missing_degrees == 0:
return
for i in range(min(self.combination.degree, missing_degrees), 0, -1):
for combination in combination_dict[i]:
if not self.total_combination.overlaps(combination):
self.add_child(CombinationNode(combination))
for child in self.children:
child.populate(combination_dict)
def recur_paths(self) -> List[List[CombinationNode]]:
if len(self.children) == 0:
return [self.path()]
paths = []
for child in self.children:
for path in child.recur_paths():
paths.append(path)
return paths
Errors.py
class ChildDegreeExceedParentDegreeError(Exception):
pass
class CombinationsOverlapError(Exception):
pass
class ToImplementError(Exception):
pass
class UncompletePathError(Exception):
pass
main.py
from typing import Dict, List, Set, Tuple, Union
import pandas as pd
from CombinationLib import *
best_depth:int = -1
best_path:List[CombinationNode] = []
THRESHOLD_OUTLIER = 0.2
THRESHOLD_AVERAGING = 0.1
def verif_averaging_pct(combination:Combination, values:List[float]) -> bool:
"""
For a given combination of values, we must have all the values within
THRESHOLD_AVERAGING of the average of the combination
"""
avg = 0
for c,v in zip(combination.combination, values):
if c == "1":
avg += v
avg /= combination.degree
for c,v in zip(combination.combination, values):
if c == "1"and (v > avg*(1+THRESHOLD_AVERAGING) or v < avg*(1-THRESHOLD_AVERAGING)):
return False
return True
def recursive_check(node:CombinationNode, depth:int, values:List[Union[float, int]]) -> None:
"""
Here is where we preferencially ask for a small number of bigger groups
"""
global best_depth
global best_path
# If there are more groups than the current best way to do, stop
if best_depth != -1 and depth > best_depth:
return
# If all the values of the combination are not within THRESHOLD_AVERAGING, stop
if not verif_averaging_pct(node.combination, values):
return
# If we finished the list of combinations, and this way is the best, keep it, stop
if len(node.children) == 0:
if best_depth == -1 or depth < best_depth:
best_depth = depth
best_path = node.path()
return
# If we are still not finished (not every value has been used), continue
for cnode in node.children:
recursive_check(cnode, depth+1, values)
def groups_from_list(values:List[Union[float, int]]) -> List[List[Union[float, int]]]:
"""
From a list of values, get the smallest list of groups of elements
within THRESHOLD_AVERAGING of each other.
It implies that we will try and recursively find the biggest group possible
within the unsused values (i.e. groups with combinations of size [3, 1] are prefered
over [2, 2])
"""
global best_depth
global best_path
groups:List[List[float]] = []
# Generate all the combinations (I used binary for this)
combination_dict:Dict[int, List[Combination]] = {}
for i in range(1, 2**len(values)):
combination = format(i, f"0{len(values)}b") # Here is the binary conversion
counter = 0
for c in combination:
if c == "1":
counter += 1
if counter not in combination_dict:
combination_dict[counter] = []
combination_dict[counter].append(Combination(combination))
# Generate of the combinations of combinations that use all values (without using one twice)
combination_trees:List[List[CombinationNode]] = []
for key in combination_dict:
for combination in combination_dict[key]:
cn = CombinationNode(combination)
cn.populate(combination_dict)
combination_trees.append(cn)
best_depth = -1
best_path = None
for root in combination_trees:
recursive_check(root, 0, values)
# print(",".join([combination.combination.combination for combination in best_path]))
for combination in best_path:
temp = []
for c,v in zip(combination.combination.combination, values):
if c == "1":
temp.append(v)
groups.append(temp)
return groups
def averages_from_groups(gs:List[List[Union[float, int]]]) -> List[float]:
"""Computing the averages of each group"""
avgs:List[float] = []
for group in gs:
avg = 0
for elt in group:
avg += elt
avg /= len(group)
avgs.append(avg)
return avgs
def end_check(ds:List[pd.DataFrame], ids:List[int]) -> bool:
"""Check if we finished consuming all the dataframes"""
for d,i in zip(ds, ids):
if i < len(d[0]):
return False
return True
def search(group:List[Union[float, int]], values_list:List[Union[float, int]]) -> List[int]:
"""Obtain all the indices corresponding to a set of values"""
# We will get all the indices in values_list of the values in group
# If a value is present in group, all the occurences of this value will be too,
# so we can use a set and search every occurence for each value.
indices:List[int] = []
group_set = set(group)
for value in group_set:
for i,v in enumerate(values_list):
if value == v:
indices.append(i)
return indices
def threshold_grouper(total_list:List[Union[float, int]]) -> pd.DataFrame:
"""Building a 2D pd.DataFrame with the averages (x) and the outliers (y)"""
result_list:List[List[Union[float, int]]] = [[total_list[0]]]
result_index = 0
total_index = 1
while total_index < len(total_list):
# Only checking if the bigger one is within THRESHOLD_OUTLIER of the little one.
# If it is the case, the opposite is true too.
# If yes, it is an outlier
if result_list[result_index][0]*(1+THRESHOLD_OUTLIER) >= total_list[total_index]:
result_list[result_index].append(total_list[total_index])
# Else it is a new peak
else:
result_list.append([total_list[total_index]])
result_index += 1
total_index += 1
result:pd.DataFrame = pd.DataFrame(result_list)
return result
def dataframes_merger(dataframes:List[pd.DataFrame]) -> pd.DataFrame:
"""Merging the dataframes, with THRESHOLDS"""
# Store the averages for the within 10% cells, in ascending order
result = []
# Keep tabs on where we are regarding each dataframe (needed for when we skip cells)
curr_indices:List[int] = [0 for _ in range(len(dataframes))]
# Repeat until all the cells in every dataframe has been seen once
while not end_check(dataframes, curr_indices):
# Get the values of the current indices in the dataframes
curr_values = [dataframe[0][i] for dataframe,i in zip(dataframes, curr_indices)]
# Get the largest 10% groups from the current list of values
groups = groups_from_list(curr_values)
# Compute the average of these groups
avgs = averages_from_groups(groups)
# Obtain the minimum average...
avg_min = min(avgs)
# ... and its index
avg_min_index = avgs.index(avg_min)
# Then get the group corresponding to the minimum average
avg_min_group = groups[avg_min_index]
# Get the indices of the values included in this group
indices_to_increment = search(avg_min_group, curr_values)
# Add the average to the result merged list
result.append(avg_min)
# For every element in the average we added, increment the corresponding index
for index in indices_to_increment:
curr_indices[index] += 1
# Re-assemble the dataframe, taking the threshold% around average into account
result = threshold_grouper(result)
print(result)
df1 = pd.DataFrame([1, 1487, 5144, 35293, 78486, 100000])
df2 = pd.DataFrame([1, 1500, 5144, 36278, 45968, 100000])
df3 = pd.DataFrame([1, 5286, 35785, 41205, 100000])
dataframes_merger([df3, df2, df1])

MaxValueSelect() function has some errors . any fixes?

My code is:
def maxValueSelection(items,V):
maxval = 0
val = 0
sorted_dict={}
for i in items.items():
sorted_dict[i[1][1]] = [i[1][0],i[0]]
sorted_dict_list = (sorted(sorted_dict))[::-1]
while sorted_dict_list!=[]:
item = sorted_dict_list[0]
if(sorted_dict[item][0] + val<=V):
maxval+=item
val = val+sorted_dict[item][0]
sorted_dict_list.pop(0)
return maxval
items = {1:(4,400),2:(9,1800),3:(10,3500),4:(20,4000),5:(2,1000),6:(1,200)}
V = 20
print(maxValueSelection(items,V))
I have used a greedy algorithm for the question in which I have two values which records the value of the item and one monitors the weight of the items which should not be exceeded more than a threshold value mentioned in the question. It seems like my greedy strategy is working upto some extent but nearly misses the maxValue in every test case. It will be helpful if someone tells me how to fix this issue with my code
If I understood correctly what you need.
this code selects V elements from the dict items, with maximum total value.
from operator import itemgetter, truediv
def maxValueSelection(items,V):
maxval = 0
val = 0
itemlist = sorted([i[1] for i in items.items()],key=itemgetter(1),reverse=True)
for item in itemlist:
if V > 0:
val = item[1]
maxval += val * min(V,item[0])
V -= min(V,item[0])
return maxval
items = {1:(4,400),2:(9,1800),3:(10,3500),4:(20,4000),5:(2,1000),6:(1,200)}
V = 20
print(maxValueSelection(items,V))

How to compare a list of string and a string, return the best match string from the list?

My problem is on my second function code.
This is my code so far....
def simi(d1,d2):
dna_1 = d1.lower()
dna_2 = d2.lower()
lst = []
i = 0
while i < len(dna_1):
if dna_1[i] == dna_2[i]:
lst.append(1)
i += 1
return len(lst) / len(d1)
def match(list_1, d , s):
dna = []
for item in list_1:
dna.append(simi(item, d))
if max(dna) < s:
return None
return list_1[max(dna)]
You have two problems, the first is you return in the loop before you have tried all the elements, secondly your function simi(item, d)returns a float if it works correctly so trying to index a list with a float will also fail. There is no way your code could do anything but error or return None.
I imagine you want to keep track of the best each iteration and return the item that is the best based on what it's simi distance calc is and if the simi is > s or else return None:
def match(list_1, d , s):
best = None
mx = float("-inf")
for item in list_1:
f = simi(item, d)
if f > mx:
mx = f
best = item
return best if mx > s else None
You can also use range in simi instead of your while loop with a list comp:
def simi(d1,d2):
dna_1 = d1.lower()
dna_2 = d2.lower()
lst = [1 for i in range(len(dna_1)) if dna_1[i] == dna_2[i] ]
return len(lst) / len(dna_1)
But if you just want to add 1 each time they condition is True you can use sum:
def simi(d1,d2):
dna_1 = d1.lower()
dna_2 = d2.lower()
sm = sum(dna_1[i] == dna_2[i] for i in range(len(dna_1)))
return sm / len(dna_1)
Using some builtins:
from functools import partial
similarity_with_sample = partial(simi, 'TACgtAcGaCGT')
Now similarity_with_sample is a function that takes one argument, and returns its similarity with 'TACgtAcGaCGT'.
Now use that as the key argument of the builtin max function:
best_match = max(list_of_samples, key=similarity_with_sample)
I'm not sure what your s variable is doing.

Python: Concatenate similiar objects in List

I have a list containing strings as ['Country-Points'].
For example:
lst = ['Albania-10', 'Albania-5', 'Andorra-0', 'Andorra-4', 'Andorra-8', ...other countries...]
I want to calculate the average for each country without creating a new list. So the output would be (in the case above):
lst = ['Albania-7.5', 'Andorra-4.25', ...other countries...]
Would realy appreciate if anyone can help me with this.
EDIT:
this is what I've got so far. So, "data" is actually a dictionary, where the keys are countries and the values are list of other countries points' to this country (the one as Key). Again, I'm new at Python so I don't realy know all the built-in functions.
for key in self.data:
lst = []
index = 0
score = 0
cnt = 0
s = str(self.data[key][0]).split("-")[0]
for i in range(len(self.data[key])):
if s in self.data[key][i]:
a = str(self.data[key][i]).split("-")
score += int(float(a[1]))
cnt+=1
index+=1
if i+1 != len(self.data[key]) and not s in self.data[key][i+1]:
lst.append(s + "-" + str(float(score/cnt)))
s = str(self.data[key][index]).split("-")[0]
score = 0
self.data[key] = lst
itertools.groupby with a suitable key function can help:
import itertools
def get_country_name(item):
return item.split('-', 1)[0]
def get_country_value(item):
return float(item.split('-', 1)[1])
def country_avg_grouper(lst) :
for ctry, group in itertools.groupby(lst, key=get_country_name):
values = list(get_country_value(c) for c in group)
avg = sum(values)/len(values)
yield '{country}-{avg}'.format(country=ctry, avg=avg)
lst[:] = country_avg_grouper(lst)
The key here is that I wrote a function to do the change out of place and then I can easily make the substitution happen in place by using slice assignment.
I would probabkly do this with an intermediate dictionary.
def country(s):
return s.split('-')[0]
def value(s):
return float(s.split('-')[1])
def country_average(lst):
country_map = {}|
for point in lst:
c = country(pair)
v = value(pair)
old = country_map.get(c, (0, 0))
country_map[c] = (old[0]+v, old[1]+1)
return ['%s-%f' % (country, sum/count)
for (country, (sum, count)) in country_map.items()]
It tries hard to only traverse the original list only once, at the expense of quite a few tuple allocations.

single iteration sharing the iterator

I have a lot of data, usually in a file. I want to compute some quantities so I have this kind of functions:
def mean(iterator):
n = 0
sum = 0.
for i in iterator:
sum += i
n += 1
return sum / float(n)
I have also many other similar functions (var, size, ...)
Now I have an iterator iterating throught the data: iter_data. I can compute all the quantities I want: m = mean(iter_data); v = var(iter_data) and so on, but the problem is that I am iterating many times and this is expensive in my case. Actually the I/O is the most expensive part.
So the question is: can I compute my quantities m, v, ... iterating only one time over iter_data keeping separate the functions mean, var, ... so that it is easy to add new ones?
What I need is something similar to boost::accumulators
For example use objects and callbacks like:
class Counter():
def __init__(self):
self.n = 0
def __call__(self, i):
self.n += 1
class Summer():
def __init__(self):
self.sum = 0
def __call__(self, i):
self.sum += i
def process(iterator, callbacks):
for i in iterator:
for f in callbacks: f(i)
counter = Counter()
summer = Summer()
callbacks = [counter, summer]
iterator = xrange(10) # testdata
process(iterator, callbacks)
# process results from callbacks
n = counter.n
sum = summer.sum
This is easily extendible and iterates the data only once.
You can use itertools.tee and generator magic (I say magic because it's not exactly nice and readable):
import itertools
def mean(iterator):
n = 0
sum = 0.
for i in iterator:
sum += i
n += 1
yield
yield sum / float(n)
def multi_iterate(funcs, iter_data):
iterators = itertools.tee(iter_data, len(funcs))
result_iterators = [func(values) for func, values in zip(funcs, iterators)]
for results in itertools.izip(*result_iterators):
pass
return results
mean_result, var_result = multi_iterate([mean, var], iter([10, 20, 30]))
print(mean_result) # 20.0
By the way, you can write mean in a simpler way:
def mean(iterator):
total = 0.
for n, item in enumerate(iterator, 1):
total += i
yield
yield total / n
You shouldn't name variables sum because that shadows the built-in function with the same name.
Without classes, you could adapt the following:
def my_mean():
total = 0.
length = 0
while True:
val = (yield)
if val is not None:
total += val
length += 1
else:
yield total / length
def my_len():
length = 0
while True:
val = (yield)
if val is not None:
length += 1
else:
yield length
def my_sum():
total = 0.
while True:
val = (yield)
if val is not None:
total += val
else:
yield total
def process(iterable, **funcs):
fns = {name:func() for name, func in funcs.iteritems()}
for fn in fns.itervalues():
fn.send(None)
for item in iterable:
for fn in fns.itervalues():
fn.send(item)
return {name:next(func) for name, func in fns.iteritems()}
data = [1, 2, 3]
print process(data, items=my_len, some_other_value=my_mean, Total=my_sum)
# {'items': 3, 'some_other_value': 2.0, 'Total': 6.0}
What you want is to have a main Calc class that iterates over the data applying different calculation for mean, var, etc and then can return those values through an interface. You could make it more generic by letting calculations register themselves with this class before the main calculation and then have their results available through new accessors in the interface.

Categories

Resources