Inconsistent results in pyspark combineByKey (as opposed to groupByKey)

Inconsistent results in pyspark combineByKey (as opposed to groupByKey) - python

I want to group by key some rows in a RDD so I can perform more advanced operations with the rows within one group. Please note, I do not want to calculate merely some aggregate values. The rows are key-value pairs, where the key is a GUID and the value is a complex object.
As per pyspark documentation, I first tried to implement this with combineByKey as it is supposed be more performant than groupByKey. The list at the beginning is just for illustration, not my real data:
l = list(range(1000))
numbers = sc.parallelize(l)
rdd = numbers.map(lambda x: (x % 5, x))
def f_init(initial_value):
return [initial_value]
def f_merge(current_merged, new_value):
if current_merged is None:
current_merged = []
return current_merged.append(new_value)
def f_combine(merged1, merged2):
if merged1 is None:
merged1 = []
if merged2 is None:
merged2 = []
return merged1 + merged2
combined_by_key = rdd.combineByKey(f_init, f_merge, f_combine)
c = combined_by_key.collectAsMap()
i = 0
for k, v in c.items():
if v is None:
print(i, k, 'value is None.')
else:
print(i, k, len(v))
i += 1
The output of this is:
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
Which is not what I expected. The same logic but implemented with groupByKey returns a correct output:
grouped_by_key = rdd.groupByKey()
d = grouped_by_key.collectAsMap()
i = 0
for k, v in d.items():
if v is None:
print(i, k, 'value is None.')
else:
print(i, k, len(v))
i += 1
Returns:
0 0 200
1 1 200
2 2 200
3 3 200
4 4 200
So unless I'm missing something, this is the case when groupByKey is preferred over reduceByKey or combineByKey (the topic of related discussion: Is groupByKey ever preferred over reduceByKey).

It is the case when understanding basic APIs is preferred. In particular if you check list.append docstring:
?list.append
## Docstring: L.append(object) -> None -- append object to end
## Type: method_descriptor
you'll see that like the other mutating methods in Python API it by convention doesn't return modified object. It means that f_merge always returns None and there is no accumulation whatsoever.
That being said for most of the problems there much more efficient solutions than groupByKey but rewriting it with combineByKey (or aggregateByKey) is never one of these.

Related

python ndarray multiply columns

I have a dataframe with two columns that are json.
So for example,
df = A B C D
1. 2. {b:1,c:2,d:{r:1,t:{y:0}}} {v:9}
I want to flatten it entirely, so every value in the json will be in a seperate columns, and the name will be the full path. So here the value 0 will be in the column:
C_d_t_y
What is the best way to do it, and without having to predefine the depth of the json or the fields?

If your dataframe contains only nested dictionaries (no lists), you can try:
def get_values(df):
def _parse(val, current_path):
if isinstance(val, dict):
for k, v in val.items():
yield from _parse(v, current_path + [k])
else:
yield "_".join(map(str, current_path)), val
rows = []
for idx, row in df.iterrows():
tmp = {}
for i in row.index:
tmp.update(dict(_parse(row[i], [i])))
rows.append(tmp)
return pd.DataFrame(rows, index=df.index)
print(get_values(df))
Prints:
A B C_b C_c C_d_r C_d_t_y D_v
0 1 2 1 2 1 0 9

Truth table with Boolean Functions

I am trying to generate a Truth Table using PANDAS in python.
I have been given a Boolean Network with 3 external nodes (U1,U2,U3) and 6 internal nodes (v1,v2,v3,v4,v5,v6).
I have created a table with all the possible combinations of the 3 external nodes which are 2^3 = 8.
import pandas as pd
import itertools
in_comb = list(itertools.product([0,1], repeat = 3))
df = pd.DataFrame(in_comb)
df.columns = ['U1','U2','U3']
df.index += 1
U1
U2
U3
0
0
0
0
0
1
0
1
0
0
1
1
1
0
0
1
1
0
1
0
1
1
1
1
And I also have created the same table but with all the possible combinations of the 6 internal nodes which are 2^6 = 64 combinations.
The functions for each node were also given
v1(t+1) = U1(t)
v2(t+1) = v1(t) and U2(t)
v3(t+1) = v2(t) and v5(t)
v5(t+1) = not U3(t)
v6(t+1) = v5(t) or v3(t)
The truth table has to be done with PANDAS and it has to show all the combinations with each combination possible.
For example.
v1
v2
v3
v4
v5
v6
0 0 0
0 0 1
0 1 0
0
0
0
0
0
0
000010
000000
000010
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
1
The table above is an example of how the end product should be. Where the [0 0 0] is the first combination of the external nodes.
I am confused as to how to compute the functions of each gene and how to filter the data and end up with a new table like the one here.
Here I attach an image of the problem I want to solve:

What you seem to have missed is the fact that you don't only have 3 inputs to your network, as the "old state" is also considered an input - that's what a feedback combinational network does, it turns the old state + input into new state (and often output).
This means that you have 3+6 inputs, for 2^9=512 combinations. Not very easy to understand when printed, but still possible. I modified your code to print this (beware that I'm quite new to pandas, so this code can definitely be improved)
import pandas as pd
import pandas as pd
import itertools
#list of (u, v) pairs (3 and 6 elements)
# uses bools instead of ints
inputs = list((row[0:3],row[3:]) for row in itertools.product([False,True], repeat = 9))
def new_state(u, v):
# implement the itnernal nodes
return (
u[0],
v[0] and u[1],
v[1] and v[4],
v[2],
not u[2],
v[4] or v[2]
)
new_states = list(new_state(u, v) for u,v in inputs)
# unzip inputs to (u,v), add new_states
raw_rows = zip(*zip(*inputs), new_states)
def format_boolvec(v):
"""Format a tuple of bools like (False, False, True, False) into a string like "0010" """
return "".join('1' if b else '0' for b in v)
formatted_rows = list(map(lambda row: list(map(format_boolvec, row)), raw_rows))
df = pd.DataFrame(formatted_rows)
df.columns = ['U', "v(t)", "v(t+1)"]
df.index += 1
df
The heart of it is the function new_state that takes the (u, v) pair of input & old state and produces the resulting new state. It's a direct translation of your specification.
I modified your itertools.product line to use bools, produce length-9 results and split them to 3+6 length tuples. To still print in your format, I added the format_boolvec(v) function. Other than that, it should be very easy to follow, but fell free to comment if you need more explanation.
To find an input sequence from a given start state to a given end state, you could do it yourself by hand, but it's tedious. I recommend using a graph algorithm, which is easy to implement since we also know the length of the desired path, so we don't need any fancy algorithms like Bellman-Ford or Dijkstra's - we need to just generate all length=3 paths and filter for the endpoint.
# to find desired inputs
# treat each state as a node in a graph
# (think of visual graph transition diagrams)
# and add edges between them labeled with the inputs
# find all l=3 paths, and check the end nodes
nodes = {format_boolvec(prod): {} for prod in itertools.product([False,True], repeat = 6)}
for index, row in df.iterrows():
nodes[row['v(t)']][row['U']] = row['v(t+1)']
# we now built the graph, only need to find a path from start state to end state
def prefix_paths(prefix, paths):
# aux helper function for all_length_n_paths
for path, endp in paths:
yield ([prefix]+path, endp)
def all_length_n_paths(graph, start_node, n):
"""Return all length n paths from a given starting point
Yield tuples (path, endpoint) where path is a list of strings of the inputs, and endpoint is the end of the path.
Uses internal recursion to generate paths"""
if n == 0:
yield ([], start_node)
return
for inp, nextstate in graph[start_node].items():
yield from prefix_paths(inp, all_length_n_paths(graph, nextstate, n-1))
# just iterate over all length=3 paths starting at 101100 and print it if it end's up at 011001
for path, end in all_length_n_paths(nodes, "101100", 3):
if end=="011001":
print(path)
This code should also be easy to follow, maybe except that iterator syntax.
The result is not just one, but 3 different paths:
['100', '110', '011']
['101', '110', '011']
['111', '110', '011']

Finding and averaging elements of multiple dataframes that are within 10% of each other

I have objects that store values are dataframes. I have been able to compare if values from two dataframes are within 10% of each other. However, I am having difficulty extending this to multiple dataframes. Moreover, I am wondering how I should apporach this problem if dataframes are not the same size?
def add_well_peak(self, *other):
if len(self.Bell) == len(other.Bell): #if dataframes ARE the same size
for k in range(len(self.Bell)):
for j in range(len(other.Bell)):
if int(self.Size[k]) - int(self.Size[k])*(1/10) <= int(other.Size[j]) <= int(self.Size[k]) + int(self.Size[k])*(1/10):
#average all
For example, in the image below, there are objects that contain dataframes (i.e., self, other1, other2). The colors represent matches (i.e, values that are within 10% of each other). If a match exist, then average the values. If a match does not exist still include the unmatch number. I want to be able to generalize this for any number of objects greater or equal than 2 (other 1, other 2, other 3, other ....). Any help would be appreciated. Please let me know if anything is unclear. This is my first time posting. Thanks again.
matching data

Results:
Using my solution on the dataframes of your image, I get the following:
Threshold outlier = 0.2:
0
0 1.000000
1 1493.500000
2 5191.333333
3 35785.333333
4 43586.500000
5 78486.000000
6 100000.000000
Threshold outlier = 0.5:
0 1
0 1.000000 NaN
1 1493.500000 NaN
2 5191.333333 NaN
3 43586.500000 35785.333333
4 78486.000000 100000.000000
Explanations:
The lines are averaged peaks, the columns representing the different values obtained for these peaks. I assumed the average emanating from the biggest number of elements was the legitimate one, and the rest within the THRESHOLD_OUTLIER were the outliers (should be sorted, the more probable you are as a legitimate peak, the more you are on the left (the 0th column is the most probable)). For instance, on line 3 of the 0.5 outlier threshold results, 43586.500000 is an average coming from 3 dataframes, while 35785.333333 comes from only 2, thus the most probable is the first one.
Issues:
The solution is quite complicated. I assume a big part of it could be removed, but I can't see how for the moment, and as it works, I'll certainly leave the optimization to you.
Still, I tried commenting my best, and if you have any question, do not hesitate!
Files:
CombinationLib.py
from __future__ import annotations
from typing import Dict, List
from Errors import *
class Combination():
"""
Support class, to make things easier.
Contains a string `self.combination` which is a binary number stored as a string.
This allows to test every combination of value (i.e. "101" on the list `[1, 2, 3]`
would signify grouping `1` and `3` together).
There are some methods:
- `__add__` overrides the `+` operator
- `compute_degree` gives how many `1`s are in the combination
- `overlaps` allows to verify if combination overlaps (use the same value twice)
(i.e. `100` and `011` don't overlap, while `101` and `001` do)
"""
def __init__(self, combination:str) -> Combination:
self.combination:str = combination
self.degree:int = self.compute_degree()
def __add__(self, other: Combination) -> Combination:
if self.combination == None:
return other.copy()
if other.combination == None:
return self.copy()
if self.overlaps(other):
raise CombinationsOverlapError()
result = ""
for c1, c2 in zip(self.combination, other.combination):
result += "1" if (c1 == "1" or c2 == "1") else "0"
return Combination(result)
def __str__(self) -> str:
return self.combination
def compute_degree(self) -> int:
if self.combination == None:
return 0
degree = 0
for bit in self.combination:
if bit == "1":
degree += 1
return degree
def copy(self) -> Combination:
return Combination(self.combination)
def overlaps(self, other:Combination) -> bool:
for c1, c2 in zip(self.combination, other.combination):
if c1 == "1" and c1 == c2:
return True
return False
class CombinationNode():
"""
The main class.
The main idea was to build a tree of possible "combinations of combinations":
100-011 => 111
|---010-001 => 111
|---001-010 => 111
At each node, the combination applied to the current list of values was to be acceptable
(all within THREASHOLD_AVERAGING).
Also, the shorter a path, the better the solution as it means it found a way to average
a lot of the values, with the minimum amount of outliers possible, maybe by grouping
the outliers together in a way that makes sense, ...
- `populate` fills the tree automatically, with every solution possible
- `path` is used mainly on leaves, to obtain the path taken to arrive there.
"""
def __init__(self, combination:Combination) -> CombinationNode:
self.combination:Combination = combination
self.children:List[CombinationNode] = []
self.parent:CombinationNode = None
self.total_combination:Combination = combination
def __str__(self) -> str:
list_paths = self.recur_paths()
list_paths = [",".join([combi.combination.combination for combi in path]) for path in list_paths]
return "\n".join(list_paths)
def add_child(self, child:CombinationNode) -> None:
if child.combination.degree > self.combination.degree and not self.total_combination.overlaps(child.combination):
raise ChildDegreeExceedParentDegreeError(f"{child.combination} > {self.combination}")
self.children.append(child)
child.parent = self
child.total_combination += self.total_combination
def path(self) -> List[CombinationNode]:
path = []
current = self
while current.parent != None:
path.append(current)
current = current.parent
path.append(current)
return path[::-1]
def populate(self, combination_dict:Dict[int, List[Combination]]) -> None:
missing_degrees = len(self.combination.combination)-self.total_combination.degree
if missing_degrees == 0:
return
for i in range(min(self.combination.degree, missing_degrees), 0, -1):
for combination in combination_dict[i]:
if not self.total_combination.overlaps(combination):
self.add_child(CombinationNode(combination))
for child in self.children:
child.populate(combination_dict)
def recur_paths(self) -> List[List[CombinationNode]]:
if len(self.children) == 0:
return [self.path()]
paths = []
for child in self.children:
for path in child.recur_paths():
paths.append(path)
return paths
Errors.py
class ChildDegreeExceedParentDegreeError(Exception):
pass
class CombinationsOverlapError(Exception):
pass
class ToImplementError(Exception):
pass
class UncompletePathError(Exception):
pass
main.py
from typing import Dict, List, Set, Tuple, Union
import pandas as pd
from CombinationLib import *
best_depth:int = -1
best_path:List[CombinationNode] = []
THRESHOLD_OUTLIER = 0.2
THRESHOLD_AVERAGING = 0.1
def verif_averaging_pct(combination:Combination, values:List[float]) -> bool:
"""
For a given combination of values, we must have all the values within
THRESHOLD_AVERAGING of the average of the combination
"""
avg = 0
for c,v in zip(combination.combination, values):
if c == "1":
avg += v
avg /= combination.degree
for c,v in zip(combination.combination, values):
if c == "1"and (v > avg*(1+THRESHOLD_AVERAGING) or v < avg*(1-THRESHOLD_AVERAGING)):
return False
return True
def recursive_check(node:CombinationNode, depth:int, values:List[Union[float, int]]) -> None:
"""
Here is where we preferencially ask for a small number of bigger groups
"""
global best_depth
global best_path
# If there are more groups than the current best way to do, stop
if best_depth != -1 and depth > best_depth:
return
# If all the values of the combination are not within THRESHOLD_AVERAGING, stop
if not verif_averaging_pct(node.combination, values):
return
# If we finished the list of combinations, and this way is the best, keep it, stop
if len(node.children) == 0:
if best_depth == -1 or depth < best_depth:
best_depth = depth
best_path = node.path()
return
# If we are still not finished (not every value has been used), continue
for cnode in node.children:
recursive_check(cnode, depth+1, values)
def groups_from_list(values:List[Union[float, int]]) -> List[List[Union[float, int]]]:
"""
From a list of values, get the smallest list of groups of elements
within THRESHOLD_AVERAGING of each other.
It implies that we will try and recursively find the biggest group possible
within the unsused values (i.e. groups with combinations of size [3, 1] are prefered
over [2, 2])
"""
global best_depth
global best_path
groups:List[List[float]] = []
# Generate all the combinations (I used binary for this)
combination_dict:Dict[int, List[Combination]] = {}
for i in range(1, 2**len(values)):
combination = format(i, f"0{len(values)}b") # Here is the binary conversion
counter = 0
for c in combination:
if c == "1":
counter += 1
if counter not in combination_dict:
combination_dict[counter] = []
combination_dict[counter].append(Combination(combination))
# Generate of the combinations of combinations that use all values (without using one twice)
combination_trees:List[List[CombinationNode]] = []
for key in combination_dict:
for combination in combination_dict[key]:
cn = CombinationNode(combination)
cn.populate(combination_dict)
combination_trees.append(cn)
best_depth = -1
best_path = None
for root in combination_trees:
recursive_check(root, 0, values)
# print(",".join([combination.combination.combination for combination in best_path]))
for combination in best_path:
temp = []
for c,v in zip(combination.combination.combination, values):
if c == "1":
temp.append(v)
groups.append(temp)
return groups
def averages_from_groups(gs:List[List[Union[float, int]]]) -> List[float]:
"""Computing the averages of each group"""
avgs:List[float] = []
for group in gs:
avg = 0
for elt in group:
avg += elt
avg /= len(group)
avgs.append(avg)
return avgs
def end_check(ds:List[pd.DataFrame], ids:List[int]) -> bool:
"""Check if we finished consuming all the dataframes"""
for d,i in zip(ds, ids):
if i < len(d[0]):
return False
return True
def search(group:List[Union[float, int]], values_list:List[Union[float, int]]) -> List[int]:
"""Obtain all the indices corresponding to a set of values"""
# We will get all the indices in values_list of the values in group
# If a value is present in group, all the occurences of this value will be too,
# so we can use a set and search every occurence for each value.
indices:List[int] = []
group_set = set(group)
for value in group_set:
for i,v in enumerate(values_list):
if value == v:
indices.append(i)
return indices
def threshold_grouper(total_list:List[Union[float, int]]) -> pd.DataFrame:
"""Building a 2D pd.DataFrame with the averages (x) and the outliers (y)"""
result_list:List[List[Union[float, int]]] = [[total_list[0]]]
result_index = 0
total_index = 1
while total_index < len(total_list):
# Only checking if the bigger one is within THRESHOLD_OUTLIER of the little one.
# If it is the case, the opposite is true too.
# If yes, it is an outlier
if result_list[result_index][0]*(1+THRESHOLD_OUTLIER) >= total_list[total_index]:
result_list[result_index].append(total_list[total_index])
# Else it is a new peak
else:
result_list.append([total_list[total_index]])
result_index += 1
total_index += 1
result:pd.DataFrame = pd.DataFrame(result_list)
return result
def dataframes_merger(dataframes:List[pd.DataFrame]) -> pd.DataFrame:
"""Merging the dataframes, with THRESHOLDS"""
# Store the averages for the within 10% cells, in ascending order
result = []
# Keep tabs on where we are regarding each dataframe (needed for when we skip cells)
curr_indices:List[int] = [0 for _ in range(len(dataframes))]
# Repeat until all the cells in every dataframe has been seen once
while not end_check(dataframes, curr_indices):
# Get the values of the current indices in the dataframes
curr_values = [dataframe[0][i] for dataframe,i in zip(dataframes, curr_indices)]
# Get the largest 10% groups from the current list of values
groups = groups_from_list(curr_values)
# Compute the average of these groups
avgs = averages_from_groups(groups)
# Obtain the minimum average...
avg_min = min(avgs)
# ... and its index
avg_min_index = avgs.index(avg_min)
# Then get the group corresponding to the minimum average
avg_min_group = groups[avg_min_index]
# Get the indices of the values included in this group
indices_to_increment = search(avg_min_group, curr_values)
# Add the average to the result merged list
result.append(avg_min)
# For every element in the average we added, increment the corresponding index
for index in indices_to_increment:
curr_indices[index] += 1
# Re-assemble the dataframe, taking the threshold% around average into account
result = threshold_grouper(result)
print(result)
df1 = pd.DataFrame([1, 1487, 5144, 35293, 78486, 100000])
df2 = pd.DataFrame([1, 1500, 5144, 36278, 45968, 100000])
df3 = pd.DataFrame([1, 5286, 35785, 41205, 100000])
dataframes_merger([df3, df2, df1])

Quicksort function does not produce the expected output, though partition function does (python)

I have to implement a quicksort algorithm on a "slice" object for a school project, the slice object is a tuple with :
a 'data' field (the whole numpy array to sort)
a 'left' and 'right' field (the indexes representing the subpart of the slice in the array)
The partition function code goes as follows :
def partition (s, cmp, piv=0):
"""
Creates two slices from *s* by selecting in the first slice all
elements being less than the pivot and in the second one all other
elements.
:param s: A slice of is a dictionary with 3 fields :
- data: the array of objects,
- left: left bound of the slide (a position in the array),
- right: right bound of the slice.
:type s: dict
:param cmp: A comparison function, returning 0 if a == b, -1 is a < b, 1 if a > b
:type cmp: function
:return: A couple of slices, the first slice contains all elements that are
less than the pivot, the second one contains all elements that are
greater than the pivot, the pivot does not belong to any slice.
:rtype: tuple
>>> import generate
>>> import element
>>> import numpy
>>> def cmp (x,y):
... if x == y:
... return 0
... elif x < y:
... return -1
... else:
... return 1
>>> t = numpy.array([element.Element(i) for i in [4,2,3,1]])
>>> p = {'left':0,'right':len(t)-1,'data':t}
>>> pivot = 0
>>> p1,p2 = partition(p,cmp,pivot)
>>> print(p1['data'][p1['left']:p1['right']+1])
[1 2 3 4]
>>> print(p2['data'][p2['left']:p2['right']+1])
[]
"""
#convenience variables to make the function easily readable.
left = s['left']
right = s['right']
swap(s['data'],right,piv)
j = left
for i in range(left,right+1):
if (cmp(s['data'][i] , s['data'][right]) != 1 ):
swap(s['data'],i,j)
j += 1
s1 = {'data':s['data'],'left':left,'right':j-1}
s2 = {'data':s['data'],'left':j+1,'right':right}
return s1,s2
The doctests for this function do not fail, however when calling the quicksort recursive function, the partition function does not behave as expected.The code for the quicksort function is as follows :
def quicksort_slice (s, cmp):
"""
A sorting function implementing the quicksort algorithm
:param s: A slice of an array, that is a dictionary with 3 fields :
data, left, right representing resp. an array of objects and left
and right bounds of the slice.
:type s: dict
:param cmp: A comparison function, returning 0 if a == b, -1 is a < b, 1 if a > b
:type cmp: function
:return: Nothing
>>> import generate
>>> import element
>>> import numpy
>>> def cmp (x,y):
... if x == y:
... return 0
... elif x < y:
... return -1
... else:
... return 1
>>> t = numpy.array([element.Element(i) for i in [4,5,1,2,3,8,9,6,7]])
>>> p = {'left':0,'right':len(t)-1,'data':t}
>>> quicksort_slice(p,cmp)
>>> print(p['data'])
[1 2 3 4 5 6 7 8 9]
"""
if s['left'] < s['right'] :
s1,s2 = partition(s,cmp)
quicksort_slice(s1,cmp)
quicksort_slice(s2,cmp)
This is the output produced by the doctests for this function:
File "C:\Users\Marion\Documents\S2\ASD\TP\tp2_asd\src\sorting.py", line 148, in __main__.quicksort_slice
Failed example:
print(p['data'])
Expected:
[1 2 3 4 5 6 7 8 9]
Got:
[8 2 1 3 7 4 9 5 6]
I have tried to insert debug print statements to observe the recursive calls to the partition function and it seems it doesn't swap values properly but I don't know how to fix this. Any help would be much appreciated.
Edit : As pointed out by one of the comments below, I simply forgot to update the value of piv, which is not always 0 after the first call to the function. I also replaced :
for i in range(left,right+1):
if (cmp(s['data'][i] , s['data'][right]) != 1 ):
swap(s['data'],i,j)
j += 1
with
for i in range(left,right):
if (cmp(s['data'][i] , s['data'][right]) != 1 ):
swap(s['data'],i,j)
j += 1
swap(s['data'],right,j)

Tried to close this thread before and just realised that I needed an answer to do this. The solution was provided by jasonharper.
As pointed out by jasonharper I was "always using the element at index 0 as your pivot - even if 0 isn't within the slice".

Pythonic way to assign range of number to bucket

I'm developing an ABtest framework using django. I want to assign variant number based on bucket_id from cookies' request.
bucket_id is set by the front end with a range integer from 0-99.
So far, I have created the function name get_bucket_name:
def get_bucket_range(data):
range_bucket = []
first_val = 0
next_val = 0
for i, v in enumerate(data.split(",")):
v = int(v)
if i == 0:
first_val = v
range_bucket.append([0, first_val])
elif i == 1:
range_bucket.append([first_val, first_val + v])
next_val = first_val + v
else:
range_bucket.append([next_val, next_val + v])
next_val = next_val + v
return range_bucket
Data input for get_bucket_range is a comma delineated string which means we have 3 variants where each variant has its own weight e.g. data = "25,25,50" with first variant's weight being 25 etc.
I then created a function to assign the variant named,
def assign_variant(range_bucket, num):
for i in range(len(range_bucket)):
if num in range(range_bucket[i][0], range_bucket[i][1]):
return i
This function should have 2 parameters, range_bucket -> from get_bucket_range function, and num -> bucket_id from cookies.
With this function I can return which bucket_id belongs to the variant id.
For example, we have 25 as bucket_id, with data = "25,25,50". This means our bucket_id should belong to variant id 1. Or in the case that we have 25 as bucket_id, with data = "10,10,10,70". This should mean that our bucket_id will belong to variant id 2.
However, it feels like neither of my functions are pythonic or optimised. Does anyone here have any suggestions as to how I could improve my code?

Your functions could look like this for example:
def get_bucket_range(data):
last = 0
range_bucket = []
for v in map(int, data.split(',')):
range_bucket.append([last, last+v])
last += v
return range_bucket
def assign_variant(range_bucket, num):
for i, (low, high) in enumerate(range_bucket):
if low <= num < high:
return i

You can greatly reduce the lengths of your functions with the itertools.accumulate and bisect.bisect functions. The first function accumulates all the weights into sums (10,10,10,70 becomes 10,20,30,100), and the second function gives you the index of where that element would belong, which in your case is equivalent to the index of the group it belongs to.
from itertools import accumulate
from bisect import bisect
def get_bucket_range(data):
return list(accumulate(map(int, data.split(',')))
def assign_variant(range_bucket, num):
return bisect(range_bucket, num)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Inconsistent results in pyspark combineByKey (as opposed to groupByKey) - python

Related

python ndarray multiply columns

Truth table with Boolean Functions

Finding and averaging elements of multiple dataframes that are within 10% of each other

Quicksort function does not produce the expected output, though partition function does (python)

Pythonic way to assign range of number to bucket

Categories

Resources