Rank aggregation: Merge local subrankings into global ranking - python

I have a dataset of multiple local store rankings that I'm looking to aggregate / combine into one national ranking, programmatically. I know that the local rankings are by sales volume, but I am not given the sales volume so must use the relative rankings to create as accurate a national ranking as possible.
As a short example, let's say that we have 3 local ranking lists, from best ranking (1st) to worst ranking (last), that represent different geographic boundaries that can overlap with one another.
ranking_1 = ['J','A','Z','B','C']
ranking_2 = ['A','H','K','B']
ranking_3 = ['Q','O','A','N','K']
We know that J or Q is the highest ranked store, as both are highest in ranking_1 and ranking_3, respectively, and they appear above A, which is the highest in ranking_2. We know that O is next, as it's above A in ranking_3. A comes next, and so on...
If I did this correctly on paper, the output of this short example would be:
global_ranking = [('J',1.5),('Q',1.5),('O',3),('A',4),('H',6),('N',6),('Z',6),('K',8),('B',9),('C',10)]
Note that when we don't have enough data to determine which of two stores is ranked higher, we consider it a tie (i.e. we know that one of J or Q is the highest ranked store, but don't know which is higher, so we put them both at 1.5). In the actual dataset, there are 100+ lists of 1000+ items in each.
I've had fun trying to figure out this problem and am curious if anyone has any smart approaches to it.

Modified Merge Sort algorithm will help here. The modification should take into account incomparable stores and though build groups of incomparable elements which you are willing to consider as equal (like Q and J)

This method seeks to analyze all of the stores at the front of the rankings. If they are not located in a lower than first position in any other ranking list, then they belong at this front level and are added to a 'level' list. Next, they are removed from the front runners and all of the list are adjusted so that there are new front runners. Repeat the process until there are no stores left.
def rank_stores(rankings):
"""
Rank stores with rankings by volume sales with over lap between lists.
:param rankings: list of rankings of stores also in lists.
:return: Ordered list with sets of items at same rankings.
"""
rank_global = []
# Evaluate all stores in the number one postion, if they are not below
# number one somewhere else, then they belong at this level.
# Then remove them from the front of the list, and repeat.
while sum([len(x) for x in rankings]) > 0:
tops = []
# Find out which of the number one stores are not in a lower position
# somewhere else.
for rank in rankings:
if not rank:
continue
else:
top = rank[0]
add = True
for rank_test in rankings:
if not rank_test:
continue
elif not rank_test[1:]:
continue
elif top in rank_test[1:]:
add = False
break
else:
continue
if add:
tops.append(top)
# Now add tops to total rankings list,
# then go through the rankings and pop the top if in tops.
rank_global.append(set(tops))
# Remove the stores that just made it to the top.
for rank in rankings:
if not rank:
continue
elif rank[0] in tops:
rank.pop(0)
else:
continue
return rank_global
For the rankings provided:
ranking_1 = ['J','A','Z','B','C']
ranking_2 = ['A','H','K','B']
ranking_3 = ['Q','O','A','N','K']
rankings = [ranking_1, ranking_2, ranking_3]
Then calling the function:
rank_stores(rankings)
Results in:
[{'J', 'Q'}, {'O'}, {'A'}, {'H', 'N', 'Z'}, {'K'}, {'B'}, {'C'}]
In some circumstances there may not be enough information to determine definite rankings. Try this order.
['Z', 'A', 'B', 'J', 'K', 'F', 'L', 'E', 'W', 'X', 'Y', 'R', 'C']
We can derive the following rankings:
a = ['Z', 'A', 'B', 'F', 'E', 'Y']
b = ['Z', 'J', 'K', 'L', 'X', 'R']
c = ['F', 'E', 'W', 'Y', 'C']
d = ['J', 'K', 'E', 'W', 'X']
e = ['K', 'F', 'W', 'R', 'C']
f = ['X', 'Y', 'R', 'C']
g = ['Z', 'F', 'W', 'X', 'Y', 'R', 'C']
h = ['Z', 'A', 'E', 'W', 'C']
i = ['L', 'E', 'Y', 'R', 'C']
j = ['L', 'E', 'W', 'R']
k = ['Z', 'B', 'K', 'L', 'W', 'Y', 'R']
rankings = [a, b, c, d, e, f, g, h, i, j, k]
Calling the function:
rank_stores(rankings)
results in:
[{'Z'},
{'A', 'J'},
{'B'},
{'K'},
{'F', 'L'},
{'E'},
{'W'},
{'X'},
{'Y'},
{'R'},
{'C'}]
In this scenario there is not enough information to determine where 'J' should be relative to 'A' and 'B'. Only that it is in the range beetween 'Z' and 'K'.
When multiplied among hundreds of rankings and stores, some of the stores will not be properly ranked on an absolute volume basis.

Related

Simulating the sample space for a probability problem in Python

I am interested in simulating the sample space for the following question on a probability assignment:
A man will carve pumpkins for his two daughters and three sons. His wife will bring each kid’s pumpkins in a completely random order. The man has decided that as soon as he has carved pumpkins for two of his sons, he would ask his wife to carve the remaining pumpkins. Let W denote the number of pumpkins he will carve.
So the resulting sample space of W would look something like this:
sample_space=[['S','S'],
['S','D','S'],
['S','D','D','S'],
['D','S','S'],
['D','S','D','S'],
['D','D','S','S']]
I was thinking about having two lists, one of sons, one of daughters:
son_list1=['S','S','S']
daughter_list1=['D','D']
And then combining them with in every possible order:
result_list1=[['S','S','S','D','D'],
['S','S','D','S','D'],
['S','S','D','D','S'],
['S','D','S','S','D'],
['S','D','S','D','S'],
['S','D','D','S','S'],
['D','S','S','S','D'],
['D','S','S','D','S'],
['D','S','D','S','S'],
['D','D','S','S','S']]
I don't know if numbering each son and each daughter and then combining them would be easier where we have:
son_list2=['S1','S2','S3']
daughter_list2=['D1','D2']
where this resulting list would be something like:
result_list2=[['S1','S2','S3','D1','D2'],
['S1','S3','S2','D1','D2'],
['S2','S1','S3','D1','D2'],
['S2','S3','S1','D1','D2'],
['S3','S1','S2','D1','D2'],
['S3','S2','S1','D1','D2'],
...
['D2','D1','S3','S2','S1']]
But if this method would be easier, I could just get rid of the numbers after result_list2 was generaged and then delete the repeats.
Anyway, after I get the resulting list in the form of result_list1, I could create a "son counter" and then go through each list and then stop when the "son counter" reaches 2 and then from there delete the repeats to get the sample_space list.
Is there any better logic?
To solve this problem, I think the best solution would be to get all of the permutations of the order in which he carves each pumpkin.
I just used the following code, for getting all permutations of a set, from GeeksforGeeks. I just changed some of the variable names to make it more clear.
def permutation(passed_list):
# If passed_list is empty then there are no permutations
if len(passed_list) == 0:
return []
# If there is only one element in lst then, only
# one permutation is possible
if len(passed_list) == 1:
return [passed_list]
# Find the permutations for passed_list if there are
# more than 1 characters
perm_list = [] # empty list that will store current permutation
# Iterate the input(passed_list) and calculate the permutation
for i in range(len(passed_list)):
item = passed_list[i]
# Extract passed_list[i] or item from the list. remaining_list is
# remaining list
remaining_list = passed_list[:i] + passed_list[i + 1:]
# Generating all permutations where item is first
# element
for p in permutation(remaining_list):
perm_list.append([item] + p)
return perm_list
Then you can just iterate through all of the permutations, keeping track of the order as you go. Once you get to two sons, you stop going through that iteration, add that order to your sample space, and then go to the next permutation.
if __name__ == '__main__':
# Set of all children. It doesn't matter what order this list is in
children = ['S', 'S', 'S', 'D', 'D']
# perms is the list of all permutations of children list
perms = permutation(children)
# This set will hold the resulting sample space you are looking for
total_set = []
# For each permutation
for perm in perms:
order = [] # Contains the order of whose pumpkin he carves
son_counter = 0
for child in perm:
if child is 'S':
son_counter += 1
# Update the order
order.append(child)
if son_counter is 2:
# To keep from adding duplicate orders
if order not in total_set:
total_set.append(order)
# Reset the following two variables for the next iteration
order = []
son_counter = 0
break
print(total_set)
This gave me the following output:
[['S', 'S'], ['S', 'D', 'S'], ['S', 'D', 'D', 'S'], ['D', 'S', 'S'], ['D', 'S', 'D', 'S'], ['D', 'D', 'S', 'S']]
I believe this is the answer you are looking for.
Let me know if you have any questions!
You could use dynamic programming to build up the sample space from the bottom up. For example,
def create_samples(n_sons, n_daughters):
if n_sons == 0:
# stop carving
yield []
elif n_daughters == 0:
# must carve n_sons more pumpkins
yield ['S'] * n_sons
else:
# choose to carve for a sun
for sample in create_samples(n_sons - 1, n_daughters):
yield ['S'] + sample
# choose to carve for a daughter
for sample in create_samples(n_sons, n_daughters - 1):
yield ['D'] + sample
samples = list(create_samples(2, 2))
# [['S', 'S'],
# ['S', 'D', 'S'],
# ['S', 'D', 'D', 'S'],
# ['D', 'S', 'S'],
# ['D', 'S', 'D', 'S'],
# ['D', 'D', 'S', 'S']]
The function create_samples(n_sons, n_daughters) returns all samples that meet your condition, under the assumption that n_sons and n_daughters remain to be processed.

Unique elements of sublists depending on specific value in sublist

I an trying to select unique datasets from a very large quite inconsistent list.
My Dataset RawData consists of string-items of different length.
Some items occure many times, for example: ['a','b','x','15/30']
The key to compare the item is always the last string: for example '15/30'
The goal is: Get a list: UniqueData with items that occure only once. (i want to keep the order)
Dataset:
RawData = [['a','b','x','15/30'],['d','e','f','g','h','20/30'],['w','x','y','z','10/10'],['a','x','c','15/30'],['i','j','k','l','m','n','o','p','20/60'],['x','b','c','15/30']]
My desired solution Dataset:
UniqueData = [['a','b','x','15/30'],['d','e','f','g','h','20/30'],['w','x','y','z','10/10'],['i','j','k','l','m','n','o','p','20/60']]
I tried many possible solutions for instance:
for index, elem in enumerate(RawData): and appending to a new list if.....
for element in list does not work, because the items are not exactly the same.
Can you help me finding a solution to my problem?
Thanks!
The best way to remove duplicates is to add them into a set. Add the last element into a set as to keep track of all the unique values. When the value you want to add is already present in the set unique do nothing if not present add the value to set unique and append the lst to result list here it's new.
Try this.
new=[]
unique=set()
for lst in RawData:
if lst[-1] not in unique:
unique.add(lst[-1])
new.append(lst)
print(new)
#[['a', 'b', 'x', '15/30'],
['d', 'e', 'f', 'g', 'h', '20/30'],
['w', 'x', 'y', 'z', '10/10'],
['i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', '20/60']]
You could set up a new array for unique data and to track the items you have seen so far. Then as you loop through the data if you have not seen the last element in that list before then append it to unique data and add it to the seen list.
RawData = [['a', 'b', 'x', '15/30'], ['d', 'e', 'f', 'g', 'h', '20/30'], ['w', 'x', 'y', 'z', '10/10'],
['a', 'x', 'c', '15/30'], ['i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', '20/60'], ['x', 'b', 'c', '15/30']]
seen = []
UniqueData = []
for data in RawData:
if data[-1] not in seen:
UniqueData.append(data)
seen.append(data[-1])
print(UniqueData)
OUTPUT
[['a', 'b', 'x', '15/30'], ['d', 'e', 'f', 'g', 'h', '20/30'], ['w', 'x', 'y', 'z', '10/10'], ['i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', '20/60']]
RawData = [['a','b','x','15/30'],['d','e','f','g','h','20/30'],['w','x','y','z','10/10'],['a','x','c','15/30'],['i','j','k','l','m','n','o','p','20/60'],['x','b','c','15/30']]
seen = []
seen_indices = []
for _,i in enumerate(RawData):
# _ -> index
# i -> individual lists
if i[-1] not in seen:
seen.append(i[-1])
else:
seen_indices.append(_)
for index in sorted(seen_indices, reverse=True):
del RawData[index]
print (RawData)
Using a set to filter out entries for which the key has already been seen is the most efficient way to go.
Here's a one liner example using a list comprehension with internal side effects:
UniqueData = [rd for seen in [set()] for rd in RawData if not(rd[-1] in seen or seen.add(rd[-1])) ]

Solving a "colored Quxes" coding challenge with recursion

I am trying to solve some of the coding challenges that I find online. However I was stopped by the below problem. I tried to solve it using recursion but I feel I am missing a very important concept in recursion. My code works for all of the below examples except the last one it will break down.
Can someone point to me the mistake that I made in this recursion code? Or maybe guide me through solving the issue?
I know why my code breaks but I don't know how to get around the "pass by object reference" in Python which I think creating the bigger problem for me.
The coding question is:
On a mysterious island there are creatures known as Quxes which come in three colors: red, green, and blue. One power of the Qux is that if two of them are standing next to each other, they can transform into a single creature of the third color.
Given N Quxes standing in a line, determine the smallest number of them remaining after any possible sequence of such transformations.
For example, given the input ['R', 'G', 'B', 'G', 'B'], it is possible to end up with a single Qux through the following steps:
Arrangement | Change
----------------------------------------
['R', 'G', 'B', 'G', 'B'] | (R, G) -> B
['B', 'B', 'G', 'B'] | (B, G) -> R
['B', 'R', 'B'] | (R, B) -> G
['B', 'G'] | (B, G) -> R
['R'] |
________________________________________
My code is:
class fusionCreatures(object):
"""Regular Numbers Gen.
"""
def __init__(self , value=[]):
self.value = value
self.ans = len(self.value)
def fusion(self, fus_arr, i):
color = ['R','G','B']
color.remove(fus_arr[i])
color.remove(fus_arr[i+1])
fus_arr.pop(i)
fus_arr.pop(i)
fus_arr.insert(i, color[0])
return fus_arr
def fusionCreatures1(self, arr=None):
# this method is to find the smallest number of creature in a row after fusion
if arr == None:
arr = self.value
for i in range (0,len(arr)-1):
#print(arr)
if len(arr) == 2 and i >= 1 or len(arr)<2:
break
if arr[i] != arr[i+ 1]:
arr1 = self.fusion(arr, i)
testlen = self.fusionCreatures1(arr)
if len(arr) < self.ans:
self.ans = len(arr)
return self.ans
Testing array (all of them work except the last one):
t1 = fusionCreatures(['R','G','B','G','B'])
t2 = fusionCreatures(['R','G','B','R','G','B'])
t3 = fusionCreatures(['R','R','G','B','G','B'])
t4 = fusionCreatures(['G','R','B','R','G'])
t5 = fusionCreatures(['G','R','B','R','G','R','G'])
t6 = fusionCreatures(['R','R','R','R','R'])
t7 = fusionCreatures(['R', 'R', 'R', 'G', 'G', 'G', 'B', 'B', 'B'])
print(t1.fusionCreatures1())
print(t2.fusionCreatures1())
print(t3.fusionCreatures1())
print(t4.fusionCreatures1())
print(t5.fusionCreatures1())
print(t6.fusionCreatures1())
print(t7.fusionCreatures1())
I'll start by mentioning that there is a deductive approach that works in O(n) and is detailed in this blog post. It boils down to checking the parity of the counts of the three types of elements in the list to determine which of a few fixed outcomes occurs.
You mention that you'd prefer to use a recursive approach, which is O(n!). This is a good start because it can be used as a tool for helping arrive at the O(n) solution and is a common recursive pattern to be familiar with.
Because we can't know whether a given fusion between two Quxes will ultimately lead to an optimal global solution we're forced to try every possibility. We do this by walking over the list and looking for potential fusions. When we find one, perform the transformation in a new list and call fuse_quxes on it. Along the way, we keep track of the smallest length achieved.
Here's one approach:
def fuse_quxes(quxes, choices="RGB"):
fusion = {x[:-1]: [x[-1]] for x in permutations(choices)}
def walk(quxes):
best = len(quxes)
for i in range(1, len(quxes)):
if quxes[i-1] != quxes[i]:
sub = quxes[:i-1] + fusion[quxes[i-1], quxes[i]] + quxes[i+1:]
best = min(walk(sub), best)
return best
return walk(quxes)
This is pretty much the direction your provided code is moving towards, but the implementation seems unclear. Unfortunately, I don't see any single or quick fix. Here are a few general issues:
Putting the fusionCreatures1 function into a class allows it to mutate external state, namely self.value and self.ans. self.value in particular is poorly named and difficult to keep track of. It seems like the intent is to use it as a reference copy to reset arr to its default value, but arr = self.value means that when fus_arr is mutated in fusion(), self.value is as well. Everything is pretty much a reference to one underlying list.
Adding slices to these copies at least makes the program easier to reason about, for example, arr = self.value[:] and fus_arr = fus_arr[:] in the fusion() function. In short, try to write pure functions.
self.ans is also unclear and unnecessary; better to keep the result value relegated to a local variable within the recursive function.
It seems unnecessary to put a stateless function into a class unless it's a purely static method and the class is acting as a namespace.
Another cause of cognitive overload are branching statements like if and break. We want to minimize the frequency and nesting of these. Here is fusionCreatures1 in pseudocode, with annotations for mutations and complex interactions:
def fusionCreatures1():
if ...
read mutated global state
for i in len(arr):
if complex length and index checks:
break
if arr[i] != arr[i+ 1]:
impure_func_that_changes_arr_length(arr)
recurse()
if new best compared to global state:
mutate global state
You'll probably agree that it's pretty difficult to mentally step through a run of this function.
In fusionCreatures1(), two variables are unused:
arr1 = self.fusion(arr, i)
testlen = self.fusionCreatures1(arr)
The assignment arr1 = self.fusion(arr, i) (along with the return fus_arr) seems to indicate a lack of understanding that self.fusion is really an in-place function that mutates its argument array. So calling it means arr1 is arr and we have another aliased variable to reason about.
Beyond this, neither arr1 or testlen are used in the program, so the intent is unclear.
A good linter will pick up these unused variables and identify most of the other complexity issues I've mentioned.
Mutating a list while looping over it is usually disastrous. self.fusion(arr, i) mutates arr inside a loop, making it very difficult to reason about its length and causing an index error when the range(len(arr)) no longer matches the actual len(arr) in the function body (or at least necessitating an in-body precondition). Making self.fusion(arr, i) pure using a slice, as mentioned above, fixes this problem but reveals that there is no recursive base case, resulting in a stack overflow error.
Avoid variable names like arr, arr1, value unless the context is obvious. Again, these obfuscate intent and make the program difficult to understand.
Some minor style suggestions:
Use snake_case per PEP-8. Class names should be TitleCased to differentiate them from functions. No need to inherit from object--that's implicit.
Use consistent spacing around functions and operators: range (0,len(arr)-1): is clearer as range(len(arr) - 1):, for example. Use vertical whitespace around blocks.
Use lists instead of typing out t1, t2, ... t7.
Function names should be verbs, not nouns. A class like fusionCreatures with a method called fusionCreatures1 is unclear. Something like QuxesSolver.minimize(creatures) makes the intent a bit more obvious.
As for the solution I provided above, there are other tricks worth considering to speed it up. One is memoization, which can help avoid duplicate work (any given list will always produce the same minimized length, so we just store this computation in a dict and spit it back out if we ever see it again). If we hit a length of 1, that's the best we can do globally, so we can skip the rest of the search.
Here's a full runner, including the linear solution translated to Python (again, defer to the blog post to read about how it works):
from collections import defaultdict
from itertools import permutations
from random import choice, randint
def fuse_quxes_linear(quxes, choices="RGB"):
counts = defaultdict(int)
for e in quxes:
counts[e] += 1
if not quxes or any(x == len(quxes) for x in counts.values()):
return len(quxes)
elif len(set(counts[x] % 2 for x in choices)) == 1:
return 2
return 1
def fuse_quxes(quxes, choices="RGB"):
fusion = {x[:-1]: [x[-1]] for x in permutations(choices)}
def walk(quxes):
best = len(quxes)
for i in range(1, len(quxes)):
if quxes[i-1] != quxes[i]:
sub = quxes[:i-1] + fusion[quxes[i-1], quxes[i]] + quxes[i+1:]
best = min(walk(sub), best)
return best
return walk(quxes)
if __name__ == "__main__":
tests = [
['R','G','B','G','B'],
['R','G','B','R','G','B'],
['R','R','G','B','G','B'],
['G','R','B','R','G'],
['G','R','B','R','G','R','G'],
['R','R','R','R','R'],
['R', 'R', 'R', 'G', 'G', 'G', 'B', 'B', 'B']
]
for test in tests:
print(test, "=>", fuse_quxes(test))
assert fuse_quxes_linear(test) == fuse_quxes(test)
for i in range(100):
test = [choice("RGB") for x in range(randint(0, 10))]
assert fuse_quxes_linear(test) == fuse_quxes(test)
Output:
['R', 'G', 'B', 'G', 'B'] => 1
['R', 'G', 'B', 'R', 'G', 'B'] => 2
['R', 'R', 'G', 'B', 'G', 'B'] => 2
['G', 'R', 'B', 'R', 'G'] => 1
['G', 'R', 'B', 'R', 'G', 'R', 'G'] => 2
['R', 'R', 'R', 'R', 'R'] => 5
['R', 'R', 'R', 'G', 'G', 'G', 'B', 'B', 'B'] => 2
Here is my suggestion.
First, instead of "R", "G" and "B" I use integer values 0, 1, and 2. This allows nice and easy fusion between a and b, as long as they are different, by simply doing 3 - a - b.
Then my recursion code is:
def fuse_quxes(l):
n = len(l)
for i in range(n - 1):
if l[i] == l[i + 1]:
continue
else:
newn = fuse_quxes(l[:i] + [3 - l[i] - l[i + 1]] + l[i+2:])
if newn < n:
n = newn
return n
Run this with
IN[5]: fuse_quxes([0, 0, 0, 1, 1, 1, 2, 2, 2])
Out[5]: 2
Here is my attempt of the problem
please find the description in comment
inputs = [['R','G','B','G','B'],
['R','G','B','R','G','B'],
['R','R','G','B','G','B'],
['G','R','B','R','G'],
['G','R','B','R','G','R','G'],
['R','R','R','R','R'],
['R', 'R', 'R', 'G', 'G', 'G', 'B', 'B', 'B'],]
def fuse_quxes(inp):
RGB_set = {"R", "G", "B"}
merge_index = -1
## pair qux with next in line and loop through all pairs
for i, (q1, q2) in enumerate(zip(inp[:-1], inp[1:])):
merged = RGB_set-{q1,q2}
## If more than item remained in merged after removing q1 and q2 qux can't fuse
if(len(merged))==1:
merged = merged.pop()
merge_index=i
merged_color = merged
## loop through the pair until result of fuse is different from qux in either right
## or left side
if (i>0 and merged!=inp[i-1]) or ((i+2)<len(inp) and merged!=inp[i+2]):
break
print(inp)
## merge two qux which results to qux differnt from either its right or left else do any
## possible merge
if merge_index>=0:
del inp[merge_index]
inp[merge_index] = merged_color
return fuse_quxes(inp)
else:
## if merge can't be made break the recurssion
print("Result", len(inp))
print("_______________________")
return len(inp)
[fuse_quxes(inp) for inp in inputs]
output
['R', 'G', 'B', 'G', 'B']
['R', 'R', 'G', 'B']
['R', 'B', 'B']
['G', 'B']
['R']
Result 1
_______________________
['R', 'G', 'B', 'R', 'G', 'B']
['R', 'G', 'B', 'R', 'R']
['R', 'G', 'G', 'R']
['B', 'G', 'R']
['B', 'B']
Result 2
_______________________
['R', 'R', 'G', 'B', 'G', 'B']
['R', 'B', 'B', 'G', 'B']
['G', 'B', 'G', 'B']
['R', 'G', 'B']
['R', 'R']
Result 2
_______________________
['G', 'R', 'B', 'R', 'G']
['G', 'G', 'R', 'G']
['G', 'B', 'G']
['R', 'G']
['B']
Result 1
_______________________
['G', 'R', 'B', 'R', 'G', 'R', 'G']
['G', 'G', 'R', 'G', 'R', 'G']
['G', 'B', 'G', 'R', 'G']
['R', 'G', 'R', 'G']
['B', 'R', 'G']
['B', 'B']
Result 2
_______________________
['R', 'R', 'R', 'R', 'R']
Result 5
_______________________
['R', 'R', 'R', 'G', 'G', 'G', 'B', 'B', 'B']
['R', 'R', 'B', 'G', 'G', 'B', 'B', 'B']
['R', 'G', 'G', 'G', 'B', 'B', 'B']
['B', 'G', 'G', 'B', 'B', 'B']
['R', 'G', 'B', 'B', 'B']
['R', 'R', 'B', 'B']
['R', 'G', 'B']
['R', 'R']
Result 2
_______________________
[1, 2, 2, 1, 2, 5, 2]

A Faster Way of Removing Unused Categories in Pandas?

I'm running some models in Python, with data subset on categories.
For memory usage, and preprocessing, all the categorical variables are stored as category data type.
For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.
I am currently doing this using .cat.remove_unused_categories(), which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).
Here is a simplified example:
import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})
#convert to category datatype
z.x = z.x.astype('category')
#groupby
z = z.groupby('x')
#loop over groups
for i in z.groups:
x = z.get_group(i)
x.x = x.x.cat.remove_unused_categories()
#run my fancy model here
On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.
Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i) which takes a similar time, and x.x.cat.categories = i, which asks for the same number of categories as I started with.
Your problem is in that you are assigning z.get_group(i) to x. x is now a copy of a portion of z. Your code will work fine with this change
for i in z.groups:
x = z.get_group(i).copy() # will no longer be tied to z
x.x = x.x.cat.remove_unused_categories()

Change the way sorted works in Python (different than alphanumeric)

I'm representing cards in poker as letters (lower and uppercase) in order to store them efficiently. I basically now need a custom sorting function to allow calculations with them.
What is the fastest way to sort letters in Python using
['a', 'n', 'A', 'N', 'b', 'o', ....., 'Z']
as the ranks rather than
['A', 'B', 'C', 'D', 'E', 'F', ....., 'z']
which is the default?
Note, this sorting is derived from:
import string
c = string.letters[:13]
d = string.letters[13:26]
h = string.letters[26:39]
s = string.letters[39:]
'a' = 2 of clubs
'n' = 2 of diamonds
'A' = 2 of hearts
'N' = 2 of spades
etc
You can provide a key function to sorted, this function will be called for each element in the iterable and the return value will be used for the sorting instead of the elements value.
In this case it might look something like the following:
order = ['a', 'n', 'A', 'N', 'b', 'o', ....., 'Z']
sorted_list = sorted(some_list, key=order.index)
Here is a brief example to illustrate this:
>>> order = ['a', 'n', 'A', 'N']
>>> sorted(['A', 'n', 'N', 'a'], key=order.index)
['a', 'n', 'A', 'N']
Note that to make this more efficient you may want to use a dictionary lookup for your key function instead of order.index, for example:
order = ['a', 'n', 'A', 'N', 'b', 'o', ....., 'Z']
order_dict = {x: i for i, x in enumerate(order)}
sorted_list = sorted(some_list, key=order_dict.get)
Store [edit: not store, use internally] them as numbers ordered by value and only convert to letters when displaying them.
Edit: if 1 byte values are required then you can have the cards in the range 1:52 as characters, then, again, convert to the proper letters when displaying and storing them.

Categories

Resources