A Faster Way of Removing Unused Categories in Pandas?

A Faster Way of Removing Unused Categories in Pandas? - python

I'm running some models in Python, with data subset on categories.
For memory usage, and preprocessing, all the categorical variables are stored as category data type.
For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.
I am currently doing this using .cat.remove_unused_categories(), which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).
Here is a simplified example:
import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})
#convert to category datatype
z.x = z.x.astype('category')
#groupby
z = z.groupby('x')
#loop over groups
for i in z.groups:
x = z.get_group(i)
x.x = x.x.cat.remove_unused_categories()
#run my fancy model here
On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.
Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i) which takes a similar time, and x.x.cat.categories = i, which asks for the same number of categories as I started with.

Your problem is in that you are assigning z.get_group(i) to x. x is now a copy of a portion of z. Your code will work fine with this change
for i in z.groups:
x = z.get_group(i).copy() # will no longer be tied to z
x.x = x.x.cat.remove_unused_categories()

Related

High memory pressure in generator with random.sample()

DISCLAIMER (added later): I have modified the code to take into account #jasonharper and #user2357112supportsMonica comments here below. Still having the memory issue.
I'm running the following code:
import itertools
from tqdm import tnrange
import random
def perm_generator(comb1, comb2):
seen = set()
length1 = len(comb1)
length2 = len(comb2)
while True:
perm1 = tuple(random.sample(comb1, length1))
perm2 = tuple(random.sample(comb2, length2))
perm_pair = perm1 + perm2
if ( (perm_pair not in seen) ):
seen.add(perm_pair)
yield [perm1,perm2]
seq_all = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V')
combinations_first_half = list(itertools.combinations(seq_all, int(len(seq_all)/2)))
n = 1000
random.seed(0)
all_rand_permutations = []
for i in tnrange(len(combinations_first_half), desc = 'rand_permutations'):
comb1 = combinations_first_half[i]
comb2 = tuple(set(seq_all) - set(comb1))
gen = perm_generator(comb1,comb2)
rand_permutations = [next(gen) for _ in range(n)]
all_rand_permutations += rand_permutations
and for almost all the iterations of the for loop everything goes smoothly with about 33 iterations per second.
However, in some rare cases, the loop gets stuck and begins to build memory pressure for quite a few seconds. Eventually, on a given later iteration the kernel dies.
It seems to be related with random.sample() because if I start the loop from the index relative to the iteration in which the kernel dies or one of the high memory pressure iterations (hence, somehow consequently shifting the random.seed()), there is no issue and the loop goes through it like for the other fast iterations.
I attach here a few screenshots:

Why does Python's map() function swap values position in each return?

im studying python and trying to learn how to use the map() function.
Had the idea to change letters from a string for equivalent+1 in alphabet, ex.: abc -> bcd
wrote the following code:
m = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
def func(s):
return m[m.index(s) + 1]
l = "abc"
print(set(map(func, l)))
But every excecution returns a different order for the letters
I got the expected answer by using:
l2 = [func(i) for i in s]
print(l2)
But i wanted to understand the map() function and how it works. Tried to read the documentation but I could not understand much.
Sorry about my bad english and my lack of experience in python :/

It is because you are converting to set in set(map(func, l)) and set is an unordered collection in Python.
From docs:
A set object is an unordered collection of distinct hashable objects....Being an unordered collection, sets do not record element position or order of insertion. Accordingly, sets do not support indexing, slicing, or other sequence-like behavior.
If you replace print(set(map(func, l))) with print(list(map(func, l))), you'll not see this behavior.

Rank aggregation: Merge local subrankings into global ranking

I have a dataset of multiple local store rankings that I'm looking to aggregate / combine into one national ranking, programmatically. I know that the local rankings are by sales volume, but I am not given the sales volume so must use the relative rankings to create as accurate a national ranking as possible.
As a short example, let's say that we have 3 local ranking lists, from best ranking (1st) to worst ranking (last), that represent different geographic boundaries that can overlap with one another.
ranking_1 = ['J','A','Z','B','C']
ranking_2 = ['A','H','K','B']
ranking_3 = ['Q','O','A','N','K']
We know that J or Q is the highest ranked store, as both are highest in ranking_1 and ranking_3, respectively, and they appear above A, which is the highest in ranking_2. We know that O is next, as it's above A in ranking_3. A comes next, and so on...
If I did this correctly on paper, the output of this short example would be:
global_ranking = [('J',1.5),('Q',1.5),('O',3),('A',4),('H',6),('N',6),('Z',6),('K',8),('B',9),('C',10)]
Note that when we don't have enough data to determine which of two stores is ranked higher, we consider it a tie (i.e. we know that one of J or Q is the highest ranked store, but don't know which is higher, so we put them both at 1.5). In the actual dataset, there are 100+ lists of 1000+ items in each.
I've had fun trying to figure out this problem and am curious if anyone has any smart approaches to it.

Modified Merge Sort algorithm will help here. The modification should take into account incomparable stores and though build groups of incomparable elements which you are willing to consider as equal (like Q and J)

This method seeks to analyze all of the stores at the front of the rankings. If they are not located in a lower than first position in any other ranking list, then they belong at this front level and are added to a 'level' list. Next, they are removed from the front runners and all of the list are adjusted so that there are new front runners. Repeat the process until there are no stores left.
def rank_stores(rankings):
"""
Rank stores with rankings by volume sales with over lap between lists.
:param rankings: list of rankings of stores also in lists.
:return: Ordered list with sets of items at same rankings.
"""
rank_global = []
# Evaluate all stores in the number one postion, if they are not below
# number one somewhere else, then they belong at this level.
# Then remove them from the front of the list, and repeat.
while sum([len(x) for x in rankings]) > 0:
tops = []
# Find out which of the number one stores are not in a lower position
# somewhere else.
for rank in rankings:
if not rank:
continue
else:
top = rank[0]
add = True
for rank_test in rankings:
if not rank_test:
continue
elif not rank_test[1:]:
continue
elif top in rank_test[1:]:
add = False
break
else:
continue
if add:
tops.append(top)
# Now add tops to total rankings list,
# then go through the rankings and pop the top if in tops.
rank_global.append(set(tops))
# Remove the stores that just made it to the top.
for rank in rankings:
if not rank:
continue
elif rank[0] in tops:
rank.pop(0)
else:
continue
return rank_global
For the rankings provided:
ranking_1 = ['J','A','Z','B','C']
ranking_2 = ['A','H','K','B']
ranking_3 = ['Q','O','A','N','K']
rankings = [ranking_1, ranking_2, ranking_3]
Then calling the function:
rank_stores(rankings)
Results in:
[{'J', 'Q'}, {'O'}, {'A'}, {'H', 'N', 'Z'}, {'K'}, {'B'}, {'C'}]
In some circumstances there may not be enough information to determine definite rankings. Try this order.
['Z', 'A', 'B', 'J', 'K', 'F', 'L', 'E', 'W', 'X', 'Y', 'R', 'C']
We can derive the following rankings:
a = ['Z', 'A', 'B', 'F', 'E', 'Y']
b = ['Z', 'J', 'K', 'L', 'X', 'R']
c = ['F', 'E', 'W', 'Y', 'C']
d = ['J', 'K', 'E', 'W', 'X']
e = ['K', 'F', 'W', 'R', 'C']
f = ['X', 'Y', 'R', 'C']
g = ['Z', 'F', 'W', 'X', 'Y', 'R', 'C']
h = ['Z', 'A', 'E', 'W', 'C']
i = ['L', 'E', 'Y', 'R', 'C']
j = ['L', 'E', 'W', 'R']
k = ['Z', 'B', 'K', 'L', 'W', 'Y', 'R']
rankings = [a, b, c, d, e, f, g, h, i, j, k]
Calling the function:
rank_stores(rankings)
results in:
[{'Z'},
{'A', 'J'},
{'B'},
{'K'},
{'F', 'L'},
{'E'},
{'W'},
{'X'},
{'Y'},
{'R'},
{'C'}]
In this scenario there is not enough information to determine where 'J' should be relative to 'A' and 'B'. Only that it is in the range beetween 'Z' and 'K'.
When multiplied among hundreds of rankings and stores, some of the stores will not be properly ranked on an absolute volume basis.

how to pass the specific column of dictionary as input of other function in python

Im using a function to parse an excel in python without using libraries but importing an individual script accessing the excel data. my program can read the excel data and with following structure gets the values. I need to access specific columns of the listExcelvalues as follow and pay it to the input of other function :
actual code
ive converted the list into the dict but the problem is I dont want to pass all the dictionary as input of other function but instead specific columns.
Any ideas how to do so ?

As I understand you have a structure like this:
>>> l=[['a','b','c','d'],['e','f','g','h'],['i','j','k','l'],['m','n','o','p']]
>>> m=list(enumerate(l,0))
>>> m
[(0, ['a', 'b', 'c', 'd']), (1, ['e', 'f', 'g', 'h']), (2, ['i', 'j', 'k', 'l']), (3, ['m', 'n', 'o', 'p'])]
Then you can access rows like this. Here is the second row (counting from zero):
>>> row=m[2][1]
>>> row
['i', 'j', 'k', 'l']
>>> hrow=m[2]
>>> hrow
(2, ['i', 'j', 'k', 'l'])
And columns too. Here is the second (counting from zero) column:
>>> col=[]
>>> for r in m:
... col.append(r[1][2])
>>> col
['c', 'g', 'k', 'o']

You want to take a subset of your dict. You may want to try this out:
new_dict = {k:v if k in list_of_req_cols for k,v in original_dict.items()}

compare lists with a list to get the list which is closest match to the list we are comparing

I have a list a = ["c","o","m","p","a","r","e"]. I have two lists
b = ["c","l","o","m","p","a","r","e"] and c=["c","o","m","p","a","e","r"]
now i want to compare list 'b' and 'c' with 'a' to see whether the order of elements of 'b' are closer to 'a' or order of elements of 'c' are closer to and return the list. What I would like to achieve is list 'b' is to be returned when comparing 'b' and 'c' with 'a'. Is there a function to do that?

difflib.SequenceMatcher will find
the longest contiguous matching subsequence
that contains no "junk" elements
SequenceMatcher.ratio returns the measure of the sequences' similarity. It's a float in the range [0, 1]. Higher ratio indicates higher similarity (the ratio is 1 if given sequences are identical).
The below helper function uses the max function to compare the first argument to the rest of positional arguments:
def closest(seq, *args):
# Cache information about `seq`.
# We only really need to change one sequence.
sm = SequenceMatcher(b=seq)
def _ratio(x):
sm.set_seq1(x)
return sm.ratio()
return max(args, key=_ratio)
Example:
In [37]: closest(
....: ['c', 'o', 'm', 'p', 'a', 'r', 'e'], # a
....: ['c', 'l', 'o', 'm', 'p', 'a', 'r', 'e'], # b
....: ['c', 'o', 'm', 'p', 'a', 'e', 'r'] # c
....: )
Out[37]: ['c', 'l', 'o', 'm', 'p', 'a', 'r', 'e'] # b

The traditional way of solving this problem is by using Levenshtein distance. This basically tallies up all of the additions, deletions and insertions required to move from one string to another.
You can think of each of those operations as "breaking" the pattern of a just a bit.
It's a pretty simple function to implement, but there's a package that has already done it for you here. Sample code is below:
>>> from Levenshtein import distance
>>> distance("compare", "clompare")
1
>>> distance("compare", "compaer")
2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

A Faster Way of Removing Unused Categories in Pandas? - python

Your problem is in that you are assigning z.get_group(i) to x. x is now a copy of a portion of z. Your code will work fine with this change for i in z.groups: x = z.get_group(i).copy() # will no longer be tied to z x.x = x.x.cat.remove_unused_categories()

Related

High memory pressure in generator with random.sample()

Why does Python's map() function swap values position in each return?

Rank aggregation: Merge local subrankings into global ranking

how to pass the specific column of dictionary as input of other function in python

compare lists with a list to get the list which is closest match to the list we are comparing

Categories

Resources