High memory pressure in generator with random.sample()

High memory pressure in generator with random.sample() - python

DISCLAIMER (added later): I have modified the code to take into account #jasonharper and #user2357112supportsMonica comments here below. Still having the memory issue.
I'm running the following code:
import itertools
from tqdm import tnrange
import random
def perm_generator(comb1, comb2):
seen = set()
length1 = len(comb1)
length2 = len(comb2)
while True:
perm1 = tuple(random.sample(comb1, length1))
perm2 = tuple(random.sample(comb2, length2))
perm_pair = perm1 + perm2
if ( (perm_pair not in seen) ):
seen.add(perm_pair)
yield [perm1,perm2]
seq_all = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V')
combinations_first_half = list(itertools.combinations(seq_all, int(len(seq_all)/2)))
n = 1000
random.seed(0)
all_rand_permutations = []
for i in tnrange(len(combinations_first_half), desc = 'rand_permutations'):
comb1 = combinations_first_half[i]
comb2 = tuple(set(seq_all) - set(comb1))
gen = perm_generator(comb1,comb2)
rand_permutations = [next(gen) for _ in range(n)]
all_rand_permutations += rand_permutations
and for almost all the iterations of the for loop everything goes smoothly with about 33 iterations per second.
However, in some rare cases, the loop gets stuck and begins to build memory pressure for quite a few seconds. Eventually, on a given later iteration the kernel dies.
It seems to be related with random.sample() because if I start the loop from the index relative to the iteration in which the kernel dies or one of the high memory pressure iterations (hence, somehow consequently shifting the random.seed()), there is no issue and the loop goes through it like for the other fast iterations.
I attach here a few screenshots:

Related

How do I alphabetically sort an array of strings without any sort functions? Python

When solving the following problem:
"Assuming you have a random list of strings (for example: a, b, c, d, e, f, g), write a program that will sort the strings in alphabetical order.
You may not use the sort command."
I run into the problem of running strings through the following code, which sometimes gets me duplicated strings in final list
I am fairly new to python and our class just started to look at numpy, and functions in that module, and im not sure of any being used in the code (except any sort function).
import numpy as np
list=[]
list=str(input("Enter list of string(s): "))
list=list.split()
print() # for format space purposes
listPop=list
min=listPop[0]
newFinalList=[]
if(len(list2)!=1):
while(len(listPop)>=1):
for i in range(len(listPop)):
#setting min=first element of list
min=listPop[0]
if(listPop[i]<=min):
min=listPop[i]
print(min)
listPop.pop(i)
newFinalList.append(min)
print(newFinalList)
else:
print("Only one string inputted, so already alphabatized:",list2)
Expected result of ["a","y","z"]
["a","y","z"]
Actual result...
Enter list of string(s): a y z
a
a
a
['a', 'a', 'a']
Enter list of string(s): d e c
d
c
d
d
['c', 'd', 'd']

Selection sort: for each index i of the list, select the smallest item at or after i and swap it into the ith position. Here's an implementation in three lines:
# For each index i...
for i in range(len(list)):
# Find the position of the smallest item after (or including) i.
j = list[i:].index(min(list[i:])) + i
# Swap it into the i-th place (this is a no-op if i == j).
list[i], list[j] = list[j], list[i]
list[i:] is a slice (subset) of list starting at the ith element.
min(list) gives you the smallest element in list.
list.index(element) gives you the (first) index of element in list.
a, b = b, a atomically swaps the values of a and b.
The trickiest part of this implementation is that when you're using index to find the index of the smallest element, you need to find the index within the same list[i:] slice that you found the element in, otherwise you might select a duplicate element in an earlier part of the list. Since you're finding the index relative to list[i:], you then need to add i back to it to get the index within the entire list.

You can implement Quick sort for same:
def partition(arr,low,high):
i = ( low-1 )
pivot = arr[high]
for j in range(low , high):
if arr[j] <= pivot:
i = i+1
arr[i],arr[j] = arr[j],arr[i]
arr[i+1],arr[high] = arr[high],arr[i+1]
return ( i+1 )
def quickSort(arr,low,high):
if low < high:
pi = partition(arr,low,high)
quickSort(arr, low, pi-1)
quickSort(arr, pi+1, high)
arr = ['a', 'x', 'p', 'o', 'm', 'w']
n = len(arr)
quickSort(arr,0,n-1)
print ("Sorted list is:")
for i in range(n):
print ("%s" %arr[i]),
output:
Sorted array is:
a m o p w x

Mergesort:
from heapq import merge
from itertools import islice
def _ms(a, n):
return islice(a,n) if n<2 else merge(_ms(a,n//2),_ms(a,n-n//2))
def mergesort(a):
return type(a)(_ms(iter(a),len(a)))
# example
import string
import random
L = list(string.ascii_lowercase)
random.shuffle(L)
print(L)
print(mergesort(L))
Sample run:
['h', 'g', 's', 'l', 'a', 'f', 'b', 'z', 'x', 'c', 'r', 'j', 'q', 'p', 'm', 'd', 'k', 'w', 'u', 'v', 'y', 'o', 'i', 'n', 't', 'e']
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

Solving a "colored Quxes" coding challenge with recursion

I am trying to solve some of the coding challenges that I find online. However I was stopped by the below problem. I tried to solve it using recursion but I feel I am missing a very important concept in recursion. My code works for all of the below examples except the last one it will break down.
Can someone point to me the mistake that I made in this recursion code? Or maybe guide me through solving the issue?
I know why my code breaks but I don't know how to get around the "pass by object reference" in Python which I think creating the bigger problem for me.
The coding question is:
On a mysterious island there are creatures known as Quxes which come in three colors: red, green, and blue. One power of the Qux is that if two of them are standing next to each other, they can transform into a single creature of the third color.
Given N Quxes standing in a line, determine the smallest number of them remaining after any possible sequence of such transformations.
For example, given the input ['R', 'G', 'B', 'G', 'B'], it is possible to end up with a single Qux through the following steps:
Arrangement | Change
----------------------------------------
['R', 'G', 'B', 'G', 'B'] | (R, G) -> B
['B', 'B', 'G', 'B'] | (B, G) -> R
['B', 'R', 'B'] | (R, B) -> G
['B', 'G'] | (B, G) -> R
['R'] |
________________________________________
My code is:
class fusionCreatures(object):
"""Regular Numbers Gen.
"""
def __init__(self , value=[]):
self.value = value
self.ans = len(self.value)
def fusion(self, fus_arr, i):
color = ['R','G','B']
color.remove(fus_arr[i])
color.remove(fus_arr[i+1])
fus_arr.pop(i)
fus_arr.pop(i)
fus_arr.insert(i, color[0])
return fus_arr
def fusionCreatures1(self, arr=None):
# this method is to find the smallest number of creature in a row after fusion
if arr == None:
arr = self.value
for i in range (0,len(arr)-1):
#print(arr)
if len(arr) == 2 and i >= 1 or len(arr)<2:
break
if arr[i] != arr[i+ 1]:
arr1 = self.fusion(arr, i)
testlen = self.fusionCreatures1(arr)
if len(arr) < self.ans:
self.ans = len(arr)
return self.ans
Testing array (all of them work except the last one):
t1 = fusionCreatures(['R','G','B','G','B'])
t2 = fusionCreatures(['R','G','B','R','G','B'])
t3 = fusionCreatures(['R','R','G','B','G','B'])
t4 = fusionCreatures(['G','R','B','R','G'])
t5 = fusionCreatures(['G','R','B','R','G','R','G'])
t6 = fusionCreatures(['R','R','R','R','R'])
t7 = fusionCreatures(['R', 'R', 'R', 'G', 'G', 'G', 'B', 'B', 'B'])
print(t1.fusionCreatures1())
print(t2.fusionCreatures1())
print(t3.fusionCreatures1())
print(t4.fusionCreatures1())
print(t5.fusionCreatures1())
print(t6.fusionCreatures1())
print(t7.fusionCreatures1())

I'll start by mentioning that there is a deductive approach that works in O(n) and is detailed in this blog post. It boils down to checking the parity of the counts of the three types of elements in the list to determine which of a few fixed outcomes occurs.
You mention that you'd prefer to use a recursive approach, which is O(n!). This is a good start because it can be used as a tool for helping arrive at the O(n) solution and is a common recursive pattern to be familiar with.
Because we can't know whether a given fusion between two Quxes will ultimately lead to an optimal global solution we're forced to try every possibility. We do this by walking over the list and looking for potential fusions. When we find one, perform the transformation in a new list and call fuse_quxes on it. Along the way, we keep track of the smallest length achieved.
Here's one approach:
def fuse_quxes(quxes, choices="RGB"):
fusion = {x[:-1]: [x[-1]] for x in permutations(choices)}
def walk(quxes):
best = len(quxes)
for i in range(1, len(quxes)):
if quxes[i-1] != quxes[i]:
sub = quxes[:i-1] + fusion[quxes[i-1], quxes[i]] + quxes[i+1:]
best = min(walk(sub), best)
return best
return walk(quxes)
This is pretty much the direction your provided code is moving towards, but the implementation seems unclear. Unfortunately, I don't see any single or quick fix. Here are a few general issues:
Putting the fusionCreatures1 function into a class allows it to mutate external state, namely self.value and self.ans. self.value in particular is poorly named and difficult to keep track of. It seems like the intent is to use it as a reference copy to reset arr to its default value, but arr = self.value means that when fus_arr is mutated in fusion(), self.value is as well. Everything is pretty much a reference to one underlying list.
Adding slices to these copies at least makes the program easier to reason about, for example, arr = self.value[:] and fus_arr = fus_arr[:] in the fusion() function. In short, try to write pure functions.
self.ans is also unclear and unnecessary; better to keep the result value relegated to a local variable within the recursive function.
It seems unnecessary to put a stateless function into a class unless it's a purely static method and the class is acting as a namespace.
Another cause of cognitive overload are branching statements like if and break. We want to minimize the frequency and nesting of these. Here is fusionCreatures1 in pseudocode, with annotations for mutations and complex interactions:
def fusionCreatures1():
if ...
read mutated global state
for i in len(arr):
if complex length and index checks:
break
if arr[i] != arr[i+ 1]:
impure_func_that_changes_arr_length(arr)
recurse()
if new best compared to global state:
mutate global state
You'll probably agree that it's pretty difficult to mentally step through a run of this function.
In fusionCreatures1(), two variables are unused:
arr1 = self.fusion(arr, i)
testlen = self.fusionCreatures1(arr)
The assignment arr1 = self.fusion(arr, i) (along with the return fus_arr) seems to indicate a lack of understanding that self.fusion is really an in-place function that mutates its argument array. So calling it means arr1 is arr and we have another aliased variable to reason about.
Beyond this, neither arr1 or testlen are used in the program, so the intent is unclear.
A good linter will pick up these unused variables and identify most of the other complexity issues I've mentioned.
Mutating a list while looping over it is usually disastrous. self.fusion(arr, i) mutates arr inside a loop, making it very difficult to reason about its length and causing an index error when the range(len(arr)) no longer matches the actual len(arr) in the function body (or at least necessitating an in-body precondition). Making self.fusion(arr, i) pure using a slice, as mentioned above, fixes this problem but reveals that there is no recursive base case, resulting in a stack overflow error.
Avoid variable names like arr, arr1, value unless the context is obvious. Again, these obfuscate intent and make the program difficult to understand.
Some minor style suggestions:
Use snake_case per PEP-8. Class names should be TitleCased to differentiate them from functions. No need to inherit from object--that's implicit.
Use consistent spacing around functions and operators: range (0,len(arr)-1): is clearer as range(len(arr) - 1):, for example. Use vertical whitespace around blocks.
Use lists instead of typing out t1, t2, ... t7.
Function names should be verbs, not nouns. A class like fusionCreatures with a method called fusionCreatures1 is unclear. Something like QuxesSolver.minimize(creatures) makes the intent a bit more obvious.
As for the solution I provided above, there are other tricks worth considering to speed it up. One is memoization, which can help avoid duplicate work (any given list will always produce the same minimized length, so we just store this computation in a dict and spit it back out if we ever see it again). If we hit a length of 1, that's the best we can do globally, so we can skip the rest of the search.
Here's a full runner, including the linear solution translated to Python (again, defer to the blog post to read about how it works):
from collections import defaultdict
from itertools import permutations
from random import choice, randint
def fuse_quxes_linear(quxes, choices="RGB"):
counts = defaultdict(int)
for e in quxes:
counts[e] += 1
if not quxes or any(x == len(quxes) for x in counts.values()):
return len(quxes)
elif len(set(counts[x] % 2 for x in choices)) == 1:
return 2
return 1
def fuse_quxes(quxes, choices="RGB"):
fusion = {x[:-1]: [x[-1]] for x in permutations(choices)}
def walk(quxes):
best = len(quxes)
for i in range(1, len(quxes)):
if quxes[i-1] != quxes[i]:
sub = quxes[:i-1] + fusion[quxes[i-1], quxes[i]] + quxes[i+1:]
best = min(walk(sub), best)
return best
return walk(quxes)
if __name__ == "__main__":
tests = [
['R','G','B','G','B'],
['R','G','B','R','G','B'],
['R','R','G','B','G','B'],
['G','R','B','R','G'],
['G','R','B','R','G','R','G'],
['R','R','R','R','R'],
['R', 'R', 'R', 'G', 'G', 'G', 'B', 'B', 'B']
]
for test in tests:
print(test, "=>", fuse_quxes(test))
assert fuse_quxes_linear(test) == fuse_quxes(test)
for i in range(100):
test = [choice("RGB") for x in range(randint(0, 10))]
assert fuse_quxes_linear(test) == fuse_quxes(test)
Output:
['R', 'G', 'B', 'G', 'B'] => 1
['R', 'G', 'B', 'R', 'G', 'B'] => 2
['R', 'R', 'G', 'B', 'G', 'B'] => 2
['G', 'R', 'B', 'R', 'G'] => 1
['G', 'R', 'B', 'R', 'G', 'R', 'G'] => 2
['R', 'R', 'R', 'R', 'R'] => 5
['R', 'R', 'R', 'G', 'G', 'G', 'B', 'B', 'B'] => 2

Here is my suggestion.
First, instead of "R", "G" and "B" I use integer values 0, 1, and 2. This allows nice and easy fusion between a and b, as long as they are different, by simply doing 3 - a - b.
Then my recursion code is:
def fuse_quxes(l):
n = len(l)
for i in range(n - 1):
if l[i] == l[i + 1]:
continue
else:
newn = fuse_quxes(l[:i] + [3 - l[i] - l[i + 1]] + l[i+2:])
if newn < n:
n = newn
return n
Run this with
IN[5]: fuse_quxes([0, 0, 0, 1, 1, 1, 2, 2, 2])
Out[5]: 2

Here is my attempt of the problem
please find the description in comment
inputs = [['R','G','B','G','B'],
['R','G','B','R','G','B'],
['R','R','G','B','G','B'],
['G','R','B','R','G'],
['G','R','B','R','G','R','G'],
['R','R','R','R','R'],
['R', 'R', 'R', 'G', 'G', 'G', 'B', 'B', 'B'],]
def fuse_quxes(inp):
RGB_set = {"R", "G", "B"}
merge_index = -1
## pair qux with next in line and loop through all pairs
for i, (q1, q2) in enumerate(zip(inp[:-1], inp[1:])):
merged = RGB_set-{q1,q2}
## If more than item remained in merged after removing q1 and q2 qux can't fuse
if(len(merged))==1:
merged = merged.pop()
merge_index=i
merged_color = merged
## loop through the pair until result of fuse is different from qux in either right
## or left side
if (i>0 and merged!=inp[i-1]) or ((i+2)<len(inp) and merged!=inp[i+2]):
break
print(inp)
## merge two qux which results to qux differnt from either its right or left else do any
## possible merge
if merge_index>=0:
del inp[merge_index]
inp[merge_index] = merged_color
return fuse_quxes(inp)
else:
## if merge can't be made break the recurssion
print("Result", len(inp))
print("_______________________")
return len(inp)
[fuse_quxes(inp) for inp in inputs]
output
['R', 'G', 'B', 'G', 'B']
['R', 'R', 'G', 'B']
['R', 'B', 'B']
['G', 'B']
['R']
Result 1
_______________________
['R', 'G', 'B', 'R', 'G', 'B']
['R', 'G', 'B', 'R', 'R']
['R', 'G', 'G', 'R']
['B', 'G', 'R']
['B', 'B']
Result 2
_______________________
['R', 'R', 'G', 'B', 'G', 'B']
['R', 'B', 'B', 'G', 'B']
['G', 'B', 'G', 'B']
['R', 'G', 'B']
['R', 'R']
Result 2
_______________________
['G', 'R', 'B', 'R', 'G']
['G', 'G', 'R', 'G']
['G', 'B', 'G']
['R', 'G']
['B']
Result 1
_______________________
['G', 'R', 'B', 'R', 'G', 'R', 'G']
['G', 'G', 'R', 'G', 'R', 'G']
['G', 'B', 'G', 'R', 'G']
['R', 'G', 'R', 'G']
['B', 'R', 'G']
['B', 'B']
Result 2
_______________________
['R', 'R', 'R', 'R', 'R']
Result 5
_______________________
['R', 'R', 'R', 'G', 'G', 'G', 'B', 'B', 'B']
['R', 'R', 'B', 'G', 'G', 'B', 'B', 'B']
['R', 'G', 'G', 'G', 'B', 'B', 'B']
['B', 'G', 'G', 'B', 'B', 'B']
['R', 'G', 'B', 'B', 'B']
['R', 'R', 'B', 'B']
['R', 'G', 'B']
['R', 'R']
Result 2
_______________________
[1, 2, 2, 1, 2, 5, 2]

Whats happening this code. Anagram as a substring in a string

I found this question posted on here, but couldn't comment or ask a question, so I am creating a new question.
The original post stated the following:
t = "abd"
s = "abdc"
s trivially contains t. However, when you sort them, you get the strings abd and abcd, and the in comparison fails. The sorting gets other letters in the way.
Instead, you need to step through s in chunks the size of t.
t_len = len(t)
s_len = len(s)
t_sort = sorted(t)
for start in range(s_len - t_len + 1):
chunk = s[start:start+t_len]
if t_sort == sorted(chunk):
# SUCCESS!!
In the for loop why are they taking S-len then subtracting t_len? Why are they adding 1 at the end?

alvits and d_void already explained the value of start; I won't repeat that.
I strongly recommend that you learn some basic trace debugging. Insert some useful print statements to follow the execution. For instance:
Code:
t = "goal"
s = "catalogue"
t_len = len(t)
s_len = len(s)
t_sort = sorted(t)
print "lengths & sorted", t_len, s_len, t_sort
for start in range(s_len - t_len + 1):
chunk = s[start:start+t_len]
print "LOOP start=", start, "\tchunk=", chunk, sorted(chunk)
if t_sort == sorted(chunk):
print "success"
Output:
lengths & sorted 4 9 ['a', 'g', 'l', 'o']
LOOP start= 0 chunk= cata ['a', 'a', 'c', 't']
LOOP start= 1 chunk= atal ['a', 'a', 'l', 't']
LOOP start= 2 chunk= talo ['a', 'l', 'o', 't']
LOOP start= 3 chunk= alog ['a', 'g', 'l', 'o']
success
LOOP start= 4 chunk= logu ['g', 'l', 'o', 'u']
LOOP start= 5 chunk= ogue ['e', 'g', 'o', 'u']
Does that help illustrate what's happening in the loop?

creating a binary heap implementation giving wrong result

I am implementing the binary heap following an online course and I have done the following:
from __future__ import division
class BinaryHeap(object):
def __init__(self, arr=None):
self.heap = []
def insert(self, item):
self.heap.append(item)
self.__swim(len(self.heap) - 1)
def __swim(self, index):
parent_index = (index - 1) // 2
while index > 0 and self.heap[parent_index] < self.heap[index]:
self.heap[parent_index], self.heap[index] = self.heap[index], self.heap[parent_index]
index = parent_index
Now, I use it as:
s = 'SORTEXAMPLE'
a = BinaryHeap()
for c in s:
a.insert(c)
Now, after this the heap is ordered as:
['S', 'T', 'X', 'P', 'L', 'R', 'A', 'M', 'O', 'E', 'E']
rather than
['X', 'T', 'S', 'P', 'L', 'R', 'A', 'M', 'O', 'E', 'E']
So, it seems one of the last exchanges did not happen and I thought I might have messed up the indexing but I could not find any obvious issues.

Ok, I figured it out. of course, I cannot cache the parent_index outside the loop!
The code should be:
def __swim(self, index):
while index > 0 and self.heap[(index - 1) // 2] < self.heap[index]:
self.heap[(index - 1) // 2], self.heap[index] = self.heap[index], self.heap[(index - 1) // 2]
index = (index - 1) // 2
I am surprised this did not go in an infinite loop before....

A Faster Way of Removing Unused Categories in Pandas?

I'm running some models in Python, with data subset on categories.
For memory usage, and preprocessing, all the categorical variables are stored as category data type.
For each level of a categorical variable in my 'group by' column, I am running a regression, where I need to reset all my categorical variables to those that are present in that subset.
I am currently doing this using .cat.remove_unused_categories(), which is taking nearly 50% of my total runtime. At the moment, the worst offender is my grouping column, others are not taking as much time (as I guess there are not as many levels to drop).
Here is a simplified example:
import itertools
import pandas as pd
#generate some fake data
alphabets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
keywords = [''.join(i) for i in itertools.product(alphabets, repeat = 2)]
z = pd.DataFrame({'x':keywords})
#convert to category datatype
z.x = z.x.astype('category')
#groupby
z = z.groupby('x')
#loop over groups
for i in z.groups:
x = z.get_group(i)
x.x = x.x.cat.remove_unused_categories()
#run my fancy model here
On my laptop, this takes about 20 seconds. for this small example, we could convert to str, then back to category for a speed up, but my real data has at least 300 lines per group.
Is it possible to speed up this loop? I have tried using x.x = x.x.cat.set_categories(i) which takes a similar time, and x.x.cat.categories = i, which asks for the same number of categories as I started with.

Your problem is in that you are assigning z.get_group(i) to x. x is now a copy of a portion of z. Your code will work fine with this change
for i in z.groups:
x = z.get_group(i).copy() # will no longer be tied to z
x.x = x.x.cat.remove_unused_categories()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

High memory pressure in generator with random.sample() - python

Related

How do I alphabetically sort an array of strings without any sort functions? Python

Solving a "colored Quxes" coding challenge with recursion

Whats happening this code. Anagram as a substring in a string

creating a binary heap implementation giving wrong result

A Faster Way of Removing Unused Categories in Pandas?

Categories

Resources