Python multiprocessing with large objects: prevent copying/serialization of object - python

I have implemented multiprocessing for some problem with larger objects like the following:
import time
import pathos.multiprocessing as mp
from functools import partial
from random import randrange
class RandomNumber():
def __init__(self, object_size=100):
self.size = bytearray(object_size*10**6) # 100 MB size
self.foo = None
def do_something(self, *args, **kwargs):
self.foo = randrange(1, 10)
time.sleep(0.5) # wait for 0.5 seconds
return self
def wrapper(random_number, *args, **kwargs):
return random_number.do_something(*args, **kwargs)
if __name__ == '__main__':
# create data
numbers = [RandomNumber() for m in range(0, 9)]
kwds = {'add': randrange(1, 10)}
# calculate
pool = mp.Pool(processes=mp.cpu_count())
result = pool.map_async(partial(wrapper, **kwds), numbers)
try:
result = result.get()
except:
pass
# print result
my_results = [i.foo for i in result]
print(my_results)
pool.close()
pool.join()
which yields something like:
[8, 7, 8, 3, 1, 2, 6, 4, 8]
Now the problem is that I have a massive improvement in performance compared to using a list comprehension when the objects are very small and this improvement turns into the opposite with larger object sizes e.g. 100 MB and larger.
From the documentation and other questions I have discovered that this caused by the use of pickle/dill for the serialization of single objects in order to pass them to the workers within the pool. In other words: the objects are copied and this IO operation becomes a bottleneck as it is more time consuming than the actual calculation.
I have alread tried to work on the same object using a multiprocessing.Manager but this resulted in even higher runtimes.
The problem is that I am bound to a specific class structure (here represented through RandomNumber()) which I cannot change..
Now my question is: Are there any ways or concepts to circumvent this behaviour and only get my calls on do_something() without the overhead of serialization or copying?
Any hints are welcome. Thanks in advance!

You need to use Batch processing.Do not create destroy workers for each number.
Make limited workers based on cpu_count.Then pass a list to each worked and process them .Use map and pass a list containing batches of numbers.

I have found a solution using multiprocessing or multithreading from the concurrent.futures library which does not require to pickle the objects. In my case, multithreading using ThreadPoolExecutor brings a clear advantage over multiprocessing via ProcessPoolExecutor.
import time
from random import randrange
import concurrent.futures as cf
class RandomNumber():
def __init__(self, object_size=100):
self.size = bytearray(object_size*10**6) # 100 MB size
self.foo = None
def do_something(self, *args, **kwargs):
self.foo = randrange(1, 10)
time.sleep(0.5) # wait for 0.5 seconds
return self
def wrapper(random_number, *args, **kwargs):
return random_number.do_something(*args, **kwargs)
if __name__ == '__main__':
# create data
numbers = [RandomNumber() for m in range(0, 100)]
kwds = {'add': randrange(1, 10)}
# run
with cf.ThreadPoolExecutor(max_workers=3) as executor:
result = executor.map(wrapper, numbers, timeout=5*60)
# print result
my_results = [i.foo for i in result]
print(my_results)
yields:
[3, 3, 1, 1, 3, 7, 7, 6, 7, 5, 9, 5, 6, 5, 6, 9, 1, 5, 1, 7, 5, 3, 6, 2, 9, 2, 1, 2, 5, 1, 7, 9, 2, 9, 4, 9, 8, 5, 2, 1, 7, 8, 5, 1, 4, 5, 8, 2, 2, 5, 3, 6, 3, 2, 5, 3, 1, 9, 6, 7, 2, 4, 1, 5, 4, 4, 4, 9, 3, 1, 5, 6, 6, 8, 4, 4, 8, 7, 5, 9, 7, 8, 6, 2, 3, 1, 7, 2, 4, 8, 3, 6, 4, 1, 7, 7, 3, 4, 1, 2]
real 0m21.100s
user 0m1.100s
sys 0m2.896s
Nonetheless, this still leads to memory leakage in cases where I have too much objects (here numbers) and does not prevent this by going into some "batch mode" if memory has to be swapped i.e. the system freezes until the task has finished.
Any hints on how to prevent this?

Related

How to (log) transform *args arguments without losing structure

I am attempting to apply statistical tests to some datasets with variable numbers of groups. This causes a problem when I try to perform a log transformation for said groups while maintaining the ability to perform the test function (in this case scipy's kruskal()), which takes a variable number of arguments, one for each group of data.
The code below is an idea of what I want. Naturally stats.kruskal([np.log(i) for i in args]) does not work, as kruskal() does not expect a list of arrays, but one argument for each array. How do I perform log transformation (or any kind of alteration, really), while still being able to use the function?
import scipy.stats as stats
import numpy as np
def t(*args):
test = stats.kruskal([np.log(i) for i in args])
return test
a = [11, 12, 4, 42, 12, 1, 21, 12, 6]
b = [1, 12, 4, 3, 14, 8, 8, 6]
c = [2, 2, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 7, 7, 7, 8, 8]
print(t(a, b, c))
IIUC, * in front of the list you are forming while calling kruskal should do the trick:
test = stats.kruskal(*[np.log(i) for i in args])
Asterisk unpacks the list and passes each entry of the list as arguments to the function being called i.e. kruskal here.

Dataframe with fixed length (over writing)

I write a code that generates a mass amount of data in each round. So, I need to only store data for the last 10 rounds. How can I create a dataframe which erases the oldest object when I add a need object (over-writing)? The order of observations -from old to new- should be maintained. Is there any simple function or data format to do this?
Thanks in advance!
You could use this function:
def ins(arr, item):
if len(arr) < 10:
arr.insert(0, item)
else:
arr.pop()
arr.insert(0, item)
ex = [1, 2, 3, 4, 5, 6, 7, 8, 9]
ins(ex, 'a')
print(ex)
# ['a', 1, 2, 3, 4, 5, 6, 7, 8, 9]
ins(ex, 'b')
print(ex)
# ['b', 'a', 1, 2, 3, 4, 5, 6, 7, 8]
In order for this to work you MUST pass a list as argument to the function ins(), so that the new item is inserted and the 10th is removed (if there is one).
(I considered that the question is not pandas specific, but rather a way to store a maximum amount of items in an array)

Rolling dice with pygal

Why is there a +1 in the code below?
from die import Die
#from the file die.py import the Class Die
# create a D6.
die = Die()
# make some rolls, and store results in a list.
results = []
for roll_num in range(100):
result = die.roll()# file die.py and function roll in that file.
results.append(result) #adding to results every time that a die roll. die.roll()
# Analyze the results.
frequencies = []
for value in range(1,die.num_sides+1):
frequency = results.count(value)
frequencies.append(frequency)
print(frequencies)
I am able to run this code, but I don't know why that +1 is there.
from the example in the documentation
>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(range(1, 11))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Does multiprocessing.pool.map erase map object?

When I apply multiprocessing.pool.map to list object, the list object would not be affected:
from multiprocessing import Pool
def identity(x):
return x
num_list = list(range(0, 10))
print("before multiprocessing:")
with Pool(10) as p:
print(p.map(identity, num_list))
print("after multiprocessing:")
print(list(num_list))
prints
before multiprocessing:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
after multiprocessing:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
But when I apply multiprocessing.pool.map upon map object, it seems to got erased:
from multiprocessing import Pool
def identity(x):
return x
num_list = list(range(0, 10))
num_list = map(identity, num_list)
print("before multiprocessing:")
with Pool(10) as p:
print(p.map(identity, num_list))
print("after multiprocessing:")
print(list(num_list))
prints
before multiprocessing:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
after multiprocessing:
[]
The only difference is num_list = map(identity, num_list).
Does num_list(map object) got erased by multiprocessing.pool.map?
I'm not sure about this but I couldn't find another explanation.
map function return an iterator, After p.map() traversing the last element of the map obj, it will return nothing when accessing the map obj again. This is the feature of the iterator

Printing top n distinct values of a list

I want to print the top 10 distinct elements from a list:
top=10
test=[1,1,1,2,3,4,5,6,7,8,9,10,11,12,13]
for i in range(0,top):
if test[i]==1:
top=top+1
else:
print(test[i])
It is printing:
2,3,4,5,6,7,8
I am expecting:
2,3,4,5,6,7,8,9,10,11
What I am missing?
Using numpy
import numpy as np
top=10
test=[1,1,1,2,3,4,5,6,7,8,9,10,11,12,13]
test=np.unique(np.array(test))
test[test!=1][:top]
Output
array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Since you code only executes the loop for 10 times and the first 3 are used to ignore 1, so only the following 3 is printed, which is exactly happened here.
If you want to print the top 10 distinct value, I recommand you to do this:
# The code of unique is taken from [remove duplicates in list](https://stackoverflow.com/questions/7961363/removing-duplicates-in-lists)
def unique(l):
return list(set(l))
def print_top_unique(List, top):
ulist = unique(List)
for i in range(0, top):
print(ulist[i])
print_top_unique([1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], 10)
My Solution
test = [1,1,1,2,3,4,5,6,7,8,9,10,11,12,13]
uniqueList = [num for num in set(test)] #creates a list of unique characters [1,2,3,4,5,6,7,8,9,10,11,12,13]
for num in range(0,11):
if uniqueList[num] != 1: #skips one, since you wanted to start with two
print(uniqueList[num])

Categories

Resources