When I apply multiprocessing.pool.map to list object, the list object would not be affected:
from multiprocessing import Pool
def identity(x):
return x
num_list = list(range(0, 10))
print("before multiprocessing:")
with Pool(10) as p:
print(p.map(identity, num_list))
print("after multiprocessing:")
print(list(num_list))
prints
before multiprocessing:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
after multiprocessing:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
But when I apply multiprocessing.pool.map upon map object, it seems to got erased:
from multiprocessing import Pool
def identity(x):
return x
num_list = list(range(0, 10))
num_list = map(identity, num_list)
print("before multiprocessing:")
with Pool(10) as p:
print(p.map(identity, num_list))
print("after multiprocessing:")
print(list(num_list))
prints
before multiprocessing:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
after multiprocessing:
[]
The only difference is num_list = map(identity, num_list).
Does num_list(map object) got erased by multiprocessing.pool.map?
I'm not sure about this but I couldn't find another explanation.
map function return an iterator, After p.map() traversing the last element of the map obj, it will return nothing when accessing the map obj again. This is the feature of the iterator
Related
Why is there a +1 in the code below?
from die import Die
#from the file die.py import the Class Die
# create a D6.
die = Die()
# make some rolls, and store results in a list.
results = []
for roll_num in range(100):
result = die.roll()# file die.py and function roll in that file.
results.append(result) #adding to results every time that a die roll. die.roll()
# Analyze the results.
frequencies = []
for value in range(1,die.num_sides+1):
frequency = results.count(value)
frequencies.append(frequency)
print(frequencies)
I am able to run this code, but I don't know why that +1 is there.
from the example in the documentation
>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(range(1, 11))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
I have implemented multiprocessing for some problem with larger objects like the following:
import time
import pathos.multiprocessing as mp
from functools import partial
from random import randrange
class RandomNumber():
def __init__(self, object_size=100):
self.size = bytearray(object_size*10**6) # 100 MB size
self.foo = None
def do_something(self, *args, **kwargs):
self.foo = randrange(1, 10)
time.sleep(0.5) # wait for 0.5 seconds
return self
def wrapper(random_number, *args, **kwargs):
return random_number.do_something(*args, **kwargs)
if __name__ == '__main__':
# create data
numbers = [RandomNumber() for m in range(0, 9)]
kwds = {'add': randrange(1, 10)}
# calculate
pool = mp.Pool(processes=mp.cpu_count())
result = pool.map_async(partial(wrapper, **kwds), numbers)
try:
result = result.get()
except:
pass
# print result
my_results = [i.foo for i in result]
print(my_results)
pool.close()
pool.join()
which yields something like:
[8, 7, 8, 3, 1, 2, 6, 4, 8]
Now the problem is that I have a massive improvement in performance compared to using a list comprehension when the objects are very small and this improvement turns into the opposite with larger object sizes e.g. 100 MB and larger.
From the documentation and other questions I have discovered that this caused by the use of pickle/dill for the serialization of single objects in order to pass them to the workers within the pool. In other words: the objects are copied and this IO operation becomes a bottleneck as it is more time consuming than the actual calculation.
I have alread tried to work on the same object using a multiprocessing.Manager but this resulted in even higher runtimes.
The problem is that I am bound to a specific class structure (here represented through RandomNumber()) which I cannot change..
Now my question is: Are there any ways or concepts to circumvent this behaviour and only get my calls on do_something() without the overhead of serialization or copying?
Any hints are welcome. Thanks in advance!
You need to use Batch processing.Do not create destroy workers for each number.
Make limited workers based on cpu_count.Then pass a list to each worked and process them .Use map and pass a list containing batches of numbers.
I have found a solution using multiprocessing or multithreading from the concurrent.futures library which does not require to pickle the objects. In my case, multithreading using ThreadPoolExecutor brings a clear advantage over multiprocessing via ProcessPoolExecutor.
import time
from random import randrange
import concurrent.futures as cf
class RandomNumber():
def __init__(self, object_size=100):
self.size = bytearray(object_size*10**6) # 100 MB size
self.foo = None
def do_something(self, *args, **kwargs):
self.foo = randrange(1, 10)
time.sleep(0.5) # wait for 0.5 seconds
return self
def wrapper(random_number, *args, **kwargs):
return random_number.do_something(*args, **kwargs)
if __name__ == '__main__':
# create data
numbers = [RandomNumber() for m in range(0, 100)]
kwds = {'add': randrange(1, 10)}
# run
with cf.ThreadPoolExecutor(max_workers=3) as executor:
result = executor.map(wrapper, numbers, timeout=5*60)
# print result
my_results = [i.foo for i in result]
print(my_results)
yields:
[3, 3, 1, 1, 3, 7, 7, 6, 7, 5, 9, 5, 6, 5, 6, 9, 1, 5, 1, 7, 5, 3, 6, 2, 9, 2, 1, 2, 5, 1, 7, 9, 2, 9, 4, 9, 8, 5, 2, 1, 7, 8, 5, 1, 4, 5, 8, 2, 2, 5, 3, 6, 3, 2, 5, 3, 1, 9, 6, 7, 2, 4, 1, 5, 4, 4, 4, 9, 3, 1, 5, 6, 6, 8, 4, 4, 8, 7, 5, 9, 7, 8, 6, 2, 3, 1, 7, 2, 4, 8, 3, 6, 4, 1, 7, 7, 3, 4, 1, 2]
real 0m21.100s
user 0m1.100s
sys 0m2.896s
Nonetheless, this still leads to memory leakage in cases where I have too much objects (here numbers) and does not prevent this by going into some "batch mode" if memory has to be swapped i.e. the system freezes until the task has finished.
Any hints on how to prevent this?
I want to print the top 10 distinct elements from a list:
top=10
test=[1,1,1,2,3,4,5,6,7,8,9,10,11,12,13]
for i in range(0,top):
if test[i]==1:
top=top+1
else:
print(test[i])
It is printing:
2,3,4,5,6,7,8
I am expecting:
2,3,4,5,6,7,8,9,10,11
What I am missing?
Using numpy
import numpy as np
top=10
test=[1,1,1,2,3,4,5,6,7,8,9,10,11,12,13]
test=np.unique(np.array(test))
test[test!=1][:top]
Output
array([ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Since you code only executes the loop for 10 times and the first 3 are used to ignore 1, so only the following 3 is printed, which is exactly happened here.
If you want to print the top 10 distinct value, I recommand you to do this:
# The code of unique is taken from [remove duplicates in list](https://stackoverflow.com/questions/7961363/removing-duplicates-in-lists)
def unique(l):
return list(set(l))
def print_top_unique(List, top):
ulist = unique(List)
for i in range(0, top):
print(ulist[i])
print_top_unique([1, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], 10)
My Solution
test = [1,1,1,2,3,4,5,6,7,8,9,10,11,12,13]
uniqueList = [num for num in set(test)] #creates a list of unique characters [1,2,3,4,5,6,7,8,9,10,11,12,13]
for num in range(0,11):
if uniqueList[num] != 1: #skips one, since you wanted to start with two
print(uniqueList[num])
I'm trying to make an iterator that prints the repeating sequence
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, ...
I want an iterator so I can use .next(), and I want it to loop around to 0 when .next() is called while the iterator is at 9.
But the thing is that I'll probably have a lot of these, so I don't just want to do itertools.cycle([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]).
I don't want to have that many repeated lists of the same sequence in memory. I'd rather have the function x + 1 % 10 in each of the iterators and just have the iterator increment x every time next is called. I can't seem to figure out how to do this though with itertools. Is there a pythonic way of doing this?
You can write an generator that uses range
def my_cycle(start, stop, step=1):
while True:
for x in range(start, stop, step):
yield x
c = my_cycle(0, 10)
You can use your own custom generator:
def cycle_range(xr):
while True:
for x in xr:
yield x
Assuming you are on Python 2, use:
r = xrange(9)
it1 = cycle_range(xr)
it2 = cycle_range(xr)
For memory efficiency.
This is one way via itertools:
import itertools
def counter():
for i in itertools.count(1):
yield i%10
g = counter()
You can use a custom generator like this:
def single_digit_ints():
i = 0
while True:
yield i
i = (i + 1) % 10
for i in single_digit_ints():
# ...
I was getting confused by the purpose of "return" and "yield"
def countMoreThanOne():
return (yy for yy in xrange(1,10,2))
def countMoreThanOne():
yield (yy for yy in xrange(1,10,2))
What is the difference on the above function?
Is it impossible to access the content inside the function using yield?
In first you return a generator
from itertools import chain
def countMoreThanOne():
return (yy for yy in xrange(1,10,2))
print list(countMoreThanOne())
>>>
[1, 3, 5, 7, 9]
while in this you are yielding a generator so that a generator within the generator
def countMoreThanOne():
yield (yy for yy in xrange(1,10,2))
print list(countMoreThanOne())
print list(chain.from_iterable(countMoreThanOne()))
[<generator object <genexpr> at 0x7f0fd85c8f00>]
[1, 3, 5, 7, 9]
if you use list comprehension then difference can be clearly seen:-
in first:-
def countMoreThanOne():
return [yy for yy in xrange(1,10,2)]
print countMoreThanOne()
>>>
[1, 3, 5, 7, 9]
def countMoreThanOne1():
yield [yy for yy in xrange(1,10,2)]
print countMoreThanOne1()
<generator object countMoreThanOne1 at 0x7fca33f70eb0>
>>>
After reading your other comments I think you should write the function like this:
def countMoreThanOne():
return xrange(1, 10, 2)
>>> print countMoreThanOne()
xrange(1, 11, 2)
>>> print list(countMoreThanOne())
[1, 3, 5, 7, 9]
or even better, to have some point in making it a function:
def oddNumbersLessThan(stop):
return xrange(1, stop, 2)
>>> print list(oddNumbersLessThan(15))
[1, 3, 5, 7, 9, 11, 13]