Passing result-list from Multiprocessing Manager outside scope - python

I'm trying to scan domain for pentesting purposes, the program use Multiprocessing with the result-list got passed from Processes; back to main function.
I have tried to use Global Variables mentioned in global also in a class. Using this reminds me that the processes lives in different memory. So i'm using manager.list() instead; to share memory between processes
Here's what i've tried:
from multiprocessing import Process, cpu_count, Manager
class variably:
variably=bla..
....
def engine(domainlist, R):
for domain in domainlist:
try:
r = requests.get("http://" + domain, headers=headers, timeout=0.7, allow_redirects=False)
if r.status_code == expected_response:
print("Success" + domain)
print(domain, file=open("LazyWritesForDebugPurposes.txt", "a"))
R.append(str(domain))
elif r.status_code != expected_response:
print("Failed" + domain + str(r.status_code))
except:
pass
def fromtext():
....
R = []
with Manager() as manager:
num_cpus = cpu_count()
processes = []
R = manager.list()
for process_num in range(num_cpus):
section = domainlist[process_num::num_cpus]
p = Process(target=engine, args=(section,R,))
p.start()
processes.append(p)
for p in processes:
p.join()
print(R)
print(R)
print("")
print(" Total of Domains Queried : " + colors.RED_BG + " "+str(len(R)) +" "+ colors.ENDC)
if len(inf.result_success) >= 0:
print(" Successfull Result : " + colors.GREEN_BG + " "+str(len(R))+ " "+colors.ENDC)
fromtext()
Sorry for any invalid syntax or indentation, trying to simplify the codes into more shorter snippet.
Above codes returns BrokenPipe sometimes with ConnectionRefused error.
From the exception, i can see that the list already appended as: ['Domain.com','Domain_2.com'] but somehow raises an exception.
Here's some Screenshot about the problem:
Problematic Screenshot
EDIT:
It's looks like the list can only pass inside manager() scope, how can i extend the data passing outside the scope, for example calling the list in different function. This below codes works:
with Manager() as manager:
num_cpus = cpu_count()
processes = []
R = manager.list()
for process_num in range(num_cpus):
section = domainlist[process_num::num_cpus]
p = Process(target=engine, args=(section,R,))
p.start()
processes.append(p)
for p in processes:
p.join()
str(len(R))

You really want to use a queue. Create a multiprocessing.SimpleQueue in your main thread and pass it to all your subprocesses. They can add items to this queue.
Creating your own manager is almost always a mistake.

The problem to this that i had to explicitly convert the manager.list() as another normal list by setting a new variable and use global to make it usable in another function. I know that it's dirty, havent tried using Queue() but for now, atleast it's working.
def executor():
global R
....
with Manager() as manager:
num_cpus = cpu_count()
processes = []
R = manager.list()
for process_num in range(num_cpus):
section = domainlist[process_num::num_cpus]
p = Process(target=engine, args=(section,R,))
p.start()
processes.append(p)
for p in processes:
p.join()
R = list(R)
print(R)
If there's a simplification to this, i would really appreciate it.

Related

Why my CPU are not running 100% while I'm using multiprocessing?

I'm learning python multiprocessing and I tried this code :
def f(name):
n = 0
print('running ', name)
for i in range(1000):
for j in range(1000):
for k in range(100):
n += 1
print(n)
if __name__ == '__main__':
tab = []
for i in range(10):
p = Process(target=f, args=(i,))
p.start()
tab.append(p)
for t in tab:
t.join()
It worked well, and I saw in the monitor that CPU were all running 100%.
But when I switched to my application, which is using a shared variable for multiprocessing, I saw that CPU were all running at 40-50% and execution times are worst than without multiprocessing.
def repeat_optimization(self, cycle, nb_cycles, res):
start = time.time()
res['a'].append(a())
res['b'].append(b())
end = time.time()
print("Benchmark iteration : " + str(cycle + 1) + "/" + str(nb_cycles) + " completed in " + str(round(end - start, 2)) + "s")
def parallelize_repeat_optimization(self, nb_cycles):
manager = Manager()
res = manager.dict()
res['a'] = []
res['b'] = []
tab = []
for cycle in range(nb_cycles):
p = Process(target=self.repeat_optimization, args=(cycle, nb_cycles, res))
p.start()
tab.append(p)
for t in tab:
t.join()
return res
What's wrong with my code, since I do not see a main difference with the first example ?
In any multiprocessing situation you will always have higher utilization if each process is completely independent. In this case, there are hidden costs to using a Manager to synchronize the two processes. Under the hood, the processes have to communicate with each other to make sure they do not overwrite each others' changes, and this means each process will waste a lot of time waiting for responses from the other process.

Using python Multiprocessing Pool for accessing split files, search bytestream and then concat the values

I am new to multiprocessing.I've a large binary files, which I've split. I'm going to search specific values(here 1002 and 1376) in file, collect their count in dictionary and return total count. But I am not finding correct way to do it. Below definition of f1 doesn't seem to be right. what I want is one process will use all 3 parameters of one tuple as a parameters of f1 and another process will use another tuple. return me count of each number in dictionary. so each process may return {1002:40,1376:33} and {1002:24,1376:65}
def f1(file_name, num1, num2):
Vdict=defaultdict(list)
with open (file_name,'rb') as f:
b=f.read(188)
while len(b)>0:
...
compare if b has 1002 or 1376 then add Vdict[1002]=Vdict[1002]+1 or
Vdict[1376]=Vdict[1376]+1
return Vdict
if __name__ == "__main__":
pool = Pool()
numbers = [('file_1.ts',1002,1376), ('file_2.ts',1002,1376)]
result2=pool.starmap(f1,numbers )
print(result2)
pool.close()
pool.join()
print('done!')
Use a common data structure (see result)
import multiprocessing
def worker(procnum, return_dict):
"""worker function"""
print(str(procnum) + " sharing-state!")
return_dict[procnum] = procnum
if __name__ == "__main__":
# see https://docs.python.org/3/library/multiprocessing.html#sharing-state-between-processes
manager = multiprocessing.Manager()
result = manager.dict()
jobs = []
for i in range(3):
p = multiprocessing.Process(target=worker, args=(i, result))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print(result.values())

How to return data from a function called by multiprocessing.Process? (Python3) [duplicate]

In the example code below, I'd like to get the return value of the function worker. How can I go about doing this? Where is this value stored?
Example Code:
import multiprocessing
def worker(procnum):
'''worker function'''
print str(procnum) + ' represent!'
return procnum
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker, args=(i,))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print jobs
Output:
0 represent!
1 represent!
2 represent!
3 represent!
4 represent!
[<Process(Process-1, stopped)>, <Process(Process-2, stopped)>, <Process(Process-3, stopped)>, <Process(Process-4, stopped)>, <Process(Process-5, stopped)>]
I can't seem to find the relevant attribute in the objects stored in jobs.
Use shared variable to communicate. For example like this:
import multiprocessing
def worker(procnum, return_dict):
"""worker function"""
print(str(procnum) + " represent!")
return_dict[procnum] = procnum
if __name__ == "__main__":
manager = multiprocessing.Manager()
return_dict = manager.dict()
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker, args=(i, return_dict))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
print(return_dict.values())
I think the approach suggested by #sega_sai is the better one. But it really needs a code example, so here goes:
import multiprocessing
from os import getpid
def worker(procnum):
print('I am number %d in process %d' % (procnum, getpid()))
return getpid()
if __name__ == '__main__':
pool = multiprocessing.Pool(processes = 3)
print(pool.map(worker, range(5)))
Which will print the return values:
I am number 0 in process 19139
I am number 1 in process 19138
I am number 2 in process 19140
I am number 3 in process 19139
I am number 4 in process 19140
[19139, 19138, 19140, 19139, 19140]
If you are familiar with map (the Python 2 built-in) this should not be too challenging. Otherwise have a look at sega_Sai's link.
Note how little code is needed. (Also note how processes are re-used).
For anyone else who is seeking how to get a value from a Process using Queue:
import multiprocessing
ret = {'foo': False}
def worker(queue):
ret = queue.get()
ret['foo'] = True
queue.put(ret)
if __name__ == '__main__':
queue = multiprocessing.Queue()
queue.put(ret)
p = multiprocessing.Process(target=worker, args=(queue,))
p.start()
p.join()
print(queue.get()) # Prints {"foo": True}
Note that in Windows or Jupyter Notebook, with multithreading you have to save this as a file and execute the file. If you do it in a command prompt you will see an error like this:
AttributeError: Can't get attribute 'worker' on <module '__main__' (built-in)>
For some reason, I couldn't find a general example of how to do this with Queue anywhere (even Python's doc examples don't spawn multiple processes), so here's what I got working after like 10 tries:
from multiprocessing import Process, Queue
def add_helper(queue, arg1, arg2): # the func called in child processes
ret = arg1 + arg2
queue.put(ret)
def multi_add(): # spawns child processes
q = Queue()
processes = []
rets = []
for _ in range(0, 100):
p = Process(target=add_helper, args=(q, 1, 2))
processes.append(p)
p.start()
for p in processes:
ret = q.get() # will block
rets.append(ret)
for p in processes:
p.join()
return rets
Queue is a blocking, thread-safe queue that you can use to store the return values from the child processes. So you have to pass the queue to each process. Something less obvious here is that you have to get() from the queue before you join the Processes or else the queue fills up and blocks everything.
Update for those who are object-oriented (tested in Python 3.4):
from multiprocessing import Process, Queue
class Multiprocessor():
def __init__(self):
self.processes = []
self.queue = Queue()
#staticmethod
def _wrapper(func, queue, args, kwargs):
ret = func(*args, **kwargs)
queue.put(ret)
def run(self, func, *args, **kwargs):
args2 = [func, self.queue, args, kwargs]
p = Process(target=self._wrapper, args=args2)
self.processes.append(p)
p.start()
def wait(self):
rets = []
for p in self.processes:
ret = self.queue.get()
rets.append(ret)
for p in self.processes:
p.join()
return rets
# tester
if __name__ == "__main__":
mp = Multiprocessor()
num_proc = 64
for _ in range(num_proc): # queue up multiple tasks running `sum`
mp.run(sum, [1, 2, 3, 4, 5])
ret = mp.wait() # get all results
print(ret)
assert len(ret) == num_proc and all(r == 15 for r in ret)
This example shows how to use a list of multiprocessing.Pipe instances to return strings from an arbitrary number of processes:
import multiprocessing
def worker(procnum, send_end):
'''worker function'''
result = str(procnum) + ' represent!'
print result
send_end.send(result)
def main():
jobs = []
pipe_list = []
for i in range(5):
recv_end, send_end = multiprocessing.Pipe(False)
p = multiprocessing.Process(target=worker, args=(i, send_end))
jobs.append(p)
pipe_list.append(recv_end)
p.start()
for proc in jobs:
proc.join()
result_list = [x.recv() for x in pipe_list]
print result_list
if __name__ == '__main__':
main()
Output:
0 represent!
1 represent!
2 represent!
3 represent!
4 represent!
['0 represent!', '1 represent!', '2 represent!', '3 represent!', '4 represent!']
This solution uses fewer resources than a multiprocessing.Queue which uses
a Pipe
at least one Lock
a buffer
a thread
or a multiprocessing.SimpleQueue which uses
a Pipe
at least one Lock
It is very instructive to look at the source for each of these types.
It seems that you should use the multiprocessing.Pool class instead and use the methods .apply() .apply_async(), map()
http://docs.python.org/library/multiprocessing.html?highlight=pool#multiprocessing.pool.AsyncResult
You can use the exit built-in to set the exit code of a process. It can be obtained from the exitcode attribute of the process:
import multiprocessing
def worker(procnum):
print str(procnum) + ' represent!'
exit(procnum)
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=worker, args=(i,))
jobs.append(p)
p.start()
result = []
for proc in jobs:
proc.join()
result.append(proc.exitcode)
print result
Output:
0 represent!
1 represent!
2 represent!
3 represent!
4 represent!
[0, 1, 2, 3, 4]
The pebble package has a nice abstraction leveraging multiprocessing.Pipe which makes this quite straightforward:
from pebble import concurrent
#concurrent.process
def function(arg, kwarg=0):
return arg + kwarg
future = function(1, kwarg=1)
print(future.result())
Example from: https://pythonhosted.org/Pebble/#concurrent-decorators
Thought I'd simplify the simplest examples copied from above, working for me on Py3.6. Simplest is multiprocessing.Pool:
import multiprocessing
import time
def worker(x):
time.sleep(1)
return x
pool = multiprocessing.Pool()
print(pool.map(worker, range(10)))
You can set the number of processes in the pool with, e.g., Pool(processes=5). However it defaults to CPU count, so leave it blank for CPU-bound tasks. (I/O-bound tasks often suit threads anyway, as the threads are mostly waiting so can share a CPU core.) Pool also applies chunking optimization.
(Note that the worker method cannot be nested within a method. I initially defined my worker method inside the method that makes the call to pool.map, to keep it all self-contained, but then the processes couldn't import it, and threw "AttributeError: Can't pickle local object outer_method..inner_method". More here. It can be inside a class.)
(Appreciate the original question specified printing 'represent!' rather than time.sleep(), but without it I thought some code was running concurrently when it wasn't.)
Py3's ProcessPoolExecutor is also two lines (.map returns a generator so you need the list()):
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
print(list(executor.map(worker, range(10))))
With plain Processes:
import multiprocessing
import time
def worker(x, queue):
time.sleep(1)
queue.put(x)
queue = multiprocessing.SimpleQueue()
tasks = range(10)
for task in tasks:
multiprocessing.Process(target=worker, args=(task, queue,)).start()
for _ in tasks:
print(queue.get())
Use SimpleQueue if all you need is put and get. The first loop starts all the processes, before the second makes the blocking queue.get calls. I don't think there's any reason to call p.join() too.
If you are using Python 3, you can use concurrent.futures.ProcessPoolExecutor as a convenient abstraction:
from concurrent.futures import ProcessPoolExecutor
def worker(procnum):
'''worker function'''
print(str(procnum) + ' represent!')
return procnum
if __name__ == '__main__':
with ProcessPoolExecutor() as executor:
print(list(executor.map(worker, range(5))))
Output:
0 represent!
1 represent!
2 represent!
3 represent!
4 represent!
[0, 1, 2, 3, 4]
A simple solution:
import multiprocessing
output=[]
data = range(0,10)
def f(x):
return x**2
def handler():
p = multiprocessing.Pool(64)
r=p.map(f, data)
return r
if __name__ == '__main__':
output.append(handler())
print(output[0])
Output:
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
You can use ProcessPoolExecutor to get a return value from a function as shown below:
from concurrent.futures import ProcessPoolExecutor
def test(num1, num2):
return num1 + num2
with ProcessPoolExecutor() as executor:
feature = executor.submit(test, 2, 3)
print(feature.result()) # 5
I modified vartec's answer a bit since I needed to get the error codes from the function. (Thanks vertec!!! its an awesome trick)
This can also be done with a manager.list but I think is better to have it in a dict and store a list within it. That way, way we keep the function and the results since we can't be sure of the order in which the list will be populated.
from multiprocessing import Process
import time
import datetime
import multiprocessing
def func1(fn, m_list):
print 'func1: starting'
time.sleep(1)
m_list[fn] = "this is the first function"
print 'func1: finishing'
# return "func1" # no need for return since Multiprocess doesnt return it =(
def func2(fn, m_list):
print 'func2: starting'
time.sleep(3)
m_list[fn] = "this is function 2"
print 'func2: finishing'
# return "func2"
def func3(fn, m_list):
print 'func3: starting'
time.sleep(9)
# if fail wont join the rest because it never populate the dict
# or do a try/except to get something in return.
raise ValueError("failed here")
# if we want to get the error in the manager dict we can catch the error
try:
raise ValueError("failed here")
m_list[fn] = "this is third"
except:
m_list[fn] = "this is third and it fail horrible"
# print 'func3: finishing'
# return "func3"
def runInParallel(*fns): # * is to accept any input in list
start_time = datetime.datetime.now()
proc = []
manager = multiprocessing.Manager()
m_list = manager.dict()
for fn in fns:
# print fn
# print dir(fn)
p = Process(target=fn, name=fn.func_name, args=(fn, m_list))
p.start()
proc.append(p)
for p in proc:
p.join() # 5 is the time out
print datetime.datetime.now() - start_time
return m_list, proc
if __name__ == '__main__':
manager, proc = runInParallel(func1, func2, func3)
# print dir(proc[0])
# print proc[0]._name
# print proc[0].name
# print proc[0].exitcode
# here you can check what did fail
for i in proc:
print i.name, i.exitcode # name was set up in the Process line 53
# here will only show the function that worked and where able to populate the
# manager dict
for i, j in manager.items():
print dir(i) # things you can do to the function
print i, j

concurrent requests with queue and thread

I am trying to make concurrent API calls with python.
I based my code on the solution (first answer) presented in this thread: What is the fastest way to send 100,000 HTTP requests in Python?
Currently, my code is broken.
I have a main function which creates the queue, populates it, initiates the threads, starts them, and joins the queue.
I also have a target function which should make the get requests to the API.
The difficulties I am experiencing right now is that
the target function does not execute the necessary work.
The target is called, but it acts as the queue is empty.
The first print is executed ("inside scraper worker"), while the second ("inside scraper worker - queue NOT empty") is not.
def main_scraper(flights):
print("main scraper was called, got: ")
print(flights)
data = []
q = Queue()
map(q.put, flights)
for i in range(0, 5):
t = Thread(target = scraper_worker, args = (q, data))
t.daemon = True
t.start()
q.join()
return data
def scraper_worker(q, data):
print("inside scraper worker")
while not q.empty():
print("inside scraper worker, queue not empty")
f = q.get()
url = kiwi_url(f)
response = requests.get(url)
response_data = response.json()
results = parseResults(response_data)
q.task_done()
print("task done. results:")
print(results)
#f._price = results[0]["price"]
#f._url = results[0]["deep_link"]
data.append(results)
return data
I hope this is enough information for you to help me out.
Otherwise, I will rewrite the code in order to create a code that can be run by anyone.
I would guess that the flights are not being put on the queue. map(q.put, flights) is lazy, and is never accessed so it is as if it didn't happen. I would just iterate.
def main_scraper(flights):
print("main scraper was called, got: ")
print(flights)
data = []
q = Queue()
for flight in flights:
q.put(flight)
for i in range(0, 5):
t = Thread(target = scraper_worker, args = (q, data))
t.daemon = True
t.start()
q.join()
return data

python - multiprocessing with queue

Here is my code below , I put string in queue , and hope dowork2 to do something work , and return char in shared_queue
but I always get nothing at while not shared_queue.empty()
please give me some point , thanks.
import time
import multiprocessing as mp
class Test(mp.Process):
def __init__(self, **kwargs):
mp.Process.__init__(self)
self.daemon = False
print('dosomething')
def run(self):
manager = mp.Manager()
queue = manager.Queue()
shared_queue = manager.Queue()
# shared_list = manager.list()
pool = mp.Pool()
results = []
results.append(pool.apply_async(self.dowork2,(queue,shared_queue)))
while True:
time.sleep(0.2)
t =time.time()
queue.put('abc')
queue.put('def')
l = ''
while not shared_queue.empty():
l = l + shared_queue.get()
print(l)
print( '%.4f' %(time.time()-t))
pool.close()
pool.join()
def dowork2(queue,shared_queue):
while True:
path = queue.get()
shared_queue.put(path[-1:])
if __name__ == '__main__':
t = Test()
t.start()
# t.join()
# t.run()
I managed to get it work by moving your dowork2 outside the class. If you declare dowork2 as a function before Test class and call it as
results.append(pool.apply_async(dowork2, (queue, shared_queue)))
it works as expected. I am not 100% sure but it probably goes wrong because your Test class is already subclassing Process. Now when your pool creates a subprocess and initialises the same class in the subprocess, something gets overridden somewhere.
Overall I wonder if Pool is really what you want to use here. Your worker seems to be in an infinite loop indicating you do not expect a return value from the worker, only the result in the return queue. If this is the case, you can remove Pool.
I also managed to get it work keeping your worker function within the class when I scrapped the Pool and replaced with another subprocess:
foo = mp.Process(group=None, target=self.dowork2, args=(queue, shared_queue))
foo.start()
# results.append(pool.apply_async(Test.dowork2, (queue, shared_queue)))
while True:
....
(you need to add self to your worker, though, or declare it as a static method:)
def dowork2(self, queue, shared_queue):

Categories

Resources