I am trying to build a system that will crawl a remote server continuously and download new files locally. I want crawling and downloading to be split into separate objects, and separate threads. To keep track of files found on the server and which still needs to be downloaded, I will use a PriorityQueue.
Because the system will be larger later with more tasks added, I will need a main module that sits on top and I need to initiate the PriorityQueue in the main module. But I have a problem on how to share this PriorityQueue between the main module and the crawler.
Here is the code below, ignoring the download part for now as it doesn't play into the problem yet, as I can't figure out how to make the Crawler object "see" the queue object created in main.py.
main.py
import crawler
import threading
from queue import PriorityQueue
class Watchdog(object):
def __init__(self):
self.queue = PriorityQueue
def setup(self):
self.crawler = crawler.Crawler()
def run(self):
t1 = threading.Thread(target=self.crawler.crawl(), args=(<pysftp connection>,))
t1.daemon=True
t2.daemon=True
t1.start()
t2.start()
t1.join()
t2.join()
crawler.py
import pysftp
class Crawler(object):
def __init__(self, connection):
self.connection = connection
def crawl(self):
callbacks = pysftp.WTCallbacks()
self.connection.walktree(rootdir, fcallback=callbacks.file_cb, dcallback=callbacks.dir_cb, ucallback=callbacks.unk_cb)
for fpath in callbacks.flist:
with queue.mutex:
if not fpath in queue.queue:
queue.put(os.path.getmtime(fpath), fpath)
The problem is that I can not figure out how I can make the queue object I create in main.py to be reachable and shared in crawler.py. When I add the download task, it should also be able to see the queue object, and I need the queue to be synced over all modules, so that when a new file is added by crawler, the downloader will immediately see that.
You need to use multiprocessing module. Check the doc at https://docs.python.org/3.6/library/multiprocessing.html#exchanging-objects-between-processes
You could also use celery
Related
I just started getting familiar with multiprocessing in python and got stuck at a problem which I'm not able to solve the way i want and I don't find any clear information if what I'm trying is even properly solvable.
What i'm trying to do is something similar to the following:
import time
from multiprocessing import Process, Event, Queue
from threading import Thread
class Main:
def __init__(self):
self.task_queue = Queue()
self.process = MyProcess(self.task_queue)
self.process.start()
def execute_script(self, code):
ProcessCommunication(code, self.task_queue).start()
class ProcessCommunication(Thread):
def __init__(self, script, task_queue):
super().__init__()
self.script = script
self.script_queue = task_queue
self.script_end_event = Event()
def run(self):
self.script_queue.put((self.script, self.script_end_event))
while not self.script_end_event.is_set():
time.sleep(0.1)
class MyProcess(Process):
class ExecutionThread(Thread):
def __init__(self, code, end_event):
super().__init__()
self.code = code
self.event = end_event
def run(self):
exec(compile(self.code, '<string>', 'exec'))
self.event.set()
def __init__(self, task_queue):
super().__init__(name="TEST_PROCESS")
self.task_queue = task_queue
self.status = None
def run(self):
while True:
if not self.task_queue.empty():
script, end_event = self.task_queue.get()
if script is None:
break
self.ExecutionThread(script, end_event).start()
So I would like to have one separate process, which is running during the whole runtime of my main Process, to execute user written scripts in an environment with restriced user privileges, restriced namespace. Also to protect the main process from potential endless loops without waiting times which load the CPU core too much.
Example Code to use the structure could look something like this:
if __name__ == '__main__':
main_class = Main()
main_class.execute_script("print(1)")
The main process can start several scripts simultaneously and I would like to pass an event, together with the execution request, to the process so that the main process gets notified whenever one of the scripts finished.
However, the Python process Queues somehow do not like the passing of events thorugh the queue and throw the following Error.
'RuntimeError: Semaphore objects should only be shared between processes through inheritance'
As I create another event with every execution request, I can't pass them on instantiation of the Process.
I came up with one way to solve this, which is passing an identifier together with the code and basically set up another queue which is served with the identifier whenever the end_event would be set. However, the usage of events seems much more elegant to me and i wonder if there is solution which I did not think of yet.
I'm looking into concurrency options for Python. Since I'm an iOS/macOS developer, I'd find it very useful if there was something like NSOperationQueue in python.
Basically, it's a queue to which you can add operations (every operation is Operation-derived class with run method to implement) which are executed either serially, or in parallel or ideally various dependencies can be set on operations (ie that some operation depends on others being executed before it can start).
have you looked celery as an option? This is what celery website quotes
Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.
I'm looking for it, too. But since it doesn't seem to exist yet, I have written my own implementation:
import time
import threading
import queue
import weakref
class OperationQueue:
def __init__(self):
self.thread = None
self.queue = queue.Queue()
def run(self):
while self.queue.qsize() > 0:
msg = self.queue.get()
print(msg)
# emulate if it cost time
time.sleep(2)
def addOperation(self, string):
# put to queue first for thread safe.
self.queue.put(string)
if not (self.thread and self.thread.is_alive()):
print('renew a thread')
self.thread = threading.Thread(target=self.run)
self.thread.start()
myQueue = OperationQueue()
myQueue.addOperation("test1")
# test if it auto free
item = weakref.ref(myQueue)
time.sleep(1)
myQueue.addOperation("test2")
myQueue = None
time.sleep(3)
print(f'item = {item}')
print("Done.")
I'm a bit new to coding/scripting in general and need some help implementing multiprocessing.
I currently have two functions that I'll concentrate on here. The first, def getting_routes(router):logs into all my routers (the list of routers comes from a previous function) and runs a command. The second function `def parse_paths(routes): parses the results of this command.
def get_list_of_routers
<some code>
return routers
def getting_routes(router):
routes = sh.ssh(router, "show ip route")
return routes
def parse_paths(routes):
l = routes.split("\n")
...... <more code>.....
return parsed_list
My list is roughly 50 routers long and along with subsequent parsing, takes quite a bit of time so I'd like to use the multiprocessing module to run the sshing into routers, command execution, and subsequent parsing in parallel across all routers.
I wrote:
#!/usr/bin/env python
import multiprocessing.dummy import Pool as ThreadPool
def get_list_of_routers (***this part does not need to be threaded)
<some code>
return routers
def getting_routes(router):
routes = sh.ssh(router, "show ip route")
return routes
def parse_paths(routes):
l = routes.split("\n")
...... <more code>.....
return parsed_list
if __name__ == '__main__':
worker_1 = multiprocessing.Process(target=getting_routes)
worker_2 = multiprocessing.Process(target=parse_paths)
worker_1.start()
worker_2.start()
What I'd like is for parallel sshing into a router, running the command, and returning the parsed output. I've been reading http://kmdouglass.github.io/posts/learning-pythons-multiprocessing-module.html and the multiprocessing module but am still not getting the results I need and keep getting undefined errors. Any help with what I might be missing in the multiprocessing module? Thanks in advance!
Looks like you're not sending the router parameter to the getting_routes function.
Also, I think using threads will be sufficient, you don't need to create new processes.
What you need to do is create a loop in you main block that will start a new thread for each router that is returned from the get_list_of_routers function. Then you have two options - either call the parse_paths function from within the thread, or get the return results from the threads and then call the parse_paths.
For example:
import Queue
from threading import Thread
que = Queue.Queue()
threads = []
for router in get_list_of_routers():
t = Thread(target=lambda q, arg1: q.put(getting_routers(arg1)), args=(que, router))
t.start()
threads.append(t)
for t in threads:
t.join()
results = []
while not que.empty():
results.append(que.get())
parse_paths(results)
I'm using the VSphere API, here are the lines that I'm dealing with:
task = vm.PowerOff()
while task.info.state not in [vim.TaskInfo.State.success, vim.TaskInfo.State.error]:
time.sleep(1)
log.info("task {} is running".format(task))
log.ingo("task {} is done".format(task))
The problem here is that this blocks the execution completely whilst the task is not finished. I would like the logging part to be ran "in parallel", so I can start other tasks.
I thought about creating a function that would accept a task as parameter, and poll the info.state attribute just like now, but how do I make this non blocking ?
EDIT: I'm using Python 2.7
You could use asyncio and create an event loop. You can use asyncio.async() to create an asynchronous task that won't block the event loop execution.
Here is an example of using the threading module:
import threading
class VMShutdownThread(threading.Thread):
def __init__(self, vm):
self.vm = vm
def run(self):
task = vm.PowerOff()
while task.info.state not in [vim.TaskInfo.State.success, vim.TaskInfo.State.error]:
time.sleep(1)
log.info("task {} is running".format(task))
log.info("task {} is done".format(task))
vm_shutdown_thread = VMShutdownThread(vm)
vm_shutdown_thread.start()
If you create a logger, you can configure it to print the thread name.
I am facing the problem with collecting logs from the following script.
Once I set up the SLEEP_TIME to too "small" value, the LoggingThread
threads somehow block the logging module. The script freeze on logging request
in the action function. If the SLEEP_TIME is about 0.1 the script collect
all log messages as I expect.
I tried to follow this answer but it does not solve my problem.
import multiprocessing
import threading
import logging
import time
SLEEP_TIME = 0.000001
logger = logging.getLogger()
ch = logging.StreamHandler()
ch.setFormatter(logging.Formatter('%(asctime)s %(levelname)s %(funcName)s(): %(message)s'))
ch.setLevel(logging.DEBUG)
logger.setLevel(logging.DEBUG)
logger.addHandler(ch)
class LoggingThread(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
def run(self):
while True:
logger.debug('LoggingThread: {}'.format(self))
time.sleep(SLEEP_TIME)
def action(i):
logger.debug('action: {}'.format(i))
def do_parallel_job():
processes = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=processes)
for i in range(20):
pool.apply_async(action, args=(i,))
pool.close()
pool.join()
if __name__ == '__main__':
logger.debug('START')
#
# multithread part
#
for _ in range(10):
lt = LoggingThread()
lt.setDaemon(True)
lt.start()
#
# multiprocess part
#
do_parallel_job()
logger.debug('FINISH')
How to use logging module in multiprocess and multithread scripts?
This is probably bug 6721.
The problem is common in any situation where you have locks, threads and forks. If thread 1 had a lock while thread 2 calls fork, in the forked process, there will only be thread 2 and the lock will be held forever. In your case, that is logging.StreamHandler.lock.
A fix can be found here (permalink) for the logging module. Note that you need to take care of any other locks, too.
I've run into similar issue just recently while using logging module together with Pathos multiprocessing library. Still not 100% sure, but it seems, that in my case the problem may have been caused by the fact, that logging handler was trying to reuse a lock object from within different processes.
Was able to fix it with a simple wrapper around default logging Handler:
import threading
from collections import defaultdict
from multiprocessing import current_process
import colorlog
class ProcessSafeHandler(colorlog.StreamHandler):
def __init__(self):
super().__init__()
self._locks = defaultdict(lambda: threading.RLock())
def acquire(self):
current_process_id = current_process().pid
self._locks[current_process_id].acquire()
def release(self):
current_process_id = current_process().pid
self._locks[current_process_id].release()
By default, multiprocessing will fork() the process in the pool when running on Linux. The resulting subprocess will lose all running threads except for the main one. So if you're on Linux, that's the problem.
First action item: You shouldn't ever use the fork()-based pool; see https://pythonspeed.com/articles/python-multiprocessing/ and https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.
On Windows, and I think newer versions of Python on macOS, the "spawn"-based pool is used. This is also what you ought use on Linux. In this setup, a new Python process is started. As you would expect, the new process doesn't have any of the threads from the parent process, because it's a new process.
Second action item: you'll want to have logging setup done in each subprocess in the pool; the logging setup for the parent process isn't sufficient to get logs from the worker processes. You do this with the initializer keyword argument to Pool, e.g. write a function called setup_logging() and then do pool = multiprocessing.Pool(initializer=setup_logging) (https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool).