I'm a bit new to coding/scripting in general and need some help implementing multiprocessing.
I currently have two functions that I'll concentrate on here. The first, def getting_routes(router):logs into all my routers (the list of routers comes from a previous function) and runs a command. The second function `def parse_paths(routes): parses the results of this command.
def get_list_of_routers
<some code>
return routers
def getting_routes(router):
routes = sh.ssh(router, "show ip route")
return routes
def parse_paths(routes):
l = routes.split("\n")
...... <more code>.....
return parsed_list
My list is roughly 50 routers long and along with subsequent parsing, takes quite a bit of time so I'd like to use the multiprocessing module to run the sshing into routers, command execution, and subsequent parsing in parallel across all routers.
I wrote:
#!/usr/bin/env python
import multiprocessing.dummy import Pool as ThreadPool
def get_list_of_routers (***this part does not need to be threaded)
<some code>
return routers
def getting_routes(router):
routes = sh.ssh(router, "show ip route")
return routes
def parse_paths(routes):
l = routes.split("\n")
...... <more code>.....
return parsed_list
if __name__ == '__main__':
worker_1 = multiprocessing.Process(target=getting_routes)
worker_2 = multiprocessing.Process(target=parse_paths)
worker_1.start()
worker_2.start()
What I'd like is for parallel sshing into a router, running the command, and returning the parsed output. I've been reading http://kmdouglass.github.io/posts/learning-pythons-multiprocessing-module.html and the multiprocessing module but am still not getting the results I need and keep getting undefined errors. Any help with what I might be missing in the multiprocessing module? Thanks in advance!
Looks like you're not sending the router parameter to the getting_routes function.
Also, I think using threads will be sufficient, you don't need to create new processes.
What you need to do is create a loop in you main block that will start a new thread for each router that is returned from the get_list_of_routers function. Then you have two options - either call the parse_paths function from within the thread, or get the return results from the threads and then call the parse_paths.
For example:
import Queue
from threading import Thread
que = Queue.Queue()
threads = []
for router in get_list_of_routers():
t = Thread(target=lambda q, arg1: q.put(getting_routers(arg1)), args=(que, router))
t.start()
threads.append(t)
for t in threads:
t.join()
results = []
while not que.empty():
results.append(que.get())
parse_paths(results)
Related
I am creating an API to execute command line commands. The server has only two methods practically, "run" and "stop". So, the main function of "run" is to run a command line program in the server side and return a list with the system output. For the other hand, the function of "stop" just kill the process running. Here is the code:
import sys
import json
import subprocess
from klein import Klein
class ItemStore(object):
app = Klein()
current_process = None
def __init__(self):
self._items = {}
def create_process(self, exe):
"""
Run command and return the system output inside a JSON string
"""
print("COMMAND: ", exe)
process = subprocess.Popen(exe, shell=True, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
self.current_process = process
# Poll process for new output until finished
output_lines = []
counter = 0
while True:
counter = counter + 1
nextline = process.stdout.readline()
if process.poll() is not None:
break
aux = nextline.decode("utf-8")
output_lines.append(aux)
sys.stdout.flush()
counter = counter + 1
print("RETURN CODE: ", process.returncode)
return json.dumps(output_lines)
#app.route('/run/<command>', methods=['POST'])
def run(self, request, command):
"""
Execute command line process
"""
exe = command
print("COMMAND: ", exe)
output_lines = self.create_process(exe)
request.setHeader("Content-Type", "application/json")
request.setResponseCode(200)
return output_lines
#app.route('/stop', methods=['POST'])
def stop(self, request):
"""
Kill current execution
"""
self.current_process.kill()
request.setResponseCode(200)
return None
if __name__ == '__main__':
store = ItemStore()
store.app.run('0.0.0.0', 15508)
Well, the problem with this is, if I need to stop the current execution, "stop" request will be not attend until "run" request has finished, so it has no sense to work in this way. I have been reading several pages about async/await solution, but I can not get it work!. I think the most prominent solution is in this webpage https://crossbario.com/blog/Going-Asynchronous-from-Flask-to-Twisted-Klein/ , however, "run" is still a synchronous process. I just posted my main and original code in order to not confuse with the webpage changes.
Best regards
Everything to do with Klein in this example is already handling requests concurrently. However, your application code blocks until it has fully responded to a request.
You have to write your application code to be non-blocking instead of blocking.
Switch your code from the subprocess module to Twisted's process support.
Use Klein's feature of being able to return a Deferred instead of a result (if you want incremental results while the process is running, also look at the request interface - in particular, the write method - so you can write those results before the Deferred fires with a final result).
After Deferreds make sense to you, then you might want to think about syntactic sugar that's available in the form of async/await. Until you understand what Deferreds are doing, async/await is just going to be black magic that will only ever work by accident in your programs.
I'm having issuing using most or all of the cores to process the files faster , it can be reading multiple files a time or using multiple cores to read a single file.
I would prefer using multiple cores to read a single file before moving it to the next.
I tried the code below but can't seem to get all the core used up.
The following code would basically retrieve *.txt file in the directory which contains htmls , in json format.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import json
import urlparse
import os
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlTheHtml(htmlsource):
htmlArray = json.loads(htmlsource)
for eachHtml in htmlArray:
soup = BeautifulSoup(eachHtml['result'], 'html.parser')
if all(['another text to search' not in str(soup),
'text to search' not in str(soup)]):
try:
gd_no = ''
try:
gd_no = soup.find('input', {'id': 'GD_NO'})['value']
except:
pass
r = requests.post('domain api address', data={
'gd_no': gd_no,
})
except:
pass
if __name__ == '__main__':
pool = Pool(cpu_count() * 2)
print(cpu_count())
fileArray = []
for filename in os.listdir(os.getcwd()):
if filename.endswith('.txt'):
fileArray.append(filename)
for file in fileArray:
with open(file, 'r') as myfile:
htmlsource = myfile.read()
results = pool.map(crawlTheHtml(htmlsource), f)
On top of that , i'm not sure what the ,f represent.
Question 1 :
What did i not do properly to fully utilize all the cores/threads ?
Question 2 :
Is there a better way to use try : except : because sometimes the value is not in the page and that would cause the script to stop. When dealing with multiple variables, i will end up with a lot of try & except statement.
Answer to question 1, your problem is this line:
from multiprocessing.dummy import Pool # This is a thread-based Pool
Answer taken from: multiprocessing.dummy in Python is not utilising 100% cpu
When you use multiprocessing.dummy, you're using threads, not processes:
multiprocessing.dummy replicates the API of multiprocessing but is no
more than a wrapper around the threading module.
That means you're restricted by the Global Interpreter Lock (GIL), and only one thread can actually execute CPU-bound operations at a time. That's going to keep you from fully utilizing your CPUs. If you want get full parallelism across all available cores, you're going to need to address the pickling issue you're hitting with multiprocessing.Pool.
i had this probleme
you need to do
from multiprocessing import Pool
from multiprocessing import freeze_support
and you need to do in the end
if __name__ = '__main__':
freeze_support()
and you can continue your script
from multiprocessing import Pool, Queue
from os import getpid
from time import sleep
from random import random
MAX_WORKERS=10
class Testing_mp(object):
def __init__(self):
"""
Initiates a queue, a pool and a temporary buffer, used only
when the queue is full.
"""
self.q = Queue()
self.pool = Pool(processes=MAX_WORKERS, initializer=self.worker_main,)
self.temp_buffer = []
def add_to_queue(self, msg):
"""
If queue is full, put the message in a temporary buffer.
If the queue is not full, adding the message to the queue.
If the buffer is not empty and that the message queue is not full,
putting back messages from the buffer to the queue.
"""
if self.q.full():
self.temp_buffer.append(msg)
else:
self.q.put(msg)
if len(self.temp_buffer) > 0:
add_to_queue(self.temp_buffer.pop())
def write_to_queue(self):
"""
This function writes some messages to the queue.
"""
for i in range(50):
self.add_to_queue("First item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
self.add_to_queue("Second item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
def worker_main(self):
"""
Waits indefinitely for an item to be written in the queue.
Finishes when the parent process terminates.
"""
print "Process {0} started".format(getpid())
while True:
# If queue is not empty, pop the next element and do the work.
# If queue is empty, wait indefinitly until an element get in the queue.
item = self.q.get(block=True, timeout=None)
print "{0} retrieved: {1}".format(getpid(), item)
# simulate some random length operations
sleep(random())
# Warning from Python documentation:
# Functionality within this package requires that the __main__ module be
# importable by the children. This means that some examples, such as the
# multiprocessing.Pool examples will not work in the interactive interpreter.
if __name__ == '__main__':
mp_class = Testing_mp()
mp_class.write_to_queue()
# Waits a bit for the child processes to do some work
# because when the parent exits, childs are terminated.
sleep(5)
I'm using the VSphere API, here are the lines that I'm dealing with:
task = vm.PowerOff()
while task.info.state not in [vim.TaskInfo.State.success, vim.TaskInfo.State.error]:
time.sleep(1)
log.info("task {} is running".format(task))
log.ingo("task {} is done".format(task))
The problem here is that this blocks the execution completely whilst the task is not finished. I would like the logging part to be ran "in parallel", so I can start other tasks.
I thought about creating a function that would accept a task as parameter, and poll the info.state attribute just like now, but how do I make this non blocking ?
EDIT: I'm using Python 2.7
You could use asyncio and create an event loop. You can use asyncio.async() to create an asynchronous task that won't block the event loop execution.
Here is an example of using the threading module:
import threading
class VMShutdownThread(threading.Thread):
def __init__(self, vm):
self.vm = vm
def run(self):
task = vm.PowerOff()
while task.info.state not in [vim.TaskInfo.State.success, vim.TaskInfo.State.error]:
time.sleep(1)
log.info("task {} is running".format(task))
log.info("task {} is done".format(task))
vm_shutdown_thread = VMShutdownThread(vm)
vm_shutdown_thread.start()
If you create a logger, you can configure it to print the thread name.
I am running this code as a CherryPy Web Service both on Mac OS X and Ubuntu 14.04. By using multiprocessing on python3 I want to start the static method worker() in an asynchronous way, within a Process Pool.
The same code runs flawlessly on Mac OS X, in Ubuntu 14.04 worker() does not run. I.e. by debugging the code inside the POST method I am able to see that each line is executed - from
reqid = str(uuid.uuid4())
to
return handle_error(202, "Request ID: " + reqid)
Starting the same code in Ubuntu 14.04, it does not run the worker() method, not even a print() at the top of the method (which would be logged).
Here's the relevant code (I only omitted the handle_error() method):
import cherrypy
import json
from lib import get_parameters, handle_error
from multiprocessing import Pool
import os
from pymatbridge import Matlab
import requests
import shutil
import uuid
from xml.etree import ElementTree
class Schedule(object):
exposed = True
def __init__(self, mlab_path, pool):
self.mlab_path = mlab_path
self.pool = pool
def POST(self, *paths, **params):
if validate(cherrypy.request.headers):
try:
reqid = str(uuid.uuid4())
path = os.path.join("results", reqid)
os.makedirs(path)
wargs = [(self.mlab_path, reqid)]
self.pool.apply_async(Schedule.worker, wargs)
return handle_error(202, "Request ID: " + reqid)
except:
return handle_error(500, "Internal Server Error")
else:
return handle_error(401, "Unauthorized")
#### this is not executed ####
#staticmethod
def worker(args):
mlab_path, reqid = args
mlab = Matlab(executable=mlab_path)
mlab.start()
mlab.run_code("cd mlab")
mlab.run_code("sched")
a = mlab.get_variable("a")
mlab.stop()
return reqid
####
# to start the Web Service
if __name__ == "__main__":
# start Web Service with some configuration
global_conf = {
"global": {
"server.environment": "production",
"engine.autoreload.on": True,
"engine.autoreload.frequency": 5,
"server.socket_host": "0.0.0.0",
"log.screen": False,
"log.access_file": "site.log",
"log.error_file": "site.log",
"server.socket_port": 8084
}
}
cherrypy.config.update(global_conf)
conf = {
"/": {
"request.dispatch": cherrypy.dispatch.MethodDispatcher(),
"tools.encode.debug": True,
"request.show_tracebacks": False
}
}
pool = Pool(3)
cherrypy.tree.mount(Schedule('matlab', pool), "/sched", conf)
# activate signal handler
if hasattr(cherrypy.engine, "signal_handler"):
cherrypy.engine.signal_handler.subscribe()
# start serving pages
cherrypy.engine.start()
cherrypy.engine.block()
Your logic is hiding the problem from you. The apply_async method returns an AsyncResult object which acts as a handler to the asynchronous task you just scheduled. As you ignore the outcome of the scheduled task, the whole thing looks like "failing silently".
If you try to get the results from that task, you'd see the real problem.
handler = self.pool.apply_async(Schedule.worker, wargs)
handler.get()
... traceback here ...
cPickle.PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
In short, you must ensure the arguments you pass to the Pool are Picklable.
Instance and class methods are Picklable if the object/class they belong to is picklable as well. Static methods are not picklable because they loose the association with the object itself, therefore the pickle library cannot serialise them correctly.
As a general line, is better to avoid scheduling to multiprocessing.Pool anything different than a top level defined functions.
To run a background tasks with Cherrypy it's better if you use an asynchronous task queue manager like Celery or RQ. This services are very easy to install and run, your tasks will run in a completely separated process and if you need to scale because your load is increasing it'll be very straight forward.
You have a simple example with Cherrypy here.
I solved changing the method from #staticmethod to #classmethod. Now the job runs inside the ProcessPool. I found classmethods to be more useful in this case, as explained here.
Thanks.
I have an issue reading a multiprocessing queue the function for reading the queue is being called from another module.
below is the class containing the function to start a thread which runs function_to_get_data. The class resides in its own file, which I will call one.py. function_to_get_data is in another file, two.py and is an infinite loop which puts data into the queue (code snippet for this further down). It also contains the function to read the queue. The Queue q is defined globally at the beginning.
import multiprocessing
from two import function_to_get_data
q = multiprocessing.Queue()
class Poller:
def startPoller(self):
pollerThread = multiprocessing.Process(target=module_to_get_data,args=(q,))
pollerThread.start()
def getPoller(self):
if q.empty():
print "queue is empty"
else:
pollResQueue = q.get()
q.put(pollResQueue)
return pollResQueue
if __name__ == "__main__":
startpoll = Poller()
startpoll.startPoller()
Below is snippet from function_to_get_data:
def module_to_get_data(q):
while 1:
# performs actions #
q.put(data_from_actions)
I have a another module, three.py, which requires the data from the queue and requests it by calling the function from the initial class:
from one import Poller
externalPoller = Poller()
data_this_module_needs = externalPoller.getPoller()
The issue is that the Queue is always empty.
I should add that the function in three.py is also called as a thread in one.py by a post from a web page:
def POST(data):
data = web.input()
if data == 'Start':
thread_two = multiprocessing.Process(target= function_in_three_py, args=(q,))
thread_two.start()
If I use the python command line and enter the two Poller functions and call them, I get data from the queue no problem.