I have an issue reading a multiprocessing queue the function for reading the queue is being called from another module.
below is the class containing the function to start a thread which runs function_to_get_data. The class resides in its own file, which I will call one.py. function_to_get_data is in another file, two.py and is an infinite loop which puts data into the queue (code snippet for this further down). It also contains the function to read the queue. The Queue q is defined globally at the beginning.
import multiprocessing
from two import function_to_get_data
q = multiprocessing.Queue()
class Poller:
def startPoller(self):
pollerThread = multiprocessing.Process(target=module_to_get_data,args=(q,))
pollerThread.start()
def getPoller(self):
if q.empty():
print "queue is empty"
else:
pollResQueue = q.get()
q.put(pollResQueue)
return pollResQueue
if __name__ == "__main__":
startpoll = Poller()
startpoll.startPoller()
Below is snippet from function_to_get_data:
def module_to_get_data(q):
while 1:
# performs actions #
q.put(data_from_actions)
I have a another module, three.py, which requires the data from the queue and requests it by calling the function from the initial class:
from one import Poller
externalPoller = Poller()
data_this_module_needs = externalPoller.getPoller()
The issue is that the Queue is always empty.
I should add that the function in three.py is also called as a thread in one.py by a post from a web page:
def POST(data):
data = web.input()
if data == 'Start':
thread_two = multiprocessing.Process(target= function_in_three_py, args=(q,))
thread_two.start()
If I use the python command line and enter the two Poller functions and call them, I get data from the queue no problem.
Related
I have long running process, that I want to keep track about in which state it currently is in. There is N processes running in same time therefore multiprocessing issue.
I pass Queue into process to report messages about state, and this Queue is then read(if not empty) in thread every couple of second.
I'm using Spider on windows as environment and later described behavior is in its console. I did not try it in different env.
from multiprocessing import Process,Queue,Lock
import time
def test(process_msg: Queue):
try:
process_msg.put('Inside process message')
# process...
return # to have exitstate = 0
except Exception as e:
process_msg.put(e)
callback_msg = Queue()
if __name__ == '__main__':
p = Process(target = test,
args = (callback_msg,))
p.start()
time.sleep(5)
print(p)
while not callback_msg.empty():
msg = callback_msg.get()
if type(msg) != Exception:
tqdm.write(str(msg))
else:
raise msg
Problem is that whatever I do with code, it never reads what is inside the Queue(also because it never puts anything in it). Only when I switch to dummy version, which runs similary to threading on only 1 CPU from multiprocessing.dummy import Process,Queue,Lock
Apparently the test function have to be in separate file.
I have 4 different Python custom objects and an events queue. Each obect has a method that allows it to retrieve an event from the shared events queue, process it if the type is the desired one and then puts a new event on the same events queue, allowing other processes to process it.
Here's an example.
import multiprocessing as mp
class CustomObject:
def __init__(events_queue: mp.Queue) -> None:
self.events_queue = event_queue
def process_events_queue() -> None:
event = self.events_queue.get()
if type(event) == SpecificEventDataTypeForThisClass:
# do something and create a new_event
self.events_queue.put(new_event)
else:
self.events_queue.put(event)
# there are other methods specific to each object
These 4 objects have specific tasks to do, but they all share this same structure. Since I need to "simulate" the production condition, I want them to run all at the same time, indipendently from eachother.
Here's just an example of what I want to do, if possible.
import multiprocessing as mp
import CustomObject
if __name__ == '__main__':
events_queue = mp.Queue()
data_provider = mp.Process(target=CustomObject, args=(events_queue,))
portfolio = mp.Process(target=CustomObject, args=(events_queue,))
engine = mp.Process(target=CustomObject, args=(events_queue,))
broker = mp.Process(target=CustomObject, args=(events_queue,))
while True:
data_provider.process_events_queue()
portfolio.process_events_queue()
engine.process_events_queue()
broker.process_events_queue()
My idea is to run each object in a separate process, allowing them to communicate with events shared through the events_queue. So my question is, how can I do that?
The problem is that obj = mp.Process(target=CustomObject, args=(events_queue,)) returns a Process instance and I can't access the CustomObject methods from it. Also, is there a smarter way to achieve what I want?
Processes require a function to run, which defines what the process is actually doing. Once this function exits (and there are no non-daemon threads) the process is done. This is similar to how Python itself always executes a __main__ script.
If you do mp.Process(target=CustomObject, args=(events_queue,)) that just tells the process to call CustomObject - which instantiates it once and then is done. This is not what you want, unless the class actually performs work when instantiated - which is a bad idea for other reasons.
Instead, you must define a main function or method that handles what you need: "communicate with events shared through the events_queue". This function should listen to the queue and take action depending on the events received.
A simple implementation looks like this:
import os, time
from multiprocessing import Queue, Process
class Worker:
# separate input and output for simplicity
def __init__(self, commands: Queue, results: Queue):
self.commands = commands
self.results = results
# our main function to be run by a process
def main(self):
# each process should handle more than one command
while True:
value = self.commands.get()
# pick a well-defined signal to detect "no more work"
if value is None:
self.results.put(None)
break
# do whatever needs doing
result = self.do_stuff(value)
print(os.getpid(), ':', self, 'got', value, 'put', result)
time.sleep(0.2) # pretend we do something
# pass on more work if required
self.results.put(result)
# placeholder for what needs doing
def do_stuff(self, value):
raise NotImplementedError
This is a template for a class that just keeps on processing events. The do_stuff method must be overloaded to define what actually happens.
class AddTwo(Worker):
def do_stuff(self, value):
return value + 2
class TimesThree(Worker):
def do_stuff(self, value):
return value * 3
class Printer(Worker):
def do_stuff(self, value):
print(value)
This already defines fully working process payloads: Process(target=TimesThree(in_queue, out_queue).main) schedules the main method in a process, listening for and responding to commands.
Running this mainly requires connecting the individual components:
if __name__ == '__main__':
# bookkeeping of resources we create
processes = []
start_queue = Queue()
# connect our workers via queues
queue = start_queue
for element in (AddTwo, TimesThree, Printer):
instance = element(queue, Queue())
# we run the main method in processes
processes.append(Process(target=instance.main))
queue = instance.results
# start all processes
for process in processes:
process.start()
# send input, but do not wait for output
start_queue.put(1)
start_queue.put(248124)
start_queue.put(-256)
# send shutdown signal
start_queue.put(None)
# wait for processes to shutdown
for process in processes:
process.join()
Note that you do not need classes for this. You can also compose functions for a similar effect, as long as everything is pickle-able:
import os, time
from multiprocessing import Queue, Process
def main(commands, results, do_stuff):
while True:
value = commands.get()
if value is None:
results.put(None)
break
result = do_stuff(value)
print(os.getpid(), ':', do_stuff, 'got', value, 'put', result)
time.sleep(0.2)
results.put(result)
def times_two(value):
return value * 2
if __name__ == '__main__':
in_queue, out_queue = Queue(), Queue()
worker = Process(target=main, args=(in_queue, out_queue, times_two))
worker.start()
for message in (1, 3, 5, None):
in_queue.put(message)
while True:
reply = out_queue.get()
if reply is None:
break
print('result:', reply)
I'm having issuing using most or all of the cores to process the files faster , it can be reading multiple files a time or using multiple cores to read a single file.
I would prefer using multiple cores to read a single file before moving it to the next.
I tried the code below but can't seem to get all the core used up.
The following code would basically retrieve *.txt file in the directory which contains htmls , in json format.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import json
import urlparse
import os
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlTheHtml(htmlsource):
htmlArray = json.loads(htmlsource)
for eachHtml in htmlArray:
soup = BeautifulSoup(eachHtml['result'], 'html.parser')
if all(['another text to search' not in str(soup),
'text to search' not in str(soup)]):
try:
gd_no = ''
try:
gd_no = soup.find('input', {'id': 'GD_NO'})['value']
except:
pass
r = requests.post('domain api address', data={
'gd_no': gd_no,
})
except:
pass
if __name__ == '__main__':
pool = Pool(cpu_count() * 2)
print(cpu_count())
fileArray = []
for filename in os.listdir(os.getcwd()):
if filename.endswith('.txt'):
fileArray.append(filename)
for file in fileArray:
with open(file, 'r') as myfile:
htmlsource = myfile.read()
results = pool.map(crawlTheHtml(htmlsource), f)
On top of that , i'm not sure what the ,f represent.
Question 1 :
What did i not do properly to fully utilize all the cores/threads ?
Question 2 :
Is there a better way to use try : except : because sometimes the value is not in the page and that would cause the script to stop. When dealing with multiple variables, i will end up with a lot of try & except statement.
Answer to question 1, your problem is this line:
from multiprocessing.dummy import Pool # This is a thread-based Pool
Answer taken from: multiprocessing.dummy in Python is not utilising 100% cpu
When you use multiprocessing.dummy, you're using threads, not processes:
multiprocessing.dummy replicates the API of multiprocessing but is no
more than a wrapper around the threading module.
That means you're restricted by the Global Interpreter Lock (GIL), and only one thread can actually execute CPU-bound operations at a time. That's going to keep you from fully utilizing your CPUs. If you want get full parallelism across all available cores, you're going to need to address the pickling issue you're hitting with multiprocessing.Pool.
i had this probleme
you need to do
from multiprocessing import Pool
from multiprocessing import freeze_support
and you need to do in the end
if __name__ = '__main__':
freeze_support()
and you can continue your script
from multiprocessing import Pool, Queue
from os import getpid
from time import sleep
from random import random
MAX_WORKERS=10
class Testing_mp(object):
def __init__(self):
"""
Initiates a queue, a pool and a temporary buffer, used only
when the queue is full.
"""
self.q = Queue()
self.pool = Pool(processes=MAX_WORKERS, initializer=self.worker_main,)
self.temp_buffer = []
def add_to_queue(self, msg):
"""
If queue is full, put the message in a temporary buffer.
If the queue is not full, adding the message to the queue.
If the buffer is not empty and that the message queue is not full,
putting back messages from the buffer to the queue.
"""
if self.q.full():
self.temp_buffer.append(msg)
else:
self.q.put(msg)
if len(self.temp_buffer) > 0:
add_to_queue(self.temp_buffer.pop())
def write_to_queue(self):
"""
This function writes some messages to the queue.
"""
for i in range(50):
self.add_to_queue("First item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
self.add_to_queue("Second item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
def worker_main(self):
"""
Waits indefinitely for an item to be written in the queue.
Finishes when the parent process terminates.
"""
print "Process {0} started".format(getpid())
while True:
# If queue is not empty, pop the next element and do the work.
# If queue is empty, wait indefinitly until an element get in the queue.
item = self.q.get(block=True, timeout=None)
print "{0} retrieved: {1}".format(getpid(), item)
# simulate some random length operations
sleep(random())
# Warning from Python documentation:
# Functionality within this package requires that the __main__ module be
# importable by the children. This means that some examples, such as the
# multiprocessing.Pool examples will not work in the interactive interpreter.
if __name__ == '__main__':
mp_class = Testing_mp()
mp_class.write_to_queue()
# Waits a bit for the child processes to do some work
# because when the parent exits, childs are terminated.
sleep(5)
I have to write Python program that spawns 3 threads. Pass a ID to each thread as a parameter (numbers 1 through 3). In each thread, call the following JSON endpoint substituting {ID} with the ID passed to the thread.
https://jsonplaceholder.typicode.com/posts/{ID}
Parse the JSON string into a dict in each thread before returning it to the main thread. Combine the results of all child threads into a list in the main thread.my code like below
import threading
import requests
import json
import queue
q = queue.Queue()
def main():
th_list=[]
for i in range(1,4):
t = threading.Thread(target=call_url, args=(i,q))
th_list.append(t)
print(th_list)
for thread in th_list:
thread.start()
#for thread in th_list:
#print(thread.join())
print(q.get())
def call_url(i,q):
url='https://jsonplaceholder.typicode.com/posts/' + str(i)
response=requests.get(url)
o_dict=json.loads(response.content)
q.put(o_dict)
if __name__=='__main__':
main()
But that gets me None as the result. Any help will be appreciated.
long time lurker here.
I have a thread controller object. This object takes in other objects called "Checks". These Checks pull in DB rows that match their criteria. The thread manager polls each check (asking it for it's DB rows aka work units) and then enqueues each row along with a reference to that check object. The thought is that N many threads will come in and pull off an item from the queue and execute the corresponding Check's do_work method. The do_work method will return Pass\Fail and all passes will be enqueued for further processing.
The main script (not shown) instantiates the checks and adds them to the thread manager using add_check and then calls kick_off_work.
So far I am testing and it simply locks up:
import Queue
from threading import Thread
class ThreadMan:
def __init__(self, reporter):
print "Initializing thread manager..."
self.workQueue = Queue.Queue()
self.resultQueue = Queue.Queue()
self.checks = []
def add_check(self, check):
self.checks.append(check)
def kick_off_work(self):
for check in self.checks:
for work_unit in check.populate_work():
#work unit is a DB row
self.workQueue.put({"object" : check, "work" : work_unit})
threads = Thread(target=self.execute_work_unit)
threads = Thread(target=self.execute_work_unit)
threads.start()
self.workQueue.join();
def execute_work_unit(self):
unit = self.workQueue.get()
check_object = unit['object'] #Check object
work_row = unit['work'] # DB ROW
check_object.do_work(work_row)
self.workQueue.task_done();
print "Done with work!!"
The output is simply:
Initializing thread manager...
In check1's do_work method... Doing work
Done with work!!
(locked up)
I would like to run through the entire queue
you should only add a "while" in your execute_work_unit otherwise it stops at first iteration:
def execute_work_unit(self):
while True:
unit = self.workQueue.get()
check_object = unit['object'] #Check object
work_row = unit['work'] # DB ROW
check_object.do_work(work_row)
self.workQueue.task_done();
print "Done with work!!"
have a look there:
http://docs.python.org/2/library/queue.html#module-Queue
EDIT: to get it finish just add threads.join() after your self.workQueue.join() in
def kick_off_work(self):