Multiprocessing: Pass file handle to Process - python

My program spawns multiple processes to do some time consuming calculations. The results are then collected in a queue and a writer process writes them into an output file.
Below is a simplified version of my code which should illustrate my issue. If I comment out the flush statement in the Writer class, test.out is empty at the end of the program.
What exactly is happening here? Is test.out not closed properly? Was it naive to assume that passing the file handle to an autonomous process should work in the first place?
from multiprocessing import JoinableQueue, Process
def main():
queue = JoinableQueue()
queue.put("hello world!")
with open("test.out", "w") as outhandle:
wproc = Writer(queue, outhandle)
wproc.start()
queue.join()
with open("test.out") as handle:
for line in handle:
print(line.strip())
class Writer(Process):
def __init__(self, queue, handle):
Process.__init__(self)
self.daemon = True
self.queue = queue
self.handle = handle
def run(self):
while True:
msg = self.queue.get()
print(msg, file=self.handle)
#self.handle.flush()
self.queue.task_done()
if __name__ == '__main__':
main()

The writer is a separate process. The data it writes to the file might be buffered, and because the process keeps running, it doesn't know that it should flush the buffer (write it to the file). Flushing manually is the right thing to do.
Normally, the file would be closed when you exit the with block, and this would flush the buffer. But the parent process doesn't know anything about its children's buffers, so the child has to flush it's own buffer (closing the file should work too - that doesn't close the file for the parent, at least on Unix systems).
Also, check out the Pool class from multiprocessing (https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool) - it might save you some work.

I had the same issue with writing the output from a processed pooled data set to a file directly.
This was sorted out by first collecting the pooled results into a list and then finally wrote to the file.
This is mainly happening due to, the hard disk can't write the speed which processor processes, the buffered content getting lost or pooled data is not in the proper order.
Best thing to do is allocate pooled output into a memory location or to a variable (string or a list) first and then write to a file, that way things should get sorted out.
Good lick!

Related

Keep track of the state of event in multiprocessing by writing to file python

I am working on a task that requires multiprocessing in python and I need to keep track of the state by writing already processed document IDs to a file (a single file shared among processes).
I have implemented a simple version using the code snippet below. In the code, I have some Ids stored in a variable called question, the shared file f and in the main method, I split the question into possible chunks that can be processed parallel.
Is this the right way to do such?
from multiprocessing import Pool
from multiprocessing import Queue
def reader(val):
pqueue.put(val)
def writer():
a = pqueue.get()
f = open("num.txt",'a')
f.write(str(a))
f.write("\n")
f.close()
def main():
global question
global pqueue
pqueue = Queue() # writer() writes to pqueue from _this_ process
processes = []
question = [16,0,1,2,3,4,5,6,7,8,10,11,12,13,14,15]
cores=5
loww=0
chunksize = int((len(question)-loww)/cores)
splits = []
for i in range(cores):
splits.append(loww+1+((i)*chunksize))
splits.append(len(question)+1)
print(splits)
args = []
for i in range(cores):
a=[]
arguments = (i, splits[i], splits[i+1])
a.append(arguments)
args.append(a)
print(args)
p = Pool(cores)
p.map(call_process, args)
p.close()
p.join
def call_process(args):
lower=args[0][1]
upper=args[0][2]
for x in range(lower,upper):
a = question[x-1]
try:
pass
except:
continue
#write item to file
print(f,'a = ',a)
reader(a)
writer()
main()
Note: the code seems not to be working.
Sooner or later you'll have a process trying to open a file while another process was in the process of writing to the file, and things will break.
Rather, my strategy would be:
Start a process, call this "chronicler", that monitors a Queue for incoming bits & pieces, and everytime something comes in, write to the file.
Start the workers. Everytime a worker is done, push some bits & pieces into the aforementioned Queue. Then continue with the next task (thus handing off all file-open-write-and-close process to the "chronicler")
Have all of them monitor an Event called "stop_and_drop_dead". The main process can set() this Event and the child processes, upon seeing the Event is set, end themselves gracefully.

Can I asynchronously delete a file in Python?

I have a long running python script which creates and deletes temporary files. I notice there is a non-trivial amount of time spent on file deletion, but the only purpose of deleting those files is to ensure that the program doesn't eventually fill up all the disk space during a long run. Is there a cross platform mechanism in Python to aschyronously delete a file so the main thread can continue to work while the OS takes care of the file delete?
You can try delegating deleting the files to another thread or process.
Using a newly spawned thread:
thread.start_new_thread(os.remove, filename)
Or, using a process:
# create the process pool once
process_pool = multiprocessing.Pool(1)
results = []
# later on removing a file in async fashion
# note: need to hold on to the async result till it has completed
results.append(process_pool.apply_async(os.remove, filename), callback=lambda result: results.remove(result))
The process version may allow for more parallelism because Python threads are not executing in parallel due to the notorious global interpreter lock. I would expect though that GIL is released when it calls any blocking kernel function, such as unlink(), so that Python lets another thread to make progress. In other words, a background worker thread that calls os.unlink() may be the best solution, see Tim Peters' answer.
Yet, multiprocessing is using Python threads underneath to asynchronously communicate with the processes in the pool, so some benchmarking is required to figure which version gives more parallelism.
An alternative method to avoid using Python threads but requires more coding is to spawn another process and send the filenames to its standard input through a pipe. This way you trade os.remove() to a synchronous os.write() (one write() syscall). It can be done using deprecated os.popen() and this usage of the function is perfectly safe because it only communicates in one direction to the child process. A working prototype:
#!/usr/bin/python
from __future__ import print_function
import os, sys
def remover():
for line in sys.stdin:
filename = line.strip()
try:
os.remove(filename)
except Exception: # ignore errors
pass
def main():
if len(sys.argv) == 2 and sys.argv[1] == '--remover-process':
return remover()
remover_process = os.popen(sys.argv[0] + ' --remover-process', 'w')
def remove_file(filename):
print(filename, file=remover_process)
remover_process.flush()
for file in sys.argv[1:]:
remove_file(file)
if __name__ == "__main__":
main()
You can create a thread to delete files, following a common producer-consumer pattern:
import threading, Queue
dead_files = Queue.Queue()
END_OF_DATA = object() # a unique sentinel value
def background_deleter():
import os
while True:
path = dead_files.get()
if path is END_OF_DATA:
return
try:
os.remove(path)
except: # add the exceptions you want to ignore here
pass # or log the error, or whatever
deleter = threading.Thread(target=background_deleter)
deleter.start()
# when you want to delete a file, do:
# dead_files.put(file_path)
# when you want to shut down cleanly,
dead_files.put(END_OF_DATA)
deleter.join()
CPython releases the GIL (global interpreter lock) around internal file deletion calls, so this should be effective.
Edit - new text
I would advise against spawning a new process per delete. On some platforms, process creation is quite expensive. Would also advise against spawning a new thread per delete: in a long-running program, you really never want the possibility of creating an unbounded number of threads at any point. Depending on how quickly file deletion requests pile up, that could happen here.
The "solution" above is wordier than the others, because it avoids all that. There's only one new thread total. Of course it could easily be generalized to use any fixed number of threads instead, all sharing the same dead_files queue. Start with 1, add more if needed ;-)
The OS-level file removal primitives are synchronous on both Unix and Windows, so I think you pretty much have to use a worker thread. You could have it pull files to delete off a Queue object, and then when the main thread is done with a file it can just post the file to the queue. If you're using NamedTemporaryFile objects, you probably want to set delete=False in the constructor and just post the name to the queue, not the file object, so you don't have object lifetime headaches.

Python file handling within threads

Hello I have a program that looks through a range of data and finds anomalies in that data. To make my program faster I incorporated the use of threads (66 in total) now when my program finds the anomalies I would want it to write it to a file but however when i try to write to the file from within multiple threads it wont write.here is a segment of it
Python:
import threading
class myThread(threading.Thread):
def __init__(self,lock,output):
threading.Thread.__init__(self)
self.lock = lock
self.file = output
def run(self):
main(self.lock,self.file)
def main(lock,file):
lock.acquire()
file.write("It wont write :(")
lock.release
if __name__ == "__main__":
lock = threading.Lock()
file = open("file.txt","wb")
thread1 = myThread(lock,file)
thread1.start()
here is my code on a much smaller scale
my error message is that file is not open for writing
EDIT:this code for some reason works but my full length code seems to not work so I am going to post it
def main(START_IP,END_IP,lock,File):
# store found DNS servers
foundDNS=[]
# scan all the ip addresses in the range
for i0 in range(START_IP[0], END_IP[0]+1):
for i1 in range(START_IP[1], END_IP[1]+1):
for i2 in range(START_IP[2], END_IP[2]+1):
for i3 in range(START_IP[3], END_IP[3]+1):
# build ip addres
ipaddr=str(i0)+"."+str(i1)+"."+str(i2)+"."+str(i3)
print "Scanning "+ipaddr+"...",
# scan address
ret=ScanDNS(ipaddr, 10)
if ret==True:
foundDNS.append(ipaddr)
print "Found!"
lock.acquire()
File.write(ipaddr)
File.write("\n")
File.flush()
lock.release()
else:
print
This uses my exact same MyThread class just with the required arguments for main to manipulate the data. If I run my code for about a minute as its scanning over DNS servers
I should get maybe 20-30 DNS servers saved into a file but I generally get this:
FILE.TXT
2.2.1.2
8.8.8.8
31.40.40
31.31.40.40
31.31.41.41
I know for a fact (because I watched the scanning output) and that it hardly all of them. So why is some writing and some not?
This may be a typo, but this:
lock.release
should have parentheses:
lock.release()
Also, your writes will be buffered until the first newline or flush().
Check the documentation for File Objects:
flush() does not necessarily write the file’s data to disk. Use
flush() followed by os.fsync() to ensure this behavior.
File.fileno() gets the file descriptor needed by os.fsync():
with lock:
File.write(ipaddr)
File.write("\n")
File.flush()
os.fsync(File.fileno())

Python threading with filehandling

Hello I have a program that looks through a range of data and finds anomalies in that data. To make my program faster I incorporated the use of threads (66 in total) now when my program finds the anomalies I would want it to write it to a file but however when i try to write to the file from within multiple threads it wont write.
class myThread(threading.Thread):
def __init__(self,arg1,arg2,lock,output):
threading.Thread.__init__(self)
self.arg1 = arg1
self.arg2 = arg2
self.lock = lock
self.file = output
def run(self):
# print "Starting " + self.name
main(self.arg1,self.arg2,self.lock,self.file)
# print "Exiting " + self.name
def main(START_IP,END_IP,lock,File):
# store found DNS servers
foundDNS=[]
# scan all the ip addresses in the range
for i0 in range(START_IP[0], END_IP[0]+1):
for i1 in range(START_IP[1], END_IP[1]+1):
for i2 in range(START_IP[2], END_IP[2]+1):
for i3 in range(START_IP[3], END_IP[3]+1):
# build ip addres
ipaddr=str(i0)+"."+str(i1)+"."+str(i2)+"."+str(i3)
print "Scanning "+ipaddr+"...",
# scan address
ret=ScanDNS(ipaddr, 10)
if ret==True:
foundDNS.append(ipaddr)
print "Found!"
lock.acquire()
File.write(ipaddr)
File.write("\n")
File.flush()
lock.release()
else:
print
file = open("file.txt","wb")
lock = threading.Lock()
thread1 = myThread(START_IP,END_IP,lock,)
thread1.start()
This uses my exact same MyThread class just with the required arguments for main to manipulate the data. If I run my code for about a minute as its scanning over DNS servers I should get maybe 20-30 DNS servers saved into a file but I generally get this:
FILE.TXT
2.2.1.2
8.8.8.8
31.40.40
31.31.40.40
31.31.41.41
I know for a fact (because I watched the scanning output) and that it hardly writes all of them. So why is some writing and some not?
I don't know why your code is not working, but I can hazard a guess that it is due to race conditions. Hopefully someone knowledgeable can answer that part of your question.
However, I've encountered a similar problem before, and I solved it by moving the file writing code to a single output thread. This thread read from a synchronized queue to which other threads pushed data to be written.
Also, if you happen to be working on a machine with multiple cores, then it's better to use multiprocess instead of threading. The latter only runs threads on a single core, while the former does not have this limitation.
instead of providing file - provide Queue. Spawn new thread to read from Queue and file write. Or Use Locks everywhere in print too because some treads can be deadlocked.
To avoid potential error or misuse for access file from multi-threads, you can try using logging to write down your result.
import logging
logger = logging.getLogger()
file_handler = logging.FileHandler()
formatter = #your formmat
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)
Check the the documentation for File Objects:
File.flush() is not enough to ensure that your data is written on disk, add
os.fsync(File.fileno()) just after to make it happens.

python -> multiprocessing module

Here's what I am trying to accomplish -
I have about a million files which I need to parse & append the parsed content to a single file.
Since a single process takes ages, this option is out.
Not using threads in Python as it essentially comes to running a single process (due to GIL).
Hence using multiprocessing module. i.e. spawning 4 sub-processes to utilize all that raw core power :)
So far so good, now I need a shared object which all the sub-processes have access to. I am using Queues from the multiprocessing module. Also, all the sub-processes need to write their output to a single file. A potential place to use Locks I guess. With this setup when I run, I do not get any error (so the parent process seems fine), it just stalls. When I press ctrl-C I see a traceback (one for each sub-process). Also no output is written to the output file. Here's code (note that everything runs fine without multi-processes) -
import os
import glob
from multiprocessing import Process, Queue, Pool
data_file = open('out.txt', 'w+')
def worker(task_queue):
for file in iter(task_queue.get, 'STOP'):
data = mine_imdb_page(os.path.join(DATA_DIR, file))
if data:
data_file.write(repr(data)+'\n')
return
def main():
task_queue = Queue()
for file in glob.glob('*.csv'):
task_queue.put(file)
task_queue.put('STOP') # so that worker processes know when to stop
# this is the block of code that needs correction.
if multi_process:
# One way to spawn 4 processes
# pool = Pool(processes=4) #Start worker processes
# res = pool.apply_async(worker, [task_queue, data_file])
# But I chose to do it like this for now.
for i in range(4):
proc = Process(target=worker, args=[task_queue])
proc.start()
else: # single process mode is working fine!
worker(task_queue)
data_file.close()
return
what am I doing wrong? I also tried passing the open file_object to each of the processes at the time of spawning. But to no effect. e.g.- Process(target=worker, args=[task_queue, data_file]). But this did not change anything. I feel the subprocesses are not able to write to the file for some reason. Either the instance of the file_object is not getting replicated (at the time of spawn) or some other quirk... Anybody got an idea?
EXTRA: Also Is there any way to keep a persistent mysql_connection open & pass it across to the sub_processes? So I open a mysql connection in my parent process & the open connection should be accessible to all my sub-processes. Basically this is the equivalent of a shared_memory in python. Any ideas here?
Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.
It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-
from multiprocessing import Pool
def main():
po = Pool()
for file in glob.glob('*.csv'):
filepath = os.path.join(DATA_DIR, file)
po.apply_async(mine_page, (filepath,), callback=save_data)
po.close()
po.join()
file_ptr.close()
def mine_page(filepath):
#do whatever it is that you want to do in a separate process.
return data
def save_data(data):
#data is a object. Store it in a file, mysql or...
return
Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...
The docs for multiprocessing indicate several methods of sharing state between processes:
http://docs.python.org/dev/library/multiprocessing.html#sharing-state-between-processes
I'm sure each process gets a fresh interpreter and then the target (function) and args are loaded into it. In that case, the global namespace from your script would have been bound to your worker function, so the data_file would be there. However, I am not sure what happens to the file descriptor as it is copied across. Have you tried passing the file object as one of the args?
An alternative is to pass another Queue that will hold the results from the workers. The workers put the results and the main code gets the results and writes it to the file.

Categories

Resources