How to resolve BrokenPipeError when using multiprocessing in Python

How to resolve BrokenPipeError when using multiprocessing in Python - python

I am working on learning multiproccessing, and have had no issues until I encountered this one when working with queues. Essentially, the queue gets filled up, but then something seems to go wrong and it crashes.
I am running python 3.6.8 on Windows 10. multiprocessing has seemed to work when I was not using queues (I built a similar code snippet to the below without queues to learn).
import glob, multiprocessing, os
def appendFilesThreaded(inputDirectory, outputDirectory, inputFileType=".txt", outputFileName="appended_files.txt"):
files = glob.glob(inputDirectory+'*'+inputFileType)
fileQueue = multiprocessing.Queue()
for file in files:
fileQueue.put(file)
threadsToUse = max(1, multiprocessing.cpu_count()-1)
print("Using " + str(threadsToUse) + " worker threads.")
processes = []
for i in range(threadsToUse):
p = multiprocessing.Process(target=appendFilesWorker, args=(fileQueue,outputDirectory+"temp-" + str(i) + outputFileName))
processes.append(p)
p.start()
for process in processes:
process.join()
with open(outputDirectory + outputFileName, 'w') as outputFile:
for i in range(threadsToUse):
with open(outputDirectory+"temp-" + str(i) + outputFileName) as fileToAppend:
outputFile.write(fileToAppend.read())
os.remove(outputDirectory+"temp-" + str(i) + outputFileName)
print('Done')
def appendFilesWorker(fileQueue, outputFileNamePath):
with open(outputFileNamePath, 'w') as outputFile:
while not fileQueue.empty:
with open(fileQueue.get()) as fileToAppend:
outputFile.write(fileToAppend.read())
if __name__ == '__main__':
appendFilesThreaded(inputDir,outputDir)
I would expect this to successfully append files, but it crashes. It results in BrokenPipeError: [WinError 232] The pipe is being closed

Found the issue: calling queue.empty is incorrect. You need parentheses (e.g. queue.empty())
I'll leave my embarrassing mistake up in case it helps others :)

Related

My python script stops... no error, just stops

I am running a script that iterates through a text file. On each line on the text file there is an ip adress. The script grabs the banner, then writes the ip + banner on another file.
The problem is, it just stops around 500 lines, more or less, with no error.
Another weird thing is if i run it with python3 it does what i said above. If i run it with python it iterates through those 500 lines, then starts at the beggining. I noticed this when i saw repetitions in my output file. Anyway here is the code, maybe you guys can tell me what im doing wrong:
import os
import subprocess
import concurrent.futures
#import time, random
import threading
import multiprocessing
with open("ipuri666.txt") as f:
def multiprocessing_func():
try:
line2 = line.rstrip('\r\n')
a = subprocess.Popen(["curl", "-I", line2, "--connect-timeout", "1", "--max-time", "1"], stdout=subprocess.PIPE)
b = subprocess.Popen(["grep", "Server"], stdin=a.stdout, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
#a.stdout.close()
out, err = b.communicate()
g = open("IP_BANNER2","a")
print( "out: {0}".format(out))
g.write(line2 + " " + "out: {0}\n".format(out))
print("err: {0}".format(err))
except IOError:
print("Connection timed out")
if __name__ == '__main__':
#starttime = time.time()
processes = []
for line in f:
p = multiprocessing.Process(target=multiprocessing_func, args=())
processes.append(p)
p.start()
for process in processes:
process.join()

If your use case allows I would recommend just rewriting this as a shell script, there is no need to use Python. (This would likely solve your issue indirectly.)
#!/usr/bin/env bash
readarray -t ips < ipuri666.txt
for ip in ${ips[#]}; do
output=$(curl -I "$ip" --connect-timeout 1 --max-time 1 | grep "Server")
echo "$ip $output" >> fisier.txt
done
The script is slightly simpler than what you are trying to do, for instance I do not capture the error. This should be pretty close to what you are trying to accomplish. I will update again if needed.

How to run darktable-cli in parallel?

I have a sequence of about 1000 CR2 images which I need to convert to TIFF16. The following command line works:
darktable-cli input_image.CR2 colorcard.xmp output.tiff --core --conf plugins/imageio/format/tiff/bpp=16
But when I want to execute that command in parallel via the Python code below, I am getting the following error after one image is converted:
[init] the database lock file contains a pid that seems to be alive in your system: 31531
[init] database is locked, probably another process is already using it
ERROR: can't acquire database lock, aborting.
Here is my Python code:
#!/usr/bin/env python3
import glob
import shlex
import subprocess
import multiprocessing as mp
from multiprocessing import Pool
def call_proc(cmd):
subprocess.run(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
app = '/Applications/darktable.app/Contents/MacOS/darktable-cli '
xmp = ' colorcard.xmp '
opt = ' --core --conf plugins/imageio/format/tiff/bpp=16 --conf plugins/imageio/storage/disk/overwrite=true --library /tmp/darktable.db'
raw_images = glob.glob('indata/*')
procs = []
for raw_image in raw_images:
tif_image = raw_image.replace('.CR2', '.tif').replace('indata', 'outdata')
cmd = app + raw_image + xmp + tif_image + opt
procs.append(cmd)
pool = Pool(mp.cpu_count())
pool.map(call_proc, procs)
pool.close()
pool.join()
Platform:
Darktable Version: darktable-cli 3.0.0
OS: macOS Mojave 10.14.3 (18D42)
NVIDIA GeForce GTX 680MX 2048 MB
I found the following thread but had no luck with the given solution.
Any help is highly appreciated.

In the thread, #miguev gave the answer which help me. Which is not pretty but works. I am adding for each image a tmp directory and pass that to the --configdir attr like so:
for i in raw_images:
os.mkdir('/tmp/' + str(os.path.basename(i).split('.')[0]))
cmds_list = []
for raw_image in raw_images:
tif_image = raw_image.replace('.CR2', '.tif').replace('indata', 'outdata')
cmd = app + raw_image + ' ' + xmp + ' ' + tif_image + opt + ' --configdir /tmp/' + str(os.path.basename(raw_image).split('.')[0])
cmds_list.append(cmd)
All you need to do at the end when you are done clean up behind you.

Writing to multiple files using multiprocessing. Error: "TypeError: cannot serialize '_io.TextIOWrapper' object"

I'm trying to write to file the result from multiprocessing (4 cores/processes). Since the CPU cores work simultaneously, I thought of making 4 files, 0.txt, 1.txt, 2.txt and 3.txt and keep it in a multiprocessing.Manager().list(). But I'm getting error, TypeError: cannot serialize '_io.TextIOWrapper' object.
def run_solver(total, proc_id, result, fouts):
for i in range(10)):
fouts[proc_id].write('hi\n')
if __name__ == '__main__':
processes = []
fouts = Manager().list((open('0.txt', 'w'), open('1.txt', 'w'), open('2.txt', 'w'), open('3.txt', 'w')))
for proc_id in range(os.cpu_count()):
processes.append(Process(target=run_solver, args=(int(total/os.cpu_count()), proc_id, result, fouts)))
for process in processes:
process.start()
for process in processes:
process.join()
for i in range(len(fouts)):
fouts[i].close()
I've tried to populate the list with file handle inside the function too, like below.
def run_solver(total, proc_id, result, fouts):
fout[proc_id] = open(str(proc_id)+'.txt', 'w')
for i in range(10)):
fouts[proc_id].write('hi\n')
fout[proc_id].close()
if __name__ == '__main__':
processes = []
fouts = Manager().list([0]*os.cpu_count())
Both doesn't work and I understood there's something related to not able to serialize or unpickleable. But I don't know how to resolve this. Can someone suggest a solution?

Open the files in each process. Do not open them in the manager, you can't send open files from the manager process to the executor processes.
def run_solver(total, proc_id, result, fouts):
with open(fouts[proc_id], 'w') as openfile:
for i in range(10)):
openfile.write('hi\n')
if __name__ == '__main__':
processes = []
with Manager() as manager:
fouts = manager.list(['0.txt', '1.txt', '2.txt', '3.txt'])
for proc_id in range(os.cpu_count()):
processes.append(Process(
target=run_solver, args=(
int(total/os.cpu_count()), proc_id, result, fouts)
))
If you are sharing filenames between processes, you want to prevent race conditions when writing to those files, you really want to use a lock per file too:
def run_solver(total, proc_id, result, fouts, locks):
with open(fouts[proc_id], 'a') as openfile:
for i in range(10)):
with locks[proc_id]:
openfile.write('hi\n')
openfile.flush()
if __name__ == '__main__':
processes = []
with Manager() as manager:
fouts = manager.list(['0.txt', '1.txt', '2.txt', '3.txt'])
locks = manager.list([Lock() for fout in fouts])
for proc_id in range(os.cpu_count()):
processes.append(Process(
target=run_solver, args=(
int(total/os.cpu_count()), proc_id, result, fouts, locks
)
))
Because the files are opened with with they are closed automatically each time, and they are opened in append mode so different processes don't clobber one another. You do need to remember to flush the write buffer before unlocking again.
As an aside, you probably want to look at the process pools rather than do your own manual pooling.

How do I count the number of line in a FTP file without downloading it locally while using Python

So I need to be able to read and count the number of lines from a FTP server WITHOUT downloading it to my local machine while using Python.
I know the code to connect to the server:
ftp = ftplib.FTP('example.com') //Object ftp set as server address
ftp.login ('username' , 'password') // Login info
ftp.retrlines('LIST') // List file directories
ftp.cwd ('/parent folder/another folder/file/') //Change file directory
I also know the basic code to count the number of line If it is already downloaded/stored locally :
with open('file') as f:
... count = sum(1 for line in f)
... print (count)
I just need to know how to connect these 2 pieces of code without having to download the file to my local system.
Any help is appreciated.
Thank You

As far as i know FTP doesn't provide any kind of functionality to read the file content without actually downloading it. However you could try using something like Is it possible to read FTP files without writing them using Python?
(You haven't specified what python you are using)
#!/usr/bin/env python
from ftplib import FTP
def countLines(s):
print len(s.split('\n'))
ftp = FTP('ftp.kernel.org')
ftp.login()
ftp.retrbinary('RETR /pub/README_ABOUT_BZ2_FILES', countLines)
Please take this code as a reference only

There is a way: I adapted a piece of code that I created for processes csv files "on the fly". Is implement by producer-consumer problem approach. Apply this pattern allows us to assign each task to a thread (or process) and show partial results for huge remote files. You can adapt it for ftp requests.
Download stream is saved in queue and is consumed "on the fly". No HDD extra space is needed and memory efficient. Tested in Python 3.5.2 (vanilla) on Fedora Core 25 x86_64.
This is the source adapted for ftp (over http) retrieve:
from threading import Thread, Event
from queue import Queue, Empty
import urllib.request,sys,csv,io,os,time;
import argparse
FILE_URL = 'http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/nation.1751_2010.csv'
def download_task(url,chunk_queue,event):
CHUNK = 1*1024
response = urllib.request.urlopen(url)
event.clear()
print('%% - Starting Download - %%')
print('%% - ------------------ - %%')
'''VT100 control codes.'''
CURSOR_UP_ONE = '\x1b[1A'
ERASE_LINE = '\x1b[2K'
while True:
chunk = response.read(CHUNK)
if not chunk:
print('%% - Download completed - %%')
event.set()
break
chunk_queue.put(chunk)
def count_task(chunk_queue, event):
part = False
time.sleep(5) #give some time to producer
M=0
contador = 0
'''VT100 control codes.'''
CURSOR_UP_ONE = '\x1b[1A'
ERASE_LINE = '\x1b[2K'
while True:
try:
#Default behavior of queue allows getting elements from it and block if queue is Empty.
#In this case I set argument block=False. When queue.get() and queue Empty ocurrs not block and throws a
#queue.Empty exception that I use for show partial result of process.
chunk = chunk_queue.get(block=False)
for line in chunk.splitlines(True):
if line.endswith(b'\n'):
if part: ##for treat last line of chunk (normally is a part of line)
line = linepart + line
part = False
M += 1
else:
##if line not contains '\n' is last line of chunk.
##a part of line which is completed in next interation over next chunk
part = True
linepart = line
except Empty:
# QUEUE EMPTY
print(CURSOR_UP_ONE + ERASE_LINE + CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE + CURSOR_UP_ONE)
print('Downloading records ...')
if M>0:
print('Partial result: Lines: %d ' % M) #M-1 because M contains header
if (event.is_set()): #'THE END: no elements in queue and download finished (even is set)'
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print(CURSOR_UP_ONE + ERASE_LINE+ CURSOR_UP_ONE)
print('The consumer has waited %s times' % str(contador))
print('RECORDS = ', M)
break
contador += 1
time.sleep(1) #(give some time for loading more records)
def main():
chunk_queue = Queue()
event = Event()
args = parse_args()
url = args.url
p1 = Thread(target=download_task, args=(url,chunk_queue,event,))
p1.start()
p2 = Thread(target=count_task, args=(chunk_queue,event,))
p2.start()
p1.join()
p2.join()
# The user of this module can customized one parameter:
# + URL where the remote file can be found.
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('-u', '--url', default=FILE_URL,
help='remote-csv-file URL')
return parser.parse_args()
if __name__ == '__main__':
main()
Usage
$ python ftp-data.py -u <ftp-file>
Example:
python ftp-data-ol.py -u 'http://cdiac.ornl.gov/ftp/ndp030/CSV-FILES/nation.1751_2010.csv'
The consumer has waited 0 times
RECORDS = 16327
Csv version on Github: https://github.com/AALVAREZG/csv-data-onthefly

Simultaneously wait() for multiple subproccess.Popen commands, then exit

I'm trying to run an unknown number of commands and capture their stdout in a file. However, I am presented with a difficulty when attempting to p.wait() on each instance. My code looks something like this:
print "Started..."
for i, cmd in enumerate(commands):
i = "output_%d.log" % i
p = Popen(cmd, shell=True, universal_newlines=True, stdout=open(i, 'w'))
p.wait()
print "Done!"
I'm looking for a way to execute everything in commands simultaneously and exit the current script only when each and every single process has been completed. It would also help to be informed when each command returns an exit code.
I've looked at some answers, including this one by J.F. Sebastian and tried to adapt it to my situation by changing args=(p.stdout, q) to args=(p.returncode, q) but it ended up exiting immediately and running in the background (possibly due to shell=True?), as well as not responding to any keys pressed inside the bash shell... I don't know where to go with this.
Jeremy Brown's answer also helped, sort of, but select.epoll() was throwing an AttributeError exception.
Is there any other seamless way or trick to make it work? It doesn't need to be cross platform, a solution for GNU/Linux and macOS would be much appreciated. Thanks in advance!

A big thanks to Adam Matan for the biggest hint towards the solution. This is what I came up with, and it works flawlessly:
It initiates each Thread object in parallel
It starts each instance simultaneously
Finally it waits for each exit code without blocking other threads
Here is the code:
import threading
import subprocess
...
def run(cmd):
name = cmd.split()[0]
out = open("%s_log.txt" % name, 'w')
err = open('/dev/null', 'w')
p = subprocess.Popen(cmd.split(), stdout=out, stderr=err)
p.wait()
print name + " completed, return code: " + str(p.returncode)
...
proc = [threading.Thread(target=run, args=(cmd)) for cmd in commands]
[p.start() for p in proc]
[p.join() for p in proc]
print "Done!"

I would have rathered add this as a comment because I was working off of Jack of all Spades' answer. I had trouble getting that exact command to work because it was unpacking the string list I had of commands.
Here's my edit for python3:
import subprocess
import threading
commands = ['sleep 2', 'sleep 4', 'sleep 8']
def run(cmd):
print("Command %s" % cmd)
name = cmd.split(' ')[0]
print("name %s" % name)
out = open('/tmp/%s_log.txt' % name, 'w')
err = open('/dev/null', 'w')
p = subprocess.Popen(cmd.split(' '), stdout=out, stderr=err)
p.wait()
print(name + " completed, return code: " + str(p.returncode))
proc = [threading.Thread(target=run, kwargs={'cmd':cmd}) for cmd in commands]
[p.start() for p in proc]
[p.join() for p in proc]
print("Done!")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to resolve BrokenPipeError when using multiprocessing in Python - python

Found the issue: calling queue.empty is incorrect. You need parentheses (e.g. queue.empty()) I'll leave my embarrassing mistake up in case it helps others :)

Related

My python script stops... no error, just stops

How to run darktable-cli in parallel?

Writing to multiple files using multiprocessing. Error: "TypeError: cannot serialize '_io.TextIOWrapper' object"

How do I count the number of line in a FTP file without downloading it locally while using Python

Simultaneously wait() for multiple subproccess.Popen commands, then exit

Categories

Resources