Converting several files with Multiprocessing in python

Converting several files with Multiprocessing in python - python

I have seeked for an answer on this site but didn't find out. My problem is that I want to convert several files from a format to another in python. I would like to convert 4 files simultaneously. I have already made a code with process keyword from the multiprocessing library, it works, but it uses several processes to convert files one by one, which is not what I want. I tried to convert them simultaneously with this code:
def convert_all_files (directoryName):
directoryName = r'Here I set my directory name C:\...'
files2=[]
pool=mp.Pool(4)
for path, dirs, files in os.walk(directoryName):
for f in files:
f1=f
path1=path
files2.append((path1,f1))
for j in range(0, len(files2)):
pool.apply_async(convert, (files2[j][0],files2[j][1]))
pool.close()
pool.join()
My problem is that the code runs, but the function convert is not executed, and the code freezes at the line pool.join() (I use this technique to gain a lot of time because conversion is very long, and when I run this code, the conversion is instantaneous and doesn't work.)
I use the function defined above in another file. I import my module and call the function.
Does anyone has an Idea?
thanks

Here is a working solution, without the limitations of 4 conversion at the same time.
def convert_all_files(directoryName):
for folder, subs, files in os.walk(directoryName):
for filename in files:
p = Process(target=convert, args=(folder, filename))
p.start()
If you need the limitation, here is a solution, but I am not sure this is the best:
def convert_all_files(directoryName):
process_count = 0
for folder, subs, files in os.walk(directoryName):
for filename in files:
p = Process(target=convert, args=(folder, filename))
p.start()
# Maybe not the better way to handle it
process_count = process_count + 1
if process_count % 4 == 0:
p.join()

Related

Files are closed before reaching the end - using ExitStack in Python

I have used the following code to read multiple files simultaneously
from contextlib import ExitStack
files_to_parse = [file1, file2, file3]
with ExitStack() as stack:
files = [stack.enter_context(open(i, "r")) for i in files_to_parse]
for rows in zip(*files):
for r in rows:
#do stuff
However I have noticed that since all my files don't have the same number of lines, whenever the shortest file reaches the end, all the files will close.
I used the code above (which I found here on stackoverflow) because I need to parse several files at the same time (to save time). Doing so, divide the computing time by 4. However all files aren't parsed entirely because of the problem I have mentioned above.
Is there any way to solve this problem?

open might be used as context manager, but does not have to. You might use it in ancient way, where you take responsibility to close it, that is
try:
files = [open(fname) for fname in filenames]
# here do what you need to with files
finally:
for file in files:
file.close()

Python: list files every 30sec and avoid processing duplicates

please help me with challenge i have, that is to list files every 30seconds and process them (process them means for example -- copying to another location, each file is moved out of the directory once processed), and when i list files after 30seconds, i want to avoid any files that are listed previously for processing (due to the reason that they were listed previously and FOR LOOP is still in progress)
Means i want to avoid duplicate file processing while listing the files every 30seconds.
here is my code.
def List_files():
path = 'c:\\projects\\hc2\\'
files = []
for r, d, f in os.walk(path):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
class MyFilethreads:
def __init__(self, t1):
self.t1 = t1
def start_threading(self):
for file in List_files():
self.t1 = Thread(target=<FILEPROCESS_FUNCTION>, args=(file,))
self.t1.start()
t1 = Thread()
myclass = MyFilethreads(t1)
while True:
myclass.start_threading()
time.sleep(30)
I have not included my actual function for processing files, since its big,,it is called with thread as FILEPROCESS_FUNCTION.
Problem:
if the file size is high, my file processing time may increase some times (in other words, FOR LOOP is taking more than 30 sec ) but i cant reduce 30sec timer since it's very rare possibility, and my python script takes hundreds of files every min..
Hence, i am looking for a way to avoid files that are already listed previously, and by this i wanted to avoid duplicate file processing.
please help.
thanks in advance.

Make a dictionary in your class, and insert all the files you have seen. then, in your start_threading check if the file is in the dictionary, and pass in that case.

Python, image compression and multiprocessing

I'm trying to wrap my head round MultiProcessing in Python, but I simply can't. Notice that I was, am and probably forever be a noob in everything-programming. Ah, anyways. Here it goes.
I'm writing a Python script that compresses images downloaded to a folder with ImageMagick, using predefined variables from the user, stored in an ini file. The script searches for folders matching a pattern in a download dir, checks if they contain JPGs, PNGs or other image files and, if yes, recompresses and renames them, storing the results in a "compressed" folder.
Now, here's the thing: I'd love it if I was able to "parallelize" the whole compression thingy, but... I can't understand how I'm supposed to do that.
I don't want to tire you with the existing code since it simply sucks. It's just a simple "for file in directory" loop. THAT's what I'd love to parallelize - could somebody give me an example on how multiprocessing could be used with files in a directory?
I mean, let's take this simple piece of code:
for f in matching_directory:
print ('I\'m going to process file:', f)
For those that DO have to peek at the code, here's the part where I guess the whole parallelization bit will stick:
for f in ImageFolders:
print (splitter)
print (f)
print (splitter)
PureName = CleanName(f)
print (PureName)
for root, dirs, files in os.walk(f):
padding = int(round( math.log( len(files), 10))) + 1
padding = max(minpadding, padding)
filecounter = 0
for filename in files:
if filename.endswith(('.jpg', '.jpeg', '.gif', '.png')):
filecounter += 1
imagefile, ext = os.path.splitext(filename)
newfilename = "%s_%s%s" % (PureName, (str(filecounter).rjust(padding,'0')), '.jpg')
startfilename = os.path.join (f, filename)
finalfilename = os.path.join(Dir_Images_To_Publish, PureName, newfilename)
print (filecounter, ':', startfilename, ' >>> ', finalfilename)
Original_Image_FileList.append(startfilename)
Processed_Image_FileList.append(finalfilename)
...and here I'd like to be able to add a piece of code where a worker takes the first file from Original_Image_FileList and compresses it to the first filename from Processed_Image_FileList, a second one takes the one after that, blah-blah, up to a specific number of workers - depending on a user setting in the ini file.
Any ideas?

You can create a pool of workers using the Pool class, to which you can distribute the image compression to. See the Using a pool of workers section of the multiprocessing documentation.
If your compression function is called compress(filename), for example, then you can use the Pool.map method to apply this function to an iterable that returns the filenames, i.e. your list matching_directory:
from multiprocessing import Pool
def compress_image(image):
"""Define how you'd like to compress `image`..."""
pass
def distribute_compression(images, pool_size = 4):
with Pool(processes=pool_size) as pool:
pool.map(compress_image, images)
There's a variety of map-like methods available, see map for starters. You may like to experiment with the pool size, to see what works best.

traversing multiple files and opening them

I am new to python and not adept to it. I need to traverse a huge list of directories which contain zipped files within them. While this can be done via the method,
for file in list:
for filename in file:
with open.gizp(filename) as fileopen:
for line in fileopen:
process
The time taken would be take a few days. Would i be able to use any function that allows me to traverse other parts of the directory concurrently to perform the same function and not have any repeats in the traversal?
Any help or direction would be greatly appreciated

Move the heavy processing to a separate program, then call that program with subprocess to keep a certain number of parallel processes running:
import subprocess
import time
todo = []
for file in list:
for filename in file:
todo.append(filename)
running_processes = []
while len(todo)>0:
running_processes = [p for p in running_processes if p.poll() is None]
if len(running_processes)<8:
target = todo.pop()
running_processes.append( subprocess.Popen(['python','process_gzip.py',target]) )
time.sleep(1)

You can open many files concurrently. For instance:
files = [gzip.open(f,"rb") for f in fileslist]
processed = [process(f) for f in files]
(btw, don't call your files list "list", or a list of files "file", since they are reserved words by the language and do not describe what the object really is in your case).
Now it is going to take about the same time, since you always process them one at a time. So, is it the processing of them that you want to parallelize? Then you want to look at threading or multiprocessing.
Are you looking for os.path.walk to traverse directories? (https://docs.python.org/2/library/os.path.html). You can also do:
for folder in folderslist:
fileslist = os.listdir(folder)
for file in fileslist:
....
Are you interested by fileinput to iterate over lines from multiple input streams? (https://docs.python.org/2/library/fileinput.html, fileinput.hook_compressed seems to handle gzip).

Reading all .txt files in C:\\Files\\

I have a thread that I would like to loop through all of the .txt files in a certain directory (C:\files\) All I need is help reading anything from that directory that is a .txt file. I cant seem to figure it out.. Here is my current code that looks for specific files:
def file_Read(self):
if self.is_connected:
threading.Timer(5, self.file_Read).start();
print '~~~~~~~~~~~~Thread test~~~~~~~~~~~~~~~'
try:
with open('C:\\files\\test.txt', 'r') as content_file:
content = content_file.read()
Num,Message = content.strip().split(';')
print Num
print Message
print Num
self.send_message(Num + , Message)
content_file.close()
os.remove("test.txt")
#except
except Exception as e:
print 'no file ', e
time.sleep(10)
does anyone have a simple fix for this? I have found a lot of threads using methods like:
directory = os.path.join("c:\\files\\","path")
threading.Timer(5, self.file_Read).start();
print '~~~~~~~~~~~~Thread test~~~~~~~~~~~~~~~'
try:
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".txt"):
content_file = open(file, 'r')
but this doesn't seem to be working.
Any help would be appreciated. Thanks in advance...

I would do something like this, by using glob:
import glob
import os
txtpattern = os.path.join("c:\\files\\", "*.txt")
files = glob.glob(txtpattern)
for f in file:
print "Filename : %s" % f
# Do what you want with the file
This method works only if you want to read .txt in your directory and not in its potential subdirectories.

Take a look at the manual entries for os.walk - if you need to recurse sub-directories or glob.glob if you are only interested in a single directory.

The main problem is that the first thing you do in the function that you want to start in the threads is that you create a new thread with that function.
Since every thread will start a new thread, you should get an increasing number of threads starting new threads, which also seems to be what happens.
If you want to do some work on all the files, and you want to do that in parallel on a multi-core machine (which is what I'm guessing) take a look at the multiprocessing module, and the Queue class. But get the file handling code working first before you try to parallelize it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting several files with Multiprocessing in python - python

Related

Files are closed before reaching the end - using ExitStack in Python

Python: list files every 30sec and avoid processing duplicates

Python, image compression and multiprocessing

traversing multiple files and opening them

Reading all .txt files in C:\\Files\\

Categories

Resources