I'm trying to wrap my head round MultiProcessing in Python, but I simply can't. Notice that I was, am and probably forever be a noob in everything-programming. Ah, anyways. Here it goes.
I'm writing a Python script that compresses images downloaded to a folder with ImageMagick, using predefined variables from the user, stored in an ini file. The script searches for folders matching a pattern in a download dir, checks if they contain JPGs, PNGs or other image files and, if yes, recompresses and renames them, storing the results in a "compressed" folder.
Now, here's the thing: I'd love it if I was able to "parallelize" the whole compression thingy, but... I can't understand how I'm supposed to do that.
I don't want to tire you with the existing code since it simply sucks. It's just a simple "for file in directory" loop. THAT's what I'd love to parallelize - could somebody give me an example on how multiprocessing could be used with files in a directory?
I mean, let's take this simple piece of code:
for f in matching_directory:
print ('I\'m going to process file:', f)
For those that DO have to peek at the code, here's the part where I guess the whole parallelization bit will stick:
for f in ImageFolders:
print (splitter)
print (f)
print (splitter)
PureName = CleanName(f)
print (PureName)
for root, dirs, files in os.walk(f):
padding = int(round( math.log( len(files), 10))) + 1
padding = max(minpadding, padding)
filecounter = 0
for filename in files:
if filename.endswith(('.jpg', '.jpeg', '.gif', '.png')):
filecounter += 1
imagefile, ext = os.path.splitext(filename)
newfilename = "%s_%s%s" % (PureName, (str(filecounter).rjust(padding,'0')), '.jpg')
startfilename = os.path.join (f, filename)
finalfilename = os.path.join(Dir_Images_To_Publish, PureName, newfilename)
print (filecounter, ':', startfilename, ' >>> ', finalfilename)
Original_Image_FileList.append(startfilename)
Processed_Image_FileList.append(finalfilename)
...and here I'd like to be able to add a piece of code where a worker takes the first file from Original_Image_FileList and compresses it to the first filename from Processed_Image_FileList, a second one takes the one after that, blah-blah, up to a specific number of workers - depending on a user setting in the ini file.
Any ideas?
You can create a pool of workers using the Pool class, to which you can distribute the image compression to. See the Using a pool of workers section of the multiprocessing documentation.
If your compression function is called compress(filename), for example, then you can use the Pool.map method to apply this function to an iterable that returns the filenames, i.e. your list matching_directory:
from multiprocessing import Pool
def compress_image(image):
"""Define how you'd like to compress `image`..."""
pass
def distribute_compression(images, pool_size = 4):
with Pool(processes=pool_size) as pool:
pool.map(compress_image, images)
There's a variety of map-like methods available, see map for starters. You may like to experiment with the pool size, to see what works best.
Related
I wrote a script that sums the size of the files in subdirectories on a FTP server:
for dirs in ftp.nlst("."):
try:
print("Searching in "+dirs+"...")
ftp.cwd(dirs)
for files in ftp.nlst("."):
size += ftp.size(files)
ftp.cwd("../")
except ftplib.error_perm:
pass
print("Total size of "+serveradd+tvt+" = "+str(size*10**-9)+" GB")
Is there a quicker way to get the size of the whole directory tree other than summing the file sizes for all directories?
As Alex Hall commented, this is not recursive. I'll address the speeding-up issue, as you can read about recursion from many sources, for example here.
Putting that aside, you didn't mention how many files are approximately in that directory, but you're wasting time by spending a whole round-trip for every file in the directory. Instead ask the server to return the entire listing for the directory and sum the file sizes:
import re
class DirSizer:
def __init__(self):
self.size = 0
def add_list_entry(self, lst):
if '<DIR>' not in lst:
metadata = re.split(r'\s+', lst)
self.size += int(metadata[2])
ds = DirSizer()
ftp.retrlines('LIST', ds.add_list_entry) # add_list_entry will be called for every line
print(ds.size) # => size (shallow, currently) of the directory
Note that:
This should of course be done recursively for every directory in the tree.
Your server might return the list in a different format, so you might need to change either the re.split line or the metadata[2] part.
If your server supports the MLSD FTP command, use that instead, as it'll be in a standardized format.
See here for an explanation of retrlines and the callback.
I have seeked for an answer on this site but didn't find out. My problem is that I want to convert several files from a format to another in python. I would like to convert 4 files simultaneously. I have already made a code with process keyword from the multiprocessing library, it works, but it uses several processes to convert files one by one, which is not what I want. I tried to convert them simultaneously with this code:
def convert_all_files (directoryName):
directoryName = r'Here I set my directory name C:\...'
files2=[]
pool=mp.Pool(4)
for path, dirs, files in os.walk(directoryName):
for f in files:
f1=f
path1=path
files2.append((path1,f1))
for j in range(0, len(files2)):
pool.apply_async(convert, (files2[j][0],files2[j][1]))
pool.close()
pool.join()
My problem is that the code runs, but the function convert is not executed, and the code freezes at the line pool.join() (I use this technique to gain a lot of time because conversion is very long, and when I run this code, the conversion is instantaneous and doesn't work.)
I use the function defined above in another file. I import my module and call the function.
Does anyone has an Idea?
thanks
Here is a working solution, without the limitations of 4 conversion at the same time.
def convert_all_files(directoryName):
for folder, subs, files in os.walk(directoryName):
for filename in files:
p = Process(target=convert, args=(folder, filename))
p.start()
If you need the limitation, here is a solution, but I am not sure this is the best:
def convert_all_files(directoryName):
process_count = 0
for folder, subs, files in os.walk(directoryName):
for filename in files:
p = Process(target=convert, args=(folder, filename))
p.start()
# Maybe not the better way to handle it
process_count = process_count + 1
if process_count % 4 == 0:
p.join()
I am new to python and not adept to it. I need to traverse a huge list of directories which contain zipped files within them. While this can be done via the method,
for file in list:
for filename in file:
with open.gizp(filename) as fileopen:
for line in fileopen:
process
The time taken would be take a few days. Would i be able to use any function that allows me to traverse other parts of the directory concurrently to perform the same function and not have any repeats in the traversal?
Any help or direction would be greatly appreciated
Move the heavy processing to a separate program, then call that program with subprocess to keep a certain number of parallel processes running:
import subprocess
import time
todo = []
for file in list:
for filename in file:
todo.append(filename)
running_processes = []
while len(todo)>0:
running_processes = [p for p in running_processes if p.poll() is None]
if len(running_processes)<8:
target = todo.pop()
running_processes.append( subprocess.Popen(['python','process_gzip.py',target]) )
time.sleep(1)
You can open many files concurrently. For instance:
files = [gzip.open(f,"rb") for f in fileslist]
processed = [process(f) for f in files]
(btw, don't call your files list "list", or a list of files "file", since they are reserved words by the language and do not describe what the object really is in your case).
Now it is going to take about the same time, since you always process them one at a time. So, is it the processing of them that you want to parallelize? Then you want to look at threading or multiprocessing.
Are you looking for os.path.walk to traverse directories? (https://docs.python.org/2/library/os.path.html). You can also do:
for folder in folderslist:
fileslist = os.listdir(folder)
for file in fileslist:
....
Are you interested by fileinput to iterate over lines from multiple input streams? (https://docs.python.org/2/library/fileinput.html, fileinput.hook_compressed seems to handle gzip).
I have a thread that I would like to loop through all of the .txt files in a certain directory (C:\files\) All I need is help reading anything from that directory that is a .txt file. I cant seem to figure it out.. Here is my current code that looks for specific files:
def file_Read(self):
if self.is_connected:
threading.Timer(5, self.file_Read).start();
print '~~~~~~~~~~~~Thread test~~~~~~~~~~~~~~~'
try:
with open('C:\\files\\test.txt', 'r') as content_file:
content = content_file.read()
Num,Message = content.strip().split(';')
print Num
print Message
print Num
self.send_message(Num + , Message)
content_file.close()
os.remove("test.txt")
#except
except Exception as e:
print 'no file ', e
time.sleep(10)
does anyone have a simple fix for this? I have found a lot of threads using methods like:
directory = os.path.join("c:\\files\\","path")
threading.Timer(5, self.file_Read).start();
print '~~~~~~~~~~~~Thread test~~~~~~~~~~~~~~~'
try:
for root,dirs,files in os.walk(directory):
for file in files:
if file.endswith(".txt"):
content_file = open(file, 'r')
but this doesn't seem to be working.
Any help would be appreciated. Thanks in advance...
I would do something like this, by using glob:
import glob
import os
txtpattern = os.path.join("c:\\files\\", "*.txt")
files = glob.glob(txtpattern)
for f in file:
print "Filename : %s" % f
# Do what you want with the file
This method works only if you want to read .txt in your directory and not in its potential subdirectories.
Take a look at the manual entries for os.walk - if you need to recurse sub-directories or glob.glob if you are only interested in a single directory.
The main problem is that the first thing you do in the function that you want to start in the threads is that you create a new thread with that function.
Since every thread will start a new thread, you should get an increasing number of threads starting new threads, which also seems to be what happens.
If you want to do some work on all the files, and you want to do that in parallel on a multi-core machine (which is what I'm guessing) take a look at the multiprocessing module, and the Queue class. But get the file handling code working first before you try to parallelize it.
I'm trying to download multiple image files from two websites, and am doing it using multiprocessing module, hoping to shorten the time needed (synchronously it would be about five minutes). This is the code being executed in a separate process:
def _get_image(self):
if not os.path.isdir(self.file_path + self.folder):
os.makedirs(self.file_path + self.folder)
rand = Random()
rand_num = rand.randint(0, sys.maxint)
self.url += str(rand_num)
opener = urllib.FancyURLopener()
opener.retrieve(self.url, self.file_path + self.folder + '/' + str(rand_num) + '.jpg')
The above code is executed in separate processes and works ok, though I'd like it not to save each file right after it's downloaded, but at the end of the process execution. After download, I'd like them to be stored in some internal list, or dict... Sadly, FancyURLopener doesn't allow to store files in memory, and insists on writing them to the disk right after download. Is there a tool like FancyURLopener, but without the disk-writes?
URLopener.open() returns a file-like. You can read() it to retreive the data as a byte string, then store it wherever you want.
Why do you need a URLopener in the first place? How about a simple urllib2.urlopen()?