I am new to python and not adept to it. I need to traverse a huge list of directories which contain zipped files within them. While this can be done via the method,
for file in list:
for filename in file:
with open.gizp(filename) as fileopen:
for line in fileopen:
process
The time taken would be take a few days. Would i be able to use any function that allows me to traverse other parts of the directory concurrently to perform the same function and not have any repeats in the traversal?
Any help or direction would be greatly appreciated
Move the heavy processing to a separate program, then call that program with subprocess to keep a certain number of parallel processes running:
import subprocess
import time
todo = []
for file in list:
for filename in file:
todo.append(filename)
running_processes = []
while len(todo)>0:
running_processes = [p for p in running_processes if p.poll() is None]
if len(running_processes)<8:
target = todo.pop()
running_processes.append( subprocess.Popen(['python','process_gzip.py',target]) )
time.sleep(1)
You can open many files concurrently. For instance:
files = [gzip.open(f,"rb") for f in fileslist]
processed = [process(f) for f in files]
(btw, don't call your files list "list", or a list of files "file", since they are reserved words by the language and do not describe what the object really is in your case).
Now it is going to take about the same time, since you always process them one at a time. So, is it the processing of them that you want to parallelize? Then you want to look at threading or multiprocessing.
Are you looking for os.path.walk to traverse directories? (https://docs.python.org/2/library/os.path.html). You can also do:
for folder in folderslist:
fileslist = os.listdir(folder)
for file in fileslist:
....
Are you interested by fileinput to iterate over lines from multiple input streams? (https://docs.python.org/2/library/fileinput.html, fileinput.hook_compressed seems to handle gzip).
Related
dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# one of subdirs contain huge number of files
files = [os.path.join(file, f) for file in subdirs for f in os.listdir(file)]
The code ran smoothly first few times under 30 seconds but over different runs of the same code, the time increased to 11 minutes and now not even running in 11 minutes. The problem is in the 3rd line and I suspect os.listdir for this.
EDIT: Just want to read the files so that it can be sent as argument to a multiprocessing function. RAM is also not an issue as RAM is ample and not even 1/10th of RAM is used by the program
It might leads that os.listdir(dir_) reads the entire directory tree and returns a list of all the files and subdirectories in dir_. This process can take a long time if the directory tree is very large or if the system is under heavy load.
But instead that use either below method or use walk() method.
dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# Create an empty list to store the file paths
files = []
for subdir in subdirs:
# Use os.scandir() to iterate over the files and directories in the subdirectory
with os.scandir(subdir) as entries:
for entry in entries:
# Check if the entry is a regular file
if entry.is_file():
# Add the file path to the list
files.append(entry.path)
I have used the following code to read multiple files simultaneously
from contextlib import ExitStack
files_to_parse = [file1, file2, file3]
with ExitStack() as stack:
files = [stack.enter_context(open(i, "r")) for i in files_to_parse]
for rows in zip(*files):
for r in rows:
#do stuff
However I have noticed that since all my files don't have the same number of lines, whenever the shortest file reaches the end, all the files will close.
I used the code above (which I found here on stackoverflow) because I need to parse several files at the same time (to save time). Doing so, divide the computing time by 4. However all files aren't parsed entirely because of the problem I have mentioned above.
Is there any way to solve this problem?
open might be used as context manager, but does not have to. You might use it in ancient way, where you take responsibility to close it, that is
try:
files = [open(fname) for fname in filenames]
# here do what you need to with files
finally:
for file in files:
file.close()
please help me with challenge i have, that is to list files every 30seconds and process them (process them means for example -- copying to another location, each file is moved out of the directory once processed), and when i list files after 30seconds, i want to avoid any files that are listed previously for processing (due to the reason that they were listed previously and FOR LOOP is still in progress)
Means i want to avoid duplicate file processing while listing the files every 30seconds.
here is my code.
def List_files():
path = 'c:\\projects\\hc2\\'
files = []
for r, d, f in os.walk(path):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
class MyFilethreads:
def __init__(self, t1):
self.t1 = t1
def start_threading(self):
for file in List_files():
self.t1 = Thread(target=<FILEPROCESS_FUNCTION>, args=(file,))
self.t1.start()
t1 = Thread()
myclass = MyFilethreads(t1)
while True:
myclass.start_threading()
time.sleep(30)
I have not included my actual function for processing files, since its big,,it is called with thread as FILEPROCESS_FUNCTION.
Problem:
if the file size is high, my file processing time may increase some times (in other words, FOR LOOP is taking more than 30 sec ) but i cant reduce 30sec timer since it's very rare possibility, and my python script takes hundreds of files every min..
Hence, i am looking for a way to avoid files that are already listed previously, and by this i wanted to avoid duplicate file processing.
please help.
thanks in advance.
Make a dictionary in your class, and insert all the files you have seen. then, in your start_threading check if the file is in the dictionary, and pass in that case.
I'm trying to wrap my head round MultiProcessing in Python, but I simply can't. Notice that I was, am and probably forever be a noob in everything-programming. Ah, anyways. Here it goes.
I'm writing a Python script that compresses images downloaded to a folder with ImageMagick, using predefined variables from the user, stored in an ini file. The script searches for folders matching a pattern in a download dir, checks if they contain JPGs, PNGs or other image files and, if yes, recompresses and renames them, storing the results in a "compressed" folder.
Now, here's the thing: I'd love it if I was able to "parallelize" the whole compression thingy, but... I can't understand how I'm supposed to do that.
I don't want to tire you with the existing code since it simply sucks. It's just a simple "for file in directory" loop. THAT's what I'd love to parallelize - could somebody give me an example on how multiprocessing could be used with files in a directory?
I mean, let's take this simple piece of code:
for f in matching_directory:
print ('I\'m going to process file:', f)
For those that DO have to peek at the code, here's the part where I guess the whole parallelization bit will stick:
for f in ImageFolders:
print (splitter)
print (f)
print (splitter)
PureName = CleanName(f)
print (PureName)
for root, dirs, files in os.walk(f):
padding = int(round( math.log( len(files), 10))) + 1
padding = max(minpadding, padding)
filecounter = 0
for filename in files:
if filename.endswith(('.jpg', '.jpeg', '.gif', '.png')):
filecounter += 1
imagefile, ext = os.path.splitext(filename)
newfilename = "%s_%s%s" % (PureName, (str(filecounter).rjust(padding,'0')), '.jpg')
startfilename = os.path.join (f, filename)
finalfilename = os.path.join(Dir_Images_To_Publish, PureName, newfilename)
print (filecounter, ':', startfilename, ' >>> ', finalfilename)
Original_Image_FileList.append(startfilename)
Processed_Image_FileList.append(finalfilename)
...and here I'd like to be able to add a piece of code where a worker takes the first file from Original_Image_FileList and compresses it to the first filename from Processed_Image_FileList, a second one takes the one after that, blah-blah, up to a specific number of workers - depending on a user setting in the ini file.
Any ideas?
You can create a pool of workers using the Pool class, to which you can distribute the image compression to. See the Using a pool of workers section of the multiprocessing documentation.
If your compression function is called compress(filename), for example, then you can use the Pool.map method to apply this function to an iterable that returns the filenames, i.e. your list matching_directory:
from multiprocessing import Pool
def compress_image(image):
"""Define how you'd like to compress `image`..."""
pass
def distribute_compression(images, pool_size = 4):
with Pool(processes=pool_size) as pool:
pool.map(compress_image, images)
There's a variety of map-like methods available, see map for starters. You may like to experiment with the pool size, to see what works best.
I have a directory in which I have around hundred thousands of text files.
Python code creates a list of names of this files,
listoffiles = os.listdir(directory)
I break this listoffiles with lol function in 64 parts
lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
partitioned_listoffiles = lol(listoffiles, 64)
Then I pool it to 2 processes
pool = Pool(processes=2,)
single_count_tuples = pool.map(Map, partitioned_listoffiles)
In Map function I read those files and do further processing
My problem is this code works fine if I do it for small folder with thousands of files. Large directories it runs out of memory. How should I solve this issue. Can I read first n files and then next n files and create listoffiles and process this steps in for loop.
If the directory is very very large then you could use scandir() instead of os.listdir(). But it is unlikely that os.listdir() causes MemoryError therefore the issue is in the other two places:
Use a generator expression instead of list comprehension:
chunks = (lst[i:i+n] for i in range(0, len(lst), n))
Use pool.imap or pool.imap_unordered instead of pool.map():
for result in pool.imap_unordered(Map, chunks):
pass
Or better:
files = os.listdir(directory)
for result in pool.imap_unordered(process_file, files, chunksize=100):
pass
I've had a very similar problem, where I was required to verify a certain number of files are in a specific folder. The problem was that the folder may contain up to 20 million very small files.
From what I've learned, there is no possibility to limit pythons listdir to a certain amount of items.
My listdir takes quite a while to list the directory and a lot of RAM but manages to run on a VM with 4GB RAM..
You may want to try using glob instead, which might keep the file list smaller, depending on your requirements.
import glob
print glob.glob("/tmp/*.txt")