Python: list files every 30sec and avoid processing duplicates

Python: list files every 30sec and avoid processing duplicates - python

please help me with challenge i have, that is to list files every 30seconds and process them (process them means for example -- copying to another location, each file is moved out of the directory once processed), and when i list files after 30seconds, i want to avoid any files that are listed previously for processing (due to the reason that they were listed previously and FOR LOOP is still in progress)
Means i want to avoid duplicate file processing while listing the files every 30seconds.
here is my code.
def List_files():
path = 'c:\\projects\\hc2\\'
files = []
for r, d, f in os.walk(path):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
class MyFilethreads:
def __init__(self, t1):
self.t1 = t1
def start_threading(self):
for file in List_files():
self.t1 = Thread(target=<FILEPROCESS_FUNCTION>, args=(file,))
self.t1.start()
t1 = Thread()
myclass = MyFilethreads(t1)
while True:
myclass.start_threading()
time.sleep(30)
I have not included my actual function for processing files, since its big,,it is called with thread as FILEPROCESS_FUNCTION.
Problem:
if the file size is high, my file processing time may increase some times (in other words, FOR LOOP is taking more than 30 sec ) but i cant reduce 30sec timer since it's very rare possibility, and my python script takes hundreds of files every min..
Hence, i am looking for a way to avoid files that are already listed previously, and by this i wanted to avoid duplicate file processing.
please help.
thanks in advance.

Make a dictionary in your class, and insert all the files you have seen. then, in your start_threading check if the file is in the dictionary, and pass in that case.

Related

os.listdir getting slower over different runs

dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# one of subdirs contain huge number of files
files = [os.path.join(file, f) for file in subdirs for f in os.listdir(file)]
The code ran smoothly first few times under 30 seconds but over different runs of the same code, the time increased to 11 minutes and now not even running in 11 minutes. The problem is in the 3rd line and I suspect os.listdir for this.
EDIT: Just want to read the files so that it can be sent as argument to a multiprocessing function. RAM is also not an issue as RAM is ample and not even 1/10th of RAM is used by the program

It might leads that os.listdir(dir_) reads the entire directory tree and returns a list of all the files and subdirectories in dir_. This process can take a long time if the directory tree is very large or if the system is under heavy load.
But instead that use either below method or use walk() method.
dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# Create an empty list to store the file paths
files = []
for subdir in subdirs:
# Use os.scandir() to iterate over the files and directories in the subdirectory
with os.scandir(subdir) as entries:
for entry in entries:
# Check if the entry is a regular file
if entry.is_file():
# Add the file path to the list
files.append(entry.path)

How do I update a file directory pickle file with new files/locations without searching everything again in Python

I currently have a script using os.walk in a for loop like this:
empty = []
if os.path.exists(pickle_location):
df = pd.read_pickle(pickle_location)
#INSERT UPDATE FUNCTION HERE
else:
for i in file_lists:
for root, dir, files in os.walk(i, topdown=False):
for name in files:
empty.append(root+"/"+name)
I was wondering how to make the updating function, so the script does not run forever every time or at the very least cuts the time significantly. Am doing it on a file/location size of 1.2 million, so it takes very +1 hour.
I have looked in the documentation for something, but I don't have much experience. Maybe there is a smarter way? Thanks in advance.

Files are closed before reaching the end - using ExitStack in Python

I have used the following code to read multiple files simultaneously
from contextlib import ExitStack
files_to_parse = [file1, file2, file3]
with ExitStack() as stack:
files = [stack.enter_context(open(i, "r")) for i in files_to_parse]
for rows in zip(*files):
for r in rows:
#do stuff
However I have noticed that since all my files don't have the same number of lines, whenever the shortest file reaches the end, all the files will close.
I used the code above (which I found here on stackoverflow) because I need to parse several files at the same time (to save time). Doing so, divide the computing time by 4. However all files aren't parsed entirely because of the problem I have mentioned above.
Is there any way to solve this problem?

open might be used as context manager, but does not have to. You might use it in ancient way, where you take responsibility to close it, that is
try:
files = [open(fname) for fname in filenames]
# here do what you need to with files
finally:
for file in files:
file.close()

Grab the name of a modified file python

Within a script is a watcher algorithm which I've adapted from here:
https://www.michaelcho.me/article/using-pythons-watchdog-to-monitor-changes-to-a-directory
My aim is now to add a few lines to grab the name of any file that's modified so I can have an if statement checking for a certain file such as:
if [modified file name] == "my_file":
print("do something")
I don't have any experience with watchdog or file watching so I am struggling finding an answer for this. How would I receive the modified file name?

Current setup of the watchdog class is pretty useless since it just prints...it doesn't return anything.
Let me offer a different approach:
following would get you list of files modified in past 12 hours:
result = [os.path.join(root,f) for root, subfolder, files in os.walk(my_dir)
for f in files
if dt.datetime.fromtimestamp(os.path.getmtime(os.path.join(root,f))) >
dt.datetime.now() - dt.timedelta(hours=12)]
in stead of fixed delta hours, you can save the last_search_time and use it in subsequent searches.
You can then search result to see if it contains your file:
if my_file in result:
print("Sky is falling. Take cover.")

traversing multiple files and opening them

I am new to python and not adept to it. I need to traverse a huge list of directories which contain zipped files within them. While this can be done via the method,
for file in list:
for filename in file:
with open.gizp(filename) as fileopen:
for line in fileopen:
process
The time taken would be take a few days. Would i be able to use any function that allows me to traverse other parts of the directory concurrently to perform the same function and not have any repeats in the traversal?
Any help or direction would be greatly appreciated

Move the heavy processing to a separate program, then call that program with subprocess to keep a certain number of parallel processes running:
import subprocess
import time
todo = []
for file in list:
for filename in file:
todo.append(filename)
running_processes = []
while len(todo)>0:
running_processes = [p for p in running_processes if p.poll() is None]
if len(running_processes)<8:
target = todo.pop()
running_processes.append( subprocess.Popen(['python','process_gzip.py',target]) )
time.sleep(1)

You can open many files concurrently. For instance:
files = [gzip.open(f,"rb") for f in fileslist]
processed = [process(f) for f in files]
(btw, don't call your files list "list", or a list of files "file", since they are reserved words by the language and do not describe what the object really is in your case).
Now it is going to take about the same time, since you always process them one at a time. So, is it the processing of them that you want to parallelize? Then you want to look at threading or multiprocessing.
Are you looking for os.path.walk to traverse directories? (https://docs.python.org/2/library/os.path.html). You can also do:
for folder in folderslist:
fileslist = os.listdir(folder)
for file in fileslist:
....
Are you interested by fileinput to iterate over lines from multiple input streams? (https://docs.python.org/2/library/fileinput.html, fileinput.hook_compressed seems to handle gzip).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: list files every 30sec and avoid processing duplicates - python

Make a dictionary in your class, and insert all the files you have seen. then, in your start_threading check if the file is in the dictionary, and pass in that case.

Related

os.listdir getting slower over different runs

How do I update a file directory pickle file with new files/locations without searching everything again in Python

Files are closed before reaching the end - using ExitStack in Python

Grab the name of a modified file python

traversing multiple files and opening them

Categories

Resources