Monitoring Folders - Python

Monitoring Folders - Python - python

I am trying to write a python script which will monitor folders. The files are being written into the folder from a third party GUI based program. Some exported files grow in situ and other are written in a tmp folder elsewhere before being copied into the target folder. In the tmp folder cases, an empty folder is placed at the target destination until the file is ready to move. There may be multiples of these empty folders, but they are only created after the previous one has been populated.
The below appears to work well until there are Zero size files/folders.
I think the main issue lies in zero_files. Providing the rest for context.
import os
import datetime
import time
import itertools
print('Starting to Monitor File growth')
print(datetime.datetime.now())
print("")
path = os.path.normpath(r'C:\Users\ed\Desktop\Test_Run')
check_rate = 180
#time in seconds between checks
print("Waiting for a moment before starting monitoring")
print("")
time.sleep(60)
#wait for the first files to appear
def get_directory_size(directory):
"""Returns the `directory` size in bytes."""
total = 0
try:
# print("[+] Getting the size of", directory)
for entry in os.scandir(directory):
if entry.is_file():
# if it's a file, use stat() function
total += entry.stat().st_size
elif entry.is_dir():
# if it's a directory, recursively call this function
try:
total += get_directory_size(entry.path)
except FileNotFoundError:
pass
except NotADirectoryError:
# if `directory` isn't a directory, get the file size then
return os.path.getsize(directory)
except PermissionError:
# if for whatever reason we can't open the folder, return 0
return 0
return total
def folder_growing(path):
sizes = [1,2]
while sizes[-1] > sizes[-2]:
time.sleep(check_rate)
sizes.append(get_directory_size(path))
print('Monitoring Folder')
def zero_files(path):
files = os.listdir(path)
a= []
for i in files:
file_size = a.append(os.path.getsize(f'{path}\\{i}'))
a.sort()
try:
while a[-1] == 0:
file_size = a.append(os.path.getsize(f'{path}\\{i}'))
a.sort()
print("test")
time.sleep(120)
except FileNotFoundError:
pass
print(f"***Checking folders every", (int(check_rate/60)),"mins***")
get_directory_size(path)
folder_growing(path)
time.sleep(120)
zero_files(path)
wait = 10
print('No Folder Growth Detected')
print("")
print(f"***Waiting ", (int(wait/60)),"mins for Safety***")
time.sleep(wait)
print("")
print(datetime.datetime.now())
print("Done")

Related

Checking len of azure blob storage folder causes function to not run

Weird problem I've run into. I'm currently using the following code:
generic.py
def function_in_different_pyfile(input_folder):
# do stuff here
folder_1 = f"/folder_1"
folder_1_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_1)
folder_2 = f"/folder_2"
folder_2_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_2)
if len([file for file in folder_1_virtualdir]) !=(len([file for file in folder_2_virtualdir]):
generic.function_in_different_pyfile(folder_1_virtualdir)
else:
print('Already done')
So what I'm trying to do is:
Check the number of files in folder_1_virtualdir and folder_2_virtualdir
If they aren't equal, run the function.
If they are, then print statement/pass.
The problem:
The generic.function() runs although doesn't do anything when you pass in the list comprehension.
The generic.function() works totally fine if you don't have a list comprehension in the code e.g:
folder_1 = f"/folder_1"
folder_1_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_1)
folder_2 = f"/folder_2"
folder_2_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_2)
generic.function_in_different_pyfile(folder_1_virtualdir)
Will work completely fine.
There are no error messages. It passes through the function as if the function doesn't do anything.
What I've tried:
I've tested this by modifying the function:
generic.py
def function_in_different_pyfile(input_folder):
print('Start of the function')
# do stuff here
print('End of the function')
You will see these print statements although the function doesn't process any of the files in the input_folder argument if you include the list comprehension.
This is extended to when the list comprehension is ANYWHERE in the code:
folder_1 = f"/folder_1"
folder_1_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_1)
folder_1_contents = [file for file in folder_1_virtualdir]
folder_2 = f"/folder_2"
folder_2_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_2)
generic.function_in_different_pyfile(folder_1_virtualdir)
# Function doesn't run.
I'm fairly new to Python although can't seem to understand why the list comprehension here completely prevents the function from running correctly.

You could try the code if the number of files in the folder is less than 5000:
folder_1 = f"/folder_1"
folder_1_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_1)
folder_2 = f"/folder_2"
folder_2_virtualdir = CONTAINER_CLIENT.list_blobs(name_starts_with=folder_2)
folder_1_count = len(folder_1_virtualdir)
folder_2_count = len(folder_2_virtualdir)
if folder_1_count != folder_2_count :
generic.function_in_different_pyfile(folder_1_virtualdir)
else:
print('Already done')
If greater than 5000, you need to get the number iterating through your blob.
count = 0
for count, item in enumerate(blobs):
print("number", count + 1, "in the list is", item)

How to compare ctime properly?

I have a program that gets the modified date/time of directories and files. I then want to get the date/time from 30 seconds ago and compare that to the modified date/time.
If the modified time is less than 30 seconds ago, I want to trigger an alert. My code is triggering alert even if the modified time occurred more than 30 seconds ago.
Is there a way I can only trigger an alert if the modification occurred less than 30 seconds ago?
import os.path
import time, stat
import sys
share_dir = 'C:/mydir'
source_dir = r'' + share_dir + '/'
def trigger():
print("Triggered")
def check_dir():
while True:
for currentdir, dirs, files in os.walk(source_dir):
for file in files:
currentfile = os.path.join(currentdir, file)
# get modified time for files
ftime = os.stat(currentfile )[stat.ST_MTIME]
past = time.time() - 30 # last 30 seconds
if time.ctime(ftime) >= time.ctime(past):
print(time.ctime(ftime) + " > " + time.ctime(past))
print("Found modification in last 30 seconds for file =>", currentfile, time.ctime(ftime))
trigger()
sys.exit()
else:
print('No recent modifications.' + currentfile)
for folder in dirs:
currentfolder = os.path.join(currentdir, folder)
# get modified time for directories
dtime = os.stat(currentfolder)[stat.ST_MTIME]
past = time.time() - 30 # last 30 seconds
if time.ctime(dtime) >= time.ctime(past):
print(time.ctime(dtime) + " > " + time.ctime(past))
print("Found modification in last 30 seconds for folder =>", currentfolder, time.ctime(dtime))
trigger()
sys.exit()
else:
print('No recent modifications: ' + currentfolder)
time.sleep(4)
if __name__ == "__main__":
check_dir()

I'm doing this on a large scale file system. I personally use SQLite3 and round the mtime of the file (I had weird things happen using any other sort of operation and it was more consistent).
I'm also unsure why you're not just doing a pure math solution. Take the current time, take the mtime of the file, find the difference between them and if it's less than or equal to thirty, you get a hit.
I redid some of the code. I recommend trying this:
import os.path
import time, stat
import sys
def trigger():
print("Triggered")
def check_dir(source_dir):
for currentdir, dirs, files in os.walk(source_dir):
for file in files:
currentfile = os.path.join(currentdir, file)
# get modified time for files
ftime = os.path.getmtime(currentfile)
if time.time() - ftime <= 30:
print("Found modification in last 30 seconds for file =>", currentfile, time.ctime(ftime))
trigger()
exit(0)
else:
print('No recent modifications.' + currentfile)
for folder in dirs:
currentfolder = os.path.join(currentdir, folder)
# get modified time for directories
dtime = os.stat(currentfolder)[stat.ST_MTIME]
if time.time() - dtime <= 30:
print("Found modification in last 30 seconds for folder =>", currentfolder, time.ctime(dtime))
trigger()
exit(0)
else:
print('No recent modifications: ' + currentfolder)
if __name__ == "__main__":
check_dir('yourdirectoryhere')
Did some light testing on my own system and it seemed to work perfectly. Might want to add back the while loop but it should work.

while loop never stop even set a false sentinel

I write a program to find the large-files whose size >= 100Mb,
However,it ran endless on MacOS
I set
sentinel = True
while sentinel:
and the breaking condition:
sentinel = False
The complete codes:
import os, time, shelve, logging
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s - %(levelname)s - %(message)s")
logging.info("Start of Program")
start = time.time()
root = '/'
# errors= set()
# dirs = set()
sentinel = True
while sentinel:
try:
root = os.path.abspath(root) #ensure its a abspath
#set the baseline as 100M
#consider the shift
baseline = 100 * 2**20 # 2*20 is 1M
#setup to collect the large files
large_files = []
#root is a better choise as the a concept
for foldername, subfolders, files in os.walk(root):
# logging.error("foldername: %s" %foldername)
# print("subfolders: ", subfolders)
for f in files:
# print(f"{foldername}, {f}")
abspath = os.path.join(foldername, f)
logging.debug("abspath: %s" %abspath)
size = os.path.getsize(abspath)
if size >= baseline:
large_files.append((os.path.basename(abspath), size/2**20))
# turn_end = time.time()
# print(f"UnitTimer: {turn_end-start}") #no spaces beween .
#write the large files to shelf
logging.debug("subfolders: " + str(subfolders))
shelf = shelve.open('large_files')
shelf.clear()
shelf["large_files"] = large_files
shelf.close()
end = time.time()
logging.debug("Timer: %s." %(end-start))
#break the while loop
logging.info("End of Program")
#break the loop after walk()
sentinel = False
except (PermissionError,FileNotFoundError) as e:
# errors.add(e)
pass
The codes ran endless, but I cannot find the problem.

The sentinel is only set to False when there isn't an exception. Make sure to set it to False in your except block as well.
To recover gracefully from an error, you probably don't want to wrap every file access in the same try/except block. Rather, you want to have a small try/except block catching an individual file operation, and if that fails you can apply your error handling code (e.g. retrying or logging and continuing to the next file).

Python script use while loop to keep updating job scripts and multiprocess the tasks in queue

I am trying to write a python script scanning a folder and collect updated SQL script, and then automatically pull data for the SQL script. In the code, a while loop is scanning new SQL file, and send to data pull function. I am having trouble to understand how to make a dynamic queue with while loop, but also have multiprocess to run the tasks in the queue.
The following code has a problem that the while loop iteration will work on a long job before it moves to next iteration and collects other jobs to fill the vacant processor.
Update:
Thanks to #pbacterio for catching the bug, and now the error message is gone. After changing the code, the python code can take all the job scripts during one iteration, and distribute the scripts to four processors. However, it will get hang by a long job to go to next iteration, scanning and submitting the newly added job scripts. Any idea how to reconstruct the code?
I finally figured out the solution see answer below. It turned out what I was looking for is
the_queue = Queue()
the_pool = Pool(4, worker_main,(the_queue,))
For those stumble on the similar idea, following is the whole architecture of this automation script converting a shared drive to a 'server for SQL pulling' or any other job queue 'server'.
a. The python script auto_data_pull.py as shown in the answer. You need to add your own job function.
b. A 'batch script' with following:
start C:\Anaconda2\python.exe C:\Users\bin\auto_data_pull.py
c. Add a task triggered by start computer, run the 'batch script'
That's all. It works.
Python Code:
from glob import glob
import os, time
import sys
import CSV
import re
import subprocess
import pandas as PD
import pypyodbc
from multiprocessing import Process, Queue, current_process, freeze_support
#
# Function run by worker processes
#
def worker(input, output):
for func, args in iter(input.get, 'STOP'):
result = compute(func, args)
output.put(result)
#
# Function used to compute result
#
def compute(func, args):
result = func(args)
return '%s says that %s%s = %s' % \
(current_process().name, func.__name__, args, result)
def query_sql(sql_file): #test func
#jsl file processing and SQL querying, data table will be saved to csv.
fo_name = os.path.splitext(sql_file)[0] + '.csv'
fo = open(fo_name, 'w')
print sql_file
fo.write("sql_file {0} is done\n".format(sql_file))
return "Query is done for \n".format(sql_file)
def check_files(path):
"""
arguments -- root path to monitor
returns -- dictionary of {file: timestamp, ...}
"""
sql_query_dirs = glob(path + "/*/IDABox/")
files_dict = {}
for sql_query_dir in sql_query_dirs:
for root, dirs, filenames in os.walk(sql_query_dir):
[files_dict.update({(root + filename): os.path.getmtime(root + filename)}) for
filename in filenames if filename.endswith('.jsl')]
return files_dict
##### working in single thread
def single_thread():
path = "Y:/"
before = check_files(path)
sql_queue = []
while True:
time.sleep(3)
after = check_files(path)
added = [f for f in after if not f in before]
deleted = [f for f in before if not f in after]
overlapped = list(set(list(after)) & set(list(before)))
updated = [f for f in overlapped if before[f] < after[f]]
before = after
sql_queue = added + updated
# print sql_queue
for sql_file in sql_queue:
try:
query_sql(sql_file)
except:
pass
##### not working in queue
def multiple_thread():
NUMBER_OF_PROCESSES = 4
path = "Y:/"
sql_queue = []
before = check_files(path) # get the current dictionary of sql_files
task_queue = Queue()
done_queue = Queue()
while True: #while loop to check the changes of the files
time.sleep(5)
after = check_files(path)
added = [f for f in after if not f in before]
deleted = [f for f in before if not f in after]
overlapped = list(set(list(after)) & set(list(before)))
updated = [f for f in overlapped if before[f] < after[f]]
before = after
sql_queue = added + updated
TASKS = [(query_sql, sql_file) for sql_file in sql_queue]
# Create queues
#submit task
for task in TASKS:
task_queue.put(task)
for i in range(NUMBER_OF_PROCESSES):
p = Process(target=worker, args=(task_queue, done_queue)).start()
# try:
# p = Process(target=worker, args=(task_queue))
# p.start()
# except:
# pass
# Get and print results
print 'Unordered results:'
for i in range(len(TASKS)):
print '\t', done_queue.get()
# Tell child processes to stop
for i in range(NUMBER_OF_PROCESSES):
task_queue.put('STOP')
# single_thread()
if __name__ == '__main__':
# freeze_support()
multiple_thread()
Reference:
monitor file changes with python script: http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html
Multiprocessing:
https://docs.python.org/2/library/multiprocessing.html

Where did you define sql_file in multiple_thread() in
multiprocessing.Process(target=query_sql, args=(sql_file)).start()
You have not defined sql_file in the method and moreover you have used that variable in a for loop. The variable's scope is only confined to the for loop.

Try replacing this:
result = func(*args)
by this:
result = func(args)

I have figured this out. Thank your for the response inspired the thought.
Now the script can run a while loop to monitor the folder for new updated/added SQL script, and then distribute the data pulling to multiple threads. The solution comes from the queue.get(), and queue.put(). I assume the queue object takes care of the communication by itself.
This is the final code --
from glob import glob
import os, time
import sys
import pypyodbc
from multiprocessing import Process, Queue, Event, Pool, current_process, freeze_support
def query_sql(sql_file): #test func
#jsl file processing and SQL querying, data table will be saved to csv.
fo_name = os.path.splitext(sql_file)[0] + '.csv'
fo = open(fo_name, 'w')
print sql_file
fo.write("sql_file {0} is done\n".format(sql_file))
return "Query is done for \n".format(sql_file)
def check_files(path):
"""
arguments -- root path to monitor
returns -- dictionary of {file: timestamp, ...}
"""
sql_query_dirs = glob(path + "/*/IDABox/")
files_dict = {}
try:
for sql_query_dir in sql_query_dirs:
for root, dirs, filenames in os.walk(sql_query_dir):
[files_dict.update({(root + filename): os.path.getmtime(root + filename)}) for
filename in filenames if filename.endswith('.jsl')]
except:
pass
return files_dict
def worker_main(queue):
print os.getpid(),"working"
while True:
item = queue.get(True)
query_sql(item)
def main():
the_queue = Queue()
the_pool = Pool(4, worker_main,(the_queue,))
path = "Y:/"
before = check_files(path) # get the current dictionary of sql_files
while True: #while loop to check the changes of the files
time.sleep(5)
sql_queue = []
after = check_files(path)
added = [f for f in after if not f in before]
deleted = [f for f in before if not f in after]
overlapped = list(set(list(after)) & set(list(before)))
updated = [f for f in overlapped if before[f] < after[f]]
before = after
sql_queue = added + updated
if sql_queue:
for jsl_file in sql_queue:
try:
the_queue.put(jsl_file)
except:
print "{0} failed with error {1}. \n".format(jsl_file, str(sys.exc_info()[0]))
pass
else:
pass
if __name__ == "__main__":
main()

Successive multiprocessing

I am filtering huge text files using multiprocessing.py. The code basically opens the text files, works on it, then closes it.
Thing is, I'd like to be able to launch it successively on multiple text files. Hence, I tried to add a loop, but for some reason it doesn't work (while the code works on each file). I believe this is an issue with:
if __name__ == '__main__':
However, I am looking for something else. I tried to create a Launcher and a LauncherCount files like this:
LauncherCount.py:
def setLauncherCount(n):
global LauncherCount
LauncherCount = n
and,
Launcher.py:
import os
import LauncherCount
LauncherCount.setLauncherCount(0)
os.system("OrientedFilterNoLoop.py")
LauncherCount.setLauncherCount(1)
os.system("OrientedFilterNoLoop.py")
...
I import LauncherCount.py, and use LauncherCount.LauncherCount as my loop index.
Of course, this doesn't work too as it edits the variable LauncherCount.LauncherCount locally, so it won't be edited in the imported version of LauncherCount.
Is there any way to edit globally a variable in an imported file? Or, is there any way to do this in any other way? What I need is running a code multiple times, in changing one value, and without using any loop apparently.
Thanks!
Edit: Here is my main code if necessary. Sorry for the bad style ...
import multiprocessing
import config
import time
import LauncherCount
class Filter:
""" Filtering methods """
def __init__(self):
print("launching methods")
# Return the list: [Latitude,Longitude] (elements are floating point numbers)
def LatLong(self,line):
comaCount = []
comaCount.append(line.find(','))
comaCount.append(line.find(',',comaCount[0] + 1))
comaCount.append(line.find(',',comaCount[1] + 1))
Lat = line[comaCount[0] + 1 : comaCount[1]]
Long = line[comaCount[1] + 1 : comaCount[2]]
try:
return [float(Lat) , float(Long)]
except ValueError:
return [0,0]
# Return a boolean:
# - True if the Lat/Long is within the Lat/Long rectangle defined by:
# tupleFilter = (minLat,maxLat,minLong,maxLong)
# - False if not
def LatLongFilter(self,LatLongList , tupleFilter) :
if tupleFilter[0] <= LatLongList[0] <= tupleFilter[1] and
tupleFilter[2] <= LatLongList[1] <= tupleFilter[3]:
return True
else:
return False
def writeLine(self,key,line):
filterDico[key][1].write(line)
def filteringProcess(dico):
myFilter = Filter()
while True:
try:
currentLine = readFile.readline()
except ValueError:
break
if len(currentLine) ==0: # Breaks at the end of the file
break
if len(currentLine) < 35: # Deletes wrong lines (too short)
continue
LatLongList = myFilter.LatLong(currentLine)
for key in dico:
if myFilter.LatLongFilter(LatLongList,dico[key][0]):
myFilter.writeLine(key,currentLine)
###########################################################################
# Main
###########################################################################
# Open read files:
readFile = open(config.readFileList[LauncherCount.LauncherCount][1], 'r')
# Generate writing files:
pathDico = {}
filterDico = config.filterDico
# Create outputs
for key in filterDico:
output_Name = config.readFileList[LauncherCount.LauncherCount][0][:-4]
+ '_' + key +'.log'
pathDico[output_Name] = config.writingFolder + output_Name
filterDico[key] = [filterDico[key],open(pathDico[output_Name],'w')]
p = []
CPUCount = multiprocessing.cpu_count()
CPURange = range(CPUCount)
startingTime = time.localtime()
if __name__ == '__main__':
### Create and start processes:
for i in CPURange:
p.append(multiprocessing.Process(target = filteringProcess ,
args = (filterDico,)))
p[i].start()
### Kill processes:
while True:
if [p[i].is_alive() for i in CPURange] == [False for i in CPURange]:
readFile.close()
for key in config.filterDico:
config.filterDico[key][1].close()
print(key,"is Done!")
endTime = time.localtime()
break
print("Process started at:",startingTime)
print("And ended at:",endTime)

To process groups of files in sequence while working on files within a group in parallel:
#!/usr/bin/env python
from multiprocessing import Pool
def work_on(args):
"""Process a single file."""
i, filename = args
print("working on %s" % (filename,))
return i
def files():
"""Generate input filenames to work on."""
#NOTE: you could read the file list from a file, get it using glob.glob, etc
yield "inputfile1"
yield "inputfile2"
def process_files(pool, filenames):
"""Process filenames using pool of processes.
Wait for results.
"""
for result in pool.imap_unordered(work_on, enumerate(filenames)):
#NOTE: in general the files won't be processed in the original order
print(result)
def main():
p = Pool()
# to do "successive" multiprocessing
for filenames in [files(), ['other', 'bunch', 'of', 'files']]:
process_files(p, filenames)
if __name__=="__main__":
main()
Each process_file() is called in sequence after the previous one has been complete i.e., the files from different calls to process_files() are not processed in parallel.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Monitoring Folders - Python - python

Related

Checking len of azure blob storage folder causes function to not run

How to compare ctime properly?

while loop never stop even set a false sentinel

Python script use while loop to keep updating job scripts and multiprocess the tasks in queue

Successive multiprocessing

Categories

Resources