Within a script is a watcher algorithm which I've adapted from here:
https://www.michaelcho.me/article/using-pythons-watchdog-to-monitor-changes-to-a-directory
My aim is now to add a few lines to grab the name of any file that's modified so I can have an if statement checking for a certain file such as:
if [modified file name] == "my_file":
print("do something")
I don't have any experience with watchdog or file watching so I am struggling finding an answer for this. How would I receive the modified file name?
Current setup of the watchdog class is pretty useless since it just prints...it doesn't return anything.
Let me offer a different approach:
following would get you list of files modified in past 12 hours:
result = [os.path.join(root,f) for root, subfolder, files in os.walk(my_dir)
for f in files
if dt.datetime.fromtimestamp(os.path.getmtime(os.path.join(root,f))) >
dt.datetime.now() - dt.timedelta(hours=12)]
in stead of fixed delta hours, you can save the last_search_time and use it in subsequent searches.
You can then search result to see if it contains your file:
if my_file in result:
print("Sky is falling. Take cover.")
Related
I've only just recently starting to code with Python and I found my first problem which I can't seem to figure out after days of research. Hopefully someone on this forum can help me out.
The situation: In our company I have multiple folders and subfolders. Within those subfolders I we have excel files called:
Item Supply Demand "date".xlsx
Backorder report"date".xlsx
Product available report"date".xlsx
Everyday in the morning our IT downloads a new file with these names and the date of today. For example today it will look like this: Item Supply Demand 23-06-22.xlsx
The goal: I want to find the most recent Excel file within our subfolders which contains the name "Item Supply Demand". I already know how to find the most recent Excel file with the glob.glob function. However, I cannot seem to add an extra filter on a name part. Below the code that I already have:
import sys
import csv
import pandas as pd
import glob
import os.path
import pathlib
import re
#search for all Excel files
files = glob.glob(r"Pathname\**\*.xlsx", recursive = True)
#find most recent Item Supply Demand report
text_files = str(files)
if 'Item Supply Demand' in text_files:
max_file = max(files, key=os.path.getctime)
#Add the file to the dataframe
df = pd.read_excel(max_file)
df
Does anyone know what is currently missing or wrong on my code?
Many thanks in advance for helping our!
Cheer,
Kav
Try this, your already 99% of the way there.
files = glob.glob(r"Pathname\**\*Item Supply Demand*.xlsx", recursive = True)
Then I suppose the code block underneath can drop the conditional to become
# find most recent Item Supply Demand report
max_file = max(files, key=os.path.getctime)
Note - I haven't checked will that syntax do what you want - or even work at all - I'm assuming its working for you as its not the focus of your question.
edit: Just checked that - nice - it will give you exactly what you want.
The variable "files" is already list of strings. You can create list of string that match only the substring, then use that list.
wanted_file_substring = "Item Supply Demand"
matching_files = [specific_file for specific_file in files if wanted_file_substring in specific_file]
max_file = max(matching_files, key=os.path.getctime)
Edit my answer:
Either answer you choose, you need to initialize variable outside of "if" statement or move the read_excel line into the if statement. If the file you want is not found, your program will error out, because pandas is trying to reference a variable that doesn't exist.
Change the if statement to:
if files:
max_file = max(.....)
pd.read_excel(max_file)
please help me with challenge i have, that is to list files every 30seconds and process them (process them means for example -- copying to another location, each file is moved out of the directory once processed), and when i list files after 30seconds, i want to avoid any files that are listed previously for processing (due to the reason that they were listed previously and FOR LOOP is still in progress)
Means i want to avoid duplicate file processing while listing the files every 30seconds.
here is my code.
def List_files():
path = 'c:\\projects\\hc2\\'
files = []
for r, d, f in os.walk(path):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
class MyFilethreads:
def __init__(self, t1):
self.t1 = t1
def start_threading(self):
for file in List_files():
self.t1 = Thread(target=<FILEPROCESS_FUNCTION>, args=(file,))
self.t1.start()
t1 = Thread()
myclass = MyFilethreads(t1)
while True:
myclass.start_threading()
time.sleep(30)
I have not included my actual function for processing files, since its big,,it is called with thread as FILEPROCESS_FUNCTION.
Problem:
if the file size is high, my file processing time may increase some times (in other words, FOR LOOP is taking more than 30 sec ) but i cant reduce 30sec timer since it's very rare possibility, and my python script takes hundreds of files every min..
Hence, i am looking for a way to avoid files that are already listed previously, and by this i wanted to avoid duplicate file processing.
please help.
thanks in advance.
Make a dictionary in your class, and insert all the files you have seen. then, in your start_threading check if the file is in the dictionary, and pass in that case.
SO members...how can i read latest json file in a directory one time only (if no new file print something). So far I can only read the latest file ...The sample script (run every 45mins) below open and read latest json file in a directory. In this case latest file is file3.json (json file created every 30mins). Thus, if file4 is not created for some reason (for example server fail to create new json file). If the script run again.. it will still read the same last file3.
files in directory
file1.json
file2.json
file3.json
The script below able to open and read latest json file created in the directory.
import glob
import os
import os.path
import datetime, time
listFiles = glob.iglob('logFile/*.json')
latestFile = max(listFiles, key=os.path.getctime)
with open(latestFile, 'r') as f:
mydata = json.load(f)
print(mydata)
To ensure the script will only read newest file and read the newest file one time only...aspect something below:-
listFiles = glob.iglob('logFile/*.json')
latestFile = max(listFiles, key=os.path.getctime)
if latestFile newer than previous open/read file: # Not sure to compare the latest file with the previous file.
with open(latestFile, 'r') as f:
mydata = json.load(f)
print(mydata)
else:
print("no new file created")
Thank you for your help. Example solution would be good to share.
I can't figure out the solution...seems simple but few days try n error without any luck.
(1)Make sure read latest file in directory
(2)Make sure read file/s that may miss to read (due to script fail to run)
(3)Only read once all the files and if no new file give warning.
Thank you.
After SO discussion and suggestion, I got few methods to resolve or at least to accommodate some of the requirement. I just move files that have been processed. If no file create, script will run nothing and if script fail and once normalize it will run and read all related files available. I think its good for now. Thank you guyz...
Below is the answer rather an approach, I would like to propose:
The idea is as follows:
Every log file that is written to a directory can have a key-val in it called "creation_time": timestamp (fileX.json that gets stored in the server). Now, your script runs at 45min to obtain the file which is dumped to a directory. In normal cases, you must be able to read the file, and finally, when you exit the script you can store the last read filename and the creation_time taken from the fileX.json into a logger.json.
An example for a logger.json is as follows:
{
"creation_time": "03520201330",
"file_name": "file3.json"
}
Whenever a server fail or any delay occurs, there could be a rewritten of the fileX.json or new fileX's.json would have been created in the directory. In these situations, you would first open the logger.json and obtain both the timestamp and last filename as shown in the example above. By using the last filename, you can compare the old timestamp that is present in logger with the new timestamp in fileX.json. If they match basically there is no change you only read ahead files and rewrite the logger.
If that is not the case you would re-read the last fileX.json again and proceed to read other ahead files.
My code which is aimed at showing the last opened file, outputs always one special file even though it is not the last opened file. Also my code doesn't work if it isn't in the same folder as the data I am searching for.
I'm a complete newbie in python and this is actually the first program I've ever created with it. I wanted to make my life easier and make a little terminal application which should automatically debug my code. My first step was to create a code that shows the last opened project because if the folder i want to put my projects in is full it would be hard to search the name of my project. So I've come up with this:
import os
z = 3
o = r"/home/myname/Dokumente/Python"
list = os.listdir(o)
list_length = len(list)
list_time = []
list_low = []
print(list)
while list_length != 0:
list_length -= 1
print((os.path.getatime(list[list_length-1])))
list_time.append((os.path.getatime(list[list_length-1])))
print(list_time)
else:
list_time.reverse()
recent = list_time.index(min(list_time))
print(recent)
print("recently opened")
print(list[recent])
print(list)
To my second problem (not working when not in the same folder) this is the output of the Terminal:
['Hello_World.py', 'Python_Debugger_Kopie.py']
Traceback (most recent call last):
File "Python_Debugger.py", line 14, in <module>
print((os.path.getatime(list[list_length-1])))
File "/usr/lib/python3.7/genericpath.py", line 60, in getatime
return os.stat(filename).st_atime
FileNotFoundError: [Errno 2] No such file or directory: 'Hello_World.py'
You have a path problem I think.
os.path.getatime(file)
Returns the time of last access of file.
So when you call :
os.path.getatime(list[list_length-1])
Python is trying to find the file 'Hello_World.py'.
However, this file is in your "/home/myname/Dokumente/Python" directory.
So I think you could write this at the beggining of your file :
path_dir = '/home/myname/Dokumente/Python'
os.chdir(path_dir)
It will change your current working directory, and it should work.
os.listdir returns only the file names when you try to use os.path.getatime it checks your current working directory for the file name and doesn't find the file. try adding the path with something like os.path.join(o,list[list_length-1])
A couple additional notes since you're new to python:
list is a keyword, you should probably choose another variable name to avoid problems
you don't need to keep track of your indexes in python when iterating through lists. its handled for you. you can do something like:
for element in list:
print((os.path.getatime(element)))
...
This question already has answers here:
A way to "listen" for changes to a file system from Python on Linux?
(2 answers)
Closed 7 years ago.
We have one requirement in our project that detect anything that is dropped into a directory in python.
The process is like this:
There will be a python script running almost all time a day(sort of cron job), which will keep watch on a directory.
When anybody puts a file into a directory that file should be detected.
File dropped will have zip, xml, json or an ini format.
There is no fix way that how user will drop the file into that directory (i.e person could simply copy or move it using console by cp or mv command. Or person might do a FTP transfer from some other computer, or may upload it through our web interface)
I'm able to detect it while dropped by web interface but not for other ways.
Can anyone suggest me the way to detect file dropped:
def detect_file(watch_folder_path):
""" Detect a file dropped """
watched_files = os.listdir(watch_folder_path)
if len(watched_files) > 0:
filename = watched_files[0]
print "File located :, filename
If this is a linux system I would suggest inotifywatch as it seems to be as it can be configure per events, like create, move_to and more.
There is a python wrapper pyinotify for it which which you can invoke like this:
python -m pyinotify -v /my-dir-to-watch
How about:
known_files = []
def detect_file(watch_folder_path):
files = os.listdir(watch_folder_path)
for file in files:
if file not in known_files:
#RAISE ALERT e.g. send email
known_files.append(file)
Add the file to the known_files list once the alert has been raised so that it does not keep alerting.
You will then want to run detect_files() on repeat at a frequency of your discretion. I recommend using Timer to achieve this. Or even more simply, execute this function inside a while True: statement, and add in time.sleep(60) to run the detect_files() check every 60 seconds for example.
If you don't want to use any dependency for your project you can rely on a script to compute the changes for your files. Assuming this script will always run you can write the following code:
def is_interesting_file(f):
interesting_extensions = ['zip', 'json', 'xml', 'ini']
file_extension = f.split('.')[-1]
if file_extension in interesting_extension:
return True
return False
watch_folder_path = 0
previous_watched_files = set()
while True:
watched_files = set(os.listdir(watch_folder_path))
new_files = watched_files.difference(previous_watched_files)
interesting_files = [filename for filename in new_files if is_interesting_file(filename)]
#Do something with your interesting files
If you want to run this script on a cron or something like that using this approach, you can always save the directory listing in a file or simple database as sqlite and assign it to the previous_watched_files variable. Then you can make one iteration watching the directory for changes, clear the db/file records and creating them again with the updated listing results.