This question already has answers here:
A way to "listen" for changes to a file system from Python on Linux?
(2 answers)
Closed 7 years ago.
We have one requirement in our project that detect anything that is dropped into a directory in python.
The process is like this:
There will be a python script running almost all time a day(sort of cron job), which will keep watch on a directory.
When anybody puts a file into a directory that file should be detected.
File dropped will have zip, xml, json or an ini format.
There is no fix way that how user will drop the file into that directory (i.e person could simply copy or move it using console by cp or mv command. Or person might do a FTP transfer from some other computer, or may upload it through our web interface)
I'm able to detect it while dropped by web interface but not for other ways.
Can anyone suggest me the way to detect file dropped:
def detect_file(watch_folder_path):
""" Detect a file dropped """
watched_files = os.listdir(watch_folder_path)
if len(watched_files) > 0:
filename = watched_files[0]
print "File located :, filename
If this is a linux system I would suggest inotifywatch as it seems to be as it can be configure per events, like create, move_to and more.
There is a python wrapper pyinotify for it which which you can invoke like this:
python -m pyinotify -v /my-dir-to-watch
How about:
known_files = []
def detect_file(watch_folder_path):
files = os.listdir(watch_folder_path)
for file in files:
if file not in known_files:
#RAISE ALERT e.g. send email
known_files.append(file)
Add the file to the known_files list once the alert has been raised so that it does not keep alerting.
You will then want to run detect_files() on repeat at a frequency of your discretion. I recommend using Timer to achieve this. Or even more simply, execute this function inside a while True: statement, and add in time.sleep(60) to run the detect_files() check every 60 seconds for example.
If you don't want to use any dependency for your project you can rely on a script to compute the changes for your files. Assuming this script will always run you can write the following code:
def is_interesting_file(f):
interesting_extensions = ['zip', 'json', 'xml', 'ini']
file_extension = f.split('.')[-1]
if file_extension in interesting_extension:
return True
return False
watch_folder_path = 0
previous_watched_files = set()
while True:
watched_files = set(os.listdir(watch_folder_path))
new_files = watched_files.difference(previous_watched_files)
interesting_files = [filename for filename in new_files if is_interesting_file(filename)]
#Do something with your interesting files
If you want to run this script on a cron or something like that using this approach, you can always save the directory listing in a file or simple database as sqlite and assign it to the previous_watched_files variable. Then you can make one iteration watching the directory for changes, clear the db/file records and creating them again with the updated listing results.
Related
I am working on the listener portion of a backdoor program (for an ETHICAL hacking course) and I would like to be able to read files from any part of my linux system and not just from within the directory where my listener python script is located - however, this has not proven to be as simple as specifying a typical absolute path such as "~/Desktop/test.txt"
So far my code is able to read files and upload them to the virtual machine where my reverse backdoor script is actively running. But this is only when I read and upload files that are in the same directory as my listener script (aptly named listener.py). Code shown below.
def read_file(self, path):
with open(path, "rb") as file:
return base64.b64encode(file.read())
As I've mentioned previously, the above function only works if I try to open and read a file that is in the same directory as the script that the above code belongs to, meaning that path in the above content is a simple file name such as "picture.jpg"
I would like to be able to read a file from any part of my filesystem while maintaining the same functionality.
For example, I would love to be able to specify "~/Desktop/another_picture.jpg" as the path so that the contents of "another_picture.jpg" from my "~/Desktop" directory are base64 encoded for further processing and eventual upload.
Any and all help is much appreciated.
Edit 1:
My script where all the code is contained, "listener.py", is located in /root/PycharmProjects/virus_related/reverse_backdoor/. within this directory is a file that for simplicity's sake we can call "picture.jpg" The same file, "picture.jpg" is also located on my desktop, absolute path = "/root/Desktop/picture.jpg"
When I try read_file("picture.jpg"), there are no problems, the file is read.
When I try read_file("/root/Desktop/picture.jpg"), the file is not read and my terminal becomes stuck.
Edit 2:
I forgot to note that I am using the latest version of Kali Linux and Pycharm.
I have run "realpath picture.jpg" and it has yielded the path "/root/Desktop/picture.jpg"
Upon running read_file("/root/Desktop/picture.jpg"), I encounter the same problem where my terminal becomes stuck.
[FINAL EDIT aka Problem solved]:
Based on the answer suggesting trying to read a file like "../file", I realized that the code was fully functional because read_file("../file") worked without any flaws, indicating that my python script had no trouble locating the given path. Once the file was read, it was uploaded to the machine running my backdoor where, curiously, it uploaded the file to my target machine but in the parent directory of the script. It was then that I realized that problem lied in the handling of paths in the backdoor script rather than my listener.py
Credit is also due to the commentator who pointed out that "~" does not count as a valid path element. Once I reached the conclusion mentioned just above, I attempted read_file("~/Desktop/picture.jpg") which failed. But with a quick modification, read_file("/root/Desktop/picture.jpg") was successfully executed and the file was uploaded in the same directory as my backdoor script on my target machine once I implemented some quick-fix code.
My apologies for not being so specific; efforts to aid were certainly confounded by the unmentioned complexity of my situation and I would like to personally thank everyone who chipped in.
This was my first whole-hearted attempt to reach out to the stackoverflow community for help and I have not been disappointed. Cheers!
A solution I found is putting "../" before the filename if the path is right outside of the dictionary.
test.py (in some dictionary right inside dictionary "Desktop" (i.e. /Desktop/test):
with open("../test.txt", "r") as test:
print(test.readlines())
test.txt (in dictionary "/Desktop")
Hi!
Hello!
Result:
["Hi!", "Hello!"]
This is likely the simplest solution. I found this solution because I always use "cd ../" on the terminal.
This not only allows you to modify the current file, but all other files in the same directory as the one you are reading/writing to.
path = os.path.dirname(os.path.abspath(__file__))
dir_ = os.listdir(path)
for filename in dir_:
f = open(dir_ + '/' + filename)
content = f.read()
print filename, len(content)
try:
im = Image.open(filename)
im.show()
except IOError:
print('The following file is not an image type:', filename)
I want to build a script which finds out which files on an FTP server are new and which are already processed.
For each file on the FTP we read out the information, parse it and write the information we need from it to our database. The files are xml-files, but have to be translated.
At the moment I'm using mlsd() to get a list, but this takes up to 4 minutes because there are already 15.000 files in this directory - it will be more everyday.
Instead of comparing this list with an older list which I saved in a textfile I would like to know if there are better possibilities.
Because this task has to run "live" it would end in an cronjob every 1 or 2 minutes. If this method takes to long this won't work.
The solution should be either in PHP or Python.
def handle(self, *args, **options):
ftp = FTP_TLS(host=host)
ftp.login(user,passwd)
ftp.prot_p()
list = ftp.mlsd("...")
for item in list:
print(item[0] + " => " + item[1]['modify'])
This code examples already runs 4 minutes.
I have always tried to avoid browsing a folder to find what could have changed. I prefered setting a dedicated workflow. When files can only be added (or new versions of existing files), I tried to use a workflow where files are added in one directory and then go in other directories where they are archived. Processing can occur in a directory where files are deleted after being used, or when they are copied/moved from a folder to an other one.
As a slight goody, I also use a copy/rename pattern: the files are first copied using a temporary name (for example a .t prefix or suffix) and renamed when the copy has ended. This prevents trying to process a file which is not fully copied. Ok it used to be more important when we had slow lines, but race conditions should be avoided as much as possible, and it allows to use daemon which polls a folder every 10 seconds or less.
Unsure whether it is really relevant here because it could require some refactoring, but it gives bullet proof solutions.
If FTP is your only interface to the server, there's no better way that what you are already doing.
Except maybe, if you server supports non-standard -t switch to LIST/NLST commands, which returns the list sorted by timestamps.
See How to get files in FTP folder sorted by modification time.
And if what takes long is the download of the file list (not initiation of the download). In that case you can request sorted list, but download only the leading new files, aborting the listing once you find the first already processed file.
For an example, how to abort download of a file list, see:
Download the first N rows of text file in ftp with ftplib.retrlines
Something like this:
class AbortedListing(Exception):
pass
def collectNewFiles(s):
if isProcessedFile(s): # your code to detect if the file was processed already
print("We know this file already: " + s + " - aborting")
raise AbortedListing()
print("New file: " + s)
try:
ftp.retrlines("NLST -t /path", collectNewFiles)
except AbortedListing:
# read/skip response
ftp.getmultiline()
I have written the following code to extract zip files in a directory and a delete a particular excel file in the extracted directory :
def extractZipFiles(dest_directory):
"This function extracts zip files in the destination directory for further processing"
fileFullPath = dest_directory + '\\'
extractedDirList = list()
for file in os.listdir(dest_directory):
dn = fileFullPath+file
dn = re.sub(r'\.zip$', "", fileFullPath+file) #remove the trailing .zip.
extractedDirList.append(dn)
zf = zipfile.ZipFile(fileFullPath+file, mode='r')
zf.extractall(dn) # extract the contents of that zip to the empty directory
zf.close()
return extractedDirList
def removeSelectedReports(extractedDirList):
"This function removes the selected reports from extracted directory"
for i in range(len(extractedDirList)):
for filename in os.listdir(extractedDirList[i]):
if filename.startswith("ABC_8"):
logger.info("File to be removed::"+filename)
fullPathName= "%s/%s" % (extractedDirList[i],filename)
os.remove(fullPathName)
return
extractedDirList = extractZipFiles(attributionRptDestDir)
logger.info("ZIP FILES EXTRACTED:"+str(extractedDirList))
removeSelectedReports(extractedDirList)
I am getting the following intermittent issue even though I have closed the zip file handler.
[WinError 32] The process cannot access the file because it is being used by another process: '\\\\share\\Workingdirectory\\report.20180517.zip'
Can you please help resolve this issue
You should try to figure out what has the file open. Based on your code, it looks like you are on Microsoft Windows.
I would stop all applications on your workstation, including browsers, run with only a minimum number of apps open, and reproduce the problem. Once reproduced you can use a tool to lists all handles open to a particular file.
A handy utility would be handle.exe, but please use any tool with similar functionality.
Once you find the offending application, you can further investigate why the file is open, and take counter measures.
I would be careful not to close any application which has the file open, until you know it is safe to do so.
I'm using the Python FTP lib for the first time. My goal is simply to connect to an FTP site, get a directory listing, and then download all files which are newer than a certain date - (e.g. download all files created or modified within the last 5 days, for example)
This turned out to be a bit more complicated than I expected for a few reasons. Firstly, I've discovered that there is no real "standard" FTP file list format. Most FTP sites conventionally use the UNIX ls format, but this isn't guaranteed.
So, my initial thought was to simply parse the UNIX ls format: it's not so bad after all, and it seems most mainstream FTP servers will use it in response to the LIST command.
This was easy enough to code with Python's ftplib:
import ftplib
def callback(line):
print(line)
ftp = ftplib.FTP("ftp.example.com")
result = ftp.login(user = "myusername", passwd = "XXXXXXXX")
dirlist = ftp.retrlines("LIST", callback )
This works, except the problem is that the date given in the UNIX list format returned by the FTP server I'm dealing with doesn't have a year. A typical entry is:
-rw-rw-r-- 1 user user 1505581 Dec 9 21:53 somefile.txt
So the problem here is that I'd have to code in extra logic to sort of "guess" if the date refers to the current year or not. Except really, I'd much rather not code some complex logic like that when it seems so unnecessary - there's no reason the FTP server shouldn't be able to give me the year.
Okay, so after Googling around for some alternative ways to get LIST information, I've found that many FTP servers support the MLST and MLSD command, which apparently provides a directory listing in a "machine-readable" format, i.e. a list format which is much more amenable to automatic processing. Great. So, I try the following:
dirlist = ftp.sendcmd("MLST")
print(dirlist)
This produces a single line response, giving me data about the current working directory, but NOT a list of files.
250-Start of list for /
modify=20151210094445;perm=flcdmpe;type=cdir;unique=808U6EC0051;UNIX.group=1003;UNIX.mode=0775;UNIX.owner=1229; /
250 End of list
So this looks great, and easy to parse, and it also has a modify date with the year. Except it seems the MLST command is showing information about the directory itself, rather than a listing of files.
So, I've Googled around and read the relevant RFCs, but can't seem to figure out how to get a listing of files in "MLST" format. It seems the MLSD command is what I want, but I get a 425 error when I try that:
File "temp8.py", line 8, in <module>
dirlist = ftp.sendcmd("MLSD")
File "/usr/lib/python3.2/ftplib.py", line 255, in sendcmd
return self.getresp()
File "/usr/lib/python3.2/ftplib.py", line 227, in getresp
raise error_temp(resp)
ftplib.error_temp: 425 Unable to build data connection: Invalid argument
So how can I get a full directory listing in MLST/MLSD format here?
There is another module ftputil which is built based on ftplib, and has many features emulating os, os.path, shutil. I found it pretty easy to use and robust in related operation. Maybe you could give it a try.
As for your purpose, the introduction codes solves it exactly.
you could try this, and see if you can get what you need.
print(ftp.mlst('directory'))
I am working on something similar where i need to parse the content of directory and all sub directories within. However the server that I am working with did not allow mlst command, so i accomplished what i need by,
parse the main directory content
for loop through main directory content
Append for loop output to pandas DataFrame.
test = pd.Series('ftp.nlst('/target directory/'))
df_server_content = pd.DataFrame()
for i in test:
data_dir = '/target directory/' + i
server_series = pd.Series(ftp.nlst(data_dir))
df_server_content = df_server_content.append(server_series)
I'm trying to code a simple application that must read all currently open files within a certain directory.
More specificly, I want to get a list of files open anywhere inside my Documents folder,
but I don't want only the processes' IDs or process name, I want the full path of the open file.
The thing is I haven't quite found anything to do that.
I couldn't do it neither in linux shell (using ps and lsof commands) nor using python's psutil library. None of these is giving me the information I need, which is only the path of currently open files in a dir.
Any advice?
P.S: I'm tagging this as python question (besides os related tags) because it would be a plus if it could be done using some python library.
This seems to work (on Linux):
import subprocess
import shlex
cmd = shlex.split('lsof -F n +d .')
try:
output = subprocess.check_output(cmd).splitlines()
except subprocess.CalledProcessError as err:
output = err.output.splitlines()
output = [line[3:] for line in output if line.startswith('n./')]
# Out[3]: ['file.tmp']
it reads open files from current directory, non-recursively.
For recursive search, use +D option. Keep in mind, that it is vulnerable to race condition - when you get your ouput, situation might have changed already. It is always best to try to do something (open file), and check for failure, e.g. open file and catch exception or check for null FILE value in C.