Python FTP: parseable directory listing - python

I'm using the Python FTP lib for the first time. My goal is simply to connect to an FTP site, get a directory listing, and then download all files which are newer than a certain date - (e.g. download all files created or modified within the last 5 days, for example)
This turned out to be a bit more complicated than I expected for a few reasons. Firstly, I've discovered that there is no real "standard" FTP file list format. Most FTP sites conventionally use the UNIX ls format, but this isn't guaranteed.
So, my initial thought was to simply parse the UNIX ls format: it's not so bad after all, and it seems most mainstream FTP servers will use it in response to the LIST command.
This was easy enough to code with Python's ftplib:
import ftplib
def callback(line):
print(line)
ftp = ftplib.FTP("ftp.example.com")
result = ftp.login(user = "myusername", passwd = "XXXXXXXX")
dirlist = ftp.retrlines("LIST", callback )
This works, except the problem is that the date given in the UNIX list format returned by the FTP server I'm dealing with doesn't have a year. A typical entry is:
-rw-rw-r-- 1 user user 1505581 Dec 9 21:53 somefile.txt
So the problem here is that I'd have to code in extra logic to sort of "guess" if the date refers to the current year or not. Except really, I'd much rather not code some complex logic like that when it seems so unnecessary - there's no reason the FTP server shouldn't be able to give me the year.
Okay, so after Googling around for some alternative ways to get LIST information, I've found that many FTP servers support the MLST and MLSD command, which apparently provides a directory listing in a "machine-readable" format, i.e. a list format which is much more amenable to automatic processing. Great. So, I try the following:
dirlist = ftp.sendcmd("MLST")
print(dirlist)
This produces a single line response, giving me data about the current working directory, but NOT a list of files.
250-Start of list for /
modify=20151210094445;perm=flcdmpe;type=cdir;unique=808U6EC0051;UNIX.group=1003;UNIX.mode=0775;UNIX.owner=1229; /
250 End of list
So this looks great, and easy to parse, and it also has a modify date with the year. Except it seems the MLST command is showing information about the directory itself, rather than a listing of files.
So, I've Googled around and read the relevant RFCs, but can't seem to figure out how to get a listing of files in "MLST" format. It seems the MLSD command is what I want, but I get a 425 error when I try that:
File "temp8.py", line 8, in <module>
dirlist = ftp.sendcmd("MLSD")
File "/usr/lib/python3.2/ftplib.py", line 255, in sendcmd
return self.getresp()
File "/usr/lib/python3.2/ftplib.py", line 227, in getresp
raise error_temp(resp)
ftplib.error_temp: 425 Unable to build data connection: Invalid argument
So how can I get a full directory listing in MLST/MLSD format here?

There is another module ftputil which is built based on ftplib, and has many features emulating os, os.path, shutil. I found it pretty easy to use and robust in related operation. Maybe you could give it a try.
As for your purpose, the introduction codes solves it exactly.

you could try this, and see if you can get what you need.
print(ftp.mlst('directory'))
I am working on something similar where i need to parse the content of directory and all sub directories within. However the server that I am working with did not allow mlst command, so i accomplished what i need by,
parse the main directory content
for loop through main directory content
Append for loop output to pandas DataFrame.
test = pd.Series('ftp.nlst('/target directory/'))
df_server_content = pd.DataFrame()
for i in test:
data_dir = '/target directory/' + i
server_series = pd.Series(ftp.nlst(data_dir))
df_server_content = df_server_content.append(server_series)

Related

ftplib -- deleting many files from ftp folder

I am pretty new to Python, but would like to use it to do some tasks on an FTP. Feel like it should be fairly easy but I am having some issues when trying to delete multiple files (hundreds) from an FTP folder. I have the file names I want to delete as strings from a SQL table I can copy and paste if needed.
My code so far:
import os
import ftplib
ftpHost = 'ftp.myhost.com'
ftpPort = 21
ftpUsername = 'myuser'
ftpPassword = 'mypassword'
ftp = ftplib.FTP(timeout=30)
ftp.connect(ftpHost, ftpPort)
ftp.login(ftpUsername, ftpPassword)
ftp.cwd("/myftpfolder/January2023")
ftp.delete("1234myfile.mp4")
ftp.quit()
print("Execution complete...")
As above, I can delete the files one-off but is there a practical way for me to delete about 800 files from the folder above if I were able to paste them somewhere or put them in a text file and have Python read through it to execute the deletes? I suppose this isn't necessarily an FTP or ftplib specific question, but could help me get a better general understanding of lists, tuples, etc. Using Python3.10 btw.
Thanks!

Find out differences between directory listings on time A and time B on FTP

I want to build a script which finds out which files on an FTP server are new and which are already processed.
For each file on the FTP we read out the information, parse it and write the information we need from it to our database. The files are xml-files, but have to be translated.
At the moment I'm using mlsd() to get a list, but this takes up to 4 minutes because there are already 15.000 files in this directory - it will be more everyday.
Instead of comparing this list with an older list which I saved in a textfile I would like to know if there are better possibilities.
Because this task has to run "live" it would end in an cronjob every 1 or 2 minutes. If this method takes to long this won't work.
The solution should be either in PHP or Python.
def handle(self, *args, **options):
ftp = FTP_TLS(host=host)
ftp.login(user,passwd)
ftp.prot_p()
list = ftp.mlsd("...")
for item in list:
print(item[0] + " => " + item[1]['modify'])
This code examples already runs 4 minutes.
I have always tried to avoid browsing a folder to find what could have changed. I prefered setting a dedicated workflow. When files can only be added (or new versions of existing files), I tried to use a workflow where files are added in one directory and then go in other directories where they are archived. Processing can occur in a directory where files are deleted after being used, or when they are copied/moved from a folder to an other one.
As a slight goody, I also use a copy/rename pattern: the files are first copied using a temporary name (for example a .t prefix or suffix) and renamed when the copy has ended. This prevents trying to process a file which is not fully copied. Ok it used to be more important when we had slow lines, but race conditions should be avoided as much as possible, and it allows to use daemon which polls a folder every 10 seconds or less.
Unsure whether it is really relevant here because it could require some refactoring, but it gives bullet proof solutions.
If FTP is your only interface to the server, there's no better way that what you are already doing.
Except maybe, if you server supports non-standard -t switch to LIST/NLST commands, which returns the list sorted by timestamps.
See How to get files in FTP folder sorted by modification time.
And if what takes long is the download of the file list (not initiation of the download). In that case you can request sorted list, but download only the leading new files, aborting the listing once you find the first already processed file.
For an example, how to abort download of a file list, see:
Download the first N rows of text file in ftp with ftplib.retrlines
Something like this:
class AbortedListing(Exception):
pass
def collectNewFiles(s):
if isProcessedFile(s): # your code to detect if the file was processed already
print("We know this file already: " + s + " - aborting")
raise AbortedListing()
print("New file: " + s)
try:
ftp.retrlines("NLST -t /path", collectNewFiles)
except AbortedListing:
# read/skip response
ftp.getmultiline()

WinError 32 :The process cannot access the file because it is being used by another process

I have written the following code to extract zip files in a directory and a delete a particular excel file in the extracted directory :
def extractZipFiles(dest_directory):
"This function extracts zip files in the destination directory for further processing"
fileFullPath = dest_directory + '\\'
extractedDirList = list()
for file in os.listdir(dest_directory):
dn = fileFullPath+file
dn = re.sub(r'\.zip$', "", fileFullPath+file) #remove the trailing .zip.
extractedDirList.append(dn)
zf = zipfile.ZipFile(fileFullPath+file, mode='r')
zf.extractall(dn) # extract the contents of that zip to the empty directory
zf.close()
return extractedDirList
def removeSelectedReports(extractedDirList):
"This function removes the selected reports from extracted directory"
for i in range(len(extractedDirList)):
for filename in os.listdir(extractedDirList[i]):
if filename.startswith("ABC_8"):
logger.info("File to be removed::"+filename)
fullPathName= "%s/%s" % (extractedDirList[i],filename)
os.remove(fullPathName)
return
extractedDirList = extractZipFiles(attributionRptDestDir)
logger.info("ZIP FILES EXTRACTED:"+str(extractedDirList))
removeSelectedReports(extractedDirList)
I am getting the following intermittent issue even though I have closed the zip file handler.
[WinError 32] The process cannot access the file because it is being used by another process: '\\\\share\\Workingdirectory\\report.20180517.zip'
Can you please help resolve this issue
You should try to figure out what has the file open. Based on your code, it looks like you are on Microsoft Windows.
I would stop all applications on your workstation, including browsers, run with only a minimum number of apps open, and reproduce the problem. Once reproduced you can use a tool to lists all handles open to a particular file.
A handy utility would be handle.exe, but please use any tool with similar functionality.
Once you find the offending application, you can further investigate why the file is open, and take counter measures.
I would be careful not to close any application which has the file open, until you know it is safe to do so.

Read msi with python msilib

I need to read an msi file and make some queries to it. But it looks like despite it is a standard lib for python, it has poor documentation.
To make queries I have to know database schema and I can't find any examples or methods to get it from the file.
Here is my code I'm trying to make work:
import msilib
path = "C:\\Users\\Paul\\Desktop\\my.msi" #I cannot share msi
dbobject = msilib.OpenDatabase(path, msilib.MSIDBOPEN_READONLY)
view = dbobject.OpenView("SELECT FileName FROM File")
rec = view.Execute(None)
r = v.Fetch()
And the rec variable is None. But I can open the MSI file with InstEd tool and see that File is present in the tables list and there are a lot of records there.
What I'm doing wrong?
Your code is suspect, as the last line will throw a NameError in your sample. So let's ignore that line.
The real problem is that view.Execute returns nothing of use. Under the hoods, the MsiViewExecute function only returns success or failure. After you call that, you then need to call view.Fetch, which may be what your last line intended to do.

Trouble using requests library in python

I am attempting to check for active web site folders against a list that was created using robots.txt (this is for learning security, Im doing this on a server that I own and control). I am using Python 2.7 on Kali Linux.
My code works if I just do one web address at a time, as I get a proper 200 or 404 response for folders that are active and not working, respectively.
When I attempt to this against the entire list, I get a string of 404 errors. When i print out actual addresses that the script is creating, everything looks correct.
Here is the code that I am doing:
import requests
attempt = open('info.txt', 'r')
folders = attempt.readlines()
for line in folders:
host = 'http://10.0.1.66/mutillidae'+line
attempt = requests.get(host)
print attempt
This results in a string of 404 errors. If I take the loop out, and try each one individually, I get a 200 response back showing that it is up and running.
I have also printed out the address using the same loop against the text document that contains the correct folders, and the addresses seem to look fine which I verified through copy and pasting. I have tried this with a file containing multiple folders and a single folder listed, and always get a 404 when attempting to read from the file.
The info.txt file contains the following:
/passwords/
/classes/
/javascript/
/config
/owasp-esapi-php/
/documentation/
Any advice is appreciated.
Lines returned by file.readlines() contain trailing newlines, which you must remove before passing them to requests.get. Replace the statement:
host = 'http://10.0.1.66/mutillidae'+line
with:
host = 'http://10.0.1.66/mutillidae' + line.rstrip()
and the problem will go away.
Note that your code would be easier to read if you refrained from using the same generic variable name such as attempt for different purposes, all in the same scope. Also, one should try to use variable names that reflect their usageā€”for example, host would be better named url, as it doesn't hold the host name, but the entire URL.

Categories

Resources