How can I optimise this recursive file size function? - python

I wrote a script that sums the size of the files in subdirectories on a FTP server:
for dirs in ftp.nlst("."):
try:
print("Searching in "+dirs+"...")
ftp.cwd(dirs)
for files in ftp.nlst("."):
size += ftp.size(files)
ftp.cwd("../")
except ftplib.error_perm:
pass
print("Total size of "+serveradd+tvt+" = "+str(size*10**-9)+" GB")
Is there a quicker way to get the size of the whole directory tree other than summing the file sizes for all directories?

As Alex Hall commented, this is not recursive. I'll address the speeding-up issue, as you can read about recursion from many sources, for example here.
Putting that aside, you didn't mention how many files are approximately in that directory, but you're wasting time by spending a whole round-trip for every file in the directory. Instead ask the server to return the entire listing for the directory and sum the file sizes:
import re
class DirSizer:
def __init__(self):
self.size = 0
def add_list_entry(self, lst):
if '<DIR>' not in lst:
metadata = re.split(r'\s+', lst)
self.size += int(metadata[2])
ds = DirSizer()
ftp.retrlines('LIST', ds.add_list_entry) # add_list_entry will be called for every line
print(ds.size) # => size (shallow, currently) of the directory
Note that:
This should of course be done recursively for every directory in the tree.
Your server might return the list in a different format, so you might need to change either the re.split line or the metadata[2] part.
If your server supports the MLSD FTP command, use that instead, as it'll be in a standardized format.
See here for an explanation of retrlines and the callback.

Related

Sorting files list after creation time, stopped working on network location

I have created a list of files using glob.
This list I sort after creation time.
The code has been working - but suddenly stopped.
It still works when files are in a local folder.
It also work on the network location when there are only a few files.
Typically there are +5000 files on the network location.
def get_files(folder):
files_list = glob.glob(folder)
files_list.sort(key=os.path.getctime)
return files_list
It generates the files_list with no problem, but it stalls when it comes to the sorting.
I think, it's what you want.
def get_files(folder):
files_list = glob.glob(folder)
return sorted(files_list, key = os.path.getctime)
getctime on a remote file is a network operation. That means that under the hood, you have a (large) network access to get the list of files and then one network acces per file to get its timestamp.
According to the documentation for scandir it can save resources by getting the attributes of the file on the first folder access. So you should try:
def get_files(folder):
entries = os.scandir(folder)
files_list = [entry.name for entry in sorted(entries, key=lambda x:x.stat().c_time)]

Python: list files every 30sec and avoid processing duplicates

please help me with challenge i have, that is to list files every 30seconds and process them (process them means for example -- copying to another location, each file is moved out of the directory once processed), and when i list files after 30seconds, i want to avoid any files that are listed previously for processing (due to the reason that they were listed previously and FOR LOOP is still in progress)
Means i want to avoid duplicate file processing while listing the files every 30seconds.
here is my code.
def List_files():
path = 'c:\\projects\\hc2\\'
files = []
for r, d, f in os.walk(path):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
class MyFilethreads:
def __init__(self, t1):
self.t1 = t1
def start_threading(self):
for file in List_files():
self.t1 = Thread(target=<FILEPROCESS_FUNCTION>, args=(file,))
self.t1.start()
t1 = Thread()
myclass = MyFilethreads(t1)
while True:
myclass.start_threading()
time.sleep(30)
I have not included my actual function for processing files, since its big,,it is called with thread as FILEPROCESS_FUNCTION.
Problem:
if the file size is high, my file processing time may increase some times (in other words, FOR LOOP is taking more than 30 sec ) but i cant reduce 30sec timer since it's very rare possibility, and my python script takes hundreds of files every min..
Hence, i am looking for a way to avoid files that are already listed previously, and by this i wanted to avoid duplicate file processing.
please help.
thanks in advance.
Make a dictionary in your class, and insert all the files you have seen. then, in your start_threading check if the file is in the dictionary, and pass in that case.

Find out differences between directory listings on time A and time B on FTP

I want to build a script which finds out which files on an FTP server are new and which are already processed.
For each file on the FTP we read out the information, parse it and write the information we need from it to our database. The files are xml-files, but have to be translated.
At the moment I'm using mlsd() to get a list, but this takes up to 4 minutes because there are already 15.000 files in this directory - it will be more everyday.
Instead of comparing this list with an older list which I saved in a textfile I would like to know if there are better possibilities.
Because this task has to run "live" it would end in an cronjob every 1 or 2 minutes. If this method takes to long this won't work.
The solution should be either in PHP or Python.
def handle(self, *args, **options):
ftp = FTP_TLS(host=host)
ftp.login(user,passwd)
ftp.prot_p()
list = ftp.mlsd("...")
for item in list:
print(item[0] + " => " + item[1]['modify'])
This code examples already runs 4 minutes.
I have always tried to avoid browsing a folder to find what could have changed. I prefered setting a dedicated workflow. When files can only be added (or new versions of existing files), I tried to use a workflow where files are added in one directory and then go in other directories where they are archived. Processing can occur in a directory where files are deleted after being used, or when they are copied/moved from a folder to an other one.
As a slight goody, I also use a copy/rename pattern: the files are first copied using a temporary name (for example a .t prefix or suffix) and renamed when the copy has ended. This prevents trying to process a file which is not fully copied. Ok it used to be more important when we had slow lines, but race conditions should be avoided as much as possible, and it allows to use daemon which polls a folder every 10 seconds or less.
Unsure whether it is really relevant here because it could require some refactoring, but it gives bullet proof solutions.
If FTP is your only interface to the server, there's no better way that what you are already doing.
Except maybe, if you server supports non-standard -t switch to LIST/NLST commands, which returns the list sorted by timestamps.
See How to get files in FTP folder sorted by modification time.
And if what takes long is the download of the file list (not initiation of the download). In that case you can request sorted list, but download only the leading new files, aborting the listing once you find the first already processed file.
For an example, how to abort download of a file list, see:
Download the first N rows of text file in ftp with ftplib.retrlines
Something like this:
class AbortedListing(Exception):
pass
def collectNewFiles(s):
if isProcessedFile(s): # your code to detect if the file was processed already
print("We know this file already: " + s + " - aborting")
raise AbortedListing()
print("New file: " + s)
try:
ftp.retrlines("NLST -t /path", collectNewFiles)
except AbortedListing:
# read/skip response
ftp.getmultiline()

Given a filename, go to the next file in a directory

I am writing a method that takes a filename and a path to a directory and returns the next available filename in the directory or None if there are no files with names that would sort after the file.
There are plenty of questions about how to list all the files in a directory or iterate over them, but I am not sure if the best solution to finding a single next filename is to use the list that one of the previous answers generated and then find the location of the current file in the list and choose the next element (or None if we're already on the last one).
EDIT: here's my current file-picking code. It's reused from a different part of the project, where it is used to pick a random image from a potentially nested series of directories.
# picks a file from a directory
# if the file is also a directory, pick a file from the new directory
# this might choke up if it encounters a directory only containing invalid files
def pickNestedFile(directory, bad_files):
file=None
while file is None or file in bad_files:
file=random.choice(os.listdir(directory))
#file=directory+file # use the full path name
print "Trying "+file
if os.path.isdir(os.path.join(directory, file))==True:
print "It's a directory!"
return pickNestedFile(directory+"/"+file, bad_files)
else:
return directory+"/"+file
The program I am using this in now is to take a folder of chatlogs, pick a random log, starting position, and length. These will then be processed into a MOTD-like series of (typically) short log snippets. What I need the next-file picking ability for is when the length is unusually long or the starting line is at the end of the file, so that it continues at the top of the next file (a.k.a. wrap around midnight).
I am open to the idea of using a different method to choose the file, since the above method does not discreetly give a separate filename and directory and I'd have to go use a listdir and match to get an index anyway.
You should probably consider rewriting your program to not have to use this. But this would be how you could do it:
import os
def nextFile(filename,directory):
fileList = os.listdir(directory)
nextIndex = fileList.index(filename) + 1
if nextIndex == 0 or nextIndex == len(fileList):
return None
return fileList[nextIndex]
print(nextFile("mail","test"))
I tweaked the accepted answer to allow new files to be added to the directory on the fly and for it to work if a file is deleted or changed or doesn't exist. There are better ways to work with filenames/paths, but the example below keeps it simple. Maybe it's helpful:
import os
def next_file_in_dir(directory, current_file=None):
file_list = os.listdir(directory)
next_index = 0
if current_file in file_list:
next_index = file_list.index(current_file) + 1
if next_index >= len(file_list):
next_index = 0
return file_list[next_index]
file_name = None
directory = "videos"
user_advanced_to_next = True
while user_advanced_to_next:
file_name = next_file_in_dir(directory=directory, current_file=file_name )
user_advanced_to_next = play_video("{}/{}".format(directory, file_name))
finish_and_clean_up()

os.walk() caching/speeding up

I have a prototype server[0] that's doing an os.walk()[1] for each query a client[0] makes.
I'm currently looking into ways of:
caching this data in memory,
speeding up queries, and
hopefully allowing for expansion into storing metadata and data persistence later on.
I find SQL complicated for tree structures, so I thought I would get some advice before actually committing to SQLite
Are there any cross-platform, embeddable or bundle-able non-SQL databases that might be able to handle this kind of data?
I have a small (10k-100k files) list.
I have an extremely small amount of connections (maybe 10-20).
I want to be able to scale to handling metadata as well.
[0] the server and client are actually the same piece of software, this is a P2P application, that's designed to share files over a local trusted network with out a main server, using zeroconf for discovery, and twisted for pretty much everything else
[1] query time is currently 1.2s with os.walk() on 10,000 files
Here is the related function in my Python code that does the walking:
def populate(self, string):
for name, sharedir in self.sharedirs.items():
for root, dirs, files, in os.walk(sharedir):
for dir in dirs:
if fnmatch.fnmatch(dir, string):
yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
for file in files:
if fnmatch.fnmatch(file, string):
yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))
You don't need to persist a tree structure -- in fact, your code is busily dismantling the natural tree structure of the directory tree into a linear sequence, so why would you want to restart from a tree next time?
Looks like what you need is just an ordered sequence:
i X result of os.path.join for X
where X, a string, names either a file or directory (you treat them just the same), i is a progressively incrementing integer number (to preserve the order), and the result column, also a string, is the result of os.path.join(name, *os.path.join(root, &c.
This is perfectly easy to put in a SQL table, of course!
To create the table the first time, just remove the guards if fnmatch.fnmatch (and the string argument) from your populate function, yield the dir or file before the os.path.join result, and use a cursor.executemany to save the enumerate of the call (or, use a self-incrementing column, your pick). To use the table, populate becomes essentially a:
select result from thetable where X LIKE '%foo%' order by i
where string is foo.
I misunderstood the question at first, but I think I have a solution now (and sufficiently different from my other answer to warrant a new one). Basically, you do the normal query the first time you run walk on a directory, but you store the yielded values. The second time around, you just yield those stored values. I've wrapped the os.walk() call because it's short, but you could just as easily wrap your generator as a whole.
cache = {}
def os_walk_cache( dir ):
if dir in cache:
for x in cache[ dir ]:
yield x
else:
cache[ dir ] = []
for x in os.walk( dir ):
cache[ dir ].append( x )
yield x
raise StopIteration()
I'm not sure of your memory requirements, but you may want to consider periodically cleaning out cache.
Have you looked at MongoDB? What about mod_python? mod_python should allow you to do your os.walk() and just store the data in Python data structures, since the script is persistent between connections.

Categories

Resources