After creating a zip file in Python2, how to get the details of the zip? It's not about it's containing files but the zip itself.
On Linux opening the zip file with the 'Archive Manager' the properties can be displayed:
"Last modified, Archive size, Content size, Compression ratio, Number of files"
How to get those properties from within a python script?
This information is not available in the ZIP archive as a single structure to access. I am not sure how Archive Manager implements it and I do not have one around to check it out, but I presume it to be a combination of stat of the archive itself to retrieve the time of its last modification and size. E.g. for archive ar.zip:
os.stat('ar.zip').st_mtime # last modification of the archive
os.stat('ar.zip').st_size # size of the archive
And iterating over archive members information for the rest. For ZIP file, this operation should actually not be prohibitively expensive as there is a directory pointing to all entries at the end of the archive, so it does not have to be read it in its entirety.
For instance:
osize = csize = cnt = 0
for item in z.infolist():
osize += item.file_size
csize += item.compress_size
cnt += 1
will give you osize with original (uncompressed) size of all files, csize compressed size in the archive and cnt number of all entries in the archive.
With that, you can get the compression ratio dividing csize by osize with one caveat. Since you mention/flag using python 2.7, do not forget to convert (at least) one of them to float to force result to be float as well: ratio = float(czise) / osize. On Pyton 3 / would produce float in any case.
You can of course wrap all of that into a convenient function you can pass an open zip archive to:
def zip_details(archive_obj):
archive_info = {'original_size': 0,
'compressed_size': 0,
'total_entries': 0}
archive_info['total_size'] = os.fstat(archive_obj.fp.fileno()).st_size
archive_info['last_change'] = os.fstat(archive_obj.fp.fileno()).st_mtime
for item in archive_obj.infolist():
archive_info['original_size'] += item.file_size
archive_info['compressed_size'] += item.compress_size
archive_info['total_entries'] += 1
archive_info['compression_ration'] = float(archive_info['compressed_size']) / archive_info['original_size']
return archive_info
and get a dictionary with the desired details in return. Or you could subclass zipfile.ZipFile and add this functionality as its method.
You've expressed limitation in the question title to exclude using the content, but I am afraid, that condition is impossible to fulfill for an existing archive except for overall size and time of last modification. Everything else can really only be learned by looking into an archive itself. File count from the directory at its ends and further details from information stored on individual files. This is not python specific and holds for any tool or language used.
As long as working with 'bash' (like in Linux) here is a simple method to zip a given file/dir list with getting the zip archive properties
import os
bashCommand = "zip -r -v" \
" " + "./my-extension.zip" \
" " + "file1 file2 fileN dir1 dir2 dirN" \
" " + "| grep 'total bytes=' > zip.log"
os.system(bashCommand)
Note: Sure this can be executed directly at the OS prompt, but the intend is to include the call in a bigger python script
Related
I want to build a script which finds out which files on an FTP server are new and which are already processed.
For each file on the FTP we read out the information, parse it and write the information we need from it to our database. The files are xml-files, but have to be translated.
At the moment I'm using mlsd() to get a list, but this takes up to 4 minutes because there are already 15.000 files in this directory - it will be more everyday.
Instead of comparing this list with an older list which I saved in a textfile I would like to know if there are better possibilities.
Because this task has to run "live" it would end in an cronjob every 1 or 2 minutes. If this method takes to long this won't work.
The solution should be either in PHP or Python.
def handle(self, *args, **options):
ftp = FTP_TLS(host=host)
ftp.login(user,passwd)
ftp.prot_p()
list = ftp.mlsd("...")
for item in list:
print(item[0] + " => " + item[1]['modify'])
This code examples already runs 4 minutes.
I have always tried to avoid browsing a folder to find what could have changed. I prefered setting a dedicated workflow. When files can only be added (or new versions of existing files), I tried to use a workflow where files are added in one directory and then go in other directories where they are archived. Processing can occur in a directory where files are deleted after being used, or when they are copied/moved from a folder to an other one.
As a slight goody, I also use a copy/rename pattern: the files are first copied using a temporary name (for example a .t prefix or suffix) and renamed when the copy has ended. This prevents trying to process a file which is not fully copied. Ok it used to be more important when we had slow lines, but race conditions should be avoided as much as possible, and it allows to use daemon which polls a folder every 10 seconds or less.
Unsure whether it is really relevant here because it could require some refactoring, but it gives bullet proof solutions.
If FTP is your only interface to the server, there's no better way that what you are already doing.
Except maybe, if you server supports non-standard -t switch to LIST/NLST commands, which returns the list sorted by timestamps.
See How to get files in FTP folder sorted by modification time.
And if what takes long is the download of the file list (not initiation of the download). In that case you can request sorted list, but download only the leading new files, aborting the listing once you find the first already processed file.
For an example, how to abort download of a file list, see:
Download the first N rows of text file in ftp with ftplib.retrlines
Something like this:
class AbortedListing(Exception):
pass
def collectNewFiles(s):
if isProcessedFile(s): # your code to detect if the file was processed already
print("We know this file already: " + s + " - aborting")
raise AbortedListing()
print("New file: " + s)
try:
ftp.retrlines("NLST -t /path", collectNewFiles)
except AbortedListing:
# read/skip response
ftp.getmultiline()
What I'm trying to do is to write a code that will delete a single one of 2 [or 3] files on a folder. I have batch renamed that the file names are incrementing like 0.jpg, 1.jpg, 2.jpg... n.jpg and so on. What I had in mind for the every single of two files scenario was to use something like "if %2 == 0" but couldn't figure out how actually to remove the files from the list object and my folder obviously.
Below is the piece of NON-WORKING code. I guess, it is not working as the file_name is a str.
import os
os.chdir('path_to_my_folder')
for f in os.listdir():
file_name, file_ext = os.path.splitext(f)
print(file_name)
if file_name%2 == 0:
os.remove();
Yes, that's your problem: you're trying to use an integer function on a string. SImply convert:
if int(file_name)%2 == 0:
... that should fix your current problem.
Your filename is a string, like '0.jpg', and you can’t % 2 a string.1
What you want to do is pull the number out of the filename, like this:
name, ext = os.path.splitext(filename)
number = int(name)
And now, you can use number % 2.
(Of course this only works if every file in the directory is named in the format N.jpg, where N is an integer; otherwise you’ll get a ValueError.)
1. Actually, you can do that, it just doesn’t do what you want. For strings, % means printf-style formatting, so filename % 2 means “find the %d or similar format spec in filename and replace it with a string version of 2.
Thanks a lot for the answers! I have amended the code and now it looks like this;
import os
os.chdir('path_to_the_folder')
for f in os.listdir():
name, ext = os.path.splitext(f)
number = int(name)
if number % 2 == 0:
os.remove()
It doesn't give an error but it also doesn't remove/delete the files from the folder. What in the end I want to achieve is that every file name which is divisible by two will be removed so only 1.jpg, 3.jpg, 5.jpg and so on will remain.
Thanks so much for your time.
A non-Python method but sharing for future references;
cd path_to_your_folder
mkdir odd; mv *[13579].png odd
also works os OSX. This reverses the file order but that can be re-corrected easily. Still want to manage this within Python though!
I wrote a script that sums the size of the files in subdirectories on a FTP server:
for dirs in ftp.nlst("."):
try:
print("Searching in "+dirs+"...")
ftp.cwd(dirs)
for files in ftp.nlst("."):
size += ftp.size(files)
ftp.cwd("../")
except ftplib.error_perm:
pass
print("Total size of "+serveradd+tvt+" = "+str(size*10**-9)+" GB")
Is there a quicker way to get the size of the whole directory tree other than summing the file sizes for all directories?
As Alex Hall commented, this is not recursive. I'll address the speeding-up issue, as you can read about recursion from many sources, for example here.
Putting that aside, you didn't mention how many files are approximately in that directory, but you're wasting time by spending a whole round-trip for every file in the directory. Instead ask the server to return the entire listing for the directory and sum the file sizes:
import re
class DirSizer:
def __init__(self):
self.size = 0
def add_list_entry(self, lst):
if '<DIR>' not in lst:
metadata = re.split(r'\s+', lst)
self.size += int(metadata[2])
ds = DirSizer()
ftp.retrlines('LIST', ds.add_list_entry) # add_list_entry will be called for every line
print(ds.size) # => size (shallow, currently) of the directory
Note that:
This should of course be done recursively for every directory in the tree.
Your server might return the list in a different format, so you might need to change either the re.split line or the metadata[2] part.
If your server supports the MLSD FTP command, use that instead, as it'll be in a standardized format.
See here for an explanation of retrlines and the callback.
After reading the man page on filtering rules and looking here: Using Rsync filter to include/exclude files
I don't understand why the code below doesn't work.
import subprocess, os
from ftplib import FTP
ftp_site = 'ftp.ncbi.nlm.nih.gov'
ftp = FTP(ftp_site)
ftp.login()
ftp.cwd('genomes/genbank/bacteria')
dirs = ftp.nlst()
for organism in dirs:
latest = os.path.join(organism, "latest_assembly_versions")
for path in ftp.nlst(latest):
accession = path.split("/")[-1]
fasta = accession+"_genomic.fna.gz"
subprocess.call(['rsync',
'--recursive',
'--copy-links',
#'--dry-run',
'-vv',
'-f=+ '+accession+'/*',
'-f=+ '+fasta,
'-f=- *',
'ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest,
'--log-file=scratch/test_dir/log.txt',
'scratch/' + organism])
I also tried '--exclude=*[^'+fasta+']' to try to exclude files that don't match fasta instead of -f=- *
For each directory path within latest/*, I want the file that matches fasta exactly. There will always be exactly one file fasta in the directory latest/path.
EDIT: I am testing this with rsync version 3.1.0 and have seen incompatibility issues with earlier versions.
Here is a link to working code that you should be able to paste into a python interpreter to get the results of a "dry run," which won't download anything onto your machine: http://pastebin.com/0reVKMCg it gets EVERYTHING under ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+latest, which is not what I want. and if I run that script with '-f=- *' uncommented, it doesn't get anything, which seems to contradict the answer here Using Rsync filter to include/exclude files
This part of the rsync man page contained the info I needed to solve my problem:
Note that, when using the --recursive (-r) option (which is implied by -a), every subcomponent of every
path is visited from the top down, so include/exclude patterns get applied recursively to each subcompo-
nent's full name (e.g. to include "/foo/bar/baz" the subcomponents "/foo" and "/foo/bar" must not be
excluded). The exclude patterns actually short-circuit the directory traversal stage when rsync finds the
files to send. If a pattern excludes a particular parent directory, it can render a deeper include pat-
tern ineffectual because rsync did not descend through that excluded section of the hierarchy. This is
particularly important when using a trailing '*' rule. For instance, this won't work:
+ /some/path/this-file-will-not-be-found
+ /file-is-included
- *
This fails because the parent directory "some" is excluded by the '*' rule, so rsync never visits any of
the files in the "some" or "some/path" directories. One solution is to ask for all directories in the
hierarchy to be included by using a single rule: "+ */" (put it somewhere before the "- *" rule), and per-
haps use the --prune-empty-dirs option. Another solution is to add specific include rules for all the
parent dirs that need to be visited. For instance, this set of rules works fine:
+ /some/
+ /some/path/
+ /some/path/this-file-is-found
+ /file-also-included
- *
This helped me write the following code:
def get_fastas(local_mirror="scratch/ncbi", bacteria="Escherichia_coli"):
ftp_site = 'ftp.ncbi.nlm.nih.gov'
ftp = FTP(ftp_site)
ftp.login()
ftp.cwd('genomes/genbank/bacteria')
rsync_log = os.path.join(local_mirror, "rsync_log.txt")
latest = os.path.join(bacteria, 'latest_assembly_versions')
for parent in ftp.nlst(latest)[0:2]:
accession = parent.split("/")[-1]
fasta = accession+"_genomic.fna.gz"
organism_dir = os.path.join(local_mirror, bacteria)
subprocess.call(['rsync',
'--copy-links',
'--recursive',
'--itemize-changes',
'--prune-empty-dirs',
'-f=+ '+accession,
'-f=+ '+fasta,
'--exclude=*',
'ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/'+parent,
organism_dir])
It turns out '-f=+ '+accession, doesn't work with a * after the trailing /. Although it does work with just a trailing / without the *
I'm writing a fuse filesystem using python that should interact with amazon S3 (basically treat S3 buckets as filesystems) and facing some issues with my readdir implementation.
First of all I'd like to mention I'm relatively new to python and fuse (more of the java guy, using python here because of the simplicity of the fuse binding..), so this may all just be a stupid beginner's mistake..
Here's what I currently have:
def readdir(self, path, fh):
s3Path = S3fsUtils.toS3Path(path) # removes prefixed slash - boto3 can't handle that in key names
print("Reading dir: " + str(path))
retVal = [".", ".."]
for s3Obj in self.bucket.objects.all(): # for now list all objects in bucket
tmp = str(s3Obj.key)
if tmp.startswith(s3Path): # only return things below current path
print("READDIR: appending to output: " + tmp)
retVal.append(tmp)
return retVal # return directory contents as a list of strings
Here's what happens when running "ls -l" (The filesystem is mounted into "/tmp/fusetest"):
root#michael-dev:/tmp/fusetest# ls -l
ls: reading directory .: Input/output error
total 0
root#michael-dev:/tmp/fusetest#
... and here's the console output of the filesystem:
(found entries are a few "directories", i.e. S3 keys with no data behind them)
Reading dir: /
READDIR: appending to output: blabla/
READDIR: appending to output: blablubb/
READDIR: appending to output: haha/
READDIR: appending to output: hahaha/
READDIR: appending to output: huhu/
READDIR: appending to output: new_folder/
Releasing dir: /
I'm guessing the problem is that I return a list of strings rather than some more "C-struct-like" thing...
I found this question which is also about problems with readdir, there a class "fuse.Direntry" is used. However, in my fuse.py (fusepy version = 2.0.2), I cannot find any class like that, the closest in name I found was "fuse_file_info" which doesn't really look useful for the task at hand.
So what should readdir return and where does the i/o error come from?
Ok, this turned out to be just what I expected - a stupid mistake...
Since Amazon S3 represents folders as empty files whose names end with a slash, my file listings contained a lot of entries with slashes at the end.
It turned out, fuse cannot handle that, causing the readdir operation to fail.
Removing the slashes from the file names does the trick.