Python tarfile progress

Python tarfile progress - python

Is there any library to show progress when adding files to a tar archive in python or alternativly would be be possible to extend the functionality of the tarfile module to do this?
In an ideal world I would like to show the overall progress of the tar creation as well as an ETA as to when it will be complete.
Any help on this would be really appreciated.

Unfortunately it doesn't look like there is an easy way to get byte by byte numbers.
Are you adding really large files to this tar file? If not, I would update progress on a file-by-file basis so that as files are added to the tar, the progress is updated based on the size of each file.
Supposing that all your filenames are in the variable toadd and tarfile is a TarFile object. How about,
from itertools import imap
from operator import attrgetter
# you may want to change this depending on how you want to update the
# file info for your tarobjs
tarobjs = imap(tarfile.getattrinfo, toadd)
total = sum(imap(attrgetter('size'), tarobjs))
complete = 0.0
for tarobj in tarobjs:
sys.stdout.write("\rPercent Complete: {0:2.0d}%".format(complete))
tarfile.add(tarobj)
complete += tarobj.size / total * 100
sys.stdout.write("\rPercent Complete: {0:2.0d}%\n".format(complete))
sys.stdout.write("Job Done!")

Find or write a file-like that wraps a real file which provides progress reporting, and pass it to Tarfile.addfile() so that you can know how many bytes have been requested for inclusion in the archive. You may have to use/implement throttling in case Tarfile tries to read the whole file at once.

I have recently written a wrapper library that provides a progress callback. Have a look at it on git hub:
https://github.com/thomaspurchas/tarfile-Progress-Reporter
Feel free to ask for help if you need it.

Seems like you can use the filter parameter of tarfile.add()
with tarfile.open(<tarball path>, 'w') as tarball:
tarball.add(<some file>, filter = my_filter)
def my_filter(tarinfo):
#increment some count
#add tarinfo.size to some byte counter
return tarinfo
All information you can get from a TarInfo object is available to you.

How are you adding files to the tar file? Is is through "add" with recursive=True? You could build the list of files yourself and call "add" one-by-one, showing the progress as you go. If you're building from a stream/file then it looks like you could wrap that fileobj to see the read status and pass that into addfile.
It does not look like you will need to modify tarfile.py at all.

Related

How to monitor changes to a particular website CSS?

I wish to monitor the CSS of a few websites (these websites aren't my own) for changes and receive a notification of some sort when they do. If you could please share any experience you've had with this to point me in the right direction for how to code it I'd appreciate it greatly.
I'd like this script/app to notify a Slack group on change, which I assume will require a webhook.
Not asking for code, just any advice about particular APIs and other tools that may be of benefit.

I would suggest a modification of tschaefermedia's answer.
Crawl website for .css files, save.
Take an md5 of each file.
Then compare md5 of the new file will the old file.
If the md5 is different then the file changed.
Below is a function to take md5 of large files.
def md5(file_name):
# make a md5 hash object
hash_md5 = hashlib.md5()
# open file as binary and read only
with open(file_name, 'rb') as f:
i = 0
# read 4096 bytes at a time and take the md5 hash of it and add it to the hash total
# b converts string literal to bytes
for chunk in iter(lambda: f.read(4096), b''):
i += 1
# get sum of md5 hashes
# m.update(a); m.update(b) is equivalent to m.update(a+b)
hash_md5.update(chunk)
# check for correct number of iterations
file_size = os.path.getsize(file_name)
expected_i = int(math.ceil(float(file_size) / float(4096)))
correct_i = i == expected_i
# check if md5 correct
md5_chunk_file = hash_md5.hexdigest()
return md5_chunk_file

I would suggest using Github in your workflow. That gives you a good idea of changes and ways to revert back to older versions.

One possible solution:
Crawl website for .css files, save change dates and/or filesize.
After each crawl compare information and if changes detected use slack API to notify. I haven't worked with slack that, for this part of the solution maybe someone else can give advice.

Detecting duplicate files with Python

I'm trying to write a script in Python for sorting through files (photos, videos), checking metadata of each, finding and moving all duplicates to a separate directory. Got stuck with the metadata checking part. Tried os.stat - doesn't return True for duplicate files. Ideally, I should be able to do something like :
if os.stat("original.jpg")== os.stat("duplicate.jpg"):
shutil.copy("duplicate.jpg","C:\\Duplicate Folder")
Pointers anyone?

There's a few things you can do. You can compare the contents or hash of each file or you can check a few select properties from the os.stat result, ex
def is_duplicate(file1, file2):
stat1, stat2 = os.stat(file1), os.stat(file2)
return stat1.st_size==stat2.st_size and stat1.st_mtime==stat2.st_mtime

A basic loop using a set to keep track of already encountered files:
import glob
import hashlib
uniq = set()
for fname in glob.glob('*.txt'):
with open(fname,"rb") as f:
sig = hashlib.sha256(f.read()).digest()
if sig not in uniq:
uniq.add(sig)
print fname
else:
print fname, " (duplicate)"
Please note as with any hash function there is a slight chance of collision. That is two different files having the same digest. Depending your needs, this is acceptable of not.
According to Thomas Pornin in an other answer :
"For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability [of collision] is about 4.3*10-60."
Given your need, if you have to check for additional properties in order to identify "true" duplicates, change the sig = ....line to whatever suits you. For example, if you need to check for "same content" and "same owner" (st_uidas returned by os.stat()), write:
sig = ( hashlib.sha256(f.read()).digest(),
os.stat(fname).st_uid )

If two files have the same md5 they are exact duplicates.
from hashlib import md5
with open(file1, "r") as original:
original_md5 = md5(original.read()).hexdigest()
with open(file2, "r") as duplicate:
duplicate_md5 = md5(duplicate.read()).hexdigest()
if original_md5 == duplicate_md5:
do_stuff()
In your example you're using jpg file in that case you want to call the method open with its second argument equals to rb. For that see the documentation for open

os.stat offers information about some file's metadata and features, including the creation time. That is not a good approach in order to find out if two files are the same.
For instance: Two files can be the same and have different time creation. Hence, comparing stats will fail here. Sylvain Leroux approach is the best one when combining performance and accuracy, since it is very rare two different files has the same hash.
So, unless you have an incredibly large amount of data and a repeated file will cause a system fatality, this is the way to go.
If that your case (it not seems to be), well ... the only way you can be 100% sure two file are the same is iterating and perform a comparison byte per byte.

python: edit ISO file directly

Is it possible to take an ISO file and edit a file in it directly, i.e. not by unpacking it, changing the file, and repacking it?
It is possible to do 1. from Python? How would I do it?

You can use for listing and extracting, I tested the first.
https://github.com/barneygale/iso9660/blob/master/iso9660.py
import iso9660
cd = iso9660.ISO9660("/Users/murat/Downloads/VisualStudio6Enterprise.ISO")
for path in cd.tree():
print path
https://github.com/barneygale/isoparser
import isoparser
iso = isoparser.parse("http://www.microsoft.com/linux.iso")
print iso.record("boot", "grub").children
print iso.record("boot", "grub", "grub.cfg").content

Have you seen Hachoir, a Python library to "view and edit a binary stream field by field"? I haven't had a need to try it myself, but ISO 9660 is listed as a supported parser format.

PyCdlib can read and edit ISO-9660 files, as well as Joliet, Rock Ridge, UDF, and El Torito extensions. It has a bunch of detailed examples in its documentation, including one showing how to edit a file in-place. At the time of writing, it cannot add or remove files, or edit directories. However, it is still actively maintained, in contrast to the libraries linked in older answers.

Of course, as with any file.
It can be done with open/read/write/seek/tell/close operations on a file. Pack/unpack the data with struct/ctypes. It would require serious knowledge of the contents of ISO, but I presume you already know what to do. If you're lucky you can try using mmap - the interface to file contents string-like.

Saving a temporary file

I'm using xlwt in python to create a Excel spreadsheet. You could interchange this for almost anything else that generates a file; it's what I want to do with the file that's important.
from xlwt import *
w = Workbook()
#... do something
w.save('filename.xls')
I want to I have two use cases for the file: I stream it out to the user's browser or I attach it to an email. In both cases the file only needs to exist the duration of the web request that generates it.
What I'm getting at, the reason for starting this thread is saving to a real file on the filesystem has its own hurdles (stopping overwriting, cleaning up the file once done). Is there somewhere I could "save" it where it lives only in memory and only for the duration of the request?

cStringIO
(or mmap if it should be mutable)

Generalising the answer, as you suggested: If the "anything else that generates a file" won't accept a file-like object as well as a filepath, then you can reduce the hassle by using tempfile.NamedTemporaryFile

abstracting the conversion between id3 tags, m4a tags, flac tags

I'm looking for a resource in python or bash that will make it easy to take, for example, mp3 file X and m4a file Y and say "copy X's tags to Y".
Python's "mutagen" module is great for manupulating tags in general, but there's no abstract concept of "artist field" that spans different types of tag; I want a library that handles all the fiddly bits and knows fieldname equivalences. For things not all tag systems can express, I'm okay with information being lost or best-guessed.
(Use case: I encode lossless files to mp3, then go use the mp3s for listening. Every month or so, I want to be able to update the 'master' lossless files with whatever tag changes I've made to the mp3s. I'm tired of stubbing my toes on implementation differences among formats.)

I needed this exact thing, and I, too, realized quickly that mutagen is not a distant enough abstraction to do this kind of thing. Fortunately, the authors of mutagen needed it for their media player QuodLibet.
I had to dig through the QuodLibet source to find out how to use it, but once I understood it, I wrote a utility called sequitur which is intended to be a command line equivalent to ExFalso (QuodLibet's tagging component). It uses this abstraction mechanism and provides some added abstraction and functionality.
If you want to check out the source, here's a link to the latest tarball. The package is actually a set of three command line scripts and a module for interfacing with QL. If you want to install the whole thing, you can use:
easy_install QLCLI
One thing to keep in mind about exfalso/quodlibet (and consequently sequitur) is that they actually implement audio metadata properly, which means that all tags support multiple values (unless the file type prohibits it, which there aren't many that do). So, doing something like:
print qllib.AudioFile('foo.mp3')['artist']
Will not output a single string, but will output a list of strings like:
[u'The First Artist', u'The Second Artist']
The way you might use it to copy tags would be something like:
import os.path
import qllib # this is the module that comes with QLCLI
def update_tags(mp3_fn, flac_fn):
mp3 = qllib.AudioFile(mp3_fn)
flac = qllib.AudioFile(flac_fn)
# you can iterate over the tag names
# they will be the same for all file types
for tag_name in mp3:
flac[tag_name] = mp3[tag_name]
flac.write()
mp3_filenames = ['foo.mp3', 'bar.mp3', 'baz.mp3']
for mp3_fn in mp3_filenames:
flac_fn = os.path.splitext(mp3_fn)[0] + '.flac'
if os.path.getmtime(mp3_fn) != os.path.getmtime(flac_fn):
update_tags(mp3_fn, flac_fn)

I have a bash script that does exactly that, atwat-tagger. It supports flac, mp3, ogg and mp4 files.
usage: `atwat-tagger.sh inputfile.mp3 outputfile.ogg`
I know your project is already finished, but somebody who finds this page through a search engine might find it useful.

Here's some example code, a script that I wrote to copy tags between
files using Quod Libet's music format classes (not mutagen's!). To run
it, just do copytags.py src1 dest1 src2 dest2 src3 dest3, and it
will copy the tags in sec1 to dest1 (after deleting any existing tags
on dest1!), and so on. Note the blacklist, which you should tweak to
your own preference. The blacklist will not only prevent certain tags
from being copied, it will also prevent them from being clobbered in
the destination file.
To be clear, Quod Libet's format-agnostic tagging is not a feature of mutagen; it is implemented on top of mutagen. So if you want format-agnostic tagging, you need to use quodlibet.formats.MusicFile to open your files instead of mutagen.File.
Code can now be found here: https://github.com/DarwinAwardWinner/copytags
If you also want to do transcoding at the same time, use this: https://github.com/DarwinAwardWinner/transfercoder
One critical detail for me was that Quod Libet's music format classes
expect QL's configuration to be loaded, hence the config.init line in my
script. Without that, I get all sorts of errors when loading or saving
files.
I have tested this script for copying between flac, ogg, and mp3, with "standard" tags, as well as arbitrary tags. It has worked perfectly so far.
As for the reason that I didn't use QLLib, it didn't work for me. I suspect it was getting the same config-related errors as I was, but was silently ignoring them and simply failing to write tags.

You can just write a simple app with a mapping of each tag name in each format to an "abstract tag" type, and then its easy to convert from one to the other. You don't even have to know all available types - just those that you are interested in.
Seems to me like a weekend-project type of time investment, possibly less. Have fun, and I won't mind taking a peek at your implementation and even using it - if you won't mind releasing it of course :-) .

There's also tagpy, which seems to work well.

Since the other solutions have mostly fallen off the net, here is what I came up, based on the python mediafile library (python3-mediafile in Debian GNU/Linux).
#!/usr/bin/python3
import sys
from mediafile import MediaFile
src = MediaFile (sys.argv [1])
dst = MediaFile (sys.argv [2])
for field in src.fields ():
try:
setattr (dst, field, getattr (src, field))
except:
pass
dst.save ()
Usage: mediafile-mergetags srcfile dstfile
It copies (merges) all tags from srcfile into dstfile, and seems to work properly with flac, opus, mp3 and so on, including copying album art.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.