How can I mount a tar.gz archive transparently with Python?
I have a tar.gz archive whose contents have to be read by an external program. The contents will only be needed temporarily. I could just unpack it to a temporary folder and point my external program there to read it. Afterwards, I could just delete the temp folder again. However, the archives may be large (>1 GB when extracted) so that unpacking them will take up a lot of space on the disk. My server is rather weak regarding HD performance and I cannot waste space ad lib but it does have a lot of RAM and CPU power.
That's why I want to try to mount the archive transparently without unpacking it entirely. I came across archivemount which seems to do exactly what I want. Is there a way to do what archivemount does in pure Python? No subprocess.call "solutions", please. It should run on 64-bit Linux.
I believe there should be a smart way to use tarfile to access archive's contents and then fusepy to create a user-space file system which exposes the contents of the archive. Has anyone already put these pieces together? Any ideas?
If you think that this is not a good idea, please post relevant comments. If you know what is better, please comment.
As of version 0.3.1 of my ratarmount module, you can use it or take a look at its source to mount a .tar.gz in Python. The gzip seeking support is from the dependency indexed_gzip. Ratarmount itself is based on tarindexer, which implements the idea to use tarfile to get offsets and then seek to it. But, ratarmount adds a FUSE layer among other usability and performance features.
You can install ratarmount from PyPI:
pip3 install --user ratarmount
and then call its command line interface directly from python like so:
import ratarmount
ratarmount.cli( [ '--help' ] )
ratarmount.cli( [ pathToTar, pathToMountPoint ] )
The heart of the module is as you already surmised tarfile, which is used to iterate over all TarInfo objects and create a list of filepath,offset,size, which then can be used to seek directly to the offset in the raw tar file and the simply read the next size bytes. This works because TAR is that simple of a format.
Here is the unoptimized and very bare core idea:
import sys
import tarfile
from indexed_gzip import IndexedGzipFile
targzfile = sys.argv[1]
filetoprint = sys.argv[2]
index = {} # path : ( offset, size )
file = IndexedGzipFile( targzfile )
for tarinfo in tarfile.open( fileobj = file, mode = 'r|' ):
index[tarinfo.name] = ( tarinfo.offset_data, tarinfo.size )
# at this point you could save or load the index for faster consecutive file seeks
file.seek( index[filetoprint][0] )
sys.stdout.buffer.write( file.read( index[filetoprint][1] ) )
The above example was tested to work with:
wget -O- 'https://ftp.mozilla.org/pub/firefox/releases/70.0/linux-x86_64/en-US/firefox-70.0.tar.bz2' | bzip2 -d -c | gzip > firefox.tgz
python3 minimal-example.py firefox.tgz firefox/updater.ini
Related
Finder allows you to sort files by many different attributes.
In the OSX filesystem, there is an attribute on every file called "comments" (com.apple.metadata:kMDItemFinderComment) which allows you to add any arbitrary string data as metadata for that file.
Finder exposes this "comment" attribute in the GUI and you can "sort" by it. I thought I could abuse this attribute to fill in random data for the each files "comments", and then sort by those random comments.
tldr; I'm trying to create "sort by random" functionality (in Finder) with the help of a BASH script and some python.
this does work to achieve that (sort of):
find $1 -type f -print0 | while IFS= read -r -d $'\0' file; #get a list of files in the dir
do
if [[ $file == *.wav ]]
then
hash=$(openssl rand -hex 12); #generate a random hash
osxmetadata --set findercomment "$hash" $file; #set the comment
fi
done
here i'm using the osxmetadata python utility to do the heavy lifting.
and while it works as intended, but it's really slow:
https://i.stack.imgur.com/d7exk.gif
i'm trying to do this operation on folders with many items, and would frequently be "re-seeding" the files with random comments.
can anyone suggest an optimization i can try to make this faster? i tried using xattrs but that doesn't seem reindex the comments in finder when they update.
I'd wrap the then-clause in a (...)& and add a wait after the loop. Then it will do every file in parallel.
I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.
You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/
You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)
I am looking for some help with python script to create a Self Extracting Archive (SFX) an exe file which can be created by WinRar basically.
I would want to archive a folder with password protection and also split volume by 3900 MB so that it can be easily burned to a disk.
I know WinRar has command line parameters to create a archive, but i am not sure how to call it via python anyhelp on this would be of great help.
Here are main things I want:
Archive Format - RAR
Compression Method Normal
Split Volume size, 3900 MB
Password protection
I looked up everywhere but don't seem to find anything around this functionality.
You could have a look at rarfile
Alternatively use something like:
from subprocess import call
cmdlineargs = "command -switch1 -switchN archive files.. path_to_extract"
call(["WinRAR"] + cmdlineargs.split())
Note in the second line you will need to use the correct command line arguments, the ones above are just as an example.
I am working on a project where I have to store .hg directories. The easiest way is to pack the .hg into hg.tar. I save it in MongoDB's GridFS filesystem.
If I go with this plan, I have to read the tar out.
import tarfile, cStringIO as io
repo = get_repo(saved_repo.id)
ios = io.StringIO()
ios.write(repo.hgfile.read())
ios.seek(0)
tar = tarfile.open(mode='r', fileobj=ios)
members = tar.getmembers()
#for info in members:
# tar.extract(info.name, '/tmp')
for file in members:
print file.name, file.isdir()
This is a working code. I can get all the files and directories names as the loop continues.
My question is how do I extract this tar into a valid, file-system like directory. I can .extractfile individually into memory, but if I want to feed into Mercurial API, I probably need the entire directory as in a single DIRECTORY .hg in memory like how they exist in the filesystem.
Thoughts?
Mercurial has a concept called opener that's used to abstract filesystem access. I first looked at http://hg.intevation.org/mercurial/crew/file/tip/mercurial/revlog.py to see if you can replace the revlog class (which is the base class for changelog, manifest log and filelogs), but recent versions of Mercurial also have a VFS abstraction layer. It can be found in http://hg.intevation.org/mercurial/crew/file/8c64c4af21a4/mercurial/scmutil.py#l202 and is used by the localrepo.localrepository class for all file access.
I want to create a tar archive with a hierarchical directory structure from Python, using strings for the contents of the files. I've read this question , which shows a way of adding strings as files, but not as directories. How can I add directories on the fly to a tar archive without actually making them?
Something like:
archive.tgz:
file1.txt
file2.txt
dir1/
file3.txt
dir2/
file4.txt
Extending the example given in the question linked, you can do it as follows:
import tarfile
import StringIO
import time
tar = tarfile.TarFile("test.tar", "w")
string = StringIO.StringIO()
string.write("hello")
string.seek(0)
info = tarfile.TarInfo(name='dir')
info.type = tarfile.DIRTYPE
info.mode = 0755
info.mtime = time.time()
tar.addfile(tarinfo=info)
info = tarfile.TarInfo(name='dir/foo')
info.size=len(string.buf)
info.mtime = time.time()
tar.addfile(tarinfo=info, fileobj=string)
tar.close()
Be careful with mode attribute since default value might not include execute permissions for the owner of the directory which is needed to change to it and get its contents.
A slight modification to the helpful accepted answer so that it works with python 3 as well as python 2 (and matches the OP's example a bit closer):
from io import BytesIO
import tarfile
import time
# create and open empty tar file
tar = tarfile.open("test.tgz", "w:gz")
# Add a file
file1_contents = BytesIO("hello 1".encode())
finfo1 = tarfile.TarInfo(name='file1.txt')
finfo1.size = len(file1_contents.getvalue())
finfo1.mtime = time.time()
tar.addfile(tarinfo=finfo1, fileobj=file1_contents)
# create directory in the tar file
dinfo = tarfile.TarInfo(name='dir')
dinfo.type = tarfile.DIRTYPE
dinfo.mode = 0o755
dinfo.mtime = time.time()
tar.addfile(tarinfo=dinfo)
# add a file to the new directory in the tar file
file2_contents = BytesIO("hello 2".encode())
finfo2 = tarfile.TarInfo(name='dir/file2.txt')
finfo2.size = len(file2_contents.getvalue())
finfo2.mtime = time.time()
tar.addfile(tarinfo=finfo2, fileobj=file2_contents)
tar.close()
In particular, I updated octal syntax following PEP 3127 -- Integer Literal Support and Syntax, switched to BytesIO from io, used getvalue instead of buf, and used open instead of TarFile to show zipped output as in the example. (Context handler usage (with ... as tar:) would also work in both python2 and python3, but cut and paste didn't work with my python2 repl, so I didn't switch it.) Tested on python 2.7.15+ and python 3.7.3.
Looking at the tar file format it seems doable. The files that go in each subdirectory get the relative pathname (e.g. dir1/file3.txt) as their name.
The only trick is that you must define each directory before the files that go into it (tar won't create the necessary subdirectories on the fly). There is a special flag you can use to identify a tarfile entry as a directory, but for legacy purposes, tar also accepts file entries having names that end with / as representing directories, so you should be able to just add dir1/ as a file from a zero-length string using the same technique.