Python: Assign a compression level to tarfile - python

My question is a follow up to this one.
I would like to know how I can modify the following code so that I can assign a compression level:
import os
import tarfile
home = '//global//scratch//chamar//parsed_data//batch0'
backup_dir = '//global//scratch//chamar//parsed_data//'
home_dirs = [ name for name in os.listdir(home) if os.path.isdir(os.path.join(home, name)) ]
for directory in home_dirs:
full_dir = os.path.join(home, directory)
tar = tarfile.open(os.path.join(backup_dir, directory+'.tar.gz'), 'w:gz')
tar.add(full_dir, arcname=directory)
tar.close()
Basically, what the code does is that I loop through each directory in batch0 and compress each directory (where in each directory there are 6000+ files) and create a tar.gz compressed file for each directory in //global//scratch//chamar//parsed_data//.
I think by default the compression level is = 9 but it takes a lot of time to compressed. I don't need a lot of compression. A level 5 would be enough. How can I modify the above code to include a compression level?

There is a compresslevel attribute you can pass to open() (no need to use gzopen() directly):
tar = tarfile.open(filename, "w:gz", compresslevel=5)
From the gzip documentation, compresslevel can be a number between 1 and 9 (9 is the default), 1 being the fastest and least compressed, and 9 being the slowest and most compressed.
[See also: tarfile documentation]

There is a compresslevel option in the gzopen method. The line below should replace the one with the tarfile.open call in your example:
tar = tarfile.TarFile.gzopen(os.path.join(backup_dir, directory+'.tar.gz'), mode='w', compresslevel=5)

Related

Limit on bz2 file decompression using python?

I have numerous files that are compressed in the bz2 format and I am trying to uncompress them in a temporary directory using python to then analyze. There are hundreds of thousands of files so manually decompressing the files isn't feasible so I wrote the following script.
My issue is that whenever I try to do this, the maximum file size is 900 kb even though a manual decompression has each file around 6 MB. I am not sure if this is a flaw in my code and how I am saving the data as a string to then copy to the file or a problem with something else. I have tried this with different files and I know that it works for files smaller than 900 kb. Has anyone else had a similar problem and knows of a solution?
My code is below:
import numpy as np
import bz2
import os
import glob
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
with bz2.BZ2File(zipped_file,'rb') as zipfile: #Read in the bz2 files
newfilepath = cpath +'/temp/'+zipped_file[-47:-4] #create a temporary file
with open(newfilepath, "wb") as tmpfile: #open the temporary file
for i,line in enumerate(zipfile.readlines()):
tmpfile.write(line) #write the data from the compressed file to the temporary file
filenames_.append(newfilepath)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S*bz2'
unzip_f(path_)
It returns the correct file paths with the wrong sizes capped at 900 kb.
It turns out this issue is due to the files being multi stream which does not work in python 2.7. There is more info here as mentioned by jasonharper and here. Below is a solution just using the Unix command to decompress the bz2 files and then moving them to the temporary directory I want. It is not as pretty but it works.
import numpy as np
import os
import glob
import shutil
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
newfilepath = cpath +'/temp/' #create a temporary file
newfilename = newfilepath + zipped_file[-47:-4]
os.popen('bzip2 -kd ' + zipped_file)
shutil.move(zipped_file[-47:-4],newfilepath)
filenames_.append(newfilename)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S0*bz2'
unzip_f(path_)
This is a known limitation in Python2, where the BZ2File class doesn't support multiple streams.
This can be easily resolved by using bz2file, https://pypi.org/project/bz2file/, which is a backport of Python3 implementation and can be used as a drop-in replacement.
After running pip install bz2file you can just replace bz2 with it:
import bz2file as bz2 and everything should just work :)
The original Python bug report: https://bugs.python.org/issue1625

Writing zipfile in Python 3.6 without absolute path

I am trying to write a zip file using Python's zipfile module that starts at a certain subfolder but still maintains the tree structure from that subfolder. For example, if I pass "C:\Users\User1\OneDrive\Documents", the zip file will contain everything from Documents onward, with all of Documents' subfolders maintained within Documents. I have the following code:
import zipfile
import os
import datetime
def backup(src, dest):
"""Backup files from src to dest."""
base = os.path.basename(src)
now = datetime.datetime.now()
newFile = f'{base}_{now.month}-{now.day}-{now.year}.zip'
# Set the current working directory.
os.chdir(dest)
if os.path.exists(newFile):
os.unlink(newFile)
newFile = f'{base}_{now.month}-{now.day}-{now.year}_OVERWRITE.zip'
# Write the zipfile and walk the source directory tree.
with zipfile.ZipFile(newFile, 'w') as zip:
for folder, _ , files in os.walk(src):
print(f'Working in folder {os.path.basename(folder)}')
for file in files:
zip.write(os.path.join(folder, file),
arcname=os.path.join(
folder[len(os.path.dirname(folder)) + 1:], file),
compress_type=zipfile.ZIP_DEFLATED)
print(f'\n---------- Backup of {base} to {dest} successful! ----------\n')
I know I have to use the arcname parameter for zipfile.write(), but I can't figure out how to get it to maintain the tree structure of the original directory. The code as it is now writes every subfolder to the first level of the zip file, if that makes sense. I've read several posts suggesting I use os.path.relname() to chop off the root, but I can't seem to figure out how to do it properly. I am also aware that this post looks similar to others on Stack Overflow. I have read those other posts and cannot figure out how to solve this problem.
The arcname parameter will set the exact path within the zip file for the file you are adding. You issue is when you are building the path for arcname you are using the wrong value to get the length of the prefix to remove. Specifically:
arcname=os.path.join(folder[len(os.path.dirname(folder)) + 1:], file)
Should be changed to:
arcname=os.path.join(folder[len(src):], file)

Using gdal in python to produce tiff files from csv files

I have many csv files with this format:
Latitude,Longitude,Concentration
53.833399,-122.825257,0.021957
53.837893,-122.825238,0.022642
....
My goal is to produce GeoTiff files based on the information within these files (one tiff file per csv file), preferably using python. This was done several years ago on the project I am working on, however how they did it before has been lost. All I know is that they most likely used GDAL.
I have attempted to do this by researching how to use GDAL, but this has not got me anywhere, as there are limited resources and I have no knowledge of how to use this.
Can someone help me with this?
Here is a little code I adapted for your case. You need to have the GDAL directory with all the *.exe in added to your path for it to work (in most cases it's C:\Program Files (x86)\GDAL).
It uses the gdal_grid.exe util (see doc here: http://www.gdal.org/gdal_grid.html)
You can modify as you wish the gdal_cmd variable to suits your needs.
import subprocess
import os
# your directory with all your csv files in it
dir_with_csvs = r"C:\my_csv_files"
# make it the active directory
os.chdir(dir_with_csvs)
# function to get the csv filenames in the directory
def find_csv_filenames(path_to_dir, suffix=".csv"):
filenames = os.listdir(path_to_dir)
return [ filename for filename in filenames if filename.endswith(suffix) ]
# get the filenames
csvfiles = find_csv_filenames(dir_with_csvs)
# loop through each CSV file
# for each CSV file, make an associated VRT file to be used with gdal_grid command
# and then run the gdal_grid util in a subprocess instance
for fn in csvfiles:
vrt_fn = fn.replace(".csv", ".vrt")
lyr_name = fn.replace('.csv', '')
out_tif = fn.replace('.csv', '.tiff')
with open(vrt_fn, 'w') as fn_vrt:
fn_vrt.write('<OGRVRTDataSource>\n')
fn_vrt.write('\t<OGRVRTLayer name="%s">\n' % lyr_name)
fn_vrt.write('\t\t<SrcDataSource>%s</SrcDataSource>\n' % fn)
fn_vrt.write('\t\t<GeometryType>wkbPoint</GeometryType>\n')
fn_vrt.write('\t\t<GeometryField encoding="PointFromColumns" x="Longitude" y="Latitude" z="Concentration"/>\n')
fn_vrt.write('\t</OGRVRTLayer>\n')
fn_vrt.write('</OGRVRTDataSource>\n')
gdal_cmd = 'gdal_grid -a invdist:power=2.0:smoothing=1.0 -zfield "Concentration" -of GTiff -ot Float64 -l %s %s %s' % (lyr_name, vrt_fn, out_tif)
subprocess.call(gdal_cmd, shell=True)

Python zipfile module doesn't compress files

I have a problem with compression in Python.
I know I should call the ZIP_DEFLATED method when writing to make the zip file compressed, but it does not work for me.
I have 3 PDF documents in the C:zip directory.
When I run the following code it works just fine:
import os,sys
list = os.listdir('C:\zip')
file = ZipFile('test.zip','w')
for item in list:
file.write(item)
file.close()
It makes the test.zip file without the compression.
When I change the fourth row to this:
file = ZipFile('test.zip','w', compression = ZIP_DEFLATED)
It also makes the test.zip file without the compression.
I also tried to change the write method to give it the compress_ type argument:
file.write(item, compress_type = ZIP_DEFLATED)
But that doesn't work either.
I use Python version 2.7.4 with Win7.
I tired the code with another computer (same circumstances, Win7 and Python 2.7.4), and it made the test.zip file compressed just like it should.
I know the zlib module should be available, when I run this:
import zlib
It doesn't return an error, also if there would be something wrong with the zlib module the code at the top should had return an error too, so I suspect that zlib isn't the problem.
By default the ZIP module only store data, to compress it you can do this:
import zipfile
try:
import zlib
mode= zipfile.ZIP_DEFLATED
except:
mode= zipfile.ZIP_STORED
zip= zipfile.ZipFile('zipfilename', 'w', mode)
zip.write(item)
zip.close()
In case you get here as I did, I'll add something.
If you use ZipInfo objects, they always override the compression method specified while creating the ZipFile, which is then useless.
So either you set their compression method (no parameter on the constructor, you must set the attribute) or specify the compression method when calling write (or writestr).
import zlib
from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED
def write_things():
zip_buffer = io.BytesIO()
with ZipFile(file = zip_buffer, mode = "w", compression = ZIP_DEFLATED) as zipper:
# Get some data to write
fname, content, zip_ts = get_file_data()
file_object = ZipInfo(fname, zip_ts)
zipper.writestr(file_object, content) # Surprise, no compression
# This is required to get compression
# zipper.writestr(file_object, content, compress_type = ZIP_DEFLATED)

How can I programmatically create a tar archive of nested directories and files solely from Python strings and without temporary files?

I want to create a tar archive with a hierarchical directory structure from Python, using strings for the contents of the files. I've read this question , which shows a way of adding strings as files, but not as directories. How can I add directories on the fly to a tar archive without actually making them?
Something like:
archive.tgz:
file1.txt
file2.txt
dir1/
file3.txt
dir2/
file4.txt
Extending the example given in the question linked, you can do it as follows:
import tarfile
import StringIO
import time
tar = tarfile.TarFile("test.tar", "w")
string = StringIO.StringIO()
string.write("hello")
string.seek(0)
info = tarfile.TarInfo(name='dir')
info.type = tarfile.DIRTYPE
info.mode = 0755
info.mtime = time.time()
tar.addfile(tarinfo=info)
info = tarfile.TarInfo(name='dir/foo')
info.size=len(string.buf)
info.mtime = time.time()
tar.addfile(tarinfo=info, fileobj=string)
tar.close()
Be careful with mode attribute since default value might not include execute permissions for the owner of the directory which is needed to change to it and get its contents.
A slight modification to the helpful accepted answer so that it works with python 3 as well as python 2 (and matches the OP's example a bit closer):
from io import BytesIO
import tarfile
import time
# create and open empty tar file
tar = tarfile.open("test.tgz", "w:gz")
# Add a file
file1_contents = BytesIO("hello 1".encode())
finfo1 = tarfile.TarInfo(name='file1.txt')
finfo1.size = len(file1_contents.getvalue())
finfo1.mtime = time.time()
tar.addfile(tarinfo=finfo1, fileobj=file1_contents)
# create directory in the tar file
dinfo = tarfile.TarInfo(name='dir')
dinfo.type = tarfile.DIRTYPE
dinfo.mode = 0o755
dinfo.mtime = time.time()
tar.addfile(tarinfo=dinfo)
# add a file to the new directory in the tar file
file2_contents = BytesIO("hello 2".encode())
finfo2 = tarfile.TarInfo(name='dir/file2.txt')
finfo2.size = len(file2_contents.getvalue())
finfo2.mtime = time.time()
tar.addfile(tarinfo=finfo2, fileobj=file2_contents)
tar.close()
In particular, I updated octal syntax following PEP 3127 -- Integer Literal Support and Syntax, switched to BytesIO from io, used getvalue instead of buf, and used open instead of TarFile to show zipped output as in the example. (Context handler usage (with ... as tar:) would also work in both python2 and python3, but cut and paste didn't work with my python2 repl, so I didn't switch it.) Tested on python 2.7.15+ and python 3.7.3.
Looking at the tar file format it seems doable. The files that go in each subdirectory get the relative pathname (e.g. dir1/file3.txt) as their name.
The only trick is that you must define each directory before the files that go into it (tar won't create the necessary subdirectories on the fly). There is a special flag you can use to identify a tarfile entry as a directory, but for legacy purposes, tar also accepts file entries having names that end with / as representing directories, so you should be able to just add dir1/ as a file from a zero-length string using the same technique.

Categories

Resources