Calculate CRC32, MD5 and SHA1 of zip content without decompression in Python - python

I need to calculate the CRC32, MD5 and SHA1 of the content of zip files without decompressing them.
So far I found out how to calculate these for the zip files itself, e.g.:
CRC32:
import zlib
zip_name = "test.zip"
def Crc32Hasher(file_path):
buf_size = 65536
crc32 = 0
with open(file_path, 'rb') as f:
while True:
data = f.read(buf_size)
if not data:
break
crc32 = zlib.crc32(data, crc32)
return format(crc32 & 0xFFFFFFFF, '08x')
print(Crc32Hasher(zip_name))
SHA1: (MD5 similarly)
import hashlib
zip_name = "test.zip"
def Sha1Hasher(file_path):
buf_size = 65536
sha1 = hashlib.sha1()
with open(file_path, 'rb') as f:
while True:
data = f.read(buf_size)
if not data:
break
sha1.update(data)
return format(sha1.hexdigest())
print(Sha1Hasher(zip_name))
For the content of the zip file, I can read the CRC32 from the zip directly without the need of calculating it as follow:
Read CRC32 of zip content:
import zipfile
zip_name = "test.zip"
if zip_name.lower().endswith(('.zip')):
z = zipfile.ZipFile(zip_name, "r")
for info in z.infolist():
print(info.filename,
format(info.CRC & 0xFFFFFFFF, '08x'))
But I couldn't figure out how to calculate the SHA1 (or MD5) of the content of zip files without decompressing them first.
Is that somehow possible?

It is not possible. You can get CRC because it was carefully precalculated for you when archive is created (it is used for integrity check). Any other checksum/hash has to be calculated from scratch and will require at least streaming of the archive content, i.e. unpacking.
UPD: Possibble implementations
libarchive: extra dependencies, supports many archive formats
import libarchive.public as libarchive
with libarchive.file_reader(fname) as archive:
for entry in archive:
md5 = hashlib.md5()
for block in entry.get_blocks():
md5.update(block)
print(str(entry), md5.hexdigest())
Native zipfile: no dependencies, zip only
import zipfile
archive = zipfile.ZipFile(fname)
blocksize = 1024**2 #1M chunks
for fname in archive.namelist():
entry = archive.open(fname)
md5 = hashlib.md5()
while True:
block = entry.read(blocksize)
if not block:
break
md5.update(block)
print(fname, md5.hexdigest())

Related

How to get the md5 checksum of files present in windows shared folder?

I'm using pysmb to connect to windows shared folder and calculate the md5 checksum of the files
I'm using hashlib for this purpose.
The code I have tried is as follows:
conn = SMBConnection(userName, password, config.clientMachineName, serverName,
use_ntlm_v2=True)
conn.connect(host, 139)
file_obj = tempfile.NamedTemporaryFile()
file_attributes, filesize = conn.retrieveFile(share_name, file_path, file_obj)
# calculate md5 hash
md5_hash = hashlib.md5()
while True:
data = file_obj.read(1024)
if not data:
break
md5_hash.update(data)
print(file_path, md5_hash.hexdigest())
But it returns the same hexadecimal value for all files.
What can be the alternative solution?

Write tar file to buffer with python

I want to get the data of tar.gz I created
In this example I create the tar.gz file, and then read the content
import tarfile
with tarfile.open('/tmp/test.tar.gz', 'w:gz') as f:
f.add("/home/chris/.zshrc")
with open ('/tmp/test.tar.gz','rb') as f:
data = f.read()
I there any short and clean way? I dpn't need the tar.gz file, only the data
Use an in-memory buffer by specifying to tarfile the fileobj argument that is an io.BytesIO instance:
import tarfile
from io import BytesIO
buf = BytesIO()
with tarfile.open('/tmp/test.tar.gz', 'w:gz', fileobj=buf) as f:
f.add("/home/chris/.zshrc")
data = buf.getvalue()
print(len(data))
Or you can do:
import tarfile
from io import BytesIO
buf = BytesIO()
with tarfile.open('/tmp/test.tar.gz', 'w:gz', fileobj=buf) as f:
f.add("/home/chris/.zshrc")
buf.seek(0, 0) # reset pointer back to the start of the buffer
with tarfile.open('/tmp/test.tar.gz', 'r:gz', fileobj=buf) as f:
print(f.getmembers())

Sending multiple .CSV files to .ZIP without storing to disk in Python

I'm working on a reporting application for my Django powered website. I want to run several reports and have each report generate a .csv file in memory that can be downloaded in batch as a .zip. I would like to do this without storing any files to disk. So far, to generate a single .csv file, I am following the common operation:
mem_file = StringIO.StringIO()
writer = csv.writer(mem_file)
writer.writerow(["My content", my_value])
mem_file.seek(0)
response = HttpResponse(mem_file, content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename=my_file.csv'
This works fine, but only for a single, unzipped .csv. If I had, for example, a list of .csv files created with a StringIO stream:
firstFile = StringIO.StringIO()
# write some data to the file
secondFile = StringIO.StringIO()
# write some data to the file
thirdFile = StringIO.StringIO()
# write some data to the file
myFiles = [firstFile, secondFile, thirdFile]
How could I return a compressed file that contains all objects in myFiles and can be properly unzipped to reveal three .csv files?
zipfile is a standard library module that does exactly what you're looking for. For your use-case, the meat and potatoes is a method called "writestr" that takes a name of a file and the data contained within it that you'd like to zip.
In the code below, I've used a sequential naming scheme for the files when they're unzipped, but this can be switched to whatever you'd like.
import zipfile
import StringIO
zipped_file = StringIO.StringIO()
with zipfile.ZipFile(zipped_file, 'w') as zip:
for i, file in enumerate(files):
file.seek(0)
zip.writestr("{}.csv".format(i), file.read())
zipped_file.seek(0)
If you want to future-proof your code (hint hint Python 3 hint hint), you might want to switch over to using io.BytesIO instead of StringIO, since Python 3 is all about the bytes. Another bonus is that explicit seeks are not necessary with io.BytesIO before reads (I haven't tested this behavior with Django's HttpResponse, so I've left that final seek in there just in case).
import io
import zipfile
zipped_file = io.BytesIO()
with zipfile.ZipFile(zipped_file, 'w') as f:
for i, file in enumerate(files):
f.writestr("{}.csv".format(i), file.getvalue())
zipped_file.seek(0)
The stdlib comes with the module zipfile, and the main class, ZipFile, accepts a file or file-like object:
from zipfile import ZipFile
temp_file = StringIO.StringIO()
zipped = ZipFile(temp_file, 'w')
# create temp csv_files = [(name1, data1), (name2, data2), ... ]
for name, data in csv_files:
data.seek(0)
zipped.writestr(name, data.read())
zipped.close()
temp_file.seek(0)
# etc. etc.
I'm not a user of StringIO so I may have the seek and read out of place, but hopefully you get the idea.
def zipFiles(files):
outfile = StringIO() # io.BytesIO() for python 3
with zipfile.ZipFile(outfile, 'w') as zf:
for n, f in enumarate(files):
zf.writestr("{}.csv".format(n), f.getvalue())
return outfile.getvalue()
zipped_file = zip_files(myfiles)
response = HttpResponse(zipped_file, content_type='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename=my_file.zip'
StringIO has getvalue method which return the entire contents. You can compress the zipfile
by zipfile.ZipFile(outfile, 'w', zipfile.ZIP_DEFLATED). Default value of compression is ZIP_STORED which will create zip file without compressing.

How do I automatically handle decompression when reading a file in Python?

I am writing some Python code that loops through a number of files and processes the first few hundred lines of each file. I would like to extend this code so that if any of the files in the list are compressed, it will automatically decompress while reading them, so that my code always receives the decompressed lines. Essentially my code currently looks like:
for f in files:
handle = open(f)
process_file_contents(handle)
Is there any function that can replace open in the above code so that if f is either plain text or gzip-compressed text (or bzip2, etc.), the function will always return a file handle to the decompressed contents of the file? (No seeking required, just sequential access.)
I had the same problem: I'd like my code to accept filenames and return a filehandle to be used with with, automatically compressed & etc.
In my case, I'm willing to trust the filename extensions and I only need to deal with gzip and maybe bzip files.
import gzip
import bz2
def open_by_suffix(filename):
if filename.endswith('.gz'):
return gzip.open(filename, 'rb')
elif filename.endswith('.bz2'):
return bz2.BZ2file(filename, 'r')
else:
return open(filename, 'r')
If we don't trust the filenames, we can compare the initial bytes of the file for magic strings (modified from https://stackoverflow.com/a/13044946/117714):
import gzip
import bz2
magic_dict = {
"\x1f\x8b\x08": (gzip.open, 'rb')
"\x42\x5a\x68": (bz2.BZ2File, 'r')
}
max_len = max(len(x) for x in magic_dict)
def open_by_magic(filename):
with open(filename) as f:
file_start = f.read(max_len)
for magic, (fn, flag) in magic_dict.items():
if file_start.startswith(magic):
return fn(filename, flag)
return open(filename, 'r')
Usage:
# cat
for filename in filenames:
with open_by_suffix(filename) as f:
for line in f:
print f
Your use-case would look like:
for f in files:
with open_by_suffix(f) as handle:
process_file_contents(handle)

Python: binary file to gz file and then to jpg extension and finally return again to the original binary file

I want to do the following in Python:
Take a binary (executable) file
Turn it into a zip file with gzip (gz extension)
Then put the jpg extension
Later again the desire to recover the original (without the gz extension or jpg).
The idea is to send binary files through GMail SMTP to recover and then get them over IMAP and process them in ehtir original form (1).
The python gzip and shutil libraries can do what you need.
To gzip the executable.
import gzip, shutil
src = open('executable', 'rb')
dest = gzip.open('executable.gz.jpg', 'wb')
shutil.copyfileobj(src, dest)
src.close()
dest.close()
And then to get the original back.
import gzip. shutil
src = gzip.open('executable.gz.jpg', 'rb')
dest = gzip.open('executable', 'wb')
shutil.copyfileobj(src, dest)
src.close()
dest.close()
That being said, gmail's MIME filters look at content, not extension, so it may still block the new file.
You can use gzip to compress the files and os.rename to change file names. In your case you could just use gzip and save it with a .jpg extension in the first place.
import gzip
# write compressed file
with gzip.open('my_file.jpg', 'wb') as f:
f.write(content)
# read it again
with gzip.open('my_file.jpg', 'rb') as f:
content = f.read()

Categories

Resources