I have a problem that i need to validate huge compressed files after download (usually more than 10-20gb per file) against reference checksums that have apparently been generated using cksum
(To be more precise: My python script needs to download large compressed files from the ncbi ftp-server that was supposed to provide md5 checksums for validating the downloads, but instead only provided some different unspecified filehash/checksum values. After some trial and error I found that these checksums were identical to the output of the unix tool cksum, which apparently genereates CRC-checksums. So to compare/validate these i need to generate cksum-equivalent checksums for the downloaded files.)
It appears that the unix tool cksum yields totally different checksum values than the supposed equivalent unix tool crc32 (or the python zlib.crc32() function, for that matter). When googling the problem I could not understand the explanations for why this occurs, especially since they appear to be identical on some systems? So maybe this is because I work on a 64 bit system (but then: who doesn't nowadays)?
using built-in python modules I can easily generate md5- and CRC32 checksums, but none of these are equivalent to the cksum output, neither in decimal nor in hexadecimal representation.
I did find a previous post here on stackoverflow pointing to a snippet that seems to solve this. But while it works for small files, A.) I do not understand a word of it, so I have a hard time adapting it and B.) it does not seem to work well with large files.
for completeness sake: here is the snippet (python3 version):
#!/usr/bin/env python
import sys
crctab = [ 0x00000000, 0x04c11db7, 0x09823b6e, 0x0d4326d9, 0x130476dc,
0x17c56b6b, 0x1a864db2, 0x1e475005, 0x2608edb8, 0x22c9f00f,
0x2f8ad6d6, 0x2b4bcb61, 0x350c9b64, 0x31cd86d3, 0x3c8ea00a,
0x384fbdbd, 0x4c11db70, 0x48d0c6c7, 0x4593e01e, 0x4152fda9,
0x5f15adac, 0x5bd4b01b, 0x569796c2, 0x52568b75, 0x6a1936c8,
0x6ed82b7f, 0x639b0da6, 0x675a1011, 0x791d4014, 0x7ddc5da3,
0x709f7b7a, 0x745e66cd, 0x9823b6e0, 0x9ce2ab57, 0x91a18d8e,
0x95609039, 0x8b27c03c, 0x8fe6dd8b, 0x82a5fb52, 0x8664e6e5,
0xbe2b5b58, 0xbaea46ef, 0xb7a96036, 0xb3687d81, 0xad2f2d84,
0xa9ee3033, 0xa4ad16ea, 0xa06c0b5d, 0xd4326d90, 0xd0f37027,
0xddb056fe, 0xd9714b49, 0xc7361b4c, 0xc3f706fb, 0xceb42022,
0xca753d95, 0xf23a8028, 0xf6fb9d9f, 0xfbb8bb46, 0xff79a6f1,
0xe13ef6f4, 0xe5ffeb43, 0xe8bccd9a, 0xec7dd02d, 0x34867077,
0x30476dc0, 0x3d044b19, 0x39c556ae, 0x278206ab, 0x23431b1c,
0x2e003dc5, 0x2ac12072, 0x128e9dcf, 0x164f8078, 0x1b0ca6a1,
0x1fcdbb16, 0x018aeb13, 0x054bf6a4, 0x0808d07d, 0x0cc9cdca,
0x7897ab07, 0x7c56b6b0, 0x71159069, 0x75d48dde, 0x6b93dddb,
0x6f52c06c, 0x6211e6b5, 0x66d0fb02, 0x5e9f46bf, 0x5a5e5b08,
0x571d7dd1, 0x53dc6066, 0x4d9b3063, 0x495a2dd4, 0x44190b0d,
0x40d816ba, 0xaca5c697, 0xa864db20, 0xa527fdf9, 0xa1e6e04e,
0xbfa1b04b, 0xbb60adfc, 0xb6238b25, 0xb2e29692, 0x8aad2b2f,
0x8e6c3698, 0x832f1041, 0x87ee0df6, 0x99a95df3, 0x9d684044,
0x902b669d, 0x94ea7b2a, 0xe0b41de7, 0xe4750050, 0xe9362689,
0xedf73b3e, 0xf3b06b3b, 0xf771768c, 0xfa325055, 0xfef34de2,
0xc6bcf05f, 0xc27dede8, 0xcf3ecb31, 0xcbffd686, 0xd5b88683,
0xd1799b34, 0xdc3abded, 0xd8fba05a, 0x690ce0ee, 0x6dcdfd59,
0x608edb80, 0x644fc637, 0x7a089632, 0x7ec98b85, 0x738aad5c,
0x774bb0eb, 0x4f040d56, 0x4bc510e1, 0x46863638, 0x42472b8f,
0x5c007b8a, 0x58c1663d, 0x558240e4, 0x51435d53, 0x251d3b9e,
0x21dc2629, 0x2c9f00f0, 0x285e1d47, 0x36194d42, 0x32d850f5,
0x3f9b762c, 0x3b5a6b9b, 0x0315d626, 0x07d4cb91, 0x0a97ed48,
0x0e56f0ff, 0x1011a0fa, 0x14d0bd4d, 0x19939b94, 0x1d528623,
0xf12f560e, 0xf5ee4bb9, 0xf8ad6d60, 0xfc6c70d7, 0xe22b20d2,
0xe6ea3d65, 0xeba91bbc, 0xef68060b, 0xd727bbb6, 0xd3e6a601,
0xdea580d8, 0xda649d6f, 0xc423cd6a, 0xc0e2d0dd, 0xcda1f604,
0xc960ebb3, 0xbd3e8d7e, 0xb9ff90c9, 0xb4bcb610, 0xb07daba7,
0xae3afba2, 0xaafbe615, 0xa7b8c0cc, 0xa379dd7b, 0x9b3660c6,
0x9ff77d71, 0x92b45ba8, 0x9675461f, 0x8832161a, 0x8cf30bad,
0x81b02d74, 0x857130c3, 0x5d8a9099, 0x594b8d2e, 0x5408abf7,
0x50c9b640, 0x4e8ee645, 0x4a4ffbf2, 0x470cdd2b, 0x43cdc09c,
0x7b827d21, 0x7f436096, 0x7200464f, 0x76c15bf8, 0x68860bfd,
0x6c47164a, 0x61043093, 0x65c52d24, 0x119b4be9, 0x155a565e,
0x18197087, 0x1cd86d30, 0x029f3d35, 0x065e2082, 0x0b1d065b,
0x0fdc1bec, 0x3793a651, 0x3352bbe6, 0x3e119d3f, 0x3ad08088,
0x2497d08d, 0x2056cd3a, 0x2d15ebe3, 0x29d4f654, 0xc5a92679,
0xc1683bce, 0xcc2b1d17, 0xc8ea00a0, 0xd6ad50a5, 0xd26c4d12,
0xdf2f6bcb, 0xdbee767c, 0xe3a1cbc1, 0xe760d676, 0xea23f0af,
0xeee2ed18, 0xf0a5bd1d, 0xf464a0aa, 0xf9278673, 0xfde69bc4,
0x89b8fd09, 0x8d79e0be, 0x803ac667, 0x84fbdbd0, 0x9abc8bd5,
0x9e7d9662, 0x933eb0bb, 0x97ffad0c, 0xafb010b1, 0xab710d06,
0xa6322bdf, 0xa2f33668, 0xbcb4666d, 0xb8757bda, 0xb5365d03,
0xb1f740b4 ]
UNSIGNED = lambda n: n & 0xffffffff
def memcrc(b):
n = len(b)
i = c = s = 0
for c in b:
tabidx = (s>>24)^c
s = UNSIGNED((s << 8)) ^ crctab[tabidx]
while n:
c = n & 0o0377
n = n >> 8
s = UNSIGNED(s << 8) ^ crctab[(s >> 24) ^ c]
return UNSIGNED(~s)
if __name__ == '__main__':
fname = sys.argv[-1]
buffer = open(fname, 'rb').read()
print("%d\t%d\t%s" % (memcrc(buffer), len(buffer), fname))
Could someone please help me understand this?
what exactly is the problem with the difference between cksum and crc32?
is it simply the fact that the one is 32bit based and the other 64 bit?
Can i simply convert between the values produced by both, and if yes how?
what is the purpose of the crctab in the above snippet and how does the conversion work there?
I don't know the why part of your question. All I can say is that the great thing about standards is that you have so many to choose from.
cksum is specified by POSIX to use a different CRC than the more common CRC-32 you find in zlib, Python, used in zip and gzip files, etc. The CRC-32/CKSUM has this specification (from Greg Cook's CRC catalog):
width=32 poly=0x04c11db7 init=0x00000000 refin=false refout=false xorout=0xffffffff check=0x765e7680 residue=0xc704dd7b name="CRC-32/CKSUM"
The more common CRC-32 has this specification:
width=32 poly=0x04c11db7 init=0xffffffff refin=true refout=true xorout=0xffffffff check=0xcbf43926 residue=0xdebb20e3 name="CRC-32/ISO-HDLC"
The cksum utility on my system (macOS) computes the CRC-32/CKSUM, but it also has options to compute the CRC-32/ISO_HDLC, as well as two other actual checksums, the first from the BSD Unix sum command, and the second from the AT&T System V Unix sum command.
There is apparently no shortage of results that cksum might produce.
No, it has nothing to do with 32 vs. 64 bit systems.
No, you cannot convert between the values.
The purpose of the table is to speed up the CRC calculation by precomputing the CRC of every byte value.
I've tried refactoring your code into a class following a similar api to the standard Python hashlib:
class crc32:
def __init__(self):
self.nchars = 0
self.crc = 0
def update(self, buf):
crc = self.crc
for c in buf:
crc = crctab[(crc >> 24) ^ c] ^ ((crc << 8) & 0xffffffff)
self.crc = crc
self.nchars += len(buf)
def digest(self):
crc = self.crc
n = self.nchars
while n:
c = n & 0xff
crc = crctab[(crc >> 24) ^ c] ^ ((crc << 8) & 0xffffffff)
n >>= 8
return UNSIGNED(~crc)
I've expanded out UNSIGNED to try and make it faster and reordered some of the statements to be more similar to the standard zlib library (as used by Python) while I was trying to understand the differences. It seems they use a different polynomial to generate the table, but otherwise it's the same.
The above code can be used as:
with open('largefile', 'rb') as fd:
digest = crc32()
while buf := fd.read(4096):
digest.update(buf)
print(digest.digest())
which prints out the expected 1135714720 for a file created by:
echo -n hello world > test.txt
The above code should work for large files, but given the performance of Python this would take far too long to be useful. A 75MB file I have takes ~11 seconds, while cksum takes just ~0.2 seconds. You should be able to get somewhere with using Cython to speed it up, but that's a bit more fiddly and if your're struggling with the existing code it's going to be quite a learning curve!
I've had another play and got performance similar to cksum with Cython, the code looks like:
cdef unsigned int *crctab_c = [
// copy/paste crctab from above
]
cdef class crc32_c:
cdef unsigned int crc, nchars
def __init__(self):
self.nchars = 0
self.crc = 0
cdef _update(self, bytes buf):
cdef unsigned int crc, i, j
cdef unsigned char c
crc = self.crc
for c in buf:
i = (crc >> 24) ^ c
j = crc << 8
crc = crctab_c[i] ^ j
self.crc = crc
self.nchars += len(buf)
def update(self, buf):
return self._update(buf)
def digest(self):
crc = self.crc
n = self.nchars
while n:
c = n & 0xff
crc = crctab_c[(crc >> 24) ^ c] ^ ((crc << 8) & 0xffffffff)
n >>= 8
return (~crc) & 0xffffffff
after compiling this code with Cython, it can be used in a similar manner to the previous class. Performance is pretty good: Python now takes ~200ms for a 75MiB file and is basically the same as cksum, but much slower than zlib which only takes ~80ms.
I want to write a Python program that makes PNG files. My big problem is with generating the CRC and the data in the IDAT chunk. Python 2.6.4 does have a zlib module, but there are extra settings needed. The PNG specification REQUIRES the IDAT data to be compressed with zlib's deflate method with a window size of 32768 bytes, but I can't find how to set those parameters in the Python zlib module.
As for the CRC for each chunk, the zlib module documentation indicates that it contains a CRC function. I believe that calling that CRC function as crc32(data,-1) will generate the CRC that I need, though if necessary I can translate the C code given in the PNG specification.
Note that I can generate the rest of the PNG file and the data that is to be compressed for the IDAT chunk, I just don't know how to properly compress the image data for the IDAT chunk after implementing the initial filtering step.
EDITED:
The problem with PyPNG is that it will not write tEXt chunks. A minor annoyance is that one has to manipulate the image as (R, G, B) data; I'd prefer to manipulate palette values of the pixels directly and then define the associations between palette values and color data. I'm also left unsure if PyPNG takes advantage of the "compression" allowed by using 1-, 2-, and 4- bit palette values in the image data to fit more than one pixel in a byte.
Even if you can't use PyPNG for the tEXt chunk reason, you can use its code! (it's MIT licensed). Here's how a chunk is written:
def write_chunk(outfile, tag, data=''):
"""
Write a PNG chunk to the output file, including length and
checksum.
"""
# http://www.w3.org/TR/PNG/#5Chunk-layout
outfile.write(struct.pack("!I", len(data)))
outfile.write(tag)
outfile.write(data)
checksum = zlib.crc32(tag)
checksum = zlib.crc32(data, checksum)
outfile.write(struct.pack("!i", checksum))
Note the use of zlib.crc32 to create the CRC checksum, and also note how the checksum runs over both the tag and the data.
For compressing the IDAT chunks you basically just use zlib. As others have noted the adler checksum and the default window size are all okay (by the way the PNG spec does not require a window size of 32768, it requires that the window is at most 32768 bytes; this is all a bit odd, because in any case 32768 is the maximum window size permitted by the current version of the zlib spec).
The code to do this in PyPNG is not particular great, see the write_passes() function. The bit that actually compresses the data and writes a chunk is this:
compressor = zlib.compressobj()
compressed = compressor.compress(tostring(data))
if len(compressed):
# print >> sys.stderr, len(data), len(compressed)
write_chunk(outfile, 'IDAT', compressed)
PyPNG never uses scanline filtering. Partly this is because it would be very slow in Python, partly because I haven't written the code. If you have Python code to do filtering, it would be a most welcome contribution to PyPNG. :)
Short answer: (1) "deflate" and "32Kb window" are the defaults (2) uses adler32 not crc32
Long answer:
""" The PNG specification REQUIRES the IDAT data to be compressed with zlib's deflate method with a window size of 32768 bytes, but I can't find how to set those parameters in the Python zlib module. """
You don't need to set them. Those are the defaults.
If you really want to specify non-default arguments to zlib, you can use zlib.compressobj() ... it has several args that are not documented in the Python docs. Reading material:
source: Python's gzip.py (see how it calls zlib.compressobj)
source: Python's zlibmodule.c (see its defaults)
SO: This question (see answers by MizardX and myself, and comments on each)
docs: The manual on the zlib site
"""As for the CRC for each chunk, the zlib module documentation indicates that it contains a CRC function. I believe that calling that CRC function as crc32(data,-1) will generate the CRC that I need, though if necessary I can translate the C code given in the PNG specification."""
Please check out the zlib specification aka RFC 1950 ... it says that the checksum used is adler32
The zlib compress or compressobj output will include the appropriate CRC; why do you think that you will need to do it yourself?
Edit So you do need a CRC-32. Good news: zlib.crc32() will do the job:
Code:
import zlib
crc_table = None
def make_crc_table():
global crc_table
crc_table = [0] * 256
for n in xrange(256):
c = n
for k in xrange(8):
if c & 1:
c = 0xedb88320L ^ (c >> 1)
else:
c = c >> 1
crc_table[n] = c
make_crc_table()
"""
/* Update a running CRC with the bytes buf[0..len-1]--the CRC
should be initialized to all 1's, and the transmitted value
is the 1's complement of the final running CRC (see the
crc() routine below)). */
"""
def update_crc(crc, buf):
c = crc
for byte in buf:
c = crc_table[int((c ^ ord(byte)) & 0xff)] ^ (c >> 8)
return c
# /* Return the CRC of the bytes buf[0..len-1]. */
def crc(buf):
return update_crc(0xffffffffL, buf) ^ 0xffffffffL
if __name__ == "__main__":
tests = [
"",
"\x00",
"\x01",
"Twas brillig and the slithy toves did gyre and gimble in the wabe",
]
for test in tests:
model = crc(test) & 0xFFFFFFFFL
zlib_result = zlib.crc32(test) & 0xFFFFFFFFL
print (model, zlib_result, model == zlib_result)
Output from Python 2.7 is below. Also tested with Python 2.1 to 2.6 inclusive and 1.5.2 JFTHOI.
(0L, 0L, True)
(3523407757L, 3523407757L, True)
(2768625435L, 2768625435L, True)
(4186783197L, 4186783197L, True)
Don't you want to use some existing software to generate your PNGs? How about PyPNG?
There are libraries that can write PNG files for you, such as PIL. That will be easier and faster, and as an added bonus you can read and write tons of formats.
It looks like you will have to resort to call zlib "by hand" using ctypes --
It is not that hard:
>>> import ctypes
>>> z = ctypes.cdll.LoadLibrary("libzip.so.1")
>>> z.zlibVersion.restype=ctypes.c_char_p
>>> z.zlibVersion()
'1.2.3'
You can check the zlib library docmentation here: http://zlib.net/manual.html
The zlib.crc32 works fine, and the zlib compressor has correct defaults for png generation.
For the casual reader who looks for png generation from Python code, here is a complete example that you can use as a starter for your own png generator code - all you need is the standard zlib module and some bytes-encoding:
#! /usr/bin/python
""" Converts a list of list into gray-scale PNG image. """
__copyright__ = "Copyright (C) 2014 Guido Draheim"
__licence__ = "Public Domain"
import zlib
import struct
def makeGrayPNG(data, height = None, width = None):
def I1(value):
return struct.pack("!B", value & (2**8-1))
def I4(value):
return struct.pack("!I", value & (2**32-1))
# compute width&height from data if not explicit
if height is None:
height = len(data) # rows
if width is None:
width = 0
for row in data:
if width < len(row):
width = len(row)
# generate these chunks depending on image type
makeIHDR = True
makeIDAT = True
makeIEND = True
png = b"\x89" + "PNG\r\n\x1A\n".encode('ascii')
if makeIHDR:
colortype = 0 # true gray image (no palette)
bitdepth = 8 # with one byte per pixel (0..255)
compression = 0 # zlib (no choice here)
filtertype = 0 # adaptive (each scanline seperately)
interlaced = 0 # no
IHDR = I4(width) + I4(height) + I1(bitdepth)
IHDR += I1(colortype) + I1(compression)
IHDR += I1(filtertype) + I1(interlaced)
block = "IHDR".encode('ascii') + IHDR
png += I4(len(IHDR)) + block + I4(zlib.crc32(block))
if makeIDAT:
raw = b""
for y in xrange(height):
raw += b"\0" # no filter for this scanline
for x in xrange(width):
c = b"\0" # default black pixel
if y < len(data) and x < len(data[y]):
c = I1(data[y][x])
raw += c
compressor = zlib.compressobj()
compressed = compressor.compress(raw)
compressed += compressor.flush() #!!
block = "IDAT".encode('ascii') + compressed
png += I4(len(compressed)) + block + I4(zlib.crc32(block))
if makeIEND:
block = "IEND".encode('ascii')
png += I4(0) + block + I4(zlib.crc32(block))
return png
def _example():
with open("cross3x3.png","wb") as f:
f.write(makeGrayPNG([[0,255,0],[255,255,255],[0,255,0]]))