I have a problem that i need to validate huge compressed files after download (usually more than 10-20gb per file) against reference checksums that have apparently been generated using cksum
(To be more precise: My python script needs to download large compressed files from the ncbi ftp-server that was supposed to provide md5 checksums for validating the downloads, but instead only provided some different unspecified filehash/checksum values. After some trial and error I found that these checksums were identical to the output of the unix tool cksum, which apparently genereates CRC-checksums. So to compare/validate these i need to generate cksum-equivalent checksums for the downloaded files.)
It appears that the unix tool cksum yields totally different checksum values than the supposed equivalent unix tool crc32 (or the python zlib.crc32() function, for that matter). When googling the problem I could not understand the explanations for why this occurs, especially since they appear to be identical on some systems? So maybe this is because I work on a 64 bit system (but then: who doesn't nowadays)?
using built-in python modules I can easily generate md5- and CRC32 checksums, but none of these are equivalent to the cksum output, neither in decimal nor in hexadecimal representation.
I did find a previous post here on stackoverflow pointing to a snippet that seems to solve this. But while it works for small files, A.) I do not understand a word of it, so I have a hard time adapting it and B.) it does not seem to work well with large files.
for completeness sake: here is the snippet (python3 version):
#!/usr/bin/env python
import sys
crctab = [ 0x00000000, 0x04c11db7, 0x09823b6e, 0x0d4326d9, 0x130476dc,
0x17c56b6b, 0x1a864db2, 0x1e475005, 0x2608edb8, 0x22c9f00f,
0x2f8ad6d6, 0x2b4bcb61, 0x350c9b64, 0x31cd86d3, 0x3c8ea00a,
0x384fbdbd, 0x4c11db70, 0x48d0c6c7, 0x4593e01e, 0x4152fda9,
0x5f15adac, 0x5bd4b01b, 0x569796c2, 0x52568b75, 0x6a1936c8,
0x6ed82b7f, 0x639b0da6, 0x675a1011, 0x791d4014, 0x7ddc5da3,
0x709f7b7a, 0x745e66cd, 0x9823b6e0, 0x9ce2ab57, 0x91a18d8e,
0x95609039, 0x8b27c03c, 0x8fe6dd8b, 0x82a5fb52, 0x8664e6e5,
0xbe2b5b58, 0xbaea46ef, 0xb7a96036, 0xb3687d81, 0xad2f2d84,
0xa9ee3033, 0xa4ad16ea, 0xa06c0b5d, 0xd4326d90, 0xd0f37027,
0xddb056fe, 0xd9714b49, 0xc7361b4c, 0xc3f706fb, 0xceb42022,
0xca753d95, 0xf23a8028, 0xf6fb9d9f, 0xfbb8bb46, 0xff79a6f1,
0xe13ef6f4, 0xe5ffeb43, 0xe8bccd9a, 0xec7dd02d, 0x34867077,
0x30476dc0, 0x3d044b19, 0x39c556ae, 0x278206ab, 0x23431b1c,
0x2e003dc5, 0x2ac12072, 0x128e9dcf, 0x164f8078, 0x1b0ca6a1,
0x1fcdbb16, 0x018aeb13, 0x054bf6a4, 0x0808d07d, 0x0cc9cdca,
0x7897ab07, 0x7c56b6b0, 0x71159069, 0x75d48dde, 0x6b93dddb,
0x6f52c06c, 0x6211e6b5, 0x66d0fb02, 0x5e9f46bf, 0x5a5e5b08,
0x571d7dd1, 0x53dc6066, 0x4d9b3063, 0x495a2dd4, 0x44190b0d,
0x40d816ba, 0xaca5c697, 0xa864db20, 0xa527fdf9, 0xa1e6e04e,
0xbfa1b04b, 0xbb60adfc, 0xb6238b25, 0xb2e29692, 0x8aad2b2f,
0x8e6c3698, 0x832f1041, 0x87ee0df6, 0x99a95df3, 0x9d684044,
0x902b669d, 0x94ea7b2a, 0xe0b41de7, 0xe4750050, 0xe9362689,
0xedf73b3e, 0xf3b06b3b, 0xf771768c, 0xfa325055, 0xfef34de2,
0xc6bcf05f, 0xc27dede8, 0xcf3ecb31, 0xcbffd686, 0xd5b88683,
0xd1799b34, 0xdc3abded, 0xd8fba05a, 0x690ce0ee, 0x6dcdfd59,
0x608edb80, 0x644fc637, 0x7a089632, 0x7ec98b85, 0x738aad5c,
0x774bb0eb, 0x4f040d56, 0x4bc510e1, 0x46863638, 0x42472b8f,
0x5c007b8a, 0x58c1663d, 0x558240e4, 0x51435d53, 0x251d3b9e,
0x21dc2629, 0x2c9f00f0, 0x285e1d47, 0x36194d42, 0x32d850f5,
0x3f9b762c, 0x3b5a6b9b, 0x0315d626, 0x07d4cb91, 0x0a97ed48,
0x0e56f0ff, 0x1011a0fa, 0x14d0bd4d, 0x19939b94, 0x1d528623,
0xf12f560e, 0xf5ee4bb9, 0xf8ad6d60, 0xfc6c70d7, 0xe22b20d2,
0xe6ea3d65, 0xeba91bbc, 0xef68060b, 0xd727bbb6, 0xd3e6a601,
0xdea580d8, 0xda649d6f, 0xc423cd6a, 0xc0e2d0dd, 0xcda1f604,
0xc960ebb3, 0xbd3e8d7e, 0xb9ff90c9, 0xb4bcb610, 0xb07daba7,
0xae3afba2, 0xaafbe615, 0xa7b8c0cc, 0xa379dd7b, 0x9b3660c6,
0x9ff77d71, 0x92b45ba8, 0x9675461f, 0x8832161a, 0x8cf30bad,
0x81b02d74, 0x857130c3, 0x5d8a9099, 0x594b8d2e, 0x5408abf7,
0x50c9b640, 0x4e8ee645, 0x4a4ffbf2, 0x470cdd2b, 0x43cdc09c,
0x7b827d21, 0x7f436096, 0x7200464f, 0x76c15bf8, 0x68860bfd,
0x6c47164a, 0x61043093, 0x65c52d24, 0x119b4be9, 0x155a565e,
0x18197087, 0x1cd86d30, 0x029f3d35, 0x065e2082, 0x0b1d065b,
0x0fdc1bec, 0x3793a651, 0x3352bbe6, 0x3e119d3f, 0x3ad08088,
0x2497d08d, 0x2056cd3a, 0x2d15ebe3, 0x29d4f654, 0xc5a92679,
0xc1683bce, 0xcc2b1d17, 0xc8ea00a0, 0xd6ad50a5, 0xd26c4d12,
0xdf2f6bcb, 0xdbee767c, 0xe3a1cbc1, 0xe760d676, 0xea23f0af,
0xeee2ed18, 0xf0a5bd1d, 0xf464a0aa, 0xf9278673, 0xfde69bc4,
0x89b8fd09, 0x8d79e0be, 0x803ac667, 0x84fbdbd0, 0x9abc8bd5,
0x9e7d9662, 0x933eb0bb, 0x97ffad0c, 0xafb010b1, 0xab710d06,
0xa6322bdf, 0xa2f33668, 0xbcb4666d, 0xb8757bda, 0xb5365d03,
0xb1f740b4 ]
UNSIGNED = lambda n: n & 0xffffffff
def memcrc(b):
n = len(b)
i = c = s = 0
for c in b:
tabidx = (s>>24)^c
s = UNSIGNED((s << 8)) ^ crctab[tabidx]
while n:
c = n & 0o0377
n = n >> 8
s = UNSIGNED(s << 8) ^ crctab[(s >> 24) ^ c]
return UNSIGNED(~s)
if __name__ == '__main__':
fname = sys.argv[-1]
buffer = open(fname, 'rb').read()
print("%d\t%d\t%s" % (memcrc(buffer), len(buffer), fname))
Could someone please help me understand this?
what exactly is the problem with the difference between cksum and crc32?
is it simply the fact that the one is 32bit based and the other 64 bit?
Can i simply convert between the values produced by both, and if yes how?
what is the purpose of the crctab in the above snippet and how does the conversion work there?
I don't know the why part of your question. All I can say is that the great thing about standards is that you have so many to choose from.
cksum is specified by POSIX to use a different CRC than the more common CRC-32 you find in zlib, Python, used in zip and gzip files, etc. The CRC-32/CKSUM has this specification (from Greg Cook's CRC catalog):
width=32 poly=0x04c11db7 init=0x00000000 refin=false refout=false xorout=0xffffffff check=0x765e7680 residue=0xc704dd7b name="CRC-32/CKSUM"
The more common CRC-32 has this specification:
width=32 poly=0x04c11db7 init=0xffffffff refin=true refout=true xorout=0xffffffff check=0xcbf43926 residue=0xdebb20e3 name="CRC-32/ISO-HDLC"
The cksum utility on my system (macOS) computes the CRC-32/CKSUM, but it also has options to compute the CRC-32/ISO_HDLC, as well as two other actual checksums, the first from the BSD Unix sum command, and the second from the AT&T System V Unix sum command.
There is apparently no shortage of results that cksum might produce.
No, it has nothing to do with 32 vs. 64 bit systems.
No, you cannot convert between the values.
The purpose of the table is to speed up the CRC calculation by precomputing the CRC of every byte value.
I've tried refactoring your code into a class following a similar api to the standard Python hashlib:
class crc32:
def __init__(self):
self.nchars = 0
self.crc = 0
def update(self, buf):
crc = self.crc
for c in buf:
crc = crctab[(crc >> 24) ^ c] ^ ((crc << 8) & 0xffffffff)
self.crc = crc
self.nchars += len(buf)
def digest(self):
crc = self.crc
n = self.nchars
while n:
c = n & 0xff
crc = crctab[(crc >> 24) ^ c] ^ ((crc << 8) & 0xffffffff)
n >>= 8
return UNSIGNED(~crc)
I've expanded out UNSIGNED to try and make it faster and reordered some of the statements to be more similar to the standard zlib library (as used by Python) while I was trying to understand the differences. It seems they use a different polynomial to generate the table, but otherwise it's the same.
The above code can be used as:
with open('largefile', 'rb') as fd:
digest = crc32()
while buf := fd.read(4096):
digest.update(buf)
print(digest.digest())
which prints out the expected 1135714720 for a file created by:
echo -n hello world > test.txt
The above code should work for large files, but given the performance of Python this would take far too long to be useful. A 75MB file I have takes ~11 seconds, while cksum takes just ~0.2 seconds. You should be able to get somewhere with using Cython to speed it up, but that's a bit more fiddly and if your're struggling with the existing code it's going to be quite a learning curve!
I've had another play and got performance similar to cksum with Cython, the code looks like:
cdef unsigned int *crctab_c = [
// copy/paste crctab from above
]
cdef class crc32_c:
cdef unsigned int crc, nchars
def __init__(self):
self.nchars = 0
self.crc = 0
cdef _update(self, bytes buf):
cdef unsigned int crc, i, j
cdef unsigned char c
crc = self.crc
for c in buf:
i = (crc >> 24) ^ c
j = crc << 8
crc = crctab_c[i] ^ j
self.crc = crc
self.nchars += len(buf)
def update(self, buf):
return self._update(buf)
def digest(self):
crc = self.crc
n = self.nchars
while n:
c = n & 0xff
crc = crctab_c[(crc >> 24) ^ c] ^ ((crc << 8) & 0xffffffff)
n >>= 8
return (~crc) & 0xffffffff
after compiling this code with Cython, it can be used in a similar manner to the previous class. Performance is pretty good: Python now takes ~200ms for a 75MiB file and is basically the same as cksum, but much slower than zlib which only takes ~80ms.
I googled this issue for last 2 weeks and wasn't able to find an algorithm or solution. I have some short .wav file but it has MULAW compression and python doesn't seem to have function inside wave.py that can successfully decompresses it. So I've taken upon myself to build a decoder in python.
I've found some info about MULAW in basic elements:
Wikipedia
A-law u-Law comparison
Some c-esc codec library
So I need some guidance, since I don't know how to approach getting from signed short integer to a full wave signal. This is my initial thought from what I've gathered so far:
So from wiki I've got a equation for u-law compression and decompression :
compression :
decompression :
So judging by compression equation, it looks like the output is limited to a float range of -1 to +1 , and with signed short integer from –32,768 to 32,767 so it looks like I would need to convert it from short int to float in specific range.
Now, to be honest, I've heard of quantisation before, but I am not sure if I should first try and dequantize and then decompress or in the other way, or even if in this case it is the same thing... the tutorials/documentation can be a bit of tricky with terminology.
The wave file I am working with is supposed to contain 'A' sound like for speech synthesis, I could probably verify success by comparing 2 waveforms in some audio software and custom wave analyzer but I would really like to diminish trial and error section of this process.
So what I've had in mind:
u = 0xff
data_chunk = b'\xe7\xe7' # -6169
data_to_r1 = unpack('h',data_chunk)[0]/0xffff # I suspect this is wrong,
# # but I don't know what else
u_law = ( -1 if data_chunk<0 else 1 )*( pow( 1+u, abs(data_to_r1)) -1 )/u
So is there some sort of algorithm or crucial steps I would need to take in form of first: decompression, second: quantisation : third ?
Since everything I find on google is how to read a .wav PCM-modulated file type, not how to manage it if wild compression arises.
So, after scouring the google the solution was found in github ( go figure ). I've searched for many many algorithms and found 1 that is within bounds of error for lossy compression. Which is for u law for positive values from 30 -> 1 and for negative values from -32 -> -1
To be honest i think this solution is adequate but not quite per equation per say, but it is best solution for now. This code is transcribed to python directly from gcc9108 audio codec
def uLaw_d(i8bit):
bias = 33
sign = pos = 0
decoded = 0
i8bit = ~i8bit
if i8bit&0x80:
i8bit &= ~(1<<7)
sign = -1
pos = ( (i8bit&0xf0) >> 4 ) + 5
decoded = ((1 << pos) | ((i8bit & 0x0F) << (pos - 4)) | (1 << (pos - 5))) - bias
return decoded if sign else ~decoded
def uLaw_e(i16bit):
MAX = 0x1fff
BIAS = 33
mask = 0x1000
sign = lsb = 0
pos = 12
if i16bit < 0:
i16bit = -i16bit
sign = 0x80
i16bit += BIAS
if ( i16bit>MAX ): i16bit = MAX
for x in reversed(range(pos)):
if i16bit&mask != mask and pos>=5:
pos = x
break
lsb = ( i16bit>>(pos-4) )&0xf
return ( ~( sign | ( pos<<4 ) | lsb ) )
With test:
print( 'normal :\t{0}\t|\t{0:2X}\t:\t{0:016b}'.format(0xff) )
print( 'encoded:\t{0}\t|\t{0:2X}\t:\t{0:016b}'.format(uLaw_e(0xff)) )
print( 'decoded:\t{0}\t|\t{0:2X}\t:\t{0:016b}'.format(uLaw_d(uLaw_e(0xff))) )
and output:
normal : 255 | FF : 0000000011111111
encoded: -179 | -B3 : -000000010110011
decoded: 263 | 107 : 0000000100000111
And as you can see 263-255 = 8 which is within bounds. When i tried to implement seeemmmm method described in G.711 ,that kind user Oliver Charlesworth suggested that i look in to , the decoded value for maximum in data was -8036 which is close to the maximum of uLaw spec, but i couldn't reverse engineer decoding function to get binary equivalent of function from wikipedia.
Lastly, i must say that i am currently disappointed that python library doesn't support all kind of compression algorithms since it is not just a tool that people use, it is also a resource python consumers learn from since most of data for further dive into code isn't readily available or understandable.
EDIT
After decoding the data and writing wav file via wave.py i've successfully succeeded to write a new raw linear PCM file. This works... even though i was sceptical at first.
EDIT 2: ::> you can find real solution oncompressions.py
I find this helpful for converting to/from ulaw with numpy arrays.
import audioop
def numpy_audioop_helper(x, xdtype, func, width, ydtype):
'''helper function for using audioop buffer conversion in numpy'''
xi = np.asanyarray(x).astype(xdtype)
if np.any(x != xi):
xinfo = np.iinfo(xdtype)
raise ValueError("input must be %s [%d..%d]" % (xdtype, xinfo.min, xinfo.max))
y = np.frombuffer(func(xi.tobytes(), width), dtype=ydtype)
return y.reshape(xi.shape)
def audioop_ulaw_compress(x):
return numpy_audioop_helper(x, np.int16, audioop.lin2ulaw, 2, np.uint8)
def audioop_ulaw_expand(x):
return numpy_audioop_helper(x, np.uint8, audioop.ulaw2lin, 2, np.int16)
Python actually supports decoding u-Law out of the box:
audioop.ulaw2lin(fragment, width)
Convert sound fragments in u-LAW encoding to linearly encoded sound fragments. u-LAW encoding always uses 8 bits samples, so width
refers only to the sample width of the output fragment here.
https://docs.python.org/3/library/audioop.html#audioop.ulaw2lin