python-lzw doesn't decompress larger blobs

python-lzw doesn't decompress larger blobs - python

I am new to python and we had been trying to use lzw code from GIT in the program.
https://github.com/joeatwork/python-lzw/blob/master/lzw/init.py
This is working well if we have a smaller blob but if the blob size increases it doesn't decompress the blob. So I had been reading the documentation but I am unable to understand the below which might be the reason why the full blob is not getting decompressed.
I have also attached a strip of the python code I am using.
Our control codes are
- CLEAR_CODE (codepoint 256). When this code is encountered, we flush
the codebook and start over.
- END_OF_INFO_CODE (codepoint 257). This code is reserved for
encoder/decoders over the integer codepoint stream (like the
mechanical bit that unpacks bits into codepoints)
When dealing with bytes, codes are emitted as variable
length bit strings packed into the stream of bytes.
codepoints are written with varying length
- initially 9 bits
- at 512 entries 10 bits
- at 1025 entries at 11 bits
- at 2048 entries 12 bits
- with max of 4095 entries in a table (including Clear and EOI)
code points are stored with their MSB in the most significant bit
available in the output character.
My code strip :
def decompress_without_eoi(buf):
# Decompress LZW into a bytes, ignoring End of Information code
def gen():
try:
for byte in lzw.decompress(buf):
yield byte
except ValueError as exc:
#print(repr(exc))
if 'End of information code' in repr(exc):
#print('Ignoring EOI error..\n')
pass
else:
raise
return
try:
#print('Trying a join..\n')
deblob = b''.join(gen())
except Exception as exc2:
#print(repr(exc2))
#print('Trying byte by byte..')
deblob=[]
try:
for byte in gen():
deblob.append(byte)
except Exception as exc3:
#print(repr(exc3))
return b''.join(deblob)
return deblob
#current function to deblob
def deblob3(row):
if pd.notnull(row[0]):
blob = row[0]
h = html2text.HTML2Text()
h.ignore_links=True
h.ignore_images = True #zzzz
if type(blob) != bytes:
blobbytes = blob.read()[:-10]
else:
blobbytes = blob[:-10]
if row[1]==361:
# If compressed, return up to EOI-257 code, which is last non-null code before tag
# print (row[0])
return h.handle(striprtf(decompress_without_eoi(blobbytes)))
elif row[1]==360:
# If uncompressed, return up to tag
return h.handle(striprtf(blobbytes))
This function has been called as per below
nf['IS_BLOB'] = nf[['IS_BLOB','COMPRESSION']].apply(deblob3,axis=1)

Related

Zeroing/blacking out pixels in a .tiff-like file (.svs or .ndpi)

I am trying to zero out the pixels in some .tiff-like biomedical scans (.svs & .ndpi) by changing values directly in the binary file.
For reference, I am using the docs on the .tiff format here.
As a sanity check, I've confirmed that the first two bytes have values 73 and 73 (or I and I in ASCII), meaning it is little-endian, and that the two next bytes are the value 42 (both these things are expected as according to the docs just mentioned).
I wrote a Python script that reads the IFD (Image File Directory) and its components, but I am having troubles proceeding from there.
My code is this:
with open('scan.svs', "rb") as f:
# Read away the first 4 bytes:
f.read(4)
# Read offset of first IFD as the four next bytes:
IFD_offset = int.from_bytes(f.read(4), 'little')
# Move to IFD:
f.seek(IFD_offset, 0)
# Read IFD:
IFD = f.read(12)
# Get components of IFD:
tag = int.from_bytes(IFD[:2], 'little')
field_type = int.from_bytes(IFD[2:4], 'little')
count = int.from_bytes(IFD[4:8], 'little')
value_offset = int.from_bytes(IFD[8:], 'little')
# Now what?
The values for the components are tag=16, field_type=254, count=65540 and value_offset=0.
How do I go from there?
Ps: Using Python is not a must, if there is some other tool that could more easily to the job.

How to fix this IO bound python operation on 12GB .bin file?

I'm reading this book Hands-On Machine Learning for Algorithmic Trading and I came across a script that is supposed to parse a large .bin binary file and convert it to .h5. This file consists of something called ITCH data, you can find the technical documentation of the data here. The script is very inefficient, it reads a 12GB(12952050754 bytes) file 2 bytes at a time which is ultra slow(might take up to 4 hours on some decent 4cpu GCP instance) which is not very surprising. You can find the whole notebook here.
My problem is I don't understand how this .bin file is being read I mean I don't see where is the necessity of reading the file 2 bytes at a time, I think there is a way to read at a large buffer size but I'm not sure how to do it, or even convert the script to c++ if after optimizing this script, it is still being slow which I can do if I understand the inner workings of this I/O process, does anyone have suggestions?
here's a link to the file source of ITCH data, you can find small files(300 mb or less) which are for less time periods if you need to experiment with the code.
The bottleneck:
with file_name.open('rb') as data:
while True:
# determine message size in bytes
message_size = int.from_bytes(data.read(2), byteorder='big', signed=False)
# get message type by reading first byte
message_type = data.read(1).decode('ascii')
message_type_counter.update([message_type])
# read & store message
record = data.read(message_size - 1)
message = message_fields[message_type]._make(unpack(fstring[message_type], record))
messages[message_type].append(message)
# deal with system events
if message_type == 'S':
seconds = int.from_bytes(message.timestamp, byteorder='big') * 1e-9
print('\n', event_codes.get(message.event_code.decode('ascii'), 'Error'))
print(f'\t{format_time(seconds)}\t{message_count:12,.0f}')
if message.event_code.decode('ascii') == 'C':
store_messages(messages)
break
message_count += 1
if message_count % 2.5e7 == 0:
seconds = int.from_bytes(message.timestamp, byteorder='big') * 1e-9
d = format_time(time() - start)
print(f'\t{format_time(seconds)}\t{message_count:12,.0f}\t{d}')
res = store_messages(messages)
if res == 1:
print(pd.Series(dict(message_type_counter)).sort_values())
break
messages.clear()
And here's the store_messages() function:
def store_messages(m):
"""Handle occasional storing of all messages"""
with pd.HDFStore(itch_store) as store:
for mtype, data in m.items():
# convert to DataFrame
data = pd.DataFrame(data)
# parse timestamp info
data.timestamp = data.timestamp.apply(int.from_bytes, byteorder='big')
data.timestamp = pd.to_timedelta(data.timestamp)
# apply alpha formatting
if mtype in alpha_formats.keys():
data = format_alpha(mtype, data)
s = alpha_length.get(mtype)
if s:
s = {c: s.get(c) for c in data.columns}
dc = ['stock_locate']
if m == 'R':
dc.append('stock')
try:
store.append(mtype,
data,
format='t',
min_itemsize=s,
data_columns=dc)
except Exception as e:
print(e)
print(mtype)
print(data.info())
print(pd.Series(list(m.keys())).value_counts())
data.to_csv('data.csv', index=False)
return 1
return 0

According to the code, file format looks like its 2 bytes of message size, one byte of message type and then n bytes of actual message (defined by the previously read message size).
Low hanging fruit to optimize this is to read 3 bytes first into list, convert [0:1] to message size int and [2] to message type and then read the message ..
To further eliminate amount of required reads, you could read a fixed amount of data from the file into a list of and start extracting from it. While extracting, keep a index of already processed bytes stored and once that index or index+amount of data to be read goes over the size of the list, you prepopulate the list. This could lead to huge memory requirements if not done properly thought..

"Split" a image into packages of bytes

I am trying to do a project for college which consists of sending images using two Arduino Due boards and Python. I have two codes: one for the client (the one who sends the image) and one for the server (the one who receives the image). I know how to send the bytes and check if they are correct, however, I'm required to "split" the image into packages that have:
a header that has a size of 8 bytes and must be in this order:
the first byte must say the payload size;
the next three bytes must say how many packages will be sent in total;
the next three bytes must say which package I'm currently at;
the last byte must contain a code to an error message;
a payload containing data with a maximum size of 128 bytes;
an end of package (EOP) sequence (in this case, 3 bytes).
I managed to create the end of package sequence and append it correctly to a payload in order to send, however I'm facing issues on creating the header.
I'm currently trying to make the following loop:
with open(root.filename, 'rb') as f:
picture = f.read()
picture_size = len(picture)
packages = ceil(picture_size/128)
last_pack_size = (picture_size)
EOPs = 0
EOP_bytes = [b'\x15', b'\xff', b'\xd9']
for p in range(1,packages):
read_bytes = [None, int.to_bytes(picture[(p-1)*128], 1, 'big'),
int.to_bytes(picture[(p-1)*128 + 1], 1, 'big')]
if p != packages:
endrange = p*128+1
else:
endrange = picture_size
for i in range((p-1)*128 + 2, endrange):
read_bytes.append(int.to_bytes(picture[i], 1, 'big'))
read_bytes.pop(0)
if read_bytes == EOP_bytes:
EOPs += 1
print("read_bytes:", read_bytes)
print("EOP_bytes:", EOP_bytes)
print("EOPs", EOPs)
I expect at the end that the server receives the same amount of packages that the client has sent, and in the end I need to join the packages to recreate the image. I can manage to do that, I just need some help with creating the header.

Here is a a demo of how to construct your header, it's not a complete soultion but given you only asked for help constructing the header it may be what you are looking for.
headerArray = bytearray()
def Main():
global headerArray
# Sample Data
payloadSize = 254 # 0 - 254
totalPackages = 1
currentPackage = 1
errorCode = 101 # 0 - 254
AddToByteArray(payloadSize,1) # the first byte must say the payload size;
AddToByteArray(totalPackages,3) # the next three bytes must say how many packages will be sent in total;
AddToByteArray(currentPackage,3) # the next three bytes must say which package I'm currently at;
AddToByteArray(errorCode,1) # the last byte must contain a code to an error message;
def AddToByteArray(value,numberOfBytes):
global headerArray
allocate = value.to_bytes(numberOfBytes, 'little')
headerArray += allocate
Main()
# Output
print(f"Byte Array: {headerArray}")
for i in range(0,len(headerArray)):
print(f"Byte Position: {i} Value:{headerArray[i]}")
Obviously I have not included the logic to obtain the current package or total packages.

Securely encrypt integers (up to 2^48) into the shortest possible URL-safe string

In my Django application I have hierarchical URL structure:
webpage.com/property/PK/sub-property/PK/ etc...
I do not want to expose primary keys and create a vulnerability. Therefore I am
encrypting all PKs into strings in all templates and URLs. This is done by the wonderful library django-encrypted-id written by this SO user.
However, the library supports up to 2^64 long integers and produces 24 characters output (22 + 2 padding). This results in huge URLs in my nested structure.
Therefore, I would like to patch the encrypting and decrypting functions and try to shorten the output. Here is the original code (+ padding handling which I added):
# Remove the padding after encode and add it on decode
PADDING = '=='
def encode(the_id):
assert 0 <= the_id < 2 ** 64
crc = binascii.crc32(bytes(the_id)) & 0xffffff
message = struct.pack(b"<IQxxxx", crc, the_id)
assert len(message) == 16
cypher = AES.new(
settings.SECRET_KEY[:24], AES.MODE_CBC,
settings.SECRET_KEY[-16:]
)
return base64.urlsafe_b64encode(cypher.encrypt(message)).rstrip(PADDING)
def decode(e):
if isinstance(e, basestring):
e = bytes(e.encode("ascii"))
try:
e += str(PADDING)
e = base64.urlsafe_b64decode(e)
except (TypeError, AttributeError):
raise ValueError("Failed to decrypt, invalid input.")
for skey in getattr(settings, "SECRET_KEYS", [settings.SECRET_KEY]):
cypher = AES.new(skey[:24], AES.MODE_CBC, skey[-16:])
msg = cypher.decrypt(e)
crc, the_id = struct.unpack("<IQxxxx", msg)
if crc != binascii.crc32(bytes(the_id)) & 0xffffff:
continue
return the_id
raise ValueError("Failed to decrypt, CRC never matched.")
# Lets test with big numbers
for x in range(100000000, 100000003):
ekey = encode(x)
pk = decode(ekey)
print "Pk: %s Ekey: %s" % (pk, ekey)
Output (I changed the strings a bit, so don't try to hack me :P):
Pk: 100000000 Ekey: GNtOHji8rA42qfq3p5gNMI
Pk: 100000001 Ekey: tK6RcAZ2MrWmR3nB5qkQDe
Pk: 100000002 Ekey: a7VXIf8pEB6R7XvqwGQo6W
I have tried to modify everything in the encode() function but without any success. The produced string has always the length of 22.
Here is what I want:
Keep the encryption strength near to the original level or at least do not decrease it dramatically
Support integers up to 2^48 (~281 trillions), or 2^40, because as it is now with 2^64 is too much, I do not think that we will ever have such huge PKs in the database.
I will be happy with string length between 14-20. If its 20.. then yeah, its still 2 chars less..

Currently you are using CBC mode with a static IV, so the code you have isn't secure anyway and, like you say, produces rather large ciphertexts.
I would recommend swapping from CBC mode to CTR mode, which lets you have a variable length IV. The normal recommended length for the IV (or nonce) in CTR mode, I think, is 12, but you can reduce this up or down as needed. CTR is also a stream cipher which means what you put in is what you get out in terms of size. With AES, CBC mode will always return you ciphertexts in blocks of 16 bytes so even if you are encrypting 6 bytes, you get 16 bytes out, so isn't ideal for you.
If you make your IV say... 48 bits long and aim to encrypt no larger than 48 bits, you'll be able to produce a raw output of 6 + 6 = 12 bytes, or with base64, (4*(12/3)) = 16 bytes. You will be able to get a lower output than this by further reducing your IV and/or input size (2^40?). You can lower possible values of your input as much as you want without damaging the security.
Keep in mind that CTR does have pitfalls. Producing two ciphertexts that share the same IV and key means that they can be trivially broken, so always randomly generate your IV (and don't reduce it in size too much).

Incorrect Padding error while decoding base64 encoding

I have tried to decode a PDF I stored as BLOB and save it into a file with .pdf extension. results[0][1] has the BLOB data extracted from database query.
blob_val=results[0][1]
if len(blob_val) % 4 != 0:
while len(blob_val) % 4 != 0:
blob_val = blob_val + b"="
decod_text = base64.b64decode(blob_val)
else:
decod_text = base64.b64decode(blob_val)
Eventhough i have added = at the end to correct padding errors, it is still showing incorrect padding error. why does it still shows this error even when we corrected it by "="?

Each base64 char is encoding six bits. For this to work, the total number of bytes should be divisible by three, not four.
This should work (and be a bit simplified):
blob_val = results[0][1]
# If the length is divisible by 3, the 'while' will never
# be entered, so no point in doing the additional 'if' above.
while len(blob_val) % 3 != 0:
blob_val += b"="
decod_text = base64.b64decode(blob_val)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python-lzw doesn't decompress larger blobs - python

Related

Zeroing/blacking out pixels in a .tiff-like file (.svs or .ndpi)

How to fix this IO bound python operation on 12GB .bin file?

"Split" a image into packages of bytes

Securely encrypt integers (up to 2^48) into the shortest possible URL-safe string

Incorrect Padding error while decoding base64 encoding

Categories

Resources