How to compress or compact a string in Python

How to compress or compact a string in Python - python

I'm making a python "script" that sends a string to a webservice (in C#). I NEED to compress or compact this string, because the bandwidth and MBs data is LIMITED (yeah, in capitals because it's very limited).
I was thinking of converting it into a file and then compressing the file. But I'm looking for a method to directly compress the string.
How can I compress or compact the string?

How about zlib?
import zlib
a = "this string needs compressing"
a = zlib.compress(a.encode())
print(zlib.decompress(a).decode()) # outputs original contents of a
You can also use sys.getsizeof(obj) to see how much data an object takes up before and after compression.

import sys
import zlib
text=b"""This function is the primary interface to this module along with
decompress() function. This function returns byte object by compressing the data
given to it as parameter. The function has another parameter called level which
controls the extent of compression. It an integer between 0 to 9. Lowest value 0
stands for no compression and 9 stands for best compression. Higher the level of
compression, greater the length of compressed byte object."""
# Checking size of text
text_size=sys.getsizeof(text)
print("\nsize of original text",text_size)
# Compressing text
compressed = zlib.compress(text)
# Checking size of text after compression
csize=sys.getsizeof(compressed)
print("\nsize of compressed text",csize)
# Decompressing text
decompressed=zlib.decompress(compressed)
#Checking size of text after decompression
dsize=sys.getsizeof(decompressed)
print("\nsize of decompressed text",dsize)
print("\nDifference of size= ", text_size-csize)

Related

Python how to get length of data compressed with zlib?

I have a file with multiple zlib-compressed binary data, and the offsets and lengths are unknown. Below, I have a script that gets the offset of the byte after the final zlib compressed data, which is what I need. The script works; however, in order to get the length of the original zlib compressed data, I have to decompress it and re-compress it. Is there a better way to get the length without having to re-compress it? Here's my code:
import zlib
def inflate(infile):
data = infile.read()
offset = 0
while offset < len(data):
window = data[offset : offset + 2]
for key, value in zlib_headers.items():
if window == key:
decomp_obj = zlib.decompressobj()
yield key, offset, decomp_obj.decompress(data[offset:])
if offset == len(data):
break
offset += 1
if __name__ == "__main__":
zlib_headers = {b"\x78\x01": 3, b"\x78\x9c": 6, b"\x78\xda": 9}
with open("input_file", "rb") as infile:
*_, last = inflate(infile)
key, offset, data = last
start_offset = offset + len(zlib.compress(data, zlib_headers[key]))
print(start_offset)

Recompressing it won't even work. The recompression could be a different length. There is no assurance that the result will be the same, unless you control the compression process that made the compressed data in the first place, and you can guarantee that it uses the same compression code, same version of that code, and exactly the same settings. There is not even enough information in the zlib header to determine what the compression level was. By the way, your list of possible zlib headers is incomplete. There are 29 others it could be. The easiest and most reliable way to determine whether or not a zlib stream starts at the current byte is to begin decompressing until you either get an error or it completes. The first thing the decompressor will do is check the zlib header for validity.
To find the length of the decompressed data, feed decomp_obj.decompress() a fixed number of bytes at a time. E.g. 65536 bytes. Keep track of how many bytes you have fed it. Stop when decomp_obj.eof is true. That indicates that the end of the zlib stream has been reached. Then decomp_obj.unused_data will be the bytes you fed it that were after the zlib stream. Subtract the length of the leftover from your total amount fed, and you have the length of the zlib stream.

Buffer/stream to read a subset of a file in Python?

I have written a class which parses the header segment of data file (for storing scientific instrument data) and collects things like offsets to various data segments within the file. The actual data is obtained through various methods which read and parse the data segments.
The problem I'm running into is that there is a segment defined for vendor-specific, unstructured data. Since there's nothing to parse I just need my method to return raw binary data. However this segment could be very large so I don't just want to read it all at once and return a single bytes object.
What I'd like to do is have the method return an io.BufferedReader object or similar into the file that only reads between a beginning and end offset. I haven't been able to figure out a way to do this using the builtin IO classes. Is it possible?

You inherit all of the class methods from IOBase so you can absolutely call reader.seek(byte_offset) to skip to this byte position in the stream. However from there you will have to manually track the bytes you have read until you get to the max offset you are reading to. The start offset for seek() will of course have to be known beforehand as will the ending byte offset. Some example code follows (assuming a start offset of byte 250):
import io
stream = io.open("file.txt", "r")
buffered_reader = io.BufferedReader(stream)
# set the stream to byte 250
buffered_reader.seek(250)
# read up to byte 750 (500 bytes from position 250)
data = buffered_reader.read(500)
Of course, if this header is dynamically sized... You will have to scan to determine start positions which means reading line by line.

Reading bytes from wave file python

I'm working on a little audio project and part of it requires using wave files and flac files. Im trying to figure out how to read the metadata in each and how to add tags manually. I'm having trouble figuring out how to read the bytes as they are.
I have been referencing this page and a couple others to see the full format of a Wave file however for some wave files I get some discrepancies. I want to be able to see the hexadecimal bytes in order to see what differences are occurring.
Using simply open('fname', 'rb') and read, only returns the bytes as strings. Using struct.unpack has worked for some wave files however it is limited to printing as strings, ints, or shorts and I can't see exactly what is going wrong when I use it. Is there any other way I can read this file in hex?
Thanks

I assume that you just want to display the content of a binary file in hexadecimal. First, you do not need to use Python for that, as some editors to it natively, for example vim.
Now assuming you have a string that you got by reading a file, you can easily change it to a list of hexadecimal values:
with open('fname', 'rb') as fd: # open the file
data = rd.read(16) # read 16 bytes from it
h = [ hex(ord(b)) for b in data] # convert the bytes to their hex value
print (h) # prints a list of hexadecimal codes of the read bytes

Python and zlib: Terribly slow decompressing concatenated streams

I've been supplied with a zipped file containing multiple individual streams of compressed XML. The compressed file is 833 mb.
If I try to decompress it as a single object, I only get the first stream (about 19 kb).
I've modified the following code supplied as a answer to an older question to decompress each stream and write it to a file:
import zlib
outfile = open('output.xml', 'w')
def zipstreams(filename):
"""Return all zip streams and their positions in file."""
with open(filename, 'rb') as fh:
data = fh.read()
i = 0
print "got it"
while i < len(data):
try:
zo = zlib.decompressobj()
dat =zo.decompress(data[i:])
outfile.write(dat)
zo.flush()
i += len(data[i:]) - len(zo.unused_data)
except zlib.error:
i += 1
outfile.close()
zipstreams('payload')
infile.close()
This code runs and produces the desired result (all the XML data decompressed to a single file). The problem is that it takes several days to work!
Even though there are tens of thousands of streams in the compressed file, it still seems like this should be a much faster process. Roughly 8 days to decompress 833mb (estimated 3gb raw) suggests that I'm doing something very wrong.
Is there another way to do this more efficiently, or is the slow speed the result of a read-decompress-write---repeat bottleneck that I'm stuck with?
Thanks for any pointers or suggestions you have!

It's hard to say very much without more specific knowledge of the file format you're actually dealing with, but it's clear that your algorithm's handling of substrings is quadratic-- not a good thing when you've got tens of thousands of them. So let's see what we know:
You say that the vendor states that they are
using the standard zlib compression library.These are the same compression routines on which the gzip utilities are built.
From this we can conclude that the component streams are in raw zlib format, and are not encapsulated in a gzip wrapper (or a PKZIP archive, or whatever). The authoritative documentation on the ZLIB format is here: https://www.rfc-editor.org/rfc/rfc1950
So let's assume that your file is exactly as you describe: A 32-byte header, followed by raw ZLIB streams concatenated together, without any other stuff in between. (Edit: That's not the case, after all).
Python's zlib documentation provides a Decompress class that is actually pretty well suited to churning through your file. It includes an attribute unused_data whose documentation states clearly that:
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
So, this is what you can do: Write a loop that reads through data, say, one block at a time (no need to even read the entire 800MB file into memory). Push each block to the Decompress object, and check the unused_data attribute. When it becomes non-empty, you've got a complete object. Write it to disk, create a new decompress object and initialize iw with the unused_data from the last one. This just might work (untested, so check for correctness).
Edit: Since you do have other data in your data stream, I've added a routine that aligns to the next ZLIB start. You'll need to find and fill in the two-byte sequence that identifies a ZLIB stream in your data. (Feel free to use your old code to discover it.) While there's no fixed ZLIB header in general, it should be the same for each stream since it consists of protocol options and flags, which are presumably the same for the entire run.
import zlib
# FILL IN: ZHEAD is two bytes with the actual ZLIB settings in the input
ZHEAD = CMF+FLG
def findstart(header, buf, source):
"""Find `header` in str `buf`, reading more from `source` if necessary"""
while buf.find(header) == -1:
more = source.read(2**12)
if len(more) == 0: # EOF without finding the header
return ''
buf += more
offset = buf.find(header)
return buf[offset:]
You can then advance to the start of the next stream. I've added a try/except pair since the same byte sequence might occur outside a stream:
source = open(datafile, 'rb')
skip_ = source.read(32) # Skip non-zlib header
buf = ''
while True:
decomp = zlib.decompressobj()
# Find the start of the next stream
buf = findstart(ZHEAD, buf, source)
try:
stream = decomp.decompress(buf)
except zlib.error:
print "Spurious match(?) at output offset %d." % outfile.tell(),
print "Skipping 2 bytes"
buf = buf[2:]
continue
# Read until zlib decides it's seen a complete file
while decomp.unused_data == '':
block = source.read(2**12)
if len(block) > 0:
stream += decomp.decompress(block)
else:
break # We've reached EOF
outfile.write(stream)
buf = decomp.unused_data # Save for the next stream
if len(block) == 0:
break # EOF
outfile.close()
PS 1. If I were you I'd write each XML stream into a separate file.
PS 2. You can test whatever you do on the first MB of your file, till you get adequate performance.

Decompressing 833 MB should take about 30 seconds on a modern processor (e.g. a 2 GHz i7). So yes, you are doing something very wrong. Attempting to decompress at every byte offset to see if you get an error is part of the problem, though not all of it. There are better ways to find the compressed data. Ideally you should find or figure out the format. Alternatively, you can search for valid zlib headers using the RFC 1950 specification, though you may get false positives.
More significant may be that you are reading the entire 833 MB into memory at once, and decompressing the 3 GB to memory, possibly in large pieces each time. How much memory does your machine have? You may be thrashing to virtual memory.
If the code you show works, then the data is not zipped. zip is a specific file format, normally with the .zip extension, that encapsulates raw deflate data within a structure of local and central directory information intended to reconstruct a directory in a file system. You must have something rather different, since your code is looking for and apparently finding zlib streams. What is the format you have? Where did you get it? How is it documented? Can you provide a dump of, say, the first 100 bytes?
The way this should be done is not to read the whole thing into memory and decompress entire streams at once, also into memory. Instead, make use of the zlib.decompressobj interface which allows you provide a piece at a time, and get the resulting available decompressed data. You can read the input file in much smaller pieces, find the decompressed data streams by using the documented format or looking for zlib (RFC 1950 headers), and then running those a chunk at a time through the decompressed object, writing out the decompressed data where you want it. decomp.unused_data can be used to detect the end of the compressed stream (as in the example you found).

From what you've described in the comments, it sounds like they're concatenating together the individual files they would have sent you separately. Which means each one has a 32-byte header you need to skip.
If you don't skip those headers, it would probably have exactly the behavior you described: If you get lucky, you'll get 32 invalid-header errors and then successfully parse the next stream. If you get unlucky, the 32 bytes of garbage will look like the start of a real stream, and you'll waste a whole lot of time parsing some arbitrary number of bytes until you finally get a decoding error. (If you get really unlucky, it'll actually decode successfully, giving you a giant hunk of garbage and eating up one or more subsequent streams.)
So, try just skipping 32 bytes after each stream finishes.
Or, if you have a more reliable way of detecting the start of the next stream (this is why I told you to print out the offsets and look at the data in a hex editor, and while alexis told you to look at the zlib spec), do that instead.

Get uncompressed size of a .gz file in python

Using gzip, tell() returns the offset in the uncompressed file.
In order to show a progress bar, I want to know the original (uncompressed) size of the file.
Is there an easy way to find out?

Uncompressed size is stored in the last 4 bytes of the gzip file. We can read the binary data and convert it to an int. (This will only work for files under 4GB)
import struct
def getuncompressedsize(filename):
with open(filename, 'rb') as f:
f.seek(-4, 2)
return struct.unpack('I', f.read(4))[0]

The gzip format specifies a field called ISIZE that:
This contains the size of the original (uncompressed) input data modulo 2^32.
In gzip.py, which I assume is what you're using for gzip support, there is a method called _read_eof defined as such:
def _read_eof(self):
# We've read to the end of the file, so we have to rewind in order
# to reread the 8 bytes containing the CRC and the file size.
# We check the that the computed CRC and size of the
# uncompressed data matches the stored values. Note that the size
# stored is the true file size mod 2**32.
self.fileobj.seek(-8, 1)
crc32 = read32(self.fileobj)
isize = U32(read32(self.fileobj)) # may exceed 2GB
if U32(crc32) != U32(self.crc):
raise IOError, "CRC check failed"
elif isize != LOWU32(self.size):
raise IOError, "Incorrect length of data produced"
There you can see that the ISIZE field is being read, but only to to compare it to self.size for error detection. This then should mean that GzipFile.size stores the actual uncompressed size. However, I think it's not exposed publicly, so you might have to hack it in to expose it. Not so sure, sorry.
I just looked all of this up right now, and I haven't tried it so I could be wrong. I hope this is of some use to you. Sorry if I misunderstood your question.

Despite what the other answers say, the last four bytes are not a reliable way to get the uncompressed length of a gzip file. First, there may be multiple members in the gzip file, so that would only be the length of the last member. Second, the length may be more than 4 GB, in which case the last four bytes represent the length modulo 232. Not the length.
However for what you want, there is no need to get the uncompressed length. You can instead base your progress bar on the amount of input consumed, as compared to the length of the gzip file, which is readily obtained. For typical homogenous data, that progress bar would show exactly the same thing as a progress bar based instead on the uncompressed data.

Unix way: use "gunzip -l file.gz" via subprocess.call / os.popen, capture and parse its output.

The last 4 bytes of the .gz hold the original size of the file

I am not sure about performance, but this could be achieved without knowing gzip magic by using:
with gzip.open(filepath, 'rb') as file_obj:
file_size = file_obj.seek(0, io.SEEK_END)
This should also work for other (compressed) stream readers like bz2 or the plain open.
EDIT:
as suggested in the comments, 2 in second line was replaced by io.SEEK_END, which is definitely more readable and probably more future-proof.
EDIT:
Works only in Python 3.

f = gzip.open(filename)
# kludge - report uncompressed file position so progess bars
# don't go to 400%
f.tell = f.fileobj.tell

Looking at the source for the gzip module, I see that the underlying file object for GzipFile seems to be fileobj. So:
mygzipfile = gzip.GzipFile()
...
mygzipfile.fileobj.tell()
?
Maybe it would be good to do some sanity checking before doing that, like checking that the attribute exists with hasattr.
Not exactly a public API, but...

GzipFile.size stores the uncompressed size, but it's only incremented when you read the file, so you should prefer len(fd.read()) instead of the non-public GzipFile.size.

Here is a Python2 version for #norok's solution
import gzip, io
with oepn("yourfile.gz", "rb") as f:
prev, cur = 0, f.seek(1000000, io.SEEK_CUR)
while prev < cur:
prev, cur = cur, f.seek(1000000, io.SEEK_CUR)
filesize = cur
Note that just like f.seek(0, io.SEEK_END) this is slow for large files, but it will overcome the 4GB size limitation of the faster solutions suggested here

import gzip
File = gzip.open("input.gz", "r")
Size = gzip.read32(File)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.