I created a Python script to compress text by using the Huffman algorithm. Say I have the following string:
string = 'The quick brown fox jumps over the lazy dog'
Running my algorithm returns the following 'bits':
result = '01111100111010101111010011111010000000011000111000010111110111110010100110010011010100101111100011110001000110101100111101000010101101110110111000111010101110010111111110011000101101000110111000'
By comparing the amount of bits of the result with the input string, the algorithm seems to work:
>>> print len(result), len(string) * 8
194 344
But now comes the question: how do I write this to a file, while still being able to decode it. You can only write to a file per byte, not per bit. By writing the 'codes' as bytes, there is no compression at all!
I am new at computer science, and the online resources just don't cut it for me. All help is much appreciated!
Edit: note that I had my codes something like this (in case of another input string 'xxxxxxxyzz'):
{'y': '00', 'x': '1', 'z': '10'}
The way I create the resulting string is by concatenating these codes in order of the input string:
result = '1111111001010'
How to get back to the original string from this result? Or am I getting this completely wrong? Thank you!
First you need to convert your input string to bytes:
def _to_Bytes(data):
b = bytearray()
for i in range(0, len(data), 8):
b.append(int(data[i:i+8], 2))
return bytes(b)
Then, open a file to write in binary mode:
result = '01111100111010101111010011111010000000011000111000010111110111110010100110010011010100101111100011110001000110101100111101000010101101110110111000111010101110010111111110011000101101000110111000'
with open('test.bin', 'wb') as f:
f.write(_to_Bytes(result))
Now, writing the original string to a file, a comparison of bytes can take place:
import os
with open('test_compare.txt', 'a') as f:
f.write('The quick brown fox jumps over the lazy dog')
_o = os.path.getsize('test_compare.txt')
_c = os.path.getsize('test.bin')
print(f'Original file: {_o} bytes')
print(f'Compressed file: {_c} bytes')
print('Compressed file to about {}% of original'.format(round((((_o-_c)/_o)*100), 0)))
Output:
Original file: 43 bytes
Compressed file: 25 bytes
Compressed file to about 42.0% of original
To get back to the original, you can write a function that determines the possible ordering of characters:
d = {'y': '00', 'x': '1', 'z': '10'}
result = '1111111001010'
from typing import Generator
def reverse_encoding(content:str, _lookup) -> Generator[str, None, None]:
while content:
_options = [i for i in _lookup if content.startswith(i) and (any(content[len(i):].startswith(b) for b in _lookup) or not content[len(i):])]
if not _options:
raise Exception("Decoding error")
yield _lookup[_options[0]]
content = content[len(_options[0]):]
print(''.join(reverse_encoding(result, {b:a for a, b in d.items()})))
Output:
'xxxxxxxyzz'
Related
Is there a simple way to, in Python, read a file's hexadecimal data into a list, say hex?
So hex would be this:
hex = ['AA','CD','FF','0F']
I don't want to have to read into a string, then split. This is memory intensive for large files.
s = "Hello"
hex_list = ["{:02x}".format(ord(c)) for c in s]
Output
['48', '65', '6c', '6c', '6f']
Just change s to open(filename).read() and you should be good.
with open('/path/to/some/file', 'r') as fp:
hex_list = ["{:02x}".format(ord(c)) for c in fp.read()]
Or, if you do not want to keep the whole list in memory at once for large files.
hex_list = ("{:02x}".format(ord(c)) for c in fp.read())
and to get the values, keep calling
next(hex_list)
to get all the remaining values from the generator
list(hex_list)
Using Python 3, let's assume the input file contains the sample bytes you show. For example, we can create it like this
>>> inp = bytes((170,12*16+13,255,15)) # i.e. b'\xaa\xcd\xff\x0f'
>>> with open(filename,'wb') as f:
... f.write(inp)
Now, given we want the hex representation of each byte in the input file, it would be nice to open the file in binary mode, without trying to interpret its contents as characters/strings (or we might trip on the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 0: invalid start byte)
>>> with open(filename,'rb') as f:
... buff = f.read() # it reads the whole file into memory
...
>>> buff
b'\xaa\xcd\xff\x0f'
>>> out_hex = ['{:02X}'.format(b) for b in buff]
>>> out_hex
['AA', 'CD', 'FF', '0F']
If the file is large, we might want to read one character at a time or in chunks. For that purpose I recommend to read this Q&A
Be aware that for viewing hexadecimal dumps of files, there are utilities available on most operating systems. If all you want to do is hex dump the file, consider one of these programs:
od (octal dump, which has a -x or -t x option)
hexdump
xd utility available under windows
Online hex dump tools, such as this one.
I have a string (it could be an integer too) in Python and I want to write it to a file. It contains only ones and zeros I want that pattern of ones and zeros to be written to a file. I want to write the binary directly because I need to store a lot of data, but only certain values. I see no need to take up the space of using eight bit per value when I only need three.
For instance. Let's say I were to write the binary string "01100010" to a file. If I opened it in a text editor it would say b (01100010 is the ascii code for b). Do not be confused though. I do not want to write ascii codes, the example was just to indicate that I want to directly write bytes to the file.
Clarification:
My string looks something like this:
binary_string = "001011010110000010010"
It is not made of of the binary codes for numbers or characters. It contains data relative only to my program.
To write out a string you can use the file's .write method. To write an integer, you will need to use the struct module
import struct
#...
with open('file.dat', 'wb') as f:
if isinstance(value, int):
f.write(struct.pack('i', value)) # write an int
elif isinstance(value, str):
f.write(value) # write a string
else:
raise TypeError('Can only write str or int')
However, the representation of int and string are different, you may with to use the bin function instead to turn it into a string of 0s and 1s
>>> bin(7)
'0b111'
>>> bin(7)[2:] #cut off the 0b
'111'
but maybe the best way to handle all these ints is to decide on a fixed width for the binary strings in the file and convert them like so:
>>> x = 7
>>> '{0:032b}'.format(x) #32 character wide binary number with '0' as filler
'00000000000000000000000000000111'
Alright, after quite a bit more searching, I found an answer. I believe that the rest of you simply didn't understand (which was probably my fault, as I had to edit twice to make it clear). I found it here.
The answer was to split up each piece of data, convert them into a binary integer then put them in a binary array. After that, you can use the array's tofile() method to write to a file.
from array import *
bin_array = array('B')
bin_array.append(int('011',2))
bin_array.append(int('010',2))
bin_array.append(int('110',2))
with file('binary.mydata', 'wb') as f:
bin_array.tofile(f)
I want that pattern of ones and zeros to be written to a file.
If you mean you want to write a bitstream from a string to a file, you'll need something like this...
from cStringIO import StringIO
s = "001011010110000010010"
sio = StringIO(s)
f = open('outfile', 'wb')
while 1:
# Grab the next 8 bits
b = sio.read(8)
# Bail if we hit EOF
if not b:
break
# If we got fewer than 8 bits, pad with zeroes on the right
if len(b) < 8:
b = b + '0' * (8 - len(b))
# Convert to int
i = int(b, 2)
# Convert to char
c = chr(i)
# Write
f.write(c)
f.close()
...for which xxd -b outfile shows...
0000000: 00101101 01100000 10010000 -`.
Brief example:
my_number = 1234
with open('myfile', 'wb') as file_handle:
file_handle.write(struct.pack('i', my_number))
...
with open('myfile', 'rb') as file_handle:
my_number_back = struct.unpack('i', file_handle.read())[0]
Appending to an array.array 3 bits at a time will still produce 8 bits for every value. Appending 011, 010, and 110 to an array and writing to disk will produce the following output: 00000011 00000010 00000110. Note all the padded zeros in there.
It seems like, instead, you want to "compact" binary triplets into bytes to save space. Given the example string in your question, you can convert it to a list of integers (8 bits at a time) and then write it to a file directly. This will pack all the bits together using only 3 bits per value rather than 8.
Python 3.4 example
original_string = '001011010110000010010'
# first split into 8-bit chunks
bit_strings = [original_string[i:i + 8] for i in range(0, len(original_string), 8)]
# then convert to integers
byte_list = [int(b, 2) for b in bit_strings]
with open('byte.dat', 'wb') as f:
f.write(bytearray(byte_list)) # convert to bytearray before writing
Contents of byte.dat:
hex: 2D 60 12
binary (by 8 bits): 00101101 01100000 00010010
binary (by 3 bits): 001 011 010 110 000 000 010 010
^^ ^ (Note extra bits)
Note that this method will pad the last values so that it aligns to an 8-bit boundary, and the padding goes to the most significant bits (left side of the last byte in the above output). So you need to be careful, and possibly add zeros to the end of your original string to make your string length a multiple of 8.
I want to open up about 135 different offsets in the file in hex form. The sections of interest are the names of the characters skins in the game, so an easy way to edit these and save them would save me MEGA time.
This is code I ended up with, something I could understand. I converted the file to HEX and TEXT form:
import binascii
filename = 'Skin1.pack'
with open(filename, 'rb') as f:
content = f.read()
out = binascii.hexlify(content)
f = open('hex.txt', 'wb')
f.write(out)
f.close()
import binascii
filename = 'hex.txt'
with open(filename, 'rb') as f:
content = f.read()
asci = binascii.unhexlify(content)
w = open('printed-hex.txt', 'wb')
w.write(asci)
w.close()
Now im trying to use this byte to replace some of the text in the file
f = open("printed-hex.txt",'r')
filedata = f.read()
f.close()
newdata = filedata.replace("K n i g h t ",input)
f = open("printed-hex.txt",'w')
f.write(newdata)
f.close()
but I'm met with this error,
Traceback (most recent call last):
File "C:\Users\Dee\Desktop\ARC to HEX\Edit-Printed-HEX.py", line 3, in <module>
filedata = f.read()
File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2656: character maps to <undefined>
To nitpick, hex doesn't have 'lines' so you might want to think about how you will limit the location you want to edit. Perhaps edit a fixed number of bytes.
The output you have seen in the console is python attempting to print binary data. It has printed the extended characters because there arn't printable characters that correspond to the characters in the string. You can see that some characters are printable, and that is why you have things like 7(5. in it.
What you need is an easy way to represent the binary data as hex, and a way to convert back. I'll leave the implementation of the actual editor up to you.
import mmap
handle = open('/usr/bin/xxd', 'r')
memorymap = mmap.mmap(handle.fileno(), 0, prot=mmap.PROT_READ)
value_to_hex = dict(enumerate('0123456789ABCDEF'))
hex_to_value = {v: k for (k, v) in value_to_hex.items()}
def expand_byte(byte):
""" Converts a single byte into 2 4 bit values """
return [(byte >> s) & 0xF for s in [4, 0]]
def compact_bytes(values):
""" Converts 2 4 bit values into a single byte """
return (values[0] << 4) | values[1]
def bin_to_hex(data):
""" Converts binary data to hex characters """
return [value_to_hex[v] for b in data for v in expand_byte(b)]
def hex_to_bin(hexadecimal):
""" Converts hex characters to binary data """
return [
compact_bytes([hex_to_value[v] for v in hexadecimal[i:i + 2]])
for i in range(0, len(hexadecimal), 2)
]
test_data = [ord(c) for c in memorymap[0:8]]
hex_data = bin_to_hex(test_data)
final_data = hex_to_bin(hex_data)
print "From '{0}'\nto '{1}'\nto '{2}'".format([chr(c) for c in test_data], hex_data, [chr(c) for c in final_data])
This prints:
From '['\x7f', 'E', 'L', 'F', '\x02', '\x01', '\x01', '\x00']'
to '['7', 'F', '4', '5', '4', 'C', '4', '6', '0', '2', '0', '1', '0', '1', '0', '0']'
to '['\x7f', 'E', 'L', 'F', '\x02', '\x01', '\x01', '\x00']'
Bitwise value manipulation is something you may not have come across before, so you should learn about it. The >> << | and & operators are bitwise operators.
To retrieve the data, operate the mmap object like in the example code;
If you want to open a fragment of data in a hex editor, copy it into a temporary file, then open the file in the editor e.g. with subprocess.check_call(), then copy the new file's contents back. (That's unless your editor has a command-line option that allows to set focus at a specific offset at startup)
To use just Python's console, use something like
" ".join("%02x"%ord(c) for c in <data>)
to see the data in hex (or just repr to see it in ASCII), or, for more xxd-like look and feel, something 3rd-party like hexview.
I have a string (it could be an integer too) in Python and I want to write it to a file. It contains only ones and zeros I want that pattern of ones and zeros to be written to a file. I want to write the binary directly because I need to store a lot of data, but only certain values. I see no need to take up the space of using eight bit per value when I only need three.
For instance. Let's say I were to write the binary string "01100010" to a file. If I opened it in a text editor it would say b (01100010 is the ascii code for b). Do not be confused though. I do not want to write ascii codes, the example was just to indicate that I want to directly write bytes to the file.
Clarification:
My string looks something like this:
binary_string = "001011010110000010010"
It is not made of of the binary codes for numbers or characters. It contains data relative only to my program.
To write out a string you can use the file's .write method. To write an integer, you will need to use the struct module
import struct
#...
with open('file.dat', 'wb') as f:
if isinstance(value, int):
f.write(struct.pack('i', value)) # write an int
elif isinstance(value, str):
f.write(value) # write a string
else:
raise TypeError('Can only write str or int')
However, the representation of int and string are different, you may with to use the bin function instead to turn it into a string of 0s and 1s
>>> bin(7)
'0b111'
>>> bin(7)[2:] #cut off the 0b
'111'
but maybe the best way to handle all these ints is to decide on a fixed width for the binary strings in the file and convert them like so:
>>> x = 7
>>> '{0:032b}'.format(x) #32 character wide binary number with '0' as filler
'00000000000000000000000000000111'
Alright, after quite a bit more searching, I found an answer. I believe that the rest of you simply didn't understand (which was probably my fault, as I had to edit twice to make it clear). I found it here.
The answer was to split up each piece of data, convert them into a binary integer then put them in a binary array. After that, you can use the array's tofile() method to write to a file.
from array import *
bin_array = array('B')
bin_array.append(int('011',2))
bin_array.append(int('010',2))
bin_array.append(int('110',2))
with file('binary.mydata', 'wb') as f:
bin_array.tofile(f)
I want that pattern of ones and zeros to be written to a file.
If you mean you want to write a bitstream from a string to a file, you'll need something like this...
from cStringIO import StringIO
s = "001011010110000010010"
sio = StringIO(s)
f = open('outfile', 'wb')
while 1:
# Grab the next 8 bits
b = sio.read(8)
# Bail if we hit EOF
if not b:
break
# If we got fewer than 8 bits, pad with zeroes on the right
if len(b) < 8:
b = b + '0' * (8 - len(b))
# Convert to int
i = int(b, 2)
# Convert to char
c = chr(i)
# Write
f.write(c)
f.close()
...for which xxd -b outfile shows...
0000000: 00101101 01100000 10010000 -`.
Brief example:
my_number = 1234
with open('myfile', 'wb') as file_handle:
file_handle.write(struct.pack('i', my_number))
...
with open('myfile', 'rb') as file_handle:
my_number_back = struct.unpack('i', file_handle.read())[0]
Appending to an array.array 3 bits at a time will still produce 8 bits for every value. Appending 011, 010, and 110 to an array and writing to disk will produce the following output: 00000011 00000010 00000110. Note all the padded zeros in there.
It seems like, instead, you want to "compact" binary triplets into bytes to save space. Given the example string in your question, you can convert it to a list of integers (8 bits at a time) and then write it to a file directly. This will pack all the bits together using only 3 bits per value rather than 8.
Python 3.4 example
original_string = '001011010110000010010'
# first split into 8-bit chunks
bit_strings = [original_string[i:i + 8] for i in range(0, len(original_string), 8)]
# then convert to integers
byte_list = [int(b, 2) for b in bit_strings]
with open('byte.dat', 'wb') as f:
f.write(bytearray(byte_list)) # convert to bytearray before writing
Contents of byte.dat:
hex: 2D 60 12
binary (by 8 bits): 00101101 01100000 00010010
binary (by 3 bits): 001 011 010 110 000 000 010 010
^^ ^ (Note extra bits)
Note that this method will pad the last values so that it aligns to an 8-bit boundary, and the padding goes to the most significant bits (left side of the last byte in the above output). So you need to be careful, and possibly add zeros to the end of your original string to make your string length a multiple of 8.
I need to read up to the point of a certain string in a binary file, and then act on the bytes that follow. The string is 'colr' (this is a JPEG 2000 file) and here is what I have so far:
from collections import deque
f = open('my.jp2', 'rb')
bytes = deque([], 4)
while ''.join(map(chr, bytes)) != 'colr':
bytes.appendleft(ord(f.read(1)))
if this works:
bytes = deque([0x63, 0x6F, 0x6C, 0x72], 4)
print ''.join(map(chr, bytes))
(returns 'colr'), I'm not sure why the test in my loop never evaluates to True. I wind up spinning - just hanging - I don't even get an exit when I've read through the whole file.
Change your bytes.appendleft() to bytes.append() and then it will work -- it does for me.
with open("my.jpg","rb") as f:
print f.read().split("colr",1)
if you dont want to read it all at once ... then
def preprocess(line):
print "Do Something with this line"
def postprocess(line):
print "Do something else with this line"
currentproc = preprocess
with open("my.jpg","rb") as f:
for line in f:
if "colr" in line:
left,right = line.split("colr")
preprocess(left)
postprocess(right)
currentproc= postprocess
else:
currentproc(line)
its line by line rather than byte by byte ... but meh ...
I have a hard time thinking that you dont have enough ram to hold the whole jpg in memory... python is not really an awesome language to minimize memory or time footprints
but it is awesome for functional requirements :)