Python - trying to deal with the bits of a file - python

I have very recently started to learn Python, and I chose to learn things by trying to solve a problem that I find interesting. This problem is to take a file (binary or not) and encrypt it using a simple method, something like replacing every "1001 0001" in it with a "0010 0101", and vice-versa.
However, I didn't find a way to do it. When reading the file, I can create an array in which each element contains one byte of data, with the read() method. But how can I replace this byte with another one, if it is one of the bytes I chose to replace, and then write the resulting information into the output encrypted file?
Thanks in advance!

To swap bytes 10010001 and 00100101:
#!/usr/bin/env python
import string
a, b = map(chr, [0b10010001, 0b00100101])
translation_table = string.maketrans(a+b, b+a) # swap a,b
with open('input', 'rb') as fin, open('output', 'wb') as fout:
fout.write(fin.read().translate(translation_table))

read() returns an immutable string, so you'll first need to convert that to a list of characters. Then go through your list and change the bytes as needed, and finally join the list back into a new string to write to the output file.
filedata = f.read()
filebytes = list(filedata)
for i, c in enumerate(filebytes):
if ord(c) == 0x91:
filebytes[i] = chr(0x25)
newfiledata = ''.join(filebytes)

Following Aaron's answer, once you have a string, then you can also use translate or replace:
In [43]: s = 'abc'
In [44]: s.replace('ab', 'ba')
Out[44]: 'bac'
In [45]: tbl = string.maketrans('a', 'd')
In [46]: s.translate(tbl)
Out[46]: 'dbc'
Docs: Python string.

I'm sorry about this somewhat relevant wall of text -- I'm just in a teaching mood.
If you want to optimize such an operation, I suggest using numpy. The advantage is that the entire translation operation is done with a single numpy operation, and those are written in C, so it is about as fast as you can get it using python.
In the below example I simply XOR every byte with 0b11111111 using a lookup table -- first element is the translation of 0b0000000, the second the translation of 0b00000001, third 0b00000010, and so on. By altering the lookup table, you can do any kind of translation that does not change within the file.
import numpy as np
import sys
data = np.fromfile(sys.argv[1], dtype="uint8")
lookup_table = np.array(
[i ^ 0xFF for i in range(256)], dtype="uint8")
lookup_table[data].tofile(sys.argv[2])
To highlight the simplicity of it all I've done no argument checking. Invoke script like this:
python name_of_script.py input_file.txt output_file.txt
To directly answer your question, if you want to swap 0b10010001 and 0b00100101, you replace the lookup_table = ... line with this:
lookup_table = np.array(range(256), dtype="uint8")
lookup_table[0b10010001] = 0b00100101
lookup_table[0b00100101] = 0b10010001
Of course there is no lookup table encryption that isn't easily broken using frequency analysis. But as you may know, encryption using a one-time pad is unbreakable, as long as the pad is safe. This modified script encrypts or decrypts using a one-time pad (which you'll have to create yourself, store to a file, and somehow (there's the rub) securely transmit to the intended recipient of the message):
data = np.fromfile(sys.argv[1], dtype="uint8")
pad = np.fromfile(sys.argv[2], dtype="uint8")
(data ^ pad[:len(data)]).tofile(sys.argv[3])
Example usage (linux):
$ dd if=/dev/urandom of=pad.bin bs=512 count=5
$ python pytrans.py pytrans.py pad.bin encrypted.bin
Recipient then does:
$ python pytrans.py encrypted.bin pad.bin decrypted.py
Viola! Fast and unbreakable encryption with three lines (plus two import lines) in python.

Related

How to store multiple float arrays efficiently in a file using python

I am trying to extract embeddings from a hidden layer of LSTM. I have a huge dataset with multiple sentences and therefore those will generate multiple numpy vectors. I want to store all those vectors efficiently into a single file. This is what I have so far
with open(src_vectors_save_file, "wb") as s_writer, open(tgt_vectors_save_file, "wb") as t_writer:
for batch in data_iter:
encoder_hidden_layer, decoder_hidden_layer = self.extract_lstm_hidden_states_for_batch(
batch, data.src_vocabs, attn_debug
)
encoder_hidden_layer = encoder_hidden_layer.detach().numpy()
decoder_hidden_layer = decoder_hidden_layer.detach().numpy()
enc_hidden_bytes = pickle.dumps(encoder_hidden_layer)
dec_hidden_bytes = pickle.dumps(decoder_hidden_layer)
s_writer.write(enc_hidden_bytes)
s_writer.write("\n")
t_writer.write(dec_hidden_bytes)
t_writer.write("\n")
Essentially I am using pickle to get the bytes from the np.array and writing that in binary file. I tried to naively separate each byte encoded array with ASCII newline which obviously throws an error. I was planning to use .readlines() function or read each byte-encoded array per line using a for loop in the next program. However, that won't be possible now.
I am out of any ideas can someone suggest an alternative? How can I efficiently store all the arrays in a compressed fashion in one file and how can I read them back from that file?
There is a problem with using \ns are separators because the dump from pickle (enc_hidden_bytes) could have \n in it because the data is not ASCII encoded.
There are two solutions. You can escape the \n appearing in the data and then use \n as terminators. But this adds complexity even while reading.
The other solution is to put into the file the size of the data before starting the actual data. This is like some sort of a header and is a very common practice while sending data over a connection.
You can write the following two functions -
import struct
def write_bytes(handle, data):
total_bytes = len(data)
handle.write(struct.pack(">Q", total_bytes))
handle.write(data)
def read_bytes(handle):
size_bytes = handle.read(8)
if len(size_bytes) == 0:
return None
total_bytes = struct.unpack(">Q", size_bytes)[0]
return handle.read(total_bytes)
Now you can replace
s_writer.write(enc_hidden_bytes)
s_writer.write("\n")
with
write_bytes(s_writer, enc_hidden_bytes)
and same for the other variables.
While reading back from the file in a loop you can use the read_bytes function in a similar way.

Reading and comparing a set of bytes: Python

I have two text files, both of them having 150000+ lines of data. I need to shorten them to a range of lines.
Allow me to explain:
The line which starts with "BO_ " must be the first line and the last will be the one which does not start with "BO_". How do I compare a set of characters since Python reads the file each byte at a time?
Is there any inbuilt function to trim the lines in the file. I thought of getting each byte and checking them consecutively with B, O, _ and " ". But this would be hectic, I bet the memory will run out before it is even able to check the file, considering if the mentioned happens only at the end of the file.
I tried the following code:
def character(f):
c = f.read(1)
while c:
yield c
c = f.read(1)
This code works perfectly fine, it returns each byte of the text. But, going by this approach, it will be difficult and time-consuming. The code would be very ugly.
You can use f.readline() to read a line (up until a newline b"\n" character)
read more here

Python - split files

I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?
----Adding current code----
if not os.path.exists(output_path):
os.makedirs(output_path)
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
outFile = open('output/tempfile', 'wb')
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
outFile.write(chunk)
f = open('output/tempfile', 'rb').read().split('\r\n\r\n')
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')
Okay, I was bored and wanted to figure out the best way to do this. Turns out that my initial way in the comments above was overly complicated (unless considering some scenario where time is absolutely critical, or memory is severely constrained). A buffer is a much simpler way to achieve this, so long as you take two or more blocks at a time. This code emulates the questions scenario for demonstration.
Note: depending on the regex engine implementation, this is more efficient and requires significantly less str/byte conversions, as using regex requires casting each block of bytes to string. The approach below requires no string conversions, instead operating solely on the bytes returned from request.post(), and in turn writing those same bytes to file, without conversions.
from pprint import pprint
someString = '''I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?'''
n = 16
# emulate a stream by creating 37 blocks of 16 bytes
byteBlocks = [bytearray(someString[i:i+n]) for i in range(0, len(someString), n)]
pprint(byteBlocks)
# this string is present twice, but both times it is split across two bytearrays
matchBytes = bytearray('requests.post()')
# our buffer
buff = bytearray()
count = 0
for bb in byteBlocks:
buff += bb
count += 1
# every two blocks
if (count % 2) == 0:
if count == 2:
start = 0
else:
start = len(matchBytes)
# check the bytes starting from block (n -2 -len(matchBytes)) to (len(buff) -len(matchBytes))
# this will check all the bytes only once...
if matchBytes in buff[ ((count-2)*n)-start : len(buff)-len(matchBytes) ]:
print('Match starting at index:', buff.index(matchBytes), 'ending at:', buff.index(matchBytes)+len(matchBytes))
Update:
So, given the updated question, this code may remove the need to create a temporary file. I haven't been able to test it exactly, as I don't have a similar response, but you should be able to figure out any bugs yourself.
Since you aren't actually working with a stream directly, i.e. you're given the finished response object from requests.post(), then you don't have to worry about using chunks in the networking sense. The "chunks" that requests refers to is really it's way of dishing out the bytes, of which it already has all of. You can access the bytes directly using r.raw.read(n) but as far as I can tell, the request object doesn't allow you to see how many bytes there are in "r.raw", thus you're more or less forced to use the "iter_content" method.
Anyway, this code should copy all the bytes from the request object into a string, then you can search and split that string as before.
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
match = '\r\n\r\n'
data = ''
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
data += chunk
f = data.split(match)
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Python 2.6: Creating image from array

Python rookie here! So, I have a data file which stores a list of bytes, representing pixel values in an image. I know that the image is 3-by-3 pixels. Here's my code so far:
# Part 1: read the data
data = []
file = open("test.dat", "rb")
for i in range(0, 9)
byte = file.read(1)
data[i] = byte
file.close()
# Part2: create the image
image = PIL.Image.frombytes('L', (3, 3), data)
image.save('image.bmp')
I have a couple of questions:
In part 1, is this the best way to read a binary file and store the data in an array?
In part 2, I get the error "TypeError: must be string or read-only buffer, not list.
Any help on either of these?
Thank you!
Part 1
If you know that you need exactly nine bytes of data, that looks like a fine way to do it, though it would probably be cleaner/clearer to use a context manager and skip the explicit loop:
with open('test.dat', 'rb') as infile:
data = list(infile.read(9)) # read nine bytes and convert to a list
Part 2
According to the documentation, the data you must pass to PIL.Image.frombytes is:
data – A byte buffer containing raw data for the given mode.
A list isn't a byte buffer, so you're probably wasting your time converting the input to a list. My guess is that if you pass it the byte string directly, you'll get what you're looking for. This is what I'd try:
with open('test.dat', 'rb') as infile:
data = infile.read(9) # Don't convert the bytestring to a list
image = PIL.Image.frombytes('L', (3, 3), data) # pass in the bytestring
image.save('image.bmp')
Hopefully that helps; obviously I can't test it over here since I don't know what the content of your file is.
Of course, if you really need the bytes as a list for some other reason (doubtful--you can iterate over a string just as well as a list), you can always either convert them to a list when you need it (datalist = list(data)) or join them into a string when you make the call to PIL:
image = PIL.Image.frombytes('L', (3, 3), ''.join(datalist))
Part 3
This is sort of an aside, but it's likely to be relevant: do you know what version of PIL you're using? If you're using the actual, original Python Imaging Library, you may also be running into some of the many problems with that library--it's super buggy and unsupported since about 2009.
If you are, I highly recommend getting rid of it and grabbing the Pillow fork instead, which is the live, functional version. You don't have to change any code (it still installs a module called PIL), but the Pillow library is superior to the original PIL by leaps and bounds.

Python get rid of bytes b' '

import save
string = ""
with open("image.jpg", "rb") as f:
byte = f.read(1)
while byte != b"":
byte = f.read(1)
print ((byte))
I'm getting bytes like:
b'\x00'
How do I get rid of this b''?
Let's say I wanna save the bytes to a list, and then save this list as the same image again. How do I proceed?
Thanks!
You can use bytes.decode function if you really need to "get rid of b": http://docs.python.org/3.3/library/stdtypes.html#bytes.decode
But it seems from your code that you do not really need to do this, you really need to work with bytes.
The b"..." is just a python notation of byte strings, it's not really there, it only gets printed. Does it cause some real problems to you?
The b'', is only the string representation of the data that is written when you print it.
Using decode will not help you here because you only want the bytes, not the characters they represent. Slicing the string representation will help even less because then you are still left with a string of several useless characters ('\', 'x', and so on), not the original bytes.
There is no need to modify the string representation of the data, because the data is still there. Just use it instead of the string (i.e. don't use print). If you want to copy the data, you can simply do:
data = file1.read(...)
...
file2.write(data)
If you want to output the binary data directly from your program, use the sys.stdout.buffer:
import sys
sys.stdout.buffer.write(data)
To operate on binary data you can use the array-module.
Below you will find an iterator that operates on 4096 chunks of data instead of reading everything into memory at ounce.
import array
def bytesfromfile(f):
while True:
raw = array.array('B')
raw.fromstring(f.read(4096))
if not raw:
break
yield raw
with open("image.jpg", 'rb') as fd
for byte in bytesfromfile(fd):
for b in byte:
# do something with b
This is one way to get rid of the b'':
import sys
print(b)
If you want to save the bytes later it's more efficient to read the entire file in one go rather than building a list, like this:
with open('sample.jpg', mode='rb') as fh:
content = fh.read()
with open('out.jpg', mode='wb') as out:
out.write(content)
Here is one solution
print(str(byte[2:-1]))

Categories

Resources