Python rookie here! So, I have a data file which stores a list of bytes, representing pixel values in an image. I know that the image is 3-by-3 pixels. Here's my code so far:
# Part 1: read the data
data = []
file = open("test.dat", "rb")
for i in range(0, 9)
byte = file.read(1)
data[i] = byte
file.close()
# Part2: create the image
image = PIL.Image.frombytes('L', (3, 3), data)
image.save('image.bmp')
I have a couple of questions:
In part 1, is this the best way to read a binary file and store the data in an array?
In part 2, I get the error "TypeError: must be string or read-only buffer, not list.
Any help on either of these?
Thank you!
Part 1
If you know that you need exactly nine bytes of data, that looks like a fine way to do it, though it would probably be cleaner/clearer to use a context manager and skip the explicit loop:
with open('test.dat', 'rb') as infile:
data = list(infile.read(9)) # read nine bytes and convert to a list
Part 2
According to the documentation, the data you must pass to PIL.Image.frombytes is:
data – A byte buffer containing raw data for the given mode.
A list isn't a byte buffer, so you're probably wasting your time converting the input to a list. My guess is that if you pass it the byte string directly, you'll get what you're looking for. This is what I'd try:
with open('test.dat', 'rb') as infile:
data = infile.read(9) # Don't convert the bytestring to a list
image = PIL.Image.frombytes('L', (3, 3), data) # pass in the bytestring
image.save('image.bmp')
Hopefully that helps; obviously I can't test it over here since I don't know what the content of your file is.
Of course, if you really need the bytes as a list for some other reason (doubtful--you can iterate over a string just as well as a list), you can always either convert them to a list when you need it (datalist = list(data)) or join them into a string when you make the call to PIL:
image = PIL.Image.frombytes('L', (3, 3), ''.join(datalist))
Part 3
This is sort of an aside, but it's likely to be relevant: do you know what version of PIL you're using? If you're using the actual, original Python Imaging Library, you may also be running into some of the many problems with that library--it's super buggy and unsupported since about 2009.
If you are, I highly recommend getting rid of it and grabbing the Pillow fork instead, which is the live, functional version. You don't have to change any code (it still installs a module called PIL), but the Pillow library is superior to the original PIL by leaps and bounds.
Related
I am trying to extract embeddings from a hidden layer of LSTM. I have a huge dataset with multiple sentences and therefore those will generate multiple numpy vectors. I want to store all those vectors efficiently into a single file. This is what I have so far
with open(src_vectors_save_file, "wb") as s_writer, open(tgt_vectors_save_file, "wb") as t_writer:
for batch in data_iter:
encoder_hidden_layer, decoder_hidden_layer = self.extract_lstm_hidden_states_for_batch(
batch, data.src_vocabs, attn_debug
)
encoder_hidden_layer = encoder_hidden_layer.detach().numpy()
decoder_hidden_layer = decoder_hidden_layer.detach().numpy()
enc_hidden_bytes = pickle.dumps(encoder_hidden_layer)
dec_hidden_bytes = pickle.dumps(decoder_hidden_layer)
s_writer.write(enc_hidden_bytes)
s_writer.write("\n")
t_writer.write(dec_hidden_bytes)
t_writer.write("\n")
Essentially I am using pickle to get the bytes from the np.array and writing that in binary file. I tried to naively separate each byte encoded array with ASCII newline which obviously throws an error. I was planning to use .readlines() function or read each byte-encoded array per line using a for loop in the next program. However, that won't be possible now.
I am out of any ideas can someone suggest an alternative? How can I efficiently store all the arrays in a compressed fashion in one file and how can I read them back from that file?
There is a problem with using \ns are separators because the dump from pickle (enc_hidden_bytes) could have \n in it because the data is not ASCII encoded.
There are two solutions. You can escape the \n appearing in the data and then use \n as terminators. But this adds complexity even while reading.
The other solution is to put into the file the size of the data before starting the actual data. This is like some sort of a header and is a very common practice while sending data over a connection.
You can write the following two functions -
import struct
def write_bytes(handle, data):
total_bytes = len(data)
handle.write(struct.pack(">Q", total_bytes))
handle.write(data)
def read_bytes(handle):
size_bytes = handle.read(8)
if len(size_bytes) == 0:
return None
total_bytes = struct.unpack(">Q", size_bytes)[0]
return handle.read(total_bytes)
Now you can replace
s_writer.write(enc_hidden_bytes)
s_writer.write("\n")
with
write_bytes(s_writer, enc_hidden_bytes)
and same for the other variables.
While reading back from the file in a loop you can use the read_bytes function in a similar way.
I am trying to read an image file which is in *.his format. Honestly, I do not know much about this format, on spending some time on google I figured out that its a binary format and it can be read in ImageJ software as a raw format import. On further inquiry, I found the following details of the *.his file:
Image type = 16-bit unsigned
Matrix dimensions in pixels = w1024 x h1024
Skip header info = 100 Bytes (The number of bytes in the file before the first byte of image data).
Little-Endian Byte Order
With this information in hand, I started out ...
Just wanted to print the values in one by one, just to see the output:
f = open("file.his", 'rb')
f.seek(100)
try:
byte = f.read(2)
while byte != "":
byte = f.read(2)
print unpack('<H', byte)
finally:
f.close()
It prints some numbers out and then the error message :
.....
(64846,)
(64846,)
(64830,)
Traceback (most recent call last):
print unpack('
Plz can someone suggest me how to read this kind of file. I still think 'unpack' is the right function however if someone has similar experience, any response greatly appreciated.
Rky.
I've done a very similar task with *.inr image file maybe the logic could help you, here its what you could apply:
1-Reading the file
First you need to read the file.
file = open(hisfile, 'r')
inp = file.readlines()
2-Get header
In my case i done a for loop until the number of characters was 256, in your case you need to count the bits so you could "print" line by line to find out when you need to stop or try to use this to count the bits:
import sys
sys.getsizeof(line) #returns the size of the object
3-Data
When you already know that the following lines are the raw data you need to put them in one variable with a for loop:
for line in inp:
raw_data += line
4-Convert the data
To convert the string to a numpy array you could do:
data = fromstring(raw_data, dtype='uint16')
And then aplying the shape data:
data = data.reshape((1024,1024)).transpose() #You need to see if the transpose part its relevant,because in my case was fundamental.
Maybe if you have an example of the file i could try to read it and help you more. Of course you could do all the process in 1 for loop using if's.
import save
string = ""
with open("image.jpg", "rb") as f:
byte = f.read(1)
while byte != b"":
byte = f.read(1)
print ((byte))
I'm getting bytes like:
b'\x00'
How do I get rid of this b''?
Let's say I wanna save the bytes to a list, and then save this list as the same image again. How do I proceed?
Thanks!
You can use bytes.decode function if you really need to "get rid of b": http://docs.python.org/3.3/library/stdtypes.html#bytes.decode
But it seems from your code that you do not really need to do this, you really need to work with bytes.
The b"..." is just a python notation of byte strings, it's not really there, it only gets printed. Does it cause some real problems to you?
The b'', is only the string representation of the data that is written when you print it.
Using decode will not help you here because you only want the bytes, not the characters they represent. Slicing the string representation will help even less because then you are still left with a string of several useless characters ('\', 'x', and so on), not the original bytes.
There is no need to modify the string representation of the data, because the data is still there. Just use it instead of the string (i.e. don't use print). If you want to copy the data, you can simply do:
data = file1.read(...)
...
file2.write(data)
If you want to output the binary data directly from your program, use the sys.stdout.buffer:
import sys
sys.stdout.buffer.write(data)
To operate on binary data you can use the array-module.
Below you will find an iterator that operates on 4096 chunks of data instead of reading everything into memory at ounce.
import array
def bytesfromfile(f):
while True:
raw = array.array('B')
raw.fromstring(f.read(4096))
if not raw:
break
yield raw
with open("image.jpg", 'rb') as fd
for byte in bytesfromfile(fd):
for b in byte:
# do something with b
This is one way to get rid of the b'':
import sys
print(b)
If you want to save the bytes later it's more efficient to read the entire file in one go rather than building a list, like this:
with open('sample.jpg', mode='rb') as fh:
content = fh.read()
with open('out.jpg', mode='wb') as out:
out.write(content)
Here is one solution
print(str(byte[2:-1]))
I have very recently started to learn Python, and I chose to learn things by trying to solve a problem that I find interesting. This problem is to take a file (binary or not) and encrypt it using a simple method, something like replacing every "1001 0001" in it with a "0010 0101", and vice-versa.
However, I didn't find a way to do it. When reading the file, I can create an array in which each element contains one byte of data, with the read() method. But how can I replace this byte with another one, if it is one of the bytes I chose to replace, and then write the resulting information into the output encrypted file?
Thanks in advance!
To swap bytes 10010001 and 00100101:
#!/usr/bin/env python
import string
a, b = map(chr, [0b10010001, 0b00100101])
translation_table = string.maketrans(a+b, b+a) # swap a,b
with open('input', 'rb') as fin, open('output', 'wb') as fout:
fout.write(fin.read().translate(translation_table))
read() returns an immutable string, so you'll first need to convert that to a list of characters. Then go through your list and change the bytes as needed, and finally join the list back into a new string to write to the output file.
filedata = f.read()
filebytes = list(filedata)
for i, c in enumerate(filebytes):
if ord(c) == 0x91:
filebytes[i] = chr(0x25)
newfiledata = ''.join(filebytes)
Following Aaron's answer, once you have a string, then you can also use translate or replace:
In [43]: s = 'abc'
In [44]: s.replace('ab', 'ba')
Out[44]: 'bac'
In [45]: tbl = string.maketrans('a', 'd')
In [46]: s.translate(tbl)
Out[46]: 'dbc'
Docs: Python string.
I'm sorry about this somewhat relevant wall of text -- I'm just in a teaching mood.
If you want to optimize such an operation, I suggest using numpy. The advantage is that the entire translation operation is done with a single numpy operation, and those are written in C, so it is about as fast as you can get it using python.
In the below example I simply XOR every byte with 0b11111111 using a lookup table -- first element is the translation of 0b0000000, the second the translation of 0b00000001, third 0b00000010, and so on. By altering the lookup table, you can do any kind of translation that does not change within the file.
import numpy as np
import sys
data = np.fromfile(sys.argv[1], dtype="uint8")
lookup_table = np.array(
[i ^ 0xFF for i in range(256)], dtype="uint8")
lookup_table[data].tofile(sys.argv[2])
To highlight the simplicity of it all I've done no argument checking. Invoke script like this:
python name_of_script.py input_file.txt output_file.txt
To directly answer your question, if you want to swap 0b10010001 and 0b00100101, you replace the lookup_table = ... line with this:
lookup_table = np.array(range(256), dtype="uint8")
lookup_table[0b10010001] = 0b00100101
lookup_table[0b00100101] = 0b10010001
Of course there is no lookup table encryption that isn't easily broken using frequency analysis. But as you may know, encryption using a one-time pad is unbreakable, as long as the pad is safe. This modified script encrypts or decrypts using a one-time pad (which you'll have to create yourself, store to a file, and somehow (there's the rub) securely transmit to the intended recipient of the message):
data = np.fromfile(sys.argv[1], dtype="uint8")
pad = np.fromfile(sys.argv[2], dtype="uint8")
(data ^ pad[:len(data)]).tofile(sys.argv[3])
Example usage (linux):
$ dd if=/dev/urandom of=pad.bin bs=512 count=5
$ python pytrans.py pytrans.py pad.bin encrypted.bin
Recipient then does:
$ python pytrans.py encrypted.bin pad.bin decrypted.py
Viola! Fast and unbreakable encryption with three lines (plus two import lines) in python.
Using gzip, tell() returns the offset in the uncompressed file.
In order to show a progress bar, I want to know the original (uncompressed) size of the file.
Is there an easy way to find out?
Uncompressed size is stored in the last 4 bytes of the gzip file. We can read the binary data and convert it to an int. (This will only work for files under 4GB)
import struct
def getuncompressedsize(filename):
with open(filename, 'rb') as f:
f.seek(-4, 2)
return struct.unpack('I', f.read(4))[0]
The gzip format specifies a field called ISIZE that:
This contains the size of the original (uncompressed) input data modulo 2^32.
In gzip.py, which I assume is what you're using for gzip support, there is a method called _read_eof defined as such:
def _read_eof(self):
# We've read to the end of the file, so we have to rewind in order
# to reread the 8 bytes containing the CRC and the file size.
# We check the that the computed CRC and size of the
# uncompressed data matches the stored values. Note that the size
# stored is the true file size mod 2**32.
self.fileobj.seek(-8, 1)
crc32 = read32(self.fileobj)
isize = U32(read32(self.fileobj)) # may exceed 2GB
if U32(crc32) != U32(self.crc):
raise IOError, "CRC check failed"
elif isize != LOWU32(self.size):
raise IOError, "Incorrect length of data produced"
There you can see that the ISIZE field is being read, but only to to compare it to self.size for error detection. This then should mean that GzipFile.size stores the actual uncompressed size. However, I think it's not exposed publicly, so you might have to hack it in to expose it. Not so sure, sorry.
I just looked all of this up right now, and I haven't tried it so I could be wrong. I hope this is of some use to you. Sorry if I misunderstood your question.
Despite what the other answers say, the last four bytes are not a reliable way to get the uncompressed length of a gzip file. First, there may be multiple members in the gzip file, so that would only be the length of the last member. Second, the length may be more than 4 GB, in which case the last four bytes represent the length modulo 232. Not the length.
However for what you want, there is no need to get the uncompressed length. You can instead base your progress bar on the amount of input consumed, as compared to the length of the gzip file, which is readily obtained. For typical homogenous data, that progress bar would show exactly the same thing as a progress bar based instead on the uncompressed data.
Unix way: use "gunzip -l file.gz" via subprocess.call / os.popen, capture and parse its output.
The last 4 bytes of the .gz hold the original size of the file
I am not sure about performance, but this could be achieved without knowing gzip magic by using:
with gzip.open(filepath, 'rb') as file_obj:
file_size = file_obj.seek(0, io.SEEK_END)
This should also work for other (compressed) stream readers like bz2 or the plain open.
EDIT:
as suggested in the comments, 2 in second line was replaced by io.SEEK_END, which is definitely more readable and probably more future-proof.
EDIT:
Works only in Python 3.
f = gzip.open(filename)
# kludge - report uncompressed file position so progess bars
# don't go to 400%
f.tell = f.fileobj.tell
Looking at the source for the gzip module, I see that the underlying file object for GzipFile seems to be fileobj. So:
mygzipfile = gzip.GzipFile()
...
mygzipfile.fileobj.tell()
?
Maybe it would be good to do some sanity checking before doing that, like checking that the attribute exists with hasattr.
Not exactly a public API, but...
GzipFile.size stores the uncompressed size, but it's only incremented when you read the file, so you should prefer len(fd.read()) instead of the non-public GzipFile.size.
Here is a Python2 version for #norok's solution
import gzip, io
with oepn("yourfile.gz", "rb") as f:
prev, cur = 0, f.seek(1000000, io.SEEK_CUR)
while prev < cur:
prev, cur = cur, f.seek(1000000, io.SEEK_CUR)
filesize = cur
Note that just like f.seek(0, io.SEEK_END) this is slow for large files, but it will overcome the 4GB size limitation of the faster solutions suggested here
import gzip
File = gzip.open("input.gz", "r")
Size = gzip.read32(File)