Python struct unpack with irregular field sizes - python

I'm trying to unpack a struct containing some metadata and an image. It's all bits, but the way that contents were packed is quite strange. Instead of elements being aligned on a byte, they're aligned on every 6 bits. The file format and contents are listed in the image below and can be also accessed at the link: http://etlcdb.db.aist.go.jp/etlcdb/etln/form_c.htm.
The only fields that I am interested in are the JIS code and the 16 gray level image data that is the last field. This data is taken from http://etlcdb.db.aist.go.jp/etlcdb/etln/etl4/etl4.htm.
The main problem that I'm facing is the fact that the fields are 6 bit aligned. I don't know how to properly unpack the struct.
As an aside, with PIL, it's also unclear how to handle 4 bit images. The mode 1 isn't working correctly, as expected..
import struct
from PIL import Image, ImageFont, ImageDraw
filename = 'ETL4/ETL4C'
skip = 0
record_size = 2952
with open(filename, 'r') as f:
f.seek(skip * record_size)
s = f.read(record_size)
r = struct.unpack('>9x1i203x2736s', s)
print r[1]
i1 = Image.frombytes('1', (72, 76), r[1], 'raw')
# fn = 'ETL9B_{:d}_{:d}.png'.format((r[0]-1)%20+1, hex(r[1])[-4:])
fn = 'name1.png'
i1.save(fn, 'PNG')

Related

Zeroing/blacking out pixels in a .tiff-like file (.svs or .ndpi)

I am trying to zero out the pixels in some .tiff-like biomedical scans (.svs & .ndpi) by changing values directly in the binary file.
For reference, I am using the docs on the .tiff format here.
As a sanity check, I've confirmed that the first two bytes have values 73 and 73 (or I and I in ASCII), meaning it is little-endian, and that the two next bytes are the value 42 (both these things are expected as according to the docs just mentioned).
I wrote a Python script that reads the IFD (Image File Directory) and its components, but I am having troubles proceeding from there.
My code is this:
with open('scan.svs', "rb") as f:
# Read away the first 4 bytes:
f.read(4)
# Read offset of first IFD as the four next bytes:
IFD_offset = int.from_bytes(f.read(4), 'little')
# Move to IFD:
f.seek(IFD_offset, 0)
# Read IFD:
IFD = f.read(12)
# Get components of IFD:
tag = int.from_bytes(IFD[:2], 'little')
field_type = int.from_bytes(IFD[2:4], 'little')
count = int.from_bytes(IFD[4:8], 'little')
value_offset = int.from_bytes(IFD[8:], 'little')
# Now what?
The values for the components are tag=16, field_type=254, count=65540 and value_offset=0.
How do I go from there?
Ps: Using Python is not a must, if there is some other tool that could more easily to the job.

unpack_from requires a buffer of at least 784 bytes

I'm running the following function for an ML model.
def get_images(filename):
bin_file = open(filename, 'rb')
buf = bin_file.read() # all the file are put into memory
bin_file.close() # release the measure of operating system
index = 0
magic, num_images, num_rows, num_colums = struct.unpack_from(big_endian + four_bytes, buf, index)
index += struct.calcsize(big_endian + four_bytes)
images = [] # temp images as tuple
for x in range(num_images):
im = struct.unpack_from(big_endian + picture_bytes, buf, index)
index += struct.calcsize(big_endian + picture_bytes)
im = list(im)
for i in range(len(im)):
if im[i] > 1:
im[i] = 1
However, I am receiving an error at the line:
im = struct.unpack_from(big_endian + picture_bytes, buf, index)
With the error:
error: unpack_from requires a buffer of at least 784 bytes
I have noticed this error is only occurring at certain iterations. I cannot figure out why this is might be the case. The dataset is a standard MNIST dataset which is freely available online.
I have also looked through similar questions on SO (e.g. error: unpack_from requires a buffer) but they don't seem to resolve the issue.
You didn't include the struct formats in your mre so it is hard to say why you are getting the error. Either you are using a partial/corrupted file or your struct formats are wrong.
This answer uses the test file 't10k-images-idx3-ubyte.gz' and file formats found at http://yann.lecun.com/exdb/mnist/
Open the file and read it into a bytes object (gzip is used because of the file's type).
import gzip,struct
with gzip.open(r'my\path\t10k-images-idx3-ubyte.gz','rb') as f:
data = bytes(f.read())
print(len(data))
The file format spec says the header is 16 bytes (four 32 bit ints) - separate it from the pixels with a slice then unpack it
hdr,pixels = data[:16],data[16:]
magic, num_images, num_rows, num_cols = struct.unpack(">4L",hdr)
# print(len(hdr),len(pixels))
# print(magic, num_images, num_rows, num_cols)
There are a number of ways to iterate over the individual images.
img_size = num_rows * num_cols
imgfmt = "B"*img_size
for i in range(num_images):
start = i * img_size
end = start + img_size
img = pixels[start:end]
img = struct.unpack(imgfmt,img)
# do work on the img
Or...
imgfmt = "B"*img_size
for img in struct.iter_unpack(imgfmt, pixels):
img = [p if p == 0 else 1 for p in img]
The itertools grouper recipe would probably also work.

Problems when I write np array to binary file, new file is only half of the original one

I am trying to remove top 24 lines of a raw file, so I opened the original raw file(let's call it raw1.raw) and converted it to nparray, then I initialized a new array and remove the top24 lines, but after writing new array to the new binary file(raw2.raw), I found raw2 is 15.2mb only while the original file raw1.raw is like 30.6mb, my code:
import numpy as np
import imageio
import rawpy
import cv2
def ave():
fd = open('raw1.raw', 'rb')
rows = 3000 #around 3000, not the real rows
cols = 5100 #around 5100, not the real cols
f = np.fromfile(fd, dtype=np.uint8,count=rows*cols)
I_array = f.reshape((rows, cols)) #notice row, column format
#print(I_array)
fd.close()
im = np.zeros((rows - 24 , cols))
for i in range (len(I_array) - 24):
for j in range(len(I_array[i])):
im[i][j] = I_array[i + 24][j]
#print(im)
newFile = open("raw2.raw", "wb")
im.astype('uint8').tofile(newFile)
newFile.close()
if __name__ == "__main__":
ave()
I tried to use im.astype('uint16') when write in the binary file, but the value would be wrong if I use uint16.
There must clearly be more data in your 'raw1.raw' file that you are not using. Are you sure that file wasn't created using 'uint16' data and you are just pulling out the first half as 'uint8' data? I just checked the writing of random data.
import os, numpy as np
x = np.random.randint(0,256,size=(3000,5100),dtype='uint8')
x.tofile(open('testfile.raw','w'))
print(os.stat('testfile.raw').st_size) #I get 15.3MB.
So, 'uint8' for a 3000 by 5100 clearly takes up 15.3MB. I don't know how you got 30+.
############################ EDIT #########
Just to add more clarification. Do you realize that dtype does nothing more than change the "view" of your data? It doesn't effect the actual data that is saved in memory. This also goes for data that you read from a file. Take for example:
import numpy as np
#The way to understand x, is that x is taking 12 bytes in memory and using
#that information to hold 3 values. The first 4 bytes are the first value,
#the second 4 bytes are the second, etc.
x = np.array([1,2,3],dtype='uint32')
#Change x to display those 12 bytes at 6 different values. Doing this does
#NOT change the data that the array is holding. You are only changing the
#'view' of the data.
x.dtype = 'uint16'
print(x)
In general (there are few special cases), changing the dtype doesn't change the underlying data. However, the conversion function .astype() does change the underlying data. If you have any array of 12 bytes viewed as 'int32' then running .astype('uint8') will take each entry (4 bytes) and covert it (known as casting) to a uint8 entry (1 byte). The new array will only have 3 bytes for the 3 entries. You can see this litterally:
x = np.array([1,2,3],dtype='uint32')
print(x.tobytes())
y = x.astype('uint8')
print(y.tobytes())
So, when we say that a file is 30mb, we mean that the file has (minus some header information) is 30,000,000 bytes which are exactly uint8s. 1 uint8 is 1 byte. If any array has 6000by5100 uint8s (bytes), then the array has 30,600,000 bytes of information in memory.
Likewise, if you read a file (DOES NOT MATTER THE FILE) and write np.fromfile(,dtype=np.uint8,count=15_300_000) then you told python to read EXACTLY 15_300_000 bytes (again 1 byte is 1 uint8) of information (15mb). If your file is 100mb, 40mb, or even 30mb, it would be completely irrelevant because you told python to only read the first 15mb of data.

How to gps degrees, minutes AND seconds to a tif in python using the piexif package?

How can I write full gps coordinates in degrees, minutes, seconds
format as a GPS exif tag to .tif images using python? I can write a
portion of the gps coord using the
piexif package (e.g.
Longitude) if I only include the decimal and minutes as integers.
However piexif throws a ValueError whenever I include seconds and
fractions of a second. piexif will only accept a tuple containing two
integers for the longitude, despite the standard calling for 3
integers.
Info on EXIF tags is available [here](http://www.cipa.jp/std/documents/e/DC-
008-2012_E.pdf)!
import piexif # pip install piexif
from PIL import Image # PIL version 4.0.0 for compatability with conda/py3.6
fname = "foo.tiff"
#read tiff file into a pillow image obj
im = Image.open(fname)
#readin any existing exif data to a dict
exif_dict = piexif.load(fname)
#GPS coord to write to GPS tag in dms format
#long = 120° 37' 42.9996" East longitude
LongRef = "E"
Long_d = 120
Long_m = 37
Long_s = 42.9996
#add gps data to EXIF containing GPS dict
#exif_dict['GPS'][piexif.GPSIFD.GPSLongitude] = (Long_d, Long_m) #this works
exif_dict['GPS'][piexif.GPSIFD.GPSLongitude] = (Long_d, (Long_m,Long_s)) #this returns an error
exif_dict['GPS'][piexif.GPSIFD.GPSLongitude] = (Long_d, Long_m, Long_s) #this also returns an error
"""
Traceback (most recent call last):
File "<ipython-input-210-918cd4e2989f>", line 7, in <module>
exif_bytes = piexif.dump(exif_dict)
File "C:\Users\JF\AppData\Local\Continuum\anaconda3\lib\site-packages\piexif-1.0.13-py3.6.egg\piexif\_dump.py", line 74, in dump
gps_set = _dict_to_bytes(gps_ifd, "GPS", zeroth_length + exif_length)
File "C:\Users\JF\AppData\Local\Continuum\anaconda3\lib\site-packages\piexif-1.0.13-py3.6.egg\piexif\_dump.py", line 341, in _dict_to_bytes
'{0} in {1} IFD. Got as {2}.'.format(key, ifd, type(ifd_dict[key]))
ValueError: "dump" got wrong type of exif value.
4 in GPS IFD. Got as <class 'tuple'>.
"""
#convert updated GPS dict to exif_bytes
exif_bytes = piexif.dump(exif_dict)
#encode updated exif tag into image and save as a jpeg
im.save(fname.replace('.tiff','.jpeg'), "jpeg", exif=exif_bytes)
Found this solution,
There is great documentation here!!
http://www.cipa.jp/std/documents/e/DC-008-2012_E.pdf
https://piexif.readthedocs.io/en/latest/functions.html
Hope this is useful
import piexif
#http://www.cipa.jp/std/documents/e/DC-008-2012_E.pdf
#https://piexif.readthedocs.io/en/latest/functions.html
exif_dict = piexif.load("test_exif.JPG")
gps_ifd = {piexif.GPSIFD.GPSLatitudeRef:"N",
piexif.GPSIFD.GPSLatitude:(2,2),
piexif.GPSIFD.GPSLongitudeRef:"W",
piexif.GPSIFD.GPSLongitude:(1,1),
piexif.GPSIFD.GPSAltitudeRef:(0),
piexif.GPSIFD.GPSAltitude:(1700,1)
}
exif_dict = {"0th":{}, "Exif":{}, "GPS":gps_ifd, "1st":{}, "thumbnail":None}
exif_bytes = piexif.dump(exif_dict)
piexif.insert(exif_bytes,"test_exif.JPG")
exif_dict = piexif.load("test_exif.JPG")
print(exif_dict)
I haven't used piexif but according to their docs you need to specify the data in rational format as in (int, int):
exif_dict["GPS"][piexif.GPSIFD.GPSLongitude] = [(120, 1), (37,1), (429996, 10000)];
Basically use (numerator, denominator) to describe your values with integers.

How python calculates the size of text files

I use the following code to save to text file:
filepath = open(filename, 'a')
np.savetxt(filepath, C, fmt='%i')
I came from C where I can control the size of the resulting file and know in advance. Hence, I want to understand how the size of the file is calculated in Python. My program generates a numpy matrix of shape (12500, 65) containing values 1 or -1. The resulting text file on disk has the (2,024,874 bytes) which does not make sense to me! Isn't supposed to be calculated as (assuming the size of a signed integer is 8 as I explicitly mention it as fmt='%i'): `12500 * 65 * 8 = 6,500,000 bytes'?
As mentioned by Mark, you're saving text, i.e. "1", not \x01\x00.... To demonstrate:
import io
import numpy as np
tenbyten = np.ones((10, 10), dtype=int)
myfile = io.BytesIO()
np.savetxt(myfile, tenbyten, fmt='%i')
len(myfile.getvalue()) # 200
myfile.getvalue()[:30] # b'1 1 1 1 1 1 1 1 1 1\n1 1 1 1 1 '
It's a string of ASCII number 1's and spaces, with newlines. Yours has some -'s mixed in I gather. If you want pure binary, you could do something like the following:
raw_data = tenbyten.tobytes() # .tofile() to go to a file instead of bytestring
len(raw_data) # 800
raw_data[:10] # b'\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00'
To get something that matches your 6.5 MB as an exercise, you could do the following: len(np.empty((12500, 65), dtype='int64').tobytes()) Note that the raw data is very raw, and discards all information about the data type, endianness, and shape, so the following is true:
np.ones((10, 10)).tobytes() == np.ones((5, 20)).tobytes() == np.ones(100).tobytes()
If you use np.save, that will save binary with the metadata
my_npy = io.BytesIO()
np.save(my_npy, tenbyten)
len(my_npy.getbuffer()) # 880
my_npy.getvalue()[:70]
# b"\x93NUMPY\x01\x00F\x00{'descr': '<i8', 'fortran_order': False, 'shape': (10, 10), "
For your case with +1/-1, forcing a datatype of int8 (with my_array.astype('int8')) is basically a free 8-fold data compression.

Categories

Resources