How python calculates the size of text files

How python calculates the size of text files - python

I use the following code to save to text file:
filepath = open(filename, 'a')
np.savetxt(filepath, C, fmt='%i')
I came from C where I can control the size of the resulting file and know in advance. Hence, I want to understand how the size of the file is calculated in Python. My program generates a numpy matrix of shape (12500, 65) containing values 1 or -1. The resulting text file on disk has the (2,024,874 bytes) which does not make sense to me! Isn't supposed to be calculated as (assuming the size of a signed integer is 8 as I explicitly mention it as fmt='%i'): `12500 * 65 * 8 = 6,500,000 bytes'?

As mentioned by Mark, you're saving text, i.e. "1", not \x01\x00.... To demonstrate:
import io
import numpy as np
tenbyten = np.ones((10, 10), dtype=int)
myfile = io.BytesIO()
np.savetxt(myfile, tenbyten, fmt='%i')
len(myfile.getvalue()) # 200
myfile.getvalue()[:30] # b'1 1 1 1 1 1 1 1 1 1\n1 1 1 1 1 '
It's a string of ASCII number 1's and spaces, with newlines. Yours has some -'s mixed in I gather. If you want pure binary, you could do something like the following:
raw_data = tenbyten.tobytes() # .tofile() to go to a file instead of bytestring
len(raw_data) # 800
raw_data[:10] # b'\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00'
To get something that matches your 6.5 MB as an exercise, you could do the following: len(np.empty((12500, 65), dtype='int64').tobytes()) Note that the raw data is very raw, and discards all information about the data type, endianness, and shape, so the following is true:
np.ones((10, 10)).tobytes() == np.ones((5, 20)).tobytes() == np.ones(100).tobytes()
If you use np.save, that will save binary with the metadata
my_npy = io.BytesIO()
np.save(my_npy, tenbyten)
len(my_npy.getbuffer()) # 880
my_npy.getvalue()[:70]
# b"\x93NUMPY\x01\x00F\x00{'descr': '<i8', 'fortran_order': False, 'shape': (10, 10), "
For your case with +1/-1, forcing a datatype of int8 (with my_array.astype('int8')) is basically a free 8-fold data compression.

Related

Problems when I write np array to binary file, new file is only half of the original one

I am trying to remove top 24 lines of a raw file, so I opened the original raw file(let's call it raw1.raw) and converted it to nparray, then I initialized a new array and remove the top24 lines, but after writing new array to the new binary file(raw2.raw), I found raw2 is 15.2mb only while the original file raw1.raw is like 30.6mb, my code:
import numpy as np
import imageio
import rawpy
import cv2
def ave():
fd = open('raw1.raw', 'rb')
rows = 3000 #around 3000, not the real rows
cols = 5100 #around 5100, not the real cols
f = np.fromfile(fd, dtype=np.uint8,count=rows*cols)
I_array = f.reshape((rows, cols)) #notice row, column format
#print(I_array)
fd.close()
im = np.zeros((rows - 24 , cols))
for i in range (len(I_array) - 24):
for j in range(len(I_array[i])):
im[i][j] = I_array[i + 24][j]
#print(im)
newFile = open("raw2.raw", "wb")
im.astype('uint8').tofile(newFile)
newFile.close()
if __name__ == "__main__":
ave()
I tried to use im.astype('uint16') when write in the binary file, but the value would be wrong if I use uint16.

There must clearly be more data in your 'raw1.raw' file that you are not using. Are you sure that file wasn't created using 'uint16' data and you are just pulling out the first half as 'uint8' data? I just checked the writing of random data.
import os, numpy as np
x = np.random.randint(0,256,size=(3000,5100),dtype='uint8')
x.tofile(open('testfile.raw','w'))
print(os.stat('testfile.raw').st_size) #I get 15.3MB.
So, 'uint8' for a 3000 by 5100 clearly takes up 15.3MB. I don't know how you got 30+.
############################ EDIT #########
Just to add more clarification. Do you realize that dtype does nothing more than change the "view" of your data? It doesn't effect the actual data that is saved in memory. This also goes for data that you read from a file. Take for example:
import numpy as np
#The way to understand x, is that x is taking 12 bytes in memory and using
#that information to hold 3 values. The first 4 bytes are the first value,
#the second 4 bytes are the second, etc.
x = np.array([1,2,3],dtype='uint32')
#Change x to display those 12 bytes at 6 different values. Doing this does
#NOT change the data that the array is holding. You are only changing the
#'view' of the data.
x.dtype = 'uint16'
print(x)
In general (there are few special cases), changing the dtype doesn't change the underlying data. However, the conversion function .astype() does change the underlying data. If you have any array of 12 bytes viewed as 'int32' then running .astype('uint8') will take each entry (4 bytes) and covert it (known as casting) to a uint8 entry (1 byte). The new array will only have 3 bytes for the 3 entries. You can see this litterally:
x = np.array([1,2,3],dtype='uint32')
print(x.tobytes())
y = x.astype('uint8')
print(y.tobytes())
So, when we say that a file is 30mb, we mean that the file has (minus some header information) is 30,000,000 bytes which are exactly uint8s. 1 uint8 is 1 byte. If any array has 6000by5100 uint8s (bytes), then the array has 30,600,000 bytes of information in memory.
Likewise, if you read a file (DOES NOT MATTER THE FILE) and write np.fromfile(,dtype=np.uint8,count=15_300_000) then you told python to read EXACTLY 15_300_000 bytes (again 1 byte is 1 uint8) of information (15mb). If your file is 100mb, 40mb, or even 30mb, it would be completely irrelevant because you told python to only read the first 15mb of data.

Can a numpy file be created without defining an array?

I have some very large data to deal with. I'd like to be able to use np.load(filename, mmap_mode="r+") to use these files on disk rather than RAM. My issue is that creating them in RAM causes the exact problem I'm trying to avoid.
I know about np.memmap already and that is a potential solution, but creating a memmap and then saving the array using np.save(filename, memmap) means that I'd be doubling the disk space requirement even if only briefly and that isn't always an option. Primarily I don't want to use memmaps because the header information in .npy files (namely shape and dtype) is useful to have.
My question is, can I create a numpy file without needing to first create it in memory? That is, can I create a numpy file by just giving a dtype and a shape? The idea would be along the lines of np.save(filename, np.empty((x, y, z))) but I'm assuming that empty requires it to be assigned in memory before saving.
My current solution is:
def create_empty_numpy_file(filename, shape, dtype=np.float64):
with tempfile.TemporaryFile() as tmp:
memmap = np.memmap(tmp, dtype, mode="w+", shape=shape)
np.save(filename, memmap)
EDIT
My final solution based on bnaeker's answer and a few details from numpy.lib.format:
class MockFlags:
def __init__(self, shape, c_contiguous=True):
self.c_contiguous = c_contiguous
self.f_contiguous = (not c_contiguous) or (c_contiguous and len(shape) == 1)
class MockArray:
def __init__(self, shape, dtype=np.float64, c_contiguous=True):
self.shape = shape
self.dtype = np.dtype(dtype)
self.flags = MockFlags(shape, c_contiguous)
def save(self, filename):
if self.dtype.itemsize == 0:
buffersize = 0
else:
# Set buffer size to 16 MiB to hide the Python loop overhead.
buffersize = max(16 * 1024 ** 2 // self.dtype.itemsize, 1)
n_chunks, remainder = np.divmod(
np.product(self.shape) * self.dtype.itemsize, buffersize
)
with open(filename, "wb") as f:
np.lib.format.write_array_header_2_0(
f, np.lib.format.header_data_from_array_1_0(self)
)
for chunk in range(n_chunks):
f.write(b"\x00" * buffersize)
f.write(b"\x00" * remainder)

The Numpy file format is really simple. There are a few under-documented functions you can use to create the required header bytes from the metadata needed to build an array, without actually building one.
import numpy as np
def create_npy_header_bytes(
shape, dtype=np.float64, fortran_order=False, format_version="2.0"
):
# 4 or 2-byte unsigned integer, depending on version
n_size_bytes = 4 if format_version[0] == "2" else 2
magic = b"\x93NUMPY"
version_info = (
int(each).to_bytes(1, "little") for each in format_version.split(".")
)
# Keys are supposed to be alphabetically sorted
header = {
"descr": np.lib.format.dtype_to_descr(np.dtype(dtype)),
"fortran_order": fortran_order,
"shape": shape
}
# Pad header up to multiple of 64 bytes
header_bytes = str(header).encode("ascii")
header_len = len(header_bytes)
current_length = header_len + len(magic) + 2 + n_size_bytes # for version information
required_length = int(np.ceil(current_length / 64.0) * 64)
padding = required_length - current_length - 1 # For newline
header_bytes += b" " * padding + b"\n"
# Length of the header dict, including padding and newline
length = len(header_bytes).to_bytes(n_size_bytes, "little")
return b"".join((magic, *version_info, length, header_bytes))
You can test that it's equivalent with this snippet:
import numpy as np
import io
x = np.zeros((10, 3, 4))
first = create_npy_header_bytes(x.shape)
stream = io.BytesIO()
np.lib.format.write_array_header_2_0(
stream, np.lib.format.header_data_from_array_1_0(x)
)
print(f"Library: {stream.getvalue()}")
print(f"Custom: {first}")
You should see something like:
Library: b"\x93NUMPY\x02\x00t\x00\x00\x00{'descr': '<f8', 'fortran_order': False, 'shape': (10, 3, 4), } \n"
Custom: b"\x93NUMPY\x02\x00t\x00\x00\x00{'descr': '<f8', 'fortran_order': False, 'shape': (10, 3, 4)} \n"
which match, except for the trailing comma inside the header dict representation. That will not matter, as this is required to be a valid Python literal string representation of a dict, which will happily ignore that comma if it's there.
As an alternative approach, you could mock out an object which has the required fields for the library functions used to make the header itself. For np.lib.format.header_data_from_array_1_0, these seem to be .flags (which must have a field c_contiguous and/or f_contiguous), and a dtype. That's actually much simpler, and would look like:
import numpy as np
import io
class MockFlags:
def __init__(self, shape, c_contiguous=True):
self.c_contiguous = c_contiguous
self.f_contiguous = (not c_contiguous) or (c_contiguous and len(shape) == 1)
class MockArray:
def __init__(self, shape, dtype=np.float64, c_contiguous=True):
self.shape = shape
self.dtype = np.dtype(dtype)
self.flags = MockFlags(shape, c_contiguous)
mock = MockArray((10, 3, 4))
stream = io.BytesIO()
np.lib.format.write_array_header_2_0(
stream, np.lib.format.header_data_from_array_1_0(mock)
)
print(stream.getvalue())
You should see:
b"\x93NUMPY\x02\x00t\x00\x00\x00{'descr': '<f8', 'fortran_order': False, 'shape': (10, 3, 4), } \n"
which happily matches what we have above, but without having to do the shitty work of counting bytes, mucking with padding, etc. Much more betterer :)

How to skip bytes after reading data using numpy fromfile

I'm trying to read noncontiguous fields from a binary file in Python using numpy fromfile function. It's based on this Matlab code using fread:
fseek(file, 0, 'bof');
q = fread(file, inf, 'float32', 8);
8 indicates the number of bytes I want to skip after reading each value. I was wondering if there was a similar option in fromfile, or if there is another way of reading specific values from a binary file in Python. Thanks for your help.
Henrik

Something like this should work, untested:
import struct
floats = []
with open(filename, 'rb') as f:
while True:
buff = f.read(4) # 'f' is 4-bytes wide
if len(buff) < 4: break
x = struct.unpack('f', buff)[0] # Convert buffer to float (get from returned tuple)
floats.append(x) # Add float to list (for example)
f.seek(8, 1) # The second arg 1 specifies relative offset
Using struct.unpack()

Python struct unpack with irregular field sizes

I'm trying to unpack a struct containing some metadata and an image. It's all bits, but the way that contents were packed is quite strange. Instead of elements being aligned on a byte, they're aligned on every 6 bits. The file format and contents are listed in the image below and can be also accessed at the link: http://etlcdb.db.aist.go.jp/etlcdb/etln/form_c.htm.
The only fields that I am interested in are the JIS code and the 16 gray level image data that is the last field. This data is taken from http://etlcdb.db.aist.go.jp/etlcdb/etln/etl4/etl4.htm.
The main problem that I'm facing is the fact that the fields are 6 bit aligned. I don't know how to properly unpack the struct.
As an aside, with PIL, it's also unclear how to handle 4 bit images. The mode 1 isn't working correctly, as expected..
import struct
from PIL import Image, ImageFont, ImageDraw
filename = 'ETL4/ETL4C'
skip = 0
record_size = 2952
with open(filename, 'r') as f:
f.seek(skip * record_size)
s = f.read(record_size)
r = struct.unpack('>9x1i203x2736s', s)
print r[1]
i1 = Image.frombytes('1', (72, 76), r[1], 'raw')
# fn = 'ETL9B_{:d}_{:d}.png'.format((r[0]-1)%20+1, hex(r[1])[-4:])
fn = 'name1.png'
i1.save(fn, 'PNG')

Matlab to Python Conversion binary file read

I am new to both Matlab and Python and I have to convert a program in Matlab to Python. I am not sure how to typecast the data after reading from the file in Python. The file used is a binary file.
Below is the Matlab code:
fid = fopen (filename, 'r');
fseek (fid, 0, -1);
meta = zeros (n, 9, 'single');
v = zeros (n, 128, 'single');
d = 0;
for i = 1:n
meta(i,:) = fread (fid, 9, 'float');
d = fread (fid, 1, 'int');
v(i,:) = fread (fid, d, 'uint8=>single');
end
I have written the below program in python:
fid = open(filename, 'r')
fid.seek(0 , 0)
meta = np.zeros((n,9),dtype = np.float32)
v = np.zeros((n,128),dtype = np.float32)
for i in range(n):
data_str = fid.read(9);
meta[1,:] = unpack('f', data_str)
For this unpack, I getting the error as
"unpack requires a string argument of length 4"
.
Please suggest someway to make it work.

I looked a little in the problem mainly because I need this in the near future, too. Turns out there is a very simple solution using numpy, assuming you have a matlab matrix stored like I do.
import numpy as np
def read_matrix(file_name):
return np.fromfile(file_name, dtype='<f') # little-endian single precision float
arr = read_matrix(file_path)
print arr[0:10] #sample data
print len(arr) # number of elements
The data type (dtype) you must find out yourself. Help on this is here. I used fwrite(fid,value,'single'); to store the data in matlab, if you have the same, the code above will work.
Note, that the returned variable is a list; you'll have to format it to match the original shape of your data, in my case len(arr) is 307200 from a matrix of the size 15360 x 20.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How python calculates the size of text files - python

Related

Problems when I write np array to binary file, new file is only half of the original one

Can a numpy file be created without defining an array?

How to skip bytes after reading data using numpy fromfile

Python struct unpack with irregular field sizes

Matlab to Python Conversion binary file read

Categories

Resources