I compare two simple methods for writing a numpy array into a raw binary file :
# method 1
import numpy
A = numpy.random.randint(1000, size=512*1024*1024) # 2 GB
with open('blah.bin', 'wb') as f:
f.write(A)
and
# method 2
import numpy
A = numpy.random.randint(1000, size=512*1024*1024) # 2 GB
raw_input()
B = A.tostring() # check memory usage of the current process here : 4 GB are used !!
raw_input()
with open('blah.bin', 'wb') as f:
f.write(B)
With the second method, the memory usage is doubled (4 GB here) !
Why is .tostring() often used for writing numpy arrays to file ?
(in http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tofile.html, it is explained that numpy.ndarray.tofile() may be equivalent to file.write(a.tostring()))
Is method 1 as correct as method 2 for writing such an array to disk ?
The documentation does not say that .tofile() is equivalent to file.write(a.tostring()), it only mentions the latter to explain how the argument sep will behave if it's value is "".
In the second method you are creating a copy of the array A, storing in B, and after that you write in the file, while in the first method this intermediate copy is avoided.
You should also have a look in:
np.savetxt()
Related
I have two HDF5 files having an identical structure, each store a matrix of the same shape. I need to create a third HDF5 file with a matrix representing the element-wise sum of the two mentioned above matrices. Given the sizes of matrices are extremely large (in the Gb-Tb range), what would be the best way to do it, preferably in a parallel way? I am using the h5py interface to the HDF5 library. Are there any libraries capable of doing it?
Yes, this is possible. The key is to access slices of the data from file1 & file2, do your element-wise sum, then write that slice of new data to the file3. You can do this with h5py or PyTables (aka tables). No other libraries are required. I only have passing knowledge of parallel computing. I know h5py supports an mpi interface through the mpi4py Python package. Details here: h5py docs: Parallel HDF5
Here is a simple example. It creates 2 files with a dataset of random floats, shape=(10,10,10). It then creates a new file with an empty dataset of the same shape. The loop reads a slice of data from file1 and file2, sums them, then writes to the same slice in file3. To test with large data, you can modify the shapes to match your file.
21-Jan-2021 Update:
I added code to get the dataset shapes from file1 and file2, and compare them (to be sure they are equal). If the shapes aren't equal, I exit. If they match, I create the new file, then create a dataset of matching shape. (If you really want to be robust, you could do the same with the dtype.) I also use the value of shape[2] as the slice iterator over the dataset.
import h5py
import numpy as np
import random
import sys
arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file1.h5','w') as h5fw :
h5fw.create_dataset('data_1',data=arr)
arr = np.random.random(10**3).reshape(10,10,10)
with h5py.File('file2.h5','w') as h5fw :
h5fw.create_dataset('data_2',data=arr)
h5fr1 = h5py.File('file1.h5','r')
f1shape = h5fr1['data_1'].shape
h5fr2 = h5py.File('file2.h5','r')
f2shape = h5fr2['data_2'].shape
if (f1shape!=f2shape):
print ('Datasets shapes do not match')
h5fr1.close()
h5fr2.close()
sys.exit('Exiting due to error.')
else:
with h5py.File('file3.h5','w') as h5fw :
ds3 = h5fw.create_dataset('data_3', shape=f1shape, dtype='f')
for i in range(f1shape[2]):
arr1_slice = h5fr1['data_1'][:,:,i]
arr2_slice = h5fr2['data_2'][:,:,i]
arr3_slice = arr1_slice + arr2_slice
ds3[:,:,i] = arr3_slice
# alternately, you can slice and sum in 1 line
# ds3[:,:,i] = h5fr1['data_1'][:,:,i] + \
# h5fr2['data_2'][:,:,i]
print ('Done.')
h5fr1.close()
h5fr2.close()
I have large binary data files that have a predefined format, originally written by a Fortran program as little endians. I would like to read these files in the fastest, most efficient manner, so using the array package seemed right up my alley as suggested in Improve speed of reading and converting from binary file?.
The problem is the pre-defined format is non-homogeneous. It looks something like this:
['<2i','<5d','<2i','<d','<i','<3d','<2i','<3d','<i','<d','<i','<3d']
with each integer i taking up 4 bytes, and each double d taking 8 bytes.
Is there a way I can still use the super efficient array package (or another suggestion) but with the right format?
Use struct. In particular, struct.unpack.
result = struct.unpack("<2i5d...", buffer)
Here buffer holds the given binary data.
It's not clear from your question whether you're concerned about the actual file reading speed (and building data structure in memory), or about later data processing speed.
If you are reading only once, and doing heavy processing later, you can read the file record by record (if your binary data is a recordset of repeated records with identical format), parse it with struct.unpack and append it to a [double] array:
from functools import partial
data = array.array('d')
record_size_in_bytes = 9*4 + 16*8 # 9 ints + 16 doubles
with open('input', 'rb') as fin:
for record in iter(partial(fin.read, record_size_in_bytes), b''):
values = struct.unpack("<2i5d...", record)
data.extend(values)
Under assumption you are allowed to cast all your ints to doubles and willing to accept increase in allocated memory size (22% increase for your record from the question).
If you are reading the data from file many times, it could be worthwhile to convert everything to one large array of doubles (like above) and write it back to another file from which you can later read with array.fromfile():
data = array.array('d')
with open('preprocessed', 'rb') as fin:
n = os.fstat(fin.fileno()).st_size // 8
data.fromfile(fin, n)
Update. Thanks to a nice benchmark by #martineau, now we know for a fact that preprocessing the data and turning it into an homogeneous array of doubles ensures that loading such data from file (with array.fromfile()) is ~20x to ~40x faster than reading it record-per-record, unpacking and appending to array (as shown in the first code listing above).
A faster (and a more standard) variation of record-by-record reading in #martineau's answer which appends to list and doesn't upcast to double is only ~6x to ~10x slower than array.fromfile() method and seems like a better reference benchmark.
Major Update: Modified to use proper code for reading in a preprocessed array file (function using_preprocessed_file() below), which dramatically changed the results.
To determine what method is faster in Python (using only built-ins and the standard libraries), I created a script to benchmark (via timeit) the different techniques that could be used to do this. It's a bit on the longish side, so to avoid distraction, I'm only posting the code tested and related results. (If there's sufficient interest in the methodology, I'll post the whole script.)
Here are the snippets of code that were compared:
#TESTCASE('Read and constuct piecemeal with struct')
def read_file_piecemeal():
structures = []
with open(test_filenames[0], 'rb') as inp:
size = fmt1.size
while True:
buffer = inp.read(size)
if len(buffer) != size: # EOF?
break
structures.append(fmt1.unpack(buffer))
return structures
#TESTCASE('Read all-at-once, then slice and struct')
def read_entire_file():
offset, unpack, size = 0, fmt1.unpack, fmt1.size
structures = []
with open(test_filenames[0], 'rb') as inp:
buffer = inp.read() # read entire file
while True:
chunk = buffer[offset: offset+size]
if len(chunk) != size: # EOF?
break
structures.append(unpack(chunk))
offset += size
return structures
#TESTCASE('Convert to array (#randomir part 1)')
def convert_to_array():
data = array.array('d')
record_size_in_bytes = 9*4 + 16*8 # 9 ints + 16 doubles (standard sizes)
with open(test_filenames[0], 'rb') as fin:
for record in iter(partial(fin.read, record_size_in_bytes), b''):
values = struct.unpack("<2i5d2idi3d2i3didi3d", record)
data.extend(values)
return data
#TESTCASE('Read array file (#randomir part 2)', setup='create_preprocessed_file')
def using_preprocessed_file():
data = array.array('d')
with open(test_filenames[1], 'rb') as fin:
n = os.fstat(fin.fileno()).st_size // 8
data.fromfile(fin, n)
return data
def create_preprocessed_file():
""" Save array created by convert_to_array() into a separate test file. """
test_filename = test_filenames[1]
if not os.path.isfile(test_filename): # doesn't already exist?
data = convert_to_array()
with open(test_filename, 'wb') as file:
data.tofile(file)
And here were the results running them on my system:
Fastest to slowest execution speeds using Python 3.6.1
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes
Read array file (#randomir part 2): 0.06430 secs, relative 1.00x ( 0.00% slower)
Read all-at-once, then slice and struct: 0.39634 secs, relative 6.16x ( 516.36% slower)
Read and constuct piecemeal with struct: 0.43283 secs, relative 6.73x ( 573.09% slower)
Convert to array (#randomir part 1): 1.38310 secs, relative 21.51x (2050.87% slower)
Interestingly, most of the snippets are actually faster in Python 2...
Fastest to slowest execution speeds using Python 2.7.13
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes
Read array file (#randomir part 2): 0.03586 secs, relative 1.00x ( 0.00% slower)
Read all-at-once, then slice and struct: 0.27871 secs, relative 7.77x ( 677.17% slower)
Read and constuct piecemeal with struct: 0.40804 secs, relative 11.38x (1037.81% slower)
Convert to array (#randomir part 1): 1.45830 secs, relative 40.66x (3966.41% slower)
Take a look at the documentation for numpy's fromfile function: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html and https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#arrays-dtypes-constructing
Simplest example:
import numpy as np
data = np.fromfile('binary_file', dtype=np.dtype('<i8, ...'))
Read more about "Structured Arrays" in numpy and how to specify their data type(s) here: https://docs.scipy.org/doc/numpy/user/basics.rec.html#
There's a lot of good and helpful answers here, but I think the best solution needs more explaining. I implemented a method that reads the entire data file in one pass using the built-in read() and constructs a numpy ndarray all at the same time. This is more efficient than reading the data and constructing the array separately, but it's also a bit more finicky.
line_cols = 20 #For example
line_rows = 40000 #For example
data_fmt = 15*'f8,'+5*'f4,' #For example (15 8-byte doubles + 5 4-byte floats)
data_bsize = 15*8 + 4*5 #For example
with open(filename,'rb') as f:
data = np.ndarray(shape=(1,line_rows),
dtype=np.dtype(data_fmt),
buffer=f.read(line_rows*data_bsize))[0].astype(line_cols*'f8,').view(dtype='f8').reshape(line_rows,line_cols)[:,:-1]
Here, we open the file as a binary file using the 'rb' option in open. Then, we construct our ndarray with the proper shape and dtype to fit our read buffer. We then reduce the ndarray into a 1D array by taking its zeroth index, where all our data is hiding. Then, we reshape the array using np.astype, np.view and np.reshape methods. This is because np.reshape doesn't like having data with mixed dtypes, and I'm okay with having my integers expressed as doubles.
This method is ~100x faster than looping line-for-line through the data, and could potentially be compressed down into a single line of code.
In the future, I may try to read the data in even faster using a Fortran script that essentially converts the binary file into a text file. I don't know if this will be faster, but it may be worth a try.
I have some very large txt fils(about 1.5 GB ) which I want to load into Python as an array. The Problem is in this data a comma is used as a decimal separator. for smaller fils I came up with this solution:
import numpy as np
data= np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1)
data = np.char.replace(data, ',', '.')
data = np.char.replace(data, '\'', '')
data = np.char.replace(data, 'b', '').astype(np.float64)
But for the large fils Python runs into an Memory Error. Is there any other more memory efficient way to load this data?
The problem with np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1) is that it uses python objects (strings) instead of float64, which is very memory inefficient. You can use pandas read_table
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html#pandas.read_table
to read your file and set decimal=',' to change the default behaviour. This will allow for seamless reading and converting your strings into floats. After loading pandas dataframe use df.values to get a numpy array.
If it's still too large for your memory use chunks
http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking
If still no luck try np.float32 format which further halves memory footprint.
You should try parsing it yourself, iterating for each line (so implicitly using a generator that does not read all the file into memory).
Also, for data of that size, I would use python standard array library, that uses similar memory as a c array. That is, one value next to the other in memory (numpy array is also very efficient in memory usage though).
import array
def convert(s):
# The function that converts the string to float
s = s.strip().replace(',', '.')
return float(s)
data = array.array('d') #an array of type double (float of 64 bits)
with open(filename, 'r') as f:
for l in f:
strnumbers = l.split('\t')
data.extend( (convert(s) for s in strnumbers if s!='') )
#A generator expression here.
I'm sure similar code (with similar memory footprint) can be written replacing the array.array by numpy.array, specially if you need to have a two dimensional array.
I have a piece of software that reads a file and transforms each first value it reads per line using a function (derived from numpy.polyfit and numpy.poly1d functions).
This function has to then write the transformed file away and I wrongly (it seems) assumed that the disk I/O part was the performance bottleneck.
The reason why I claim that it is the transformation that is slowing things down is because I tested the code (listed below) after i changed transformedValue = f(float(values[0])) into transformedValue = 1000.00 and that took the time required down from 1 min to 10 seconds.
I was wondering if anyone knows of a more efficient way to perform repeated transformations like this?
Code snippet:
def transformFile(self, f):
""" f contains the function returned by numpy.poly1d,
inputFile is a tab seperated file containing two floats
per line.
"""
with open (self.inputFile,'r') as fr:
for line in fr:
line = line.rstrip('\n')
values = line.split()
transformedValue = f(float(values[0])) # <-------- Bottleneck
outputBatch.append(str(transformedValue)+" "+values[1]+"\n")
joinedOutput = ''.join(outputBatch)
with open(output,'w') as fw:
fw.write(joinedOutput)
The function f is generated by another function, the function fits a 2d degree polynomial through a set of expected floats and a set of measured floats. A snippet from that function is:
# Perform 2d degree polynomial fit
z = numpy.polyfit(measuredValues,expectedValues,2)
f = numpy.poly1d(z)
-- ANSWER --
I have revised the code to vectorize the values prior to transforming them, which significantly speed-up the performance, the code is now as follows:
def transformFile(self, f):
""" f contains the function returned by numpy.poly1d,
inputFile is a tab seperated file containing two floats
per line.
"""
with open (self.inputFile,'r') as fr:
outputBatch = []
x_values = []
y_values = []
for line in fr:
line = line.rstrip('\n')
values = line.split()
x_values.append(float(values[0]))
y_values.append(int(values[1]))
# Transform python list into numpy array
xArray = numpy.array(x_values)
newArray = f(xArray)
# Prepare the outputs as a list
for index, i in enumerate(newArray):
outputBatch.append(str(i)+" "+str(y_values[index])+"\n")
# Join the output list elements
joinedOutput = ''.join(outputBatch)
with open(output,'w') as fw:
fw.write(joinedOutput)
It's difficult to suggest improvements without knowing exactly what your function f is doing. Are you able to share it?
However, in general many NumPy operations often work best (read: "fastest") on NumPy array objects rather than when they are repeated multiple times on individual values.
You might like to consider reading the numbers values[0] into a Python list, passing this to a NumPy array and using vectorisable NumPy operations to obtain an array of output values.
I would like to store approx 4000 numpy arrays (of 1.5 MB each) in a serialized uncompressed file (approx 6 GB of data). Here is an example with 2 small arrays :
import numpy
d1 = { 'array1' : numpy.array([1,2,3,4]), 'array2': numpy.array([5,4,3,2]) }
numpy.savez('myarrays', **d1)
d2 = numpy.load('myarrays.npz')
for k in d2:
print d2[k]
It works, but I would like to do the same thing not in a single step :
When saving, I would like to be able to save 10 arrays, then do some other task (than may use some seconds), then write 100 other arrays, then do something else, then write some other 50 arrays, etc.
When loading : idem, I would like to be able to load some arrays, then do some other task, then continue the loading.
How to do with this numpy.savez / numpy.load ?
I don't think you can do this with np.savez. This, however, is the perfect use-case for hdf5. See either:
http://www.h5py.org
or
http://www.pytables.org
As an example of how to do this in h5py:
h5f = h5py.File('test.h5', 'w')
h5f.create_dataset('array1', data=np.array([1,2,3,4]))
h5f.create_dataset('array2', data=np.array([5,4,3,2]))
h5f.close()
# Now open it back up and read data
h5f = h5py.File('test.h5', 'r')
a = h5f['array1'][:]
b = h5f['array2'][:]
h5f.close()
print a
print b
# [1 2 3 4]
# [5 4 3 2]
And of course there are more sophisticated ways of doing this, organizing arrays via groups, adding metadata, pre-allocating space in the hdf5 file and then filling it later, etc.
savez in the current numpy is just writing the arrays to temp files with numpy.save, and then adds them to a zip file (with or without compression).
If you're not using compression, you might as well skip step 2 and just save your arrays one by one, and keep all 4000 of them in a single folder.