Using numpy.fromfile to read scattered binary data - python

There are different blocks in a binary that I want to read using a single call of numpy.fromfile. Each block has the following format:
OES=[
('EKEY','i4',1),
('FD1','f4',1),
('EX1','f4',1),
('EY1','f4',1),
('EXY1','f4',1),
('EA1','f4',1),
('EMJRP1','f4',1),
('EMNRP1','f4',1),
('EMAX1','f4',1),
('FD2','f4',1),
('EX2','f4',1),
('EY2','f4',1),
('EXY2','f4',1),
('EA2','f4',1),
('EMJRP2','f4',1),
('EMNRP2','f4',1),
('EMAX2','f4',1)]
Here is the format of the binary:
Data I want (OES format repeating n times)
------------------------
Useless Data
------------------------
Data I want (OES format repeating m times)
------------------------
etc..
I know the byte increment between the data i want and the useless data. I also know the size of each data block i want.
So far, i have accomplished my goal by seeking on the file object f and then calling:
nparr = np.fromfile(f,dtype=OES,count=size)
So I have a different nparr for each data block I want and concatenated all the numpy arrays into one new array.
My goal is to have a single array with all the blocks i want without concatenating (for memory purposes). That is, I want to call nparr = np.fromfile(f,dtype=OES) only once. Is there a way to accomplish this goal?

That is, I want to call nparr = np.fromfile(f,dtype=OES) only once. Is there a way to accomplish this goal?
No, not with a single call to fromfile().
But if you know the complete layout of the file in advance, you can preallocate the array, and then use fromfile and seek to read the OES blocks directly into the preallocated array. Suppose, for example, that you know the file positions of each OES block, and you know the number of records in each block. That is, you know:
file_positions = [position1, position2, ...]
numrecords = [n1, n2, ...]
Then you could do something like this (assuming f is the already opened file):
total = sum(numrecords)
nparr = np.empty(total, dtype=OES)
current_index = 0
for pos, n in zip(file_positions, numrecords):
f.seek(pos)
nparr[current_index:current_index+n] = np.fromfile(f, count=n, dtype=OES)
current_index += n

Related

Reading and Writing Matrices in Python, Could not Convert String to Float Error

I'm trying to write a matrix (i.e. list of lists) to a txt file and then read it out again. I'm able to do this for lists. But for some reason when I tried to move up to a matrix yesterday, it didn't work.
genotypes=[[] for i in range(10000)]
for n in range(10000):
for m in range(1024):
u=np.random.uniform()
if u<0.9:
genotypes[n].append(0)
elif 0.9<u<0.99:
genotypes[n].append(1)
elif u>0.99:
genotypes[n].append(2)
return genotypes
#genotypes=genotype_maker()
#np.savetxt('genotypes.txt',genotypes)
g=open("genotypes.txt","r")
genotypes=[]
for line in g:
genotypes.append(int(float(line.rstrip())))
I run the code twice. The first time the middle two lines are not commented out while the last four are commented out. It looks like this successfully writes a matrix of floats to a .txt file
The second time, I comment out the middle two lines and uncomment the last four. Unfortunately I then get the error message: ValueError: could not convert string to float: '0.000000000000000000e+00 0.000000000000000000e+00 (and a whole lot more of these)
What's wrong with the code?
Thanks
In your case, you should just do np.loadtxt("genotypes.txt") if you want to load the file.
However, if you want to do it manually, you need to parse everything yourself. You get an error because np.savetxt saves the matrix in a space-delimited file. You need to split your string before converting it. So for instance:
def str_to_int(x):
return int(float(x))
g=open("genotypes.txt","r")
genotypes=[]
for line in g:
values = line.rstrip().split(' ') # values is an array of strings
values_int = list(map(str_to_int,values)) # convert strings to int
genotypes.append(values_int) # append to your list
a matrix (i.e. list of lists)
Since we are already using numpy, it is possible to have numpy directly generate one of its own array types storing data of this sort, directly:
np.random.choice(
3, # i.e., allow values from 0..2
size=(10000, 1024), # the dimensions of the array to create
p=(0.9, 0.09, 0.01) # the relative probability for each value
)
Documentation here.

Fastest way to read a binary file with a defined format?

I have large binary data files that have a predefined format, originally written by a Fortran program as little endians. I would like to read these files in the fastest, most efficient manner, so using the array package seemed right up my alley as suggested in Improve speed of reading and converting from binary file?.
The problem is the pre-defined format is non-homogeneous. It looks something like this:
['<2i','<5d','<2i','<d','<i','<3d','<2i','<3d','<i','<d','<i','<3d']
with each integer i taking up 4 bytes, and each double d taking 8 bytes.
Is there a way I can still use the super efficient array package (or another suggestion) but with the right format?
Use struct. In particular, struct.unpack.
result = struct.unpack("<2i5d...", buffer)
Here buffer holds the given binary data.
It's not clear from your question whether you're concerned about the actual file reading speed (and building data structure in memory), or about later data processing speed.
If you are reading only once, and doing heavy processing later, you can read the file record by record (if your binary data is a recordset of repeated records with identical format), parse it with struct.unpack and append it to a [double] array:
from functools import partial
data = array.array('d')
record_size_in_bytes = 9*4 + 16*8 # 9 ints + 16 doubles
with open('input', 'rb') as fin:
for record in iter(partial(fin.read, record_size_in_bytes), b''):
values = struct.unpack("<2i5d...", record)
data.extend(values)
Under assumption you are allowed to cast all your ints to doubles and willing to accept increase in allocated memory size (22% increase for your record from the question).
If you are reading the data from file many times, it could be worthwhile to convert everything to one large array of doubles (like above) and write it back to another file from which you can later read with array.fromfile():
data = array.array('d')
with open('preprocessed', 'rb') as fin:
n = os.fstat(fin.fileno()).st_size // 8
data.fromfile(fin, n)
Update. Thanks to a nice benchmark by #martineau, now we know for a fact that preprocessing the data and turning it into an homogeneous array of doubles ensures that loading such data from file (with array.fromfile()) is ~20x to ~40x faster than reading it record-per-record, unpacking and appending to array (as shown in the first code listing above).
A faster (and a more standard) variation of record-by-record reading in #martineau's answer which appends to list and doesn't upcast to double is only ~6x to ~10x slower than array.fromfile() method and seems like a better reference benchmark.
Major Update: Modified to use proper code for reading in a preprocessed array file (function using_preprocessed_file() below), which dramatically changed the results.
To determine what method is faster in Python (using only built-ins and the standard libraries), I created a script to benchmark (via timeit) the different techniques that could be used to do this. It's a bit on the longish side, so to avoid distraction, I'm only posting the code tested and related results. (If there's sufficient interest in the methodology, I'll post the whole script.)
Here are the snippets of code that were compared:
#TESTCASE('Read and constuct piecemeal with struct')
def read_file_piecemeal():
structures = []
with open(test_filenames[0], 'rb') as inp:
size = fmt1.size
while True:
buffer = inp.read(size)
if len(buffer) != size: # EOF?
break
structures.append(fmt1.unpack(buffer))
return structures
#TESTCASE('Read all-at-once, then slice and struct')
def read_entire_file():
offset, unpack, size = 0, fmt1.unpack, fmt1.size
structures = []
with open(test_filenames[0], 'rb') as inp:
buffer = inp.read() # read entire file
while True:
chunk = buffer[offset: offset+size]
if len(chunk) != size: # EOF?
break
structures.append(unpack(chunk))
offset += size
return structures
#TESTCASE('Convert to array (#randomir part 1)')
def convert_to_array():
data = array.array('d')
record_size_in_bytes = 9*4 + 16*8 # 9 ints + 16 doubles (standard sizes)
with open(test_filenames[0], 'rb') as fin:
for record in iter(partial(fin.read, record_size_in_bytes), b''):
values = struct.unpack("<2i5d2idi3d2i3didi3d", record)
data.extend(values)
return data
#TESTCASE('Read array file (#randomir part 2)', setup='create_preprocessed_file')
def using_preprocessed_file():
data = array.array('d')
with open(test_filenames[1], 'rb') as fin:
n = os.fstat(fin.fileno()).st_size // 8
data.fromfile(fin, n)
return data
def create_preprocessed_file():
""" Save array created by convert_to_array() into a separate test file. """
test_filename = test_filenames[1]
if not os.path.isfile(test_filename): # doesn't already exist?
data = convert_to_array()
with open(test_filename, 'wb') as file:
data.tofile(file)
And here were the results running them on my system:
Fastest to slowest execution speeds using Python 3.6.1
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes
Read array file (#randomir part 2): 0.06430 secs, relative 1.00x ( 0.00% slower)
Read all-at-once, then slice and struct: 0.39634 secs, relative 6.16x ( 516.36% slower)
Read and constuct piecemeal with struct: 0.43283 secs, relative 6.73x ( 573.09% slower)
Convert to array (#randomir part 1): 1.38310 secs, relative 21.51x (2050.87% slower)
Interestingly, most of the snippets are actually faster in Python 2...
Fastest to slowest execution speeds using Python 2.7.13
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes
Read array file (#randomir part 2): 0.03586 secs, relative 1.00x ( 0.00% slower)
Read all-at-once, then slice and struct: 0.27871 secs, relative 7.77x ( 677.17% slower)
Read and constuct piecemeal with struct: 0.40804 secs, relative 11.38x (1037.81% slower)
Convert to array (#randomir part 1): 1.45830 secs, relative 40.66x (3966.41% slower)
Take a look at the documentation for numpy's fromfile function: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html and https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#arrays-dtypes-constructing
Simplest example:
import numpy as np
data = np.fromfile('binary_file', dtype=np.dtype('<i8, ...'))
Read more about "Structured Arrays" in numpy and how to specify their data type(s) here: https://docs.scipy.org/doc/numpy/user/basics.rec.html#
There's a lot of good and helpful answers here, but I think the best solution needs more explaining. I implemented a method that reads the entire data file in one pass using the built-in read() and constructs a numpy ndarray all at the same time. This is more efficient than reading the data and constructing the array separately, but it's also a bit more finicky.
line_cols = 20 #For example
line_rows = 40000 #For example
data_fmt = 15*'f8,'+5*'f4,' #For example (15 8-byte doubles + 5 4-byte floats)
data_bsize = 15*8 + 4*5 #For example
with open(filename,'rb') as f:
data = np.ndarray(shape=(1,line_rows),
dtype=np.dtype(data_fmt),
buffer=f.read(line_rows*data_bsize))[0].astype(line_cols*'f8,').view(dtype='f8').reshape(line_rows,line_cols)[:,:-1]
Here, we open the file as a binary file using the 'rb' option in open. Then, we construct our ndarray with the proper shape and dtype to fit our read buffer. We then reduce the ndarray into a 1D array by taking its zeroth index, where all our data is hiding. Then, we reshape the array using np.astype, np.view and np.reshape methods. This is because np.reshape doesn't like having data with mixed dtypes, and I'm okay with having my integers expressed as doubles.
This method is ~100x faster than looping line-for-line through the data, and could potentially be compressed down into a single line of code.
In the future, I may try to read the data in even faster using a Fortran script that essentially converts the binary file into a text file. I don't know if this will be faster, but it may be worth a try.

How to make an equivalent to Fortran's 'access=stream' in python

Let's say i'm making a loop, and after each iteration, y want to extend some array.
iter 1 ------------> iter 2 --------------> iter 3-------------->....
shape=[2,4]---->shape=[2,12]----->shape=[2,36]---->....
in fortran i used to do this by appending the new numbers to a binary file with:
OPEN(2,file='array.in',form='unformatted',status='unknown',access='stream')
write(2) newarray
so this would extend the old array with new values at the end.
i wish to do the same in python. This is my attempt so far:
import numpy as np
#write 2x2 array to binfile
bintest=open('binfile.in','wb')
np.ndarray.tofile(np.array([[1.0,2.0],[3.0,4.0]]),'binfile.in')
bintest.close()
#read array from binfile
artest=np.fromfile('binfile.in',dtype=np.float64).reshape(2,2)
But i can't get it to extend the array. Lets say.. by appeding another [[5.0,5.0],[5.0,5.0]] at the end,
#append new values.
np.ndarray.tofile(np.array([[5.0,5.0],[5.0,5.0]]),'binfile.in')
to make it [[1.0,2.0,5.0,5.0],[3.0,4.0,5.0,5.0]] after the reading.
How can i do this?
The other problem i have, is that i would like to be able to make this without knowing the shape of the final array (i know it would be 2 x n ). But this is not so important.
edit: the use of 'access=stream' is only to skip having to read format headers and tails.
This does the trick:
import numpy as np
#write
bintest=open('binfile.in','ab')
a=np.array([[1.0,2.0],[3.0,2.0]])
a.tofile(bintest)
bintest.close()
#read
array=np.fromfile('binfile.in',dtype=np.float64)
this way, each time its run, it appends the new array to the end of the file.

Transforming float values using a function is performance bottleneck

I have a piece of software that reads a file and transforms each first value it reads per line using a function (derived from numpy.polyfit and numpy.poly1d functions).
This function has to then write the transformed file away and I wrongly (it seems) assumed that the disk I/O part was the performance bottleneck.
The reason why I claim that it is the transformation that is slowing things down is because I tested the code (listed below) after i changed transformedValue = f(float(values[0])) into transformedValue = 1000.00 and that took the time required down from 1 min to 10 seconds.
I was wondering if anyone knows of a more efficient way to perform repeated transformations like this?
Code snippet:
def transformFile(self, f):
""" f contains the function returned by numpy.poly1d,
inputFile is a tab seperated file containing two floats
per line.
"""
with open (self.inputFile,'r') as fr:
for line in fr:
line = line.rstrip('\n')
values = line.split()
transformedValue = f(float(values[0])) # <-------- Bottleneck
outputBatch.append(str(transformedValue)+" "+values[1]+"\n")
joinedOutput = ''.join(outputBatch)
with open(output,'w') as fw:
fw.write(joinedOutput)
The function f is generated by another function, the function fits a 2d degree polynomial through a set of expected floats and a set of measured floats. A snippet from that function is:
# Perform 2d degree polynomial fit
z = numpy.polyfit(measuredValues,expectedValues,2)
f = numpy.poly1d(z)
-- ANSWER --
I have revised the code to vectorize the values prior to transforming them, which significantly speed-up the performance, the code is now as follows:
def transformFile(self, f):
""" f contains the function returned by numpy.poly1d,
inputFile is a tab seperated file containing two floats
per line.
"""
with open (self.inputFile,'r') as fr:
outputBatch = []
x_values = []
y_values = []
for line in fr:
line = line.rstrip('\n')
values = line.split()
x_values.append(float(values[0]))
y_values.append(int(values[1]))
# Transform python list into numpy array
xArray = numpy.array(x_values)
newArray = f(xArray)
# Prepare the outputs as a list
for index, i in enumerate(newArray):
outputBatch.append(str(i)+" "+str(y_values[index])+"\n")
# Join the output list elements
joinedOutput = ''.join(outputBatch)
with open(output,'w') as fw:
fw.write(joinedOutput)
It's difficult to suggest improvements without knowing exactly what your function f is doing. Are you able to share it?
However, in general many NumPy operations often work best (read: "fastest") on NumPy array objects rather than when they are repeated multiple times on individual values.
You might like to consider reading the numbers values[0] into a Python list, passing this to a NumPy array and using vectorisable NumPy operations to obtain an array of output values.

Python memory error for a large data set

I want to generate a 'bag of words' matrix containing documents with the corresponding counts for the words in the document. In order to do this I run below code for initialising the bag of words matrix. Unfortunately I receive a memory error after x amounts of documents in the line where I read the document. Is there a better way of doing this, so that I can avoid the memory error? Please be aware that I would like to process a very large amount of documents ~ 2.000.000 with only 8 Gb of RAM.
def __init__(self, paths, words_count, normalize_matrix = False ,trainingset_size = None, validation_set_words_list = None):
'''
Open all documents from the given path.
Initialize the variables needed in order
to construct the word matrix.
Parameters
----------
paths: paths to the documents.
words_count: number of words in the bag of words.
trainingset_size: the proportion of the data that should be set to the training set.
validation_set_words_list: the attributes for validation.
'''
print '################ Data Processing Started ################'
self.max_words_matrix = words_count
print '________________ Reading Docs From File System ________________'
timer = time()
for folder in paths:
self.class_names.append(folder.split('/')[len(folder.split('/'))-1])
print '____ dataprocessing for category '+folder
if trainingset_size == None:
docs = os.listdir(folder)
elif not trainingset_size == None and validation_set_words_list == None:
docs = os.listdir(folder)[:int(len(os.listdir(folder))*trainingset_size-1)]
else:
docs = os.listdir(folder)[int(len(os.listdir(folder))*trainingset_size+1):]
count = 1
length = len(docs)
for doc in docs:
if doc.endswith('.txt'):
d = open(folder+'/'+doc).read()
# Append a filtered version of the document to the document list.
self.docs_list.append(self.__filter__(d))
# Append the name of the document to the list containing document names.
self.docs_names.append(doc)
# Increase the class indices counter.
self.class_indices.append(len(self.class_names)-1)
print 'Processed '+str(count)+' of '+str(length)+' in category '+folder
count += 1
What you're asking for isn't possible. Also, Python doesn't automatically get the space benefits you're expecting from BoW. Plus, I think you're doing the key piece wrong in the first place. Let's take those in reverse order.
Whatever you're doing in this line:
self.docs_list.append(self.__filter__(d))
… is likely wrong.
All you want to store for each document is a count vector. In order to get that count vector, you will need to append to a single dict of all words seen. Unless __filter__ is modifying a hidden dict in-place, and returning a vector, it's not doing the right thing.
The main space savings in the BoW model come from not having to store copies of the string keys for each document, and from being able to store a simple array of ints instead of a fancy hash table. But an integer object is nearly as big as a (short) string object, and there's no way to predict or guarantee when you get new integers or strings vs. additional references to existing ones. So, really, the only advantage you get is 1/hash_fullness; if you want any of the other advantages, you need something like an array.array or numpy.ndarray.
For example:
a = np.zeros(len(self.word_dict), dtype='i2')
for word in split_into_words(d):
try:
idx = self.word_dict[word]
except KeyError:
idx = len(self.word_dict)
self.word_dict[word] = idx
np.resize(a, idx+1)
a[idx] = 1
else:
a[idx] += 1
self.doc_vectors.append(a)
But this still won't be enough. Unless you have on the order of 1K unique words, you can't fit all those counts in memory.
For example, if you have 5000 unique words, you've got 2M arrays, each of which has 5000 2-byte counts, so the most compact possible representation will take 20GB.
Since most documents won't have most words, you will get some benefit by using sparse arrays (or a single 2D sparse array), but there's only so much benefit you can get. And, even if things happened to be ordered in such a way that you get absolutely perfect RLE compression, if the average number of unique words per doc is on the order of 1K, you're still going to run out of memory.
So, you simply can't store all of the document vectors in memory.
If you can process them iteratively instead of all at once, that's the obvious answer.
If not, you'll have to page them in and out to disk (whether explicitly, or by using PyTables or a database or something).

Categories

Resources