I have a piece of software that reads a file and transforms each first value it reads per line using a function (derived from numpy.polyfit and numpy.poly1d functions).
This function has to then write the transformed file away and I wrongly (it seems) assumed that the disk I/O part was the performance bottleneck.
The reason why I claim that it is the transformation that is slowing things down is because I tested the code (listed below) after i changed transformedValue = f(float(values[0])) into transformedValue = 1000.00 and that took the time required down from 1 min to 10 seconds.
I was wondering if anyone knows of a more efficient way to perform repeated transformations like this?
Code snippet:
def transformFile(self, f):
""" f contains the function returned by numpy.poly1d,
inputFile is a tab seperated file containing two floats
per line.
"""
with open (self.inputFile,'r') as fr:
for line in fr:
line = line.rstrip('\n')
values = line.split()
transformedValue = f(float(values[0])) # <-------- Bottleneck
outputBatch.append(str(transformedValue)+" "+values[1]+"\n")
joinedOutput = ''.join(outputBatch)
with open(output,'w') as fw:
fw.write(joinedOutput)
The function f is generated by another function, the function fits a 2d degree polynomial through a set of expected floats and a set of measured floats. A snippet from that function is:
# Perform 2d degree polynomial fit
z = numpy.polyfit(measuredValues,expectedValues,2)
f = numpy.poly1d(z)
-- ANSWER --
I have revised the code to vectorize the values prior to transforming them, which significantly speed-up the performance, the code is now as follows:
def transformFile(self, f):
""" f contains the function returned by numpy.poly1d,
inputFile is a tab seperated file containing two floats
per line.
"""
with open (self.inputFile,'r') as fr:
outputBatch = []
x_values = []
y_values = []
for line in fr:
line = line.rstrip('\n')
values = line.split()
x_values.append(float(values[0]))
y_values.append(int(values[1]))
# Transform python list into numpy array
xArray = numpy.array(x_values)
newArray = f(xArray)
# Prepare the outputs as a list
for index, i in enumerate(newArray):
outputBatch.append(str(i)+" "+str(y_values[index])+"\n")
# Join the output list elements
joinedOutput = ''.join(outputBatch)
with open(output,'w') as fw:
fw.write(joinedOutput)
It's difficult to suggest improvements without knowing exactly what your function f is doing. Are you able to share it?
However, in general many NumPy operations often work best (read: "fastest") on NumPy array objects rather than when they are repeated multiple times on individual values.
You might like to consider reading the numbers values[0] into a Python list, passing this to a NumPy array and using vectorisable NumPy operations to obtain an array of output values.
Related
I'm trying to write a matrix (i.e. list of lists) to a txt file and then read it out again. I'm able to do this for lists. But for some reason when I tried to move up to a matrix yesterday, it didn't work.
genotypes=[[] for i in range(10000)]
for n in range(10000):
for m in range(1024):
u=np.random.uniform()
if u<0.9:
genotypes[n].append(0)
elif 0.9<u<0.99:
genotypes[n].append(1)
elif u>0.99:
genotypes[n].append(2)
return genotypes
#genotypes=genotype_maker()
#np.savetxt('genotypes.txt',genotypes)
g=open("genotypes.txt","r")
genotypes=[]
for line in g:
genotypes.append(int(float(line.rstrip())))
I run the code twice. The first time the middle two lines are not commented out while the last four are commented out. It looks like this successfully writes a matrix of floats to a .txt file
The second time, I comment out the middle two lines and uncomment the last four. Unfortunately I then get the error message: ValueError: could not convert string to float: '0.000000000000000000e+00 0.000000000000000000e+00 (and a whole lot more of these)
What's wrong with the code?
Thanks
In your case, you should just do np.loadtxt("genotypes.txt") if you want to load the file.
However, if you want to do it manually, you need to parse everything yourself. You get an error because np.savetxt saves the matrix in a space-delimited file. You need to split your string before converting it. So for instance:
def str_to_int(x):
return int(float(x))
g=open("genotypes.txt","r")
genotypes=[]
for line in g:
values = line.rstrip().split(' ') # values is an array of strings
values_int = list(map(str_to_int,values)) # convert strings to int
genotypes.append(values_int) # append to your list
a matrix (i.e. list of lists)
Since we are already using numpy, it is possible to have numpy directly generate one of its own array types storing data of this sort, directly:
np.random.choice(
3, # i.e., allow values from 0..2
size=(10000, 1024), # the dimensions of the array to create
p=(0.9, 0.09, 0.01) # the relative probability for each value
)
Documentation here.
I have large binary data files that have a predefined format, originally written by a Fortran program as little endians. I would like to read these files in the fastest, most efficient manner, so using the array package seemed right up my alley as suggested in Improve speed of reading and converting from binary file?.
The problem is the pre-defined format is non-homogeneous. It looks something like this:
['<2i','<5d','<2i','<d','<i','<3d','<2i','<3d','<i','<d','<i','<3d']
with each integer i taking up 4 bytes, and each double d taking 8 bytes.
Is there a way I can still use the super efficient array package (or another suggestion) but with the right format?
Use struct. In particular, struct.unpack.
result = struct.unpack("<2i5d...", buffer)
Here buffer holds the given binary data.
It's not clear from your question whether you're concerned about the actual file reading speed (and building data structure in memory), or about later data processing speed.
If you are reading only once, and doing heavy processing later, you can read the file record by record (if your binary data is a recordset of repeated records with identical format), parse it with struct.unpack and append it to a [double] array:
from functools import partial
data = array.array('d')
record_size_in_bytes = 9*4 + 16*8 # 9 ints + 16 doubles
with open('input', 'rb') as fin:
for record in iter(partial(fin.read, record_size_in_bytes), b''):
values = struct.unpack("<2i5d...", record)
data.extend(values)
Under assumption you are allowed to cast all your ints to doubles and willing to accept increase in allocated memory size (22% increase for your record from the question).
If you are reading the data from file many times, it could be worthwhile to convert everything to one large array of doubles (like above) and write it back to another file from which you can later read with array.fromfile():
data = array.array('d')
with open('preprocessed', 'rb') as fin:
n = os.fstat(fin.fileno()).st_size // 8
data.fromfile(fin, n)
Update. Thanks to a nice benchmark by #martineau, now we know for a fact that preprocessing the data and turning it into an homogeneous array of doubles ensures that loading such data from file (with array.fromfile()) is ~20x to ~40x faster than reading it record-per-record, unpacking and appending to array (as shown in the first code listing above).
A faster (and a more standard) variation of record-by-record reading in #martineau's answer which appends to list and doesn't upcast to double is only ~6x to ~10x slower than array.fromfile() method and seems like a better reference benchmark.
Major Update: Modified to use proper code for reading in a preprocessed array file (function using_preprocessed_file() below), which dramatically changed the results.
To determine what method is faster in Python (using only built-ins and the standard libraries), I created a script to benchmark (via timeit) the different techniques that could be used to do this. It's a bit on the longish side, so to avoid distraction, I'm only posting the code tested and related results. (If there's sufficient interest in the methodology, I'll post the whole script.)
Here are the snippets of code that were compared:
#TESTCASE('Read and constuct piecemeal with struct')
def read_file_piecemeal():
structures = []
with open(test_filenames[0], 'rb') as inp:
size = fmt1.size
while True:
buffer = inp.read(size)
if len(buffer) != size: # EOF?
break
structures.append(fmt1.unpack(buffer))
return structures
#TESTCASE('Read all-at-once, then slice and struct')
def read_entire_file():
offset, unpack, size = 0, fmt1.unpack, fmt1.size
structures = []
with open(test_filenames[0], 'rb') as inp:
buffer = inp.read() # read entire file
while True:
chunk = buffer[offset: offset+size]
if len(chunk) != size: # EOF?
break
structures.append(unpack(chunk))
offset += size
return structures
#TESTCASE('Convert to array (#randomir part 1)')
def convert_to_array():
data = array.array('d')
record_size_in_bytes = 9*4 + 16*8 # 9 ints + 16 doubles (standard sizes)
with open(test_filenames[0], 'rb') as fin:
for record in iter(partial(fin.read, record_size_in_bytes), b''):
values = struct.unpack("<2i5d2idi3d2i3didi3d", record)
data.extend(values)
return data
#TESTCASE('Read array file (#randomir part 2)', setup='create_preprocessed_file')
def using_preprocessed_file():
data = array.array('d')
with open(test_filenames[1], 'rb') as fin:
n = os.fstat(fin.fileno()).st_size // 8
data.fromfile(fin, n)
return data
def create_preprocessed_file():
""" Save array created by convert_to_array() into a separate test file. """
test_filename = test_filenames[1]
if not os.path.isfile(test_filename): # doesn't already exist?
data = convert_to_array()
with open(test_filename, 'wb') as file:
data.tofile(file)
And here were the results running them on my system:
Fastest to slowest execution speeds using Python 3.6.1
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes
Read array file (#randomir part 2): 0.06430 secs, relative 1.00x ( 0.00% slower)
Read all-at-once, then slice and struct: 0.39634 secs, relative 6.16x ( 516.36% slower)
Read and constuct piecemeal with struct: 0.43283 secs, relative 6.73x ( 573.09% slower)
Convert to array (#randomir part 1): 1.38310 secs, relative 21.51x (2050.87% slower)
Interestingly, most of the snippets are actually faster in Python 2...
Fastest to slowest execution speeds using Python 2.7.13
(10 executions, best of 3 repetitions)
Size of structure: 164
Number of structures in test file: 40,000
file size: 6,560,000 bytes
Read array file (#randomir part 2): 0.03586 secs, relative 1.00x ( 0.00% slower)
Read all-at-once, then slice and struct: 0.27871 secs, relative 7.77x ( 677.17% slower)
Read and constuct piecemeal with struct: 0.40804 secs, relative 11.38x (1037.81% slower)
Convert to array (#randomir part 1): 1.45830 secs, relative 40.66x (3966.41% slower)
Take a look at the documentation for numpy's fromfile function: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.fromfile.html and https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#arrays-dtypes-constructing
Simplest example:
import numpy as np
data = np.fromfile('binary_file', dtype=np.dtype('<i8, ...'))
Read more about "Structured Arrays" in numpy and how to specify their data type(s) here: https://docs.scipy.org/doc/numpy/user/basics.rec.html#
There's a lot of good and helpful answers here, but I think the best solution needs more explaining. I implemented a method that reads the entire data file in one pass using the built-in read() and constructs a numpy ndarray all at the same time. This is more efficient than reading the data and constructing the array separately, but it's also a bit more finicky.
line_cols = 20 #For example
line_rows = 40000 #For example
data_fmt = 15*'f8,'+5*'f4,' #For example (15 8-byte doubles + 5 4-byte floats)
data_bsize = 15*8 + 4*5 #For example
with open(filename,'rb') as f:
data = np.ndarray(shape=(1,line_rows),
dtype=np.dtype(data_fmt),
buffer=f.read(line_rows*data_bsize))[0].astype(line_cols*'f8,').view(dtype='f8').reshape(line_rows,line_cols)[:,:-1]
Here, we open the file as a binary file using the 'rb' option in open. Then, we construct our ndarray with the proper shape and dtype to fit our read buffer. We then reduce the ndarray into a 1D array by taking its zeroth index, where all our data is hiding. Then, we reshape the array using np.astype, np.view and np.reshape methods. This is because np.reshape doesn't like having data with mixed dtypes, and I'm okay with having my integers expressed as doubles.
This method is ~100x faster than looping line-for-line through the data, and could potentially be compressed down into a single line of code.
In the future, I may try to read the data in even faster using a Fortran script that essentially converts the binary file into a text file. I don't know if this will be faster, but it may be worth a try.
I have 60mb file with lots of lines.
Each line has the following format:
(x,y)
Each line will be parsed as a numpy vector at shape (1,2).
At the end it should be concatenated into a big numpy array at shpae (N,2)
where N is the number of lines.
What is the fastest way to do that? Because now it takes too much time(more than 30 min).
My Code:
with open(fname) as f:
for line in f:
point = parse_vector_string_to_array(line)
if points is None:
points = point
else:
points = np.vstack((points, point))
Where the parser is:
def parse_vector_string_to_array(string):
x, y =eval(string)
array = np.array([[x, y]])
return array
One thing that would improve speed is to imitate genfromtxt and accumulate each line in a list of lists (or tuples). Then do one np.array at the end.
for example (roughly):
points = []
for line in file:
x,y = eval(line)
points.append((x,y))
result = np.array(points)
Since your file lines look like tuples I'll leave your eval parsing. We don't usually recommend eval, but in this limited case it might the simplest.
You could try to make genfromtxt read this, but the () on each line will give some headaches.
pandas is supposed to have a faster csv reader, but I don't know if it can be configured to handle this format or now.
When I load an array using numpy.loadtxt, it seems to take too much memory. E.g.
a = numpy.zeros(int(1e6))
causes an increase of about 8MB in memory (using htop, or just 8bytes*1million \approx 8MB). On the other hand, if I save and then load this array
numpy.savetxt('a.csv', a)
b = numpy.loadtxt('a.csv')
my memory usage increases by about 100MB! Again I observed this with htop. This was observed while in the iPython shell, and also while stepping through code using Pdb++.
Any idea what's going on here?
After reading jozzas's answer, I realized that if I know ahead of time the array size, there is a much more memory efficient way to do things if say 'a' was an mxn array:
b = numpy.zeros((m,n))
with open('a.csv', 'r') as f:
reader = csv.reader(f)
for i, row in enumerate(reader):
b[i,:] = numpy.array(row)
Saving this array of floats to a text file creates a 24M text file. When you re-load this, numpy goes through the file line-by-line, parsing the text and recreating the objects.
I would expect memory usage to spike during this time, as numpy doesn't know how big the resultant array needs to be until it gets to the end of the file, so I'd expect there to be at least 24M + 8M + other temporary memory used.
Here's the relevant bit of the numpy code, from /lib/npyio.py:
# Parse each line, including the first
for i, line in enumerate(itertools.chain([first_line], fh)):
vals = split_line(line)
if len(vals) == 0:
continue
if usecols:
vals = [vals[i] for i in usecols]
# Convert each value according to its column and store
items = [conv(val) for (conv, val) in zip(converters, vals)]
# Then pack it according to the dtype's nesting
items = pack_items(items, packing)
X.append(items)
#...A bit further on
X = np.array(X, dtype)
This additional memory usage shouldn't be a concern, as this is just the way python works - while your python process appears to be using 100M of memory, internally it maintains knowledge of which items are no longer used, and will re-use that memory. For example, if you were to re-run this save-load procedure in the one program (save, load, save, load), your memory usage will not increase to 200M.
Here is what I ended up doing to solve this problem. It works even if you don't know the shape ahead of time. This performs the conversion to float first, and then combines the arrays (as opposed to #JohnLyon's answer, which combines the arrays of string then converts to float). This used an order of magnitude less memory for me, although perhaps was a bit slower. However, I literally did not have the requisite memory to use np.loadtxt, so if you don't have sufficient memory, then this will be better:
def numpy_loadtxt_memory_friendly(the_file, max_bytes = 1000000, **loadtxt_kwargs):
numpy_arrs = []
with open(the_file, 'rb') as f:
i = 0
while True:
print(i)
some_lines = f.readlines(max_bytes)
if len(some_lines) == 0:
break
vec = np.loadtxt(some_lines, **loadtxt_kwargs)
if len(vec.shape) < 2:
vec = vec.reshape(1,-1)
numpy_arrs.append(vec)
i+=len(some_lines)
return np.concatenate(numpy_arrs, axis=0)
I have a bunch of csv datasets, about 10Gb in size each. I'd like to generate histograms from their columns. But it seems like the only way to do this in numpy is to first load the entire column into a numpy array and then call numpy.histogram on that array. This consumes an unnecessary amount of memory.
Does numpy support online binning? I'm hoping for something that iterates over my csv line by line and bins values as it reads them. This way at most one line is in memory at any one time.
Wouldn't be hard to roll my own, but wondering if someone already invented this wheel.
As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:
import numpy as np
datamin = -5
datamax = 5
numbins = 20
mybins = np.linspace(datamin, datamax, numbins)
myhist = np.zeros(numbins-1, dtype='int32')
for i in range(100):
d = np.random.randn(1000,1)
htemp, jnk = np.histogram(d, mybins)
myhist += htemp
I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. #doug's suggestion of a generator seems like a good way to address that problem.
Here's a way to bin your values directly:
import numpy as NP
column_of_values = NP.random.randint(10, 99, 10)
# set the bin values:
bins = NP.array([0.0, 20.0, 50.0, 75.0])
binned_values = NP.digitize(column_of_values, bins)
'binned_values' is an index array, containing the index of the bin to which each value in column_of_values belongs.
'bincount' will give you (obviously) the bin counts:
NP.bincount(binned_values)
Given the size of your data set, using Numpy's 'loadtxt' to build a generator, might be useful:
data_array = NP.loadtxt(data_file.txt, delimiter=",")
def fnx() :
for i in range(0, data_array.shape[1]) :
yield dx[:,i]
Binning with a Fenwick Tree (very large dataset; percentile boundaries needed)
I'm posting a second answer to the same question since this approach is very different, and addresses different issues.
What if you have a VERY large dataset (billions of samples), and you don't know ahead of time WHERE your bin boundaries should be? For example, maybe you want to bin things up in to quartiles or deciles.
For small datasets, the answer is easy: load the data in to an array, then sort, then read off the values at any given percentile by jumping to the index that percentage of the way through the array.
For large datasets where the memory size to hold the array is not practical (not to mention the time to sort)... then consider using a Fenwick Tree, aka a "Binary Indexed Tree".
I think these only work for positive integer data, so you'll at least need to know enough about your dataset to shift (and possibly scale) your data before you tabulate it in the Fenwick Tree.
I've used this to find the median of a 100 billion sample dataset, in reasonable time and very comfortable memory limits. (Consider using generators to open and read the files, as per my other answer; that's still useful.)
More on Fenwick Trees:
http://en.wikipedia.org/wiki/Fenwick_tree
http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=binaryIndexedTrees
Are interval, segment, fenwick trees the same?
Binning with Generators (large dataset; fixed-width bins; float data)
If you know the width of your desired bins ahead of time -- even if there are hundreds or thousands of buckets -- then I think rolling your own solution would be fast (both to write, and to run). Here's some Python that assumes you have a iterator that gives you the next value from the file:
from math import floor
binwidth = 20
counts = dict()
filename = "mydata.csv"
for val in next_value_from_file(filename):
binname = int(floor(val/binwidth)*binwidth)
if binname not in counts:
counts[binname] = 0
counts[binname] += 1
print counts
The values can be floats, but this is assuming you use an integer binwidth; you may need to tweak this a bit if you want to use a binwidth of some float value.
As for next_value_from_file(), as mentioned earlier, you'll probably want to write a custom generator or object with an iter() method do do this efficiently. The pseudocode for such a generator would be this:
def next_value_from_file(filename):
f = open(filename)
for line in f:
# parse out from the line the value or values you need
val = parse_the_value_from_the_line(line)
yield val
If a given line has multiple values, then make parse_the_value_from_the_line() either return a list or itself be a generator, and use this pseudocode:
def next_value_from_file(filename):
f = open(filename)
for line in f:
for val in parse_the_values_from_the_line(line):
yield val