Fastest way to create a numpy array from text file - python

I have 60mb file with lots of lines.
Each line has the following format:
(x,y)
Each line will be parsed as a numpy vector at shape (1,2).
At the end it should be concatenated into a big numpy array at shpae (N,2)
where N is the number of lines.
What is the fastest way to do that? Because now it takes too much time(more than 30 min).
My Code:
with open(fname) as f:
for line in f:
point = parse_vector_string_to_array(line)
if points is None:
points = point
else:
points = np.vstack((points, point))
Where the parser is:
def parse_vector_string_to_array(string):
x, y =eval(string)
array = np.array([[x, y]])
return array

One thing that would improve speed is to imitate genfromtxt and accumulate each line in a list of lists (or tuples). Then do one np.array at the end.
for example (roughly):
points = []
for line in file:
x,y = eval(line)
points.append((x,y))
result = np.array(points)
Since your file lines look like tuples I'll leave your eval parsing. We don't usually recommend eval, but in this limited case it might the simplest.
You could try to make genfromtxt read this, but the () on each line will give some headaches.
pandas is supposed to have a faster csv reader, but I don't know if it can be configured to handle this format or now.

Related

Reading and Writing Matrices in Python, Could not Convert String to Float Error

I'm trying to write a matrix (i.e. list of lists) to a txt file and then read it out again. I'm able to do this for lists. But for some reason when I tried to move up to a matrix yesterday, it didn't work.
genotypes=[[] for i in range(10000)]
for n in range(10000):
for m in range(1024):
u=np.random.uniform()
if u<0.9:
genotypes[n].append(0)
elif 0.9<u<0.99:
genotypes[n].append(1)
elif u>0.99:
genotypes[n].append(2)
return genotypes
#genotypes=genotype_maker()
#np.savetxt('genotypes.txt',genotypes)
g=open("genotypes.txt","r")
genotypes=[]
for line in g:
genotypes.append(int(float(line.rstrip())))
I run the code twice. The first time the middle two lines are not commented out while the last four are commented out. It looks like this successfully writes a matrix of floats to a .txt file
The second time, I comment out the middle two lines and uncomment the last four. Unfortunately I then get the error message: ValueError: could not convert string to float: '0.000000000000000000e+00 0.000000000000000000e+00 (and a whole lot more of these)
What's wrong with the code?
Thanks
In your case, you should just do np.loadtxt("genotypes.txt") if you want to load the file.
However, if you want to do it manually, you need to parse everything yourself. You get an error because np.savetxt saves the matrix in a space-delimited file. You need to split your string before converting it. So for instance:
def str_to_int(x):
return int(float(x))
g=open("genotypes.txt","r")
genotypes=[]
for line in g:
values = line.rstrip().split(' ') # values is an array of strings
values_int = list(map(str_to_int,values)) # convert strings to int
genotypes.append(values_int) # append to your list
a matrix (i.e. list of lists)
Since we are already using numpy, it is possible to have numpy directly generate one of its own array types storing data of this sort, directly:
np.random.choice(
3, # i.e., allow values from 0..2
size=(10000, 1024), # the dimensions of the array to create
p=(0.9, 0.09, 0.01) # the relative probability for each value
)
Documentation here.

Formatting list of tuples to decimals and adding them to a file

I have a list of tuples as coordinates (it has to be this way)
points = [(x1, y1), (x2, y2), ...]
to draw a polygon in matplotlib. To get these coordinates I first created an empty list points = [] and then wrote a function to calculate each point from the coordinates of the centre, number of sides, side length and angle of rotation. After the function I wrote a code to read the above initial values from user's input and check their validity, and then call the function if the check is successful.
Now I want to store the coordinates and the number of points in a file as follows:
number of points
x1, y1
x2, y2
...
xn, yn
where each coordinate is written to 3 decimal places. Therefore, I need to format my tuples to 3 decimals, then convert them to strings and then write them in a file, and I want it in the shortest possible way.
I thought I would do something like lines = [float("{:.3f}".format(j)) for j in points] (which doesn't work since I have tuples) and then
lines.insert(0, len(points))
with open('points.txt', 'w') as f:
f.writelines("%s\n" % l for l in lines)
The above solution seems very nice to me, but I can't find a way to do the first line (formatting to decimals) for tuples, so I was wondering how could I possibly format a list of tuples to decimals to store them in a list for the following use of writelines and conversion into strings?
Or if there is a shorter and better way of doing this, I would appreciate any hints. Thank you!
You can directly write the floats into your file:
Testdata:
import random
tupledata = [(random.uniform(-5,5),random.uniform(-5,5) ) for _ in range(10)]
print(tupledata)
Output:
[(1.4248082335110652, 1.9169955985773148), (0.7948001195399392, 0.6827204752328884),
(-0.7506234890561183, 3.5787165366514735), (-1.0180103847958843, 2.260945997153615),
(-2.951745273938622, 4.0178333333006435), (1.8200624561140613, -2.048841087823593),
(-2.2727453771856765, 1.3697390993773828), (1.3427726323007603, -1.7616141110472583),
(0.5022889371913024, 4.88736204694349), (2.345381610723872, -2.2849852099748915)]
Write to formatted:
with open("n.txt","w") as w:
# w.write(f"{len(tupledata)}\n") # uncomment for line number on top
for t in tupledata:
w.write("{:.3f},{:.3f}\n".format(*t))
# for python 3.6 up you can alternatively use string literal interpolation:
# see https://www.python.org/dev/peps/pep-0498/
# w.write(f"{t[0]:.3f},{t[1]:.3f}\n")
with open("n.txt","r") as r:
print(r.read())
Output in file:
1.425,1.917
0.795,0.683
-0.751,3.579
-1.018,2.261
-2.952,4.018
1.820,-2.049
-2.273,1.370
1.343,-1.762
0.502,4.887
2.345,-2.285
See proper name for python * operator? for what *t does. Hint: print(*[1,2,3]) == print(1,2,3)
formatting syntax: format mini language
You are mixing several things. There is no need to add the number in the list,nor you need to create an intermediary string list.Also the formatting is easier like this:
with open('points.txt', 'w') as f:
f.write(str(len(points)))
for x, y in points:
f.write(f"{x:.2f}, {y:.2f}\n")
Use tuple unpacking when constructing your lines:
lines = ["{:.3f}, {:.3f}\n".format(*point) for point in points]
That way you already have a list of strings you can easily write to a file. No need to convert them to float again, just to cast them to string again.
with open('points.txt', 'w') as f:
f.writelines(lines)

Using numpy.fromfile to read scattered binary data

There are different blocks in a binary that I want to read using a single call of numpy.fromfile. Each block has the following format:
OES=[
('EKEY','i4',1),
('FD1','f4',1),
('EX1','f4',1),
('EY1','f4',1),
('EXY1','f4',1),
('EA1','f4',1),
('EMJRP1','f4',1),
('EMNRP1','f4',1),
('EMAX1','f4',1),
('FD2','f4',1),
('EX2','f4',1),
('EY2','f4',1),
('EXY2','f4',1),
('EA2','f4',1),
('EMJRP2','f4',1),
('EMNRP2','f4',1),
('EMAX2','f4',1)]
Here is the format of the binary:
Data I want (OES format repeating n times)
------------------------
Useless Data
------------------------
Data I want (OES format repeating m times)
------------------------
etc..
I know the byte increment between the data i want and the useless data. I also know the size of each data block i want.
So far, i have accomplished my goal by seeking on the file object f and then calling:
nparr = np.fromfile(f,dtype=OES,count=size)
So I have a different nparr for each data block I want and concatenated all the numpy arrays into one new array.
My goal is to have a single array with all the blocks i want without concatenating (for memory purposes). That is, I want to call nparr = np.fromfile(f,dtype=OES) only once. Is there a way to accomplish this goal?
That is, I want to call nparr = np.fromfile(f,dtype=OES) only once. Is there a way to accomplish this goal?
No, not with a single call to fromfile().
But if you know the complete layout of the file in advance, you can preallocate the array, and then use fromfile and seek to read the OES blocks directly into the preallocated array. Suppose, for example, that you know the file positions of each OES block, and you know the number of records in each block. That is, you know:
file_positions = [position1, position2, ...]
numrecords = [n1, n2, ...]
Then you could do something like this (assuming f is the already opened file):
total = sum(numrecords)
nparr = np.empty(total, dtype=OES)
current_index = 0
for pos, n in zip(file_positions, numrecords):
f.seek(pos)
nparr[current_index:current_index+n] = np.fromfile(f, count=n, dtype=OES)
current_index += n

Transforming float values using a function is performance bottleneck

I have a piece of software that reads a file and transforms each first value it reads per line using a function (derived from numpy.polyfit and numpy.poly1d functions).
This function has to then write the transformed file away and I wrongly (it seems) assumed that the disk I/O part was the performance bottleneck.
The reason why I claim that it is the transformation that is slowing things down is because I tested the code (listed below) after i changed transformedValue = f(float(values[0])) into transformedValue = 1000.00 and that took the time required down from 1 min to 10 seconds.
I was wondering if anyone knows of a more efficient way to perform repeated transformations like this?
Code snippet:
def transformFile(self, f):
""" f contains the function returned by numpy.poly1d,
inputFile is a tab seperated file containing two floats
per line.
"""
with open (self.inputFile,'r') as fr:
for line in fr:
line = line.rstrip('\n')
values = line.split()
transformedValue = f(float(values[0])) # <-------- Bottleneck
outputBatch.append(str(transformedValue)+" "+values[1]+"\n")
joinedOutput = ''.join(outputBatch)
with open(output,'w') as fw:
fw.write(joinedOutput)
The function f is generated by another function, the function fits a 2d degree polynomial through a set of expected floats and a set of measured floats. A snippet from that function is:
# Perform 2d degree polynomial fit
z = numpy.polyfit(measuredValues,expectedValues,2)
f = numpy.poly1d(z)
-- ANSWER --
I have revised the code to vectorize the values prior to transforming them, which significantly speed-up the performance, the code is now as follows:
def transformFile(self, f):
""" f contains the function returned by numpy.poly1d,
inputFile is a tab seperated file containing two floats
per line.
"""
with open (self.inputFile,'r') as fr:
outputBatch = []
x_values = []
y_values = []
for line in fr:
line = line.rstrip('\n')
values = line.split()
x_values.append(float(values[0]))
y_values.append(int(values[1]))
# Transform python list into numpy array
xArray = numpy.array(x_values)
newArray = f(xArray)
# Prepare the outputs as a list
for index, i in enumerate(newArray):
outputBatch.append(str(i)+" "+str(y_values[index])+"\n")
# Join the output list elements
joinedOutput = ''.join(outputBatch)
with open(output,'w') as fw:
fw.write(joinedOutput)
It's difficult to suggest improvements without knowing exactly what your function f is doing. Are you able to share it?
However, in general many NumPy operations often work best (read: "fastest") on NumPy array objects rather than when they are repeated multiple times on individual values.
You might like to consider reading the numbers values[0] into a Python list, passing this to a NumPy array and using vectorisable NumPy operations to obtain an array of output values.

Numpy histogram of large arrays

I have a bunch of csv datasets, about 10Gb in size each. I'd like to generate histograms from their columns. But it seems like the only way to do this in numpy is to first load the entire column into a numpy array and then call numpy.histogram on that array. This consumes an unnecessary amount of memory.
Does numpy support online binning? I'm hoping for something that iterates over my csv line by line and bins values as it reads them. This way at most one line is in memory at any one time.
Wouldn't be hard to roll my own, but wondering if someone already invented this wheel.
As you said, it's not that hard to roll your own. You'll need to set up the bins yourself and reuse them as you iterate over the file. The following ought to be a decent starting point:
import numpy as np
datamin = -5
datamax = 5
numbins = 20
mybins = np.linspace(datamin, datamax, numbins)
myhist = np.zeros(numbins-1, dtype='int32')
for i in range(100):
d = np.random.randn(1000,1)
htemp, jnk = np.histogram(d, mybins)
myhist += htemp
I'm guessing performance will be an issue with such large files, and the overhead of calling histogram on each line might be too slow. #doug's suggestion of a generator seems like a good way to address that problem.
Here's a way to bin your values directly:
import numpy as NP
column_of_values = NP.random.randint(10, 99, 10)
# set the bin values:
bins = NP.array([0.0, 20.0, 50.0, 75.0])
binned_values = NP.digitize(column_of_values, bins)
'binned_values' is an index array, containing the index of the bin to which each value in column_of_values belongs.
'bincount' will give you (obviously) the bin counts:
NP.bincount(binned_values)
Given the size of your data set, using Numpy's 'loadtxt' to build a generator, might be useful:
data_array = NP.loadtxt(data_file.txt, delimiter=",")
def fnx() :
for i in range(0, data_array.shape[1]) :
yield dx[:,i]
Binning with a Fenwick Tree (very large dataset; percentile boundaries needed)
I'm posting a second answer to the same question since this approach is very different, and addresses different issues.
What if you have a VERY large dataset (billions of samples), and you don't know ahead of time WHERE your bin boundaries should be? For example, maybe you want to bin things up in to quartiles or deciles.
For small datasets, the answer is easy: load the data in to an array, then sort, then read off the values at any given percentile by jumping to the index that percentage of the way through the array.
For large datasets where the memory size to hold the array is not practical (not to mention the time to sort)... then consider using a Fenwick Tree, aka a "Binary Indexed Tree".
I think these only work for positive integer data, so you'll at least need to know enough about your dataset to shift (and possibly scale) your data before you tabulate it in the Fenwick Tree.
I've used this to find the median of a 100 billion sample dataset, in reasonable time and very comfortable memory limits. (Consider using generators to open and read the files, as per my other answer; that's still useful.)
More on Fenwick Trees:
http://en.wikipedia.org/wiki/Fenwick_tree
http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=binaryIndexedTrees
Are interval, segment, fenwick trees the same?
Binning with Generators (large dataset; fixed-width bins; float data)
If you know the width of your desired bins ahead of time -- even if there are hundreds or thousands of buckets -- then I think rolling your own solution would be fast (both to write, and to run). Here's some Python that assumes you have a iterator that gives you the next value from the file:
from math import floor
binwidth = 20
counts = dict()
filename = "mydata.csv"
for val in next_value_from_file(filename):
binname = int(floor(val/binwidth)*binwidth)
if binname not in counts:
counts[binname] = 0
counts[binname] += 1
print counts
The values can be floats, but this is assuming you use an integer binwidth; you may need to tweak this a bit if you want to use a binwidth of some float value.
As for next_value_from_file(), as mentioned earlier, you'll probably want to write a custom generator or object with an iter() method do do this efficiently. The pseudocode for such a generator would be this:
def next_value_from_file(filename):
f = open(filename)
for line in f:
# parse out from the line the value or values you need
val = parse_the_value_from_the_line(line)
yield val
If a given line has multiple values, then make parse_the_value_from_the_line() either return a list or itself be a generator, and use this pseudocode:
def next_value_from_file(filename):
f = open(filename)
for line in f:
for val in parse_the_values_from_the_line(line):
yield val

Categories

Resources