I need to read a huge text file with space separated value with Python, values may wrap to the new line arbitrarily. If there is a series of same values, they will be written as N*value, so a data file looks like this:
71*0.00000000 547.89916992 133*0.00000000 193.83299255 293.70559692 471.63159180
481.01190186 570.33251953 701.80609131 856.66857910 1037.59094238 1385.04602051
126*0.00000000 128.15339661 158.62240601 175.11259460 169.34010315 193.85580444
248.73970032 268.69958496 385.36999512 405.28909302 848.83941650 874.15411377
1034.21594238 1144.12500000 1378.19799805 935.37860107 1028.68994141 118*0.00000000
546.53228760 269.04510498 259.84869385 347.39208984 311.84609985 270.91180420
In this example, there are 71 zeros at the beginning of the data set, followed by the value of 547.89... and by another 133 zeros, and so on. Resulting dataset is flat 1D array
The dimension (i.e. total number of "expanded" values) is known.
Repeating values are not necessarily zeros, they are just in this example
So, is there an effective way, maybe a python library, to read this kind of datasets?
I tried doing straight-forward check if a values has asterisk, and then populate a chunk of pre-allocated numpy array with the value, but with tens of millions points it is way too slow.
I thought about writing C++ extension, but the project is used across multiple systems (both Linux and Windows) and it may be difficult to maintain, so I would prefer pure python solution (or at least some extra python library if they exist)
Thanks!
Related
I have a 20+GB dataset that is structured as follows:
1 3
1 2
2 3
1 4
2 1
3 4
4 2
(Note: the repetition is intentional and there is no inherent order in either column.)
I want to construct a file in the following format:
1: 2, 3, 4
2: 3, 1
3: 4
4: 2
Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.
You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.
Have you tried using a std::vector of std::vector?
The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.
Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.
A std::list of would work also.
Does your program run out of memory?
Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number.
Append the 2nd column values to the file.
After all data is read, close all files.
Open each file and read the values and print them out, comma separated.
Open output file for each key.
While iterating over lines of source file append values into output files.
Join output files.
An interesting thought found also on Stack Overflow
If you want to persist a large dictionary, you are basically looking at a database.
As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").
Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).
I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.
I'm using Multiprocessing with a large (~5G) read-only dict used by processes. I started by passing the whole dict to each process, but ran into memory restraints, so changed to use a Multiprocessing Manager dict (after reading this How to share a dictionary between multiple processes in python without locking )
Since the change, performance has dived. What alternatives are there for a faster shared data store? The dict has a 40 character string key, and 2 small string element tuple data.
Use a memory mapped file. While this might sound insane (performance wise), it might not be if you use some clever tricks:
Sort the keys so you can use binary search in the file to locate a record
Try to make each line of the file the same length ("fixed width records")
If you can't use fixed width records, use this pseudo code:
Read 1KB in the middle (or enough to be sure the longest line fits *twice*)
Find the first new line character
Find the next new line character
Get a line as a substring between the two positions
Check the key (first 40 bytes)
If the key is too big, repeat with a 1KB block in the first half of the search range, else in the upper half of the search range
If the performance isn't good enough, consider writing an extension in C.
I am working on some large (several million line) bioinformatics data sets with the general format:
chromosomeNumber locusStart locusStop sequence moreData
I have other files in this format:
chromosomeNumber locusStart locusStop moreData
What I need to be able to do is read one of each type of file into memory and if the locusStart of a line of the upper file is between the start and stop of any of the lines in the lower file, print the line to output file 1. If the locusStart of that line is not between the start and stop of any lines in the bottom file, then print it to output file 2.
I am currently reading the files in, converting them into dictionaries keyed on chromosome with the corresponding lines as values. I then split each value line into a string, and then do comparisons with the strings. This takes an incredibly long time, and I would like to know if there is a more efficient way to do it.
Thanks.
It seems that for the lower file (which I assuming has the second format), the only field you are concerned about is 'locusStart'. Since, from your description, you do not necessarily care about the other data, you could make a set of all of the locusStart:
locusStart_list = set()
with open(upper_file, 'r') as f:
for line in f:
tmp_list = line.strip().split()
locusStart_list.add(tmp_list[1])
This removes all of the unnecessary line manipulation you do for the bottom file. Then, you can easily compare the locusStart of a field to the set built from the lower file. The set would also remove duplicates, making it a bit faster than using a list.
It sounds like you are going to be doing lots of greater than/less than comparisons, as such I don't think loading your data into dictionaries is going to enhance the speed of code at all--based on what you've explained it sounds like you're still looping through every element in one file or the other.
What you need is a different data structure to load your data into and run comparison operations with. Check out the the Python bisect module, I think it may provide the data structure that you need to run your comparison operations much more efficiently.
If you can more precisely describe what exactly you're trying to accomplish, we'll be able to help you get started writing your code.
Using a dictionary of the chromosome number is a good idea, as long as you can fit both files into memory.
You then want to sort both lists by locusStart (split the string, convert locusStart to a number--see instructions on sorting if you're unsure how to sort on locusStart alone).
Now you can just walk through your lists: if the lower locusStart is less than the first upper locusStart, put the line in file 2 and go on to the next one. If the lower locusStart is greater than the first upper locusStart then
While it is also greater than locusEnd, throw away the beginning of the upper list
If you find a case where it's greater than locusStart and less than locusEnd, put it in file 1
Otherwise, put it in file 2
This should replace what is now probably an O(n^2) algorithm with a O(n log n) one.
Right, I'm iterating through a large binary file
I need to minimise the time of this loop:
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
blockReturn = np.zeros((num_samples,num_receivers,num_channels))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
for smpl in range(0,num_samples):
r2_iq=np.fromfile(ReadFile.fid,np.int16,2)
blockReturn[smpl,rec,chl] = np.sqrt(math.fabs(r2_iq[0])*math.fabs(r2_iq[0]) + math.fabs(r2_iq[1])*math.fabs(r2_iq[1]))
return blockReturn
So, what's going on is as follows:
r1 is the header of the file, dTypes.NB_HDR is a type I made:
NB_HDR= np.dtype([('f3',np.uint32),('f4',np.uint32),('f5',np.uint32),('f6',np.int32),('f7',np.int32),('f8',np.uint32)])
That gets all the information about the forthcoming data block, and nicely puts us in the right position within the file (the start of the data block!).
In this data block there is:
4096 samples per channel,
4 channels per receiver,
9 receivers.
So num_receivers, num_channels, num_samples will always be the same (at the moment anyway), but as you can see this is a fairly large amount of data. Each 'sample' is a pair of int16 values that I want to find the magnitude of (hence Pythagoras).
This NB2 code is executed for each 'Block' in the file, for a 12GB file (which is how big they are) there are about 20,900 Blocks, and I've got to iterate through 1000 of these files (so, 12TB overall). Any speed advantage even it's it's milliseconds would be massively appreciated.
EDIT: Actually it might be of help to know how I'm moving around inside the file. I have a function as follows:
def navigateTo(self, blockNum, indexNum):
ReadFile.fid.seek(ReadFile.fileIndex[blockNum][indexNum],0)
ReadFile.currentBlock = blockNum
ReadFile.index = indexNum
Before I run all this code I scan the file and make a list of index locations at ReadFile.fileIndex that I browse using this function and then 'seek' to the absolute location - is this efficient?
Cheers
Because you know the length of a block after you read the header, read the whole block at once. Then reshape the array (very fast, only affects metadata) and take use the np.hypot ufunc:
blockData = np.fromfile(ReadFile.fid, np.int16, num_receivers*num_channels*num_samples*2)
blockData = blockData.reshape((num_receivers, num_channes, num_samples, 2))
return np.hypot(blockData[:,:,:,0], blockData[:,:,:,1])
On my machine it runs in 11ms per block.
import numpy as np
def NB2(self, ID_LEN):
r1=np.fromfile(ReadFile.fid,dTypes.NB_HDR,1)
num_receivers=r1[0][0]
num_channels=r1[0][1]
num_samples=r1[0][5]
# first, match your array bounds to the way you are walking the file
blockReturn = np.zeros((num_receivers,num_channels,num_samples))
for rec in range(0,num_receivers):
for chl in range(0,num_channels):
# second, read in all the samples at once if you have enough memory
r2_iq=np.fromfile(ReadFile.fid,np.int16,2*num_samples)
r2_iq.shape = (-1,2) # tell numpy that it is an array of two values
# create dot product vector by squaring data elementwise, and then
# adding those elements together. Results is of length num_samples
r2_iq = r2_iq * r2_iq
r2_iq = r2_iq[:,0] + r2_iq[:,1]
# get the distance by performing the square root "into" blockReturn
np.sqrt(r2_iq, out=blockReturn[rec,chl,:])
return blockReturn
This should help your performance. Two main ideas in numpy work. First, your result arrays dimensions should match how your loop dimensions are crafted, for memory locality.
Second, Numpy is FAST. I've beaten hand coded C with numpy, simply because it uses LAPack and vector acceleration. However to get that power, you have to let it manipulate more data at a time. That is why your sample loop has been collapsed to read in the full sample for the receiver and channel in one large read. Then use the supreme vector powers of numpy to calculate your magnitude by dot product.
There is a little more optimization to be had in the magnitude calculation, but numpy recycles buffers for you, making it less important than you might think. I hope this helps!
I'd try to use as few loops and as much constants as possible.
Everything that can be done in a linear fashion should be done so.
If values don't change, use constants to reduce lookups and such,
because that eats up cpu cycles.
This is from a theoretical point of view ;-)
If possible use highly optimised libraries. I don't exaclty know what you are trying to achieve but i'd rather use an existing FFT-Lib than writing it myself :>
One more thing: http://en.wikipedia.org/wiki/Big_O_notation (can be an eye-opener)
Most importantly, you shouldn't do file access at the lowest level of a triple nested loop, whether you do this in C or Python. You've got to read in large chunks of data at a time.
So to speed this up, read in large chunks of data at a time, and process that data using numpy indexing (that is, vectorize your code). This is particularly easy in your case since all your data is int32. Just read in big chunks of data, and reshape the data into an array that reflects the (receiver, channel, sample) structure, and then use the appropriate indexing to multiply and add things for Pythagoras, and the 'sum' command to add up the terms in the resulting array.
This is more of an observation than a solution, but porting that function to C++ and loading it in with the Python API would get you a lot of speed gain to begin with before loop optimization.
Let's say I need to save a matrix(each line corresponds one row) that could be loaded from fortran later. What method should I prefer? Is converting everything to string is the only one approach?
You can save them in binary format as well. Please see the documentation on the struct standard module, it has a pack function for converting Python object into binary data.
For example:
import struct
value = 3.141592654
data = struct.pack('d', value)
open('file.ext', 'wb').write(data)
You can convert each element of your matrix and write to a file. Fortran should be able to load that binary data. You can speed up the process by converting a row as a whole, like this:
row_data = struct.pack('d' * len(matrix_row), *matrix_row)
Please note, that 'd' * len(matrix_row) is a constant for your matrix size, so you need to calculate that format string only once.
I don't know fortran, so it's hard to tell what is easy for you to perform on that side for parsing.
It sounds like your options are either saving the doubles in plaintext (meaning, 'converting' them to string), or in binary (using struct and the likes). The decision for which one is better depends.
I would go with the plaintext solution, as it means the files will be easily readable, and you won't have to mess with different kinds of details (endianity, default double sizes).
But, there are cases where binary is better (for example, if you have a really big list of doubles and space is of importance, or if it is easier for you to parse it and you need the optimization) - but this is likely not your case.
You can use JSON
import json
matrix = [[2.3452452435, 3.34134], [4.5, 7.9]]
data = json.dumps(matrix)
open('file.ext', 'wb').write(data)
File content will look like:
[[2.3452452435, 3.3413400000000002], [4.5, 7.9000000000000004]]
If legibility and ease of access is important (and file size is reasonable), Fortran can easily parse a simple array of numbers, at least if it knows the size of the matrix beforehand (with something like READ(FILE_ID, '2(F)'), I think):
1.234 5.6789e4
3.1415 9.265358978
42 ...
Two nested for loops in your Python code can easily write your matrix in this form.