MPI in Python: load data from a file by line concurrently - python

I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards

If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.

(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.

Related

Python in-memory files for caching large files

I am doing very large data processing (16GB) and I would like to try and speed up the operations through storing the entire file in RAM in order to deal with disk latency.
I looked into the existing libraries but couldn't find anything that would give me the flexibility of interface.
Ideally, I would like to use something with integrated read_line() method so that the behavior is similar to the standard file reading interface.
Homer512 answered my question somewhat, mmap is a nice option that results in caching.
Alternatively, Numpy has some interfaces for parsing byte streams from a file directly into memory and it also comes with some string-related operations.
Just read the file once. The OS will cache it for the next read operations. It's called the page cache.
For example, run this code and see what happens (replace the path with one pointing to a file on your system):
import time
filepath = '/home/mostafa/Videos/00053.MTS'
n_bytes = 200_000_000 # read 200 MB
t0 = time.perf_counter()
with open(filepath, 'rb') as f:
f.read(n_bytes)
t1 = time.perf_counter()
with open(filepath, 'rb') as f:
f.read(n_bytes)
t2 = time.perf_counter()
print(f'duration 1 = {t1 - t0}')
print(f'duration 2 = {t2 - t1}')
On my Linux system, I get:
duration 1 = 2.1419905399670824
duration 2 = 0.10992361599346623
Note how faster the second read operation is, even though the file was closed and reopened. This is because, in the second read operation, the file is being read from RAM, thanks to the Linux kernel. Also note that if you run this program again, you'll see that both operations finish quickly, because the file was cached from the previous run:
duration 1 = 0.10143039299873635
duration 2 = 0.08972924604313448
So in short, just use this code to preload the file into RAM:
with open(filepath, 'rb') as f:
f.read()
Finally, it's worth mentioning that page cache is practically non-deterministic, i.e. your file might not get fully cached, especially if the file size is big compared to the RAM. But in cases like that, caching the file might not be practical after all. In short, default OS page caching is not the most sophisticated way, but it is good enough for many use cases.

Efficient parallel pickle file reading

I'm running a program that analyzes pickle files, each pickle should be converted into a text file (of my own format).
In order to do that efficiently, I'm running N parallel python processes, using multiprocessing.Pool (N is a parameter).
Each process opens and reads a single file exclusively, converts it into text and writes it a separate directory (each pkl file is converted into a unique text file).
I'm running on a linux machine with N=multiprocessing.cpu_count() - 1.
The pkl processing time takes much longer than expected. In total I'm analyzing 100K pkl files, each file is about 1MB, and it takes about a day. The text conversion function (to_txt()) doesn't have any major calculations, just conversion to strings.
I suspect that the inefficiency is caused because I reach the disk I/O capacity (by checking top on my linux machine)
Any ideas about how to make this more efficient?
Each process function is:
from pathlib import Path
from typing import cast
def analyze(pkl_path: Path):
with pkl_path.open("rb") as pkl_file:
results = cast(Results, pickle.load(pkl_file))
with pkl_path.with_suffix(".txt").open("w") as txt_file:
txt_file.write(results.to_txt())
Something like this
def analyze(pkl_path: Path):
with pkl_path.open("rb") as pkl_file:
file = pickle.load(pkl_file)
results = cast(Results,file)
inTxt = results.to_txt()
with pkl_path.with_suffix(".txt").open("w") as txt_file:
txt_file.write(inTxt)
Like that, files are not open too long to let other processes deal with file open/close stuffs.

Speed up reading multiple pickle files

I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that.
Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized.
Using threading wouldn't help either because of GIL.
I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?
UPDATE
Answering your questions:
Files are partial products of data processing for the purpose of ML
There are pandas.Series objects but the dtype is not known upfront
I want to have many files because we want to pick any subset easily
I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects)
The size of the files can vary a lot
I use python 3.7 so I believe it's cPickle in fact
Using pickle is very flexible because I don't have to worry about underlying types - I can save anything
I agree with what has been noted in the comments, namely that due to the constraint of python itself (chiefly, the GIL lock, as you noted) and there may simply be no faster loading the information beyond what you are doing now. Or, if there is a way, it may be both highly technical and, in the end, only gives you a modest increase in speed.
That said, depending on the datatypes you have, it may be faster to use quickle or pyrobuf.
I think that the solution would be some library written in C that
takes a list of files to read and then runs multiple threads (without
GIL). Is there something like this around?
In short: no. pickle is apparently good enough for enough people that there are no major alternate implementations fully compatible with the pickle protocol. As of sometime in python 3, cPickle was merged with pickle, and neither release the GIL anyway which is why threading won't help you (search for Py_BEGIN_ALLOW_THREADS in _pickle.c and you will find nothing).
If your data can be re-structured into a simpler data format like csv, or a binary format like numpy's npy, there will be less cpu overhead when reading your data. Pickle is built for flexibility first rather than speed or compactness first. One possible exception to the rule of more complex less speed is the HDF5 format using h5py, which can be fairly complex, and I have used to max out the bandwidth of a sata ssd.
Finally you mention you have many many pickle files, and that itself is probably causing no small amount of overhead. Each time you open a new file, there's some overhead involved from the operating system. Conveniently you can combine pickle files by simply appending them together. Then you can call Unpickler.load() until you reach the end of the file. Here's a quick example of combining two pickle files together using shutil
import pickle, shutil, os
#some dummy data
d1 = {'a': 1, 'b': 2, 1: 'a', 2: 'b'}
d2 = {'c': 3, 'd': 4, 3: 'c', 4: 'd'}
#create two pickles
with open('test1.pickle', 'wb') as f:
pickle.Pickler(f).dump(d1)
with open('test2.pickle', 'wb') as f:
pickle.Pickler(f).dump(d2)
#combine list of pickle files
with open('test3.pickle', 'wb') as dst:
for pickle_file in ['test1.pickle', 'test2.pickle']:
with open(pickle_file, 'rb') as src:
shutil.copyfileobj(src, dst)
#unpack the data
with open('test3.pickle', 'rb') as f:
p = pickle.Unpickler(f)
while True:
try:
print(p.load())
except EOFError:
break
#cleanup
os.remove('test1.pickle')
os.remove('test2.pickle')
os.remove('test3.pickle')
I think you should try and use mmap(memory mapped files) that is similar to open() but way faster.
Note: If your each file is big in size then use mmap otherwise If the files are small in size use regular methods.
I have written a sample that you can try.
import mmap
from time import perf_counter as pf
def load_files(filelist):
start = pf() # for rough time calculations
for filename in filelist:
with open(filename, mode="r", encoding="utf8") as file_obj:
with mmap.mmap(file_obj.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_file_obj:
data = pickle.load(mmap_file_obj)
print(data)
print(f'Operation took {pf()-start} sec(s)')
Here mmap.ACCESS_READ is the mode to open the file in binary. The file_obj returned by open is just used to get the file descriptor which is used to open the stream to the file via mmap as a memory mapped file.
As you can see below in the documentation of python open returns the file descriptor or fd for short. So we don't have to do anything with the file_obj operation wise. We just need its fileno() method to get its file descriptor. Also we are not closing the file_obj before the mmap_file_obj. Please take a proper look. We are closing the the mmap block first.
As you said in your comment.
open (file, flags[, mode])
Open the file file and set various flags according to flags and possibly its mode according to mode.
The default mode is 0777 (octal), and the current umask value is first masked out.
Return the file descriptor for the newly opened file.
Give it a try and see how much impact does it do on your operation
You can read more about mmap here. And about file descriptor here
You can try multiprocessing:
import os,pickle
pickle_list=os.listdir("pickles")
output_dict=dict.fromkeys(pickle_list, '')
def pickle_process_func(picklename):
with open("pickles/"+picklename, 'rb') as file:
dapickle=pickle.load(file)
#if you need previus files output wait for it
while(!output_dict[pickle_list[pickle_list.index(picklename)-1]]):
continue
#thandosomesh
print("loaded")
output_dict[picklename]=custom_func_i_dunno(dapickle)
from multiprocessing import Pool
with Pool(processes=10) as pool:
pool.map(pickle_process_func, pickle_list)
Consider using HDF5 via h5py instead of pickle. The performance is generally much better than pickle with numerical data in Pandas and numpy data structures and it supports most common data types and compression.

Performance issue with reading integers from a binary file at specific locations

I have a file with integers stored as binary and I'm trying to extract values at specific locations. It's one big serialized integer array for which I need values at specific indexes. I've created the following code but its terribly slow compared to the F# version I created before.
import os, struct
def read_values(filename, indices):
# indices are sorted and unique
values = []
with open(filename, 'rb') as f:
for index in indices:
f.seek(index*4L, os.SEEK_SET)
b = f.read(4)
v = struct.unpack("#i", b)[0]
values.append(v)
return values
For comparison here is the F# version:
open System
open System.IO
let readValue (reader:BinaryReader) cellIndex =
// set stream to correct location
reader.BaseStream.Position <- cellIndex*4L
match reader.ReadInt32() with
| Int32.MinValue -> None
| v -> Some(v)
let readValues fileName indices =
use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
// Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
let values = List.map (readValue reader) (List.ofSeq indices)
values
Any tips on how to improve the performance of the python version, e.g. by usage of numpy ?
Update
Hdf5 works very good (from 5 seconds to 0.8 seconds on my test file):
import tables
def read_values_hdf5(filename, indices):
values = []
with tables.open_file(filename) as f:
dset = f.root.raster
return dset[indices]
Update 2
I went with the np.memmap because the performance is similar to hdf5 and I already have numpy in production.
Heavily depending on your index file size you might want to read it completely into a numpy array. If the file is not large, complete sequential read may be faster than a large number of seeks.
One problem with the seek operations is that python operates on buffered input. If the program was written in some lower level language, the use on unbuffered IO would be a good idea, as you only need a few values.
import numpy as np
# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]
If you would anyway read almost all pages (i.e. your indices are random and at a frequency of 1/1000 or more), this is probably faster. On the other hand, if you have a large index file, and you only want to pick a few indices, this is not so fast.
Then one more possibility - which might be the fastest - is to use the python mmap module. Then the file is memory-mapped, and only the pages really required are accessed.
It should be something like this:
import mmap
with open("my_index", "rb") as f:
memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
for i in indices:
# the index at position i:
idx_value = struct.unpack('I', memory_map[4*i:4*i+4])
(Note, I did not actually test that one, so there may be typing errors. Also, I did not care about endianess, so please check it is correct.)
Happily, these can be combined by using numpy.memmap. It should keep your array on disk but give you numpyish indexing. It should be as easy as:
import numpy as np
index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]
I think this should be the easiest and fastest alternative. However, if "fast" is important, please test and profile.
EDIT: As the mmap solution seems to gain some popularity, I'll add a few words about memory mapped files.
What is mmap?
Memory mapped files are not something uniquely pythonic, because memory mapping is something defined in the POSIX standard. Memory mapping is a way to use devices or files as if they were just areas in memory.
File memory mapping is a very efficient way to randomly access fixed-length data files. It uses the same technology as is used with virtual memory. The reads and writes are ordinary memory operations. If they point to a memory location which is not in the physical RAM memory ("page fault" occurs), the required file block (page) is read into memory.
The delay in random file access is mostly due to the physical rotation of the disks (SSD is another story). In average, the block you need is half a rotation away; for a typical HDD this delay is approximately 5 ms plus any data handling delay. The overhead introduced by using python instead of a compiled language is negligible compared to this delay.
If the file is read sequentially, the operating system usually uses a read-ahead cache to buffer the file before you even know you need it. For a randomly accessed big file this does not help at all. Memory mapping provides a very efficient way, because all blocks are loaded exactly when you need and remain in the cache for further use. (This could in principle happen with fseek, as well, because it might use the same technology behind the scenes. However, there is no guarantee, and there is anyway some overhead as the call wanders through the operating system.)
mmap can also be used to write files. It is very flexible in the sense that a single memory mapped file can be shared by several processes. This may be very useful and efficient in some situations, and mmap can also be used in inter-process communication. In that case usually no file is specified for mmap, instead the memory map is created with no file behind it.
mmap is not very well-known despite its usefulness and relative ease of use. It has, however, one important 'gotcha'. The file size has to remain constant. If it changes during mmap, odd things may happen.
Is the indices list sorted? i think you could get better performance if the list would be sorted, as you would make a lot less disk seeks

Python- find the unique values from a large json file efficienctly

I've a json file data_large of size 150.1MB. The content inside the file is of type [{"score": 68},{"score": 78}]. I need to find the list of unique scores from each item.
This is what I'm doing:-
import ijson # since json file is large, hence making use of ijson
f = open ('data_large')
content = ijson.items(f, 'item') # json loads quickly here as compared to when json.load(f) is used.
print set(i['score'] for i in content) #this line is actually taking a long time to get processed.
Can I make print set(i['score'] for i in content) line more efficient. Currently it's taking 201secs to execute. Can it be made more efficient?
This will give you the set of unique score values (only) as ints. You'll need the 150 MB of free memory. It uses re.finditer() to parse which is about three times faster than the json parser (on my computer).
import re
import time
t = time.time()
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
data = f.read()
s = set(m.group(1) for m in obj.finditer(data))
s = set(map(int, s))
print time.time() - t
Using re.findall() also seems to be about three times faster than the json parser, it consumes about 260 MB:
import re
obj = re.compile('{.*?: (\d*?)}')
with open('datafile.txt', 'r') as f:
data = f.read()
s = set(obj.findall(data))
I don't think there is any way to improve things by much. The slow part is probably just the fact that at some point you need to parse the whole JSON file. Whether you do it all up front (with json.load) or little by little (when consuming the generator from ijson.items), the whole file needs to be processed eventually.
The advantage to using ijson is that you only need to have a small amount of data in memory at any given time. This probably doesn't matter too much for a file with a hundred or so megabytes of data, but would be a very big deal if your data file grew to be gigabytes or more. Of course, this may also depend on the hardware you're running on. If your code is going to run on an embedded system with limited RAM, limiting your memory use is much more important. On the other hand, if it is going to be running on a high performance server or workstation with lots and lots of ram available, there's may not be any reason to hold back.
So, if you don't expect your data to get too big (relative to your system's RAM capacity), you might try testing to see if using json.load to read the whole file at the start, then getting the unique values with a set is faster. I don't think there are any other obvious shortcuts.
On my system, the straightforward code below handles 10,000,000 scores (139 megabytes) in 18 seconds. Is that too slow?
#!/usr/local/cpython-2.7/bin/python
from __future__ import print_function
import json # since json file is large, hence making use of ijson
with open('data_large', 'r') as file_:
content = json.load(file_)
print(set(element['score'] for element in content))
Try using a set
set([x['score'] for x in scores])
For example
>>> scores = [{"score" : 78}, {"score": 65} , {"score" : 65}]
>>> set([x['score'] for x in scores])
set([65, 78])

Categories

Resources