Python hangs silently on large file write

Python hangs silently on large file write - python

I am trying to write a big list of numpy nd_arrays to disk.
The list is ~50000 elements long
Each element is a nd_array of size (~2048,2) of ints. The arrays have different shapes.
The method I am (curently) using is
#staticmethod
def _write_with_yaml(path, obj):
with io.open(path, 'w+', encoding='utf8') as outfile:
yaml.dump(obj, outfile, default_flow_style=False, allow_unicode=True)
I have also tried pickle which also give the same problem:
On small lists (~3400 long), this works fine, finishes fast enough (<30 sec).
On ~6000 long lists, this finishes after ~2 minutes.
When the list gets larger, the process seems not to do anything. No change in RAM or disk activity.
I stopped waiting after 30 minutes.
After force stopping the process, the file suddenly became of significant size (~600MB).
I can't know if it finished writing or not.
What is the correct way to write such large lists, know if he write succeeded, and, if possible, knowing when the write/read is going to finish?
How can I debug what's happening when the process seems to hang?
I prefer not to break and assemble the lists manually in my code, I expect the serialization libraries to be able to do that for me.

For the code
import numpy as np
import yaml
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
with open("t.yaml", 'w+', encoding='utf8') as outfile:
yaml.dump(x, outfile, default_flow_style=False, allow_unicode=True)
on my system (MacOSX, i7, 16 GiB RAM, SSD) with Python 3.7 and PyYAML 3.13 the finish time is 61min. During the save the python process occupied around 5 GBytes of memory and final file size is 2 GBytes. This also shows the overhead of the file format: as the size of the data is 50k * 2048 * 2 * 8 (the size of a float is generally 64 bits in python) = 1562 MBytes, means yaml is around 1.3 times worse (and serialisation/deserialisation is also taking time).
To answer your questions:
There is no correct or incorrect way. To have a progress update and
estimation of finishing time is not easy (ex: other tasks might
interfere with the estimation, resources like memory could be used
up, etc.). You can rely on a library that supports that or implement
something yourself (as the other answer suggested)
Not sure "debug" is the correct term, as in practice it might be that the process just slow. Doing a performance analysis is not easy, especially if
using multiple/different libraries. What I would start with is clear
requirements: what do you want from the file saved? Do they need to
be yaml? Saving 50k arrays as yaml does not seem the best solution
if you care about performance. Should you ask yourself first "which is the best format for what I want?" (but you did not give details so can't say...)
Edit: if you want something just fast, use pickle. The code:
import numpy as np
import yaml
import pickle
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
pickle.dump( x, open( "t.yaml", "wb" ) )
finishes in 9 seconds, and generates a file of 1.5GBytes (no overhead). Of course pickle format should be used in very different circumstances than yaml...

I cant say this is the answer, but it may be it.
When I was working on app that required fast cycles, I found out that something in the code is very slow. It was opening / closing yaml files.
It was solved by using JSON.
Dont use YAML for anything else than as some kind of config you dont open often.
Solution to your array saving:
np.save(path,array) # path = path+name+'.npy'
If you really need to save a list of arrays, I recommend you to save list with array paths(array themselfs you will save on disk with np.save). Saving python objects on disk is not really what you want. What you want is to save numpy arrays with np.save
Complete solution(Saving example):
for array_index in range(len(list_of_arrays)):
np.save(array_index+'.npy',list_of_arrays[array_index])
# path = array_index+'.npy'
Complete solution(Loading example):
list_of_array_paths = ['1.npy','2.npy']
list_of_arrays = []
for array_path in list_of_array_paths:
list_of_arrays.append(np.load(array_path))
Further advice:
Python cant really handle large arrays. Moreover if you have loaded several of them in the list. From the point of speed and memory, always work with one,two arrays at a time. The rest must be waiting on the disk. So instead of object reference, have reference as a path and when needed, load it from disk.
Also, you said you dont want to assemble the list manually.
Possible solution, which I dont advice, but is possibly exactly what you are looking for
>>> a = np.zeros(shape = [10,5,3])
>>> b = np.zeros(shape = [7,7,9])
>>> c = [a,b]
>>> np.save('data.npy',c)
>>> d = np.load('data.npy')
>>> d.shape
(2,)
>>> type(d)
<type 'numpy.ndarray'>
>>> d.shape
(2,)
>>> d[0].shape
(10, 5, 3)
>>>
I believe I dont need to comment above mentioned code. However, after loading back, you will lose list as the list will be transformed into numpy array.

Related

How to split large data and rejoin later

My code generates a list of numpy arrays of size (1, 1, n, n, m, m) where n may vary from 50-100 and m from 5-10 depending on the case at hand. The length of the list itself may go up to 10,000 and is being written/dumped using pickle at the end of the code. For cases at the higher end of these numbers or when file sizes go beyond 5-6 GB, I get Out of Memory error. Below is a made up example of the situation,
import numpy as np
list, list_length = [], 1000
n = 100
m = 3
for i in range(0, list_length):
list.append(np.random.random((1, 1, n, n, m, m)))
file_path = 'C:/Users/Desktop/Temp/'
with open(file_path, 'wb') as file:
pickle.dump(list, file)
I am looking for a way that helps me to
split the data so that I can get rid of memory error, and
rejoin the data in the original form when needed later
All I could think is:
for i in range(0, list_length):
data = np.random.random((1, 1, n, n, m, m))
file_path = 'C:/Users/Desktop/Temp/'+str(i)
with open(file_path, 'wb') as file:
pickle.dump(data, file)
and then combine using:
combined_list = []
for i in range(0, list_length):
file_path = 'C:/Users/Desktop/Temp/single' + str(i)
with open(file_path, 'rb') as file:
data = pickle.load(file)
combined_list.append(data)
Using this way, the file size certainly reduces due to multiple files, but that also increases processing time due to multiple file I/O operations.
Is there a more elegant and better way to do this?

Using savez, savez_compressed, or even things like h5py can be useful as #tel mentioned, but that takes extra effort trying to do "reinvent" caching mechanism. There are two easier ways to process larger-than-memory ndarray if applicable:
The easiest way is of course enable pagefile (or some other name) on Windows or swap on Linux (not sure about OS X counter part). This creates a virtually large enough memory so that you don't need to worry about memory at all. It will save to disk/load from disk accordingly
If the first way is not applicable due to not have admin rights or etc, numpy provides another way: np.memmap. This function maps an ndarray to disk such that you can index it just like it is in memory. Technically IO is done directly to the hard disk but OS will cache accordingly
For the second way, you can create a hard-disk side ndarray using:
np.memmap('yourFileName', 'float32', 'w+', 0, 2**32)
This creates a 16GB float32 array within no time (containing 4G numbers). You can then do IO to it. A lot of functions have an out parameter. You can set the out parameter accordingly so that the output is not "copied" to the disk from memory
If you want to save a list of ndarrays using the second method, either create a lot of memmaps, or concat them to a single array

Don't use pickle to store large data, it's not an efficient way to serialize anything. Instead, use the built-in numpy serialization formats/functions via the numpy.savez_compressed and numpy.load functions.
System memory isn't infinite, so at some point you'll still need to split your files (or use a heavier duty solution such as the one provided by the h5py package). However, if you were able to fit the original list into memory then savez_compressed and load should do what you need.

Python - size of object in memory vs. on disk

Here is my example:
import numpy as np
test = [np.random.choice(range(1, 1000), 1000000) for el in range(1,1000)]
this object takes in memory:
print(sys.getsizeof(test)/1024/1024/1024)
8.404254913330078e-06
something like 8 KB
When I write it to disk
import pickle
file_path = './test.pickle'
with open(file_path, 'wb') as f:
pickle.dump(test, f)
it take almost 8GB from ls -l command
Could somebody clarify why it take so little space in memory and some much on disk? I am guessing in memory numbers are not accurate.

I am guessing in memory numbers are not accurate.
Well, this would not explain 6 orders of magnitude in size, right? ;)
test is a Python list instance. getsizeof will tell you the size of "a pointer", which is 64bit on your system together with some other attributes. But you will need to do a bit more to get all the stuff which is attached to this instance, inspecting each element (lists have no strict types in Python, so you can't simply do size_of_element * len(list) etc.).
Here is one resource: https://code.tutsplus.com/tutorials/understand-how-much-memory-your-python-objects-use--cms-25609
Here is another one: How do I determine the size of an object in Python?

MPI in Python: load data from a file by line concurrently

I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards

If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.

(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.

Performance issue with reading integers from a binary file at specific locations

I have a file with integers stored as binary and I'm trying to extract values at specific locations. It's one big serialized integer array for which I need values at specific indexes. I've created the following code but its terribly slow compared to the F# version I created before.
import os, struct
def read_values(filename, indices):
# indices are sorted and unique
values = []
with open(filename, 'rb') as f:
for index in indices:
f.seek(index*4L, os.SEEK_SET)
b = f.read(4)
v = struct.unpack("#i", b)[0]
values.append(v)
return values
For comparison here is the F# version:
open System
open System.IO
let readValue (reader:BinaryReader) cellIndex =
// set stream to correct location
reader.BaseStream.Position <- cellIndex*4L
match reader.ReadInt32() with
| Int32.MinValue -> None
| v -> Some(v)
let readValues fileName indices =
use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
// Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
let values = List.map (readValue reader) (List.ofSeq indices)
values
Any tips on how to improve the performance of the python version, e.g. by usage of numpy ?
Update
Hdf5 works very good (from 5 seconds to 0.8 seconds on my test file):
import tables
def read_values_hdf5(filename, indices):
values = []
with tables.open_file(filename) as f:
dset = f.root.raster
return dset[indices]
Update 2
I went with the np.memmap because the performance is similar to hdf5 and I already have numpy in production.

Heavily depending on your index file size you might want to read it completely into a numpy array. If the file is not large, complete sequential read may be faster than a large number of seeks.
One problem with the seek operations is that python operates on buffered input. If the program was written in some lower level language, the use on unbuffered IO would be a good idea, as you only need a few values.
import numpy as np
# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]
If you would anyway read almost all pages (i.e. your indices are random and at a frequency of 1/1000 or more), this is probably faster. On the other hand, if you have a large index file, and you only want to pick a few indices, this is not so fast.
Then one more possibility - which might be the fastest - is to use the python mmap module. Then the file is memory-mapped, and only the pages really required are accessed.
It should be something like this:
import mmap
with open("my_index", "rb") as f:
memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
for i in indices:
# the index at position i:
idx_value = struct.unpack('I', memory_map[4*i:4*i+4])
(Note, I did not actually test that one, so there may be typing errors. Also, I did not care about endianess, so please check it is correct.)
Happily, these can be combined by using numpy.memmap. It should keep your array on disk but give you numpyish indexing. It should be as easy as:
import numpy as np
index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]
I think this should be the easiest and fastest alternative. However, if "fast" is important, please test and profile.
EDIT: As the mmap solution seems to gain some popularity, I'll add a few words about memory mapped files.
What is mmap?
Memory mapped files are not something uniquely pythonic, because memory mapping is something defined in the POSIX standard. Memory mapping is a way to use devices or files as if they were just areas in memory.
File memory mapping is a very efficient way to randomly access fixed-length data files. It uses the same technology as is used with virtual memory. The reads and writes are ordinary memory operations. If they point to a memory location which is not in the physical RAM memory ("page fault" occurs), the required file block (page) is read into memory.
The delay in random file access is mostly due to the physical rotation of the disks (SSD is another story). In average, the block you need is half a rotation away; for a typical HDD this delay is approximately 5 ms plus any data handling delay. The overhead introduced by using python instead of a compiled language is negligible compared to this delay.
If the file is read sequentially, the operating system usually uses a read-ahead cache to buffer the file before you even know you need it. For a randomly accessed big file this does not help at all. Memory mapping provides a very efficient way, because all blocks are loaded exactly when you need and remain in the cache for further use. (This could in principle happen with fseek, as well, because it might use the same technology behind the scenes. However, there is no guarantee, and there is anyway some overhead as the call wanders through the operating system.)
mmap can also be used to write files. It is very flexible in the sense that a single memory mapped file can be shared by several processes. This may be very useful and efficient in some situations, and mmap can also be used in inter-process communication. In that case usually no file is specified for mmap, instead the memory map is created with no file behind it.
mmap is not very well-known despite its usefulness and relative ease of use. It has, however, one important 'gotcha'. The file size has to remain constant. If it changes during mmap, odd things may happen.

Is the indices list sorted? i think you could get better performance if the list would be sorted, as you would make a lot less disk seeks

How can I speed up unpickling large objects if I have plenty of RAM?

It's taking me up to an hour to read a 1-gigabyte NetworkX graph data structure using cPickle (its 1-GB when stored on disk as a binary pickle file).
Note that the file quickly loads into memory. In other words, if I run:
import cPickle as pickle
f = open("bigNetworkXGraph.pickle","rb")
binary_data = f.read() # This part doesn't take long
graph = pickle.loads(binary_data) # This takes ages
How can I speed this last operation up?
Note that I have tried pickling the data both in using both binary protocols (1 and 2), and it doesn't seem to make much difference which protocol I use. Also note that although I am using the "loads" (meaning "load string") function above, it is loading binary data, not ascii-data.
I have 128gb of RAM on the system I'm using, so I'm hoping that somebody will tell me how to increase some read buffer buried in the pickle implementation.

I had great success in reading a ~750 MB igraph data structure (a binary pickle file) using cPickle itself. This was achieved by simply wrapping up the pickle load call as mentioned here
Example snippet in your case would be something like:
import cPickle as pickle
import gc
f = open("bigNetworkXGraph.pickle", "rb")
# disable garbage collector
gc.disable()
graph = pickle.load(f)
# enable garbage collector again
gc.enable()
f.close()
This definitely isn't the most apt way to do it, however, it reduces the time required drastically.
(For me, it reduced from 843.04s to 41.28s, around 20x)

You're probably bound by Python object creation/allocation overhead, not the unpickling itself.
If so, there is little you can do to speed this up, except not creating all the objects. Do you need the entire structure at once? If not, you could use lazy population of the data structure (for example: represent parts of the structure by pickled strings, then unpickle them only when they are accessed).

Why don't you try marshaling your data and storing it in RAM using memcached (for example). Yes, it has some limitations but as this points out marshaling is way faster (20 to 30 times) than pickling.
Of course, you should also spend as much time optimizing your data structure in order to minimize the amount and complexity of data you want stored.

This is ridiculous.
I have a huge ~150MB dictionary (collections.Counter actually) that I was reading and writing using cPickle in the binary format.
Writing it took about 3 min.
I stopped reading it in at the 16 min mark, with my RAM completely choked up.
I'm now using marshal, and it takes:
write: ~3s
read: ~5s
I poked around a bit, and came across this article.
Guess I've never looked at the pickle source, but it builds an entire VM to reconstruct the dictionary?
There should be a note about performance on very large objects in the documentation IMHO.

I'm also trying to speed up the loading/storing of networkx graphs. I'm using the adjacency_graph method to convert the graph to something serialisable, see for instance this code:
from networkx.generators import fast_gnp_random_graph
from networkx.readwrite import json_graph
G = fast_gnp_random_graph(4000, 0.7)
with open('/tmp/graph.pickle', 'wb+') as f:
data = json_graph.adjacency_data(G)
pickle.dump(data, f)
with open('/tmp/graph.pickle', 'rb') as f:
d = pickle.load(f)
H = json_graph.adjacency_graph(d)
However, this adjacency_graph conversion method is quite slow, so time gained in pickling is probably lost on converting.
So this actually doesn't speed things up, bummer. Running this code gives the following timings:
N=1000
0.666s ~ generating
0.790s ~ converting
0.237s ~ storing
0.295s ~ loading
1.152s ~ converting
N=2000
2.761s ~ generating
3.282s ~ converting
1.068s ~ storing
1.105s ~ loading
4.941s ~ converting
N=3000
6.377s ~ generating
7.644s ~ converting
2.464s ~ storing
2.393s ~ loading
12.219s ~ converting
N=4000
12.458s ~ generating
19.025s ~ converting
8.825s ~ storing
8.921s ~ loading
27.601s ~ converting
This exponential growth is probably due to the graph getting exponentially more edges. Here is a test gist, in case you want to try yourself
https://gist.github.com/wires/5918834712a64297d7d1

Maybe the best thing you can do is to split the big data into smallest object smaller, let's say, than 50MB, so can be stored in ram, and recombine it.
Afaik there's no way to automatic splitting data via pickle module, so you have to do by yourself.
Anyway, another way (which is quite harder) is to use some NoSQL Database like MongoDB to store your data...

In general, I've found that if possible, when saving large objects to disk in python, it's much more efficient to use numpy ndarrays or scipy.sparse matrices.
Thus for huge graphs like the one in the example, I could convert the graph to a scipy sparse matrix (networkx has a function that does this, and it's not hard to write one), and then save that sparse matrix in binary format.

why don't you use pickle.load?
f = open('fname', 'rb')
graph = pickle.load(f)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python hangs silently on large file write - python

Related

How to split large data and rejoin later

Python - size of object in memory vs. on disk

MPI in Python: load data from a file by line concurrently

Performance issue with reading integers from a binary file at specific locations

How can I speed up unpickling large objects if I have plenty of RAM?

Categories

Resources