How to split large data and rejoin later

How to split large data and rejoin later - python

My code generates a list of numpy arrays of size (1, 1, n, n, m, m) where n may vary from 50-100 and m from 5-10 depending on the case at hand. The length of the list itself may go up to 10,000 and is being written/dumped using pickle at the end of the code. For cases at the higher end of these numbers or when file sizes go beyond 5-6 GB, I get Out of Memory error. Below is a made up example of the situation,
import numpy as np
list, list_length = [], 1000
n = 100
m = 3
for i in range(0, list_length):
list.append(np.random.random((1, 1, n, n, m, m)))
file_path = 'C:/Users/Desktop/Temp/'
with open(file_path, 'wb') as file:
pickle.dump(list, file)
I am looking for a way that helps me to
split the data so that I can get rid of memory error, and
rejoin the data in the original form when needed later
All I could think is:
for i in range(0, list_length):
data = np.random.random((1, 1, n, n, m, m))
file_path = 'C:/Users/Desktop/Temp/'+str(i)
with open(file_path, 'wb') as file:
pickle.dump(data, file)
and then combine using:
combined_list = []
for i in range(0, list_length):
file_path = 'C:/Users/Desktop/Temp/single' + str(i)
with open(file_path, 'rb') as file:
data = pickle.load(file)
combined_list.append(data)
Using this way, the file size certainly reduces due to multiple files, but that also increases processing time due to multiple file I/O operations.
Is there a more elegant and better way to do this?

Using savez, savez_compressed, or even things like h5py can be useful as #tel mentioned, but that takes extra effort trying to do "reinvent" caching mechanism. There are two easier ways to process larger-than-memory ndarray if applicable:
The easiest way is of course enable pagefile (or some other name) on Windows or swap on Linux (not sure about OS X counter part). This creates a virtually large enough memory so that you don't need to worry about memory at all. It will save to disk/load from disk accordingly
If the first way is not applicable due to not have admin rights or etc, numpy provides another way: np.memmap. This function maps an ndarray to disk such that you can index it just like it is in memory. Technically IO is done directly to the hard disk but OS will cache accordingly
For the second way, you can create a hard-disk side ndarray using:
np.memmap('yourFileName', 'float32', 'w+', 0, 2**32)
This creates a 16GB float32 array within no time (containing 4G numbers). You can then do IO to it. A lot of functions have an out parameter. You can set the out parameter accordingly so that the output is not "copied" to the disk from memory
If you want to save a list of ndarrays using the second method, either create a lot of memmaps, or concat them to a single array

Don't use pickle to store large data, it's not an efficient way to serialize anything. Instead, use the built-in numpy serialization formats/functions via the numpy.savez_compressed and numpy.load functions.
System memory isn't infinite, so at some point you'll still need to split your files (or use a heavier duty solution such as the one provided by the h5py package). However, if you were able to fit the original list into memory then savez_compressed and load should do what you need.

Related

Converting CSV to numpy NPY efficiently

How to convert a .csv file to .npy efficently?
I've tried:
import numpy as np
filename = "myfile.csv"
vec =np.loadtxt(filename, delimiter=",")
np.save(f"{filename}.npy", vec)
While the above works for smallish file, the actual .csv file I'm working on has ~12 million lines with 1024 columns, it takes quite a lot to load everything into RAM before converting into an .npy format.
Q (Part 1): Is there some way to load/convert a .csv to .npy efficiently for large CSV file?
The above code snippet is similar to the answer from Convert CSV to numpy but that won't work for ~12M x 1024 matrix.
Q (Part 2): If there isn't any way to to load/convert a .csv to .npy efficiently, is there some way to iteratively read the .csv file into .npy efficiently?
Also, there's an answer here https://stackoverflow.com/a/53558856/610569 to save the csv file as numpy array iteratively. But seems like the np.vstack isn't the best solution when reading the file. The accepted answer there suggests hdf5 but the format is not the main objective of this question and the hdf5 format isn't desired in my use-case since I've to read it back into a numpy array afterwards.
Q (Part 3): If part 1 and part2 are not possible, are there other efficient storage (e.g. tensorstore) that can store and efficiently convert to numpy array when loading the saved storage format?
There is another library tensorstore that seems to efficiently handles arrays which support conversion to numpy array when read, https://google.github.io/tensorstore/python/tutorial.html. But somehow there isn't any information on how to save the tensor/array without the exact dimensions, all of the examples seem to include configurations like 'dimensions': [1000, 20000],.
Unlike the HDF5, the tensorstore doesn't seem to have reading overhead issues when converting to numpy, from docs:
Conversion to an numpy.ndarray also implicitly performs a synchronous read (which hits the in-memory cache since the same region was just retrieved)

Nice question; Informative in itself.
I understand you want to have the whole data set/array in memory, eventually, as a NumPy array. I assume, then, you have enough (RAM) memory to host such array -- 12M x 1K.
I don't specifically know about how np.loadtxt (genfromtxt) is operating behind the scenes, so I will tell you how I would do (after trying like you did).
Reasoning about memory...
Notice that a simple boolean array will cost ~12 GBytes of memory:
>>> print("{:.1E} bytes".format(
np.array([True]).itemsize * 12E6 * 1024
))
1.2E+10 bytes
And this is for a Boolean data type. Most likely, you have -- what -- a dataset of Integer, Float? The size may increase quite significantly:
>>> np.array([1], dtype=bool).itemsize
1
>>> np.array([1], dtype=int).itemsize
8
>>> np.array([1], dtype=float).itemsize
8
It's a lot of memory (which you know, just want to emphasize).
At this point, I would like to point out a possible swapping of the working memory. You may have enough physical (RAM) memory in your machine, but if not enough of free memory, your system will use the swap memory (i.e, disk) to keep your system stable & have the work done. The cost you pay is clear: read/writing from/to the disk is very slow.
My point so far is: check the data type of your dataset, estimate the size of your future array, and guarantee you have that minimum amount of RAM memory available.
I/O text
Considering you do have all the (RAM) memory necessary to host the whole numpy array: I would then loop over the whole (~12M lines) text file, filling the pre-existing array row-by-row.
More precisely, I would have the (big) array already instantiated before start reading the file. Only then, I would read each line, split the columns, and give it to np.asarray and assign those (1024) values to each respective row of the output array.
The looping over the file is slow, yes. The thing here is that you limit (and control) the amount of memory being used. Roughly speaking, the big objects consuming your memory are the "output" (big) array, and the "line" (1024) array. Sure, there are quite a considerable amount of memory being consumed in each loop in the temporary objects during reading (text!) values, splitting into list elements and casting to an array. Still, it's something that will remain largely constant during the whole ~12M lines.
So, the steps I would go through are:
0) estimate and guarantee enough RAM memory available
1) instantiate (np.empty or np.zeros) the "output" array
2) loop over "input.txt" file, create a 1D array from each line "i"
3) assign the line values/array to row "i" of "output" array
Sure enough, you can even make it parallel: If on one hand text files cannot be randomly (r/w) accessed, on the other hand you can easily split them (see How can I split one text file into multiple *.txt files?) to have -- if fun is at the table -- them read in parallel, if that time if critical.
Hope that helps.

TL;DR
Export to a different function other than .npy seems inevitable unless your machine is able to handle the size of the data in-memory as per described in #Brandt answer.
Reading the data, then processing it (Kinda answering Q part 2)
To handle data size larger than what the RAM can handle, one would often resort to libraries that performs "out-of-core" computation, e.g. turicreate.SFrame, vaex or dask . These libraries would be able to lazily load the .csv files into dataframes and process them by chunks when evaluated.
from turicreate import SFrame
filename = "myfile.csv"
sf = SFrame.read_csv(filename)
sf.apply(...) # Trying to process the data
or
import vaex
filename = "myfile.csv"
df = vaex.from_csv(filename,
convert=True,
chunk_size=50_000_000)
df.apply(...)
Converting the read data into numpy array (kinda answering Q part 1)
While out-of-core libraries can read and process the data efficiently, converting into numpy is an "in-memory" operation, the machine needs to have enough RAM to fit all data.
The turicreate.SFrame.to_numpy documentation writes:
Converts this SFrame to a numpy array
This operation will construct a numpy array in memory. Care must be taken when size of the returned object is big.
And the vaex documentation writes:
In-memory data representations
One can construct a Vaex DataFrame from a variety of in-memory data representations.
And dask best practices actually reimplemented their own array objects that are simpler than numpy array, see https://docs.dask.org/en/stable/array-best-practices.html. But when going through the docs, it seems like the format they have saved the dask array in are not .npy but various other formats.
Writing the file into non-.npy versions (answering Q Part 3)
Given the numpy arrays are inevitably in-memory, trying to save the data into one single .npy isn't the most viable option.
Different libraries seems to have different solutions for storage. E.g.
vaex saves the data into hdf5 by default if the convert=True argument is set when data is read through vaex.from_csv()
sframe saves the data into their own binary format
dask export functions save to_hdf() and to_parquet() format

It it's latest version (4.14) vaex support "streaming", i.e. lazy loading of CSV files. It uses pyarrow under the hood so it is supper fast. Try something like
df = vaex.open(my_file.csv)
# or
df = vaex.from_csv_arrow(my_file.csv, lazy=True)
Then you can export to bunch of formats as needed, or keep working with it like that (it is surprisingly fast). Of course, it is better to convert to some kind of binary format..

import numpy as np
import pandas as pd
# Define the input and output file names
csv_file = 'data.csv'
npy_file = 'data.npy'
# Create dummy data
data = np.random.rand(10000, 100)
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False)
# Define the chunk size
chunk_size = 1000
# Read the header row and get the number of columns
header = pd.read_csv(csv_file, nrows=0)
num_cols = len(header.columns)
# Initialize an empty array to store the data
data = np.empty((0, num_cols))
# Loop over the chunks of the csv file
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
# Convert the chunk to a numpy array
chunk_array = chunk.to_numpy()
# Append the chunk to the data array
data = np.append(data, chunk_array, axis=0)
np.save(npy_file, data)
# Load the npy file and check the shape
npy_data = np.load(npy_file)
print('Shape of data before conversion:', data.shape)
print('Shape of data after conversion:', npy_data.shape)```

I'm not aware of any existing function or utility that directly and efficiently converts csv files into npy files. With efficient I guess primarily meaning with low memory requirements.
Writing a npy file iteratively is indeed possible, with some extra effort. There's already a question on SO that addresses this, see:
save numpy array in append mode
For example using the NpyAppendArray class from Michael's answer you can do:
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
for line in csv:
row = np.fromstring(line, sep=',')
npy.append(row[np.newaxis, :])
The NpyAppendArray class updates the npy file header on every call to append, which is a bit much for your 12M rows. Maybe you could update the class to (optionally) only write the header on close. Or you could easily batch the writes:
batch_lines = 128
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
done = False
while not done:
batch = []
for count, line in enumerate(csv):
row = np.fromstring(line, sep=',')
batch.append(row)
if count + 1 >= batch_lines:
break
else:
done = True
npy.append(np.array(batch))
(code is not tested)

Python hangs silently on large file write

I am trying to write a big list of numpy nd_arrays to disk.
The list is ~50000 elements long
Each element is a nd_array of size (~2048,2) of ints. The arrays have different shapes.
The method I am (curently) using is
#staticmethod
def _write_with_yaml(path, obj):
with io.open(path, 'w+', encoding='utf8') as outfile:
yaml.dump(obj, outfile, default_flow_style=False, allow_unicode=True)
I have also tried pickle which also give the same problem:
On small lists (~3400 long), this works fine, finishes fast enough (<30 sec).
On ~6000 long lists, this finishes after ~2 minutes.
When the list gets larger, the process seems not to do anything. No change in RAM or disk activity.
I stopped waiting after 30 minutes.
After force stopping the process, the file suddenly became of significant size (~600MB).
I can't know if it finished writing or not.
What is the correct way to write such large lists, know if he write succeeded, and, if possible, knowing when the write/read is going to finish?
How can I debug what's happening when the process seems to hang?
I prefer not to break and assemble the lists manually in my code, I expect the serialization libraries to be able to do that for me.

For the code
import numpy as np
import yaml
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
with open("t.yaml", 'w+', encoding='utf8') as outfile:
yaml.dump(x, outfile, default_flow_style=False, allow_unicode=True)
on my system (MacOSX, i7, 16 GiB RAM, SSD) with Python 3.7 and PyYAML 3.13 the finish time is 61min. During the save the python process occupied around 5 GBytes of memory and final file size is 2 GBytes. This also shows the overhead of the file format: as the size of the data is 50k * 2048 * 2 * 8 (the size of a float is generally 64 bits in python) = 1562 MBytes, means yaml is around 1.3 times worse (and serialisation/deserialisation is also taking time).
To answer your questions:
There is no correct or incorrect way. To have a progress update and
estimation of finishing time is not easy (ex: other tasks might
interfere with the estimation, resources like memory could be used
up, etc.). You can rely on a library that supports that or implement
something yourself (as the other answer suggested)
Not sure "debug" is the correct term, as in practice it might be that the process just slow. Doing a performance analysis is not easy, especially if
using multiple/different libraries. What I would start with is clear
requirements: what do you want from the file saved? Do they need to
be yaml? Saving 50k arrays as yaml does not seem the best solution
if you care about performance. Should you ask yourself first "which is the best format for what I want?" (but you did not give details so can't say...)
Edit: if you want something just fast, use pickle. The code:
import numpy as np
import yaml
import pickle
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
pickle.dump( x, open( "t.yaml", "wb" ) )
finishes in 9 seconds, and generates a file of 1.5GBytes (no overhead). Of course pickle format should be used in very different circumstances than yaml...

I cant say this is the answer, but it may be it.
When I was working on app that required fast cycles, I found out that something in the code is very slow. It was opening / closing yaml files.
It was solved by using JSON.
Dont use YAML for anything else than as some kind of config you dont open often.
Solution to your array saving:
np.save(path,array) # path = path+name+'.npy'
If you really need to save a list of arrays, I recommend you to save list with array paths(array themselfs you will save on disk with np.save). Saving python objects on disk is not really what you want. What you want is to save numpy arrays with np.save
Complete solution(Saving example):
for array_index in range(len(list_of_arrays)):
np.save(array_index+'.npy',list_of_arrays[array_index])
# path = array_index+'.npy'
Complete solution(Loading example):
list_of_array_paths = ['1.npy','2.npy']
list_of_arrays = []
for array_path in list_of_array_paths:
list_of_arrays.append(np.load(array_path))
Further advice:
Python cant really handle large arrays. Moreover if you have loaded several of them in the list. From the point of speed and memory, always work with one,two arrays at a time. The rest must be waiting on the disk. So instead of object reference, have reference as a path and when needed, load it from disk.
Also, you said you dont want to assemble the list manually.
Possible solution, which I dont advice, but is possibly exactly what you are looking for
>>> a = np.zeros(shape = [10,5,3])
>>> b = np.zeros(shape = [7,7,9])
>>> c = [a,b]
>>> np.save('data.npy',c)
>>> d = np.load('data.npy')
>>> d.shape
(2,)
>>> type(d)
<type 'numpy.ndarray'>
>>> d.shape
(2,)
>>> d[0].shape
(10, 5, 3)
>>>
I believe I dont need to comment above mentioned code. However, after loading back, you will lose list as the list will be transformed into numpy array.

How to save very large ndarray to disk?

How to save very large ndarray to disk?
Please note, that any solution, including duplicating of data is not acceptable.
For example, this code
x = pandas.HDFStore("some_file.hdf")
x.append("a", pandas.DataFrame(a))
includes
pandas.DataFrame(a)
which unapprovable duplicates memory usage.
Obvious code
pickle.dump(a, f)
hangs.

It looks like numpy's save function can handle large arrays.
from pylab import *
q = randn(1000, 1000, 1000)
print('{} G'.format(q.nbytes/1024**3))
np.save(open('test_large_array_save.dat', 'wb'), q, allow_pickle=False)
Result
7.450580596923828 G
and a 7.5 G file created on disk.
Monitoring python's memory usage shows that it doesn't increases notably during the save, so a copy isn't being created.

Performance issue with reading integers from a binary file at specific locations

I have a file with integers stored as binary and I'm trying to extract values at specific locations. It's one big serialized integer array for which I need values at specific indexes. I've created the following code but its terribly slow compared to the F# version I created before.
import os, struct
def read_values(filename, indices):
# indices are sorted and unique
values = []
with open(filename, 'rb') as f:
for index in indices:
f.seek(index*4L, os.SEEK_SET)
b = f.read(4)
v = struct.unpack("#i", b)[0]
values.append(v)
return values
For comparison here is the F# version:
open System
open System.IO
let readValue (reader:BinaryReader) cellIndex =
// set stream to correct location
reader.BaseStream.Position <- cellIndex*4L
match reader.ReadInt32() with
| Int32.MinValue -> None
| v -> Some(v)
let readValues fileName indices =
use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
// Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
let values = List.map (readValue reader) (List.ofSeq indices)
values
Any tips on how to improve the performance of the python version, e.g. by usage of numpy ?
Update
Hdf5 works very good (from 5 seconds to 0.8 seconds on my test file):
import tables
def read_values_hdf5(filename, indices):
values = []
with tables.open_file(filename) as f:
dset = f.root.raster
return dset[indices]
Update 2
I went with the np.memmap because the performance is similar to hdf5 and I already have numpy in production.

Heavily depending on your index file size you might want to read it completely into a numpy array. If the file is not large, complete sequential read may be faster than a large number of seeks.
One problem with the seek operations is that python operates on buffered input. If the program was written in some lower level language, the use on unbuffered IO would be a good idea, as you only need a few values.
import numpy as np
# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]
If you would anyway read almost all pages (i.e. your indices are random and at a frequency of 1/1000 or more), this is probably faster. On the other hand, if you have a large index file, and you only want to pick a few indices, this is not so fast.
Then one more possibility - which might be the fastest - is to use the python mmap module. Then the file is memory-mapped, and only the pages really required are accessed.
It should be something like this:
import mmap
with open("my_index", "rb") as f:
memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
for i in indices:
# the index at position i:
idx_value = struct.unpack('I', memory_map[4*i:4*i+4])
(Note, I did not actually test that one, so there may be typing errors. Also, I did not care about endianess, so please check it is correct.)
Happily, these can be combined by using numpy.memmap. It should keep your array on disk but give you numpyish indexing. It should be as easy as:
import numpy as np
index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]
I think this should be the easiest and fastest alternative. However, if "fast" is important, please test and profile.
EDIT: As the mmap solution seems to gain some popularity, I'll add a few words about memory mapped files.
What is mmap?
Memory mapped files are not something uniquely pythonic, because memory mapping is something defined in the POSIX standard. Memory mapping is a way to use devices or files as if they were just areas in memory.
File memory mapping is a very efficient way to randomly access fixed-length data files. It uses the same technology as is used with virtual memory. The reads and writes are ordinary memory operations. If they point to a memory location which is not in the physical RAM memory ("page fault" occurs), the required file block (page) is read into memory.
The delay in random file access is mostly due to the physical rotation of the disks (SSD is another story). In average, the block you need is half a rotation away; for a typical HDD this delay is approximately 5 ms plus any data handling delay. The overhead introduced by using python instead of a compiled language is negligible compared to this delay.
If the file is read sequentially, the operating system usually uses a read-ahead cache to buffer the file before you even know you need it. For a randomly accessed big file this does not help at all. Memory mapping provides a very efficient way, because all blocks are loaded exactly when you need and remain in the cache for further use. (This could in principle happen with fseek, as well, because it might use the same technology behind the scenes. However, there is no guarantee, and there is anyway some overhead as the call wanders through the operating system.)
mmap can also be used to write files. It is very flexible in the sense that a single memory mapped file can be shared by several processes. This may be very useful and efficient in some situations, and mmap can also be used in inter-process communication. In that case usually no file is specified for mmap, instead the memory map is created with no file behind it.
mmap is not very well-known despite its usefulness and relative ease of use. It has, however, one important 'gotcha'. The file size has to remain constant. If it changes during mmap, odd things may happen.

Is the indices list sorted? i think you could get better performance if the list would be sorted, as you would make a lot less disk seeks

Large, sparse list of lists giving MemoryError when calling np.array(data)

I have a large matrix of 0s and 1s, that is mostly 0s. It is initially stored as a list of 25 thousand other lists, each of which are about 2000 ints long.
I am trying to put these into a numpy array, which is what another piece of my program takes. So I run training_data = np.array(data), but this returns a MemoryError
Why is this happening? I'm assuming it is too much memory for the program to handle (which is surprising to me..), but if so, is there a better way of doing this?

A (short) integer takes two bytes to store. You want 25,000 lists, each with 2,000 integers; that gives
25000*2000*2/1000000 = 100 MB
This works fine on my computer (4GB RAM):
>>> import numpy as np
>>> x = np.zeros((25000,2000),dtype=int)
Are you able to instantiate the above matrix of zeros?
Are you reading the file into a Python list of lists and then converting that to a numpy array? That's a bad idea; it will at least double the memory requirements. What is the file format of your data?
For sparse matrices scipy.sparse provides various alternative datatypes which will be much more efficient.
EDIT: responding to the OP's comment.
I have 25000 instances of some other class, each of which returns a list of length about 2000. I want to put all of these lists returned into the np.array.
Well, you're somehow going over 8GB! To solve this, don't do all this manipulation in memory. Write the data to disk a class at a time, then delete the instances and read in the file from numpy.
First do
with open(..., "wb") as f:
f = csv.writer(f)
for instance in instances:
f.writerow(instance.data)
This will write all your data into a large-ish CSV file. Then, you can just use np.loadtxt:
numpy.loadtxt(open(..., "rb"), delimiter=",")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split large data and rejoin later - python

Related

Converting CSV to numpy NPY efficiently

Python hangs silently on large file write

How to save very large ndarray to disk?

Performance issue with reading integers from a binary file at specific locations

Large, sparse list of lists giving MemoryError when calling np.array(data)

Categories

Resources