I have 3000 binary files (each of size 40[MB]) of known format (5,000,000 'records' of 'int32,float32' each). they were created using numpy tofile() method.
A method that I use, WhichShouldBeUpdated(), determines which file (out of the 3000) should be updated, and also, which records in this file should be changed. The method's output is the following:
(1) path_to_file_name_to_update
(2) a numpy record array with N records (N is the number of records to update), in the following format: [(recordID1, newIntValue1, newFloatValue1), (recordID2, newIntValue2, newFloatValue2), .....]
As can be seen:
(1) the file to update is known only at running time
(2) the records to update are also only known at running time
what would be the most efficient approach to updating the file with the new values for the records?
Since the records are of fixed length you can just open the file and seek to the position, which is a multiple of the record size and record offset. To encode the ints and floats as binary you can use struct.pack. Update: Given that the files are originally generated by numpy, the fastest way may be numpy.memmap.
You're probably not interested in data conversion, but I've had very good experiences with HDF5 and pytables for large binary files. HDF5 is designed for large scientific data sets, so it is quick and efficient.
Related
How to convert a .csv file to .npy efficently?
I've tried:
import numpy as np
filename = "myfile.csv"
vec =np.loadtxt(filename, delimiter=",")
np.save(f"{filename}.npy", vec)
While the above works for smallish file, the actual .csv file I'm working on has ~12 million lines with 1024 columns, it takes quite a lot to load everything into RAM before converting into an .npy format.
Q (Part 1): Is there some way to load/convert a .csv to .npy efficiently for large CSV file?
The above code snippet is similar to the answer from Convert CSV to numpy but that won't work for ~12M x 1024 matrix.
Q (Part 2): If there isn't any way to to load/convert a .csv to .npy efficiently, is there some way to iteratively read the .csv file into .npy efficiently?
Also, there's an answer here https://stackoverflow.com/a/53558856/610569 to save the csv file as numpy array iteratively. But seems like the np.vstack isn't the best solution when reading the file. The accepted answer there suggests hdf5 but the format is not the main objective of this question and the hdf5 format isn't desired in my use-case since I've to read it back into a numpy array afterwards.
Q (Part 3): If part 1 and part2 are not possible, are there other efficient storage (e.g. tensorstore) that can store and efficiently convert to numpy array when loading the saved storage format?
There is another library tensorstore that seems to efficiently handles arrays which support conversion to numpy array when read, https://google.github.io/tensorstore/python/tutorial.html. But somehow there isn't any information on how to save the tensor/array without the exact dimensions, all of the examples seem to include configurations like 'dimensions': [1000, 20000],.
Unlike the HDF5, the tensorstore doesn't seem to have reading overhead issues when converting to numpy, from docs:
Conversion to an numpy.ndarray also implicitly performs a synchronous read (which hits the in-memory cache since the same region was just retrieved)
Nice question; Informative in itself.
I understand you want to have the whole data set/array in memory, eventually, as a NumPy array. I assume, then, you have enough (RAM) memory to host such array -- 12M x 1K.
I don't specifically know about how np.loadtxt (genfromtxt) is operating behind the scenes, so I will tell you how I would do (after trying like you did).
Reasoning about memory...
Notice that a simple boolean array will cost ~12 GBytes of memory:
>>> print("{:.1E} bytes".format(
np.array([True]).itemsize * 12E6 * 1024
))
1.2E+10 bytes
And this is for a Boolean data type. Most likely, you have -- what -- a dataset of Integer, Float? The size may increase quite significantly:
>>> np.array([1], dtype=bool).itemsize
1
>>> np.array([1], dtype=int).itemsize
8
>>> np.array([1], dtype=float).itemsize
8
It's a lot of memory (which you know, just want to emphasize).
At this point, I would like to point out a possible swapping of the working memory. You may have enough physical (RAM) memory in your machine, but if not enough of free memory, your system will use the swap memory (i.e, disk) to keep your system stable & have the work done. The cost you pay is clear: read/writing from/to the disk is very slow.
My point so far is: check the data type of your dataset, estimate the size of your future array, and guarantee you have that minimum amount of RAM memory available.
I/O text
Considering you do have all the (RAM) memory necessary to host the whole numpy array: I would then loop over the whole (~12M lines) text file, filling the pre-existing array row-by-row.
More precisely, I would have the (big) array already instantiated before start reading the file. Only then, I would read each line, split the columns, and give it to np.asarray and assign those (1024) values to each respective row of the output array.
The looping over the file is slow, yes. The thing here is that you limit (and control) the amount of memory being used. Roughly speaking, the big objects consuming your memory are the "output" (big) array, and the "line" (1024) array. Sure, there are quite a considerable amount of memory being consumed in each loop in the temporary objects during reading (text!) values, splitting into list elements and casting to an array. Still, it's something that will remain largely constant during the whole ~12M lines.
So, the steps I would go through are:
0) estimate and guarantee enough RAM memory available
1) instantiate (np.empty or np.zeros) the "output" array
2) loop over "input.txt" file, create a 1D array from each line "i"
3) assign the line values/array to row "i" of "output" array
Sure enough, you can even make it parallel: If on one hand text files cannot be randomly (r/w) accessed, on the other hand you can easily split them (see How can I split one text file into multiple *.txt files?) to have -- if fun is at the table -- them read in parallel, if that time if critical.
Hope that helps.
TL;DR
Export to a different function other than .npy seems inevitable unless your machine is able to handle the size of the data in-memory as per described in #Brandt answer.
Reading the data, then processing it (Kinda answering Q part 2)
To handle data size larger than what the RAM can handle, one would often resort to libraries that performs "out-of-core" computation, e.g. turicreate.SFrame, vaex or dask . These libraries would be able to lazily load the .csv files into dataframes and process them by chunks when evaluated.
from turicreate import SFrame
filename = "myfile.csv"
sf = SFrame.read_csv(filename)
sf.apply(...) # Trying to process the data
or
import vaex
filename = "myfile.csv"
df = vaex.from_csv(filename,
convert=True,
chunk_size=50_000_000)
df.apply(...)
Converting the read data into numpy array (kinda answering Q part 1)
While out-of-core libraries can read and process the data efficiently, converting into numpy is an "in-memory" operation, the machine needs to have enough RAM to fit all data.
The turicreate.SFrame.to_numpy documentation writes:
Converts this SFrame to a numpy array
This operation will construct a numpy array in memory. Care must be taken when size of the returned object is big.
And the vaex documentation writes:
In-memory data representations
One can construct a Vaex DataFrame from a variety of in-memory data representations.
And dask best practices actually reimplemented their own array objects that are simpler than numpy array, see https://docs.dask.org/en/stable/array-best-practices.html. But when going through the docs, it seems like the format they have saved the dask array in are not .npy but various other formats.
Writing the file into non-.npy versions (answering Q Part 3)
Given the numpy arrays are inevitably in-memory, trying to save the data into one single .npy isn't the most viable option.
Different libraries seems to have different solutions for storage. E.g.
vaex saves the data into hdf5 by default if the convert=True argument is set when data is read through vaex.from_csv()
sframe saves the data into their own binary format
dask export functions save to_hdf() and to_parquet() format
It it's latest version (4.14) vaex support "streaming", i.e. lazy loading of CSV files. It uses pyarrow under the hood so it is supper fast. Try something like
df = vaex.open(my_file.csv)
# or
df = vaex.from_csv_arrow(my_file.csv, lazy=True)
Then you can export to bunch of formats as needed, or keep working with it like that (it is surprisingly fast). Of course, it is better to convert to some kind of binary format..
import numpy as np
import pandas as pd
# Define the input and output file names
csv_file = 'data.csv'
npy_file = 'data.npy'
# Create dummy data
data = np.random.rand(10000, 100)
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False)
# Define the chunk size
chunk_size = 1000
# Read the header row and get the number of columns
header = pd.read_csv(csv_file, nrows=0)
num_cols = len(header.columns)
# Initialize an empty array to store the data
data = np.empty((0, num_cols))
# Loop over the chunks of the csv file
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
# Convert the chunk to a numpy array
chunk_array = chunk.to_numpy()
# Append the chunk to the data array
data = np.append(data, chunk_array, axis=0)
np.save(npy_file, data)
# Load the npy file and check the shape
npy_data = np.load(npy_file)
print('Shape of data before conversion:', data.shape)
print('Shape of data after conversion:', npy_data.shape)```
I'm not aware of any existing function or utility that directly and efficiently converts csv files into npy files. With efficient I guess primarily meaning with low memory requirements.
Writing a npy file iteratively is indeed possible, with some extra effort. There's already a question on SO that addresses this, see:
save numpy array in append mode
For example using the NpyAppendArray class from Michael's answer you can do:
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
for line in csv:
row = np.fromstring(line, sep=',')
npy.append(row[np.newaxis, :])
The NpyAppendArray class updates the npy file header on every call to append, which is a bit much for your 12M rows. Maybe you could update the class to (optionally) only write the header on close. Or you could easily batch the writes:
batch_lines = 128
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
done = False
while not done:
batch = []
for count, line in enumerate(csv):
row = np.fromstring(line, sep=',')
batch.append(row)
if count + 1 >= batch_lines:
break
else:
done = True
npy.append(np.array(batch))
(code is not tested)
I have been facing the following issue: I have to loop over num_objects = 897 objects, for every single one of which I have to use num_files = 2120 h5 files. The files are very large, each being 1.48 GB and the content of interest to me is 3 arrays of floats of size 256 x 256 x 256 contained in every file (v1,v2 and v3). This is to say, the loop looks like:
for i in range(num_objects):
...
for j in range(num_files):
some operation with the three 256 x 256 x 256 arrays in each file
and my current way of loading them is by doing the following in the innermost loop:
f = h5py.File('output_'+str(q)+'.h5','r')
key1 = np.array(f['key1'])
v1=key1[:,:,:,0]
v2=key2[:,:,:,1]
v3=key3[:,:,:,2]
The above option of loading the files every time for every object is obviously very slow. On the other hand, loading all files at once and importing them in a dictionary results in excessive use of memory and my job gets killed. Some diagnostics:
The above way takes 0.48 seconds per file, per object, resulting in a total of 10.5 days (!) spent only on this operation.
I tried exporting key1 to npz files, but it was actually slower by 0.7 seconds per file.
I exported v1,v2 and v3 individually for every file to npz files (i.e. 3 npz files for each h5 file), but that saved me only 1.5 day in total.
Does anyone have some other idea/suggestion I could try to be fast and at the same time not limited by excessive memory usage?
If I understand, you have 2120 .h5 files. Do you only read the 3 arrays in dataset f['key1'] for each file? (or are there other datasets?) If you only/always read f['key1'], that's a bottleneck you can't program around. Using a SSD will help (because I/O is faster than HDD). Otherwise, you will have to reorganize your data. The amount of RAM on your system will determine the number of arrays you can read simultaneously. How much RAM do you have?
You might gain a little speed with a small code change. v1=key1[:,:,:,0] returns v1 as an array (same for v2 and v3). There's no need to read dataset f['key1'] into an array. Doing that doubles your memory footprint. (BTW, Is there a reason to convert your arrays to a dictionary?)
The process below only creates 3 arrays by slicing v1,v2,v3 from the h5py f['key1']object. It will reduce your memory footprint for each loop by 50%.
f = h5py.File('output_'+str(q)+'.h5','r')
key1 = f['key1']
## key1 is returned as a h5py dataset OBJECT, not an array
v1=key1[:,:,:,0]
v2=key2[:,:,:,1]
v3=key3[:,:,:,2]
On the HDF5 side, since you always slice out the last axis, your chunk parameters might improve I/O. However, if you want to change the chunk shape, you will have to recreate the .h5 files. So, that will probably not save time (at least in the short-term).
Agreed with #kcw78, in my experience the bottleneck is usually that you load too much data compared to what you need. So don't convert the dataset to an array unless you need to, and slice only the part you need (do you need the whole [:,:,0] or just a subset of it?).
Also, if you can, change the order of the loops, so that you open each file only once.
for j in range(num_files):
...
for i in range(num_objects):
some operation with the three 256 x 256 x 256 arrays in each file
Another approach would be to use an external tool (like h5copy) to extract the data you need in a much smaller dataset, and then read it in python, to avoid python's overhead (which might not be that much to be honest).
Finally, you could use multiprocessing to take advantage of you CPU cores, as long as you tasks are relatively independent from one another. You have an example here.
h5py is the data representation class the supports most of NumPy data types and supports traditional NumPy operations like slicing, and a variety of descriptive attributes such as shape and size attributes.
There is a cool option that you can use for representing datasets in chunks using the create_dataset method with the chunked keyword enabled:
dset = f.create_dataset("chunked", (1000, 1000), chunks=(100, 100))
The benefit of this is that you can easily resize your dataset and now as in your case, you would only have to read a chunk and not the whole 1.4 GB of data.
But beware of chunking implications: different data sizes work best of different chunk sizes. There is an option of autochunk that selects this chunk size automatically rather than through hit and trial:
dset = f.create_dataset("autochunk", (1000, 1000), chunks=True)
Cheers
I'm reading data from a large text file (a VCF) into a zarr array. The overall flow of the code is
with zarr.LMDBStore(...) as store:
array = zarr.create(..., chunks=(1000,1000), store=store, ...)
for line_num, line in enumerate(text_file):
array[line_num, :] = process_data(line)
I'm wondering - when does zarr compress the modified chunks of the array and push them to the underlying store (in this case LMDB)? Does it do that every time a chunk is updated (i.e. each line)? Or does it wait till a chunk is filled/evicted from memory before doing that? Assuming that I need to process each line separately in a for loop (that there aren't efficient array operations to use here due to the nature of the data and processing), is there any optimization I should do here with regards to how I feed the data into Zarr?
I just don't want Zarr running compression on each modified chunk every line when each chunk will be modified 1000 times before being complete and ready to save to disk.
Thanks!
Every time you execute this line:
array[line_num, :] = process_data(line)
...zarr will (1) figure out which chunks overlap the array region you want to write to, (2) retrieve those chunks from the store, (3) decompress the chunks, (4) modify the data, (5) compress the modified chunks, (6) write the modified compressed chunks to the store.
This will happen regardless of what type of underlying storage you are using.
If you have created an array with chunks that are more than one row tall, then this will likely be inefficient, resulting in each chunk being read, decompressed, updated, compressed and written many times.
A better strategy would be to parse your input file in blocks of N lines, where N is equal to the number of rows in each chunk of the output array, so that each chunk is only compressed and written once.
If by VCF you mean Variant Call Format files, you might want to look at the vcf_to_zarr function implementation in scikit-allel.
I believe the LMDB store (as far as I can tell) will write/compress every time you assign.
You could aggregate your rows in an in-memory Zarr and then assign for each block.
There could be a "batch" option to the datasets but it has not been implemeted yet as far as I can tell.
Here is a simple example to illustrate my problem:
I have a large binary file with 10 million values.
I want to get 5K values from certain points in this file.
I have a list of indexes giving me the exact place in the file I have my value in.
To solve this I tried two methods:
Going through the values and simply using seek() (from the start of the file) to get each value, something like this:
binaryFile_new = open(binary_folder_path, "r+b")
for index in index_list:
binaryFile_new.seek (size * (index), 0)
wanted_line = binaryFile_new.read (size)
wanted_line_list.append(wanted_line)
binaryFile_new.close()
But as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.
Sorting the indexes so I could go through the file "once" while seeking from the current position with something like that:
binaryFile_new = open(binary_folder_path, "r+b")
sorted_index_list = sorted(index_list)
for i, index in enumerate(sorted_index_list):
if i == 0:
binaryFile_new.seek (size * (v), 0)
else:
binaryFile_new.seek ((index - sorted_index_list[i-1]) * size - size, 1)
binaryFile_new.seek (size * (index), 0)
wanted_line = binaryFile_new.read (size)
wanted_line_list.append(wanted_line)
binaryFile_new.close()
I expected the second solution to be much faster because in theory it would go through the whole file once O(N).
But for some reason both solutions run the same.
I also have a hard constraint on memory usage, as I run this operation in parallel and on many files, so I can't read files into memory.
Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.
I'd go with #1:
for index in index_list:
binary_file.seek(size * index)
# ...
(I cleaned up your code a bit to comply with Python naming conventions and to avoid using a magic 0 constant, as SEEK_SET is default anyway.)
as I understand this solution reads through from the beginning for each index, therefore the complexity is O(N**2) in terms of file size.
No, a seek() does not "read through from the beginning", that would defeat the point of seeking. Seeking to the beginning of file and to the end of file have roughly the same cost.
Sorting the indexes so I could go through the file "once" while seeking from the current position
I can't quickly find a reference for this, but I believe there's absolutely no point in calculating the relative offset in order to use SEEK_CUR instead of SEEK_SET.
There might be a small improvement just from seeking to the positions you need in order instead of randomly, as there's an increased chance your random reads will be serviced from cache, in case many of the points you need to read happen to be close to each other (and so your read patterns trigger read-ahead in the file system).
Maybe the mmap package will help? Though, I think mmap also scans the entire file until it gets to the index so it's not "true" random access.
mmap doesn't scan the file. It sets up a region in your program's virtual memory to correspond to the file, so that accessing any page from this region the first time leads to a page fault, during which the OS reads that page (several KB) from the file (assuming it's not in the page cache) before letting your program proceed.
The internet is full of discussions of relative merits of read vs mmap, but I recommend you don't bother with trying to optimize by using mmap and use this time to learn about the virtual memory and the page cache.
[edit] reading in chunks larger than the size of your values might save you a bit of CPU time in case many of the values you need to read are in the same chunk (which is not a given) - but unless your program is CPU bound in production, I wouldn't bother with that either.
I have a file with integers stored as binary and I'm trying to extract values at specific locations. It's one big serialized integer array for which I need values at specific indexes. I've created the following code but its terribly slow compared to the F# version I created before.
import os, struct
def read_values(filename, indices):
# indices are sorted and unique
values = []
with open(filename, 'rb') as f:
for index in indices:
f.seek(index*4L, os.SEEK_SET)
b = f.read(4)
v = struct.unpack("#i", b)[0]
values.append(v)
return values
For comparison here is the F# version:
open System
open System.IO
let readValue (reader:BinaryReader) cellIndex =
// set stream to correct location
reader.BaseStream.Position <- cellIndex*4L
match reader.ReadInt32() with
| Int32.MinValue -> None
| v -> Some(v)
let readValues fileName indices =
use reader = new BinaryReader(File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
// Use list or array to force creation of values (otherwise reader gets disposed before the values are read)
let values = List.map (readValue reader) (List.ofSeq indices)
values
Any tips on how to improve the performance of the python version, e.g. by usage of numpy ?
Update
Hdf5 works very good (from 5 seconds to 0.8 seconds on my test file):
import tables
def read_values_hdf5(filename, indices):
values = []
with tables.open_file(filename) as f:
dset = f.root.raster
return dset[indices]
Update 2
I went with the np.memmap because the performance is similar to hdf5 and I already have numpy in production.
Heavily depending on your index file size you might want to read it completely into a numpy array. If the file is not large, complete sequential read may be faster than a large number of seeks.
One problem with the seek operations is that python operates on buffered input. If the program was written in some lower level language, the use on unbuffered IO would be a good idea, as you only need a few values.
import numpy as np
# read the complete index into memory
index_array = np.fromfile("my_index", dtype=np.uint32)
# look up the indices you need (indices being a list of indices)
return index_array[indices]
If you would anyway read almost all pages (i.e. your indices are random and at a frequency of 1/1000 or more), this is probably faster. On the other hand, if you have a large index file, and you only want to pick a few indices, this is not so fast.
Then one more possibility - which might be the fastest - is to use the python mmap module. Then the file is memory-mapped, and only the pages really required are accessed.
It should be something like this:
import mmap
with open("my_index", "rb") as f:
memory_map = mmap.mmap(mmap.mmap(f.fileno(), 0)
for i in indices:
# the index at position i:
idx_value = struct.unpack('I', memory_map[4*i:4*i+4])
(Note, I did not actually test that one, so there may be typing errors. Also, I did not care about endianess, so please check it is correct.)
Happily, these can be combined by using numpy.memmap. It should keep your array on disk but give you numpyish indexing. It should be as easy as:
import numpy as np
index_arr = np.memmap(filename, dtype='uint32', mode='rb')
return index_arr[indices]
I think this should be the easiest and fastest alternative. However, if "fast" is important, please test and profile.
EDIT: As the mmap solution seems to gain some popularity, I'll add a few words about memory mapped files.
What is mmap?
Memory mapped files are not something uniquely pythonic, because memory mapping is something defined in the POSIX standard. Memory mapping is a way to use devices or files as if they were just areas in memory.
File memory mapping is a very efficient way to randomly access fixed-length data files. It uses the same technology as is used with virtual memory. The reads and writes are ordinary memory operations. If they point to a memory location which is not in the physical RAM memory ("page fault" occurs), the required file block (page) is read into memory.
The delay in random file access is mostly due to the physical rotation of the disks (SSD is another story). In average, the block you need is half a rotation away; for a typical HDD this delay is approximately 5 ms plus any data handling delay. The overhead introduced by using python instead of a compiled language is negligible compared to this delay.
If the file is read sequentially, the operating system usually uses a read-ahead cache to buffer the file before you even know you need it. For a randomly accessed big file this does not help at all. Memory mapping provides a very efficient way, because all blocks are loaded exactly when you need and remain in the cache for further use. (This could in principle happen with fseek, as well, because it might use the same technology behind the scenes. However, there is no guarantee, and there is anyway some overhead as the call wanders through the operating system.)
mmap can also be used to write files. It is very flexible in the sense that a single memory mapped file can be shared by several processes. This may be very useful and efficient in some situations, and mmap can also be used in inter-process communication. In that case usually no file is specified for mmap, instead the memory map is created with no file behind it.
mmap is not very well-known despite its usefulness and relative ease of use. It has, however, one important 'gotcha'. The file size has to remain constant. If it changes during mmap, odd things may happen.
Is the indices list sorted? i think you could get better performance if the list would be sorted, as you would make a lot less disk seeks