How can I create a numpy .npy file in place on disk? - python

Is it possible to create an .npy file without allocating the corresponding array in memory first?
I need to create and work with a large numpy array, too big to create in memory. Numpy supports memory mapping, but as far as I can see my options are either:
Create a memmapped file using numpy.memmap. This creates the file directly on disk without allocating memory, but doesn't store the metadata, so when I re-map the file later I need to know its dtype, shape, etc. In the following, notice that not specifying the shape results in the memmap being interpreted as flat array:
In [77]: x=memmap('/tmp/x', int, 'w+', shape=(3,3))
In [78]: x
Out[78]:
memmap([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
In [79]: y=memmap('/tmp/x', int, 'r')
In [80]: y
Out[80]: memmap([0, 0, 0, 0, 0, 0, 0, 0, 0])
Create an array in memory, save it using numpy.save, after which it can be loaded in memmapped mode. This records metadata with the array data on disk, but requires that memory be allocated for the entire array at least once.

I had the same question and was disappointed when I read Sven's reply. Seems as though numpy would be missing out on some key functionality if you couldn't have a huge array on file and work on little pieces of it at a time. Your case seems to be close to one of the use cases in the origional rational for making the .npy format (see: http://svn.scipy.org/svn/numpy/trunk/doc/neps/npy-format.txt).
I then ran into numpy.lib.format, which seems to be full useful goodies. I have no idea why this functionality is not available from the numpy root package. The key advantage over HDF5 is that this ships with numpy.
>>> print numpy.lib.format.open_memmap.__doc__
"""
Open a .npy file as a memory-mapped array.
This may be used to read an existing file or create a new one.
Parameters
----------
filename : str
The name of the file on disk. This may not be a filelike object.
mode : str, optional
The mode to open the file with. In addition to the standard file modes,
'c' is also accepted to mean "copy on write". See `numpy.memmap` for
the available mode strings.
dtype : dtype, optional
The data type of the array if we are creating a new file in "write"
mode.
shape : tuple of int, optional
The shape of the array if we are creating a new file in "write"
mode.
fortran_order : bool, optional
Whether the array should be Fortran-contiguous (True) or
C-contiguous (False) if we are creating a new file in "write" mode.
version : tuple of int (major, minor)
If the mode is a "write" mode, then this is the version of the file
format used to create the file.
Returns
-------
marray : numpy.memmap
The memory-mapped array.
Raises
------
ValueError
If the data or the mode is invalid.
IOError
If the file is not found or cannot be opened correctly.
See Also
--------
numpy.memmap
"""

As you have found out yourself, NumPy is mainly targetted at handling data in memory. There are different libraries for handling data on disk, the one most commonly used today probably being HDF5. I suggest having a look at h5py, an excellent Python wrapper for the HDF5 libraries. It is designed to be used together with NumPy, and its interface is easy to learn if you already know NumPy. To get an impression how it tackles your problem, read the documentation of Datasets.
For the sake of completeness I should mention PyTables, which seems to be the "standard" way of handling large datasets in Python. I did not use it because h5py appealed more to me. Both libraries have FAQ entries defining their scope against the other one.

Related

Allow_Pickle = True modified my dictionary to "unsized" when loaded

I am trying to save and load variables (dictionaries) to use in other notebooks. I save the variables with:
with open('opp2b.npy', 'wb') as f:
np.save(f, mak)
np.save(f, mp)
len(mak)
82
mak and mp are dictionaries with 82 entries of the same length.
When loading if not using allow_pickle = True it will not load. So I use this
with open('opp2b.npy', 'rb') as f:
mak = np.load(f, allow_pickle=True)
mp = np.load(f, allow_pickle=True)
and when I check the length I get
len(mak)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-bb967ce1f5ef> in <module>
----> 1 len(mak)
TypeError: len() of unsized object
1
I am not sure why the array is modified, but it is now unusable for what I need it.
Per your comments, mak is not a numpy array at all. numpy.save is specifically documented to:
Save an array to a binary file in NumPy .npy format.
allow_pickle is for numpy arrays containing Python objects, but the .npy format is not intended to store things that aren't numpy arrays at all. To successfully store the dict, it's wrapping it in a 0D numpy "array", and that's what np.load is giving you. You could extract the original dict by doing:
mak = mak.item(0) # mak = mak[0] doesn't work, and I'm unclear on why .item(0) works,
# as the docs claim the only difference is that .item(0) returns
# a Python scalar, rather than a numpy scalar, and that's not an
# issue here, but I assume something about 0D arrays requires this
But really, that's trying to put a square peg in a round hole. If you're not storing numpy arrays, there's little benefit to the .npy format, if any. The main advantages it provides are:
Avoiding arbitrary code execution for untrusted inputs (since you need to allow_pickle, that advantage goes away)
Allowing you to memory map on load (irrelevant when the entire data structure must be pickled anyway; memory mapping helps only for C level data where you might benefit from lazy loading and better performance if RAM grows short, as the data need not be written to swap before the pages are reclaimed)
(No longer relevant on modern Python) Stores array data more efficiently than the old pickle protocol 0 (that produced legal ASCII output, meaning only bytes of 127 or below, which made pickling raw binary data inefficient). As long as you're using protocol 2 or higher (which is binary, handles new-style classes efficiently, and is supported back to Python 2.3), it should store your data efficiently. As of Python 3.0, the default protocol is protocol 3 (rising to protocol 4 in 3.8), so if you're using a supported version of Python, and don't specify the protocol, it will use 3 or 4 (both of which work fine; protocol 4 being better if you're pickling huge objects).
Since you aren't storing numpy arrays, just rely on the pickle module directly to store arbitrary data (for modern pickle protocols, which allow efficient binary storage, numpy stores efficiently enough anyway, so the .npy format isn't helping much, if at all; for some trivial test cases I tried, saving {'a': numpy.array([0,1,2])}, the .npy dump was over twice the size).
import pickle # At top of file
with open('opp2b.pkl', 'wb') as f: # Name with common pickle extension instead of .npy
pickle.dump(mak, f) # Argument order reversed from np.save
pickle.dump(mp, f)
and then to load:
with open('opp2b.pkl', 'rb') as f: # Matching change in name
mak = pickle.load(f)
mp = pickle.load(f)
This assumes you might in fact want to load only one data set or the other at a time; if you plan to store and load both all the time, you may as well condense it to a single write of a tuple of the relevant values (increasing the chance that duplicated objects across the two objects can use back-references to avoid reserializing the same data multiple times), e.g.:
with open('opp2b.pkl', 'wb') as f:
pickle.dump((mak, mp), f)
and:
with open('opp2b.pkl', 'rb') as f:
mak, mp = pickle.load(f)

Converting CSV to numpy NPY efficiently

How to convert a .csv file to .npy efficently?
I've tried:
import numpy as np
filename = "myfile.csv"
vec =np.loadtxt(filename, delimiter=",")
np.save(f"{filename}.npy", vec)
While the above works for smallish file, the actual .csv file I'm working on has ~12 million lines with 1024 columns, it takes quite a lot to load everything into RAM before converting into an .npy format.
Q (Part 1): Is there some way to load/convert a .csv to .npy efficiently for large CSV file?
The above code snippet is similar to the answer from Convert CSV to numpy but that won't work for ~12M x 1024 matrix.
Q (Part 2): If there isn't any way to to load/convert a .csv to .npy efficiently, is there some way to iteratively read the .csv file into .npy efficiently?
Also, there's an answer here https://stackoverflow.com/a/53558856/610569 to save the csv file as numpy array iteratively. But seems like the np.vstack isn't the best solution when reading the file. The accepted answer there suggests hdf5 but the format is not the main objective of this question and the hdf5 format isn't desired in my use-case since I've to read it back into a numpy array afterwards.
Q (Part 3): If part 1 and part2 are not possible, are there other efficient storage (e.g. tensorstore) that can store and efficiently convert to numpy array when loading the saved storage format?
There is another library tensorstore that seems to efficiently handles arrays which support conversion to numpy array when read, https://google.github.io/tensorstore/python/tutorial.html. But somehow there isn't any information on how to save the tensor/array without the exact dimensions, all of the examples seem to include configurations like 'dimensions': [1000, 20000],.
Unlike the HDF5, the tensorstore doesn't seem to have reading overhead issues when converting to numpy, from docs:
Conversion to an numpy.ndarray also implicitly performs a synchronous read (which hits the in-memory cache since the same region was just retrieved)
Nice question; Informative in itself.
I understand you want to have the whole data set/array in memory, eventually, as a NumPy array. I assume, then, you have enough (RAM) memory to host such array -- 12M x 1K.
I don't specifically know about how np.loadtxt (genfromtxt) is operating behind the scenes, so I will tell you how I would do (after trying like you did).
Reasoning about memory...
Notice that a simple boolean array will cost ~12 GBytes of memory:
>>> print("{:.1E} bytes".format(
np.array([True]).itemsize * 12E6 * 1024
))
1.2E+10 bytes
And this is for a Boolean data type. Most likely, you have -- what -- a dataset of Integer, Float? The size may increase quite significantly:
>>> np.array([1], dtype=bool).itemsize
1
>>> np.array([1], dtype=int).itemsize
8
>>> np.array([1], dtype=float).itemsize
8
It's a lot of memory (which you know, just want to emphasize).
At this point, I would like to point out a possible swapping of the working memory. You may have enough physical (RAM) memory in your machine, but if not enough of free memory, your system will use the swap memory (i.e, disk) to keep your system stable & have the work done. The cost you pay is clear: read/writing from/to the disk is very slow.
My point so far is: check the data type of your dataset, estimate the size of your future array, and guarantee you have that minimum amount of RAM memory available.
I/O text
Considering you do have all the (RAM) memory necessary to host the whole numpy array: I would then loop over the whole (~12M lines) text file, filling the pre-existing array row-by-row.
More precisely, I would have the (big) array already instantiated before start reading the file. Only then, I would read each line, split the columns, and give it to np.asarray and assign those (1024) values to each respective row of the output array.
The looping over the file is slow, yes. The thing here is that you limit (and control) the amount of memory being used. Roughly speaking, the big objects consuming your memory are the "output" (big) array, and the "line" (1024) array. Sure, there are quite a considerable amount of memory being consumed in each loop in the temporary objects during reading (text!) values, splitting into list elements and casting to an array. Still, it's something that will remain largely constant during the whole ~12M lines.
So, the steps I would go through are:
0) estimate and guarantee enough RAM memory available
1) instantiate (np.empty or np.zeros) the "output" array
2) loop over "input.txt" file, create a 1D array from each line "i"
3) assign the line values/array to row "i" of "output" array
Sure enough, you can even make it parallel: If on one hand text files cannot be randomly (r/w) accessed, on the other hand you can easily split them (see How can I split one text file into multiple *.txt files?) to have -- if fun is at the table -- them read in parallel, if that time if critical.
Hope that helps.
TL;DR
Export to a different function other than .npy seems inevitable unless your machine is able to handle the size of the data in-memory as per described in #Brandt answer.
Reading the data, then processing it (Kinda answering Q part 2)
To handle data size larger than what the RAM can handle, one would often resort to libraries that performs "out-of-core" computation, e.g. turicreate.SFrame, vaex or dask . These libraries would be able to lazily load the .csv files into dataframes and process them by chunks when evaluated.
from turicreate import SFrame
filename = "myfile.csv"
sf = SFrame.read_csv(filename)
sf.apply(...) # Trying to process the data
or
import vaex
filename = "myfile.csv"
df = vaex.from_csv(filename,
convert=True,
chunk_size=50_000_000)
df.apply(...)
Converting the read data into numpy array (kinda answering Q part 1)
While out-of-core libraries can read and process the data efficiently, converting into numpy is an "in-memory" operation, the machine needs to have enough RAM to fit all data.
The turicreate.SFrame.to_numpy documentation writes:
Converts this SFrame to a numpy array
This operation will construct a numpy array in memory. Care must be taken when size of the returned object is big.
And the vaex documentation writes:
In-memory data representations
One can construct a Vaex DataFrame from a variety of in-memory data representations.
And dask best practices actually reimplemented their own array objects that are simpler than numpy array, see https://docs.dask.org/en/stable/array-best-practices.html. But when going through the docs, it seems like the format they have saved the dask array in are not .npy but various other formats.
Writing the file into non-.npy versions (answering Q Part 3)
Given the numpy arrays are inevitably in-memory, trying to save the data into one single .npy isn't the most viable option.
Different libraries seems to have different solutions for storage. E.g.
vaex saves the data into hdf5 by default if the convert=True argument is set when data is read through vaex.from_csv()
sframe saves the data into their own binary format
dask export functions save to_hdf() and to_parquet() format
It it's latest version (4.14) vaex support "streaming", i.e. lazy loading of CSV files. It uses pyarrow under the hood so it is supper fast. Try something like
df = vaex.open(my_file.csv)
# or
df = vaex.from_csv_arrow(my_file.csv, lazy=True)
Then you can export to bunch of formats as needed, or keep working with it like that (it is surprisingly fast). Of course, it is better to convert to some kind of binary format..
import numpy as np
import pandas as pd
# Define the input and output file names
csv_file = 'data.csv'
npy_file = 'data.npy'
# Create dummy data
data = np.random.rand(10000, 100)
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False)
# Define the chunk size
chunk_size = 1000
# Read the header row and get the number of columns
header = pd.read_csv(csv_file, nrows=0)
num_cols = len(header.columns)
# Initialize an empty array to store the data
data = np.empty((0, num_cols))
# Loop over the chunks of the csv file
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
# Convert the chunk to a numpy array
chunk_array = chunk.to_numpy()
# Append the chunk to the data array
data = np.append(data, chunk_array, axis=0)
np.save(npy_file, data)
# Load the npy file and check the shape
npy_data = np.load(npy_file)
print('Shape of data before conversion:', data.shape)
print('Shape of data after conversion:', npy_data.shape)```
I'm not aware of any existing function or utility that directly and efficiently converts csv files into npy files. With efficient I guess primarily meaning with low memory requirements.
Writing a npy file iteratively is indeed possible, with some extra effort. There's already a question on SO that addresses this, see:
save numpy array in append mode
For example using the NpyAppendArray class from Michael's answer you can do:
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
for line in csv:
row = np.fromstring(line, sep=',')
npy.append(row[np.newaxis, :])
The NpyAppendArray class updates the npy file header on every call to append, which is a bit much for your 12M rows. Maybe you could update the class to (optionally) only write the header on close. Or you could easily batch the writes:
batch_lines = 128
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
done = False
while not done:
batch = []
for count, line in enumerate(csv):
row = np.fromstring(line, sep=',')
batch.append(row)
if count + 1 >= batch_lines:
break
else:
done = True
npy.append(np.array(batch))
(code is not tested)

How to avoid memory mapping when loading a numpy file

Csv file:
0,0,0,0,0,0,0,0,0,0.32,0.21,0,0.16,0,0,0,0,0,0,0.32
0,0,0,0,0,0,0.17,0,0.04,0,0,0.25,0.03,0.32,0,0.02,0.05,0.03,0.08,0
0.08,0.07,0.09,0.06,0,0,0.21,0.02,0,0,0,0,0,0,0,0.1,0.36,0,0,0
[goes on always 20 columns and x number of rows]
I'm saving the array this way:
with open(csv_profile) as csv_file:
array = np.loadtxt(csv_file, delimiter=",",dtype='str')
npy_profile=open(outfile, "wb")
np.save(npy_profile, array)
Which is saved as u4 instead of f8 which is what I need.
I noticed this error in the datatype as the output file says
<93>NUMPY^A^#v^#{'descr': '<U4', 'fortran_order': False, 'shape': (680, 20), }
Also when I load it:
profile_matrix=np.load(npy_profile,"r")
the class type is numpy.memmap instead of numpy.ndarray. How can I avoid this issue?
Both saving it in the correct format and loading it in the correct format.
Looking into the manual we can see that the second parameter of numpy.load is called mmap_mode and is set to "r" in your code. This enables memory mapping the file:
A memory-mapped array is kept on disk. However, it can be accessed and sliced like any ndarray. Memory mapping is especially useful for accessing small fragments of large files without reading the entire file into memory.
Memory mapping is normally not an "issue" as you called it, but a feature that enables faster file access and saves memory for large files. When doing memory mapped I/O, your operating system maps parts of the file into the RAM address space of your program. That way the data has not to be copied into RAM. Any changes that are made to the memory mapped numpy array are directly reflected in the file. Because you specified read only access, you probably cannot change values in the array.
If you want to disable memory mapping, you could remove the second argument "r" from the call to numpy.load, which leads to a fresh copy of the array in RAM, that you can modify without affecting the file.
While the answer from Jakob Stark explains what the additional "r" argument to np.load() does, let me just suggest a simpler and safer usage. To save and load NumPy arrays in the straight-forward way (no memory mapping, etc.), use the most straight-forward syntax:
np.save('filename.npy', array)
array2 = np.load('filename.npy')
You don't have to specify the dtype or anything, it just does the simplest possible thing, as you are expecting. Also, not manually opening the file prior to calling np.save() means that you do not have to worry about closing it again (these acts should generally be written inside a try/except block, which further adds to the complexity).

How to train Models with numpy arrays bigger than 6GB?

I have couple of huge training files I am planning to train. The validation data is also perfect and I see no problem but the SIZE is huge. I am talking about 20GB+. Loading one file crashes python due to Memory error
I have tried making the file to one but it's too big
X = np.load('X150.npy')
Y = np.load('Y150.npy')
Error
~\AppData\Roaming\Python\Python37\site-packages\numpy\lib\format.py in read_array(fp, allow_pickle, pickle_kwargs)
710 if isfileobj(fp):
711 # We can use the fast fromfile() function.
--> 712 array = numpy.fromfile(fp, dtype=dtype, count=count)
713 else:
714 # This is not a real file. We have to read it the
MemoryError:
I need a solution so I can train huge datasets.
Important: First make sure that your python is 64bit. The methods below only support files upto 2GB for 32bit python versions
Typically, one should use np.memmap() to use the array without loading on to the RAM. From the numpy docs, "Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory."
Example usage:
x_file = "X_150.npy"
X = np.memmap(x_file, dtype='int', mode='w+', shape=(300000, 1000))
However, since your files as already stored as .npy files, I stumbled upon np.lib.format.open_memmap() which creates or loads memory mapped .npy files.
The usage would be as follows, identical to what you'd do with np.memmap():
x_file = "X_150.npy"
X = np.lib.format.open_memmap(x_file, dtype='int', mode='w+', shape=(300000, 1000))
Here's the docs for the second function (from this answer):
>>> print numpy.lib.format.open_memmap.__doc__
"""
Open a .npy file as a memory-mapped array.
This may be used to read an existing file or create a new one.
Parameters
----------
filename : str
The name of the file on disk. This may not be a filelike object.
mode : str, optional
The mode to open the file with. In addition to the standard file modes,
'c' is also accepted to mean "copy on write". See `numpy.memmap` for
the available mode strings.
dtype : dtype, optional
The data type of the array if we are creating a new file in "write"
mode.
shape : tuple of int, optional
The shape of the array if we are creating a new file in "write"
mode.
fortran_order : bool, optional
Whether the array should be Fortran-contiguous (True) or
C-contiguous (False) if we are creating a new file in "write" mode.
version : tuple of int (major, minor)
If the mode is a "write" mode, then this is the version of the file
format used to create the file.
Returns
-------
marray : numpy.memmap
The memory-mapped array.
Raises
------
ValueError
If the data or the mode is invalid.
IOError
If the file is not found or cannot be opened correctly.
See Also
--------
numpy.memmap
"""

Finding shape of saved numpy array (.npy or .npz) without loading into memory

I have a huge compressed numpy array saved to disk (~20gb in memory, much less when compressed). I need to know the shape of this array, but I do not have the available memory to load it. How can I find the shape of the numpy array without loading it into memory?
This does it:
import numpy as np
import zipfile
def npz_headers(npz):
"""Takes a path to an .npz file, which is a Zip archive of .npy files.
Generates a sequence of (name, shape, np.dtype).
"""
with zipfile.ZipFile(npz) as archive:
for name in archive.namelist():
if not name.endswith('.npy'):
continue
npy = archive.open(name)
version = np.lib.format.read_magic(npy)
shape, fortran, dtype = np.lib.format._read_array_header(npy, version)
yield name[:-4], shape, dtype
Opening the file in mmap_mode might do the trick.
If not None, then memory-map the file, using the given mode
(see `numpy.memmap` for a detailed description of the modes).
A memory-mapped array is kept on disk. However, it can be accessed
and sliced like any ndarray. Memory mapping is especially useful for
accessing small fragments of large files without reading the entire
file into memory.
It is also possible to read the header block without reading the data buffer, but that requires digging further into the underlying lib/npyio/format code. I explored that in a recent SO question about storing multiple arrays in a single file (and reading them).
https://stackoverflow.com/a/35752728/901925

Categories

Resources