Numpy matrix of arrays without copying possible? - python

I got a question about numpy and it's memory. Is it possible to generate a view or something out of multiple numpy arrays without copying them?
import numpy as np
def test_var_args(*inputData):
dataArray = np.array(inputData)
print np.may_share_memory(inputData, dataArray) # prints false, b.c. of no shared memory
test_var_args(np.arange(32),np.arange(32)*2)
I've got a c++ application with images and want to do some python magic. I pass the images in rows to the python script using the c-api and want to combine them without copying them.
I am able to pass the data s.t. c++ and python share the same memory. Now I want to arange the memory to a numpy view/array or something like that.
The images in c++ are not continuously present in the memory (I slice them). The rows that I hand over to python are aranged in a continuous memory block.
The number of images I pass are varying. Maybe I can change that if there exist a preallocation trick.

There's a useful discussion in the answer here: Can memmap pandas series. What about a dataframe?
In short:
If you initialize your DataFrame from a single array of matrix, then it may not copy the data.
If you initialize from multiple arrays of the same or different types, your data will be copied.
This is the only behavior permitted by the default BlockManager used by Pandas' DataFrame, which organizes the DataFrame's memory internally.
Its possible to monkey patch the BlockManager to change this behavior though, in which case your supplied data will be referenced.

Related

Fastest Way to Create Large Numpy Arrays?

I'm working on creating a bunch of data (10M rows or more) and am trying to understand more deeply about the best way to work with numpy arrays as quickly as possible.
I'll have 1M rows of each class of data, which I read in from different sources (async). When I'm done reading, I want to combine them into a single numpy array. I'll know the final array is 10M (precisely).
I'm thinking I have the following paths:
Create a global numpy array of the entire size and copy in
Create a global numpy array and a numpy array for each source and concat together at the end
Create a null global numpy array and add each row to the global array (I think this is the slowest)
However, I'm not sure how to do #1 - numpy.copyto seems to always start with index 0.
Is there another model. I should be going with here?
If I use "views", I'm not clear how to copy it to the final array. I'm, of course, familiar with views for DBs, but not for numpy.

Python: Can I write to a file without loading its contents in RAM?

Got a big data-set that I want to shuffle. The entire set won't fit into RAM so it would be good if I could open several files (e.g. hdf5, numpy) simultaneously, loop through my data chronologically and randomly assign each data-point to one of the piles (then afterwards shuffle each pile).
I'm really inexperienced with working with data in python so I'm not sure if it's possible to write to files without holding the rest of its contents in RAM (been using np.save and savez with little success).
Is this possible and in h5py or numpy and, if so, how could I do it?
Memmory mapped files will allow for what you want. They create a numpy array which leaves the data on disk, only loading data as needed. The complete manual page is here. However, the easiest way to use them is by passing the argument mmap_mode=r+ or mmap_mode=w+ in the call to np.load leaves the file on disk (see here).
I'd suggest using advanced indexing. If you have data in a one dimensional array arr, you can index it using a list. So arr[ [0,3,5]] will give you the 0th, 3rd, and 5th elements of arr. That will make selecting the shuffled versions much easier. Since this will overwrite the data you'll need to open the files on disk read only, and create copies (using mmap_mode=w+) to put the shuffled data in.

Slow loading of large NumPy datasets

I notice a long loading time (~10 min) of a .npy file for a 1D numpy array of object data type and with a length of ~10000. Each element in this array is an ordered dictionary (OrderedDict, a dictionary subclass from collections package) with a length ~5000. So, how can I efficiently save and load large NumPy arrays to and from disk? How are large data sets in Python traditionally handled?
Numpy will pickle embedded objects by default (which you could avoid with allow_pickle=False but sounds like you may need it) which is slow (see https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html).
You may want to check Pandas (see http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization) or try to come up with your own file format that avoids pickling of your complex data structures.
Saving and loading large datasets to/from disk will always be a costly operation. One possible optimization is using memory mapping to disk and working directly on the array (if this is compatible with your application), especially if you're only interested in a small part of the dataset. This is what numpy.memmap does.
For example:
import numpy as np
a=np.memmap('largeArray.dat',dtype=np.int32,mode='w+',shape=(100000,))
this will create a numpy array 'a' of 1000000 int32. It can be handled as any "normal" numpy array. This also creates the corresponding file 'largeArray' on your disk that will contain the data in 'a'. Synchronization between 'a' and 'largeArray' is handled by numpy and this depends on your RAM size.
More info here

how to downsize a pandas DataFrame without making a copy?

I have RAM concerns, and I want to downsize my data I loaded (with read_stata() you cannot only a few rows, sadly). Can I change the code below to use only some rows for X and y, but not make a copy? That would, even if temporarily defeat the purpose, I want to save on memory, not add ever more to my footprint. Or probably downsize the data first (does `reshape' do that without a copy if you specify a smaller size than the original?) and then pick some columns?
data = pd.read_stata('S:/data/controls/notreat.dta')
X = data.iloc[:,1:]
y = data.iloc[:,0]
I feel your pain. Pandas is not a memory-friendly library, and 500Mb can quickly turn into >16Gb and shredding performance.
However, one thing that's worked for me is memmap. You can use memmap to page in numpy arrays and matrices just about as fast as your databus permits. And as an added benefit, unused pages may be unloaded.
See here for details. With some work, these memmap np arrays can be used to back a pd.Series or a pd.DataFrame without copying. However, you may find that Pandas later copies you data as you proceed. So, my advice: create a memmap file, and stay in numpy-land.
Your other alternative is to use HDFS.

Is it possible to store multidimensional arrays of arbitrary shape in a PyTables cell?

PyTables supports the creation of tables from user-defined classes that inherit from the IsDescription class. This includes support for multidimensional cells, as in the following example from the documentation:
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
However, is it possible to store an arbitrarily-shaped multidimensional array in a single cell? Following the above example, something like pressure = Float32Col(shape=(x, y)) where x and y are determined upon the insertion of each row.
If not, what is the preferred approach? Storing each (arbitrarily-shaped) multidimensional array in a CArray with a unique name and then storing those names in a master index table? The application I'm imagining is storing images and associated metadata, which I'd like to be able to both query and use numexpr on.
Any pointers toward PyTables best practices are much appreciated!
The long answer is "yes, but you probably don't want to."
PyTables probably doesn't support it directly, but HDF5 does support creation of nested variable-length datatypes, allowing ragged arrays in multiple dimensions. Should you wish to go down that path, you'll want to use h5py and browse through HDF5 User's Guide, Datatypes chapter. See section 6.4.3.2.3. Variable-length Datatypes. (I'd link it, but they apparently chose not to put anchors that deep).
Personally, the way that I would arrange the data you've got is into groups of datasets, not into a single table. That is, something like:
/particles/particlename1/pressure
/particles/particlename1/temperature
/particles/particlename2/pressure
/particles/particlename2/temperature
and so on. The lat and long values would be attributes on the /particles/particlename group rather than datasets, though having small datasets for them is perfectly fine too.
If you want to be able to do searches based on the lat and long, then having a dataset with the lat/long/name columns would be good. And if you wanted to get really fancy, there's an HDF5 datatype for references, allowing you to store a pointer to a dataset, or even to a subset of a dataset.
The short answer is "no", and I think its a "limitation" of hdf5 rather than pytables.
I think the reason is that each unit of storage (the compound dataset) must be a well defined size, which if one or more component can change size then it will obviously not be. Note it is totally possible to resize and extend a dataset in hdf5 (pytables makes heavy use of this) but not the units of data within that array.
I suspect the best thing to do is either:
a) make it a well defined size and provide a flag for overflow. This works well if the largest reasonable size is still pretty small and you are okay with tail events being thrown out. Note you might be able to get ride of the unused disk space with hdf5 compression.
b) do as you suggest a create a new CArray in the same file just read that in when required. (to keep things tidy you might want to put these all under their own group)
HDF5 actually has an API which is designed (and optimized for) for storing images in a hdf5 file. I dont think its exposed in pytables.

Categories

Resources