How slicing numpy load file is loaded into memory - python

If I want to load a portion of a file using numpy.load, I use slicing as:
np.load('myfile.npy')[start:end].
Does this guarantee that this portion from the file, i.e., [start:end], is only loaded into to memory or does it load the entire file first then slice it?
Thanks,

That loads the whole thing. If you don't want to load the whole thing, you could mmap the file and only copy the part you want:
part = numpy.load('myfile.npy', mmap_mode='r')[start:end].copy()

Related

h5py file subset taking more space than parent file?

I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?
Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')

How to avoid memory mapping when loading a numpy file

Csv file:
0,0,0,0,0,0,0,0,0,0.32,0.21,0,0.16,0,0,0,0,0,0,0.32
0,0,0,0,0,0,0.17,0,0.04,0,0,0.25,0.03,0.32,0,0.02,0.05,0.03,0.08,0
0.08,0.07,0.09,0.06,0,0,0.21,0.02,0,0,0,0,0,0,0,0.1,0.36,0,0,0
[goes on always 20 columns and x number of rows]
I'm saving the array this way:
with open(csv_profile) as csv_file:
array = np.loadtxt(csv_file, delimiter=",",dtype='str')
npy_profile=open(outfile, "wb")
np.save(npy_profile, array)
Which is saved as u4 instead of f8 which is what I need.
I noticed this error in the datatype as the output file says
<93>NUMPY^A^#v^#{'descr': '<U4', 'fortran_order': False, 'shape': (680, 20), }
Also when I load it:
profile_matrix=np.load(npy_profile,"r")
the class type is numpy.memmap instead of numpy.ndarray. How can I avoid this issue?
Both saving it in the correct format and loading it in the correct format.
Looking into the manual we can see that the second parameter of numpy.load is called mmap_mode and is set to "r" in your code. This enables memory mapping the file:
A memory-mapped array is kept on disk. However, it can be accessed and sliced like any ndarray. Memory mapping is especially useful for accessing small fragments of large files without reading the entire file into memory.
Memory mapping is normally not an "issue" as you called it, but a feature that enables faster file access and saves memory for large files. When doing memory mapped I/O, your operating system maps parts of the file into the RAM address space of your program. That way the data has not to be copied into RAM. Any changes that are made to the memory mapped numpy array are directly reflected in the file. Because you specified read only access, you probably cannot change values in the array.
If you want to disable memory mapping, you could remove the second argument "r" from the call to numpy.load, which leads to a fresh copy of the array in RAM, that you can modify without affecting the file.
While the answer from Jakob Stark explains what the additional "r" argument to np.load() does, let me just suggest a simpler and safer usage. To save and load NumPy arrays in the straight-forward way (no memory mapping, etc.), use the most straight-forward syntax:
np.save('filename.npy', array)
array2 = np.load('filename.npy')
You don't have to specify the dtype or anything, it just does the simplest possible thing, as you are expecting. Also, not manually opening the file prior to calling np.save() means that you do not have to worry about closing it again (these acts should generally be written inside a try/except block, which further adds to the complexity).

Loading large csv file in python in list or numpy array

I was unable to load a large csv file(about 1.2GBs from here ) into a numpy array or a list but was unable to load it in python. Is there a way out?
Here is my logic for your case. It will only read one line at a time and when the next line is being read, the previous one will be garbage collected unless you have stored it as a reference somewhere else.Even you can use context manager in recent Python versions.
with open("Large_size_filename") as infile:
for line in infile:
do_something_with(line)
Hope it helps you in understanding

Writing to a memory file instead of file path

Is it possible to supply a path to the buffer where to write the data instead of supplying a file path e.g. instead of object.save("D:\filename.jpg") supply it a path to memory buffer. I want to do this to avoid writing the image object data to file as .JPG and save it directly into memory so that I can have it in memory rather than loading it again from disk.
I believe you are looking for the StringIO library.
If you want a raw buffer of bytes to write to, use bitstring.
>>> a = BitArray('0x1af')
>>> a.hex, a.bin, a.uint # Different interpretations using properties
('1af', '000110101111', 431)
If you don't want a raw array of bits/bytes, then just keep your image object in memory. It's basically the same thing as a file, just, as you say -- in memory not on disk.
If object.save supports file-like objects, that means, objects, that have a write-method, you can provide the method with a StringIO.StringIO instance. It has the same interface as a normal file-object, but keeps its contents in memory.

Read matlab file (*.mat) from zipped file without extracting to directory in Python

This specific questions stems from the attempt to handle large data sets produced by a MATLAB algorithm so that I can process them with python algorithms.
Background: I have large arrays in MATLAB (typically 20x20x40x15000 [i,j,k,frame]) and I want to use them in python. So I save the array to a *.mat file and use scipy.io.loadmat(fname) to read the *.mat file into a numpy array. However, a problem arises in that if I try to load the entire *.mat file in python, a memory error occurs. To get around this, I slice the *.mat file into pieces, so that I can load the pieces one at a time into a python array. If I divide up the *.mat by frame, I now have 15,000 *.mat files which quickly becomes a pain to work with (at least in windows). So my solution is to use zipped files.
Question: Can I use scipy to directly read a *.mat file from a zipped file without first unzipping the file to the current working directory?
Specs: Python 2.7, windows xp
Current code:
import scipy.io
import zipfile
import numpy as np
def readZip(zfilename,dim,frames):
data=np.zeros((dim[0],dim[1],dim[2],frames),dtype=np.float32)
zfile = zipfile.ZipFile( zfilename, "r" )
i=0
for info in zfile.infolist():
fname = info.filename
zfile.extract(fname)
mat=scipy.io.loadmat(fname)
data[:,:,:,i]=mat['export']
mat.clear()
i=i+1
return data
Tried code:
mat=scipy.io.loadmat(zfile.read(fname))
produces this error:
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
mat=scipy.io.loadmat(zfile.open(fname))
produces this error:
fileobj.seek(0)
UnsupportedOperation: seek
Any other suggestions on handling the data are appreciated.
Thanks!
I am pretty sure that the answer to my question is NO and there are better ways to accomplish what I am trying to do.
Regardless, with the suggestion from J.F. Sebastian, I have devised a solution.
Solution: Save the data in MATLAB in the HDF5 format, namely hdf5write(fname, '/data', data_variable). This produces a *.h5 file which then can be read into python via h5py.
python code:
import h5py
r = h5py.File(fname, 'r+')
data = r['data']
I can now index directly into the data, however is stays on the hard drive.
print data[:,:,:,1]
Or I can load it into memory.
data_mem = data[:]
However, this once again gives memory errors. So, to get it into memory I can loop through each frame and add it to a numpy array.
h5py FTW!
In one of my frozen applications we bundle some files into the .bin file that py2exe creates, then pull them out like this:
z = zipfile.ZipFile(os.path.join(myDir, 'common.bin'))
data = z.read('schema-new.sql')
I am not certain if that would feed your .mat files into scipy, but I'd consider it worth a try.

Categories

Resources