numpy python 3.4 for beginners - python

I am a beginner in programming.
I installed anaconda on my laptop in order to get numpy. After that I load the npy files as follows:
import numpy
X_train = numpy.load("train-features.npy")
X_test = numpy.load("test-features.npy")
Now I would like to see what there is inside them. So i tried to print them but it gives me a memory error.
How can I look into these file in order to understand how my data set looks like?

Without knowing details or error, I am assuming that you may be getting memory error due to memory buffer getting overwhelmed because of large size of your data.
If that is the case, then you may want to memory map the npy / dat file to a disk.
Detailed examples are available at scipy.org numpy.load page
and details about numpy.memmap are available here
If you can post error and provide more insight into data that may help one with determining correct cause of the problem.

Related

Memory usage due to pickle/joblib

I have a significant large dataset consisting of several thousands of files spread among different directories. These files all have different formats and come from different sensors giving me different sampling rates. Basically, a mess. I created a python module that is able to enter these folders and make sense of all this data, reformat it, get it into a pandas dataframe that I could use for effective and easy resampling, and in general, make it easier to work with.
The problem is that the resulting dataframe is big and takes a large amount of RAM memory. Loading several of these datasets leaves not enough memory available to actually train a ML model. And it is painfully slow to read the data.
So my solution is a two part approach. First, I read the dataset into a big variable. It is a dict with nested pandas DataFrame, then compute a reduced derived DataFrame with the information I actually need to train my model, and remove from memory the dict variable. Not ideal, but it works. However, further computations sometimes needs re-reading the data and as stated previously, it is slow.
Enter the second part. Before removing the dict from memory, I pickle it into a file. sklearn actually recommends using joblib, so that's what I use. So, once the single files for the dataset are stored in the working directory, the reading stage is about 90% faster than reading the scattered data, most likely because is loading a single large file directly into memory than reading and reformatting thousands of files across different directories.
Here's my problem. The same code when is reading the data from the scattered files, ends up with about 70% less RAM than when reading the pickled data. So, although it is faster, it ends up using much memory. Has anybody experienced something like this?
Given that there are some access issues to the data (it is located in a network drive with some weird restrictions for user access) and the fact that I need to make it as user friendly as possible for other people, I'm using a Jupyter notebook. My IT department provides a web tool with all the packages required to read the network drive from the go and run Jupyter there, whilst running from a VM will require the manual configuration of the network drive to access the data and that part is not user friendly. The Jupyter tool requires only login information, while the VM requires a basic knowledge of linux sysadmin
I'm using Python 3.9.6. I'll keep trying to get a MWE that has a similar situation. So far I have one that has the opposite behaviour (loading the pickled dataset consumes less memory than reading it directly). Might be because the particular structure of the dict with nested DataFrame
MWE (Warning, running this code will create a 4GB file in your hard drive):
import numpy as np
import psutil
from os.path import exists
from os import getpid
from joblib import dump, load
## WARNING. THIS CODE SAVES A LARGE FILE INTO YOUR HARD DRIVE
def read_compute():
if exists('df.joblib'):
df = load('df.joblib')
print('==== df loaded from .joblib')
else:
df = np.random.rand(1000000,500)
dump(df, 'df.joblib')
print('=== df created and dumped')
tab = df[:100, :10]
del df
return tab
table = read_compute()
print(f'{psutil.Process(getpid()).memory_info().rss / 1024 ** 2} MB')
With this, I get when running without the df.joblib file in the pwd
=== df created and dumped
3899.62890625 MB
And then, after that file is created, I restart the kernel and run the same code again, getting
==== df loaded from .joblib
1588.5234375 MB
In my actual case, with the format of my data, I have the opposite effect.

Memory Error: Unable to allocate array shape ({very large}), solution for windows Python

Though there exist so many of these questions i couldnt find any working solutions for Windows:
I got a large list of lists (of lists):(~30000,48,411)(or even bigger in some cases), which i need as a numpy array for the training of my LSTM model...
Any ideas, how to work it out? (i dont use Linux, just Windows and python 64 bit)
I already tried converting it to np.float32-> still too big!
Then i tried to convert it to np.float16-> "tuple not callable"
The idea was to save and load it via np.memmap(), but therefore i would also need it as a numpy array before. (this format is also needed for the training process, so the goal is to convert it to a np.NdArray)
I even tried to split it into smaller lists (tenths) but still it was unable to allocate.
It is not clear for me in what format you have these "lists of lists" and what you mean with "too big" (for your memory, I assume?) but you might want to look into dask.
With that you can do something like
import dask.array as da
import dask
import numpy as np
...
arrays = []
for i in range(nfiles):
arrays.append(da.from_delayed(read_list(...), shape = (...))
arr = da.stack(arrays)
The dask documentation has more examples on how to create dask arrays.
In general, if you have data too large for your memory to handle (should not be the case for the 2-3GB of data) the processing will be very slow, so you then best bet is to chunk and then analyze it in chunks.

how to load a very large mat file in chunks?

okay so the code is like this
X1 is the loaded hyperspectral image with dimensions (512x512x91)
what i am trying to do is basically crop 64x64x91 sized matrices with a changing stride of 2. this gives me a total of 49952 images each of 64x64x91 size however when i run the for loop i get the memory error.
my system has 8 GB ram.
data_images_0=np.zeros((49952,256,256,91))
k=0
for i in range(0,512-64,2):
r=64
print(k)
for j in range (0,512-64,2):
#print(k)
data_images_0[k,:,:,:]=X1[i:i+r,j:j+r,:]
k=k+1
I have a hyperspectral image loaded as a Mat file and the dimensions are (512x512x91). I want to use chunks of this image as the input to my CNN for example using crops of 64x64x91. The problem is once i create the crops out of the original image i have trouble loading the data as loading all the crops at once gives me memory error.
Is there something i can do to load my cropped data in batches so that i dont receive such a memory error.
Should i convert my data into some other format or proceed the problem in some other way?
You are looking for the matfile function. It allows you to access the array on your harddisk and then only load parts of it.
Say your picture is named pic, then you can do something like
data = matfile("filename.mat");
part = data.pic(1:64,1:64,:);
%Do something
then only the (1:64,1:64,:) part of the variable pic will be loaded into part.
As always it should be noted, that working on the harddisk is not exactly fast and should be avoided. On the other hand if your variable is too large to fit in the memory, there is no other way around it (apart from buying more memory).
I think you might want to use the matfile function, which basically opens a .mat file without pulling its entire content into the RAM. You basically read a header from your .mat file that contains information about the stored elements, like size, data type and so on. Imagine your .mat file hyperspectralimg.mat containing the matrix myImage. You'd have to proceed like this:
filename = 'hyperspectralimg.mat';
img = matfile(filename);
A = doStuff2MyImg(img.myImage(1:64,1:64,:)); % Do stuff to your imageparts
img.myImage(1:64,1:64,:) = A; %Return changes to your file
This is a brief example how you can use it in case you haven't used matfile before. If you have already used it and it doesn't work, let us know and as a general recommendation upload code snippets and data samples regarding your issues, it helps.
A quick comment regarding the tags: If your concern regards matlab, then don't tag python and similar things.
You can use numpy memory map. This is equivalent to matfile of MatLAB.
https://numpy.org/doc/stable/reference/generated/numpy.memmap.html

How to extract data from Matlab .fig files in Python?

If I have some X vs Y data saved in a Matlab .fig file, is there a way to extract that data in Python? I've tried using the method shown in a previous discussion, but this does not work for me. I have also tried to open the files using h5py and PyTables, since .mat files are actually HDF5 files now, but this results in an error where a valid file signature can't be found.
Currently I'm trying to do this with the Anaconda distribution of Python 3.4.
EDIT: I managed to figure out something that works, but I don't know why. This has me worried something might break in the future and I won't be able to debug it. If anyone can explain why this works, but the method in the old discussion doesn't I'd really appreciate it.
from scipy.io import loadmat
d = loadmat('linear.fig', squeeze_me=True, struct_as_record=False)
x = d['hgS_070000'].children.children.properties.XData
y = d['hgS_070000'].children.children.properties.YData
The best way I can think of is using any of the Matlab-Python bridge (such as pymatbridge).
You can call Matlab code directly on python files and transform the data from one to the other. You could use some Matlab code to load the fig and extract the data and then convert the numerical variables to python arrays (or numpy arrays) easily.

How to read a big (3-4GB) file that doesn't have newlines into a numpy array?

I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.

Categories

Resources