Memory usage due to pickle/joblib

Memory usage due to pickle/joblib - python

I have a significant large dataset consisting of several thousands of files spread among different directories. These files all have different formats and come from different sensors giving me different sampling rates. Basically, a mess. I created a python module that is able to enter these folders and make sense of all this data, reformat it, get it into a pandas dataframe that I could use for effective and easy resampling, and in general, make it easier to work with.
The problem is that the resulting dataframe is big and takes a large amount of RAM memory. Loading several of these datasets leaves not enough memory available to actually train a ML model. And it is painfully slow to read the data.
So my solution is a two part approach. First, I read the dataset into a big variable. It is a dict with nested pandas DataFrame, then compute a reduced derived DataFrame with the information I actually need to train my model, and remove from memory the dict variable. Not ideal, but it works. However, further computations sometimes needs re-reading the data and as stated previously, it is slow.
Enter the second part. Before removing the dict from memory, I pickle it into a file. sklearn actually recommends using joblib, so that's what I use. So, once the single files for the dataset are stored in the working directory, the reading stage is about 90% faster than reading the scattered data, most likely because is loading a single large file directly into memory than reading and reformatting thousands of files across different directories.
Here's my problem. The same code when is reading the data from the scattered files, ends up with about 70% less RAM than when reading the pickled data. So, although it is faster, it ends up using much memory. Has anybody experienced something like this?
Given that there are some access issues to the data (it is located in a network drive with some weird restrictions for user access) and the fact that I need to make it as user friendly as possible for other people, I'm using a Jupyter notebook. My IT department provides a web tool with all the packages required to read the network drive from the go and run Jupyter there, whilst running from a VM will require the manual configuration of the network drive to access the data and that part is not user friendly. The Jupyter tool requires only login information, while the VM requires a basic knowledge of linux sysadmin
I'm using Python 3.9.6. I'll keep trying to get a MWE that has a similar situation. So far I have one that has the opposite behaviour (loading the pickled dataset consumes less memory than reading it directly). Might be because the particular structure of the dict with nested DataFrame
MWE (Warning, running this code will create a 4GB file in your hard drive):
import numpy as np
import psutil
from os.path import exists
from os import getpid
from joblib import dump, load
## WARNING. THIS CODE SAVES A LARGE FILE INTO YOUR HARD DRIVE
def read_compute():
if exists('df.joblib'):
df = load('df.joblib')
print('==== df loaded from .joblib')
else:
df = np.random.rand(1000000,500)
dump(df, 'df.joblib')
print('=== df created and dumped')
tab = df[:100, :10]
del df
return tab
table = read_compute()
print(f'{psutil.Process(getpid()).memory_info().rss / 1024 ** 2} MB')
With this, I get when running without the df.joblib file in the pwd
=== df created and dumped
3899.62890625 MB
And then, after that file is created, I restart the kernel and run the same code again, getting
==== df loaded from .joblib
1588.5234375 MB
In my actual case, with the format of my data, I have the opposite effect.

Related

Are temporary files generated while working with pandas data frames

Until now, I have always used SAS to work with sensitive data. It would be nice to change to Python instead. However, I realize I do not understand how the data is handled during processing in pandas.
While running SAS, one knows exactly where all the temporary files are stored (hence it is easy to keep these in an encrypted container). But what happens when I use pandas data frames? I think I would not even notice, if the data left my computer during processing.
The size of the mere flat files, of which I typically have dozens to merge, are a couple of Gb. Hence I cannot simply rely on the hope, that everything will be kept in the RAM during processing - or can I? I am currently using a desktop with 64 Gb RAM.

If it's a matter of life and death, I would write the data merging function in C. This is the only way to be 100% sure of what happens with the data. The general philosophy of Python is to hide whatever happens "under the hood", this does not seem to fit your particular use case.

(Py)Spark: df.sample(0.1) doesn't affect runtime of df.toPandas()

I'm working on a data set of ~ 100k lines in PySpark, and I wan't to convert it to Pandas. The data on web clicks contains string variables and is read from a .snappy.orc file in an Amazon S3 bucket via spark.read.orc(...).
The conversion is running too slow for my application (for reasons very well explained here on stackoverflow), and thus I've tried to downsample my spark DataFrame to one tenth - the dataset is so large, that the statistical analysis I need to do is probably still valid. I however need to repeat the analysis for 5000 similar datasets, why speed the a concern.
What surprised me, is that the running time of: df.sample(false, 0.1).toPandas() is exactly the same as df.toPandas() (approx 180s), and so I don't get the reduction in running time I was hoping for.
I'm suspecting it may be a question of putting in a .cache() or .collect(), but I can't figure out a proper way to fit it in.

Python large dataset feature engineering workflow using dask hdf/parquet

There is already a nice question about it in SO but the best answer is now 5years old, So I think there should be better option(s) in 2018.
I am currently looking for a feature engineering pipeline for larger than memory dataset (using suitable dtypes).
The initial file is a csv that doesn't fit in memory. Here are my needs:
Create features (mainly using groupby operations on multiple columns.)
Merge the new feature to the previous data (on disk because it doesn't fit in memory)
Use a subset (or all) columns/index for some ML applications
Repeat 1/2/3 (This is an iterative process like day1: create 4
features, day2: create 4 more ...)
Attempt with parquet and dask:
First, I splitted the big csv file in multiple small "parquet" files. With this, dask is very efficient for the calculation of new features but then, I need to merge them to the initial dataset and atm, we cannot add new columns to parquet files. Reading the csv by chunk, merging and resaving to multiple parquet files is too time consuming as feature engineering is an iterative process in this project.
Attempt with HDF and dask:
I then turned to HDF because we can add columns and also use special queries and it is still a binary file storage. Once again I splitted the big csv file to multiple HDF with the same key='base' for the base features, in order to use the concurrent writing with DASK (not allowed by HDF).
data = data.repartition(npartitions=10) # otherwise it was saving 8Mo files using to_hdf
data.to_hdf('./hdf/data-*.hdf', key='base', format='table', data_columns=['day'], get=dask.threaded.get)
(Annex quetion: specifying data_columns seems useless for dask as there is no "where" in dask.read_hdf?)
Unlike what I expected, I am not able to merge the new feature to the multiples small files with code like this:
data = dd.read_hdf('./hdf/data-*.hdf', key='base')
data['day_pow2'] = data['day']**2
data['day_pow2'].to_hdf('./hdf/data-*.hdf', key='added', get=dask.threaded.get)
with dask.threaded I get "python stopped working" after 2%.
With dask.multiprocessing.get it takes forever and create new files
What are the most appropriated tools (storage and processing) for this workflow?

I will just make a copy of a comment from the related issue on fastparquet: it is technically possible to add columns to existing parquet data-sets, but this is not implemented in fastparquet and possibly not in any other parquet implementation either.
Making code to do this might not be too onerous (but it is not currently planned): the calls to write columns happen sequentially, so new columns for writing would need to percolate down to this function, together with the file position corresponding to the current first byte of the metadata in the footer. I addition, the schema would need to be updated separately (this is simple). The process would need to be repeated for every file of a data-set. This is not an "answer" to the question, but perhaps someone fancies taking on the task.

I would seriously consider using database (indexed access) as a storage or even using Apache Spark (for processing data in a distributed / clustered way) and Hive / Impala as a backend ...

HDF5 with Python, Pandas: Data Corruption and Read Errors

So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).
I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:
Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
Store the result of query 1 as the new "QUERY1" in the HDF5 store
Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.
Some things I suspect could be part of the problem:
using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit
Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'.
Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.
Any ideas?

How is HDF5 different from a folder with files?

I'm working on an open source project dealing with adding metadata to folders. The provided (Python) API lets you browse and access metadata like it was just another folder. Because it is just another folder.
\folder\.meta\folder\somedata.json
Then I came across HDF5 and its derivation Alembic.
Reading up on HDF5 in the book Python and HDF5 I was looking for benefits to using it compared to using files in folders, but most of what I came across spoke about the benefits of a hierarchical file-format in terms of its simplicity in adding data via its API:
>>> import h5py
>>> f = h5py.File("weather.hdf5")
>>> f["/15/temperature"] = 21
Or its ability to read only certain parts of it upon request (e.g. random access), and parallel execution of a single HDF5 file (e.g. for multiprocessing)
You could mount HDF5 files, https://github.com/zjttoefs/hdfuse5
It even boasts a strong yet simple foundation concept of Groups and Datasets which from wiki reads:
Datasets, which are multidimensional arrays of a homogeneous type
Groups, which are container structures which can hold datasets and
other groups
Replace Dataset with File and Group with Folder and the whole feature-set sounds to me like what files in folders are already fully capable of doing.
For every benefit I came across, not one stood out as being exclusive to HDF5.
So my question is, if I were to give you one HDF5 file and one folder with files, both with identical content, in which scenario would HDF5 be better suited?
Edit:
Having gotten some responses about the portability of HDF5.
It sounds lovely and all, but I still haven't been given an example, a scenario, in which an HDF5 would out-do a folder with files. Why would someone consider using HDF5 when a folder is readable on any computer, any file-system, over a network, supports "parallel I/O", is readable by humans without an HDF5 interpreter.
I would go as far as to say, a folder with files is far more portable than any HDF5.
Edit 2:
Thucydides411 just gave an example of a scenario where portability matters.
https://stackoverflow.com/a/28512028/478949
I think what I'm taking away from the answers in this thread is that HDF5 is well suited for when you need the organisational structure of files and folders, like in the example scenario above, with lots (millions) small (~1 byte) data structures; like individual numbers or strings. That it makes up for what file-systems lack by providing a "sub file-system" favouring the small and many as opposed to few and large.
In computer graphics, we use it to store geometric models and arbitrary data about individual vertices which seems to align quite well with it's use in the scientific community.

As someone who developed a scientific project that went from using folders of files to HDF5, I think I can shed some light on the advantages of HDF5.
When I began my project, I was operating on small test datasets, and producing small amounts of output, in the range of kilobytes. I began with the easiest data format, tables encoded as ASCII. For each object I processed, I produced on ASCII table.
I began applying my code to groups of objects, which meant writing multiple ASCII tables at the end of each run, along with an additional ASCII table containing output related to the entire group. For each group, I now had a folder that looked like:
+ group
| |-- object 1
| |-- object 2
| |-- ...
| |-- object N
| |-- summary
At this point, I began running into my first difficulties. ASCII files are very slow to read and write, and they don't pack numeric information very efficiently, because each digit takes a full Byte to encode, rather than ~3.3 bits. So I switched over to writing each object as a custom binary file, which sped up I/O and decreased file size.
As I scaled up to processing large numbers (tens of thousands to millions) of groups, I suddenly found myself dealing with an extremely large number of files and folders. Having too many small files can be a problem for many filesystems (many filesystems are limited in the number of files they can store, regardless of how much disk space there is). I also began to find that when I would try to do post-processing on my entire dataset, the disk I/O to read many small files was starting to take up an appreciable amount of time. I tried to solve these problems by consolidating my files, so that I only produced two files for each group:
+ group 1
| |-- objects
| |-- summary
+ group 2
| |-- objects
| |-- summary
...
I also wanted to compress my data, so I began creating .tar.gz files for collections of groups.
At this point, my whole data scheme was getting very cumbersome, and there was a risk that if I wanted to hand my data to someone else, it would take a lot of effort to explain to them how to use it. The binary files that contained the objects, for example, had their own internal structure that existed only in a README file in a repository and on a pad of paper in my office. Whoever wanted to read one of my combined object binary files would have to know the byte offset, type and endianness of each metadata entry in the header, and the byte offset of every object in the file. If they didn't, the file would be gibberish to them.
The way I was grouping and compressing data also posed problems. Let's say I wanted to find one object. I would have to locate the .tar.gz file it was in, unzip the entire contents of the archive to a temporary folder, navigate to the group I was interested in, and retrieve the object with my own custom API to read my binary files. After I was done, I would delete the temporarily unzipped files. It was not an elegant solution.
At this point, I decided to switch to a standard format. HDF5 was attractive for a number of reasons. Firstly, I could keep the overall organization of my data into groups, object datasets and summary datasets. Secondly, I could ditch my custom binary file I/O API, and just use a multidimensional array dataset to store all the objects in a group. I could even create arrays of more complicated datatypes, like arrays of C structs, without having to meticulously document the byte offsets of every entry. Next, HDF5 has chunked compression which can be completely transparent to the end user of the data. Because the compression is chunked, if I think users are going to want to look at individual objects, I can have each object compressed in a separate chunk, so that only the part of the dataset the user is interested in needs to be decompressed. Chunked compression is an extremely powerful feature.
Finally, I can just give a single file to someone now, without having to explain much about how it's internally organized. The end user can read the file in Python, C, Fortran, or h5ls on the commandline or the GUI HDFView, and see what's inside. That wasn't possible with my custom binary format, not to mention my .tar.gz collections.
Sure, it's possible to replicate everything you can do with HDF5 with folders, ASCII and custom binary files. That's what I originally did, but it became a major headache, and in the end, HDF5 did everything I was kluging together in an efficient and portable way.

Thanks for asking this interesting question. Is a folder with files portable because I can copy a directory onto a stick on a Mac and then see the same directory and files on a PC? I agree that the file directory structure is portable, thanks to the people that write operating systems, but this is unrelated to the data in the files being portable. Now, if the files in this directory are pdf's, they are portable because there are tools that read and make sense of pdfs in multiple operating systems (thanks to Adobe). But, if those files are raw scientific data (in ASCII or binary doesn't matter) they are not at all portable. The ASCII file would look like a bunch of characters and the binary file would look like gibberish. If the were XML or json files, they would be readable, because json is ASCII, but the information they contain would likely not be portable because the meaning of the XML/json tags may not be clear to someone that did not write the file. This is an important point, the characters in an ASCII file are portable, but the information they represent is not.
HDF5 data are portable, just like the pdf, because there are tools in many operating systems that can read the data in HDF5 files (just like pdf readers, see http://www.hdfgroup.org/products/hdf5_tools/index.html). There are also libraries in many languages that can be used to read the data and present it in a way that makes sense to users – which is what Adobe reader does. There are hundreds of groups in the HDF5 community that do the same thing for their users (see http://www.hdfgroup.org/HDF5/users5.html).
There has been some discussion here of compression as well. The important thing about compressing in HDF5 files is that objects are compressed independently and only the objects that you need get decompressed on output. This is clearly more efficient than compressing the entire file and having to decompress the entire file to read it.
The other critical piece is that HDF5 files are self-describing – so, people that write the files can add information that helps users and tools know what is in the file. What are the variables, what are their types, what software wrote them, what instruments collected them, etc. It sounds like the tool you are working on can read metadata for files. Attributes in an HDF5 file can be attached to any object in the file – they are not just file level information. This is huge. And, of course, those, attributes can be read using tools written in many languages and many operating systems.

I'm currently evaluating HDF5 so had the same question.
This article – Moving Away from HDF5 – asks pretty much the same question. The article raises some good points about the fact that there is only a single implementation of the HDF5 library which is developed in relatively opaque circumstances by modern open-source standards.
As you can tell from the title, the authors decided to move away from HDF5, to a filesystem hierarchy of binary files containing arrays with metadata in JSON files. This was in spite having made a significant investment in HDF5, having had their fingers burnt by data corruption and performance issues.

I think the main advantage is portability.
HDF5 stores information about your datasets like the size, type and endianness of integers and floating point numbers, which means you can move an hdf5 file around and read its content even if it was created on a machine with a different architecture.
You can also attach arbitrary metadata to groups and datasets. Arguably you can also do that with files and folders if your filesystem support extended attributes.
An hdf5 file is a single file which can sometimes be more convenient than having to zip/tar folders and files. There is also a major drawback to this: if you delete a dataset, you can't reclaim the space without creating a new file.
Generally, HDF5 is well suited for storing large arrays of numbers, typically scientific datasets.

To me, we can compare folder with files to HDF5 only in the relevant context of scientific data where the most important data are arrays described by a set of metadata.
In the general context, Marcus is alright when he claims that folder with files is far more portable than any HDF5. I will add that in a general context, a folder with file is far most accessible than a HDF5 file. The obvious challenge is that with "normal" folder and files, there is no need of an extra API to access data. That is simply impossible with HDF5 that keeps data and metadata in the same file.
Imagine a moment, to read your pdf file, you need a new pdf reader that understands HDF5? Imagine, to play your music, you need a music player that can decode HDF5? to run your python script, the python interpreter needs to first decode the HDF5? Or the total, to launch your python interpreter, your operating system needs to decode the HDF5? etc. I will have simply not be able to write this answer, because my OS won't have been able to launch my web browser, that won't have able to read its internal files because I previously
turned everything into HDF5 (maybe a large HDF5 for everything in my hard drive).
Storing meta data in separate file has that huge advantage of working well with the huge amount of data files and softwares that already exist without any extra piece of headache.
I hope this helps.

A game where you need to load a lot of resources into the memory would be a scenario in which an HDF5 may be better than a folder with files. Loading data from files has costs as seek time, the time required to open each file, and read data from the file into memory. These operations can be even slower when reading data from a DVD or Blu-ray. Opening a single file can reduce drastically those costs.

Yes, the main advantage is that HDF5 is portable. HDF5 files can be accessed by a host of other programming/interpreting languages, such as Python (which your API is built on), MATLAB, Fortran and C. As Simon suggested, HDF5 is used extensively in the scientific community to store large datasets. In my experience, I find the ability to retrieve only certain datasets (and regions) useful. In addition, the building the HDF5 library for parallel I/O is very advantageous for post-processing of raw data at a later time.
Since the file is also self-describing, it is capable of storing not just raw data, but also description of that data, such as the array size, array name, units and a host of additional metadata.
Hope this helps.

HDF5 is ultimately, a format to store numbers, optimised for large datasets. The main strengths are the support for compression (that can make reading and writing data faster in many circumstances) and the fast in-kernel queries (retrieval of data fulfilling certain conditions, for example, all the values of pressure when the temperature was over 30 C).
The fact that you can combine several datasets in the same file is just a convenience. For example, you could have several groups corresponding to different weather stations, and each group consisting on several tables of data. For each group you would have a set of attributes describing the details of the instruments, and each table the individual settings. You can have one h5 file for each block of data, with an attribute in the corresponding place and it would give you the same functionality. But now, what you can do with HDF5 is to repack the file for optimized querying, compress the whole thing slightly, and retrieve your information at a blazing speed. If you have several files, each one would be individually compressed, and the OS would decide the layout on disk, that may not be the optimal.
One last thing HDF5 allows you is to load a file (or a piece) in memory exposing the same API as in disk. So, for example, you could use one or other backend depending on the size of the data and the available RAM. In your case, that would be equivalent as copying the relevant information to /dev/shm in Linux, and you would be responsible for commiting back to disk any modification.

One factor to consider is performance of disk access. Using hd5f, everything is stored in continuous area of disk, making reading data faster with fewer disk seek and rotation. On the other hand, using file system to organize data may involve reading from many small files, thus more disk access is required.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.