Process a LOT of data - python

So I'm working with parametric energy simulations and ended up with 500GB+ of data stored in .CSV files. I need to be able to process all this data to compare the results and gain insights of the influence of different parameters.
Each csv file name contains information of the parameters used for the simulation so I can not merge the files.
I normally loaded the .csv files to python using pandas and defining a Class. but now (with all this data) there is not enough memory to do this.
Can you point me out a way to process this data? I need to be able to do plots and compare the csv files.
Thank you for your time.

Convert the csv files to hdf5, which was created to deal with massive and complex datasets. It works with pandas as well as other libraries.

Related

Memory usage due to pickle/joblib

I have a significant large dataset consisting of several thousands of files spread among different directories. These files all have different formats and come from different sensors giving me different sampling rates. Basically, a mess. I created a python module that is able to enter these folders and make sense of all this data, reformat it, get it into a pandas dataframe that I could use for effective and easy resampling, and in general, make it easier to work with.
The problem is that the resulting dataframe is big and takes a large amount of RAM memory. Loading several of these datasets leaves not enough memory available to actually train a ML model. And it is painfully slow to read the data.
So my solution is a two part approach. First, I read the dataset into a big variable. It is a dict with nested pandas DataFrame, then compute a reduced derived DataFrame with the information I actually need to train my model, and remove from memory the dict variable. Not ideal, but it works. However, further computations sometimes needs re-reading the data and as stated previously, it is slow.
Enter the second part. Before removing the dict from memory, I pickle it into a file. sklearn actually recommends using joblib, so that's what I use. So, once the single files for the dataset are stored in the working directory, the reading stage is about 90% faster than reading the scattered data, most likely because is loading a single large file directly into memory than reading and reformatting thousands of files across different directories.
Here's my problem. The same code when is reading the data from the scattered files, ends up with about 70% less RAM than when reading the pickled data. So, although it is faster, it ends up using much memory. Has anybody experienced something like this?
Given that there are some access issues to the data (it is located in a network drive with some weird restrictions for user access) and the fact that I need to make it as user friendly as possible for other people, I'm using a Jupyter notebook. My IT department provides a web tool with all the packages required to read the network drive from the go and run Jupyter there, whilst running from a VM will require the manual configuration of the network drive to access the data and that part is not user friendly. The Jupyter tool requires only login information, while the VM requires a basic knowledge of linux sysadmin
I'm using Python 3.9.6. I'll keep trying to get a MWE that has a similar situation. So far I have one that has the opposite behaviour (loading the pickled dataset consumes less memory than reading it directly). Might be because the particular structure of the dict with nested DataFrame
MWE (Warning, running this code will create a 4GB file in your hard drive):
import numpy as np
import psutil
from os.path import exists
from os import getpid
from joblib import dump, load
## WARNING. THIS CODE SAVES A LARGE FILE INTO YOUR HARD DRIVE
def read_compute():
if exists('df.joblib'):
df = load('df.joblib')
print('==== df loaded from .joblib')
else:
df = np.random.rand(1000000,500)
dump(df, 'df.joblib')
print('=== df created and dumped')
tab = df[:100, :10]
del df
return tab
table = read_compute()
print(f'{psutil.Process(getpid()).memory_info().rss / 1024 ** 2} MB')
With this, I get when running without the df.joblib file in the pwd
=== df created and dumped
3899.62890625 MB
And then, after that file is created, I restart the kernel and run the same code again, getting
==== df loaded from .joblib
1588.5234375 MB
In my actual case, with the format of my data, I have the opposite effect.

Can analyzing the data without creating a data frame in python for data loaded in database possible?

The data is stored in the DBeaver database.I would like to analyze my data through python without creating a data frame.And the python is installed in my computer. As the data is huge, creating a data frame will consume my ram and space.
So, it it possible to directly link my python code to the database and do the necessary aggregation or data manipulation and gather only the output
If you use python directly then also it will consume more ram and space. and if you do directly data analysis with database then it will may lead to unexpected results
instead you can use this Dask Dataframe from Dask Official ... Dask Wikipedia
with dask dataframe you can do data analysis even if you have big dataset
I don't know in which scale you want to work with your data and how big is your data set but if you are going to change the data in large scale i would recommend creating csv file which contains your dataset and working with pandas dataFrames reading csv files are fairly fast and easy to work if your interested you can visit here and read parts needed.
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

Python large dataset feature engineering workflow using dask hdf/parquet

There is already a nice question about it in SO but the best answer is now 5years old, So I think there should be better option(s) in 2018.
I am currently looking for a feature engineering pipeline for larger than memory dataset (using suitable dtypes).
The initial file is a csv that doesn't fit in memory. Here are my needs:
Create features (mainly using groupby operations on multiple columns.)
Merge the new feature to the previous data (on disk because it doesn't fit in memory)
Use a subset (or all) columns/index for some ML applications
Repeat 1/2/3 (This is an iterative process like day1: create 4
features, day2: create 4 more ...)
Attempt with parquet and dask:
First, I splitted the big csv file in multiple small "parquet" files. With this, dask is very efficient for the calculation of new features but then, I need to merge them to the initial dataset and atm, we cannot add new columns to parquet files. Reading the csv by chunk, merging and resaving to multiple parquet files is too time consuming as feature engineering is an iterative process in this project.
Attempt with HDF and dask:
I then turned to HDF because we can add columns and also use special queries and it is still a binary file storage. Once again I splitted the big csv file to multiple HDF with the same key='base' for the base features, in order to use the concurrent writing with DASK (not allowed by HDF).
data = data.repartition(npartitions=10) # otherwise it was saving 8Mo files using to_hdf
data.to_hdf('./hdf/data-*.hdf', key='base', format='table', data_columns=['day'], get=dask.threaded.get)
(Annex quetion: specifying data_columns seems useless for dask as there is no "where" in dask.read_hdf?)
Unlike what I expected, I am not able to merge the new feature to the multiples small files with code like this:
data = dd.read_hdf('./hdf/data-*.hdf', key='base')
data['day_pow2'] = data['day']**2
data['day_pow2'].to_hdf('./hdf/data-*.hdf', key='added', get=dask.threaded.get)
with dask.threaded I get "python stopped working" after 2%.
With dask.multiprocessing.get it takes forever and create new files
What are the most appropriated tools (storage and processing) for this workflow?
I will just make a copy of a comment from the related issue on fastparquet: it is technically possible to add columns to existing parquet data-sets, but this is not implemented in fastparquet and possibly not in any other parquet implementation either.
Making code to do this might not be too onerous (but it is not currently planned): the calls to write columns happen sequentially, so new columns for writing would need to percolate down to this function, together with the file position corresponding to the current first byte of the metadata in the footer. I addition, the schema would need to be updated separately (this is simple). The process would need to be repeated for every file of a data-set. This is not an "answer" to the question, but perhaps someone fancies taking on the task.
I would seriously consider using database (indexed access) as a storage or even using Apache Spark (for processing data in a distributed / clustered way) and Hive / Impala as a backend ...

can you subset while reading in a csv in python

I have daily weather data in csv since 1980, >10GB in size. The column I am interested in date, and I want to be able to have a user select a date so that only the results from that date are returned.
I wonder if it is possible to read in and subset at the same time to save memory and computation
I am relatively new to python and tried:
d=pd.read_csv('weather.csv',sep='\t')['Date' == 'yyyymmdd']
to no avail.
Is it possible to read in all of the data that is only present for a single day (ei 20011004)?
Short answer: from a csv you'll not be able to do so.
Long answer: csv formats are very handy for humans to read, but it's the worst for machines to operate with. You'll need to parse line by line until you find the lines where the date fits the requested one.
A possible solution: You should convert the csv into a more amenable format for such operations. My suggestion would be to go with something like hdf5. You can read the whole csv with pandas and then save it as a hdf5 file as d.to_hdf('weather.h5', format='table'). You can check the pandas hdf documentation here. This should allow you to handle in a more memory and cpu efficient way.
Binary files can implement indexes and sorting in such a way that you don't have to go through all the data to check for those pieces you need. The same ideas apply to databases.
Addendum: There are other options for binary formats, like parquet (which maybe would be even better you should test) or feather (if you want some level of "native" interoperativity with R). You might want to check the following post for some insights regarding loading/saving times in different formats and their size.

Updating large DataFrame objects not on disk.

I've been learning the ins and outs of Pandas by way of manipulating large csv files obtained online, the files are time-series of financial data. I have so far figured out how to use HDFStore to store and manipulate them, however I was wondering if there exists an easier way to update the files, without re-downloading the entire source file?
I ask because I'm working with 12 ~300+MB files, which update every 15mins. While I don't need the update to be continuous it'd be swell to not download what I already have.
The Blaze library from Continuum should help you. You can find an introduction here.

Categories

Resources