I am trying to perform a Powertransformation using sklearn.preprocessing.PowerTransformer on a large dataset with a shape of roughly ~(15000, 850000). The dataset is gridded climate data divided into individual .npz files (retrieved another script. Could also have been .hdf or .nc). Each individual file has between 200-1000 rows and all 850000, corresponding to the timesteps and grid cells of the climate data, respectively.
My overall goal is to perform a PCA analysis of the data. Previous studies Lim,2012, Zhao,2022 have stated that a lot of information is lost if the data is very skewed and not Gaussian like. For that reason I want to perform a powertransformation to give all columns a more gaussian distribution before performing the PCA analysis.
My overall workflow looks like:
Preprocessing:
Load one data file -> Perform powertransformation -> save file
PCA:
Loop over all files: Load one data file -> Perform PCA using sklearn.decomposition.IncrementalPCA with partial_fit() to perform PCA on part of the dataset.
Explore PCA patterns and temporal coefficients
The problem is that the sklearn.preprocessing.PowerTransformer performs a featurewise transformation. Given all the data is float64, this would ~100 gb of ram, or ~50 gb is converted to float32. I only have 16gb ram on my computer. So I cannot read all files into memory and perform the transformation on all rows and columns at once. Is it possible to perform the transformation in bactches like for the PCA analysis?
I would suggest a three step process:
For each npz file, which contains 200-1000 observations of 850,000 features, (approximately 1 to 6GB uncompressed) split these files up by feature group. Put the first 10,000 features in one file, the next 10,000 features in the next file, and so on. Each input file becomes 85 output files. You will end up with 6,000 files of about 20MB apiece.
Join together the sample files which have the same features. For each file, you would have 15,000 observations of 10,000 features. Write out the joined version. You would have 85 of these feature group files, which would be 1.2GB uncompressed.
For each of these 85 feature group files, load in the file, run PowerTransformer on that file, and write it back out as the transformed version. Since the file is 1.2GB, it fits in memory.
Related
I have 11 years of data with a record (row) every second, over about 100 columns. It's indexed with a series of datetime (created with Pandas to_datetime())
We need to be able to make some correlation analysis between the columns, that can work just 2 columns loaded at a time. We may be resampling at lower time cadence (e.g. 48s, 1 hours, months, etc...) over up to 11 years and visualize those correlations over the 11 years.
The data are currently in 11 separate parquet files (one per year), individually generated with Pandas from 11 .txt files. Pandas did not partition any of those files. In memory, each of these parquet files load up to about 20GB. The intended target machine will only have 16 GB, loading even just 1 columns over the 11 years takes about 10 GB, so 2 columns will not fit either.
Is there a more effective solution than working with Pandas, for working on the correlation analysis over 2 columns at a time? For example, using Dask to (i) concatenate them, and (ii) repartition to some number of partitions so Dask can work with 2 columns at a time without blowing up the memory?
I tried the latter solution following this post, and did:
# Read all 11 parquet files in `data/`
df = dd.read_parquet("/blah/parquet/", engine='pyarrow')
# Export to 20 `.parquet` files
df.repartition(npartitions=20).to_parquet("/mnt/data2/SDO/AIA/parquet/combined")
but at the 2nd step, Dask blew up my memory and I got a kernel shutdown.
As Dask is a lot about working with larger-than-memory data, I am surprise this memory escalation happened.
----------------- UPDATE 1 ROW GROUPS---------------
I reprocessed the parquet files with Pandas, to create about 20 row groups (it had defaulted to just 1 group per file). Now regardless of setting split_row_groups to True or False, I am not able to resample with Dask (e.g. myseries = myseries.resample('48s').mean(). I have to do compute() on the Dask series first to get it as a Pandas dataframe, which seems to defeat the purpose of working with the row groups within Dask.
When doing that resampling, I get instead:
ValueError: Can only resample dataframes with known divisions See
https://docs.dask.org/en/latest/dataframe-design.html#partitions for
more information.
I did not have that problem when I used the default Pandas behavior to write the parquet files with just 1 row group.
dask.dataframe by default is structured a bit more toward reading smaller "hive" parquet files rather than chunking individual huge parquet files into manageable pieces. From the dask.dataframe docs:
By default, Dask will load each parquet file individually as a partition in the Dask dataframe. This is performant provided all files are of reasonable size.
We recommend aiming for 10-250 MiB in-memory size per file once loaded into pandas. Too large files can lead to excessive memory usage on a single worker, while too small files can lead to poor performance as the overhead of Dask dominates. If you need to read a parquet dataset composed of large files, you can pass split_row_groups=True to have Dask partition your data by row group instead of by file. Note that this approach will not scale as well as split_row_groups=False without a global _metadata file, because the footer will need to be loaded from every file in the dataset.
I'd try a few strategies here:
Only read in the columns you need. Since your files are so huge, you don't want dask even trying to load the first chunk to infer structure. You can provide the columns key dd.read_parquet which will be passed through to various stages of the parsing engines. In this case, dd.read_parquet(filepath, columns=list_of_columns).
If your parquet files have multiple row groups, you can make use of the dd.read_parquet argument split_row_groups=True. This will create smaller chunks which are each smaller than the full file size.
If (2) works, you may be able to avoid repartitioning, or if you need to, repartition to a multiple of your original number of partitions (22, 33, etc). When reading data from a file, dask doesn't know how large each partition is, and if you specify a number less than a multiple of the current number of partitions, the partitioning behavior isn't very well defined. On some small tests I've run, repartitioning 11 --> 20 will leave the first 10 partitions as-is and split the last one into the remaining 10!
If your file is on disk, you may be able to read the file as a memory map to avoid loading the data prior to repartitioning. You can do this by passing memory_map=True to dd.read_parquet.
I'm sure you're not the only one with this problem. Please let us know how this goes and report back what works!
Number of file : 894
total file size : 22.2GB
I have to do machine learning by reading many csv files. There is not enough memory to read at once.
Specifically to load a large number of files that do not fit in memory, one can use dask:
import dask.dataframe as dd
df = dd.read_csv('file-*.csv')
This will create a lazy version of the data, meaning the data will be loaded only when requested, e.g. df.head() will load the data from the first 5 rows only. Where possible pandas syntax will apply to dask dataframes.
For machine learning you can use dask-ml which has tight integration with sklearn, see docs.
You can read your files in chunks but not during the training phase. You have to select an appropriate algorithm for your files. However, having such big files for model training mostly means you have to do some data preparation first, which will reduce the size of the files significantly.
I have a grid of approximately 500x1000 points with weather data. Each NetCDF file contains data for one day and I have data for several years. What I am trying to extract is all weather data for all times per point. So essentially I would like to split the data into 500000 files with data for one point in each.
What approach would be the best with respect to performance?
I tried to join all NetCDF files to one giant NetCDF file (appoximately 400 GB large). Then read it in with xarray and tried to split into one file per point. It took approximatelly 30 minutes per point, so I am looking for a faster solution.
import xarray as xr
dataset = xr.open_dataset(big_nc_path)
selected_data = dataset.isel(x=12, y=17)
dataframe = selected_data.to_dataframe()
dataframe.to_csv(file_path)
Update 1:
I have reduced the file to one year now to make it easier to test.
It seems like it was to large extent an IO issue since I had the large file on an external hard drive. Now I have moved it to my SSD and it is considerably faster.
Time to extract one x,y-combination when yearly file is on external hard drive: 10minutes and 40 seconds.
Time to extract one x,y-combination when yearly file is on SSD: 20 seconds.
Still interested in improvements though :)
I posted the output of nc_dump -hs here:
https://gist.github.com/robban/fd8f1b0e3228c0b80e93772976dc638a
I generated about 500 sharded numpy data files, each of them contains about 10000 data samples (e.g., image and its label), for example:
file-000001.npy
file-000002.npy
file-000003.npy
...
file-000500.npy
each of .npy contains a numpy dictionary whose keys and size are {'image':10000x3x512x64 (dtype=np.float32),'label':10000x100 (dtype=np.float32)}. Please note that some of these numpy files contain less than 10000 samples, say 8111 etc.
During training, for each epoch, we need to iterate all the 500x10000 samples. These data cannot be loaded into memory due to capacity limits. A common solution is data prefetching queue.
My thought is as follows: (1) first record all the filenames and the count of data samples in each file, (2) for each batch, compute the batch indices, then get the corresponding data files that are needed to be loaded into memory to read the data samples to compose the batch data.
During step (2), if we set the batch size as 256, it is possible that we need to read 256 files and read just one sample in each of them to compose the batch data. This might be slow and unpractical.
Based on the queue, the data loading might be running on background threads, and all readed batch data are saved in the queue (the capacity might be a large number depends on the memory capacity). And the background threads consistently read data to fill the queue once after the queue have space.
Is it hard to implement this? I've searched in Google, it seems there are some advanced solutions such as using cache technique, using mmap. But I'm not familiar with these guys. Are there any simple examples on this?
I created a hdf5 file with 5000 groups each containing 1000 datasets with h5py. The datasets are 1-D arrays of integers, averaging about 10,000 elements, though this can vary from dataset to dataset. The datasets were created without the chunked storage option. The total size of the dataset is 281 GB. I want to iterate through all datasets to create a dictionary mapping the dataset name to the length of the dataset. I will use this dictionary in an algorithm later.
Here's what I've tried.
import h5py
f = h5py.File('/my/path/to/database.hdf5', 'r')
lens = {}
for group in f.itervalues():
for dataset in group.itervalues():
lens[dataset.name] = dataset.len()
This is too slow for my purposes and I'm looking for ways to speed up this procedure. I'm aware that it's possible to parallelize operations with h5py, but wanted to see if there was another approach before going down that road. I'd be open to re-creating the database with different options/structure if it would speed things up.