Fastest method for loading large dataset into python

Fastest method for loading large dataset into python - python

I have some relatively large .mat files that I'm reading in to Python to eventually use in PyTorch. These files range in the number of rows (~55k to ~111k), but each has a little under 11k columns, with no header, and all the entries are floats. The data file sizes range from 5.8 GB to 11.8 GB. The .mat files came from a prior data processing step in Perl, so I'm not sure about the mat version; when I tried to load a file using scipy.io.loadmat, I received the following error: ValueError: Unknown mat file type, version 46, 56. I've tried pandas, dask, and astropy and been successful, but it takes between 4-6 minutes to load a single file. Here's the code for loading using each of the methods I've mentioned above, run as a timing experiment:
import pandas as pd
import dask.dataframe as dd
from astropy.io import ascii as aio
import numpy as np
import time
numberIterations = 6
daskTime = np.zeros((numberIterations,), dtype=float)
pandasTime = np.zeros((numberIterations,), dtype=float)
astropyTime = np.zeros(numberIterations,), dtype=float)
for ii in range(numberIterations):
t0 = time.time()
data = dd.read_csv(dataPath, delimiter='\t', dtype=np.float64, header=None)
daskTime[ii] = time.time() - t0
data = 0
del(data)
t0 = time.time()
data = pd.read_csv(dataPath, delimiter='\t', dtype=np.float64, header=None)
pandasTime[ii] = time.time() - t0
data = 0
del(data)
t0 = time.time()
data = aio.read(dataPath, format='fast_no_header', delimiter='\t', header_start=None, guess=False)
astropyTime[ii] = time.time() - t0
data = 0
del(data)
When I time these methods, dask is by far the slowest (by almost 3x), followed by pandas , and then astropy. For the largest file, the load time (in seconds) for 6 runs is:
dask: 1006.15 (avg), 1.14 (std)
pandas: 337.50 (avg), 5.84 (std)
astropy: 314.61 (avg), 2.02 (std)
I'm wondering if there is a faster way to load these files, since this is still quite long. Specifically, I'm wondering if there is perhaps a better library to use for the consistent loading of tabular float data and/or if there is a way to incorporate C/C++ or bash to read the files faster. I realize this question is a little open-ended; I'm hoping to get some ideas for how I can read these files in faster, so there is not a bunch of time wasted on just reading in the files.

Given these were generated in perl, and given the code above works, these are tab-separated text files, not matlab files. Which would be appropriate for scipy.io.loadmat.
Generally, reading in text is slow, and will depend heavily on compression and IO limitations.
FWIW pandas is already pretty well optimised under the hood, and I doubt you would get significant gains from using C directly.
If you plan to use these files frequently it might be worth using zarr or hdf5 to represent tabular float data. I'd lean towards zarr if you have some experience with dask already. They work nicely together.

Related

Converting CSV to numpy NPY efficiently

How to convert a .csv file to .npy efficently?
I've tried:
import numpy as np
filename = "myfile.csv"
vec =np.loadtxt(filename, delimiter=",")
np.save(f"{filename}.npy", vec)
While the above works for smallish file, the actual .csv file I'm working on has ~12 million lines with 1024 columns, it takes quite a lot to load everything into RAM before converting into an .npy format.
Q (Part 1): Is there some way to load/convert a .csv to .npy efficiently for large CSV file?
The above code snippet is similar to the answer from Convert CSV to numpy but that won't work for ~12M x 1024 matrix.
Q (Part 2): If there isn't any way to to load/convert a .csv to .npy efficiently, is there some way to iteratively read the .csv file into .npy efficiently?
Also, there's an answer here https://stackoverflow.com/a/53558856/610569 to save the csv file as numpy array iteratively. But seems like the np.vstack isn't the best solution when reading the file. The accepted answer there suggests hdf5 but the format is not the main objective of this question and the hdf5 format isn't desired in my use-case since I've to read it back into a numpy array afterwards.
Q (Part 3): If part 1 and part2 are not possible, are there other efficient storage (e.g. tensorstore) that can store and efficiently convert to numpy array when loading the saved storage format?
There is another library tensorstore that seems to efficiently handles arrays which support conversion to numpy array when read, https://google.github.io/tensorstore/python/tutorial.html. But somehow there isn't any information on how to save the tensor/array without the exact dimensions, all of the examples seem to include configurations like 'dimensions': [1000, 20000],.
Unlike the HDF5, the tensorstore doesn't seem to have reading overhead issues when converting to numpy, from docs:
Conversion to an numpy.ndarray also implicitly performs a synchronous read (which hits the in-memory cache since the same region was just retrieved)

Nice question; Informative in itself.
I understand you want to have the whole data set/array in memory, eventually, as a NumPy array. I assume, then, you have enough (RAM) memory to host such array -- 12M x 1K.
I don't specifically know about how np.loadtxt (genfromtxt) is operating behind the scenes, so I will tell you how I would do (after trying like you did).
Reasoning about memory...
Notice that a simple boolean array will cost ~12 GBytes of memory:
>>> print("{:.1E} bytes".format(
np.array([True]).itemsize * 12E6 * 1024
))
1.2E+10 bytes
And this is for a Boolean data type. Most likely, you have -- what -- a dataset of Integer, Float? The size may increase quite significantly:
>>> np.array([1], dtype=bool).itemsize
1
>>> np.array([1], dtype=int).itemsize
8
>>> np.array([1], dtype=float).itemsize
8
It's a lot of memory (which you know, just want to emphasize).
At this point, I would like to point out a possible swapping of the working memory. You may have enough physical (RAM) memory in your machine, but if not enough of free memory, your system will use the swap memory (i.e, disk) to keep your system stable & have the work done. The cost you pay is clear: read/writing from/to the disk is very slow.
My point so far is: check the data type of your dataset, estimate the size of your future array, and guarantee you have that minimum amount of RAM memory available.
I/O text
Considering you do have all the (RAM) memory necessary to host the whole numpy array: I would then loop over the whole (~12M lines) text file, filling the pre-existing array row-by-row.
More precisely, I would have the (big) array already instantiated before start reading the file. Only then, I would read each line, split the columns, and give it to np.asarray and assign those (1024) values to each respective row of the output array.
The looping over the file is slow, yes. The thing here is that you limit (and control) the amount of memory being used. Roughly speaking, the big objects consuming your memory are the "output" (big) array, and the "line" (1024) array. Sure, there are quite a considerable amount of memory being consumed in each loop in the temporary objects during reading (text!) values, splitting into list elements and casting to an array. Still, it's something that will remain largely constant during the whole ~12M lines.
So, the steps I would go through are:
0) estimate and guarantee enough RAM memory available
1) instantiate (np.empty or np.zeros) the "output" array
2) loop over "input.txt" file, create a 1D array from each line "i"
3) assign the line values/array to row "i" of "output" array
Sure enough, you can even make it parallel: If on one hand text files cannot be randomly (r/w) accessed, on the other hand you can easily split them (see How can I split one text file into multiple *.txt files?) to have -- if fun is at the table -- them read in parallel, if that time if critical.
Hope that helps.

TL;DR
Export to a different function other than .npy seems inevitable unless your machine is able to handle the size of the data in-memory as per described in #Brandt answer.
Reading the data, then processing it (Kinda answering Q part 2)
To handle data size larger than what the RAM can handle, one would often resort to libraries that performs "out-of-core" computation, e.g. turicreate.SFrame, vaex or dask . These libraries would be able to lazily load the .csv files into dataframes and process them by chunks when evaluated.
from turicreate import SFrame
filename = "myfile.csv"
sf = SFrame.read_csv(filename)
sf.apply(...) # Trying to process the data
or
import vaex
filename = "myfile.csv"
df = vaex.from_csv(filename,
convert=True,
chunk_size=50_000_000)
df.apply(...)
Converting the read data into numpy array (kinda answering Q part 1)
While out-of-core libraries can read and process the data efficiently, converting into numpy is an "in-memory" operation, the machine needs to have enough RAM to fit all data.
The turicreate.SFrame.to_numpy documentation writes:
Converts this SFrame to a numpy array
This operation will construct a numpy array in memory. Care must be taken when size of the returned object is big.
And the vaex documentation writes:
In-memory data representations
One can construct a Vaex DataFrame from a variety of in-memory data representations.
And dask best practices actually reimplemented their own array objects that are simpler than numpy array, see https://docs.dask.org/en/stable/array-best-practices.html. But when going through the docs, it seems like the format they have saved the dask array in are not .npy but various other formats.
Writing the file into non-.npy versions (answering Q Part 3)
Given the numpy arrays are inevitably in-memory, trying to save the data into one single .npy isn't the most viable option.
Different libraries seems to have different solutions for storage. E.g.
vaex saves the data into hdf5 by default if the convert=True argument is set when data is read through vaex.from_csv()
sframe saves the data into their own binary format
dask export functions save to_hdf() and to_parquet() format

It it's latest version (4.14) vaex support "streaming", i.e. lazy loading of CSV files. It uses pyarrow under the hood so it is supper fast. Try something like
df = vaex.open(my_file.csv)
# or
df = vaex.from_csv_arrow(my_file.csv, lazy=True)
Then you can export to bunch of formats as needed, or keep working with it like that (it is surprisingly fast). Of course, it is better to convert to some kind of binary format..

import numpy as np
import pandas as pd
# Define the input and output file names
csv_file = 'data.csv'
npy_file = 'data.npy'
# Create dummy data
data = np.random.rand(10000, 100)
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False)
# Define the chunk size
chunk_size = 1000
# Read the header row and get the number of columns
header = pd.read_csv(csv_file, nrows=0)
num_cols = len(header.columns)
# Initialize an empty array to store the data
data = np.empty((0, num_cols))
# Loop over the chunks of the csv file
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
# Convert the chunk to a numpy array
chunk_array = chunk.to_numpy()
# Append the chunk to the data array
data = np.append(data, chunk_array, axis=0)
np.save(npy_file, data)
# Load the npy file and check the shape
npy_data = np.load(npy_file)
print('Shape of data before conversion:', data.shape)
print('Shape of data after conversion:', npy_data.shape)```

I'm not aware of any existing function or utility that directly and efficiently converts csv files into npy files. With efficient I guess primarily meaning with low memory requirements.
Writing a npy file iteratively is indeed possible, with some extra effort. There's already a question on SO that addresses this, see:
save numpy array in append mode
For example using the NpyAppendArray class from Michael's answer you can do:
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
for line in csv:
row = np.fromstring(line, sep=',')
npy.append(row[np.newaxis, :])
The NpyAppendArray class updates the npy file header on every call to append, which is a bit much for your 12M rows. Maybe you could update the class to (optionally) only write the header on close. Or you could easily batch the writes:
batch_lines = 128
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
done = False
while not done:
batch = []
for count, line in enumerate(csv):
row = np.fromstring(line, sep=',')
batch.append(row)
if count + 1 >= batch_lines:
break
else:
done = True
npy.append(np.array(batch))
(code is not tested)

What is the recommended way to access data from R data.table in python? Can I avoid writing data to disc?

Is there some recommended way to pass data from R (in the form of data.table) to Python without having to save the data to disc? I know I could use python modules from R using reticulate (and I suppose the same thing can be done on the other side using rpy2), but from what I've read that hurts the overall performance of the libraries and therefore there is quite a big chance that it's better to store to disc my r data.table and read that same data from disc using python and running, say, lightgbm, than to try to run lightgbm using reticulate or data.table using rpy2.
Why don't I just stick to either R or Python:
I prefer using r data.table (as opposed to Pandas) for my data manipulations, because it is way faster, more memory efficient, and has a lot of features which I like, such as inequi joins, rolling joins, cartesian joins, and pretty straightforward melting and casting. I also like that whenever I ask a data.table related question in stack overflow, I get a high-quality answer pretty fast, while for Pandas i haven't been so successful. However, there are tasks for which I prefer python, such as when it comes to gradient boosting or neural networks.

There is no recommended way.
In theory you have to dump R data.frame to disk and read it in python.
In practice (assuming production grade operating system), you can use "RAM disk" location /dev/shm/ so you essentially write data to a file that resides in RAM memory and then read it from RAM directly, without the need to dump data to disk memory.
Example usage:
fwrite(iris, "/dev/shm/data.csv")
d = fread("/dev/shm/data.csv")
unlink("/dev/shm/data.csv")
As for the format, you have the following options:
csv - universal and portable format
data.table's fwrite function is super fast and produces portable csv data file. Be sure to enable all cpu threads with setDTthreads(0L) before using fwrite on a multi-core machine.
Then in python you need to read csv file, for which python datatable module will be very fast, and then, if needed, object can be converted to python pandas using x.to_pandas().
feather - "portable" binary format
Another option is to use R's arrow package and function write_feather, and then read data in python using pyarrow module and read_feather.
This format should be faster than csv in most cases, see timings below. In case of writing data the difference might not be that big, but reading data will be much faster in most cases, especially when it comes to reading many character variables in R (although it is not your use case because you read in python). On the other hand it is not really portable yet (see apache/arrow#8732). Moreover, eventually if new version 3 will be released, then files saved with current feather might not be compatible anymore.
fst - fast binary format
fst can be used as faster alternative to feather format but it is not yet possible to read fst data in python, so this method cannot be applied to solve your problem as of now. You can track progress of this FR in https://github.com/fstpackage/fst/issues/184 and when issue will be resolved, then it will probably address your question in the fastest manner.
Using following scripts
library(data.table)
setDTthreads(0L) ## 40
N = 1e8L
x = setDT(lapply(1:10, function(...) sample.int(N)))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python
unlink(paste0("/dev/shm/data.",c("csv","feather")))
N = 1e8L
x = setDT(lapply(1:10, function(...) runif(N)))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python
unlink(paste0("/dev/shm/data.",c("csv","feather")))
N = 1e7L
x = setDT(lapply(1:10, function(...) paste0("id",sample.int(N))))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python
unlink(paste0("/dev/shm/data.",c("csv","feather")))
import datatable as dt
import timeit
import gc
from pyarrow import feather
gc.collect()
t_start = timeit.default_timer()
x = dt.fread("/dev/shm/data.csv")
print(timeit.default_timer() - t_start, flush=True)
gc.collect()
t_start = timeit.default_timer()
y = x.to_pandas()
print(timeit.default_timer() - t_start, flush=True)
del x, y
gc.collect()
t_start = timeit.default_timer()
x = feather.read_feather("/dev/shm/data.feather", memory_map=False)
print(timeit.default_timer() - t_start, flush=True)
del x
I got the following timings:
integer:
write: feather 2.7s vs csv 5.7s
read: feather 2.8s vs csv 111s+3s
double:
write: feather 5.7s vs csv 10.8s
read: feather 5.1s vs csv 180s+4.9s
character:
write: feather 50.2s vs csv 2.8s
read: feather 35s vs csv 14s+16s
Based on the presented data cases (1e8 rows for int/double, 1e7 rows for character; 10 columns: int/double/character) we can conclude the following:
writing int and double is around 2 times slower for csv than feather
writing character is around 20 times faster for csv than feather
reading int and double is much slower for csv than feather
conversion int and double from python datatable to pandas is relatively cheap
reading character is around 2 times faster for csv than feather
conversion character from python datatable to pandas is expensive
Note that these are very basic data cases, be sure to check timings on your actual data.

python read zipfile into numpy-array efficiently

I want to read a zipfile into memory and extract its content into a numpy array (as numpy-datatypes). This needs to happen in an extremely efficient/fast manner, since the files are rather big and there are many of them. Unfortunately looking at similiar questions didn't help me, because I couldn't find a way to convert the data into numpy-datatypes at the time of reading. Also the speed turned out to be a big problem.
For example: The zipfile "log_ks818.zip" contains "log_file.csv", which contains the needed data in the following format (yyyymmdd hhnnsszzz,float,float,zero):
20161001 190000100,1.000500,1.000800,0
20161001 190001000,1.001000,1.002000,0
20161001 190002500,1.001500,1.001200,0
...
The fastest that I managed to do so far (using pandas):
zfile = zipfile.ZipFile("log_ks818.zip", 'r')
df = pd.read_csv(io.BytesIO(zfile.read("log_file.csv")), header=None, usecols=[0, 1, 2], delimiter=',', encoding='utf-8')
print(df.head())
However this takes around 6 seconds for ~2,000,000 lines in the file (~80MB if unpacked), which is too slow (plus it's not a numpy object). When I compared the read-in speed of different numpy/pandas-methods, using the extracted file on the hard drive to test, np.fromfile performed the best with 0.08 seconds to simply get it into memory. It would be nice if it was possible to stay in this magnitude when reading the data from the zip-file.

I think this is not a problem about read speed from disk. Even though you are using HDD, reading 80MB into memory can be done in one second.
Take my experience as an example, the cost of time is determined by the process of uncompressing. If you just work with extracted data, it won't cost you a lot I believe.

Efficient reading of netcdf variable in python

I need to be able to quickly read lots of netCDF variables in python (1 variable per file). I'm finding that the Dataset function in netCDF4 library is rather slow compared to reading utilities in other languages (e.g., IDL).
My variables have shape of (2600,5200) and type float. They don't seem that big to me (filesize = 52Mb).
Here is my code:
import numpy as np
from netCDF4 import Dataset
import time
file = '20151120-235839.netcdf'
t0=time.time()
openFile = Dataset(file,'r')
raw_data = openFile.variables['MergedReflectivityQCComposite']
data = np.copy(raw_data)
openFile.close()
print time.time-t0
It takes about 3 seconds to read one variable (one file). I think the main slowdown is np.copy. raw_data is <type 'netCDF4.Variable'>, thus the copy. Is this the best/fastest way to do netCDF reads in python?
Thanks.

The power of Numpy is that you can create views into the exiting data in memory via the metadata it retains about the data. So a copy will always be slower than a view, via pointers. As JCOidl says it's not clear why you don't just use:
raw_data = openFile.variables['MergedReflectivityQCComposite'][:]
For more info see SciPy Cookbook and SO View onto a numpy array?

I'm not sure what to say about the np.copy operation (which is indeed slow), but I find that the PyNIO module from UCAR works well for both NetCDF and HDF files. This will place data into a numpy array:
import Nio
f = Nio.open_file(file, format="netcdf")
data = f.variables['MergedReflectivityQCComposite'][:]
f.close()
Testing your code versus the PyNIO code on a ndfCDF file I have resulted in 1.1 seconds for PyNIO, versus 3.1 seconds for the netCDF4 module. Your results may vary; worth a look though.

You can use xarray for that.
%matplotlib inline
import xarray as xr
### Single netcdf file ###
ds = xr.open_dataset('path/file.nc')
### Opening multiple NetCDF files and concatenating them by time ####
ds = xr.open_mfdatset('path/*.nc', concat_dim='time
To read the variable you can simply type ds.MergedReflectivityQCCompositeor ds.['MergedReflectivityQCComposite'][:]
You can also use xr.load_dataset but I find that it uses up more space than the open function. For xr.open_mfdataset, you can also chunk along the dimensions of the file if you want. There are other options for both functions and you might be interested to learn more about it in the xarray documentation.

Killed/MemoryError when creating a large dask.dataframe from delayed collection

I am trying to create a dask.dataframe from a bunch of large CSV files (currently 12 files, 8-10 million lines and 50 columns each). A few of them might fit together into my system memory but all of them at once definitely will not, hence the use of dask instead of regular pandas.
Since reading each csv file involves some extra work (adding columns with data from the file path), I tried creating the dask.dataframe from a list of delayed objects, similar to this example.
This is my code:
import dask.dataframe as dd
from dask.delayed import delayed
import os
import pandas as pd
def read_file_to_dataframe(file_path):
df = pd.read_csv(file_path)
df['some_extra_column'] = 'some_extra_value'
return df
if __name__ == '__main__':
path = '/path/to/my/files'
delayed_collection = list()
for rootdir, subdirs, files in os.walk(path):
for filename in files:
if filename.endswith('.csv'):
file_path = os.path.join(rootdir, filename)
delayed_reader = delayed(read_file_to_dataframe)(file_path)
delayed_collection.append(delayed_reader)
df = dd.from_delayed(delayed_collection)
print(df.compute())
When starting this script (Python 3.4, dask 0.12.0), it runs for a couple of minutes while my system memory constantly fills up. When it is fully used, everything starts lagging and it runs for some more minutes, then it crashes with killed or MemoryError.
I thought the whole point of dask.dataframe was to be able to operate on larger-than-memory dataframes that span over multiple files on disk, so what am I doing wrong here?
edit: Reading the files instead with df = dd.read_csv(path + '/*.csv') seems to work fine as far as I can see. However, this does not allow me to alter each single dataframe with additional data from the file path.
edit #2:
Following MRocklin's answer, I tried to read my data with dask's read_bytes() method as well as using the single-threaded scheduler as well as doing both in combination.
Still, even when reading chunks of 100MB in single-threaded mode on a laptop with 8GB of memory, my process gets killed sooner or later. Running the code stated below on a bunch of small files (around 1MB each) of similar shape works fine though.
Any ideas what I am doing wrong here?
import dask
from dask.bytes import read_bytes
import dask.dataframe as dd
from dask.delayed import delayed
from io import BytesIO
import pandas as pd
def create_df_from_bytesio(bytesio):
df = pd.read_csv(bytesio)
return df
def create_bytesio_from_bytes(block):
bytesio = BytesIO(block)
return bytesio
path = '/path/to/my/files/*.csv'
sample, blocks = read_bytes(path, delimiter=b'\n', blocksize=1024*1024*100)
delayed_collection = list()
for datafile in blocks:
for block in datafile:
bytesio = delayed(create_bytesio_from_bytes)(block)
df = delayed(create_df_from_bytesio)(bytesio)
delayed_collection.append(df)
dask_df = dd.from_delayed(delayed_collection)
print(dask_df.compute(get=dask.async.get_sync))

If each of your files is large then a few concurrent calls to read_file_to_dataframe might be flooding memory before Dask ever gets a chance to be clever.
Dask tries to operate in low memory by running functions in an order such that it can delete intermediate results quickly. However if the results of just a few functions can fill up memory then Dask may never have a chance to delete things. For example if each of your functions produced a 2GB dataframe and if you had eight threads running at once, then your functions might produce 16GB of data before Dask's scheduling policies can kick in.
Some options
Use dask.bytes.read_bytes
The reason why read_csv works is that it chunks up large CSV files into many ~100MB blocks of bytes (see the blocksize= keyword argument). You could do this too, although it's tricky because you need to always break on an endline.
The dask.bytes.read_bytes function can help you here. It can convert a single path into a list of delayed objects, each corresponding to a byte range of that file that starts and stops cleanly on a delimiter. You would then put these bytes into an io.BytesIO (standard library) and call pandas.read_csv on that. Beware that you'll also have to handle headers and such. The docstring to that function is extensive and should provide more help.
Use a single thread
In the example above everything would be fine if we didn't have the 8x multiplier from parallelism. I suspect that if you only ran a single function at once that things would probably pipeline without ever reaching your memory limit. You can set dask to use only a single thread with the following line
dask.set_options(get=dask.async.get_sync)
Note: For Dask versions >= 0.15, you need to use dask.local.get_sync instead.
Make sure that results fit in memory (response to edit 2)
If you make a dask.dataframe and then compute it immediately
ddf = dd.read_csv(...)
df = ddf.compute()
You're loading in all of the data into a Pandas dataframe, which will eventually blow up memory. Instead it's better to operate on the Dask dataframe and only compute on small results.
# result = df.compute() # large result fills memory
result = df.groupby(...).column.mean().compute() # small result
Convert to a different format
CSV is a pervasive and pragmatic format, but also has some flaws. You might consider a data format like HDF5 or Parquet.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest method for loading large dataset into python - python

Related

Converting CSV to numpy NPY efficiently

What is the recommended way to access data from R data.table in python? Can I avoid writing data to disc?

python read zipfile into numpy-array efficiently

Efficient reading of netcdf variable in python

Killed/MemoryError when creating a large dask.dataframe from delayed collection

Categories

Resources