How can I import many binary files in Dask? - python

I have many binary files (.tdms format, similar to .wav) stored in S3 and I would like to read them with nptdms then process them in a distributed fashion with Dask on a cluster.
In PySpark there is pyspark.SparkContext.binaryFiles() which produces an RDD with a bytearray for each input file which is a simple solution to this problem.
I have not found an equivalent function in Dask - is there one? If not, how could the equivalent functionality be achieved in Dask?
I noticed there's dask.bytes.read_bytes() if it's necessary to involve this however nptdms can't read a chunk of a file - it needs the entire file to be available and I'm not sure how to accomplish that.

dask.bytes.read_bytes() will give you whole files if you use blocksize=None, i.e., exactly one block per file. The most common use case for that is compressed files (e.g., gzip) where you can't start mid-stream, but should work for your use case too. Note that the delayed objects you get each return bytes, not open files.
Alternatively, you can use fsspec.open_files. This returns OpenFile objects, which are safe to serialise and so you can use them in dask.delayed calls such as
ofs = fsspec.open_files("s3://...", ...)
#dask.delayed
def read_a_file(of):
with of as f:
# entering context actually touches storage
return TdmsFile.read(f)
tdms = [read_a_file(of) for of in ofs]

Related

Speed up reading multiple pickle files

I have a lot of pickle files. Currently I read them in a loop but it takes a lot of time. I would like to speed it up but don't have any idea how to do that.
Multiprocessing wouldn't work because in order to transfer data from a child subprocess to the main process data need to be serialized (pickled) and deserialized.
Using threading wouldn't help either because of GIL.
I think that the solution would be some library written in C that takes a list of files to read and then runs multiple threads (without GIL). Is there something like this around?
UPDATE
Answering your questions:
Files are partial products of data processing for the purpose of ML
There are pandas.Series objects but the dtype is not known upfront
I want to have many files because we want to pick any subset easily
I want to have many smaller files instead of one big file because deserialization of one big file takes more memory (at some point in time we have serialized string and deserialized objects)
The size of the files can vary a lot
I use python 3.7 so I believe it's cPickle in fact
Using pickle is very flexible because I don't have to worry about underlying types - I can save anything
I agree with what has been noted in the comments, namely that due to the constraint of python itself (chiefly, the GIL lock, as you noted) and there may simply be no faster loading the information beyond what you are doing now. Or, if there is a way, it may be both highly technical and, in the end, only gives you a modest increase in speed.
That said, depending on the datatypes you have, it may be faster to use quickle or pyrobuf.
I think that the solution would be some library written in C that
takes a list of files to read and then runs multiple threads (without
GIL). Is there something like this around?
In short: no. pickle is apparently good enough for enough people that there are no major alternate implementations fully compatible with the pickle protocol. As of sometime in python 3, cPickle was merged with pickle, and neither release the GIL anyway which is why threading won't help you (search for Py_BEGIN_ALLOW_THREADS in _pickle.c and you will find nothing).
If your data can be re-structured into a simpler data format like csv, or a binary format like numpy's npy, there will be less cpu overhead when reading your data. Pickle is built for flexibility first rather than speed or compactness first. One possible exception to the rule of more complex less speed is the HDF5 format using h5py, which can be fairly complex, and I have used to max out the bandwidth of a sata ssd.
Finally you mention you have many many pickle files, and that itself is probably causing no small amount of overhead. Each time you open a new file, there's some overhead involved from the operating system. Conveniently you can combine pickle files by simply appending them together. Then you can call Unpickler.load() until you reach the end of the file. Here's a quick example of combining two pickle files together using shutil
import pickle, shutil, os
#some dummy data
d1 = {'a': 1, 'b': 2, 1: 'a', 2: 'b'}
d2 = {'c': 3, 'd': 4, 3: 'c', 4: 'd'}
#create two pickles
with open('test1.pickle', 'wb') as f:
pickle.Pickler(f).dump(d1)
with open('test2.pickle', 'wb') as f:
pickle.Pickler(f).dump(d2)
#combine list of pickle files
with open('test3.pickle', 'wb') as dst:
for pickle_file in ['test1.pickle', 'test2.pickle']:
with open(pickle_file, 'rb') as src:
shutil.copyfileobj(src, dst)
#unpack the data
with open('test3.pickle', 'rb') as f:
p = pickle.Unpickler(f)
while True:
try:
print(p.load())
except EOFError:
break
#cleanup
os.remove('test1.pickle')
os.remove('test2.pickle')
os.remove('test3.pickle')
I think you should try and use mmap(memory mapped files) that is similar to open() but way faster.
Note: If your each file is big in size then use mmap otherwise If the files are small in size use regular methods.
I have written a sample that you can try.
import mmap
from time import perf_counter as pf
def load_files(filelist):
start = pf() # for rough time calculations
for filename in filelist:
with open(filename, mode="r", encoding="utf8") as file_obj:
with mmap.mmap(file_obj.fileno(), length=0, access=mmap.ACCESS_READ) as mmap_file_obj:
data = pickle.load(mmap_file_obj)
print(data)
print(f'Operation took {pf()-start} sec(s)')
Here mmap.ACCESS_READ is the mode to open the file in binary. The file_obj returned by open is just used to get the file descriptor which is used to open the stream to the file via mmap as a memory mapped file.
As you can see below in the documentation of python open returns the file descriptor or fd for short. So we don't have to do anything with the file_obj operation wise. We just need its fileno() method to get its file descriptor. Also we are not closing the file_obj before the mmap_file_obj. Please take a proper look. We are closing the the mmap block first.
As you said in your comment.
open (file, flags[, mode])
Open the file file and set various flags according to flags and possibly its mode according to mode.
The default mode is 0777 (octal), and the current umask value is first masked out.
Return the file descriptor for the newly opened file.
Give it a try and see how much impact does it do on your operation
You can read more about mmap here. And about file descriptor here
You can try multiprocessing:
import os,pickle
pickle_list=os.listdir("pickles")
output_dict=dict.fromkeys(pickle_list, '')
def pickle_process_func(picklename):
with open("pickles/"+picklename, 'rb') as file:
dapickle=pickle.load(file)
#if you need previus files output wait for it
while(!output_dict[pickle_list[pickle_list.index(picklename)-1]]):
continue
#thandosomesh
print("loaded")
output_dict[picklename]=custom_func_i_dunno(dapickle)
from multiprocessing import Pool
with Pool(processes=10) as pool:
pool.map(pickle_process_func, pickle_list)
Consider using HDF5 via h5py instead of pickle. The performance is generally much better than pickle with numerical data in Pandas and numpy data structures and it supports most common data types and compression.

Join/merge multiple NetCDF files using xarray

I have a folder with NetCDF files from 2006-2100, in ten year blocks (2011-2020, 2021-2030 etc).
I want to create a new NetCDF file which contains all of these files joined together. So far I have read in the files:
ds = xarray.open_dataset('Path/to/file/20062010.nc')
ds1 = xarray.open_dataset('Path/to/file/20112020.nc')
etc.
Then merged these like this:
dsmerged = xarray.merge([ds,ds1])
This works, but is clunky and there must be a simpler way to automate this process, as I will be doing this for many different folders full of files. Is there a more efficient way to do this?
EDIT:
Trying to join these files using glob:
for filename in glob.glob('path/to/file/.*nc'):
dsmerged = xarray.merge([filename])
Gives the error:
AttributeError: 'str' object has no attribute 'items'
This is reading only the text of the filename, and not the actual file itself, so it can't merge it. How do I open, store as a variable, then merge without doing it bit by bit?
If you are looking for a clean way to get all your datasets merged together, you can use some form of list comprehension and the xarray.merge function to get it done. The following is an illustration:
ds = xarray.merge([xarray.open_dataset(f) for f in glob.glob('path/to/file/.*nc')])
In response to the out of memory issues you encountered, that is probably because you have more files than the python process can handle. The best fix for that is to use the xarray.open_mfdataset function, which actually uses the library dask under the hood to break the data into smaller chunks to be processed. This is usually more memory efficient and will often allow you bring your data into python. With this function, you do not need a for-loop; you can just pass it a string glob in the form "path/to/my/files/*.nc". The following is equivalent to the previously provided solution, but more memory efficient:
ds = xarray.open_mfdataset('path/to/file/*.nc')
I hope this proves useful.

reading numy array from GCS into spark

I have 100 npz files containing numpy arrays in google storage.
I have setup dataproc with jupyter and I am trying to read all the numpy arrays into spark RDD. What is the best way to load the numpy arrays from a google storage into pyspark?
Is there an easy way like np.load("gs://path/to/array.npz") to load the numpy array and then do sc.parallelize on it?
If you plan to scale eventually you'll want to use the distributed input methods in SparkContext rather than doing any local file loading from the driver program relying on sc.parallelize. It sounds like you need to read each of the files intact though, so in your case you want:
npz_rdd = sc.binaryFiles("gs://path/to/directory/containing/files/")
Or you can also specify single files if you want, but then you just have an RDD with a single element:
npz_rdd = sc.binaryFiles("gs://path/to/directory/containing/files/arr1.npz")
Then each record is a pair of <filename>,<str of bytes>. On Dataproc, sc.binaryFiles will just automatically work directly with GCS paths, unlike np.load which requires local filesystem paths.
Then in your worker code, you just need to use StringIO to use those byte strings as the file object you put into np.load:
from StringIO import StringIO
# For example, to create an RDD of the 'arr_0' element of each of the picked objects:
npz_rdd.map(lambda l: numpy.load(StringIO(l[1]))['arr_0'])
During development if you really want to just read the files into your main driver program, you can always collapse your RDD down using collect() to retrieve it locally:
npz_rdd = sc.binaryFiles("gs://path/to/directory/containing/files/arr1.npz")
local_bytes = npz_rdd.collect()[0][1]
local_np_obj = np.load(StringIO(local_bytes))

Pyspark reading pickled files [duplicate]

My data are available as sets of Python 3 pickled files. Most of them are serialization of Pandas DataFrames.
I'd like to start using Spark because I need more memory and CPU that one computer can have. Also, I'll use HDFS for distributed storage.
As a beginner, I didn't found relevant information explaining how to use pickle files as input file.
Does it exists? If not, are there any workaround?
Thanks a lot
A lot depends on the data itself. Generally speaking Spark doesn't perform particularly well when it has to read large, not splittable files. Nevertheless you can try to use binaryFiles method and combine it with the standard Python tools. Lets start with a dummy data:
import tempfile
import pandas as pd
import numpy as np
outdir = tempfile.mkdtemp()
for i in range(5):
pd.DataFrame(
np.random.randn(10, 2), columns=['foo', 'bar']
).to_pickle(tempfile.mkstemp(dir=outdir)[1])
Next we can read it using bianryFiles method:
rdd = sc.binaryFiles(outdir)
and deserialize individual objects:
import pickle
from io import BytesIO
dfs = rdd.values().map(lambda p: pickle.load(BytesIO(p)))
dfs.first()[:3]
## foo bar
## 0 -0.162584 -2.179106
## 1 0.269399 -0.433037
## 2 -0.295244 0.119195
One important note is that it typically requires significantly more memory than a simple methods like textFile.
Another approach is to parallelize only the paths and use libraries which can read directly from a distributed file system like hdfs3. This typically means lower memory requirements at the price of a significantly worse data locality.
Considering these two facts it is typically better to serialize your data in a format which can be loaded with a higher granularity.
Note:
SparkContext provides pickleFile method, but the name can be misleading. It can be used to read SequenceFiles containing pickle objects not the plain Python pickles.

Killed/MemoryError when creating a large dask.dataframe from delayed collection

I am trying to create a dask.dataframe from a bunch of large CSV files (currently 12 files, 8-10 million lines and 50 columns each). A few of them might fit together into my system memory but all of them at once definitely will not, hence the use of dask instead of regular pandas.
Since reading each csv file involves some extra work (adding columns with data from the file path), I tried creating the dask.dataframe from a list of delayed objects, similar to this example.
This is my code:
import dask.dataframe as dd
from dask.delayed import delayed
import os
import pandas as pd
def read_file_to_dataframe(file_path):
df = pd.read_csv(file_path)
df['some_extra_column'] = 'some_extra_value'
return df
if __name__ == '__main__':
path = '/path/to/my/files'
delayed_collection = list()
for rootdir, subdirs, files in os.walk(path):
for filename in files:
if filename.endswith('.csv'):
file_path = os.path.join(rootdir, filename)
delayed_reader = delayed(read_file_to_dataframe)(file_path)
delayed_collection.append(delayed_reader)
df = dd.from_delayed(delayed_collection)
print(df.compute())
When starting this script (Python 3.4, dask 0.12.0), it runs for a couple of minutes while my system memory constantly fills up. When it is fully used, everything starts lagging and it runs for some more minutes, then it crashes with killed or MemoryError.
I thought the whole point of dask.dataframe was to be able to operate on larger-than-memory dataframes that span over multiple files on disk, so what am I doing wrong here?
edit: Reading the files instead with df = dd.read_csv(path + '/*.csv') seems to work fine as far as I can see. However, this does not allow me to alter each single dataframe with additional data from the file path.
edit #2:
Following MRocklin's answer, I tried to read my data with dask's read_bytes() method as well as using the single-threaded scheduler as well as doing both in combination.
Still, even when reading chunks of 100MB in single-threaded mode on a laptop with 8GB of memory, my process gets killed sooner or later. Running the code stated below on a bunch of small files (around 1MB each) of similar shape works fine though.
Any ideas what I am doing wrong here?
import dask
from dask.bytes import read_bytes
import dask.dataframe as dd
from dask.delayed import delayed
from io import BytesIO
import pandas as pd
def create_df_from_bytesio(bytesio):
df = pd.read_csv(bytesio)
return df
def create_bytesio_from_bytes(block):
bytesio = BytesIO(block)
return bytesio
path = '/path/to/my/files/*.csv'
sample, blocks = read_bytes(path, delimiter=b'\n', blocksize=1024*1024*100)
delayed_collection = list()
for datafile in blocks:
for block in datafile:
bytesio = delayed(create_bytesio_from_bytes)(block)
df = delayed(create_df_from_bytesio)(bytesio)
delayed_collection.append(df)
dask_df = dd.from_delayed(delayed_collection)
print(dask_df.compute(get=dask.async.get_sync))
If each of your files is large then a few concurrent calls to read_file_to_dataframe might be flooding memory before Dask ever gets a chance to be clever.
Dask tries to operate in low memory by running functions in an order such that it can delete intermediate results quickly. However if the results of just a few functions can fill up memory then Dask may never have a chance to delete things. For example if each of your functions produced a 2GB dataframe and if you had eight threads running at once, then your functions might produce 16GB of data before Dask's scheduling policies can kick in.
Some options
Use dask.bytes.read_bytes
The reason why read_csv works is that it chunks up large CSV files into many ~100MB blocks of bytes (see the blocksize= keyword argument). You could do this too, although it's tricky because you need to always break on an endline.
The dask.bytes.read_bytes function can help you here. It can convert a single path into a list of delayed objects, each corresponding to a byte range of that file that starts and stops cleanly on a delimiter. You would then put these bytes into an io.BytesIO (standard library) and call pandas.read_csv on that. Beware that you'll also have to handle headers and such. The docstring to that function is extensive and should provide more help.
Use a single thread
In the example above everything would be fine if we didn't have the 8x multiplier from parallelism. I suspect that if you only ran a single function at once that things would probably pipeline without ever reaching your memory limit. You can set dask to use only a single thread with the following line
dask.set_options(get=dask.async.get_sync)
Note: For Dask versions >= 0.15, you need to use dask.local.get_sync instead.
Make sure that results fit in memory (response to edit 2)
If you make a dask.dataframe and then compute it immediately
ddf = dd.read_csv(...)
df = ddf.compute()
You're loading in all of the data into a Pandas dataframe, which will eventually blow up memory. Instead it's better to operate on the Dask dataframe and only compute on small results.
# result = df.compute() # large result fills memory
result = df.groupby(...).column.mean().compute() # small result
Convert to a different format
CSV is a pervasive and pragmatic format, but also has some flaws. You might consider a data format like HDF5 or Parquet.

Categories

Resources