Pandas to_csv() slow when writing to network drive - python

When writing to a Windows network drive with the Pandas to_csv() function the write operation is considerably slower than when writing to a local disk. This is obviously partly a function of network latency, but I find that if I were to write the data to a StringIO object and then write the StringIO object to the network drive it is considerably faster than calling to_csv directly with the network path, i.e.
from io import StringIO
# Slow
df.to_csv("/network/drive/test.csv")
# Fast
buf = StringIO()
df.to_csv(buf)
with open("/network/drive/test.csv", "w") as fh: fh.write(buf.getvalue())
I likewise find that when using the fwrite() function from the R data.table package there is a much smaller difference in write time between the local and network drives.
Given that I need to frequently write to a network disk I am considering making use of the "fast" method above using StringIO, but I am curious if there isn't some option I am overlooking in to_csv() that will get the same result?

Related

Memory usage due to pickle/joblib

I have a significant large dataset consisting of several thousands of files spread among different directories. These files all have different formats and come from different sensors giving me different sampling rates. Basically, a mess. I created a python module that is able to enter these folders and make sense of all this data, reformat it, get it into a pandas dataframe that I could use for effective and easy resampling, and in general, make it easier to work with.
The problem is that the resulting dataframe is big and takes a large amount of RAM memory. Loading several of these datasets leaves not enough memory available to actually train a ML model. And it is painfully slow to read the data.
So my solution is a two part approach. First, I read the dataset into a big variable. It is a dict with nested pandas DataFrame, then compute a reduced derived DataFrame with the information I actually need to train my model, and remove from memory the dict variable. Not ideal, but it works. However, further computations sometimes needs re-reading the data and as stated previously, it is slow.
Enter the second part. Before removing the dict from memory, I pickle it into a file. sklearn actually recommends using joblib, so that's what I use. So, once the single files for the dataset are stored in the working directory, the reading stage is about 90% faster than reading the scattered data, most likely because is loading a single large file directly into memory than reading and reformatting thousands of files across different directories.
Here's my problem. The same code when is reading the data from the scattered files, ends up with about 70% less RAM than when reading the pickled data. So, although it is faster, it ends up using much memory. Has anybody experienced something like this?
Given that there are some access issues to the data (it is located in a network drive with some weird restrictions for user access) and the fact that I need to make it as user friendly as possible for other people, I'm using a Jupyter notebook. My IT department provides a web tool with all the packages required to read the network drive from the go and run Jupyter there, whilst running from a VM will require the manual configuration of the network drive to access the data and that part is not user friendly. The Jupyter tool requires only login information, while the VM requires a basic knowledge of linux sysadmin
I'm using Python 3.9.6. I'll keep trying to get a MWE that has a similar situation. So far I have one that has the opposite behaviour (loading the pickled dataset consumes less memory than reading it directly). Might be because the particular structure of the dict with nested DataFrame
MWE (Warning, running this code will create a 4GB file in your hard drive):
import numpy as np
import psutil
from os.path import exists
from os import getpid
from joblib import dump, load
## WARNING. THIS CODE SAVES A LARGE FILE INTO YOUR HARD DRIVE
def read_compute():
if exists('df.joblib'):
df = load('df.joblib')
print('==== df loaded from .joblib')
else:
df = np.random.rand(1000000,500)
dump(df, 'df.joblib')
print('=== df created and dumped')
tab = df[:100, :10]
del df
return tab
table = read_compute()
print(f'{psutil.Process(getpid()).memory_info().rss / 1024 ** 2} MB')
With this, I get when running without the df.joblib file in the pwd
=== df created and dumped
3899.62890625 MB
And then, after that file is created, I restart the kernel and run the same code again, getting
==== df loaded from .joblib
1588.5234375 MB
In my actual case, with the format of my data, I have the opposite effect.

Single file unzip optimization using python

I have a large zip file that contains 1 file inside.
I want to unzip that file to a given directory for further processing and used this code:
def unzip(zipfile: ZipFile, filename: str, dest: str):
ZipFile.extract(zipfile, filename, dest)
This function is called using:
with ZipFile(file_path, "r") as zip_source:
unzip(zip_source, zip_source.infolist()[0], extract_path) # extract path is correctly defined earlier in the code
It seems like unzipping a large file takes a long time (file size > 500 Mb) and I would like to optimize this solution.
All the optimizations I found were multiprocessing based in order to make the extraction of multiple files faster, however, my zip contains only a single file so multiprocessing doesn't seem to be the answer.
You cannot parallelize the decompression of a zip file with 1 file inside as long are the file is actually compressed using the usual decompression algorithms LZ77/LZW/LZSS. These algorithm are intrinsically sequential.
Moreover these decompression methods are known to be slow (often much slower than reading the file from a storage device). This is mainly because of the algorithm themselves: their complexity and the fact that most mainstream processors cannot speed the computation up by a large margin.
Thus, there is no way to decompress the file faster, although you might find a slightly faster implementation by using another library.

How to speed-up read_pickle?

Need to write and read huge pandas DF. I am using pickle format right now:
.to_pickle to write DF to pickle
read_pickle to read pickle file.
I have couple of issues when pickle file size is huge (2 GB in this case)
Read speed is very slow (23 second to read the data)
Increasing RAM/core in VM is not improving speed
How can I read it faster? Can I use some other format which is much faster?
Can I leverage parallel processing/more core functionality to read it faster?

Data compression in python/numpy

I'm looking at using the amazon cloud for all my simulation needs. The resulting sim files are quite large, and I would like to move them over to my local drive for ease of analysis, ect. You have to pay per data you move over, so I want to compress all my sim soutions as small as possible. They are simply numpy arrays saved in the form of .mat files, using:
import scipy.io as sio
sio.savemat(filepath, do_compression = True)
So my question is, what is the best way to compress numpy arrays (they are currently stored in .mat files, but I could store them using any python method), by using python compression saving, linux compression, or both?
I am in the linux environment, and I am open to any kind of file compression.
Unless you know something special about the arrays (e.g. sparseness, or some pattern) you aren't going to do much better than the default compression, and maybe gzip on top of that. In fact you may not even need to gzip the files if you're using HTTP for downloads and your server is configured to do compression. Good lossless compression algorithms rarely vary by more than 10%.
If savemat works as advertized you should be able to get gzip compression all in python with:
import scipy.io as sio
import gzip
f_out = gzip.open(filepath_dot_gz, 'wb')
sio.savemat(f_out, do_compression = True)
Also LZMA (AKA xz) gives very good compression on fairly sparse numpy arrays, albeit it is pretty slow when compressing (and may require more memory as well).
In Ubuntu it is installed with sudo apt-get install python-lzma
It is used as any other file-object wrapper, something like that (to load pickled data):
from lzma import LZMAFile
import cPickle as pickle
if fileName.endswith('.xz'):
dataFile = LZMAFile(fileName,'r')
else:
dataFile = file(fileName, 'ro')
data = pickle.load(dataFile)
Though it won't necessarily give you the highest compression ratios, I've had good experiences saving compressed numpy arrays to disk with python-blosc. It is very fast and integrates well with numpy.

What's the best compression algorithm for data dumps

I'm creating data dumps from my site for others to download and analyze. Each dump will be a giant XML file.
I'm trying to figure out the best compression algorithm that:
Compresses efficiently (CPU-wise)
Makes the smallest possible file
Is fairly common
I know the basics of compression, but haven't a clue as to which algo fits the bill. I'll be using MySQL and Python to generate the dump, so I'll need something with a good python library.
GZIP with standard compression level should be fine for most cases. Higher compression levels=more CPU time. BZ2 is packing better but is also slower. Well, there is always a trade-off between CPU consumption/running time and compression efficiency...all compressions with default compression levels should be fine.

Categories

Resources