I am new to vaex. Just started using it to speed up some groupby + agg.nunique operations on ~40 million rows Data Frame in jupyter notebook.
It works much faster than pandas, I am really excited to use it more often but sometimes I experience weird error:
executing some simple vaex notebook cell with simple filter code like:
vf[vf.Item_count >1]
finishes under 1s but when I run the same cell again it can take several minutes and won't respond to keyboard interruption.
I run vaex and pandas inside VSCode on Win10 machine with 32GB of RAM.
My Data Frame in pandas takes around 1GB.
Could you please help me navigate around those slow downs ?
I have a significant large dataset consisting of several thousands of files spread among different directories. These files all have different formats and come from different sensors giving me different sampling rates. Basically, a mess. I created a python module that is able to enter these folders and make sense of all this data, reformat it, get it into a pandas dataframe that I could use for effective and easy resampling, and in general, make it easier to work with.
The problem is that the resulting dataframe is big and takes a large amount of RAM memory. Loading several of these datasets leaves not enough memory available to actually train a ML model. And it is painfully slow to read the data.
So my solution is a two part approach. First, I read the dataset into a big variable. It is a dict with nested pandas DataFrame, then compute a reduced derived DataFrame with the information I actually need to train my model, and remove from memory the dict variable. Not ideal, but it works. However, further computations sometimes needs re-reading the data and as stated previously, it is slow.
Enter the second part. Before removing the dict from memory, I pickle it into a file. sklearn actually recommends using joblib, so that's what I use. So, once the single files for the dataset are stored in the working directory, the reading stage is about 90% faster than reading the scattered data, most likely because is loading a single large file directly into memory than reading and reformatting thousands of files across different directories.
Here's my problem. The same code when is reading the data from the scattered files, ends up with about 70% less RAM than when reading the pickled data. So, although it is faster, it ends up using much memory. Has anybody experienced something like this?
Given that there are some access issues to the data (it is located in a network drive with some weird restrictions for user access) and the fact that I need to make it as user friendly as possible for other people, I'm using a Jupyter notebook. My IT department provides a web tool with all the packages required to read the network drive from the go and run Jupyter there, whilst running from a VM will require the manual configuration of the network drive to access the data and that part is not user friendly. The Jupyter tool requires only login information, while the VM requires a basic knowledge of linux sysadmin
I'm using Python 3.9.6. I'll keep trying to get a MWE that has a similar situation. So far I have one that has the opposite behaviour (loading the pickled dataset consumes less memory than reading it directly). Might be because the particular structure of the dict with nested DataFrame
MWE (Warning, running this code will create a 4GB file in your hard drive):
import numpy as np
import psutil
from os.path import exists
from os import getpid
from joblib import dump, load
## WARNING. THIS CODE SAVES A LARGE FILE INTO YOUR HARD DRIVE
def read_compute():
if exists('df.joblib'):
df = load('df.joblib')
print('==== df loaded from .joblib')
else:
df = np.random.rand(1000000,500)
dump(df, 'df.joblib')
print('=== df created and dumped')
tab = df[:100, :10]
del df
return tab
table = read_compute()
print(f'{psutil.Process(getpid()).memory_info().rss / 1024 ** 2} MB')
With this, I get when running without the df.joblib file in the pwd
=== df created and dumped
3899.62890625 MB
And then, after that file is created, I restart the kernel and run the same code again, getting
==== df loaded from .joblib
1588.5234375 MB
In my actual case, with the format of my data, I have the opposite effect.
Do I have to manually free arrays and dataframes as in C and C++?
I'm currently working on large arrays (thousands*thousands) with about 100~300 MB in csv.
I'm using Jupyter notebook (pythons3) to do normalizations and other data handling, and have run my codes hundreds of times.
I'm new to python, and it only recently occurred to me that memory leak might be happening. (I'm not having any errors)
I've read up a bit on gc.collect() but had trouble understanding.
Any help is appreciated, thanks!
I want to merge two data frames using the merge function in pandas. When I want to do so on a common column, jupyter notebook gives me the following error "The kernel appears to have died. It will restart automatically." each data frame is about 50k rows. But when I try the same thing with only 50 rows from each data frame it works fine. I was wondering if anyone has a suggestion.
This is most likely a ram/memory issue with your machine. check the ram that you have and monitor it while you do the merge operation.
I am working with a dataset in a Python jupyter IPython notebook that is 1.7GB. I read in the .csv that I am working with using pd.read_csv, and my RAM usage shoots up to about 7GB.
When I tried to plot the time series of one of my columns from the dataset, my RAM shot up to nearly 16GB. I was worried about the performance of my laptop, so I decided to interrupt the kernel.
My question is two-fold:
If I let the cell run, would the plot have eventually shown up? Or, is it unable to plot my chart because it reached it's RAM limit?
My data is a time-series of second-by-second data over the course of a month that contains mostly zeroes. Should I remove these zeroes from the data, and would it make it easier to plot?
It will eventually show up. Even if it uses more than 16gb it will use the pagefile to get more virtual memory. It's called paging. It is an important part of memory implementations in modern operating systems, using secondary storage to let programs exceed the size of available physical memory. When a computer runs out of RAM, the operating system (OS) will move pages of memory over to the computer's hard disk to free up RAM for other processes. This ensures that the operating system will never run out of memory and crash.
For more information: http://searchservervirtualization.techtarget.com/definition/memory-paging
It depends on whether you need or not the zeroes. It may be easier to plot or to visualize depending on the amount of data added by them.