Plotting with Pandas on a laptop with 16GB RAM - python

I am working with a dataset in a Python jupyter IPython notebook that is 1.7GB. I read in the .csv that I am working with using pd.read_csv, and my RAM usage shoots up to about 7GB.
When I tried to plot the time series of one of my columns from the dataset, my RAM shot up to nearly 16GB. I was worried about the performance of my laptop, so I decided to interrupt the kernel.
My question is two-fold:
If I let the cell run, would the plot have eventually shown up? Or, is it unable to plot my chart because it reached it's RAM limit?
My data is a time-series of second-by-second data over the course of a month that contains mostly zeroes. Should I remove these zeroes from the data, and would it make it easier to plot?

It will eventually show up. Even if it uses more than 16gb it will use the pagefile to get more virtual memory. It's called paging. It is an important part of memory implementations in modern operating systems, using secondary storage to let programs exceed the size of available physical memory. When a computer runs out of RAM, the operating system (OS) will move pages of memory over to the computer's hard disk to free up RAM for other processes. This ensures that the operating system will never run out of memory and crash.
For more information: http://searchservervirtualization.techtarget.com/definition/memory-paging
It depends on whether you need or not the zeroes. It may be easier to plot or to visualize depending on the amount of data added by them.

Related

Large CSV is being loaded much slower than expected, could it be that the RAM Python is allowed to use is limited?

I'm trying to load a large CSV file into a pandas dataframe. The CSV is rather large: a few GB.
The code is working, but rather slowly. Slower than I would expect it to even. If I take only 1/10th of the CSV, the job is done in about 10 seconds. If I try to load the whole file, it takes more than 15 minutes. I would expect this to just take roughly 10 times as long, not ~100 times.
The amount of RAM used by python is never above exactly 1,930.8 MB (there is 16GB in my system):
enter image description here
It seems to be capped at this, making me think that there is some sort of limit on how much RAM python is allowed to use. However, I never set such a limit and online everyone says "Python has no RAM limit".
Could it be that the RAM python is allowed to use is limit somewhere? And if so, how do I remove that limit?
The problem is not just how much RAM it can use, but how fast is your CPU. Loading very large csv file is very time-consuming if you just use plain pandas. Here are a few options:
You can try other libraries that are made to work with big data. This tutorial shows some libraries. I like dask. Its API is like pandas.
If you have GPU, you can use rapids (which is also mentioned in the link). Man, rapids is really a game changer. Any computation on GPU is just significantly faster. One drawback is that not all features in pandas are yet implemented, but that's if you need them.
The last solution, although not recommended, is you can process your file in batches, e.g., use a for loop, load only the first 100K rows, process them, save, then continue doing so until the file ends. This is still very time-consuming but that's the most naive way.
I hope it helps.

How to figure out if a modin dataframe is going to fit in RAM?

Im learning how to work with large datasets, so im using modin.pandas.
I'm doing some aggregation, after which a 50GB dataset is hopefully going to become closer to 5GB in size - and now i need to check: if the df is small enough to fit in RAM, i want to cast it to pandas and enjoy a bug-free reliable library.
So, naturally, the question is: how to check it? .memory_usage(deep=True).sum() tells me how much the whole df uses, but i cant possibly know from that one number how much of it is in RAM, and how much is in swap - in other words, how much space do i need for casting the df to pandas. Are there other ways? Am i even right to assume that some partitions live in RAM while others - in swap? How to calculate how much data will flood the RAM when i call ._to_pandas()? Is there a hidden .__memory_usage_in_swap_that_needs_to_fit_in_ram() of some sorts?
Am i even right to assume that some partitions live in RAM while others - in swap?
Modin doesn't specify whether data should be in RAM or swap.
On Ray, it uses ray.put to store partitions. ray.put doesn't give any guarantees about where the data will go. Note that Ray spills data blocks to disk when they are too large for its in-memory object store. You can use ray memory to get a summary of how much of each storage Ray is using.
On Dask, modin uses dask.Client.scatter, which also doesn't give guarantees about where the data will go, to store partition data. I don't know any way to figure out how much of the stored data is really in RAM.

Problems with memory and Dask distributed: multiple times the size of the data loading into memory and data spill not happening

I'm running some simple tests with Dask distributed and Datashader but I'm running into two problems that I haven't been able to solve neither understand why it happens.
The data I’m working with consists of 1.7 billion rows with 97 columns each, distributed into 64 parquet files. My test code is the following, in which I simply plot two columns of the data in a scatter plot, following the example code at the bottom of https://datashader.org/user_guide/Performance.html :
def plot(file_path):
dask_df = dd.read_parquet(file_path, engine='pyarrow')
cvs = ds.Canvas(plot_width=600, plot_height=300)
agg = cvs.points(dask_df, 'x', 'y')
img = tf.shade(agg, cmap=['lightblue', 'darkblue'])
return img
futures = [dask_client.submit(plot,file) for f in files_paths]
result = [f.result() for f in futures] #array with each plot per file
The two problems are the following:
First, my workers take way too many data into memory. For example, I've run the previous code with just one worker and one file. Even though one file is 11gb, the Dask dashboard shows around 50gb loaded into memory. The only solution I have found to this is to change the following line, expliciting a small slice of the columns:
def plot(file_path):
dask_df = dd.read_parquet(file_path, columns=['x','y',...], engine='pyarrow')
…
Although this works (and makes sense because I’m only using 2 columns for the plot) it’s still confusing as to why the workers use that much memory.
The second problem is that, even though I have configured in my ~/.config/dask/distributed.yaml file that at 70% a spill should happen, my workers keep crashing because they run out of memory:
distributed.nanny - WARNING - Worker exceeded 95% memory budget.
Restarting distributed.nanny - WARNING - Restarting worker
Finally, when I plot all the points, bringing only 5 columns with columns=['x','y','a','b','c'] when reading the data, I’m getting unreasonable slow times. Despite the files being split into 8 disk for speeding up the I/O and working with 8 cores (8 workers) it takes 5 minutes for the 1.7 billion points to plot.
I'm using: dask 2.18.0, distributed 2.19.0, datashader 0.10.0, and python 3.7.7.
I've been struggling with this for a whole week so any advice would be highly appreciated. Please feel free to ask me for any other information that may be missing.
Although this works (and makes sense because I’m only using 2 columns for the plot) it’s still confusing as to why the workers use that much memory.
Parquet is a relatively efficient format. For example your data may be compressed on disk but is uncompressed in Pandas, or the Pandas string type might be causing some bloat (Pandas uses Python strings, which are large).
The second problem is that, even though I have configured in my ~/.config/dask/distributed.yaml file that at 70% a spill should happen, my workers keep crashing because they run out of memory:
I'm not sure what to tell you with this one. Dask can't stop Python functions from running out of RAM. I would check in with the datashader folks, but I would expect their code to be pretty tight.
Finally, when I plot all the points, bringing only 5 columns with columns=['x','y','a','b','c'] when reading the data, I’m getting unreasonable slow times. Despite the files being split into 8 disk for speeding up the I/O and working with 8 cores (8 workers) it takes 5 minutes for the 1.7 billion points to plot.
It's hard to diagnose performance issues over stack overflow. I recommend following the guidance here: https://docs.dask.org/en/latest/understanding-performance.html

python handling of memory

I've noticed that Python handles memory in a way that I didn't expect. I have a huge data set which is stored in a 70 GB file. I usually load this file with np.loadtxt() and do some math on that. I have 32 GB of RAM and I've noticed that, when loaded into memory, around 25 GB of RAM is used. But apparently, this value can change. For example once, while I was processing the data, I got a memory error. After the error the dataset was still in the memory, I verified that I could access it, but only around 5 GB of RAM were used. How is this possible? And how can I force python to use as less memory as possible with my data so that I can run other application simultaneously?
Moreover, sometimes I do calculations which return a new dataset as large as the original, so that at the end I've a number of large dataset in memory, but the total used RAM is not changed. Are these variables written on the hard disk in some way? If so, why sometimes I get memory crashes?
(BTW I use spyder as IDE if it matters)

Jupyter notebook kernel dies when creating dummy variables with pandas

I am working on the Walmart Kaggle competition and I'm trying to create a dummy column of of the "FinelineNumber" column. For context, df.shape returns (647054, 7). I am trying to make a dummy column for df['FinelineNumber'], which has 5,196 unique values. The results should be a dataframe of shape (647054, 5196), which I then plan to concat to the original dataframe.
Nearly every time I run fineline_dummies = pd.get_dummies(df['FinelineNumber'], prefix='fl'), I get the following error message The kernel appears to have died. It will restart automatically. I am running python 2.7 in jupyter notebook on a MacBookPro with 16GB RAM.
Can someone explain why this is happening (and why it happens most of the time but not every time)? Is it a jupyter notebook or pandas bug? Also, I thought it might have to do with not enough RAM but I get the same error on a Microsoft Azure Machine Learning notebook with >100 GB of RAM. On Azure ML, the kernel dies every time - almost immediately.
It very much could be memory usage - a 647054, 5196 data frame has 3,362,092,584 elements, which would be 24GB just for the pointers to the objects on a 64-bit system. On AzureML while the VM has a large amount of memory you're actually limited in how much memory you have available (currently 2GB, soon to be 4GB) - and when you hit the limit the kernel typically dies. So it seems very likely it is a memory usage issue.
You might try doing .to_sparse() on the data frame first before doing any additional manipulations. That should allow Pandas to keep most of the data frame out of memory.

Categories

Resources