How to share zero copy dataframes between processes with PyArrow - python

I'm trying to work out how to share data between processes with PyArrow (to hopefully at some stage share pandas DataFrames). I am at a rather experimental (read: Newbie) stage and am trying to figure out how to use PyArrow. I'm a bit stuck and need help.
Going through the documentation, I found an example to create a buffer
import time
import pyarrow as pa
data = b'abcdefghijklmnopqrstuvwxyz'
buf = pa.py_buffer(data)
print(buf)
# <pyarrow.Buffer address=0x7fa5be7d5850 size=26 is_cpu=True is_mutable=False>
while True:
time.sleep(1)
While this process was running, I used the address and size, that the script printed, to (try to) access the buffer in another script:
import pyarrow as pa
buf = pa.foreign_buffer(0x7fa5be7d5850, size=26)
print(buf.to_pybytes())
and received... a segmentation fault - most likely because the script is trying to access memory from another process, which I may require different handling.
Is this not possible with PyArrow or is the just the way I am trying to do this? Do I need other libraries? I'd like to avoid serialisation (or writing to disk in general), if possible, but this may or may not be possible. Any pointers are appreciated.

You can't share buffers across processes, instead you should use memory mapped file.
To write:
import pyarrow as pa
mmap = pa.create_memory_map("hello.txt", 20)
mmap.write(b"hello")
To read:
import pyarrow as pa
mmap = pa.memory_map("hello.txt")
mmap.read(5)

Related

Is it possible in Python to load a large object into memory with one process, and access it in separate independent processes?

I'm writing a program that requires running algorithms on a very large (~6GB) csv file, which is loaded with pandas using read_csv().
The issue I have now, is that anytime I tweak my algorithms and need to re-simulate (which is very often), I need to wait ~30s for the dataset to load into memory, and then another 30s afterward to load the same dataset into a graphing module so I can visually see what's going on. Once it's loaded however, operations are done very quickly.
So far I've tried using mmap, and loading the dataset into a RAM disk for access, with no improvement.
I'm hoping to find a way to load up the dataset once into memory with one process, and then access it in memory with the algorithm-crunching process, which gets re-run each time I make a change.
This thread seems to be close-ish to what I need, but uses multiprocessing which needs everything to be run within the same context.
I'm not a computer engineer (I'm electrical :), so I'm not sure what I'm asking for is even possible. Any help would be appreciated however.
Thanks,
Found a solution that worked, although it was not directly related to my original ask.
Instead of loading a large file into memory and sharing between independent processes, I found that the bottleneck was really the parsing function in pandas library.
Particularly, CSV parsing, as CSVs are notoriously inefficient in terms of data storage.
I started storing my files in the python-native pickle format, which is supported by pandas through the to_pickle() and read_pickle() functions. This cut my load times drastically from ~30s to ~2s.

Memory efficient data loading with Dask

Well, what I have is a big CSV file with data and a bottleneck of server RAM. Besides that, there is a dask-distributed cluster that looks like a solution for this case, dask-scheduler is running on the server. Theres what I have tried:
import dask.dataframe as dd
import pandas as pd
from dask.bag import from_sequence
cheques = dd.read_csv('cheque_data.csv') # not working because of distributed workers can't access file directly
cheques = from_sequence(pd.read_csv('cheque_data.csv',chunksize=10**4)).to_dataframe() # dask Bag from_sequence constructs from tuples or dict
So I have stucked here, any ideas and clues would be great
Your code snippet says dd.read_csv('cheque_data.csv') is "not working because of distributed workers can't access file directly".
I think you need to give your workers access to the file.
I prefer splitting up huge files before they're read by Dask. That makes it easier for Dask to read the files in parallel.
It depends on the size of cheque_data.csv. 2GB is a lot easier to manage than 30GB.

Writing large dask dataframe to file

I have a large BCP file (12GB) that I have imported into dask and did some data wrangling that I wish to import to SQL server. The file has been reduced from 40+ columns to 8 columns and I wish to find the best method to import to SQL server. I have tried using the following:
import sqlalchemy as sa
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from urllib.parse import quote_plus
pbar = ProgressBar()
pbar.register()
#windows authentication
#to_sql_uri = quote_plus(engine)
ddf.to_sql('test',
uri='mssql+pyodbc://TEST_SERVER/TEST_DB?driver=SQL Server?Trusted_Connection=yes', if_exists='replace', index=False)
This method is taking too long (3 days and counting). I had suspected this may be the case, so I also tried to write to a BCP file with the intention of using SQL BCP, but again this is taking a number of days:
df_train_grouped.compute().to_csv("F:\TEST_FILE.bcp", sep='\t')
I am relatively new to dask and can't seem to find an easy to follow example on the most efficient method to do this.
There is no need for you to use compute, this materialises the dataframe into memory and is likely the bottleneck for you. You can instead do
df_train_grouped.to_csv("F:\TEST_FILE*.bcp", sep='\t')
which will create a number of output files in parallel - which is probably exactly what you want.
Note that profiling will determine whether your process is IO bound (e.g., by the disc itself), in which case there is nothing you can do, or whether one of the process-based schedulers (ideally the distributed scheduler) can help with GIL-holding tasks.
Changing to a multiprocessing scheduler as follows improved performance in this particular case:
dask.config.set(scheduler='processes') # overwrite default with multiprocessing scheduler
df_train_grouped.to_csv("F:\TEST_FILE*.bcp", sep='\t', chunksize=1000000)

how to dereference a module and free memory in python?

I am working on a project where I need to dereference a module and free its memory for further model training. Here I mentioned the demo code with memory uses after certain block.
I used garbage collection as well as the del function but it couldn't worked for me.
import psutil
import sys
import gc
sys.path.insert(0,'/scripts_v2')
process = psutil.Process()
mem = process.memory_info().rss/(1024**2)
print(mem)
import pandas as pd
process = psutil.Process()
mem = process.memory_info().rss/(1024**2)
print(mem)
sys.modules.pop('pandas')
#del pd
gc.collect()
process = psutil.Process()
mem = process.memory_info().rss/(1024**2)
print(mem)
I calculated the memory after a specific block of code. Here I mentioned the output of the above code.
You can see that before and after deleting the pandas library, its memory is still 60.65 MB. How can I free its memory?
Deleting a variable, a module or anything, in Python, does not have to free its memory. Python simply caches some types when you initialise them.
Your problem is likely caused by the caching of some objects in the pandas library, and as far as I am aware, you cannot free the internal cache of Python. There might be a way to do it, but it will definitely be too hacky to even bother. If you need 100% control over your memory in some project then Python will not be helping you much. To get more information about the topic see the answers here and see especially this answer.
Note that del and gc.collect() might release memory, it is just that they do not have to.
Edit: sys.modules.pop('pandas') does not de-import pandas, you can see by doing perhaps print(pd.__version__). But, the same memory problem occurs if you actually de-import the library using the methods specified here.

Freeing up buffer space after use in Python?

So I'm using Google Cloud data Lab and I use the %%storage read command to read in a large file (2,000,000 rows) Into the text variable and then I have to process it into a pandas dataframe using BytesIO eg df_new=pd.read_csv(BytesIO(text))
So now I don't need the text Variable or its contents around, (all further processing is done on df_new, how can I delete it (text) and free up memory (I sure don't need two copies of a 2 million record dataset hanging around...)
Use del followed by forced garbage collection.
import gc
# Remove text variable
del text
# Force gc collection - this not actually necessary, but may be useful.
gc.collect()
Note that you may not see process size decreasing and memory returning to OS, depending on memory allocator used (depends on OS, core libraries used and python compilation options).

Categories

Resources