Python multiprocessing pool map - Memory issues in Databricks

Python multiprocessing pool map - Memory issues in Databricks - python

I am running a python component in Databricks environment which creates a set of JSON messages and each JSON message is encoded with Avro schema. The encoding was taking longer time (8 minutes for encoding 10K messages which have complex JSON structure) and hence I tried to use multiprocessing with pool map function. The process seems to work fine for the first execution, however for subsequent runs, the performance is degrading and eventually failing with oom error. I am making sure that at the end of execution pool.close() and pool.join() are issued but not sure if it's really freeing up the memory. When I look at Databricks Ganglia UI, it shows that Swap memory and CPU utilization is increasing for each run. I also tried to reduce the no of pools (driver node has 8 cores, so tried with 6 and 4 pools) and also maxtasksperchild=1 but still doesn't help. I am wondering if I'm doing anything wrong. Following is the code which I'm using now. Wondering what is cuasing the issue here. Any pointers / suggestions are appreciated.
from multiprocessing import Pool
import multiprocessing
import json
from avro.io import *
import avro.schema
from avro_json_serializer import AvroJsonSerializer, AvroJsonDeserializer
import pyspark.sql.functions as F
def create_json_avro_encoding(row):
row_dict = row.asDict(True)
json_data = json.loads(avro_serializer.to_json(row_dict))
#print(f"JSON created { multiprocessing.current_process().name }")
return json_data
avro_schema = avro.schema.SchemaFromJSONData(avro_schema_dict, avro.schema.Names())
avro_serializer = AvroJsonSerializer(avro_schema)
records = df.collect()
pool_cnt = int(multiprocessing.cpu_count()*0.5)
print(f"No of records: {len(records)}")
print(f"starting timestamp {datetime.now().isoformat(sep=' ')}")
with Pool(pool_cnt, maxtasksperchild=1) as pool:
json_data_ret = pool.map(create_json_avro_encoding, records)
pool.close()
pool.join()

You shouldn't close the pool before joining. In fact, you shouldn't close the pool at all when using it in a with block, it will close automatically when exiting the with block.

Related

Multithreading Memory Leak

I have a Flask Application that is using multithreading to collect data via thousands of HTTP Requests.
When deploying the Application without multithreading it works as expected, however when I use multithreading there is ~200MB of ram not freed after each run, which leads to a MemoryError.
First I have used queue and multithreading and then repaced it with concurrent.futures.ThreadPoolExecutor. I have also tried to delete some variables and use garbage collect, but the MemoryError still persits.
This code is not leaking memory:
data = []
for subprocess in process:
result = process_valuechains(subprocess, publishedRevision)
data.extend(result)
This code is leaking memory:
import gc
from concurrent.futures import ThreadPoolExecutor
from itertools import repeat
subprocesses = []
for subprocess in process:
subprocesses.append(subprocess)
data = []
with ThreadPoolExecutor() as pool:
for res in pool.map(process_valuechains, subprocesses, repeat(publishedRevision)):
data.extend(res)
del res
gc.collect()
A simplified version of process_valuechains looks like following
def process_valuechains(subprocess, publishedRevision):
data = []
new_data_1= request_data_1(subprocess)
data.extend(new_data_1)
new_data_2= request_data_2(subprocess)
data.extend(new_data_2)
return data
Unfortunately, even after researching a lot I have no idea what exactly is causing the leak and how to fix it.

How to use pyarrow parquet with multiprocessing

I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing.
The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs indefinitely.
My only guess is that env is different somehow, but all the environment variable should be the same in the child process and parent process.
I've tried to debug this using print(); setting to 1 thread only. To my surprise, this even fails when 1 thread only.
So, what can be the possible causes? How would I debug this?
Code:
import pyarrow.parquet as pq
def read_pq(file):
table = pq.read_table(file)
return table
##### this works #####
table = read_pq('hdfs://myns/mydata/000000_0')
###### this doesnt work #####
import multiprocessing
result_async=[]
with Pool(1) as pool:
result_async.append( pool.apply_async(pq.read_table, args = ('hdfs://myns/mydata/000000_0',)) )
results = [r.get() for r in result_async] ###### hangs here indefinitely, no exceptions raised
print(results) ###### expecting to get List[pq.Table]
#########################

Have you tried importing pq inside a user defined function so that any initilization required per process (needed by the library) can happen in each process in the pool?
def read_pq(file):
import pyarrow.parquet as pq
table = pq.read_table(file)
return table
###### this doesnt work #####
import multiprocessing
result_async=[]
with Pool(1) as pool:
result_async.append( pool.apply_async(read_pq, args = ('hdfs://myns/mydata/000000_0',)) )
results = [r.get() for r in result_async] ###### hangs here indefinitely, no exceptions raised
print(results) ###### expecting to get List[pq.Table]
#########################

Problem is due to my lack of experience with multiprocessing.
Solution is to add:
from multiprocessing import set_start_method
set_start_method("spawn")
The solution and the reason is exactly what
https://pythonspeed.com/articles/python-multiprocessing/
describes: The logging got forked and caused deadlock.
Furthermore, although I had only "Pool(1)", in fact, I had the parent process plus the child process, so I still had two process, so the deadlock problem existed.

Python Multiprocessing using Pool goes recursively haywire

I'm trying to make an expensive part of my pandas calculations parallel to speed up things.
I've already managed to make Multiprocessing.Pool work with a simple example:
import multiprocessing as mpr
import numpy as np
def Test(l):
for i in range(len(l)):
l[i] = i**2
return l
t = list(np.arange(100))
L = [t,t,t,t]
if __name__ == "__main__":
pool = mpr.Pool(processes=4)
E = pool.map(Test,L)
pool.close()
pool.join()
No problems here. Now my own algorithm is a bit more complicated, I can't post it here in its full glory and terribleness, so I'll use some pseudo-code to outline the things I'm doing there:
import pandas as pd
import time
import datetime as dt
import multiprocessing as mpr
import MPFunctions as mpf --> self-written worker functions that get called for the multiprocessing
import ClassGetDataFrames as gd --> self-written class that reads in all the data and puts it into dataframes
=== Settings
=== Use ClassGetDataFrames to get data
=== Lots of single-thread calculations and manipulations on the dataframe
=== Cut dataframe into 4 evenly big chunks, make list of them called DDC
if __name__ == "__main__":
pool = mpr.Pool(processes=4)
LLT = pool.map(mpf.processChunks,DDC)
pool.close()
pool.join()
=== Join processed Chunks LLT back into one dataframe
=== More calculations and manipulations
=== Data Output
When I'm running this script the following happens:
It reads in the data.
It does all calculations and manipulations until the Pool statement.
Suddenly it reads in the data again, fourfold.
Then it goes into the main script fourfold at the same time.
The whole thing cascades recursively and goes haywire.
I have read before that this can happen if you're not careful, but I do not know why it does happen here. My multiprocessing code is protected by the needed name-main-statement (I'm on Win7 64), it is only 4 lines long, it has close and join statements, it calls one defined worker function which then calls a second worker function in a loop, that's it. By all I know it should just create the pool with four processes, call the four processes from the imported script, close the pool and wait until everything is done, then just continue with the script. On a sidenote, I first had the worker functions in the same script, the behaviour was the same. Instead of just doing what's in the pool it seems to restart the whole script fourfold.
Can anyone enlighten me what might cause this behaviour? I seem to be missing some crucial understanding about Python's multiprocessing behaviour.
Also I don't know if it's important, I'm on a virtual machine that sits on my company's mainframe.
Do I have to use individual processes instead of a pool?

I managed to make it work by enceasing the entire script into the if __name__ == "__main__":-statement, not just the multiprocessing part.

multiprocessing.Pool.imap_unordered with fixed queue size or buffer?

I am reading data from large CSV files, processing it, and loading it into a SQLite database. Profiling suggests 80% of my time is spent on I/O and 20% is processing input to prepare it for DB insertion. I sped up the processing step with multiprocessing.Pool so that the I/O code is never waiting for the next record. But, this caused serious memory problems because the I/O step could not keep up with the workers.
The following toy example illustrates my problem:
#!/usr/bin/env python # 3.4.3
import time
from multiprocessing import Pool
def records(num=100):
"""Simulate generator getting data from large CSV files."""
for i in range(num):
print('Reading record {0}'.format(i))
time.sleep(0.05) # getting raw data is fast
yield i
def process(rec):
"""Simulate processing of raw text into dicts."""
print('Processing {0}'.format(rec))
time.sleep(0.1) # processing takes a little time
return rec
def writer(records):
"""Simulate saving data to SQLite database."""
for r in records:
time.sleep(0.3) # writing takes the longest
print('Wrote {0}'.format(r))
if __name__ == "__main__":
data = records(100)
with Pool(2) as pool:
writer(pool.imap_unordered(process, data, chunksize=5))
This code results in a backlog of records that eventually consumes all memory because I cannot persist the data to disk fast enough. Run the code and you'll notice that Pool.imap_unordered will consume all the data when writer is at the 15th record or so. Now imagine the processing step is producing dictionaries from hundreds of millions of rows and you can see why I run out of memory. Amdahl's Law in action perhaps.
What is the fix for this? I think I need some sort of buffer for Pool.imap_unordered that says "once there are x records that need insertion, stop and wait until there are less than x before making more." I should be able to get some speed improvement from preparing the next record while the last one is being saved.
I tried using NuMap from the papy module (which I modified to work with Python 3) to do exactly this, but it wasn't faster. In fact, it was worse than running the program sequentially; NuMap uses two threads plus multiple processes.
Bulk import features of SQLite are probably not suited to my task because the data need substantial processing and normalization.
I have about 85G of compressed text to process. I'm open to other database technologies, but picked SQLite for ease of use and because this is a write-once read-many job in which only 3 or 4 people will use the resulting database after everything is loaded.

As I was working on the same problem, I figured that an effective way to prevent the pool from overloading is to use a semaphore with a generator:
from multiprocessing import Pool, Semaphore
def produce(semaphore, from_file):
with open(from_file) as reader:
for line in reader:
# Reduce Semaphore by 1 or wait if 0
semaphore.acquire()
# Now deliver an item to the caller (pool)
yield line
def process(item):
result = (first_function(item),
second_function(item),
third_function(item))
return result
def consume(semaphore, result):
database_con.cur.execute("INSERT INTO ResultTable VALUES (?,?,?)", result)
# Result is consumed, semaphore may now be increased by 1
semaphore.release()
def main()
global database_con
semaphore_1 = Semaphore(1024)
with Pool(2) as pool:
for result in pool.imap_unordered(process, produce(semaphore_1, "workfile.txt"), chunksize=128):
consume(semaphore_1, result)
See also:
K Hong - Multithreading - Semaphore objects & thread pool
Lecture from Chris Terman - MIT 6.004 L21: Semaphores

Since processing is fast, but writing is slow, it sounds like your problem is
I/O-bound. Therefore there might not be much to be gained from using
multiprocessing.
However, it is possible to peel off chunks of data, process the chunk, and
wait until that data has been written before peeling off another chunk:
import itertools as IT
if __name__ == "__main__":
data = records(100)
with Pool(2) as pool:
chunksize = ...
for chunk in iter(lambda: list(IT.islice(data, chunksize)), []):
writer(pool.imap_unordered(process, chunk, chunksize=5))

It sounds like all you really need is to replace the unbounded queues underneath the Pool with bounded (and blocking) queues. That way, if any side gets ahead of the rest, it'll just block until they're ready.
This would be easy to do by peeking at the source, to subclass or monkeypatch Pool, something like:
class Pool(multiprocessing.pool.Pool):
def _setup_queues(self):
self._inqueue = self._ctx.Queue(5)
self._outqueue = self._ctx.Queue(5)
self._quick_put = self._inqueue._writer.send
self._quick_get = self._outqueue._reader.recv
self._taskqueue = queue.Queue(10)
But that's obviously not portable (even to CPython 3.3, much less to a different Python 3 implementation).
I think you can do it portably in 3.4+ by providing a customized context, but I haven't been able to get that right, so…

A simple workaround might be to use psutil to detect the memory usage in each process and say if more than 90% of memory are taken, than just sleep for a while.
while psutil.virtual_memory().percent > 75:
time.sleep(1)
print ("process paused for 1 seconds!")

Access python program data while running

I have a python program that's been running for a while, and because of an unanticipated event, I'm now unsure that it will complete within a reasonable amount of time. The data it's collected so far, however, is valuable, and I would like to recover it if possible.
Here is the relevant code
from multiprocessing.dummy import Pool as ThreadPool
def pull_details(url):
#accesses a given URL
#returns some data which gets appended to the results list
pool = ThreadPool(25)
results = pool.map(pull_details, urls)
pool.close()
pool.join()
So I either need to access the data that is currently in results or somehow change the source of the code (or somehow manually change the program's control) to kill the loop so it continues to the later part of the program in which the data is exported (not sure if the second way is possible).
It seems as though the first option is also quite tricky, but luckily the IDE (Spyder) I'm using indicates the value of what I assume is the location of the list in the machine's memory (0xB73EDECCL).
Is it possible to create a C program (or another python program) to access this location in memory and read what's there?

Can't you use some sort of mechanism to exchange data between the two processes, like queues or pipes.
something like below:
from multiprocessing import Queue
from multiprocessing.dummy import Pool as ThreadPool
def pull_details(args=None):
q.put([my useful data])
q = Queue()
pool = ThreadPool(25)
results = pool.map(pull_details(args=q), urls)
while not done:
results = q.get()
pool.close()
pool.join()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing pool map - Memory issues in Databricks - python

You shouldn't close the pool before joining. In fact, you shouldn't close the pool at all when using it in a with block, it will close automatically when exiting the with block.

Related

Multithreading Memory Leak

How to use pyarrow parquet with multiprocessing

Python Multiprocessing using Pool goes recursively haywire

multiprocessing.Pool.imap_unordered with fixed queue size or buffer?

Access python program data while running

Categories

Resources