Multiprocessing in AWS Lambda - python

I'm trying to scan all the cloudwatch log groups(nearly 10k log groups) and check for subscription filters in my AWS Account.Since we have an lambda execution time restriction of 15 mins.I'm using Multiprocessing for this to complete it by 15 mins. Here is my code. When i execute this, its code is giving a timeout error
import time
import concurrent.futures
import boto3
from multiprocessing import Process, Pipe
logs = boto3.client('logs')
def describe_log_groups():
paginator = logs.get_paginator('describe_log_groups')
for page in paginator.paginate():
for log_groups in page['logGroups']:
yield(log_groups)
def describe_subscription_filter(loggroupname,conn):
print('In Subscription Filters')
response = logs.describe_subscription_filters(logGroupName=loggroupname)['subscriptionFilters']
if len(response) != 0:
for log in response:
print(log['destinationArn'])
conn.send([log['destinationArn']])
conn.close()
def lambda_handler(event, context):
t1 = time.perf_counter()
evlaute_loggroups = []
processes = []
parent_connections = []
loggroups_list = describe_log_groups()
for loggroup in loggroups_list:
parent_conn, child_conn = Pipe()
parent_connections.append(parent_conn)
print(parent_connections)
print(loggroup['logGroupName'])
process = Process(target=describe_subscription_filter, args=(loggroup['logGroupName'], child_conn,))
processes.append(process)
for process in processes:
process.start()
for process in processes:
process.join()
for parent_connection in parent_connections:
print(parent_connection.recv()[0])
print('done')
t2 = time.perf_counter()
print(f'Finished in {t2-t1} seconds')
I also have a doubt, using multiprocessing can we scan huge amount of log groups in Lambda.

Using multiprocessing in lambda is not going to help much. The computational power of your function is related to its RAM allocation.
If you want your function to run faster, you have to give it more RAM. With 1792 MB of RAM your function gets an allocation of 1 vCPU. This means that even with max amount of RAM (3008 MB) you will not get 2 vCPUs. Since one 1vCP can be considered as equivalent to 1 hyper-thread on a physical CPU core, your lambda function is basically limited to one thread.
You can consider the following options:
check execution time with more RAM,
simplify your code. Instead of having one large function, have few smaller function which can be orchestrated using Step Functions for instance,
move from lambda to other service, e.g. ECS.

Related

Multiprocessing is not executing parallel in Python

I have edited the code , currently it is working fine . But thinks it is not executing parallely or dynamically . Can anyone please check on to it
Code :
def folderStatistic(t):
j, dir_name = t
row = []
for content in dir_name.split(","):
row.append(content)
print(row)
def get_directories():
import csv
with open('CONFIG.csv', 'r') as file:
reader = csv.reader(file,delimiter = '\t')
return [col for row in reader for col in row]
def folderstatsMain():
freeze_support()
start = time.time()
pool = Pool()
worker = partial(folderStatistic)
pool.map(worker, enumerate(get_directories()))
def datatobechecked():
try:
folderstatsMain()
except Exception as e:
# pass
print(e)
if __name__ == '__main__':
datatobechecked()
Config.CSV
C:\USERS, .CSV
C:\WINDOWS , .PDF
etc.
There may be around 200 folder paths in config.csv
welcome to StackOverflow and Python programming world!
Moving on to the question.
Inside the get_directories() function you open the file in with context, get the reader object and close the file immediately after the moment you leave the context so when the time comes to use the reader object the file is already closed.
I don't want to discourage you, but if you are very new to programming do not dive into parallel programing yet. Difficulty in handling multiple threads simultaneously grows exponentially with every thread you add (pools greatly simplify this process though). Processes are even worse as they don't share memory and can't communicate with each other easily.
My advice is, try to write it as a single-thread program first. If you have it working and still need to parallelize it, isolate a single function with input file path as a parameter that does all the work and then use thread/process pool on that function.
EDIT:
From what I can understand from your code, you get directory names from the CSV file and then for each "cell" in the file you run parallel folderStatistics. This part seems correct. The problem may lay in dir_name.split(","), notice that you pass individual "cells" to the folderStatistics not rows. What makes you think it's not running paralelly?.
There is a certain amount of overhead in creating a multiprocessing pool because creating processes is, unlike creating threads, a fairly costly operation. Then those submitted tasks, represented by each element of the iterable being passed to the map method, are gathered up in "chunks" and written to a multiprocessing queue of tasks that are read by the pool processes. This data has to move from one address space to another and that has a cost associated with it. Finally when your worker function, folderStatistic, returns its result (which is None in this case), that data has to be moved from one process's address space back to the main process's address space and that too has a cost associated with it.
All of those added costs become worthwhile when your worker function is sufficiently CPU-intensive such that these additional costs is small compared to the savings gained by having the tasks run in parallel. But your worker function's CPU requirements are so small as to reap any benefit from multiprocessing.
Here is a demo comparing single-processing time vs. multiprocessing times for invoking a worker function, fn, twice where the first time it only performs its internal loop 10 times (low CPU requirements) while the second time it performs its internal loop 1,000,000 times (higher CPU requirements). You can see that in the first case the multiprocessing version runs considerable slower (you can't even measure the time for the single processing run). But when we make fn more CPU-intensive, then multiprocessing achieves gains over the single-processing case.
from multiprocessing import Pool
from functools import partial
import time
def fn(iterations, x):
the_sum = x
for _ in range(iterations):
the_sum += x
return the_sum
# required for Windows:
if __name__ == '__main__':
for n_iterations in (10, 1_000_000):
# single processing time:
t1 = time.time()
for x in range(1, 20):
fn(n_iterations, x)
t2 = time.time()
# multiprocessing time:
worker = partial(fn, n_iterations)
t3 = time.time()
with Pool() as p:
results = p.map(worker, range(1, 20))
t4 = time.time()
print(f'#iterations = {n_iterations}, single processing time = {t2 - t1}, multiprocessing time = {t4 - t3}')
Prints:
#iterations = 10, single processing time = 0.0, multiprocessing time = 0.35399389266967773
#iterations = 1000000, single processing time = 1.182999849319458, multiprocessing time = 0.5530076026916504
But even with a pool size of 8, the running time is not reduced by a factor of 8 (it's more like a factor of 2) due to the fixed multiprocessing overhead. When I change the number of iterations for the second case to be 100,000,000 (even more CPU-intensive), we get ...
#iterations = 100000000, single processing time = 109.3077495098114, multiprocessing time = 27.202054023742676
... which is a reduction in running time by a factor of 4 (I have many other processes running in my computer, so there is competition for the CPU).

multiprocessing is slower than thread in python

I have tested a multiprocess and thread in python, but multiprocess is slower than thread, and I calculate a distance using editdistance, my code like:
def calc_dist(kw, trie_word):
dists = []
while len(trie_word) != 0:
w = trie_word.pop()
dist = editdistance.eval(kw, w)
dists.append((w, dist))
return dists
if __name__ == "__main__":
word_list = [str(i) for i in range(1, 10000001)]
key_word = '2'
print("calc")
s = time.time()
with Pool(processes=4) as pool:
result = pool.apply_async(calc_dist, (key_word, word_list))
print(len(result.get()))
print("用时",time.time()-s)
Using threading:
class DistThread(threading.Thread):
def __init__(self, func, args):
super(DistThread, self).__init__()
self.func = func
self.args = args
self.dists = None
def run(self):
self.dists = self.func(*self.args)
def join(self):
super().join(self)
return self.dists
In my computer, it consumes about 118s, but thread takes about 36s, where is wrong with it?
a couple of issues:
a significant amount of time will be spent serialising the data so it can be sent to the other process while threads share the same address space so pointers can be used
your current code is only using one process to do all the calcs with multiprocessing. you need to seperate your array into "chunks" somehow so that it can be processed via multiple workers
e.g:
import time
from multiprocessing import Pool
import editdistance
def calc_one(trie_word):
return editdistance.eval(key_word, trie_word)
if __name__ == "__main__":
word_list = [str(i) for i in range(1, 10000001)]
key_word = '2'
print("calc")
s = time.time()
with Pool(processes=4) as pool:
result = pool.map(calc_one, word_list, chunksize=10000)
print(len(result))
print("time",time.time()-s)
s = time.time()
result = list(calc_one(w) for w in word_list)
print(len(result))
print("time",time.time()-s)
this relies on key_word being a global variable. for me, the version using multiple processes takes ~5.3 seconds while the second version takes ~16.9 secs. not 4 times as quick as the data still needs to be sent back and forth, but pretty good
I had a similar experience with threading and multi processing inside Python to consume CSVS that had a large amount of data. I had a small look into this and found that processing spawns multiple processes to perform tasks which can be slower than just running one threaded process since threading runs in one place. There is a more definitive answer here: Multiprocessing vs Threading Python.
Pasting answer from link incase link disappears;
The threading module uses threads, the multiprocessing module uses processes. The difference is that threads run in the same memory space, while processes have separate memory. This makes it a bit harder to share objects between processes with multiprocessing. Since threads use the same memory, precautions have to be taken or two threads will write to the same memory at the same time. This is what the global interpreter lock is for.
Spawning processes is a bit slower than spawning threads. Once they are running, there is not much difference.

Unexpected Behavior from Python Multiprocessing Pool Class

I am trying to utilize Python's multiprocessing library to quickly run a function using the 8 processing cores I have on a Linux VM I created. As a test, I am getting the time in seconds it takes for a worker pool with 4 processes to run a function, and the time it takes running the same function without using a worker pool. The time in seconds is coming out as about the same, in some case it is taking the worker pool much longer to process than without.
Script
import requests
import datetime
import multiprocessing as mp
shared_results = []
def stress_test_url(url):
print('Starting Stress Test')
count = 0
while count <= 200:
response = requests.get(url)
shared_results.append(response.status_code)
count += 1
pool = mp.Pool(processes=4)
now = datetime.datetime.now()
results = pool.apply(stress_test_url, args=(url,))
diff = (datetime.datetime.now() - now).total_seconds()
now = datetime.datetime.now()
results = stress_test_url(url)
diff2 = (datetime.datetime.now() - now).total_seconds()
print(diff)
print(diff2)
Terminal Output
Starting Stress Test
Starting Stress Test
44.316212
41.874116
The apply function of multiprocessing.Pool simply runs a function in a separate process and waits for its results. It takes a little bit more than running sequentially as it needs to pack the job to be processed and ship it to the child process via a pipe.
multiprocessing doesn't make sequential operations faster, it simply allows them to be run in parallel if you hardware has more than one core.
Just try this:
urls = ["http://google.com",
"http://example.com",
"http://stackoverflow.com",
"http://python.org"]
results = pool.map(stress_test_url, urls)
You will see that the 4 URLs get visited seemingly at the same time. This means your logic reduces the amount of time necessary to visit N websites to N / processes.
Lastly, benchmarking a function which performs an HTTP request is a very poor way to measure performance as networks are unreliable. You will hardly get two executions which take the same amount of time no matter whether you use multiprocessing or not.

parallel request using multiprocessing.dummy

I trying run parellel get requests using multiprocessing.dummy with report by progress.
from multiprocessing.dummy import Pool
from functools import partial
class Test(object):
def __init__(self):
self.count = 0
self.threads = 10
def callback(self, total, x):
self.count += 1
if self.count%100==0:
print("Working ({}/{}) cases processed.".format(self.count, total))
def do_async(self):
thread_pool = Pool(self.threads)#self.threads
input_list = link
callback = partial(self.callback, len(link))
tasks = [thread_pool.apply_async(get_data, (x,), callback=callback) for x in input_list]
return (task.get() for task in tasks)
start = time.time()
t = Test()
results = t.do_async()
end = time.time()`
the result of the operation - the same time as the non-parallel requests
CPython is inherently single-threaded due to something called the Global Interpreter Lock (GIL). This means only one thread can run at a time, even if there are multiple CPU cores available. multiprocessing.dummy is just a wrapper for using threads, so this is why you are not getting a speed up.
To get the benefit of having multiple CPUs, you must use multiprocessing itself. However, there are overheads based on the cost of sending and receiving the input and output data of the sub-process. If the cost of this is greater than the amount of work done by the sub-process then using multiprocessing can actually slow your program down. So in your example, multiprocessing would likely not give you a speed increase. This is especially true as most of the work in the callback involves printing to standard out, which all the processes in the pool must synchronise over to prevent garbage being printed out.
i found solution in concurrent.futures:
import concurrent.futures as futures
import datetime
import sys
results=[]
print("start", datetime.datetime.now().isoformat())
start =time.time()
with futures.ThreadPoolExecutor(max_workers=100) as executor:
fs = [executor.submit(get_data, url) for url in link]
for i, f in enumerate(futures.as_completed(fs)):
results.append(f.result())
if i%100==0:
sys.stdout.write("line nr: {} / {} \r".format(i, len(link)))

Asynchronous multiprocessing with a worker pool in Python: how to keep going after timeout?

I would like to run a number of jobs using a pool of processes and apply a given timeout after which a job should be killed and replaced by another working on the next task.
I have tried to use the multiprocessing module which offers a method to run of pool of workers asynchronously (e.g. using map_async), but there I can only set a "global" timeout after which all processes would be killed.
Is it possible to have an individual timeout after which only a single process that takes too long is killed and a new worker is added to the pool again instead (processing the next task and skipping the one that timed out)?
Here's a simple example to illustrate my problem:
def Check(n):
import time
if n % 2 == 0: # select some (arbitrary) subset of processes
print "%d timeout" % n
while 1:
# loop forever to simulate some process getting stuck
pass
print "%d done" % n
return 0
from multiprocessing import Pool
pool = Pool(processes=4)
result = pool.map_async(Check, range(10))
print result.get(timeout=1)
After the timeout all workers are killed and the program exits. I would like instead that it continues with the next subtask. Do I have to implement this behavior myself or are there existing solutions?
Update
It is possible to kill the hanging workers and they are automatically replaced. So I came up with this code:
jobs = pool.map_async(Check, range(10))
while 1:
try:
print "Waiting for result"
result = jobs.get(timeout=1)
break # all clear
except multiprocessing.TimeoutError:
# kill all processes
for c in multiprocessing.active_children():
c.terminate()
print result
The problem now is that the loop never exits; even after all tasks have been processed, calling get yields a timeout exception.
The pebble Pool module has been built for solving these types of issue. It supports timeout on given tasks allowing to detect them and easily recover.
from pebble import ProcessPool
from concurrent.futures import TimeoutError
with ProcessPool() as pool:
future = pool.schedule(function, args=[1,2], timeout=5)
try:
result = future.result()
except TimeoutError:
print "Function took longer than %d seconds" % error.args[1]
For your specific example:
from pebble import ProcessPool
from concurrent.futures import TimeoutError
results = []
with ProcessPool(max_workers=4) as pool:
future = pool.map(Check, range(10), timeout=5)
iterator = future.result()
# iterate over all results, if a computation timed out
# print it and continue to the next result
while True:
try:
result = next(iterator)
results.append(result)
except StopIteration:
break
except TimeoutError as error:
print "function took longer than %d seconds" % error.args[1]
print results
Currently the Python does not provide native means to the control execution time of each distinct task in the pool outside the worker itself.
So the easy way is to use wait_procs in the psutil module and implement the tasks as subprocesses.
If nonstandard libraries are not desirable, then you have to implement own Pool on base of subprocess module having the working cycle in the main process, poll() - ing the execution of each worker and performing required actions.
As for the updated problem, the pool becomes corrupted if you directly terminate one of the workers (it is the bug in the interpreter implementation, because such behavior should not be allowed): the worker is recreated, but the task is lost and the pool becomes nonjoinable.
You have to terminate all the pool and then recreate it again for another tasks:
from multiprocessing import Pool
while True:
pool = Pool(processes=4)
jobs = pool.map_async(Check, range(10))
print "Waiting for result"
try:
result = jobs.get(timeout=1)
break # all clear
except multiprocessing.TimeoutError:
# kill all processes
pool.terminate()
pool.join()
print result
UPDATE
Pebble is an excellent and handy library, which solves the issue. Pebble is designed for the asynchronous execution of Python functions, where is PyExPool is designed for the asynchronous execution of modules and external executables, though both can be used interchangeably.
One more aspect is when 3dparty dependencies are not desirable, then PyExPool can be a good choice, which is a single-file lightweight implementation of Multi-process Execution Pool with per-Job and global timeouts, opportunity to group Jobs into Tasks and other features.
PyExPool can be embedded into your sources and customized, having permissive Apache 2.0 license and production quality, being used in the core of one high-loaded scientific benchmarking framework.
Try the construction where each process is being joined with a timeout on a separate thread. So the main program never gets stuck and as well the processes which if gets stuck, would be killed due to timeout. This technique is a combination of threading and multiprocessing modules.
Here is my way to maintain the minimum x number of threads in the memory. Its an combination of threading and multiprocessing modules. It may be unusual to other techniques like respected fellow members have explained above BUT may be worth considerable. For the sake of explanation, I am taking a scenario of crawling a minimum of 5 websites at a time.
so here it is:-
#importing dependencies.
from multiprocessing import Process
from threading import Thread
import threading
# Crawler function
def crawler(domain):
# define crawler technique here.
output.write(scrapeddata + "\n")
pass
Next is threadController function. This function will control the flow of threads to the main memory. It will keep activating the threads to maintain the threadNum "minimum" limit ie. 5. Also it won't exit until, all Active threads(acitveCount) are finished up.
It will maintain a minimum of threadNum(5) startProcess function threads (these threads will eventually start the Processes from the processList while joining them with a time out of 60 seconds). After staring threadController, there would be 2 threads which are not included in the above limit of 5 ie. the Main thread and the threadController thread itself. thats why threading.activeCount() != 2 has been used.
def threadController():
print "Thread count before child thread starts is:-", threading.activeCount(), len(processList)
# staring first thread. This will make the activeCount=3
Thread(target = startProcess).start()
# loop while thread List is not empty OR active threads have not finished up.
while len(processList) != 0 or threading.activeCount() != 2:
if (threading.activeCount() < (threadNum + 2) and # if count of active threads are less than the Minimum AND
len(processList) != 0): # processList is not empty
Thread(target = startProcess).start() # This line would start startThreads function as a seperate thread **
startProcess function, as a separate thread, would start Processes from the processlist. The purpose of this function (**started as a different thread) is that It would become a parent thread for Processes. So when It will join them with a timeout of 60 seconds, this would stop the startProcess thread to move ahead but this won't stop threadController to perform. So this way, threadController will work as required.
def startProcess():
pr = processList.pop(0)
pr.start()
pr.join(60.00) # joining the thread with time out of 60 seconds as a float.
if __name__ == '__main__':
# a file holding a list of domains
domains = open("Domains.txt", "r").read().split("\n")
output = open("test.txt", "a")
processList = [] # thread list
threadNum = 5 # number of thread initiated processes to be run at one time
# making process List
for r in range(0, len(domains), 1):
domain = domains[r].strip()
p = Process(target = crawler, args = (domain,))
processList.append(p) # making a list of performer threads.
# starting the threadController as a seperate thread.
mt = Thread(target = threadController)
mt.start()
mt.join() # won't let go next until threadController thread finishes.
output.close()
print "Done"
Besides maintaining a minimum number of threads in the memory, my aim was to also have something which could avoid stuck threads or processes in the memory. I did this using the time out function. My apologies for any typing mistake.
I hope this construction would help anyone in this world.
Regards,
Vikas Gautam

Categories

Resources