How to share resources between several threads? - python

I'm writing a script that should retrieve file data from S3 storage in parallel. I cannot do that with a simple Pool object from multiprocessing module, because then I get the error that boto3. Resource object cannot be pickled.
I see why, and I implemented a solution which uses Pool, but target function does not use shared Resource and instead creates new Resource within each thread. But I found that to be extremely inefficient to the extent that even sequential processing would be faster.
I know that Pool has an option to set the number of workers which are used, but is there any way to create several instances of Resource and "asssign" them to those workers so I would not have to create a new instance for each item to be processed?
Code:
class Client:
def __init__(self, url, access_key, secret_key):
self.url = url
self.access_key = access_key
self.secret_key = secret_key
def get_resource(self):
return boto3.resource('s3',
endpoint_url=self.url,
aws_access_key_id=self.access_key,
aws_secret_access_key=self.secret_key)
def get_data_for_object(self, key):
data = []
resource = self.get_resource()
for line in resource.Object(bucket_name='bucket_name', key=key).get()['Body'].iter_lines():
data.append(json.loads(line))
return data
def get_data_for_minute(self, minute):
keys = [...] # object keys in S3 storage
pool = Pool()
data = pool.imap(self.get_data_for_object, keys)
flat_data = list(itertools.chain.from_iterable(data))
return flat_data
Thanks for any help.

Related

How to enter MFA code in each of the multiprocessing Pool threads?

I have a function that uploads files to s3 but it asks for MFA code before uploading starts. I am passing the function to the multiprocessing pool which creates two processes and runs the function two times concurrently.
When I run my script, It asks for MFA code twice in the terminal but the script crashes.
How do I enter MFA code in both processes concurrently and authenticate both processes?
Here is my Python code:
import multiprocessing
import boto3
session = boto3.Session()
s3_client = session.client('s3')
def load_to_s3(file_path):
response = s3_client.upload_file(file_path, bucket, target_path) # This line asks for MFA
return response
if __name__ == '__main__':
pool = multiprocessing.Pool(processes = 2)
response_list = pool.map(load_to_s3, file_path_chunks)
Error Messages -
You can try to use multiprocessing.Lock to make sure that only one process will be authenticated at a time.
Also you may need to create a new client for each process:
Resource instances are not thread safe and should not be shared across
threads or processes. These special classes contain additional meta
data that cannot be shared. It's recommended to create a new Resource
for each thread or process
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html?highlight=multithreading#multithreading-or-multiprocessing-with-resources
Example:
import multiprocessing
import time
import boto3
lock = multiprocessing.Lock()
def load_to_s3(file_path):
with lock:
print(file_path)
# there will be only one process at a time
# do your work here
# session = boto3.Session()
# s3_client = session.client('s3')
time.sleep(1)
if __name__ == "__main__":
pool = multiprocessing.Pool(processes=2)
file_path_chunks = ["1", "2", "3", "4"]
response_list = pool.map(load_to_s3, file_path_chunks)

How to send a progress of operation in a FastAPI app?

I have deployed a fastapi endpoint,
from fastapi import FastAPI, UploadFile
from typing import List
app = FastAPI()
#app.post('/work/test')
async def testing(files: List(UploadFile)):
for i in files:
.......
# do a lot of operations on each file
# after than I am just writing that processed data into mysql database
# cur.execute(...)
# cur.commit()
.......
# just returning "OK" to confirm data is written into mysql
return {"response" : "OK"}
I can request output from the API endpoint and its working fine for me perfectly.
Now, the biggest challenge for me to know how much time it is taking for each iteration. Because in the UI part (those who are accessing my API endpoint) I want to help them show a progress bar (TIME TAKEN) for each iteration/file being processed.
Is there any possible way for me to achieve it? If so, please help me out on how can I proceed further?
Thank you.
Approaches
Polling
The most preferred approach to track the progress of a task is polling:
After receiving a request to start a task on a backend:
Create a task object in the storage (e.g in-memory, redis and etc.). The task object must contain the following data: task ID, status (pending, completed), result, and others.
Run task in the background (coroutines, threading, multiprocessing, task queue like Celery, arq, aio-pika, dramatiq and etc.)
Response immediately the answer 202 (Accepted) by returning the previously received task ID.
Update task status:
This can be from within the task itself, if it knows about the task store and has access to it. Periodically, the task itself updates information about itself.
Or use a task monitor (Observer, producer-consumer pattern), which will monitor the status of the task and its result. And it will also update the information in the storage.
On the client side (front-end) start a polling cycle for the task status to endpoint /task/{ID}/status, which takes information from the task storage.
Streaming response
Streaming is a less convenient way of getting the status of request processing periodically. When we gradually push responses without closing the connection. It has a number of significant disadvantages, for example, if the connection is broken, you can lose information. Streaming Api is another approach than REST Api.
Websockets
You can also use websockets for real-time notifications and bidirectional communication.
Links:
Examples of polling approach for the progress bar and a more detailed description for django + celery can be found at these links:
https://www.dangtrinh.com/2013/07/django-celery-display-progress-bar-of.html
https://buildwithdjango.com/blog/post/celery-progress-bars/
I have provided simplified examples of running background tasks in FastAPI using multiprocessing here:
https://stackoverflow.com/a/63171013/13782669
Old answer:
You could run a task in the background, return its id and provide a /status endpoint that the front would periodically call. In the status response, you could return what state your task is now (for example, pending with the number of the currently processed file). I provided a few simple examples here.
Demo
Polling
Demo of the approach using asyncio tasks (single worker solution):
import asyncio
from http import HTTPStatus
from fastapi import BackgroundTasks
from typing import Dict, List
from uuid import UUID, uuid4
import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel, Field
class Job(BaseModel):
uid: UUID = Field(default_factory=uuid4)
status: str = "in_progress"
progress: int = 0
result: int = None
app = FastAPI()
jobs: Dict[UUID, Job] = {} # Dict as job storage
async def long_task(queue: asyncio.Queue, param: int):
for i in range(1, param): # do work and return our progress
await asyncio.sleep(1)
await queue.put(i)
await queue.put(None)
async def start_new_task(uid: UUID, param: int) -> None:
queue = asyncio.Queue()
task = asyncio.create_task(long_task(queue, param))
while progress := await queue.get(): # monitor task progress
jobs[uid].progress = progress
jobs[uid].status = "complete"
#app.post("/new_task/{param}", status_code=HTTPStatus.ACCEPTED)
async def task_handler(background_tasks: BackgroundTasks, param: int):
new_task = Job()
jobs[new_task.uid] = new_task
background_tasks.add_task(start_new_task, new_task.uid, param)
return new_task
#app.get("/task/{uid}/status")
async def status_handler(uid: UUID):
return jobs[uid]
Adapted example for loop from question
Background processing function is defined as def and FastAPI runs it on the thread pool.
import time
from http import HTTPStatus
from fastapi import BackgroundTasks, UploadFile, File
from typing import Dict, List
from uuid import UUID, uuid4
from fastapi import FastAPI
from pydantic import BaseModel, Field
class Job(BaseModel):
uid: UUID = Field(default_factory=uuid4)
status: str = "in_progress"
processed_files: List[str] = Field(default_factory=list)
app = FastAPI()
jobs: Dict[UUID, Job] = {}
def process_files(task_id: UUID, files: List[UploadFile]):
for i in files:
time.sleep(5) # pretend long task
# ...
# do a lot of operations on each file
# then append the processed file to a list
# ...
jobs[task_id].processed_files.append(i.filename)
jobs[task_id].status = "completed"
#app.post('/work/test', status_code=HTTPStatus.ACCEPTED)
async def work(background_tasks: BackgroundTasks, files: List[UploadFile] = File(...)):
new_task = Job()
jobs[new_task.uid] = new_task
background_tasks.add_task(process_files, new_task.uid, files)
return new_task
#app.get("/work/{uid}/status")
async def status_handler(uid: UUID):
return jobs[uid]
Streaming
async def process_files_gen(files: List[UploadFile]):
for i in files:
time.sleep(5) # pretend long task
# ...
# do a lot of operations on each file
# then append the processed file to a list
# ...
yield f"{i.filename} processed\n"
yield f"OK\n"
#app.post('/work/stream/test', status_code=HTTPStatus.ACCEPTED)
async def work(files: List[UploadFile] = File(...)):
return StreamingResponse(process_files_gen(files))
Below is solution which uses uniq identifiers and globally available dictionary which holds information about the jobs:
NOTE: Code below is safe to use until you use dynamic keys values ( In sample uuid in use) and keep application within single process.
To start the app create a file main.py
Run uvicorn main:app --reload
Create job entry by accessing http://127.0.0.1:8000/
Repeat step 3 to create multiple jobs
Go to http://127.0.0.1/status page to see page statuses.
Go to http://127.0.0.1/status/{identifier} to see progression of the job by the job id.
Code of app:
from fastapi import FastAPI, UploadFile
import uuid
from typing import List
import asyncio
context = {'jobs': {}}
app = FastAPI()
async def do_work(job_key, files=None):
iter_over = files if files else range(100)
for file, file_number in enumerate(iter_over):
jobs = context['jobs']
job_info = jobs[job_key]
job_info['iteration'] = file_number
job_info['status'] = 'inprogress'
await asyncio.sleep(1)
pending_jobs[job_key]['status'] = 'done'
#app.post('/work/test')
async def testing(files: List[UploadFile]):
identifier = str(uuid.uuid4())
context[jobs][identifier] = {}
asyncio.run_coroutine_threadsafe(do_work(identifier, files), loop=asyncio.get_running_loop())
return {"identifier": identifier}
#app.get('/')
async def get_testing():
identifier = str(uuid.uuid4())
context['jobs'][identifier] = {}
asyncio.run_coroutine_threadsafe(do_work(identifier), loop=asyncio.get_running_loop())
return {"identifier": identifier}
#app.get('/status')
def status():
return {
'all': list(context['jobs'].values()),
}
#app.get('/status/{identifier}')
async def status(identifier):
return {
"status": context['jobs'].get(identifier, 'job with that identifier is undefined'),
}

How to create Gcp Memory-store using python

I am trying to automate gcp memory store creation but didn't find a way to create it using python. Please help.
You can use Python Client for Google Cloud Memorystore for Redis API in order to create it.
You can use the create_instance method of the Python Client Library which creates a Redis instance based on the specified tier and memory size
async create_instance(request: google.cloud.redis_v1.types.cloud_redis.CreateInstanceRequest = None, *,
parent: str = None, instance_id: str = None, instance: google.cloud.redis_v1.types.cloud_redis.Instance = None, retry:
google.api_core.retry.Retry = <object object>, timeout: float = None, metadata: Sequence[Tuple[str, str]] = ())
from google.cloud import redis_v1beta1
from google.cloud.redis_v1beta1 import enums
client = redis_v1beta1.CloudRedisClient()
parent = client.location_path('<project>', '<location>')
instance_id = 'test-instancee'
tier = enums.Instance.Tier.BASIC
memory_size_gb = 1
instance = {'tier': tier, 'memory_size_gb': memory_size_gb}
response = client.create_instance(parent, instance_id, instance)
def callback(operation_future):
# Handle result.
result = operation_future.result()
response.add_done_callback(callback)
# Handle metadata.
# metadata = response.metadata()
print "Created"
This Code will work fine but for python2, Is there any way to use it in python3 please mention.

Reading multiple "bulked" jsons from s3 asynchronously. Is there a better way?

The goal is to try to load a large amount of "bulked" jsons from s3. I found aiobotocore and felt urged to try in hope to get more efficiency and at the same time familiarise myself with asyncio. I gave it a shot, and it works but I know basically nada about asynchronous programming. Therefore, I was hoping for some improvements/comments. Maybe there are some kind souls out there that can spot some obvious mistakes.
The problem is that boto3 only supports one http request at a time. By utilising Threadpool I managed to get significant improvements, but I'm hoping for a more efficient way.
Here is the code:
Imports:
import os
import asyncio
import aiobotocore
from itertools import chain
import json
from json.decoder import WHITESPACE
Some helper generator I found somewhere to return decoded jsons from string with multiple jsons.
def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs):
'''helper for parsing individual jsons from string of jsons (stolen from somewhere)'''
string = str(string_or_fp)
decoder = cls(**kwargs)
idx = WHITESPACE.match(string, 0).end()
while idx < len(string):
obj, end = decoder.raw_decode(string, idx)
yield obj
idx = WHITESPACE.match(string, end).end()
This function gets keys from an s3 bucket with a given prefix:
# Async stuff starts here
async def get_keys(loop, bucket, prefix):
'''Get keys in bucket based on prefix'''
session = aiobotocore.get_session(loop=loop)
async with session.create_client('s3', region_name='us-west-2',
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
aws_access_key_id=AWS_ACCESS_KEY_ID) as client:
keys = []
# list s3 objects using paginator
paginator = client.get_paginator('list_objects')
async for result in paginator.paginate(Bucket=bucket, Prefix=prefix):
for c in result.get('Contents', []):
keys.append(c['Key'])
return keys
This function gets the content for a provided key. Untop of that it flattens the list of decoded content:
async def get_object(loop,bucket, key):
'''Get json content from s3 object'''
session = aiobotocore.get_session(loop=loop)
async with session.create_client('s3', region_name='us-west-2',
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
aws_access_key_id=AWS_ACCESS_KEY_ID) as client:
# get object from s3
response = await client.get_object(Bucket=bucket, Key=key)
async with response['Body'] as stream:
content = await stream.read()
return list(iterload(content.decode()))
Here is the main function which gathers the contents for all the found keys and flattens the list of contents.
async def go(loop, bucket, prefix):
'''Returns list of dicts of object contents'''
session = aiobotocore.get_session(loop=loop)
async with session.create_client('s3', region_name='us-west-2',
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
aws_access_key_id=AWS_ACCESS_KEY_ID) as client:
keys = await get_keys(loop, bucket, prefix)
contents = await asyncio.gather(*[get_object(loop, bucket, k) for k in keys])
return list(chain.from_iterable(contents))
Finally, I run this and the result list of dicts ends up nicely in result
loop = asyncio.get_event_loop()
result = loop.run_until_complete(go(loop, 'some-bucket', 'some-prefix'))
One thing that I think might be a bit wierd is that I create a client in each async function. Probably that can be lifted out. Note sure about how aiobotocore works with multiple clients.
Furthermore, I think that you would not need to await that all keys are loaded before loading the objects for the keys, which I think is the case in this implementation. I'm assuming that as soon as a key is found you could call get_object. So, maybe it should be an async generator. But I'm not completely in the clear here.
Thank you in advance! Hope this helps someone in a similar situation.
first check out aioboto3
second, each client in aiobotocore is associated with an aiohttp session. Each session can have up to max_pool_connections. This is why in the basic aiobotocore example it does an async with on the create_client. So the pool is closed when done using the client.
Here are some tips:
You should use a work pool, created by me, modularized by CaliDog to avoid polluting your event loop. When using this think of your workflow as a stream.
This will avoid you having to use asyncio.gather, which will leave tasks running in the background after the first exception is thrown.
You should tune your work loop size and max_pool_connections together, and only use one client with the number of tasks you want to (or can based on compute required) support in parallel.
You really don't need to pass the loop around as with modern python versions there's one loop per thread
You should use aws profiles (profile param to Session init)/environment variables so you don't need to hardcode key and region information.
Based on the above here is how I would do it:
import asyncio
from itertools import chain
import json
from typing import List
from json.decoder import WHITESPACE
import logging
from functools import partial
# Third Party
import asyncpool
import aiobotocore.session
import aiobotocore.config
_NUM_WORKERS = 50
def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs):
# helper for parsing individual jsons from string of jsons (stolen from somewhere)
string = str(string_or_fp)
decoder = cls(**kwargs)
idx = WHITESPACE.match(string, 0).end()
while idx < len(string):
obj, end = decoder.raw_decode(string, idx)
yield obj
idx = WHITESPACE.match(string, end).end()
async def get_object(s3_client, bucket: str, key: str):
# Get json content from s3 object
# get object from s3
response = await s3_client.get_object(Bucket=bucket, Key=key)
async with response['Body'] as stream:
content = await stream.read()
return list(iterload(content.decode()))
async def go(bucket: str, prefix: str) -> List[dict]:
"""
Returns list of dicts of object contents
:param bucket: s3 bucket
:param prefix: s3 bucket prefix
:return: list of dicts of object contents
"""
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
session = aiobotocore.session.AioSession()
config = aiobotocore.config.AioConfig(max_pool_connections=_NUM_WORKERS)
contents = []
async with session.create_client('s3', config=config) as client:
worker_co = partial(get_object, client, bucket)
async with asyncpool.AsyncPool(None, _NUM_WORKERS, 's3_work_queue', logger, worker_co,
return_futures=True, raise_on_join=True, log_every_n=10) as work_pool:
# list s3 objects using paginator
paginator = client.get_paginator('list_objects')
async for result in paginator.paginate(Bucket=bucket, Prefix=prefix):
for c in result.get('Contents', []):
contents.append(await work_pool.push(c['Key']))
# retrieve results from futures
contents = [c.result() for c in contents]
return list(chain.from_iterable(contents))
_loop = asyncio.get_event_loop()
_result = _loop.run_until_complete(go('some-bucket', 'some-prefix'))

gunicorn with gevent workers: Using a shared global list

I am trying to implement Server-Sent Events in my Flask application by following this simple recipe: http://flask.pocoo.org/snippets/116/
For serving the app, I use gunicorn with gevent workers.
A minimal version of my code looks like this:
import multiprocessing
from gevent.queue import Queue
from gunicorn.app.base import BaseApplication
from flask import Flask, Response
app = Flask('minimal')
# NOTE: This is the global list of subscribers
subscriptions = []
class ServerSentEvent(object):
def __init__(self, data):
self.data = data
self.event = None
self.id = None
self.desc_map = {
self.data: "data",
self.event: "event",
self.id: "id"
}
def encode(self):
if not self.data:
return ""
lines = ["%s: %s" % (v, k)
for k, v in self.desc_map.iteritems() if k]
return "%s\n\n" % "\n".join(lines)
#app.route('/api/events')
def subscribe_events():
def gen():
q = Queue()
print "New subscription!"
subscriptions.append(q)
print len(subscriptions)
print id(subscriptions)
try:
while True:
print "Waiting for data"
result = q.get()
print "Got data: " + result
ev = ServerSentEvent(unicode(result))
yield ev.encode()
except GeneratorExit:
print "Removing subscription"
subscriptions.remove(q)
return Response(gen(), mimetype="text/event-stream")
#app.route('/api/test')
def push_event():
print len(subscriptions)
print id(subscriptions)
for sub in subscriptions:
sub.put("test")
return "OK"
class GunicornApplication(BaseApplication):
def __init__(self, wsgi_app, port=5000):
self.options = {
'bind': "0.0.0.0:{port}".format(port=port),
'workers': multiprocessing.cpu_count() + 1,
'worker_class': 'gevent',
'preload_app': True,
}
self.application = wsgi_app
super(GunicornApplication, self).__init__()
def load_config(self):
config = dict([(key, value) for key, value in self.options.iteritems()
if key in self.cfg.settings and value is not None])
for key, value in config.iteritems():
self.cfg.set(key.lower(), value)
def load(self):
return self.application
if __name__ == '__main__':
gapp = GunicornApplication(app)
gapp.run()
The problem is that the subscriber's list seems to be different for every worker. This means that if worker #1 handles the /api/events endpoint and adds a new subscriber to the list, the client will only receive events that are added when worker #1 also handles the /api/test endpoint.
Curiously enough, the actual list object seems to be the same for each worker, since id(subscriptions) returns the same value in every worker.
Is there a way around this? I know that I could just use Redis, but the application is supposed to be as self-contained as possible, so I'm trying to avoid any external services.
Update: The cause of the problem seems to be in my embedding of the gunicorn.app.base.BaseApplication (which is a new feature in v0.19). When running the application from the command-line with gunicorn -k gevent minimal:app, everything works as expected
Update 2: The previous suspicion turned out to be wrong, the only reason it worked was because gunicorn's default number of worker processes is 1, when adjusting the number to fit the code via the -w parameter, it exhibits the same behavior.
You say:
the actual list object seems to be the same for each worker, since
id(subscriptions) returns the same value in every worker.
but I think it's not true, the subscriptions on each worker is not the same object. Each worker is a individual process, has its own memory space.
For self-contained system, you could develop a tiny system functioned like simple version of Redis. For example, using SQLite or ZeroMQ to communicate between these workers.

Categories

Resources