Reading multiple "bulked" jsons from s3 asynchronously. Is there a better way?

Reading multiple "bulked" jsons from s3 asynchronously. Is there a better way? - python

The goal is to try to load a large amount of "bulked" jsons from s3. I found aiobotocore and felt urged to try in hope to get more efficiency and at the same time familiarise myself with asyncio. I gave it a shot, and it works but I know basically nada about asynchronous programming. Therefore, I was hoping for some improvements/comments. Maybe there are some kind souls out there that can spot some obvious mistakes.
The problem is that boto3 only supports one http request at a time. By utilising Threadpool I managed to get significant improvements, but I'm hoping for a more efficient way.
Here is the code:
Imports:
import os
import asyncio
import aiobotocore
from itertools import chain
import json
from json.decoder import WHITESPACE
Some helper generator I found somewhere to return decoded jsons from string with multiple jsons.
def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs):
'''helper for parsing individual jsons from string of jsons (stolen from somewhere)'''
string = str(string_or_fp)
decoder = cls(**kwargs)
idx = WHITESPACE.match(string, 0).end()
while idx < len(string):
obj, end = decoder.raw_decode(string, idx)
yield obj
idx = WHITESPACE.match(string, end).end()
This function gets keys from an s3 bucket with a given prefix:
# Async stuff starts here
async def get_keys(loop, bucket, prefix):
'''Get keys in bucket based on prefix'''
session = aiobotocore.get_session(loop=loop)
async with session.create_client('s3', region_name='us-west-2',
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
aws_access_key_id=AWS_ACCESS_KEY_ID) as client:
keys = []
# list s3 objects using paginator
paginator = client.get_paginator('list_objects')
async for result in paginator.paginate(Bucket=bucket, Prefix=prefix):
for c in result.get('Contents', []):
keys.append(c['Key'])
return keys
This function gets the content for a provided key. Untop of that it flattens the list of decoded content:
async def get_object(loop,bucket, key):
'''Get json content from s3 object'''
session = aiobotocore.get_session(loop=loop)
async with session.create_client('s3', region_name='us-west-2',
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
aws_access_key_id=AWS_ACCESS_KEY_ID) as client:
# get object from s3
response = await client.get_object(Bucket=bucket, Key=key)
async with response['Body'] as stream:
content = await stream.read()
return list(iterload(content.decode()))
Here is the main function which gathers the contents for all the found keys and flattens the list of contents.
async def go(loop, bucket, prefix):
'''Returns list of dicts of object contents'''
session = aiobotocore.get_session(loop=loop)
async with session.create_client('s3', region_name='us-west-2',
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
aws_access_key_id=AWS_ACCESS_KEY_ID) as client:
keys = await get_keys(loop, bucket, prefix)
contents = await asyncio.gather(*[get_object(loop, bucket, k) for k in keys])
return list(chain.from_iterable(contents))
Finally, I run this and the result list of dicts ends up nicely in result
loop = asyncio.get_event_loop()
result = loop.run_until_complete(go(loop, 'some-bucket', 'some-prefix'))
One thing that I think might be a bit wierd is that I create a client in each async function. Probably that can be lifted out. Note sure about how aiobotocore works with multiple clients.
Furthermore, I think that you would not need to await that all keys are loaded before loading the objects for the keys, which I think is the case in this implementation. I'm assuming that as soon as a key is found you could call get_object. So, maybe it should be an async generator. But I'm not completely in the clear here.
Thank you in advance! Hope this helps someone in a similar situation.

first check out aioboto3
second, each client in aiobotocore is associated with an aiohttp session. Each session can have up to max_pool_connections. This is why in the basic aiobotocore example it does an async with on the create_client. So the pool is closed when done using the client.
Here are some tips:
You should use a work pool, created by me, modularized by CaliDog to avoid polluting your event loop. When using this think of your workflow as a stream.
This will avoid you having to use asyncio.gather, which will leave tasks running in the background after the first exception is thrown.
You should tune your work loop size and max_pool_connections together, and only use one client with the number of tasks you want to (or can based on compute required) support in parallel.
You really don't need to pass the loop around as with modern python versions there's one loop per thread
You should use aws profiles (profile param to Session init)/environment variables so you don't need to hardcode key and region information.
Based on the above here is how I would do it:
import asyncio
from itertools import chain
import json
from typing import List
from json.decoder import WHITESPACE
import logging
from functools import partial
# Third Party
import asyncpool
import aiobotocore.session
import aiobotocore.config
_NUM_WORKERS = 50
def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs):
# helper for parsing individual jsons from string of jsons (stolen from somewhere)
string = str(string_or_fp)
decoder = cls(**kwargs)
idx = WHITESPACE.match(string, 0).end()
while idx < len(string):
obj, end = decoder.raw_decode(string, idx)
yield obj
idx = WHITESPACE.match(string, end).end()
async def get_object(s3_client, bucket: str, key: str):
# Get json content from s3 object
# get object from s3
response = await s3_client.get_object(Bucket=bucket, Key=key)
async with response['Body'] as stream:
content = await stream.read()
return list(iterload(content.decode()))
async def go(bucket: str, prefix: str) -> List[dict]:
"""
Returns list of dicts of object contents
:param bucket: s3 bucket
:param prefix: s3 bucket prefix
:return: list of dicts of object contents
"""
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
session = aiobotocore.session.AioSession()
config = aiobotocore.config.AioConfig(max_pool_connections=_NUM_WORKERS)
contents = []
async with session.create_client('s3', config=config) as client:
worker_co = partial(get_object, client, bucket)
async with asyncpool.AsyncPool(None, _NUM_WORKERS, 's3_work_queue', logger, worker_co,
return_futures=True, raise_on_join=True, log_every_n=10) as work_pool:
# list s3 objects using paginator
paginator = client.get_paginator('list_objects')
async for result in paginator.paginate(Bucket=bucket, Prefix=prefix):
for c in result.get('Contents', []):
contents.append(await work_pool.push(c['Key']))
# retrieve results from futures
contents = [c.result() for c in contents]
return list(chain.from_iterable(contents))
_loop = asyncio.get_event_loop()
_result = _loop.run_until_complete(go('some-bucket', 'some-prefix'))

Related

dask.distributed: handle serialization of exotic objects?

Context
I am trying to write a data pipeline using dask distributed and some legacy code from a previous project. get_data simply get url:str and session:ClientSession as arguments and return a pandas DataFrame.
from dask.distributed import Client
from aiohttp import ClientSession
client = Client()
session: ClientSession = connector.session_factory()
futures = client.map(
get_data, # function to get data (takes url and http session)
urls,
[session for _ in range(len(urls))], # PROBLEM IS HERE
retries=5,
)
r = client.map(loader.job, futures)
_ = client.gather(r)
Problem
I get the following error
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/worker.py", line 2952, in warn_dumps
b = dumps(obj)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 58, in dumps
result = cloudpickle.dumps(x, **dump_kwargs)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 632, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle 'TaskStepMethWrapper' object
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f3042b2fa00>
My temptation was then to register a serializer and a deserializer for this exotic object following this doc
from distributed.protocol import dask_serialize, dask_deserialize
#dask_serialize.register(TaskStepMethWrapper)
def serialize(ctx: TaskStepMethWrapper) -> Tuple[Dict, List[bytes]]:
header = {} #?
frames = [] #?
return header, frames
#dask_deserialize.register(TaskStepMethWrapper)
def deserialize(header: Dict, frames: List[bytes]) -> TaskStepMethWrapper:
return TaskStepMethWrapper(frames) #?
The problem is that I don't know where to load TaskStepMethWrapper from. I know that class TaskStepMethWrapper is asyncio related
grep -rnw './' -e '.*TaskStepMethWrapper.*'
grep: ./lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so : fichiers binaires correspondent
But I couldn't find its definition anywhere in site-packages/aiohttp. I also tried to use a Client(asynchronous=True) with only resulted in a TypeError: cannot pickle '_contextvars.Context' object.
How do you handle exotic objects serializations in dask. Should I extend the dask serializer or use an additional serialization family?
client = Client('tcp://scheduler-address:8786',
serializers=['dask', 'pickle'], # BUT WHICH ONE
deserializers=['dask', 'msgpack']) # BUT WHICH ONE

There is a far easier to get around this: create your sessions within the mapped function. You would have been recreating the sessions in each worker anyway, they cannot survive a transfer
from dask.distributed import Client
from aiohttp import ClientSession
client = Client()
def func(u):
session: ClientSession = connector.session_factory()
return get_data(u, session)
futures = client.map(
func,
urls,
retries=5,
)
(I don't know what loader.job is, so I have omitted that).
Note that TaskStepMethWrapper (and anything to do with aiohttp) sounds like it should be called only in async code. Maybe func needs to be async and you need appropriate awaits.

How to send a progress of operation in a FastAPI app?

I have deployed a fastapi endpoint,
from fastapi import FastAPI, UploadFile
from typing import List
app = FastAPI()
#app.post('/work/test')
async def testing(files: List(UploadFile)):
for i in files:
.......
# do a lot of operations on each file
# after than I am just writing that processed data into mysql database
# cur.execute(...)
# cur.commit()
.......
# just returning "OK" to confirm data is written into mysql
return {"response" : "OK"}
I can request output from the API endpoint and its working fine for me perfectly.
Now, the biggest challenge for me to know how much time it is taking for each iteration. Because in the UI part (those who are accessing my API endpoint) I want to help them show a progress bar (TIME TAKEN) for each iteration/file being processed.
Is there any possible way for me to achieve it? If so, please help me out on how can I proceed further?
Thank you.

Approaches
Polling
The most preferred approach to track the progress of a task is polling:
After receiving a request to start a task on a backend:
Create a task object in the storage (e.g in-memory, redis and etc.). The task object must contain the following data: task ID, status (pending, completed), result, and others.
Run task in the background (coroutines, threading, multiprocessing, task queue like Celery, arq, aio-pika, dramatiq and etc.)
Response immediately the answer 202 (Accepted) by returning the previously received task ID.
Update task status:
This can be from within the task itself, if it knows about the task store and has access to it. Periodically, the task itself updates information about itself.
Or use a task monitor (Observer, producer-consumer pattern), which will monitor the status of the task and its result. And it will also update the information in the storage.
On the client side (front-end) start a polling cycle for the task status to endpoint /task/{ID}/status, which takes information from the task storage.
Streaming response
Streaming is a less convenient way of getting the status of request processing periodically. When we gradually push responses without closing the connection. It has a number of significant disadvantages, for example, if the connection is broken, you can lose information. Streaming Api is another approach than REST Api.
Websockets
You can also use websockets for real-time notifications and bidirectional communication.
Links:
Examples of polling approach for the progress bar and a more detailed description for django + celery can be found at these links:
https://www.dangtrinh.com/2013/07/django-celery-display-progress-bar-of.html
https://buildwithdjango.com/blog/post/celery-progress-bars/
I have provided simplified examples of running background tasks in FastAPI using multiprocessing here:
https://stackoverflow.com/a/63171013/13782669
Old answer:
You could run a task in the background, return its id and provide a /status endpoint that the front would periodically call. In the status response, you could return what state your task is now (for example, pending with the number of the currently processed file). I provided a few simple examples here.
Demo
Polling
Demo of the approach using asyncio tasks (single worker solution):
import asyncio
from http import HTTPStatus
from fastapi import BackgroundTasks
from typing import Dict, List
from uuid import UUID, uuid4
import uvicorn
from fastapi import FastAPI
from pydantic import BaseModel, Field
class Job(BaseModel):
uid: UUID = Field(default_factory=uuid4)
status: str = "in_progress"
progress: int = 0
result: int = None
app = FastAPI()
jobs: Dict[UUID, Job] = {} # Dict as job storage
async def long_task(queue: asyncio.Queue, param: int):
for i in range(1, param): # do work and return our progress
await asyncio.sleep(1)
await queue.put(i)
await queue.put(None)
async def start_new_task(uid: UUID, param: int) -> None:
queue = asyncio.Queue()
task = asyncio.create_task(long_task(queue, param))
while progress := await queue.get(): # monitor task progress
jobs[uid].progress = progress
jobs[uid].status = "complete"
#app.post("/new_task/{param}", status_code=HTTPStatus.ACCEPTED)
async def task_handler(background_tasks: BackgroundTasks, param: int):
new_task = Job()
jobs[new_task.uid] = new_task
background_tasks.add_task(start_new_task, new_task.uid, param)
return new_task
#app.get("/task/{uid}/status")
async def status_handler(uid: UUID):
return jobs[uid]
Adapted example for loop from question
Background processing function is defined as def and FastAPI runs it on the thread pool.
import time
from http import HTTPStatus
from fastapi import BackgroundTasks, UploadFile, File
from typing import Dict, List
from uuid import UUID, uuid4
from fastapi import FastAPI
from pydantic import BaseModel, Field
class Job(BaseModel):
uid: UUID = Field(default_factory=uuid4)
status: str = "in_progress"
processed_files: List[str] = Field(default_factory=list)
app = FastAPI()
jobs: Dict[UUID, Job] = {}
def process_files(task_id: UUID, files: List[UploadFile]):
for i in files:
time.sleep(5) # pretend long task
# ...
# do a lot of operations on each file
# then append the processed file to a list
# ...
jobs[task_id].processed_files.append(i.filename)
jobs[task_id].status = "completed"
#app.post('/work/test', status_code=HTTPStatus.ACCEPTED)
async def work(background_tasks: BackgroundTasks, files: List[UploadFile] = File(...)):
new_task = Job()
jobs[new_task.uid] = new_task
background_tasks.add_task(process_files, new_task.uid, files)
return new_task
#app.get("/work/{uid}/status")
async def status_handler(uid: UUID):
return jobs[uid]
Streaming
async def process_files_gen(files: List[UploadFile]):
for i in files:
time.sleep(5) # pretend long task
# ...
# do a lot of operations on each file
# then append the processed file to a list
# ...
yield f"{i.filename} processed\n"
yield f"OK\n"
#app.post('/work/stream/test', status_code=HTTPStatus.ACCEPTED)
async def work(files: List[UploadFile] = File(...)):
return StreamingResponse(process_files_gen(files))

Below is solution which uses uniq identifiers and globally available dictionary which holds information about the jobs:
NOTE: Code below is safe to use until you use dynamic keys values ( In sample uuid in use) and keep application within single process.
To start the app create a file main.py
Run uvicorn main:app --reload
Create job entry by accessing http://127.0.0.1:8000/
Repeat step 3 to create multiple jobs
Go to http://127.0.0.1/status page to see page statuses.
Go to http://127.0.0.1/status/{identifier} to see progression of the job by the job id.
Code of app:
from fastapi import FastAPI, UploadFile
import uuid
from typing import List
import asyncio
context = {'jobs': {}}
app = FastAPI()
async def do_work(job_key, files=None):
iter_over = files if files else range(100)
for file, file_number in enumerate(iter_over):
jobs = context['jobs']
job_info = jobs[job_key]
job_info['iteration'] = file_number
job_info['status'] = 'inprogress'
await asyncio.sleep(1)
pending_jobs[job_key]['status'] = 'done'
#app.post('/work/test')
async def testing(files: List[UploadFile]):
identifier = str(uuid.uuid4())
context[jobs][identifier] = {}
asyncio.run_coroutine_threadsafe(do_work(identifier, files), loop=asyncio.get_running_loop())
return {"identifier": identifier}
#app.get('/')
async def get_testing():
identifier = str(uuid.uuid4())
context['jobs'][identifier] = {}
asyncio.run_coroutine_threadsafe(do_work(identifier), loop=asyncio.get_running_loop())
return {"identifier": identifier}
#app.get('/status')
def status():
return {
'all': list(context['jobs'].values()),
}
#app.get('/status/{identifier}')
async def status(identifier):
return {
"status": context['jobs'].get(identifier, 'job with that identifier is undefined'),
}

Yielding asyncio generator data back from event loop possible?

I would like to read from multiple simultanous HTTP streaming requests inside coroutines using httpx, and yield the data back to my non-async function running the event loop, rather than just returning the final data.
But if I make my async functions yield instead of return, I get complaints that asyncio.as_completed() and loop.run_until_complete() expects a coroutine or a Future, not an async generator.
So the only way I can get this to work at all is by collecting all the streamed data inside each coroutine, returning all data once the request finishes. Then collect all the coroutine results and finally returning that to the non-async calling function.
Which means I have to keep everything in memory, and wait until the slowest request has completed before I get all my data, which defeats the whole point of streaming http requests.
Is there any way I can accomplish something like this? My current silly implementation looks like this:
def collect_data(urls):
"""Non-async function wishing it was a non-async generator"""
async def stream(async_client, url, payload):
data = []
async with async_client.stream("GET", url=url) as ar:
ar.raise_for_status()
async for line in ar.aiter_lines():
data.append(line)
# would like to yield each line here
return data
async def execute_tasks(urls):
all_data = []
async with httpx.AsyncClient() as async_client:
tasks = [stream(async_client, url) for url in urls]
for coroutine in asyncio.as_completed(tasks):
all_data += await coroutine
# would like to iterate and yield each line here
return all_events
try:
loop = asyncio.get_event_loop()
data = loop.run_until_complete(execute_tasks(urls=urls))
return data
# would like to iterate and yield the data here as it becomes available
finally:
loop.close()
EDIT: I've tried some solutions using asyncio.Queue and trio memory channels as well, but since I can only read from those in an async scope it doesn't get me any closer to a solution
EDIT 2: The reason I want to use this from a non-asyncronous generator is that I want to use it from a Django app using a Django Rest Framework streaming API.

Normally you should just make collect_data async, and use async code throughout - that's how asyncio was designed to be used. But if that's for some reason not feasible, you can iterate an async iterator manually by applying some glue code:
def iter_over_async(ait, loop):
ait = ait.__aiter__()
async def get_next():
try:
obj = await ait.__anext__()
return False, obj
except StopAsyncIteration:
return True, None
while True:
done, obj = loop.run_until_complete(get_next())
if done:
break
yield obj
The way the above works is by providing an async closure that keeps retrieving the values from the async iterator using the __anext__ magic method and returning the objects as they arrive. This async closure is invoked with run_until_complete() in a loop inside an ordinary sync generator. (The closure actually returns a pair of done indicator and actual object in order to avoid propagating StopAsyncIteration through run_until_complete, which might be unsupported.)
With this in place, you can make your execute_tasks an async generator (async def with yield) and iterate over it using:
for chunk in iter_over_async(execute_tasks(urls), loop):
...
Just note that this approach is incompatible with asyncio.run, and might cause problems later down the line.

Just wanting to update #user4815162342's solution to use asyncio.run_coroutine_threadsafe instead of loop.run_until_complete.
import asyncio
from typing import Any, AsyncGenerator
def _iter_over_async(loop: asyncio.AbstractEventLoop, async_generator: AsyncGenerator):
ait = async_generator.__aiter__()
async def get_next() -> tuple[bool, Any]:
try:
obj = await ait.__anext__()
done = False
except StopAsyncIteration:
obj = None
done = True
return done, obj
while True:
done, obj = asyncio.run_coroutine_threadsafe(get_next(), loop).result()
if done:
break
yield obj
I'd also like to add, that I have found tools like this quite helpful in the process of piecewise convert synchronous code to asyncio code.

There is a nice library that does this (and more!) called pypeln:
import pypeln as pl
import asyncio
from random import random
async def slow_add1(x):
await asyncio.sleep(random()) # <= some slow computation
return x + 1
async def slow_gt3(x):
await asyncio.sleep(random()) # <= some slow computation
return x > 3
data = range(10) # [0, 1, 2, ..., 9]
stage = pl.task.map(slow_add1, data, workers=3, maxsize=4)
stage = pl.task.filter(slow_gt3, stage, workers=2)
data = list(stage) # e.g. [5, 6, 9, 4, 8, 10, 7]

Converting from requests to aiohttp

I currently have this class for making requests to an API and caching the JSON response:
import os
import pathlib
import json
import hashlib
import time
import requests
class NoJSONResponseError(Exception):
pass
class JSONRequestCacher(object):
"""Manage a JSON object through the cache.
Download the associated resource from the provided URL
when need be and retrieve the JSON from a cached file
if possible.
"""
def __init__(self, duration=86400, cachedir=None):
self.duration = duration
self.cachedir = self._get_cachedir(cachedir)
self._validate_cache()
def _get_cachedir(self, cachedir):
if cachedir is None:
cachedir = os.environ.get(
'CUSTOM_CACHEDIR',
pathlib.Path(pathlib.Path.home(), '.custom_cache/')
)
return cachedir
def _validate_cache(self):
"""Create the cache directory if it doesn't exist"""
self.cachedir.mkdir(parents=True, exist_ok=True)
def _request(self, url):
"""Perform the retrieval of the requested JSON data"""
return requests.get(url)
def save(self, raw, cachefile):
"""Save the provided raw JSON data into the cached file"""
with open(cachefile, 'w') as out:
json.dump(raw, out)
def load(self, cachefile):
"""Retrieve the saved JSON data from the cached file"""
with open(cachefile) as cached:
return json.load(cached)
def cache_is_valid(self, cachefile):
"""Check if cache exists and is more recent than the cutoff"""
if cachefile.is_file():
cache_age = time.time() - cachefile.stat().st_mtime
return cache_age < self.duration
return False
def request(self, url, refresh=False):
"""The JSON data associated to the given URL.
Either read from the cache or fetch from the web.
"""
urlhash = hashlib.md5(url.encode()).hexdigest()
cachefile = self.cachedir.joinpath(urlhash)
start = time.time()
if not refresh and self.cache_is_valid(cachefile):
return self.load(cachefile), True, time.time() - start
resp = self._request(url)
resp.raise_for_status()
try:
raw = resp.json()
except ValueError:
raise NoJSONResponseError()
self.save(raw, cachefile)
return raw, False, resp.elapsed.total_seconds()
I then have other classes and code which call the request method of this code like so:
class APIHelper():
def __init__(self):
self.cache = JSONRequestCacher()
def fetch(val):
url = 'my/url/{}'.format(val)
return self.cache.request(url)
def fetchall(vals):
repsonses = []
for val in vals:
responses.append(self.fetch(val))
return responses
For a small number of vals this is fine and it's really no big deal to wait 10 mins. However I am now looking at making 30,000+ hits to this endpoint. In the past I have used threadpools (multiprocessing.dummy.Pool) to achieve some parallelism, however from my reading it seems like async/await and aiothttp is a better way to go. Unfortunately try as I might I cannot wrap my head around how to translate that to this code. I am using Python 3.8.
EDIT
I tried making this change:
class JSONRequestCacher():
def __init__():
self.http = aiohttp.ClientSession()
async def _request(self, url):
async with self.http.get(url) as response:
return await response.read()
Got the error: AttributeError: 'coroutine' object has no attribute 'json' from my raw = resp.json() line
Tried then adding resp = await self._request(url) but that is SyntaxError: 'await' outside async function. Then if I make request an async function then calling it just seems to return me a coroutine object that doesn't give me the expected response.
And this is just trying to make the _request call async. I can't even start to understand how I am meant to make multiple calls to it via another class (APIHelper).

How to share resources between several threads?

I'm writing a script that should retrieve file data from S3 storage in parallel. I cannot do that with a simple Pool object from multiprocessing module, because then I get the error that boto3. Resource object cannot be pickled.
I see why, and I implemented a solution which uses Pool, but target function does not use shared Resource and instead creates new Resource within each thread. But I found that to be extremely inefficient to the extent that even sequential processing would be faster.
I know that Pool has an option to set the number of workers which are used, but is there any way to create several instances of Resource and "asssign" them to those workers so I would not have to create a new instance for each item to be processed?
Code:
class Client:
def __init__(self, url, access_key, secret_key):
self.url = url
self.access_key = access_key
self.secret_key = secret_key
def get_resource(self):
return boto3.resource('s3',
endpoint_url=self.url,
aws_access_key_id=self.access_key,
aws_secret_access_key=self.secret_key)
def get_data_for_object(self, key):
data = []
resource = self.get_resource()
for line in resource.Object(bucket_name='bucket_name', key=key).get()['Body'].iter_lines():
data.append(json.loads(line))
return data
def get_data_for_minute(self, minute):
keys = [...] # object keys in S3 storage
pool = Pool()
data = pool.imap(self.get_data_for_object, keys)
flat_data = list(itertools.chain.from_iterable(data))
return flat_data
Thanks for any help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading multiple "bulked" jsons from s3 asynchronously. Is there a better way? - python

Related

dask.distributed: handle serialization of exotic objects?

How to send a progress of operation in a FastAPI app?

Yielding asyncio generator data back from event loop possible?

Converting from requests to aiohttp

How to share resources between several threads?

Categories

Resources