I have an asyncio python azure script that uses multiple tasks to upload files to blobs from an asyncio queue. It works fine, at least up until the point where it uses up all available memory on the system. I can't figure out where the memory leak is. Normally I use memory-profiler, but this doesn't seem to work with async functions.
Can someone tell me either what I'm doing wrong here, or else what the best way would be to find out where the issue lies? Thanks. It's not clear to me what is not being cleaned up, if anything.
I put anywhere from a few hundred to a few thousand files on the work queue, and usually run with 3-5 tasks. Within the space of a couple of minutes this program uses up anywhere from 3 to 6GB of resident memory and then starts eating into swap until, if it runs long enough, it gets killed from memory starvation. This is on a linux box with 8GB memory using Python 3.6.8 and the following azure libraries:
azure-common 1.1.25
azure-core 1.3.0
azure-identity 1.3.0
azure-nspkg 3.0.2
azure-storage-blob 12.3.0
from azure.identity.aio import ClientSecretCredential
from azure.storage.blob.aio import BlobClient
async def uploadBlobsTask(taskName, args, workQueue):
while not workQueue.empty():
fileName = await workQueue.get()
blobName = fileName.replace(args.sourceDirPrefix, '')
blobClient = BlobClient(
"https://{}.blob.core.windows.net".format(args.accountName),
credential = args.creds,
container_name = args.container,
blob_name = blobName,
)
async with blobClient:
args.logger.info("Task {}: uploading {} as {}".format(taskName, fileName, blobName))
try:
with open(fileName, "rb") as data:
await blobClient.upload_blob(data, overwrite=True)
fileNameMoved = fileName + '.moved'
with open(fileNameMoved, "w") as fm:
fm.write("")
except KeyboardInterrupt:
raise
except:
args.logger.error("Task {}: {}".format(taskName, traceback.format_exc()))
await workQueue.put(fileName)
finally:
workQueue.task_done()
async def processFiles(args):
workQueue = asyncio.Queue()
for (path, dirs, files) in os.walk(args.sourceDir):
for f in files:
fileName = os.path.join(path, f)
await workQueue.put(fileName)
creds = ClientSecretCredential(args.tenant, args.appId, args.password)
args.creds = creds
tasks = [ args.loop.create_task(uploadBlobsTask(str(i), args, workQueue)) for i in range(1, args.tasks+1) ]
await asyncio.gather(*tasks)
await creds.close()
loop = asyncio.get_event_loop()
args.loop = loop
loop.run_until_complete(processFiles(args))
loop.close()
For what it's worth, I seem to have managed to fix this so that it works without memory leaks. I did this by obtaining a containerClient and then obtaining blobClients off of that (ie, containerClient.get_blob_client()) instead of obtaining BlobClient objects directly. Now the overall memory usage tops out at a very low level rather than growing continuously as before.
Related
I am using Prefect. And I tried to download a file from S3.
When I hard coded the AWS credentials, the file can be downloaded successfully:
import asyncio
from prefect_aws.s3 import s3_download
from prefect_aws.credentials import AwsCredentials
from prefect import flow, get_run_logger
#flow
async def fetch_taxi_data():
logger = get_run_logger()
credentials = AwsCredentials(
aws_access_key_id="xxx",
aws_secret_access_key="xxx",
)
data = await s3_download(
bucket="hongbomiao-bucket",
key="hm-airflow/taxi.csv",
aws_credentials=credentials,
)
logger.info(data)
if __name__ == "__main__":
asyncio.run(fetch_taxi_data())
Now I tried to load the credentials from Prefect Blocks.
I created a AWS Credentials Block:
However,
aws_credentials_block = AwsCredentials.load("aws-credentials-block")
data = await s3_download(
bucket="hongbomiao-bucket",
key="hm-airflow/taxi.csv",
aws_credentials=aws_credentials_block,
)
throws the error:
AttributeError: 'coroutine' object has no attribute 'get_boto3_session'
And
aws_credentials_block = AwsCredentials.load("aws-credentials-block")
credentials = AwsCredentials(
aws_access_key_id=aws_credentials_block.aws_access_key_id,
aws_secret_access_key=aws_credentials_block.aws_secret_access_key,
)
data = await s3_download(
bucket="hongbomiao-bucket",
key="hm-airflow/taxi.csv",
aws_credentials=credentials,
)
throws the error:
AttributeError: 'coroutine' object has no attribute 'aws_access_key_id'
I didn't find any useful document about how to use it.
Am I supposed to use Blocks to load credentials? If it is, what is the correct way to use Blocks correctly in Prefect? Thanks!
I just found the snippet in the screenshot in the question misses an await.
After adding await, it works now!
aws_credentials_block = await AwsCredentials.load("aws-credentials-block")
data = await s3_download(
bucket="hongbomiao-bucket",
key="hm-airflow/taxi.csv",
aws_credentials=aws_credentials_block,
)
UPDATE:
Got an answer from Michael Adkins on GitHub, and thanks!
await is only needed if you're writing an async flow or task. For users writing synchronous code, an await is not needed (and not possible). Most of our users are writing synchronous code and the example in the UI is in a synchronous context so it does not include the await.
I saw the source code at
https://github.com/PrefectHQ/prefect/blob/1dcd45637914896c60b7d49254a34e95a9ce56ea/src/prefect/blocks/core.py#L601-L604
#classmethod
#sync_compatible
#inject_client
async def load(cls, name: str, client: "OrionClient" = None):
# ...
So I think as long as the function has the decorator #sync_compatible, it means it can be used as both async and sync functions.
P.S. Started an issue https://github.com/robinhood/faust/issues/702
Developing Faust-app:
from concurrent.futures import ProcessPoolExecutor, as_completed
import faust
app = faust.App('my-app-name', broker='kafka://localhost:9092')
sink = app.topic('topic')
#app.task()
async def check():
# 3 is amount of different folders where archives are laced
with ProcessPoolExecutor(max_workers=3) as executor:
fs = [executor.submit(handle, directory) for directory in ['dir1', 'dir2', 'dir3']]
for future in as_completed(fs):
future.result()
def handle(directory):
# finding archives in directory
# unpacking 7z with mdb-files
# converting mdb tables to csv
# reading csv to dataframe
# some data manipulating
# and at last sending dataframe records to kafka
f = sink.send_soon(value={'ts': 1234567890, 'count': 10}) # always in pending status
Faced a problem when method sink.send_soon returns FutureMessage(asyncio.Future, Awaitable[RecordMetadata]) which is always in pending status.
This is the situation when future inside another future.
Note. Function handle should be sync because one cannot pass async function to ProcessPollExecutor. Method send_soon is sync method. According to this example https://github.com/robinhood/faust/blob/b5e159f1d104ad4a6aa674d14b6ba0be19b5f9f5/examples/windowed_aggregation.py#L47 awaiting is not necessarily.
If there any way to handle pending future?
Also tried this:
import asyncio
from concurrent.futures import ProcessPoolExecutor
import faust
loop = asyncio.get_event_loop()
app = faust.App('my-app-name', broker='kafka://localhost:9092', loop=loop)
sink = app.topic('topic')
#app.task()
async def check():
tasks = []
with ProcessPoolExecutor(max_workers=3) as executor:
for dir_ in ['dir1', 'dir2', 'dir3']:
task = asyncio.create_task(run_dir_handling(executor, dir_))
tasks.append(task)
await asyncio.gather(*tasks)
async def run_dir_handling(executor, dir_):
print('running blocking')
await loop.run_in_executor(executor, handle, dir_)
def handle(directory):
print('Handle directory')
# finding archives in directory
# unpacking 7z with mdb-files
# converting mdb tables to csv
# reading csv to dataframe
# some data manipulating
# and at last sending dataframe records to kafka
# `send_soon` is not non-`async def but `send` is async
# async `soon` cannot be implemented because of
# `await loop.run_in_executor(executor, handle, dir_) TypeError: cannot pickle 'coroutine' object` error
f = sink.send_soon(value={'ts': 1234567890, 'count': 10, 'dir': directory})
print(f) # always <FutureMessage pending>
But it didn't work too.
It seems loop is not even have a chance to run send_soon method.
Changed code structure for this:
import asyncio
from concurrent.futures import ProcessPoolExecutor
import faust
loop = asyncio.get_event_loop()
app = faust.App('my-app-name', broker='kafka://localhost:9092')
sink = app.topic('topic1')
#app.task()
async def check():
tasks = []
with ProcessPoolExecutor(max_workers=3) as executor:
for dir_ in ['dir1', 'dir2', 'dir3']:
task = asyncio.create_task(run_dir_handling(executor, dir_))
tasks.append(task)
await asyncio.gather(*tasks)
async def run_dir_handling(executor, dir_):
directory = await loop.run_in_executor(executor, handle, dir_)
await sink.send(value={'dir': directory})
def handle(directory):
print('Handle directory')
# finding archives in directory
# unpacking 7z with mdb-files
# converting mdb tables to csv
# reading csv to dataframe
# some data manipulating
# and at last sending dataframe records to kafka
return directory
I have an async Python script that creates a bulk API job/batch in Salesforce. After the batch is complete I then download the csv file for processing.
Here's my problem: A streaming download for a ~300 MB csv file using Python can take 3+ minutes using this asynchronous code:
If you're familiar with Salesforce bulk jobs, you can enter your information
into the variables below and download your batch results for testing. This is a working example of code provided you enter the necessary information.
import asyncio, aiohttp, aiofiles
from simple_salesforce import Salesforce
from credentials import credentials as cred
sf_data_path = 'C:/Users/[USER NAME]/Desktop/'
job_id = '[18 DIGIT JOB ID]'
batch_id = '[18 DIGIT BATCH ID]'
result_id = '[18 DIGIT RESULT ID]'
instance_name = '[INSTANCE NAME]'
result_url = f'https://{instance_name}.salesforce.com/services/async/45.0/job/{job_id}/batch/{batch_id}/result/{result_id}'
sf = Salesforce(username=['SALESFORCE USERNAME'],
password=['SALESFORCE PASSWORD'],
security_token=['SALESFORCE SECURITY TOKEN'],
organizationId=['SALESFORCE ORGANIZATION ID'])
async def download_results():
err = None
retries = 3
status = 'Not Downloaded'
for _ in range(retries):
try:
async with aiohttp.ClientSession() as session:
async with session.get(url=result_url,
headers={"X-SFDC-Session": sf.session_id, 'Content-Encoding': 'gzip'},
timeout=300) as resp:
async with aiofiles.open(f'{sf_data_path}_DOWNLOAD_.csv', 'wb') as outfile:
while True:
chunk = await resp.content.read(10485760) # = 10Mb
if not chunk:
break
await outfile.write(chunk)
status = 'Downloaded'
except Exception as e:
err = e
retries -= 1
status = 'Retrying'
continue
else:
break
else:
status = 'Failed'
return err, status, retries
asyncio.run(download_results())
However, if I download the result of the batch in the Developer Workbench: https://workbench.developerforce.com/asyncStatus.php?jobId='[18 DIGIT JOB ID]' the same file might download in 5 seconds.
There is obviously something going on here that I'm missing. I know that the Workbench uses PHP, is this functionality even available with Python? I figured the async calls would make this download quickly, but that doesn't seem to make it download as fast as the functionality in the browser. Any ideas?
Thanks!
You can try curl request to get the csv. This method is as quick as you see in the workbench.
More information you can read here:
https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/query_walkthrough.htm
https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/query_get_job_results.htm#example_locator
I have a large file, with a JSON record on each line. I'm writing a script to upload a subset of these records to CouchDB via the API, and experimenting with different approaches to see what works the fastest. Here's what I've found to work fastest to slowest (on a CouchDB instance on my localhost):
Read each needed record into memory. After all records are in memory, generate an upload coroutine for each record, and gather/run all the coroutines at once
Synchronously read file and when a needed record is encountered, synchronously upload
Use aiofiles to read the file, and when a needed record is encountered, asynchronously update
Approach #1 is much faster than the other two (about twice as fast). I am confused why approach #2 is faster than #3, especially in contrast to this example here, which takes half as much time to run asynchronously than synchronously (sync code not provided, had to rewrite it myself). Is it the context switching from file i/o to HTTP i/o, especially with file reads ocurring much more often than API uploads?
For additional illustration, here's some Python pseudo-code that represents each approach:
Approach 1 - Sync File IO, Async HTTP IO
import json
import asyncio
import aiohttp
records = []
with open('records.txt', 'r') as record_file:
for line in record_file:
record = json.loads(line)
if valid(record):
records.append(record)
async def batch_upload(records):
async with aiohttp.ClientSession() as session:
tasks = []
for record in records:
task = async_upload(record, session)
tasks.append(task)
await asyncio.gather(*tasks)
asyncio.run(batch_upload(properties))
Approach 2 - Sync File IO, Sync HTTP IO
import json
with open('records.txt', 'r') as record_file:
for line in record_file:
record = json.loads(line)
if valid(record):
sync_upload(record)
Approach 3 - Async File IO, Async HTTP IO
import json
import asyncio
import aiohttp
import aiofiles
async def batch_upload()
async with aiohttp.ClientSession() as session:
async with open('records.txt', 'r') as record_file:
line = await record_file.readline()
while line:
record = json.loads(line)
if valid(record):
await async_upload(record, session)
line = await record_file.readline()
asyncio.run(batch_upload())
The file I'm developing this with is about 1.3 GB, with 100000 records total, 691 of which I upload. Each upload begins with a GET request to see if the record already exists in CouchDB. If it does, then a PUT is performed to update the CouchDB record with any new information; if it doesn't, then a the record is POSTed to the db. So, each upload consists of two API requests. For dev purposes, I'm only creating records, so I run the GET and POST requests, 1382 API calls total.
Approach #1 takes about 17 seconds, approach #2 takes about 33 seconds, and approach #3 takes about 42 seconds.
your code uses async but it does the work synchronously and in this case it will be slower than the sync approach. Asyc won't speed up the execution if not constructed/used effectively.
You can create 2 coroutines and make them run in parallel.. perhaps that speeds up the operation.
Example:
#!/usr/bin/env python3
import asyncio
async def upload(event, queue):
# This logic is not so correct when it comes to shutdown,
# but gives the idea
while not event.is_set():
record = await queue.get()
print(f'uploading record : {record}')
return
async def read(event, queue):
# dummy logic : instead read here and populate the queue.
for i in range(1, 10):
await queue.put(i)
# Initiate shutdown..
event.set()
async def main():
event = asyncio.Event()
queue = asyncio.Queue()
uploader = asyncio.create_task(upload(event, queue))
reader = asyncio.create_task(read(event, queue))
tasks = [uploader, reader]
await asyncio.gather(*tasks)
if __name__ == '__main__':
asyncio.run(main())
Situation:
I am trying to send a HTTP request to all listed domains in a specific file I already downloaded and get the destination URL, I was forwarded to.
Problem: Well I have followed a tutorial and I get many less responses than expected. It's around 100 responses per second, but in the tutorial there are 100,000 responses per minute listed.
The script gets also slower and slower after a couple of seconds, so that I just get 1 response every 5 seconds.
Already tried: Firstly I thought that this problem is because I ran that on a Windows server. Well after I tried the script on my computer, I recognized that it was just a little bit faster, but not much more. On an other Linux server it was the same like on my computer (Unix, macOS).
Code: https://pastebin.com/WjLegw7K
work_dir = os.path.dirname(__file__)
async def fetch(url, session):
try:
async with session.get(url, ssl=False) as response:
if response.status == 200:
delay = response.headers.get("DELAY")
date = response.headers.get("DATE")
print("{}:{} with delay {}".format(date, response.url, delay))
return await response.read()
except Exception:
pass
async def bound_fetch(sem, url, session):
# Getter function with semaphore.
async with sem:
await fetch(url, session)
async def run():
os.chdir(work_dir)
for file in glob.glob("cdx-*"):
print("Opening: " + file)
opened_file = file
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(40000)
with open(work_dir + '/' + file) as infile:
seen = set()
async with ClientSession() as session:
for line in infile:
regex = re.compile(r'://(.*?)/')
domain = regex.search(line).group(1)
domain = domain.lower()
if domain not in seen:
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
tasks.append(task)
del line
responses = asyncio.gather(*tasks)
await responses
infile.close()
del seen
del file
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
I really don't know how to fix that issue. Especially because I'm very new to Python... but I have to get it to work somehow :(
It's hard to tell what is going wrong without actually debugging the code, but one potential problem is that file processing is serialized. In other words, the code never processes the next file until all the requests from the current file have finished. If there are many files and one of them is slow, this could be a problem.
To change this, define run along these lines:
async def run():
os.chdir(work_dir)
async with ClientSession() as session:
sem = asyncio.Semaphore(40000)
seen = set()
pending_tasks = set()
for f in glob.glob("cdx-*"):
print("Opening: " + f)
with open(f) as infile:
lines = list(infile)
for line in lines:
domain = re.search(r'://(.*?)/', line).group(1)
domain = domain.lower()
if domain in seen:
continue
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
pending_tasks.add(task)
# ensure that each task removes itself from the pending set
# when done, so that the set doesn't grow without bounds
task.add_done_callback(pending_tasks.remove)
# await the remaining tasks
await asyncio.wait(pending_tasks)
Another important thing: silencing all exceptions in fetch() is bad practice because there is no indication that something has started going wrong (due to either a bug or a simple typo). This might well be the reason your script becomes "slow" after a while - fetch is raising exceptions and you're never seeing them. Instead of pass, use something like print(f'failed to get {url}: {e}') where e is the object you get from except Exception as e.
Several additional remarks:
There is almost never a need to del local variables in Python; the garbage collector does that automatically.
You needn't close() a file opened using a with statement. with is designed specifically to do such closing automatically for you.
The code added domains to a seen set, but also processed an already seen domain. This version skips the domain for which it had already spawned a task.
You can create a single ClientSession and use it for the entire run.