Unit-testing: Mocking a subprocess running "aws s3 sync" with Python

Unit-testing: Mocking a subprocess running "aws s3 sync" with Python - python

My project needs to download quite a few files regularly before doing treatment on them.
I tried coding it directly in Python but it's horribly slow considering the amount of data in the buckets.
I decided to use a subprocess running aws-cli because boto3 still doesn't have a sync functionality. I know using a subprocess with aws-cli is not ideal, but it really is useful and works extremely well out of the box.
One of the perks of aws-cli is the fact that I can see the progress in stdout, which I am getting with the following code:
def download_bucket(bucket_url, dir_name, dest):
"""Download all the files from a bucket into a directory."""
path = Path(dest) / dir_name
bucket_dest = str(os.path.join(bucket_url, dir_name))
with subprocess.Popen(["aws", "s3", "sync", bucket_dest, path], stdout=subprocess.PIPE, bufsize=1, universal_newlines=True) as p:
for b in p.stdout:
print(b, end='')
if p.returncode != 0:
raise subprocess.CalledProcessError(p.returncode, p.args)
Now, I want to make sure that I test this function but I am blocked here because:
I don't know the best way to test this kind of freakish behavior:
Am I supposed to actually create a fake local s3 bucket so that aws s3 sync can hit it?
Am I supposed to mock the subprocess call and not actually call my download_bucket function?
Until now, my attempt was to create a fake bucket and to pass it to my download_bucket function.
This way, I thought that aws s3 sync would still be working, albeit locally:
def test_download_s3(tmpdir):
tmpdir.join(f'frankendir').ensure()
with mock_s3():
conn = boto3.resource('s3', region_name='us-east-1')
conn.create_bucket(Bucket='cool-bucket.us-east-1.dev.000000000000')
s3 = boto3.client('s3', region_name="us-east-1")
s3.put_object(Bucket='cool-bucket.us-east-1.dev.000000000000', Key='frankendir', Body='has no files')
body = conn.Object('cool-bucket.us-east-1.dev.000000000000', 'frankendir').get()[
'Body'].read().decode("utf-8")
download_bucket('s3://cool-bucket.us-east-1.dev.000000000000', 'frankendir', tmpdir)
#assert tmpdir.join('frankendir').join('has not files').exists()
assert body == 'has no files'
But I get the following error fatal error: An error occurred (InvalidAccessKeyId) when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.
My questions are the following:
Should I continue to pursue this creation of a fake local s3 bucket?
If so, how am I supposed to get the credentials to work?
Should I just mock the subprocess call and how?
I am having a hard time understanding how mocking works and how it's supposed to be done. From my understanding, I would just fake a call to aws s3 sync and return some files?
Is there another kind of unit test that would be enough that I didn't think of?
After all, I just want to know if when I transmit a well-formed s3://bucketurl, a dir in that bucket and a local dir, the files contained within the s3://bucketurl/dir are downloaded to my local dir.
Thank you for your help, I hope that I am not all over the place.

A much better approach is to use moto when faking / testing s3. You can check out their documentation or look at a test code example I did: https://github.com/pksol/pycon-go-beyond-mocks/blob/main/test_s3_fake.py.
If you have a few minutes, you can view this short video of me explaining the benefits of using moto vs trying to mock.

Related

How do I cache files within a function app?

I am trying to temporarily store files within a function app. The function is triggered by an http request which contains a file name. I first check if the file is within the function app storage and if not, I write it into the storage.
Like this:
if local_path.exists():
file = json.loads(local_path.read_text('utf-8'))
else:
s = AzureStorageBlob(account_name=account_name, container=container,
file_path=file_path, blob_contents_only=True,
mi_client_id=mi_client_id, account_key=account_key)
local_path.write_bytes(s.read_azure_blob())
file = json.loads((s.read_azure_blob()).decode('utf-8'))
This is how I get the local path
root = pathlib.Path(__file__).parent.absolute()
file_name = pathlib.Path(file_path).name
local_path = root.joinpath('file_directory').joinpath(file_name)
If I upload the files when deploying the function app everything works as expected and I read the files from the function app storage. But if I try to save or cache a file, it breaks and gives me this error Error: [Errno 30] Read-only file system:
Any help is greatly appreciated

So after a year I discovered that Azure function apps can utilize the python os module to save files which will persist throughout calls to the function app. No need to implement any sort of caching logic. Simply use os.write() and it will work just fine.
I did have to implement a function to clear the cache after a certain period of time but that was a much better alternative to the verbose code I had previously.

Boto3 S3 resource get stuck on "Object.get" method

I try to get a pickle file from an S3 resource using the "Object.get()" method of the boto3 library from several processes simultaneously. This causes my program to get stuck on one of the processes (No exception raised and the program does not continue to the next line).
I tried to add a "Config" variable to the S3 connection. That didn't help.
import pickle
import boto3
from botocore.client import Config
s3_item = _get_s3_name(descriptor_key) # Returns a path string of the desiered file
config = Config(connect_timeout=5, retries={'max_attempts': 0})
s3 = boto3.resource('s3', config=config)
bucket_uri = os.environ.get(*ct.S3_MICRO_SERVICE_BUCKET_URI) # Returns a string of the bucket URI
estimator_factory_logger.debug(f"Calling s3 with item {s3_item} from URI {bucket_uri}")
model_file_from_s3 = s3.Bucket(bucket_uri).Object(s3_item)
estimator_factory_logger.debug("Loading bytes...")
model_content = model_file_from_s3.get()['Body'].read() # <- Program gets stuck here
estimator_factory_logger.debug("Loading from pickle...")
est = pickle.loads(model_content)
No error message raised. It seems that the "get" method is stuck in a deadlock.
Your help will be much appreciated.

Is there a possibility that one of the files in the bucket is just huge and program takes a long time to read?
If that's the case, as a debugging step I'd look into model_file_from_s3.get()['Body'] object, which is botocore.response.StreamingBody object, and use set_socket_timeout()on it to try and force timeout.
https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html

The problem was that we created a subprocess after our main process opened several threads in it. Apparently, This is a big No-No in Linux.
We fixed it by using "spawn" instead of "fork"

Python Azure Batch - Permission Denied Linux node

I am running a python script on several Linux nodes (after the creation of a pool) using Azure Batch. Each node uses 14.04.5-LTS version of Ubuntu.
In the script, I am uploading several files on each node and then I run several tasks on each one of these nodes. But, I get a "Permission Denied" error when I try to execute the first task. Actually, the task is an unzip of few files (fyi, the uploading of these zip files went well).
This script was running well until last weeks. I suspect an update of Ubuntu version but maybe it's something else.
Here is the error I get :
error: cannot open zipfile [ /mnt/batch/tasks/shared/01-AXAIS_HPC.zip ]
Permission denied
unzip: cannot find or open /mnt/batch/tasks/shared/01-AXAIS_HPC.zip,
Here is the main part of the code :
credentials = batchauth.SharedKeyCredentials(_BATCH_ACCOUNT_NAME,_BATCH_ACCOUNT_KEY)
batch_client = batch.BatchServiceClient(
credentials,
base_url=_BATCH_ACCOUNT_URL)
create_pool(batch_client,
_POOL_ID,
application_files,
_NODE_OS_DISTRO,
_NODE_OS_VERSION)
helpers.create_job(batch_client, _JOB_ID, _POOL_ID)
add_tasks(batch_client,
_JOB_ID,
input_files,
output_container_name,
output_container_sas_token)
with add_task :
def add_tasks(batch_service_client, job_id, input_files,
output_container_name, output_container_sas_token):
print('Adding {} tasks to job [{}]...'.format(len(input_files), job_id))
tasks = list()
for idx, input_file in enumerate(input_files):
command = ['unzip -q $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC.zip -d $AZ_BATCH_NODE_SHARED_DIR',
'chmod a+x $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC/00-EXE/linux/*',
'PATH=$PATH:$AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC/00-EXE/linux',
'unzip -q $AZ_BATCH_TASK_WORKING_DIR/'
'{} -d $AZ_BATCH_TASK_WORKING_DIR/{}'.format(input_file.file_path,idx+1),
'Rscript $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC/03-MAIN.R $AZ_BATCH_TASK_WORKING_DIR $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC $AZ_BATCH_TASK_WORKING_DIR/'
'{} {}' .format(idx+1,idx+1),
'python $AZ_BATCH_NODE_SHARED_DIR/01-IMPORT_FILES.py '
'--storageaccount {} --storagecontainer {} --sastoken "{}"'.format(
_STORAGE_ACCOUNT_NAME,
output_container_name,
output_container_sas_token)]
tasks.append(batchmodels.TaskAddParameter(
'Task{}'.format(idx),
helpers.wrap_commands_in_shell('linux', command),
resource_files=[input_file]
)
)
Split = lambda tasks, n=100: [tasks[i:i+n] for i in range(0, len(tasks), n)]
SPtasks = Split(tasks)
for i in range(len(SPtasks)):
batch_service_client.task.add_collection(job_id, SPtasks[i])
Do you have any insights to help me on this issue? Thank you very much.
Robin

looking at the error, i.e.
error: cannot open zipfile [ /mnt/batch/tasks/shared/01-AXAIS_HPC.zip ]
Permission denied unzip: cannot find or open /mnt/batch/tasks/shared/01-AXAIS_HPC.zip,
This seems like that the file is not present at the current shared directory location or it is is not in correct permission. The former is more likely.
Is there any particular reason you are using the shared directory way? also, How are you uploading the file? (i.e. hope that the use of async and await is correctly done, i.e. there is not greedy process which is running your task before the shared_dir stuff is available to the node.)
side note: you own the node so you can RDP / SSH into the node and find it out that the shared_dir are actually present.
Few things to ask will be: how are you uploading these zip files.
Also if I may ask, what is the Design \ user scenario here and how exactly you are intending to use this.
Recommendation:
There are few other ways you can use zip files in the azure node, like via resourcefile or via application package. (The applicaiton package way might suite it better to deal with *.zip file) I have added few documetns and places you can have a look at the sample implementation and guidance for this.
I think a good place to start are: hope material and sample below will help you. :)
Also I would recommend to recreate your pool if it is old which will ensure you have the node running at the latest version.
Azure batch learning path:
Azure batch api basics
Samples & demo link or look here
Detailed walk through depending on what you are using i.e. CloudServiceConfiguration or VirtualMachineConfiguration link.

How do I pass grants with boto3's upload_file?

I have a command using the AWS CLI
aws s3 cp y:/mydatafolder s3://<bucket>/folder/subfolder --recursive --grants read=uri=http://policyurl
The first part is easy to do in python, I can use os.walk to walk the folders and get the files and upload the file using the boto3.s3client.upload_file command. The second part I'm struggling with is the --grants read part.
What boto3 function do I need to call to do this?
Thanks

In your upload_file call you will need to pass GrantRead in ExtraArgs. Generally speaking you can pass in any arguments that put_object would take this way. For example:
import boto3
s3 = boto3.client('s3')
file_path = '/foo/bar/baz.txt'
s3.upload_file(
Filename=file_path,
Bucket='<bucket>',
Key='key',
ExtraArgs={
'GrantRead': 'uri="http://policyurl"'
}
)

How do I queue FTP commands in Twisted?

I'm writing an FTP client using Twisted that downloads a lot of files and I'm trying to do it pretty intelligently. However, I've been having the problem that I'll download several files very quickly (sometimes ~20 per batch, sometimes ~250) and then the downloading will hang, only to eventually have connections time out and then the download and hang start all over again. I'm using a DeferredSemaphore to only download 3 files at a time, but I now suspect that this is probably not the right way to avoid throttling the server.
Here is the code in question:
def downloadFiles(self, result, directory):
# make download directory if it doesn't already exist
if not os.path.exists(directory['filename']):
os.makedirs(directory['filename'])
log.msg("Downloading files in %r..." % directory['filename'])
files = filterFiles(None, self.fileListProtocol)
# from http://stackoverflow.com/questions/2861858/queue-remote-calls-to-a-python-twisted-perspective-broker/2862440#2862440
# use a DeferredSemaphore to limit the number of files downloaded simultaneously from the directory to 3
sem = DeferredSemaphore(3)
jobs = [sem.run(self.downloadFile, f, directory) for f in files]
d = gatherResults(jobs)
return d
def downloadFile(self, f, directory):
filename = os.path.join(directory['filename'], f['filename']).encode('ascii')
log.msg('Downloading %r...' % filename)
d = self.ftpClient.retrieveFile(filename, FTPFile(filename))
return d
You'll noticed that I'm reusing an FTP connection (active, by the way) and using my own FTPFile instance to make sure the local file object gets closed when the file download connection is 'lost' (ie completed). Looking at FTPClient I wonder if I should be using queueCommand directly. To be honest, I got lost following the retrieveFile command to _openDataConnection and beyond, so maybe it's already being used.
Any suggestions? Thanks!

I would suggest using queueCommand, as you suggested I'd suspect the semaphore you're using is probably causing you issues. I believe using queueCommand will limit your FTPClient to a single active connection (though I'm just speculating), so you may want to think about creating a few FTPClient instances and passing download jobs to them if you want to do things quickly. If you use queueStringCommand, you get a Deferred that you can use to determine where each client is up to, and even add another job to the queue for that client in the callback.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.