I try to get a pickle file from an S3 resource using the "Object.get()" method of the boto3 library from several processes simultaneously. This causes my program to get stuck on one of the processes (No exception raised and the program does not continue to the next line).
I tried to add a "Config" variable to the S3 connection. That didn't help.
import pickle
import boto3
from botocore.client import Config
s3_item = _get_s3_name(descriptor_key) # Returns a path string of the desiered file
config = Config(connect_timeout=5, retries={'max_attempts': 0})
s3 = boto3.resource('s3', config=config)
bucket_uri = os.environ.get(*ct.S3_MICRO_SERVICE_BUCKET_URI) # Returns a string of the bucket URI
estimator_factory_logger.debug(f"Calling s3 with item {s3_item} from URI {bucket_uri}")
model_file_from_s3 = s3.Bucket(bucket_uri).Object(s3_item)
estimator_factory_logger.debug("Loading bytes...")
model_content = model_file_from_s3.get()['Body'].read() # <- Program gets stuck here
estimator_factory_logger.debug("Loading from pickle...")
est = pickle.loads(model_content)
No error message raised. It seems that the "get" method is stuck in a deadlock.
Your help will be much appreciated.
Is there a possibility that one of the files in the bucket is just huge and program takes a long time to read?
If that's the case, as a debugging step I'd look into model_file_from_s3.get()['Body'] object, which is botocore.response.StreamingBody object, and use set_socket_timeout()on it to try and force timeout.
https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html
The problem was that we created a subprocess after our main process opened several threads in it. Apparently, This is a big No-No in Linux.
We fixed it by using "spawn" instead of "fork"
Related
My project needs to download quite a few files regularly before doing treatment on them.
I tried coding it directly in Python but it's horribly slow considering the amount of data in the buckets.
I decided to use a subprocess running aws-cli because boto3 still doesn't have a sync functionality. I know using a subprocess with aws-cli is not ideal, but it really is useful and works extremely well out of the box.
One of the perks of aws-cli is the fact that I can see the progress in stdout, which I am getting with the following code:
def download_bucket(bucket_url, dir_name, dest):
"""Download all the files from a bucket into a directory."""
path = Path(dest) / dir_name
bucket_dest = str(os.path.join(bucket_url, dir_name))
with subprocess.Popen(["aws", "s3", "sync", bucket_dest, path], stdout=subprocess.PIPE, bufsize=1, universal_newlines=True) as p:
for b in p.stdout:
print(b, end='')
if p.returncode != 0:
raise subprocess.CalledProcessError(p.returncode, p.args)
Now, I want to make sure that I test this function but I am blocked here because:
I don't know the best way to test this kind of freakish behavior:
Am I supposed to actually create a fake local s3 bucket so that aws s3 sync can hit it?
Am I supposed to mock the subprocess call and not actually call my download_bucket function?
Until now, my attempt was to create a fake bucket and to pass it to my download_bucket function.
This way, I thought that aws s3 sync would still be working, albeit locally:
def test_download_s3(tmpdir):
tmpdir.join(f'frankendir').ensure()
with mock_s3():
conn = boto3.resource('s3', region_name='us-east-1')
conn.create_bucket(Bucket='cool-bucket.us-east-1.dev.000000000000')
s3 = boto3.client('s3', region_name="us-east-1")
s3.put_object(Bucket='cool-bucket.us-east-1.dev.000000000000', Key='frankendir', Body='has no files')
body = conn.Object('cool-bucket.us-east-1.dev.000000000000', 'frankendir').get()[
'Body'].read().decode("utf-8")
download_bucket('s3://cool-bucket.us-east-1.dev.000000000000', 'frankendir', tmpdir)
#assert tmpdir.join('frankendir').join('has not files').exists()
assert body == 'has no files'
But I get the following error fatal error: An error occurred (InvalidAccessKeyId) when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.
My questions are the following:
Should I continue to pursue this creation of a fake local s3 bucket?
If so, how am I supposed to get the credentials to work?
Should I just mock the subprocess call and how?
I am having a hard time understanding how mocking works and how it's supposed to be done. From my understanding, I would just fake a call to aws s3 sync and return some files?
Is there another kind of unit test that would be enough that I didn't think of?
After all, I just want to know if when I transmit a well-formed s3://bucketurl, a dir in that bucket and a local dir, the files contained within the s3://bucketurl/dir are downloaded to my local dir.
Thank you for your help, I hope that I am not all over the place.
A much better approach is to use moto when faking / testing s3. You can check out their documentation or look at a test code example I did: https://github.com/pksol/pycon-go-beyond-mocks/blob/main/test_s3_fake.py.
If you have a few minutes, you can view this short video of me explaining the benefits of using moto vs trying to mock.
for v,i in enumerate(assets_files):
a = requests.get(domain+i).content
split_filename = i.split('/')
path = os.path.join(all_folder[4],split_filename[-1])
with open(path,'wb') as w:
w.write(a)
print('Downloaded: ',split_filename[-1],' number: ',v)
I don't want my sys admin banning me for multiple connections. Is there a pythonic option to just download a list of files with one request? I would appreciate it.
requests has a Session object for this as explained here.
Using the global requests.get will not reuse the conection but session.get will probably will.
I am saying probably becase there is a limited connection pool which is used under the hood.
I am trying to temporarily store files within a function app. The function is triggered by an http request which contains a file name. I first check if the file is within the function app storage and if not, I write it into the storage.
Like this:
if local_path.exists():
file = json.loads(local_path.read_text('utf-8'))
else:
s = AzureStorageBlob(account_name=account_name, container=container,
file_path=file_path, blob_contents_only=True,
mi_client_id=mi_client_id, account_key=account_key)
local_path.write_bytes(s.read_azure_blob())
file = json.loads((s.read_azure_blob()).decode('utf-8'))
This is how I get the local path
root = pathlib.Path(__file__).parent.absolute()
file_name = pathlib.Path(file_path).name
local_path = root.joinpath('file_directory').joinpath(file_name)
If I upload the files when deploying the function app everything works as expected and I read the files from the function app storage. But if I try to save or cache a file, it breaks and gives me this error Error: [Errno 30] Read-only file system:
Any help is greatly appreciated
So after a year I discovered that Azure function apps can utilize the python os module to save files which will persist throughout calls to the function app. No need to implement any sort of caching logic. Simply use os.write() and it will work just fine.
I did have to implement a function to clear the cache after a certain period of time but that was a much better alternative to the verbose code I had previously.
I've actually asked a question about multiprocessing before, but now I'm running in to a weird shortcoming with the type of data that gets returned.
I'm using Gspread to interface with Google's Sheets API and get a "worksheet" object back.
This object, or an aspect of this object, is apparently incompatible with multiprocessing due to being "unpickle-able". Please see output:
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[<Worksheet 'Activation Log' id:o12345wm>]'.
Reason: 'UnpickleableError(<ssl.SSLContext object at 0x1e4be30>,)'
The code I'm using is essentially:
from multiprocessing import pool
from oauth2client.client import SignedJwtAssertionCredentials
import gspread
sheet = 1
pool = multiprocessing.pool.Pool(1)
p = pool.apply_async(get_a_worksheet, args=(sheet,))
worksheet = p.get()
And the script fails while attempting to "get" the results. The get_a_worksheet function returns a Gspread worksheet object that allows me to manipulate the remote sheet. Being able to upload changes to the document is important here - I'm not just trying to reference data, I need to alter it as well.
Does anyone know how I can run a subprocess in a separate and monitorable thread, and get an arbitrary (or custom) object type safely out of it at the end? Does anyone know what makes the ssl.SSLContext object special and "unpickleable"?
Thanks all in advance.
Multiprocessing uses pickling to pass objects between processes. So I do not believe you can use multiprocessing and make an object unpicklable.
I ended up writing a solution around this shortcoming by having the sub-process simply perform the necessary work inside itself rather than return a Worksheet object.
What I ended up with was about half a dozen function and multiprocessing function pairs, each one written to do what I needed done, but inside of a sub-process so that it could be monitored and timed.
A hierarchical map would look something like:
Main()
check_spreadsheet_for_a_string()
check_spreadsheet_for_a_string_worker()
get_hash_of_spreadsheet()
get_hash_of_spreadsheet_worker()
... etc
Where the "worker" functions are the functions called in the multiprocessing setup, and the regular functions above them manage the sub-process and time it to make sure the overall program doesn't halt if the call to gspread internals hangs or takes too long.
I'm writing an FTP client using Twisted that downloads a lot of files and I'm trying to do it pretty intelligently. However, I've been having the problem that I'll download several files very quickly (sometimes ~20 per batch, sometimes ~250) and then the downloading will hang, only to eventually have connections time out and then the download and hang start all over again. I'm using a DeferredSemaphore to only download 3 files at a time, but I now suspect that this is probably not the right way to avoid throttling the server.
Here is the code in question:
def downloadFiles(self, result, directory):
# make download directory if it doesn't already exist
if not os.path.exists(directory['filename']):
os.makedirs(directory['filename'])
log.msg("Downloading files in %r..." % directory['filename'])
files = filterFiles(None, self.fileListProtocol)
# from http://stackoverflow.com/questions/2861858/queue-remote-calls-to-a-python-twisted-perspective-broker/2862440#2862440
# use a DeferredSemaphore to limit the number of files downloaded simultaneously from the directory to 3
sem = DeferredSemaphore(3)
jobs = [sem.run(self.downloadFile, f, directory) for f in files]
d = gatherResults(jobs)
return d
def downloadFile(self, f, directory):
filename = os.path.join(directory['filename'], f['filename']).encode('ascii')
log.msg('Downloading %r...' % filename)
d = self.ftpClient.retrieveFile(filename, FTPFile(filename))
return d
You'll noticed that I'm reusing an FTP connection (active, by the way) and using my own FTPFile instance to make sure the local file object gets closed when the file download connection is 'lost' (ie completed). Looking at FTPClient I wonder if I should be using queueCommand directly. To be honest, I got lost following the retrieveFile command to _openDataConnection and beyond, so maybe it's already being used.
Any suggestions? Thanks!
I would suggest using queueCommand, as you suggested I'd suspect the semaphore you're using is probably causing you issues. I believe using queueCommand will limit your FTPClient to a single active connection (though I'm just speculating), so you may want to think about creating a few FTPClient instances and passing download jobs to them if you want to do things quickly. If you use queueStringCommand, you get a Deferred that you can use to determine where each client is up to, and even add another job to the queue for that client in the callback.