I'm writing an FTP client using Twisted that downloads a lot of files and I'm trying to do it pretty intelligently. However, I've been having the problem that I'll download several files very quickly (sometimes ~20 per batch, sometimes ~250) and then the downloading will hang, only to eventually have connections time out and then the download and hang start all over again. I'm using a DeferredSemaphore to only download 3 files at a time, but I now suspect that this is probably not the right way to avoid throttling the server.
Here is the code in question:
def downloadFiles(self, result, directory):
# make download directory if it doesn't already exist
if not os.path.exists(directory['filename']):
os.makedirs(directory['filename'])
log.msg("Downloading files in %r..." % directory['filename'])
files = filterFiles(None, self.fileListProtocol)
# from http://stackoverflow.com/questions/2861858/queue-remote-calls-to-a-python-twisted-perspective-broker/2862440#2862440
# use a DeferredSemaphore to limit the number of files downloaded simultaneously from the directory to 3
sem = DeferredSemaphore(3)
jobs = [sem.run(self.downloadFile, f, directory) for f in files]
d = gatherResults(jobs)
return d
def downloadFile(self, f, directory):
filename = os.path.join(directory['filename'], f['filename']).encode('ascii')
log.msg('Downloading %r...' % filename)
d = self.ftpClient.retrieveFile(filename, FTPFile(filename))
return d
You'll noticed that I'm reusing an FTP connection (active, by the way) and using my own FTPFile instance to make sure the local file object gets closed when the file download connection is 'lost' (ie completed). Looking at FTPClient I wonder if I should be using queueCommand directly. To be honest, I got lost following the retrieveFile command to _openDataConnection and beyond, so maybe it's already being used.
Any suggestions? Thanks!
I would suggest using queueCommand, as you suggested I'd suspect the semaphore you're using is probably causing you issues. I believe using queueCommand will limit your FTPClient to a single active connection (though I'm just speculating), so you may want to think about creating a few FTPClient instances and passing download jobs to them if you want to do things quickly. If you use queueStringCommand, you get a Deferred that you can use to determine where each client is up to, and even add another job to the queue for that client in the callback.
Related
My project needs to download quite a few files regularly before doing treatment on them.
I tried coding it directly in Python but it's horribly slow considering the amount of data in the buckets.
I decided to use a subprocess running aws-cli because boto3 still doesn't have a sync functionality. I know using a subprocess with aws-cli is not ideal, but it really is useful and works extremely well out of the box.
One of the perks of aws-cli is the fact that I can see the progress in stdout, which I am getting with the following code:
def download_bucket(bucket_url, dir_name, dest):
"""Download all the files from a bucket into a directory."""
path = Path(dest) / dir_name
bucket_dest = str(os.path.join(bucket_url, dir_name))
with subprocess.Popen(["aws", "s3", "sync", bucket_dest, path], stdout=subprocess.PIPE, bufsize=1, universal_newlines=True) as p:
for b in p.stdout:
print(b, end='')
if p.returncode != 0:
raise subprocess.CalledProcessError(p.returncode, p.args)
Now, I want to make sure that I test this function but I am blocked here because:
I don't know the best way to test this kind of freakish behavior:
Am I supposed to actually create a fake local s3 bucket so that aws s3 sync can hit it?
Am I supposed to mock the subprocess call and not actually call my download_bucket function?
Until now, my attempt was to create a fake bucket and to pass it to my download_bucket function.
This way, I thought that aws s3 sync would still be working, albeit locally:
def test_download_s3(tmpdir):
tmpdir.join(f'frankendir').ensure()
with mock_s3():
conn = boto3.resource('s3', region_name='us-east-1')
conn.create_bucket(Bucket='cool-bucket.us-east-1.dev.000000000000')
s3 = boto3.client('s3', region_name="us-east-1")
s3.put_object(Bucket='cool-bucket.us-east-1.dev.000000000000', Key='frankendir', Body='has no files')
body = conn.Object('cool-bucket.us-east-1.dev.000000000000', 'frankendir').get()[
'Body'].read().decode("utf-8")
download_bucket('s3://cool-bucket.us-east-1.dev.000000000000', 'frankendir', tmpdir)
#assert tmpdir.join('frankendir').join('has not files').exists()
assert body == 'has no files'
But I get the following error fatal error: An error occurred (InvalidAccessKeyId) when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.
My questions are the following:
Should I continue to pursue this creation of a fake local s3 bucket?
If so, how am I supposed to get the credentials to work?
Should I just mock the subprocess call and how?
I am having a hard time understanding how mocking works and how it's supposed to be done. From my understanding, I would just fake a call to aws s3 sync and return some files?
Is there another kind of unit test that would be enough that I didn't think of?
After all, I just want to know if when I transmit a well-formed s3://bucketurl, a dir in that bucket and a local dir, the files contained within the s3://bucketurl/dir are downloaded to my local dir.
Thank you for your help, I hope that I am not all over the place.
A much better approach is to use moto when faking / testing s3. You can check out their documentation or look at a test code example I did: https://github.com/pksol/pycon-go-beyond-mocks/blob/main/test_s3_fake.py.
If you have a few minutes, you can view this short video of me explaining the benefits of using moto vs trying to mock.
I was running into troubles using QCamera with focusing and other things, so I thought I can use the Camerasoftware served with Windows 10. Based on the thread of opening the Windows Camera I did some trials to aquire the taken images and use them for my program. In the documentation and its API I didn't find usable snippets (for me), so I created the hack mentioned below. It assumes that the images are in the target folder 'C:\\Users\\*username*\\Pictures\\Camera Roll' which is mentioned in the registry (See below), but I don't know if this is reliable or how to get the proper key name.
I don't think that this is the only and cleanest solution. So, my question is how to get taken images and open/close the Camera proper?
Actualy the function waits till the 'WindowsCamera.exe' has left the processlist and return newly added images / videos in the target folder
In the registry I found:
Entry: Computer\HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\User Shell Folders with key name {3B193882-D3AD-4eab-965A-69829D1FB59F}for the target folder. I don't think that this key is usable.
Working example of my hack:
import subprocess
import pathlib
import psutil
def check_for_files(path, pattern):
print(" check_for_files:", (path, pattern))
files = []
for filename in pathlib.Path(path).rglob(pattern):
files.append (filename)
return files
def get_Windows_Picture(picpath):
prefiles = check_for_files(picpath, '*.jpg')
x = subprocess.call('start microsoft.windows.camera:', shell=True)
processlist = [proc.info['name'] for proc in psutil.process_iter (['name'])]
while 'WindowsCamera.exe' in processlist:
processlist = [proc.info['name'] for proc in psutil.process_iter (['name'])]
postfiles = check_for_files(picpath, '*.jpg')
newfiles = []
for file in postfiles:
if file not in prefiles:
newfiles.append(str(file))
return newfiles
if __name__ == "__main__":
picpath = str (pathlib.Path ("C:/Users/*user*/Pictures/Camera Roll"))
images = get_Windows_Picture(picpath)
print("Images:", images)
The Camera Roll is a "known Windows folder" which means some APIs can retrieve the exact path (even if it's non-default) for you:
SHGetKnownFolderPath
SHGetKnownFolderIDList
SHSetKnownFolderPath
The knownfolderid documentation will give you the constant name of the required folder (in your case FOLDERID_CameraRoll). As you can see in the linked page, the default is %USERPROFILE%\Pictures\Camera Roll (It's the default, so this doesn't mean it's the same for everyone).
The problem in Python is that you'll need to use ctypes which can be cumbersome some times (especially in your case when you'll have to deal with GUIDs and releasing the memory returned by the API).
This gist gives a good example on how to call SHGetKnownFolderPath from Python with ctypes. In your case you'll only need the CameraRoll member in the FOLDERID class so you can greatly simplify the code.
Side note: Don't poll for the process end, just use the wait() function on the Popen object.
for v,i in enumerate(assets_files):
a = requests.get(domain+i).content
split_filename = i.split('/')
path = os.path.join(all_folder[4],split_filename[-1])
with open(path,'wb') as w:
w.write(a)
print('Downloaded: ',split_filename[-1],' number: ',v)
I don't want my sys admin banning me for multiple connections. Is there a pythonic option to just download a list of files with one request? I would appreciate it.
requests has a Session object for this as explained here.
Using the global requests.get will not reuse the conection but session.get will probably will.
I am saying probably becase there is a limited connection pool which is used under the hood.
I have a very simple build script that's behaving in an unexpected way (for me) on Linux.
Part of the script simply checks that the files generated by a build stage hasn't been manually replaced with a stale file.
# load the cache from a json file
if os.path.getmtime(some_path) < cache['timestamp']:
print("build must run because %s is stale" % some_path)
run_build = true
...
if run_build:
cache['timestamp'] = time.time()
# use subprocess to run process that modifies some_path
# store updated cache to json file
Unfortunately, this naive logic isn't working on Linux (Ubuntu 16.04): the timestamp stored in the cache is often a few miliseconds ahead of the modification time on the generated files.
I could probably just round down the cached timestamp, but that just feels like a nasty hack, and I wonder if there's better way to get consistent times here.
I am running a python script on several Linux nodes (after the creation of a pool) using Azure Batch. Each node uses 14.04.5-LTS version of Ubuntu.
In the script, I am uploading several files on each node and then I run several tasks on each one of these nodes. But, I get a "Permission Denied" error when I try to execute the first task. Actually, the task is an unzip of few files (fyi, the uploading of these zip files went well).
This script was running well until last weeks. I suspect an update of Ubuntu version but maybe it's something else.
Here is the error I get :
error: cannot open zipfile [ /mnt/batch/tasks/shared/01-AXAIS_HPC.zip ]
Permission denied
unzip: cannot find or open /mnt/batch/tasks/shared/01-AXAIS_HPC.zip,
Here is the main part of the code :
credentials = batchauth.SharedKeyCredentials(_BATCH_ACCOUNT_NAME,_BATCH_ACCOUNT_KEY)
batch_client = batch.BatchServiceClient(
credentials,
base_url=_BATCH_ACCOUNT_URL)
create_pool(batch_client,
_POOL_ID,
application_files,
_NODE_OS_DISTRO,
_NODE_OS_VERSION)
helpers.create_job(batch_client, _JOB_ID, _POOL_ID)
add_tasks(batch_client,
_JOB_ID,
input_files,
output_container_name,
output_container_sas_token)
with add_task :
def add_tasks(batch_service_client, job_id, input_files,
output_container_name, output_container_sas_token):
print('Adding {} tasks to job [{}]...'.format(len(input_files), job_id))
tasks = list()
for idx, input_file in enumerate(input_files):
command = ['unzip -q $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC.zip -d $AZ_BATCH_NODE_SHARED_DIR',
'chmod a+x $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC/00-EXE/linux/*',
'PATH=$PATH:$AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC/00-EXE/linux',
'unzip -q $AZ_BATCH_TASK_WORKING_DIR/'
'{} -d $AZ_BATCH_TASK_WORKING_DIR/{}'.format(input_file.file_path,idx+1),
'Rscript $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC/03-MAIN.R $AZ_BATCH_TASK_WORKING_DIR $AZ_BATCH_NODE_SHARED_DIR/01-AXAIS_HPC $AZ_BATCH_TASK_WORKING_DIR/'
'{} {}' .format(idx+1,idx+1),
'python $AZ_BATCH_NODE_SHARED_DIR/01-IMPORT_FILES.py '
'--storageaccount {} --storagecontainer {} --sastoken "{}"'.format(
_STORAGE_ACCOUNT_NAME,
output_container_name,
output_container_sas_token)]
tasks.append(batchmodels.TaskAddParameter(
'Task{}'.format(idx),
helpers.wrap_commands_in_shell('linux', command),
resource_files=[input_file]
)
)
Split = lambda tasks, n=100: [tasks[i:i+n] for i in range(0, len(tasks), n)]
SPtasks = Split(tasks)
for i in range(len(SPtasks)):
batch_service_client.task.add_collection(job_id, SPtasks[i])
Do you have any insights to help me on this issue? Thank you very much.
Robin
looking at the error, i.e.
error: cannot open zipfile [ /mnt/batch/tasks/shared/01-AXAIS_HPC.zip ]
Permission denied unzip: cannot find or open /mnt/batch/tasks/shared/01-AXAIS_HPC.zip,
This seems like that the file is not present at the current shared directory location or it is is not in correct permission. The former is more likely.
Is there any particular reason you are using the shared directory way? also, How are you uploading the file? (i.e. hope that the use of async and await is correctly done, i.e. there is not greedy process which is running your task before the shared_dir stuff is available to the node.)
side note: you own the node so you can RDP / SSH into the node and find it out that the shared_dir are actually present.
Few things to ask will be: how are you uploading these zip files.
Also if I may ask, what is the Design \ user scenario here and how exactly you are intending to use this.
Recommendation:
There are few other ways you can use zip files in the azure node, like via resourcefile or via application package. (The applicaiton package way might suite it better to deal with *.zip file) I have added few documetns and places you can have a look at the sample implementation and guidance for this.
I think a good place to start are: hope material and sample below will help you. :)
Also I would recommend to recreate your pool if it is old which will ensure you have the node running at the latest version.
Azure batch learning path:
Azure batch api basics
Samples & demo link or look here
Detailed walk through depending on what you are using i.e. CloudServiceConfiguration or VirtualMachineConfiguration link.