Download multiple files in different SFTP directories to local - python

I have a scenario where we need to download certain image files in different directories in SFTP server to local.
Example :
/IMAGES/folder1 has img11, img12, img13, img14
/IMAGES/folder2 has img21, img22, img23, img24
/IMAGES/folder3 has img31, img32, img33, img34
And I need to download img12, img23 and img34 from folder 1, 2 and 3 respectively
Right now I go inside each folder and get the images individually which takes an extraordinary amount of time(since there are 10,000s of images to download).
I have also found out that downloading a single file of the same size(as that of multiple image files) takes a fraction of the time.
My question is, is there a way to get these multiple files together instead of downloading them one after another ?
One approach I came up with was to copy all the files to a temp folder in sftp server and then download the directory but sftp does not allow 'copy', and I can not use 'rename' because then I will be moving the files to temp directory

You could use a process pool to open multiple sftp connections and download in parallel. For example,
from paramiko import SSHClient
from multiprocessing import Pool
def download_init(host):
global client, sftp
client = SSHClient()
client.load_system_host_keys()
client.connect(host)
sftp = ssh_client.open_sftp()
def download_close(dummy):
client.close()
def download_worker(params):
local_path, remote_path = *params
sftp.get(remote_path, local_path)
list_of_local_and_remote_files = [
["/client/files/folder1/img11", "/IMAGES/folder1/img11"],
]
def downloader(files):
pool_size = 8
pool = Pool(8, initializer=download_init,
initargs=["sftpserver.example.com"])
result = pool.map(download_worker, files, chunksize=10)
pool.map(download_close, range(pool_size))
if __name__ == "__main__":
downloader(list_of_local_and_remote_files)
Its unfortunate that Pool doesn't have a finalizer to undo what was set in the initializer. Its not usually necessary - the exiting process is cleanup enough. In the example I just wrote a separate worker function that cleans things up. By having 1 work item per pool process, they each get 1 call.

Related

Upload file using telegram-upload in flask?

We can upload file using telegram-upload library by using the following command on terminal
telegram-upload file1.mp4 /path/to/file2.mkv
But if I want to call this inside python function, How should I do it. I mean in a python function if users passes the file path as an argument, then that function should be able to upload the file to telegram server.It is not mentioned in the documentation.
In other words I want to ask how to execute or run shell commands from inside python function?
For telegram-upload you can use upload method in telegram_upload.management and
for telegram-download use download method in the same file.
Or you can see how they are implemented there.
from telegram_upload.client import Client
from telegram_upload.config import default_config, CONFIG_FILE
from telegram_upload.exceptions import catch
from telegram_upload.files import NoDirectoriesFiles, RecursiveFiles
DIRECTORY_MODES = {
'fail': NoDirectoriesFiles,
'recursive': RecursiveFiles,
}
def upload(files, to, config, delete_on_success, print_file_id, force_file, forward, caption, directories,
no_thumbnail):
"""Upload one or more files to Telegram using your personal account.
The maximum file size is 1.5 GiB and by default they will be saved in
your saved messages.
"""
client = Client(config or default_config())
client.start()
files = DIRECTORY_MODES[directories](files)
if directories == 'fail':
# Validate now
files = list(files)
client.send_files(to, files, delete_on_success, print_file_id, force_file, forward, caption, no_thumbnail)
I found the solution.Using os module we can run command line strings inside python function i.e. os.system('telegram-upload file1.mp4 /path/to/file2.mkv')

Reuse opened connection to another script

I'm trying to build project consists of multiple python files. The first file is called "startup.py" and just responsible of opening connections to multiple routers and switches (each device allow only one connection at a time) and save them to the list. This script should be running all the time so other files can use it
#startup.py
def validate_connections_to_leaves():
leaves = yaml_utils.load_yaml_file_from_directory("inventory", topology)["fabric_leaves"]
leaves_connections = []
for leaf in leaves:
leaf_ip = leaf["ansible_host"]
leaf_user = leaf["ansible_user"]
leaf_pass = leaf["ansible_pass"]
leaf_cnx = junos_utils.open_fabric_connection(host=leaf_ip, user=leaf_user, password=leaf_pass)
if leaf_cnx:
leaves_connections.append(leaf_cnx)
else:
log.script_logger(severity="ERROR", message="Unable to connect to Leaf", data=leaf_ip, debug=debug,
indent=0)
return leaves_connections
if __name__ == '__main__':
leaves = validate_connections_to_leaves()
pprint(leaves)
#Keep script running
while True:
time.sleep(10)
now I want to re-use these opened connections in another python file(s) without having to establish connections again. if I just import it to another file it will re-execute the startup script one more time.
can anyone help me to identify which part I'm missing here?
You should consider your startup.py file as your entry point where all the logic is. You other files should be imported and used inside this file.
import otherfile1
import otherfile2
# import other file here
def validate_connections_to_leaves:
# ...
if __name__ == '__main__':
leaves = validate_connections_to_leaves()
otherfile1.do_something_with_the_connection(leaves)
#Keep script running
while True:
time.sleep(10)
And in your other file it will be simply:
def do_something_with_the_connection(leaves):
# do something with the connections

Python fabric calling script "remote path"

I'm using fabric to connect to remote host, when i'm there, I try to call a script that I made (It parses the file I give in argument). But when I call the script from inside my Fabfile.py, it assumes the path I gave is from the machine I launch the fabfile from (so not my remote host)
In my fabfile.py I have:
Import import servclasse
env.host='host1'
def listconf():
#here I browes to the correct folder
s=servclasse.Server("my.file") #this is where I want it to open the host1:my.file file and instanciate a classe from what it parsed
If i do this, it tries to open the file from the folder where servclass.py is. Is there a way to give a "remote path" in argument? I would rather not downloading the file.
Should I upload the script servclasse.py with the operation.put before calling it?
Edit: more info
In my servclasse I have this:
def __init__(self, path):
self.config = ConfigParser.ConfigParser(allow_no_value=True)
self.config.readfp(open(path))
The function open() was the problem.
I figured out how to do it so i'll drop it here in case someone read this topic one day :
def listconf():
#first I browes to the correct folder then
contents = StringIO.StringIO()
get("MyFile",contents)
contents.seek(0)
s=Server(contents)
and in the servclass.py
def __init__(self, objfile):
self.config = ConfigParser.ConfigParser(allow_no_value=True)
self.config.readfp(objfile)
#and i do my stuffs

ZEO ZODB database - run locally not working

I tried looking at the documentation for running ZEO on a ZODB database, but it isn't working how they say it should.
I can get a regular ZODB running fine, but I would like to make the database accessible by several processes for a program, so I am trying to get ZEO to work.
I created this script in a folder with a subfolder zeo, which will hold the "database.fs" files created by the make_server function in a different parallel process:
CODE:
from ZEO import ClientStorage
import ZODB
import ZODB.config
import os, time, site, subprocess, multiprocessing
# make the server in for the database in a separate process with windows command
def make_server():
runzeo_path = site.getsitepackages()[0] + "\Lib\site-packages\zeo-4.0.0-py2.7.egg\ZEO\\runzeo.py"
filestorage_path = os.getcwd() + '\zeo\database.fs'
subprocess.call(["python", runzeo_path, "-a", "127.0.0.1:9100", "-f" , filestorage_path])
if __name__ == "__main__":
server_process = multiprocessing.Process(target = make_server)
server_process.start()
time.sleep(5)
storage = ClientStorage.ClientStorage(('localhost', 9100), wait=False)
db = ZODB.DB(storage)
connection = db.open()
root = connection.root()
the program will just block at the ClientStorage line if the wait=False is not given.
If the wait=False is given it produces this error:
Error Message:
Traceback (most recent call last):
File "C:\Users\cbrown\Google Drive\EclipseWorkspace\NewSpectro - v1\20131202\2 - database\zeo.py", line 17, in <module>
db = ZODB.DB(storage)
File "C:\Python27\lib\site-packages\zodb-4.0.0-py2.7.egg\ZODB\DB.py", line 443, in __init__
temp_storage.load(z64, '')
File "C:\Python27\lib\site-packages\zeo-4.0.0-py2.7.egg\ZEO\ClientStorage.py", line 841, in load
data, tid = self._server.loadEx(oid)
File "C:\Python27\lib\site-packages\zeo-4.0.0-py2.7.egg\ZEO\ClientStorage.py", line 88, in __getattr__
raise ClientDisconnected()
ClientDisconnected
Here is the output from the cmd prompt for my process which runs a server:
------
2013-12-06T21:07:27 INFO ZEO.runzeo (7460) opening storage '1' using FileStorage
------
2013-12-06T21:07:27 WARNING ZODB.FileStorage Ignoring index for C:\Users\cab0008
\Google Drive\EclipseWorkspace\NewSpectro - v1\20131202\2 - database\zeo\databas
e.fs
------
2013-12-06T21:07:27 INFO ZEO.StorageServer StorageServer created RW with storage
s: 1:RW:C:\Users\cab0008\Google Drive\EclipseWorkspace\NewSpectro - v1\20131202\
2 - database\zeo\database.fs
------
2013-12-06T21:07:27 INFO ZEO.zrpc (7460) listening on ('127.0.0.1', 9100)
What could I be doing wrong? I just want this to work locally right now so there shouldn't be any need for fancy web stuff.
You should use proper process management and simplify your life. You likely want to look into supervisor, which can be responsible for running/starting/stopping your application and ZEO.
Otherwise, you need to look at the double-fork trick to daemonize ZEO -- but why bother when a process management tool like supervisor does this for you.
If you are savvy with relational database administration, and already have a relational database at your disposal -- you can also consider RelStorage as a very good ZODB (low-level) storage backend.
In Windows you should use double \ instead of a single \ in the paths. Easy and portable way to accomplish this is to use os.path.join() function, eg. os.path.join('os.getcwd()', 'zeo', 'database.fs'). Otherwise a similar code worked ok for me.
Had same error on Windows , on Linux everything OK ...
your code is ok , to make this to work change following
C:\Python33\Lib\site-packages\ZEO-4.0.0-py3.3.egg\ZEO\zrpc\trigger.py ln:235
self.trigger.send(b'x')
C:\Python33\Lib\site-packages\ZEO-4.0.0-py3.3.egg\ZEO\zrpc\client.py ln:458:459 - comment them
here is those lines:
if socktype != socket.SOCK_STREAM:
continue

Multi-threaded S3 download doesn't terminate

I'm using python boto and threading to download many files from S3 rapidly. I use this several times in my program and it works great. However, there is one time when it doesn't work. In that step, I try to download 3,000 files on a 32 core machine (Amazon EC2 cc2.8xlarge).
The code below actually succeeds in downloading every file (except sometimes there is an httplib.IncompleteRead error that doesn't get fixed by the retries). However, only 10 or so of the 32 threads actually terminate and the program just hangs. Not sure why this is. All the files have been downloaded and all the threads should have exited. They do on other steps when I download fewer files. I've been reduced to downloading all these files with a single thread (which works but is super slow). Any insights would be greatly appreciated!
from boto.ec2.connection import EC2Connection
from boto.s3.connection import S3Connection
from boto.s3.key import Key
from boto.exception import BotoClientError
from socket import error as socket_error
from httplib import IncompleteRead
import multiprocessing
from time import sleep
import os
import Queue
import threading
def download_to_dir(keys, dir):
"""
Given a list of S3 keys and a local directory filepath,
downloads the files corresponding to the keys to the local directory.
Returns a list of filenames.
"""
filenames = [None for k in keys]
class DownloadThread(threading.Thread):
def __init__(self, queue, dir):
# call to the parent constructor
threading.Thread.__init__(self)
# create a connection to S3
connection = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
self.conn = connection
self.dir = dir
self.__queue = queue
def run(self):
while True:
key_dict = self.__queue.get()
print self, key_dict
if key_dict is None:
print "DOWNLOAD THREAD FINISHED"
break
elif key_dict == 'DONE': #last job for last worker
print "DOWNLOADING DONE"
break
else: #still work to do!
index = key_dict.get('idx')
key = key_dict.get('key')
bucket_name = key.bucket.name
bucket = self.conn.get_bucket(bucket_name)
k = Key(bucket) #clone key to use new connection
k.key = key.key
filename = os.path.join(dir, k.key)
#make dirs if don't exist yet
try:
f_dirname = os.path.dirname(filename)
if not os.path.exists(f_dirname):
os.makedirs(f_dirname)
except OSError: #already written to
pass
#inspired by: http://code.google.com/p/s3funnel/source/browse/trunk/scripts/s3funnel?r=10
RETRIES = 5 #attempt at most 5 times
wait = 1
for i in xrange(RETRIES):
try:
k.get_contents_to_filename(filename)
break
except (IncompleteRead, socket_error, BotoClientError), e:
if i == RETRIES-1: #failed final attempt
raise Exception('FAILED TO DOWNLOAD %s, %s' % (k, e))
break
wait *= 2
sleep(wait)
#put filename in right spot!
filenames[index] = filename
num_cores = multiprocessing.cpu_count()
q = Queue.Queue(0)
for i, k in enumerate(keys):
q.put({'idx': i, 'key':k})
for i in range(num_cores-1):
q.put(None) # add end-of-queue markers
q.put('DONE') #to signal absolute end of job
#Spin up all the workers
workers = [DownloadThread(q, dir) for i in range(num_cores)]
for worker in workers:
worker.start()
#Block main thread until completion
for worker in workers:
worker.join()
return filenames
Upgrade to AWS SDK version 1.4.4.0 or newer, or stick to exactly 2 threads. Older versions have a limit of at most 2 simultaneous connections. This means that your code will work well if you launch 2 threads; if you launch 3 or more, you are bound to see incomplete reads and exhausted timeouts.
You will see that while 2 threads can boost your throughput greatly, more than 2 does not change much because your network card is busy all the time anyway.
S3Connection uses httplib.py and that library is not threadsafe so ensuring each thread has it's own connection is critical. It looks like you are doing that.
Boto already has it's own retry mechanism but you are layering one on top of that to handle certain other errors. I wonder if it would be advisable to create a new S3Connection object inside the except block. It just seems like the underlying http connection could be in an unusual state at that point and it might be best to start with a fresh connection.
Just a thought.

Categories

Resources