Slow upload of many small files with SFTP

Slow upload of many small files with SFTP - python

When uploading 100 files of 100 bytes each with SFTP, it takes 17 seconds here (after the connection is established, I don't even count the initial connection time). This means it's 17 seconds to transfer 10 KB only, i.e. 0.59 KB/sec!
I know that sending SSH commands to open, write, close, etc. probably creates a big overhead, but still, is there a way to speed up the process when sending many small files with SFTP?
Or a special mode in paramiko / pysftp to keep all the writes operations to do in a memory buffer (let's say all operations for the last 2 seconds), and then do everything in one grouped pass of SSH/SFTP? This would avoid to wait for the ping time between each operation.
Note:
I have a ~ 100 KB/s connection upload speed (tested 0.8 Mbit upload speed), 40 ms ping time to the server
Of course, if instead of sending 100 files of 100 bytes, I send 1 file of 10 KB bytes, it takes < 1 second
I don't want to have to run a binary program on remote, only SFTP commands are accepted
import pysftp, time, os
with pysftp.Connection('1.2.3.4', username='root', password='') as sftp:
with sftp.cd('/tmp/'):
t0 = time.time()
for i in range(100):
print(i)
with sftp.open('test%i.txt' % i, 'wb') as f: # even worse in a+ append mode: it takes 25 seconds
f.write(os.urandom(100))
print(time.time() - t0)

With the following method (100 asynchronous tasks), it's done in ~ 0.5 seconds, which is a massive improvement.
import asyncio, asyncssh # pip install asyncssh
async def main():
async with asyncssh.connect('1.2.3.4', username='root', password='') as conn:
async with conn.start_sftp_client() as sftp:
print('connected')
await asyncio.wait([sftp.put('files/test%i.txt' % i) for i in range(100)])
asyncio.run(main())
I'll explore the source, but I still don't know if it groups many operations in few SSH transactions, or if it just runs commands in parallel.

I'd suggest you to parallelize the upload using multiple connections from multiple threads. That's easy and reliable solution.
If you want to do the hard way by using buffering the requests, you can base your solution on the following naive example.
The example:
Queues 100 file open requests;
As it reads the responses to the open requests, it queues write requests;
As it reads the responses to the write requests, it queues close requests
If I do plain SFTPClient.put for 100 files, it takes about 10-12 seconds. Using the code below, I achieve the same about 50-100 times faster.
But! The code is really naive:
It expects that the server responds to the requests in the same order. Indeed, majority of SFTP servers (including the de-facto standard OpenSSH) respond in the same order. But according to the SFTP specification, an SFTP server is free to respond in any order.
The code expects that all file reads happen in one go – upload.localhandle.read(32*1024). That's true for small files only.
The code expects that the SFTP server can handle 100 parallel requests and 100 opened files. That's not a problem for most servers, as they process the requests in order. And 100 opened files should not be a problem for a regular server.
You cannot do that for unlimited number of files though. You have to queue the files somehow to keep the number of outstanding requests in check. Actually even these 100 requests is too much.
The code uses non-public methods of SFTPClient class.
I do not do Python. There are definitely ways to code this more elegantly.
import paramiko
import paramiko.sftp
from paramiko.py3compat import long
ssh = paramiko.SSHClient()
ssh.connect(...)
sftp = ssh.open_sftp()
class Upload:
def __init__(self):
pass
uploads = []
for i in range(0, 100):
print(f"sending open request {i}")
upload = Upload()
upload.i = i
upload.localhandle = open(f"{i}.dat")
upload.remotepath = f"/remote/path/{i}.dat"
imode = \
paramiko.sftp.SFTP_FLAG_CREATE | paramiko.sftp.SFTP_FLAG_TRUNC | \
paramiko.sftp.SFTP_FLAG_WRITE
attrblock = paramiko.SFTPAttributes()
upload.request = \
sftp._async_request(type(None), paramiko.sftp.CMD_OPEN, upload.remotepath, \
imode, attrblock)
uploads.append(upload)
for upload in uploads:
print(f"reading open response {upload.i}");
t, msg = sftp._read_response(upload.request)
if t != paramiko.sftp.CMD_HANDLE:
raise SFTPError("Expected handle")
upload.handle = msg.get_binary()
print(f"sending write request {upload.i} to handle {upload.handle}");
data = upload.localhandle.read(32*1024)
upload.request = \
sftp._async_request(type(None), paramiko.sftp.CMD_WRITE, \
upload.handle, long(0), data)
for upload in uploads:
print(f"reading write response {upload.i} {upload.request}");
t, msg = sftp._read_response(upload.request)
if t != paramiko.sftp.CMD_STATUS:
raise SFTPError("Expected status")
print(f"closing {upload.i} {upload.handle}");
upload.request = \
sftp._async_request(type(None), paramiko.sftp.CMD_CLOSE, upload.handle)
for upload in uploads:
print(f"reading close response {upload.i} {upload.request}");
sftp._read_response(upload.request)

Related

Show Python FTP file upload messages

I have a remote FTP server where I want to upload new firmware images. When using the Linux ftp client I can do this using put <somefile> the server then responds with status messages like:
ftp> put TS252P005.bin flash
local: TS252P005.bin remote: flash
200 PORT command successful
150 Connecting to port 40929
226-File Transfer Complete. Starting Operation:
Checking file integrity...
Updating firmware...
Result: Success
Rebooting...
421 Service not available, remote server has closed connection
38563840 bytes sent in 6.71 secs (5.4779 MB/s)
ftp>
Now I can upload the file using Python as well using ftplib:
fw_path = ....
ip = ....
user = ...
pw = ....
with open(fw_path, "rb") as f:
with ftplib.FTP(ip) as ftp:
ftp.login(user=user, passwd=pw)
ftp.storbinary("stor flash", f)
But I can't see a way for me to get the status messages that I can see using the ftp utility. This is important for me because I need to check that the update actually succeeded.
How can I get this output in my Python program? I'm also willing to use a different library if ftplib can't do it.
Any help is appreciated!

If you want to check the response programatically, check the result of FTP.storbinary:
print(ftp.storbinary("STOR flash", f))
Though as your server actually closes the connection before even sending a complete response, the FTP.storbinary throws an exception.
If you want to read the partial response, you will have to re-implement what the FTP.storbinary does. Like
ftp.voidcmd('TYPE I')
with ftp.transfercmd("STOR flash") as conn:
while 1:
buf = f.read(8192)
if not buf:
break
conn.sendall(buf)
line = ftp.getline()
print(line)
if line[3:4] == '-':
code = line[:3]
while 1:
nextline = self.getline()
print(nextline)
if nextline[:3] == code and nextline[3:4] != '-':
break
If you want to check the response manually, enable logging using FTP.set_debuglevel.

PyCurl request hangs infinitely on perform

I have written a script to fetch scan results from Qualys to be run each week for the purpose of metrics gathering.
The first part of this script involves fetching a list of references for each of the scans that were run in the past week for further processing.
The problem is that, while this will work perfectly sometimes, other times the script will hang on the c.perform() line. This is manageable when running the script manually as it can just be re-run until it works. However, I am looking to run this as a scheduled task each week without any manual interaction.
Is there a foolproof way that I can detect if a hang has occurred and resend the PyCurl request until it works?
I have tried setting the c.TIMEOUT and c.CONNECTTIMEOUT options but these don't seem to be effective. Also, as no exception is thrown, simply putting it in a try-except block also won't fly.
The function in question is below:
# Retrieve a list of all scans conducted in the past week
# Save this to refs_raw.txt
def getScanRefs(usr, pwd):
print("getting scan references...")
with open('refs_raw.txt','wb') as refsraw:
today = DT.date.today()
week_ago = today - DT.timedelta(days=7)
strtoday = str(today)
strweek_ago = str(week_ago)
c = pycurl.Curl()
c.setopt(c.URL, 'https://qualysapi.qualys.eu/api/2.0/fo/scan/?action=list&launched_after_datetime=' + strweek_ago + '&launched_before_datetime=' + strtoday)
c.setopt(c.HTTPHEADER, ['X-Requested-With: pycurl', 'Content-Type: text/xml'])
c.setopt(c.USERPWD, usr + ':' + pwd)
c.setopt(c.POST, 1)
c.setopt(c.PROXY, 'companyproxy.net:8080')
c.setopt(c.CAINFO, certifi.where())
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
c.setopt(c.CONNECTTIMEOUT, 3)
c.setopt(c.TIMEOUT, 3)
refsbuffer = BytesIO()
c.setopt(c.WRITEDATA, refsbuffer)
c.perform()
body = refsbuffer.getvalue()
refsraw.write(body)
c.close()
print("Got em!")

I fixed the issue myself by launching a separate process using multiprocessing to launch the API call in a separate process, killing and restarting if it goes on for longer than 5 seconds. It's not very pretty but is cross-platform. For those looking for a solution that is more elegant but only works on *nix look into the signal library, specifically SIGALRM.
Code below:
# As this request for scan references sometimes hangs it will be run in a separate thread here
# This will be terminated and relaunched if no response is received within 5 seconds
def performRequest(usr, pwd):
today = DT.date.today()
week_ago = today - DT.timedelta(days=7)
strtoday = str(today)
strweek_ago = str(week_ago)
c = pycurl.Curl()
c.setopt(c.URL, 'https://qualysapi.qualys.eu/api/2.0/fo/scan/?action=list&launched_after_datetime=' + strweek_ago + '&launched_before_datetime=' + strtoday)
c.setopt(c.HTTPHEADER, ['X-Requested-With: pycurl', 'Content-Type: text/xml'])
c.setopt(c.USERPWD, usr + ':' + pwd)
c.setopt(c.POST, 1)
c.setopt(c.PROXY, 'companyproxy.net:8080')
c.setopt(c.CAINFO, certifi.where())
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
refsBuffer = BytesIO()
c.setopt(c.WRITEDATA, refsBuffer)
c.perform()
c.close()
body = refsBuffer.getvalue()
refsraw = open('refs_raw.txt', 'wb')
refsraw.write(body)
refsraw.close()
# Retrieve a list of all scans conducted in the past week
# Save this to refs_raw.txt
def getScanRefs(usr, pwd):
print("Getting scan references...")
# Occasionally the request will hang infinitely. Launch in separate method and retry if no response in 5 seconds
success = False
while success != True:
sendRequest = multiprocessing.Process(target=performRequest, args=(usr, pwd))
sendRequest.start()
for seconds in range(5):
print("...")
time.sleep(1)
if sendRequest.is_alive():
print("Maximum allocated time reached... Resending request")
sendRequest.terminate()
del sendRequest
else:
success = True
print("Got em!")

The question is old but i will add this answer, it might help someone.
the only way to terminate a running curl after executing "perform()" is by using callbacks:
1- using CURLOPT_WRITEFUNCTION:
as stated from docs:
Your callback should return the number of bytes actually taken care of. If that amount differs from the amount passed to your callback function, it'll signal an error condition to the library. This will cause the transfer to get aborted and the libcurl function used will return CURLE_WRITE_ERROR.
the drawback with this method is curl calls the write function only when receives new data from the server, so in case of server stopped sending data curl will just keep waiting at the server side and will not receive your kill signal
2- the alternative and the best so far is using progress callback:
the beauty of progress callback is being curl will call it at least once per seconds even if no data coming from the server which will give you the opportunity to return non zero value as a kill switch to curl
use option CURLOPT_XFERINFOFUNCTION,
note it is better than using CURLOPT_PROGRESSFUNCTION as quoted in docs:
We encourage users to use the newer CURLOPT_XFERINFOFUNCTION instead, if you can.
also you need to set option CURLOPT_NOPROGRESS
CURLOPT_NOPROGRESS must be set to 0 to make this function actually get called.
This is an example to show you both write and progress functions implementations in python:
# example of using write and progress function to terminate curl
import pycurl
open('mynewfile', 'w') as f # used to save downloaded data
counter = 0
# define callback functions which will be used by curl
def my_write_func(data):
"""write to file"""
f.write(data)
counter += len(data)
# an example to terminate curl: tell curl to abort if the downloaded data exceeded 1024 byte by returning -1 or any number
# not equal to len(data)
if counter >= 1024:
return -1
def progress(*data):
"""it receives progress from curl and can be used as a kill switch
Returning a non-zero value from this callback will cause curl to abort the transfer
"""
d_size, downloaded, u_size, uploade = data
# an example to terminate curl: tell curl to abort if the downloaded data exceeded 1024 byte by returning non zero value
if downloaded >= 1024:
return -1
# initialize curl object and options
c = pycurl.Curl()
# callback options
c.setopt(pycurl.WRITEFUNCTION, my_write_func)
self.c.setopt(pycurl.NOPROGRESS, 0) # required to use a progress function
self.c.setopt(pycurl.XFERINFOFUNCTION, self.progress)
# self.c.setopt(pycurl.PROGRESSFUNCTION, self.progress) # you can use this option but pycurl.XFERINFOFUNCTION is recommended
# put other curl options as required
# executing curl
c.perform()

How to stop ftp from downloading in python?

Here are some bits of code I use to download through ftp. I was trying to stop the download then continue or redownload it afterwards. I've tried ftp.abort() but it only hangs and returns timeout.
ftplib.error_proto: 421 Data timeout. Reconnect. Sorry.
SCENARIO:
The scenario is the user will choose the file to download, while downloading, the user can stop the current download and download a new file. The code 'if os.path.getsize(self.file_path) >117625:' is just my example if the user stops the download. Its not the full size of the file.
thanks.
from ftplib import FTP
class ftpness:
def __init__(self):
self.connect(myhost, myusername, mypassword)
def handleDownload(self,block):
self.f.write(block)
if os.path.getsize(self.file_path) >117625:
self.ftp.abort()
def connect(self,host, username, password):
self.ftp = FTP(host)
self.ftp.login(username, password)
self.get(self.file_path)
def get(self,filename):
self.f = open(filename, 'wb')
self.ftp.retrbinary('RETR ' + filename, self.handleDownload)
self.f.close()
self.ftp.close
a = ftpness()

error 421 is the std timeout error. so need to have the connection until the file has been downloaded.
def handleDownload(self,block):
self.f.write(block)
if os.path.getsize(self.file_path) >117625:
self.ftp.abort()
else:
self.ftp.sendcmd('NOOP')
#try to add this line just to keep the connection alive.
hope this will help you. :)

Here's a way to do it with a watchdog timer. This involves creating a separate thread, which depending on the design of your application may not be acceptable.
To kill a download with a user event, it's the same idea. If the GUI works in a separate thread, then that thread can just reach inside the FTP instance and close its socket directly.
from threading import Timer
class ftpness:
...
def connect(self,host, username, password):
self.ftp = FTP(host)
self.ftp.login(username, password)
watchdog = Timer(self.timeout, self.ftp.sock.close)
watchdog.start()
self.get(self.file_path)
watchdog.cancel() # if file transfer succeeds cancel timer
This way, if the file transfer runs longer than your preset timeout, the timer thread will close the socket underneath the transfer, forcing the get call to raise an exception. Only when the transfer succeeds is the watchdog timer cancelled.
And though this has nothing to do with your question, normally a connect call should not transfer payload data.

This is Your session idle time too long.You can file after the President into instantiate ftplib. Otherwise. Modify ftp software configuration.
For example, you use vsftpd, you can add the following configuration to vsftpd.conf:
idle_session_timeout=60000 # The default is 600 seconds

tornado - transferring a file to cdn without blocking

I have the nginx upload module handling site uploads, but still need to transfer files (let's say 3-20mb each) to our cdn, and would rather not delegate that to a background job.
What is the best way to do this with tornado without blocking other requests? Can i do this in an async callback?

You may find it useful in the overall architecture of your site to add a message queuing service such as RabbitMQ.
This would let you complete the upload via the nginx module, then in the tornado handler, post a message containing the uploaded file path and exit. A separate process would be watching for these messages and handle the transfer to your CDN. This type of service would be useful for many other tasks that could be handled offline ( sending emails, etc.. ). As your system grows, this also provides you a mechanism to scale by moving queue processing to separate machines.
I am using an architecture very similar to this. Just make sure to add your message consumer process to supervisord or whatever you are using to manage your processes.
In terms of implementation, if you are on Ubuntu installing RabbitMQ is a simple:
sudo apt-get install rabbitmq-server
On CentOS w/EPEL repositories:
yum install rabbit-server
There are a number of Python bindings to RabbitMQ. Pika is one of them and it happens to be created by an employee of LShift, who is responsible for RabbitMQ.
Below is a bit of sample code from the Pika repo. You can easily imagine how the handle_delivery method would accept a message containing a filepath and push it to your CDN.
import sys
import pika
import asyncore
conn = pika.AsyncoreConnection(pika.ConnectionParameters(
sys.argv[1] if len(sys.argv) > 1 else '127.0.0.1',
credentials = pika.PlainCredentials('guest', 'guest')))
print 'Connected to %r' % (conn.server_properties,)
ch = conn.channel()
ch.queue_declare(queue="test", durable=True, exclusive=False, auto_delete=False)
should_quit = False
def handle_delivery(ch, method, header, body):
print "method=%r" % (method,)
print "header=%r" % (header,)
print " body=%r" % (body,)
ch.basic_ack(delivery_tag = method.delivery_tag)
global should_quit
should_quit = True
tag = ch.basic_consume(handle_delivery, queue = 'test')
while conn.is_alive() and not should_quit:
asyncore.loop(count = 1)
if conn.is_alive():
ch.basic_cancel(tag)
conn.close()
print conn.connection_close

advice on the tornado google group points to using an async callback (documented at http://www.tornadoweb.org/documentation#non-blocking-asynchronous-requests) to move the file to the cdn.
the nginx upload module writes the file to disk and then passes parameters describing the upload(s) back to the view. therefore, the file isn't in memory, but the time it takes to read from disk–which would cause the request process to block itself, but not other tornado processes, afaik–is negligible.
that said, anything that doesn't need to be processed online shouldn't be, and should be deferred to a task queue like celeryd or similar.

urlopen error 10045, 'address already in use' while downloading in Python 2.5 on Windows

I'm writing code that will run on Linux, OS X, and Windows. It downloads a list of approximately 55,000 files from the server, then steps through the list of files, checking if the files are present locally. (With SHA hash verification and a few other goodies.) If the files aren't present locally or the hash doesn't match, it downloads them.
The server-side is plain-vanilla Apache 2 on Ubuntu over port 80.
The client side works perfectly on Mac and Linux, but gives me this error on Windows (XP and Vista) after downloading a number of files:
urllib2.URLError: <urlopen error <10048, 'Address already in use'>>
This link: http://bytes.com/topic/python/answers/530949-client-side-tcp-socket-receiving-address-already-use-upon-connect points me to TCP port exhaustion, but "netstat -n" never showed me more than six connections in "TIME_WAIT" status, even just before it errored out.
The code (called once for each of the 55,000 files it downloads) is this:
request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
datastream = opener.open(request)
outfileobj = open(temp_file_path, 'wb')
try:
while True:
chunk = datastream.read(CHUNK_SIZE)
if chunk == '':
break
else:
outfileobj.write(chunk)
finally:
outfileobj = outfileobj.close()
datastream.close()
UPDATE: I find by greping the log that it enters the download routine exactly 3998 times. I've run this multiple times and it fails at 3998 each time. Given that the linked article states that available ports are 5000-1025=3975 (and some are probably expiring and being reused) it's starting to look a lot more like the linked article describes the real issue. However, I'm still not sure how to fix this. Making registry edits is not an option.

If it is really a resource problem (freeing os socket resources)
try this:
request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
retry = 3 # 3 tries
while retry :
try :
datastream = opener.open(request)
except urllib2.URLError, ue:
if ue.reason.find('10048') > -1 :
if retry :
retry -= 1
else :
raise urllib2.URLError("Address already in use / retries exhausted")
else :
retry = 0
if datastream :
retry = 0
outfileobj = open(temp_file_path, 'wb')
try:
while True:
chunk = datastream.read(CHUNK_SIZE)
if chunk == '':
break
else:
outfileobj.write(chunk)
finally:
outfileobj = outfileobj.close()
datastream.close()
if you want you can insert a sleep or you make it os depended
on my win-xp the problem doesn't show up (I reached 5000 downloads)
I watch my processes and network with process hacker.

Thinking outside the box, the problem you seem to be trying to solve has already been solved by a program called rsync. You might look for a Windows implementation and see if it meets your needs.

You should seriously consider copying and modifying this pyCurl example for efficient downloading of a large collection of files.

Instead of opening a new TCP connection for each request you should really use persistent HTTP connections - have a look at urlgrabber (or alternatively, just at keepalive.py for how to add keep-alive connection support to urllib2).

All indications point to a lack of available sockets. Are you sure that only 6 are in TIME_WAIT status? If you're running so many download operations it's very likely that netstat overruns your terminal buffer. I find that netstat stat overruns my terminal during normal useage periods.
The solution is to either modify the code to reuse sockets. Or introduce a timeout. It also wouldn't hurt to keep track of how many open sockets you have. To optimize waiting. The default timeout on Windows XP is 120 seconds. so you want to sleep for at least that long if you run out of sockets. Unfortunately it doesn't look like there's an easy way to check from Python when a socket has closed and left the TIME_WAIT status.
Given the asynchronous nature of the requests and timeouts, the best way to do this might be in a thread. Make each threat sleep for 2 minutes before it finishes. You can either use a Semaphore or limit the number of active threads to ensure that you don't run out of sockets.
Here's how I'd handle it. You might want to add an exception clause to the inner try block of the fetch section, to warn you about failed fetches.
import time
import threading
import Queue
# assumes url_queue is a Queue object populated with tuples in the form of(url_to_fetch, temp_file)
# also assumes that TotalUrls is the size of the queue before any threads are started.
class urlfetcher(threading.Thread)
def __init__ (self, queue)
Thread.__init__(self)
self.queue = queue
def run(self)
try: # needed to handle empty exception raised by an empty queue.
file_remote_path, temp_file_path = self.queue.get()
request = urllib2.Request(file_remote_path)
opener = urllib2.build_opener()
datastream = opener.open(request)
outfileobj = open(temp_file_path, 'wb')
try:
while True:
chunk = datastream.read(CHUNK_SIZE)
if chunk == '':
break
else:
outfileobj.write(chunk)
finally:
outfileobj = outfileobj.close()
datastream.close()
time.sleep(120)
self.queue.task_done()
elsewhere:
while url_queue.size() < TotalUrls: # hard limit of available ports.
if threading.active_threads() < 3975: # Hard limit of available ports
t = urlFetcher(url_queue)
t.start()
else:
time.sleep(2)
url_queue.join()
Sorry, my python is a little rusty, so I wouldn't be surprised if I missed something.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Slow upload of many small files with SFTP - python

Related

Show Python FTP file upload messages

PyCurl request hangs infinitely on perform

How to stop ftp from downloading in python?

tornado - transferring a file to cdn without blocking

urlopen error 10045, 'address already in use' while downloading in Python 2.5 on Windows

Categories

Resources