ftplib - Python: script hangs on downloading big files - python

Backstory is im trying to pull some data from an ftp login I was given. This data constantly gets updated, about daily, and I believe they wipe the ftp at the end of each week or month. I was thinking about inputting a date and having the script run daily to see if there are any files that match the date, but if the servers time isn't accurate it could cause data loss. For now I just want to download ALL the files, and then ill work on fine-tuning it.
I haven't worked much with coding ftp before, but seems simple enough. However, the problem I'm having is small files get downloaded without a problem and their file sizes check out and match. When it tries to download a big file that would normally take a few minutes, it gets to a certain point (almost completing the file) and then it just stops and the script hangs.
For Example:
It tries to download a file that is 373485927 bytes in size. The script runs and downloads that file up until 373485568 bytes. It ALWAYS stops at this amount after trying different methods and changing some code.
Don't understand why it always stops at this byte and why it would work fine with smaller files (1000 bytes and under).
import os
import sys
import base64
import ftplib
def get_files(ftp, filelist):
for f in filelist:
try:
print "Downloading file " + f + "\n"
local_file = os.path.join('.', f)
file = open(local_file, "wb")
ftp.retrbinary('RETR ' + f, file.write)
except ftplib.all_errors, e:
print str(e)
file.close()
ftp.quit()
def list_files(ftp):
print "Getting directory listing...\n"
ftp.dir()
filelist = ftp.nlst()
#determine new files to DL, pass to get_files()
#for now we will download all each execute
get_files(ftp, filelist)
def get_conn(host,user,passwd):
ftp = ftplib.FTP()
try:
print "\nConnecting to " + host + "...\n"
ftp.connect(host, 21)
except ftplib.all_errors, e:
print str(e)
try:
print "Logging in...\n"
ftp.login(user, base64.b64decode(passwd))
except ftplib.all_errors, e:
print str(e)
ftp.set_pasv(True)
list_files(ftp)
def main():
host = "host.domain.com"
user = "admin"
passwd = "base64passwd"
get_conn(host,user,passwd)
if __name__ == '__main__':
main()
Output looks like this with file dddd.tar.gz being the big one and never finishes it.
Downloading file aaaa.del.gz
Downloading file bbbb.del.gz
Downloading file cccc.del.gz
Downloading file dddd.tar.gz

This could be caused by a timeout issue, perhaps try in:
def get_conn(host,user,passwd):
ftp = ftplib.FTP()
add in larger timeouts until you have more of an idea whats going on, like:
def get_conn(host,user,passwd):
ftp = ftplib.FTP(timeout=100)
I'm not sure if ftplib defaults to a timeout or not, it would be worth checking and worth checking if you are being timed-out from the server. Hope this helps.

If you are running your scrpit in windows cmd console, try to disable the "QuickEdit Mode" option of cmd.
I had encontered a problem that my ftp script hangs running in windows, but works normally in linux. At last i found that solution is working for me.
Ref:enter link description here

Related

Urllib urlopen/urlretrieve too many open files error

Problem
I'm trying to download >100.000 files from a ftp server in parallel (using threads). I previously tried it with urlretrieve as answered here, however this gave me the following error: URLError(OSError(24, 'Too many open files')). Apparently this problem is a bug (cannot find the reference anymore), so I tried to use urlopen in combination with shutil and then write it to file which I could close myself, as described here. This seemed to work fine, but then I got the same error again: URLError(OSError(24, 'Too many open files')). I thought whenever writing to a file is incomplete or will fail the with statement will cause to file to close itself, but seemingly the files still keep open and will eventually cause the script to halt.
Question
How can I prevent this error, i.e. make sure that every files get closed?
Code
import csv
import urllib.request
import shutil
from multiprocessing.dummy import Pool
def url_to_filename(url):
filename = 'patric_genomes/' + url.split('/')[-1]
return filename
def download(url):
url = url.strip()
try:
with urllib.request.urlopen(url) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
except Exception as e:
return None, e
def build_urls(id_list):
base_url = 'ftp://some_ftp_server/'
urls = []
for some_id in id_list:
url = base_url + some_id + '/' + some_id + '.fna'
print(url)
urls.append(url)
return urls
if __name__ == "__main__":
with open('full_data/genome_ids.txt') as inFile:
reader = csv.DictReader(inFile, delimiter = '\t')
ids = {row['some_id'] for row in reader}
urls = build_urls(ids)
p = Pool(100)
print(p.map(download, urls))
You may try to use contextlib to close your file as such:
import contextlib
[ ... ]
with contextlib.closing(urllib.request.urlopen(url)) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
[ ... ]
According to the docs:
contextlib.closing(thing)
Return a context manager that closes thing upon completion of the block. [ ... ] without needing to explicitly close page. Even if an error occurs, page.close() will be called when the with block is exited.
*** A workaround would be raising the open files limit on your Linux OS. Check your current open files limit:
ulimit -Hn
Add the following line in your /etc/sysctl.conf file:
fs.file-max = <number>
Where <number> is the new upper limit of open files you want to set.
Close and save the file.
sysctl -p
So that changes take effect.
I believe that file handlers you create are not disposed in time by the system, as it takes some time to close connection. So you end up using all the free file handlers (and that includes network sockets) very quickly.
What you do is setting up FTP connection for each of your files. This is a bad practice. A better way is opening 5-15 connections and reusing them, downloading the files through existing sockets, without the overhead of initial FTP handshaking for each file. See this post for reference.
P.S. Also, as #Tarun_Lalwani mentioned, it is not a good idea to create a folder with more than a ~1000 files in it, as it slows down the file system.
How can I prevent this erorr, i.e. make sure that every files get closed?
To prevent the error you need to either increase open file limit, or, which is more reasonable, decrease concurrency in your thread pool. Connection and file closing is done by the context manager properly.
Your thread pool has 100 threads and opens at least 200 handles (one for FTP connection and another for file). Reasonable concurrency would be about 10-30 threads.
Here's simplified reproduction which shows that the code is okay. Put some content in somefile in current directory.
test.py
#!/usr/bin/env python3
import sys
import shutil
import logging
from pathlib import Path
from urllib.request import urlopen
from multiprocessing.dummy import Pool as ThreadPool
def download(id):
ftp_url = sys.argv[1]
filename = Path(__name__).parent / 'files'
try:
with urlopen(ftp_url) as src, (filename / id).open('wb') as dst:
shutil.copyfileobj(src, dst)
except Exception as e:
logging.exception('Download error')
if __name__ == '__main__':
with ThreadPool(10) as p:
p.map(download, (str(i).zfill(4) for i in range(1000)))
And then in the same directory:
$ docker run --name=ftp-test -d -e FTP_USER=user -e FTP_PASSWORD=pass \
-v `pwd`/somefile:/srv/dir/somefile panubo/vsftpd vsftpd /etc/vsftpd.conf
$ IP=`docker inspect --format '{{ .NetworkSettings.IPAddress }}' ftp-test`
$ curl ftp://user:pass#$IP/dir/somefile
$ python3 client.py ftp://user:pass#$IP/dir/somefile
$ docker stop ftp-test && docker rm -v ftp-test

Exception for Python ftplib in my program?

I wrote this program to draw data from a text file on a website's directory (of which is edited by the user on the site) but it seems to crash. A lot.
from sys import argv
import ftplib
import serial
from time import sleep
one = "0"
repeat = True
ser = serial.Serial("COM3", 9600)
while repeat == True:
path = 'public_html/'
filename = 'fileone.txt'
ftp = ftplib.FTP("*omitted*")
ftp.login("*omitted*", "*omitted*")
ftp.cwd(path)
ftp.retrbinary("RETR " + filename ,open(filename, 'wb').write)
ftp.quit()
txt = open(filename)
openup = txt.read()
ser.write(openup)
print(openup)
Does anyone know any kind of way to stop it from crashing? I was thinking of using an exception but I'm no Python expert. The program does what it's meant to do, by the way, and the address and login have been omitted for obvious reasons. Also if possible I ask for an exception to stop the program from crashing when it disconnects from the serial port.
Thanks in advance!
Two things:
You might want to put all the ftplib related code in a try-except block like so:
try:
#code related to ftplib
except Exception, e: #you can fill this in after you encounter the exception once
print str(e)
You seem to be opening the file but not closing it when you're done. This might also cause errors later. The best way to do this would be:
with open(filename, 'r') as txt:
openup = txt.read()
This way the file will be closed automatically once you're outside the 'with' block.

Transfer files from one FTP location to another using Python

I am trying to perform a task to transfer files between two different FTP locations. And the simple goal is that I would want to specific file type from FTP Location A to FTP Location B for only last few hours using Python script.
I am using ftplib to perform the task and have put together below code.
So far the file transfer is working fine for single file defined in the from_sock variable, but I am hitting road block when I am wanting to loop through all files which were created within last 2 hours and copy them. So the script I have written is basically copying individual file but I want to I wan't to move all files with particular extension example *.jpg which were created within last 2 hours. I tired to use MDTM to find the file modification time but I am not able to implement in right way.
Any help on this is much appreciated. Below is the current code:
import ftplib
srcFTP = ftplib.FTP("test.com", "username", "pass")
srcFTP.cwd("/somefolder")
desFTP = ftplib.FTP("test2.com", "username", "pass")
desFTP.cwd("/")
from_Sock = srcFTP.transfercmd("RETR Test1.text")
to_Sock = desFTP.transfercmd("STOR test1.text")
state = 0
while 1:
block = from_Sock.recv(1024)
if len(block) == 0:
break
state += len(block)
while len(block) > 0:
sentlen = to_Sock.send(block)
block = block[sentlen:]
print state, "Total Bytes Transferred"
from_Sock.close()
to_Sock.close()
srcFTP.quit()
desFTP.quit()
Thanks,
DD
Here a short code that takes the path and uploads every file with an extension of .jpg via ftp. Its not exactly what you want but I stumbled on your answer and this might help you on your way.
import os
from ftplib import FTP
def ftpPush(filepathSource, filename, filepathDestination):
ftp = FTP(IP, username, password)
ftp.cwd(filepathDestination)
ftp.storlines("STOR "+filename, open(filepathSource+filename, 'r'))
ftp.quit()
path = '/some/path/'
for fileName in os.listdir(path):
if fileName.endswith(".jpg"):
ftpPush(filepathSource=path, filename=fileName, filepathDestination='/some/destination/')
The creation time of a file can be checked on an ftp server using this example.
fileName = "nameOfFile.txt"
modifiedTime = ftp.sendcmd('MDTM ' + fileName)
# successful response: '213 20120222090254'
ftp.quit()
Now you just need to check when the file that have been modified, download it if it is below you wished for threshold and then upload them to the other computer.

How to Retrieve a Zip Folder from FTP in Python

I'm trying to retrieve a zip folder(s) from an ftp site and save them to my local machine, using python (ideally I'd like to specify where they are saved on my C:).
The code below connects to the FTP site and then *something happens in the PyScripter window that looks like random characters for about 1000 lines... but nothing actually gets downloaded to my hard drive.
Any tips?
import ftplib
import sys
def gettext(ftp, filename, outfile=None):
# fetch a text file
if outfile is None:
outfile = sys.stdout
# use a lambda to add newlines to the lines read from the server
ftp.retrlines("RETR " + filename, lambda s, w=outfile.write: w(s+"\n"))
def getbinary(ftp, filename, outfile=None):
# fetch a binary file
if outfile is None:
outfile = sys.stdout
ftp.retrbinary("RETR " + filename, outfile.write)
ftp = ftplib.FTP("FTP IP Address")
ftp.login("username", "password")
ftp.cwd("/MCPA")
#gettext(ftp, "subbdy.zip")
getbinary(ftp, "subbdy.zip")
Well, it seems that you simply forgot to open the file you want to write into.
Something like:
getbinary(ftp, "subbdy.zip", open(r'C:\Path\to\subbdy.zip', 'wb'))

Downloading .pdf file from FTP using a Python Script

I am able to download files from the FTP using the ftplib in Python, but this is like I hard code the name the name of the file(R.pdf) and this downloads only (R.pdf), is there a way to download all files in the FTP with the extension .PDF to my local system using Python. I am able to do this in Shell by just using *.pdf
Replace host, user and password with your credentials,
and 'public_html/soleil' with the address of the directory in which are the PDF files you want to be downloaded,
in the following code and it should be OK I think.
from ftplib import *
from os import listdir
from os.path import getsize
ftp_dt = FTP(host,user,password)
ftp_pi = FTP(host,user,password)
print '\n- Ouverture de connection et logging : OK'
ftp_dt.cwd('public_html/soleil')
ftp_pi.cwd('public_html/soleil')
def func(content, li = [0], la = [], si = [0], memname = ['']):
if name!=memname[0]:
memname[0],li[0:1],la[:],si[0:1] = name,[0],[],[0]
li[0] = li[0] + 1
si[0] = si[0] + len(content)
la.append(str(len(content)))
if li[0]%8==0:
print ' '.join(la) +\
' total: '+str(li[0])+' chunks, '+str(si[0])+' bytes'
la[:] = []
f.write(content)
li_files = []
for name in ftp_dt.nlst():
try:
ftp_dt.size(name)
if name not in ('.','..') and name[-4:]=='.pdf':
li_files.append(name)
except:
pass
if li_files:
for name in li_files:
print '\n- Downloading '+name
with open('E:\\PDF\\DOWNS\\'+name,'wb') as f:
ftp_pi.retrbinary('RETR '+name,func)
if getsize('E:\\PDF\\DOWNS\\'+name)==ftp_dt.size(name):
print ' OK ! Download of complete '+repr(name)+' SUCCEEDED'
else:
print ' FAILURE !! : '+name+' only partially downloaded'
else:
print '\nThere is no PDF file in this FTP directory'
ftp_dt.quit()
ftp_pi.quit()
Two connexions ftp_dt and ftp_pi are defined for “Data Transfers“ and “Protocol Interpretation“ because FTP protocol is based on two channels, one for the commands and the other for..... guess what ?
The func() function is used as callback in the fonction retrbinary()
It could be just
def func(content):
f.write()
but I played a bit with the possibilities of default variables of a function.
One thing I don’t understand well: how can this code work while the reference f in func() is only defined in the text of code after the definition of func() . But I tested it and it works !
I don't have access to an FTP server I can try this but a cursory look at the documentation indicates that this is not possible.
You can, however, obtain a list of files on the remote end with the dir or nlst commands and then fetch each file in a loop.
use two python modules glob and wget. Your snippet could look like this
import glob
import wget
list_to_download = glob.glob(url+'*.pdf')
for file in list_to_download:
wget.download(file)

Categories

Resources