Download big files via FTP with python - python

Im trying to download daily a backup file from my server to my local storage server, but i got some problems.
I wrote this code (removed the useless parts, as the email function):
import os
from time import strftime
from ftplib import FTP
import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email import Encoders
day = strftime("%d")
today = strftime("%d-%m-%Y")
link = FTP(ftphost)
link.login(passwd = ftp_pass, user = ftp_user)
link.cwd(file_path)
link.retrbinary('RETR ' + file_name, open('/var/backups/backup-%s.tgz' % today, 'wb').write)
link.delete(file_name) #delete the file from online server
link.close()
mail(user_mail, "Download database %s" % today, "Database sucessfully downloaded: %s" % file_name)
exit()
And i run this with a crontab like:
40 23 * * * python /usr/bin/backup-transfer.py >> /var/log/backup-transfer.log 2>&1
It works with small files, but with the backups files (about 1.7Gb) it freeze, the downloaded file get about 1.2Gb then never grows up (i waited about a day), and the log file is empty.
Any idea?
p.s: im using Python 2.6.5

Sorry if i answer my own question, but I found the solution.
I tryed ftputil with no success, so i tryed many way and finally, this works:
def ftp_connect(path):
link = FTP(host = 'example.com', timeout = 5) #Keep low timeout
link.login(passwd = 'ftppass', user = 'ftpuser')
debug("%s - Connected to FTP" % strftime("%d-%m-%Y %H.%M"))
link.cwd(path)
return link
downloaded = open('/local/path/to/file.tgz', 'wb')
def debug(txt):
print txt
link = ftp_connect(path)
file_size = link.size(filename)
max_attempts = 5 #I dont want death loops.
while file_size != downloaded.tell():
try:
debug("%s while > try, run retrbinary\n" % strftime("%d-%m-%Y %H.%M"))
if downloaded.tell() != 0:
link.retrbinary('RETR ' + filename, downloaded.write, downloaded.tell())
else:
link.retrbinary('RETR ' + filename, downloaded.write)
except Exception as myerror:
if max_attempts != 0:
debug("%s while > except, something going wrong: %s\n \tfile lenght is: %i > %i\n" %
(strftime("%d-%m-%Y %H.%M"), myerror, file_size, downloaded.tell())
)
link = ftp_connect(path)
max_attempts -= 1
else:
break
debug("Done with file, attempt to download m5dsum")
[...]
In my log file i found:
01-12-2011 23.30 - Connected to FTP
01-12-2011 23.30 while > try, run retrbinary
02-12-2011 00.31 while > except, something going wrong: timed out
file lenght is: 1754695793 > 1754695793
02-12-2011 00.31 - Connected to FTP
Done with file, attempt to download m5dsum
Sadly, i have to reconnect to FTP even if the file has been fully downloaded, that in my cas is not a problem, becose i have to download the md5sum too.
As you can see, I'm not been able to detect the timeout and retry the connection, but when i got timeout, I simply reconnect again; If someone know how to reconnect without creating a new ftplib.FTP instance, let me know ;)

You might try setting the timeout. From the docs:
# timeout in seconds
link = FTP(host=ftp_host, user=ftp_user, passwd=ftp_pass, acct='', timeout=3600)

I implemented code with ftplib which can monitor connection, reconnect and redownload file in case of failure. Details here: How to download big file in python via ftp (with monitoring & reconnect)?

Related

httplib2.IncompleteRead: AttributeError: 'module' object has no attribute 'IncompleteRead'

I've been using a script that was no longer maintained, which downloads your entire Google Drive to your local storage. I mananged to fix a few issues to do with depreciation, and the script seemed to be running smoothly, however, as seemingly random times in my script, it will break, and I will receive the following error.
File "drive.py", line 169, in download_file
except httplib2.IncompleteRead:
AttributeError: 'module' object has no attribute 'IncompleteRead'
These are the modules I am using
import gflags, httplib2, logging, os, pprint, sys, re, time
import pprint
from apiclient.discovery import build
from apiclient.discovery import build
from oauth2client.file import Storage
from oauth2client.client import AccessTokenRefreshError, flow_from_clientsecrets
from oauth2client.tools import run_flow
And here is the code that is causing the error
if is_google_doc(drive_file):
try:
download_url = drive_file['exportLinks']['application/pdf']
except KeyError:
download_url = None
else:
download_url = drive_file['downloadUrl']
if download_url:
try:
resp, content = service._http.request(download_url)
except httplib2.IncompleteRead:
log( 'Error while reading file %s. Retrying...' % drive_file['title'].replace( '/', '_' ) )
print 'Error while reading file %s. Retrying...' % drive_file['title'].replace( '/', '_' )
download_file( service, drive_file, dest_path )
return False
if resp.status == 200:
try:
target = open( file_location, 'w+' )
except:
log( "Could not open file %s for writing. Please check permissions." % file_location )
print "Could not open file %s for writing. Please check permissions." % file_location
return False
target.write( content )
return True
else:
log( 'An error occurred: %s' % resp )
print 'An error occurred: %s' % resp
return False
else:
# The file doesn't have any content stored on Drive.
return False
I am assuming this error has something to do with losing connection while downloading, and I am unfamilar with the httplib2 module.
The full code can be found here
Thankyou in advance to anyone who can shed some light in a possible fix.
I've been updating that drive backup script, and have encountered the same error. I haven't yet worked out what exception is being thrown, but in order to reveal what it is (and allow the script to keep running) I've made the following change:
Remove this:
- except httplib2.IncompleteRead:
- log( 'Error while reading file %s. Retrying...' % drive_file['title'].replace( '/', '_' ) )
Replace it with this:
+ except Exception as e: #httplib2.IncompleteRead: # no longer exists
+ log( traceback.format_exc(e) + ' Error while reading file %s. Retrying...' % drive_file['title'].replace( '/', '_' ) )
This does have the downside that if it encounters an exception consistently, it may enter an endless loop. However, it will then reveal the actual exception being thrown, so the "except:" can be updated appropriately.
This change is visible in the repository
here.
If I encounter the error again I'll update this answer with more detail.

Error when using os.stat - Python

Solved: Adding an os.chdir(myArg) resolved the issue.
I'm getting an error when trying to run the following code on anything other than my home directory or files/direcs that I own.
FileNotFoundError: [Errno 2] No such file or directory:
I created a file in root and changed ownership on the file to pi:pi (user running the script). If I specify that file directly, it works, however if I run the script on "/", it will not read that or any other file/direc. I also created a directory /tempdir_delete/ and changed ownership to pi:pi.. If I run the script specifically on "/tempdir_delete/*", it works, but if I leave off the * it fails.
Why does it fail on all except /home/pi/ or files that I explicitly specify and own? It's running the stat as user pi, which is granted by sudo to perform the stat. Also, why do I have to specify the file that I own explicitly? Shouldn't it see that file in root and work because I own it?
import os
import re
import sys
import pwd
myReg = re.compile(r'^\.')
myUID = os.getuid()
myArg = sys.argv[1]
print(os.getuid())
print(pwd.getpwuid(int(myUID)))
print(myArg)
def getsize(direct):
if os.path.isfile(direct) == True:
statinfo = os.stat(myArg)
print(str(statinfo.st_size))
else:
for i in os.listdir(direct):
try:
statinfo = os.stat(i)
if myReg.search(i):
continue
else:
print(i + ' Size: ' + str(statinfo.st_size))
except:
print('Exception occurred, can't read.')
continue
getsize(myArg)
Solved. Adding an os.chdir(myArg) worked to resolve the issue.

IOError when Using Urllib to Download Pics

Can anyone help me on the issue of downloading multiple files? For a while, it will stop me with IOError and told me connection attempt failed. I tried to use time.sleep function to sleep for random seconds but it doesn't help. And when I re-run the code, it starts to download files again. Any solutions?
import urllib
import time
import random
index_list=["index#1","index#2",..."index#n"]
for n in index_list:
u=urllib.urlopen("url_address"+str(n)+".jpg")
data=u.read()
f=open("tm"+str(n)+".jpg","wb")
f.write(data)
t=random.uniform(0,1)*10
print "system sleep time is ", t, " seconds"
time.sleep(t)
It is very likely that the error is caused by not closing the connection properly (should I call close() after urllib.urlopen()?).
It also is better practice to close, therefore you should close f as well.
You could also use Python's with statement.
import urllib
import time
import random
index_list = ["index#1", "index#2", ..."index#n"]
for n in index_list:
# The str() function call isn't necessary, since it's a list of strings
u = urllib.urlopen("url_address" + n + ".jpg")
data = u.read()
u.close()
with open("tm" + n + ".jpg", "wb") as f:
f.write(data)
t = random.uniform(0, 1) * 10
print "system sleep time is ", t, " seconds"
time.sleep(t)
If the problem still occurs and you can't provide further information, you may try urllib.urlretrieve
Maybe you are not closing the connections properly, so the server sees too many open connections? Try to do a u.close() after reading the data in the loop.

Python else issues making an FTP program

I am having an issue with the else statement of this program... I have checked my spacing and it seems to be correct. I keep getting syntax error on the else statement. The program creates and file then attempts to upload it to a ftp server but if it fails to not say anything to the user and just continue It will try again when the program loops. Any help you could provide would be greatly appreciated.
#IMPORTS
import ConfigParser
import os
import random
import ftplib
from ftplib import FTP
#LOOP PART 1
from time import sleep
while True:
#READ THE CONFIG FILE SETUP.INI
config = ConfigParser.ConfigParser()
config.readfp(open(r'setup.ini'))
path = config.get('config', 'path')
name = config.get('config', 'name')
#CREATE THE KEYFILE
filepath = os.path.join((path), (name))
if not os.path.exists((path)):
os.makedirs((path))
file = open(filepath,'w')
file.write('text here')
file.close()
#Create Full Path
fullpath = path + name
#Random Sleep to Accomidate FTP Server
sleeptimer = random.randrange(1,30+1)
sleep((sleeptimer))
#Upload File to FTP Server
try:
host = '0.0.0.0'
port = 3700
ftp = FTP()
ftp.connect(host, port)
ftp.login('user', 'pass')
file = open(fullpath, "rb")
ftp.cwd('/')
ftp.storbinary('STOR ' + name, file)
ftp.quit()
file.close()
else:
print 'Something is Wrong'
#LOOP PART 2
sleep(180.00)
else is valid as part of an exception block, but it is only run if an exception is not raised and there must be a except defined before it.
(edit) Most people skip the else clause and just write code after exiting (dedenting) from the try/except clauses.
The quick tutorial is:
try:
# some statements that are executed until an exception is raised
...
except SomeExceptionType, e:
# if some type of exception is raised
...
except SomeOtherExceptionType, e:
# if another type of exception is raised
...
except Exception, e:
# if *any* exception is raised - but this is usually evil because it hides
# programming errors as well as the errors you want to handle. You can get
# a feel for what went wrong with:
traceback.print_exc()
...
else:
# if no exception is raised
...
finally:
# run regardless of whether exception was raised
...

PDF's and viruses in Django

I'm building a web application (python and Django) that allows users to upload pdf files for other users to download. How do I prevent a user from uploading a virus embedded in the pdf?
Update:
I found this code on django snippets that uses clamcv. Would this do the job?
def clean_file(self):
file = self.cleaned_data.get('file', '')
#check a file in form for viruses
if file:
from tempfile import mkstemp
import pyclamav
import os
tmpfile = mkstemp()[1]
f = open(tmpfile, 'wb')
f.write(file.read())
f.close()
isvirus, name = pyclamav.scanfile(tmpfile)
os.unlink(tmpfile)
if isvirus:
raise forms.ValidationError( \
"WARNING! Virus \"%s\" was detected in this file. \
Check your system." % name)
return file
Well, in general you can use any virus scanning software to accomplish this task: Just
generate a command line string which calls the virus scanner on your file
use python subprocess to run the command line string like so:
try:
command_string = 'my_virusscanner -parameters ' + uploaded_file
result = subprocess.check_output(command_string,stderr=subprocess.STDOUT,shell=True)
#if needed, do something with "result"
except subprocess.CalledProcessError as e:
#if your scanner gives an error code when detecting a virus, you'll end up here
pass
except:
#something else went wrong
#check sys.exc_info() for info
pass
Without checking the source code, I assume that pyclamav.scanfile is doning more or less the same - so if you trust clamav, you should be doing fine. If you don't trust ist, use the approach above with the virus scanner of your choice.
You can use django-safe-filefield package to validate that uploaded file extension match it MIME-type. Example:
settings.py
CLAMAV_SOCKET = 'unix://tmp/clamav.sock' # or tcp://127.0.0.1:3310
CLAMAV_TIMEOUT = 30 # 30 seconds timeout, None by default which means infinite
forms.py
from safe_filefield.forms import SafeFileField
class MyForm(forms.Form):
attachment = SafeFileField(
scan_viruses=True,
)

Categories

Resources