I have a very simple script that uses urllib to retrieve a zip file and place it on my desktop. The zip file is only a couple MB in size and doesn't take long to download. However, the script doesn't seem to finish, it just hangs. Is there a way to forcibly close the urlretrieve?...or a better solution?
The URL is to a public ftp size. Is the ftp perhaps the cause?
I'm using python 2.7.8.
url = r'ftp://ftp.ngs.noaa.gov/pub/DS_ARCHIVE/ShapeFiles/IA.ZIP'
zip_path = r'C:\Users\***\Desktop\ngs.zip'
urllib.urlretrieve(url, zip_path)
Thanks in advance!
---Edit---
Was able to use ftplib to accomplish the task...
import os
from ftplib import FTP
import zipfile
ftp_site = 'ftp.ngs.noaa.gov'
ftp_file = 'IA.ZIP'
download_folder = '//folder to place file'
download_file = 'name of file'
download_path = os.path.join(download_folder, download_file)
# Download file from ftp
ftp = FTP(ftp_site)
ftp.login()
ftp.cwd('pub/DS_ARCHIVE/ShapeFiles') #change directory
ftp.retrlines('LIST') #show me the files located in directory
download = open(download_path, 'wb')
ftp.retrbinary('RETR ' + ftp_file, download.write)
ftp.quit()
download.close()
# Unzip if .zip file is downloaded
with zipfile.ZipFile(download_path, "r") as z:
z.extractall(download_folder)
urllib has a very bad support for error catching and debugging. urllib2 is a much better choice. The urlretrieve equivalent in urllib2 is:
resp = urllib2.urlopen(im_url)
with open(sav_name, 'wb') as f:
f.write(resp.read())
And the errors to catch are:
urllib2.URLError, urllib2.HTTPError, httplib.HTTPException
And you can also catch socket.error in case that the network is down.
You can use python requests library with requests-ftp module. It provides easier API and better processes exceptions. See: https://pypi.python.org/pypi/requests-ftp and http://docs.python-requests.org/en/latest/
Related
Problem
I'm trying to download >100.000 files from a ftp server in parallel (using threads). I previously tried it with urlretrieve as answered here, however this gave me the following error: URLError(OSError(24, 'Too many open files')). Apparently this problem is a bug (cannot find the reference anymore), so I tried to use urlopen in combination with shutil and then write it to file which I could close myself, as described here. This seemed to work fine, but then I got the same error again: URLError(OSError(24, 'Too many open files')). I thought whenever writing to a file is incomplete or will fail the with statement will cause to file to close itself, but seemingly the files still keep open and will eventually cause the script to halt.
Question
How can I prevent this error, i.e. make sure that every files get closed?
Code
import csv
import urllib.request
import shutil
from multiprocessing.dummy import Pool
def url_to_filename(url):
filename = 'patric_genomes/' + url.split('/')[-1]
return filename
def download(url):
url = url.strip()
try:
with urllib.request.urlopen(url) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
except Exception as e:
return None, e
def build_urls(id_list):
base_url = 'ftp://some_ftp_server/'
urls = []
for some_id in id_list:
url = base_url + some_id + '/' + some_id + '.fna'
print(url)
urls.append(url)
return urls
if __name__ == "__main__":
with open('full_data/genome_ids.txt') as inFile:
reader = csv.DictReader(inFile, delimiter = '\t')
ids = {row['some_id'] for row in reader}
urls = build_urls(ids)
p = Pool(100)
print(p.map(download, urls))
You may try to use contextlib to close your file as such:
import contextlib
[ ... ]
with contextlib.closing(urllib.request.urlopen(url)) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
[ ... ]
According to the docs:
contextlib.closing(thing)
Return a context manager that closes thing upon completion of the block. [ ... ] without needing to explicitly close page. Even if an error occurs, page.close() will be called when the with block is exited.
*** A workaround would be raising the open files limit on your Linux OS. Check your current open files limit:
ulimit -Hn
Add the following line in your /etc/sysctl.conf file:
fs.file-max = <number>
Where <number> is the new upper limit of open files you want to set.
Close and save the file.
sysctl -p
So that changes take effect.
I believe that file handlers you create are not disposed in time by the system, as it takes some time to close connection. So you end up using all the free file handlers (and that includes network sockets) very quickly.
What you do is setting up FTP connection for each of your files. This is a bad practice. A better way is opening 5-15 connections and reusing them, downloading the files through existing sockets, without the overhead of initial FTP handshaking for each file. See this post for reference.
P.S. Also, as #Tarun_Lalwani mentioned, it is not a good idea to create a folder with more than a ~1000 files in it, as it slows down the file system.
How can I prevent this erorr, i.e. make sure that every files get closed?
To prevent the error you need to either increase open file limit, or, which is more reasonable, decrease concurrency in your thread pool. Connection and file closing is done by the context manager properly.
Your thread pool has 100 threads and opens at least 200 handles (one for FTP connection and another for file). Reasonable concurrency would be about 10-30 threads.
Here's simplified reproduction which shows that the code is okay. Put some content in somefile in current directory.
test.py
#!/usr/bin/env python3
import sys
import shutil
import logging
from pathlib import Path
from urllib.request import urlopen
from multiprocessing.dummy import Pool as ThreadPool
def download(id):
ftp_url = sys.argv[1]
filename = Path(__name__).parent / 'files'
try:
with urlopen(ftp_url) as src, (filename / id).open('wb') as dst:
shutil.copyfileobj(src, dst)
except Exception as e:
logging.exception('Download error')
if __name__ == '__main__':
with ThreadPool(10) as p:
p.map(download, (str(i).zfill(4) for i in range(1000)))
And then in the same directory:
$ docker run --name=ftp-test -d -e FTP_USER=user -e FTP_PASSWORD=pass \
-v `pwd`/somefile:/srv/dir/somefile panubo/vsftpd vsftpd /etc/vsftpd.conf
$ IP=`docker inspect --format '{{ .NetworkSettings.IPAddress }}' ftp-test`
$ curl ftp://user:pass#$IP/dir/somefile
$ python3 client.py ftp://user:pass#$IP/dir/somefile
$ docker stop ftp-test && docker rm -v ftp-test
Backstory is im trying to pull some data from an ftp login I was given. This data constantly gets updated, about daily, and I believe they wipe the ftp at the end of each week or month. I was thinking about inputting a date and having the script run daily to see if there are any files that match the date, but if the servers time isn't accurate it could cause data loss. For now I just want to download ALL the files, and then ill work on fine-tuning it.
I haven't worked much with coding ftp before, but seems simple enough. However, the problem I'm having is small files get downloaded without a problem and their file sizes check out and match. When it tries to download a big file that would normally take a few minutes, it gets to a certain point (almost completing the file) and then it just stops and the script hangs.
For Example:
It tries to download a file that is 373485927 bytes in size. The script runs and downloads that file up until 373485568 bytes. It ALWAYS stops at this amount after trying different methods and changing some code.
Don't understand why it always stops at this byte and why it would work fine with smaller files (1000 bytes and under).
import os
import sys
import base64
import ftplib
def get_files(ftp, filelist):
for f in filelist:
try:
print "Downloading file " + f + "\n"
local_file = os.path.join('.', f)
file = open(local_file, "wb")
ftp.retrbinary('RETR ' + f, file.write)
except ftplib.all_errors, e:
print str(e)
file.close()
ftp.quit()
def list_files(ftp):
print "Getting directory listing...\n"
ftp.dir()
filelist = ftp.nlst()
#determine new files to DL, pass to get_files()
#for now we will download all each execute
get_files(ftp, filelist)
def get_conn(host,user,passwd):
ftp = ftplib.FTP()
try:
print "\nConnecting to " + host + "...\n"
ftp.connect(host, 21)
except ftplib.all_errors, e:
print str(e)
try:
print "Logging in...\n"
ftp.login(user, base64.b64decode(passwd))
except ftplib.all_errors, e:
print str(e)
ftp.set_pasv(True)
list_files(ftp)
def main():
host = "host.domain.com"
user = "admin"
passwd = "base64passwd"
get_conn(host,user,passwd)
if __name__ == '__main__':
main()
Output looks like this with file dddd.tar.gz being the big one and never finishes it.
Downloading file aaaa.del.gz
Downloading file bbbb.del.gz
Downloading file cccc.del.gz
Downloading file dddd.tar.gz
This could be caused by a timeout issue, perhaps try in:
def get_conn(host,user,passwd):
ftp = ftplib.FTP()
add in larger timeouts until you have more of an idea whats going on, like:
def get_conn(host,user,passwd):
ftp = ftplib.FTP(timeout=100)
I'm not sure if ftplib defaults to a timeout or not, it would be worth checking and worth checking if you are being timed-out from the server. Hope this helps.
If you are running your scrpit in windows cmd console, try to disable the "QuickEdit Mode" option of cmd.
I had encontered a problem that my ftp script hangs running in windows, but works normally in linux. At last i found that solution is working for me.
Ref:enter link description here
I'm trying to download files (approximately 1 - 1.5MB/file) from a NASA server (URL), but to no avail! I've tried a few things with urllib2 and run into two results:
I create a new file on my machine that is only ~200KB and has nothing in it
I create a 1.5MB file on my machine that has nothing in it!
By "nothing in it" I mean when I open the file (these are hdf5 files, so I open them in hdfView) I see no hierarchical structure...literally looks like an empty h5 file. But, when I open the file in a text editor I can see there is SOMETHING there (it's binary, so in text it looks like...well, binary).
I think I am using urllib2 appropriately, though I have never successfully used urllib2 before. Would you please comment on whether what I am doing is right or not, and suggest something better?
from urllib2 import Request, urlopen, URLError, HTTPError
base_url = 'http://avdc.gsfc.nasa.gov/index.php?site=1480884223&id=40&go=list&path=%2FH2O%2F/2010'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
url = base_url + file_name
req = Request(url)
# Open the url
try:
f = urlopen(req)
print "downloading " + url
# Open our local file for writing
local_file = open('test.he5', "w" + file_mode)
#Write to our local file
local_file.write(f.read())
local_file.close()
except HTTPError, e:
print "HTTP Error:",e.code , url
except URLError, e:
print "URL Error:",e.reason , url
I got this script (which seems to be the closest to working) from here.
I am unsure what the file_name should be. I looked at the page source information of the archive and pulled the file name listed there (not the same as what shows up on the web page), and doing this yields the 1.5MB file that shows nothing in hdfview.
You are creating an invalid url:
base_url = 'http://avdc.gsfc.nasa.gov/index.php?site=1480884223&id=40&go=list&path=%2FH2O%2F/2010'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
url = base_url + file_name
You probably meant:
base_url = 'http://avdc.gsfc.nasa.gov/'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
When downloading a large file, it's better to use a buffered copy from filehandle to filehandle:
import shutil
# ...
f = urlopen(req)
with open('test.he5', "w" + file_mode) as local_file:
shutil.copyfileobj(f, local_file)
.copyfileobj will efficiently load from the open urllib connection and write to the open local_file file handle. Note the with statement, when the code block underneath concludes it'll automatically close the file for you.
I'm trying to download some public data files. I screenscrape to get the links to the files, which all look something like this:
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2001-2002/L28POC_B.xpt
I can't find any documentation on the Requests library website.
The requests library doesn't support ftp:// links.
To download a file from an FTP server you could use urlretrieve:
import urllib.request
urllib.request.urlretrieve('ftp://server/path/to/file', 'file')
# if you need to pass credentials:
# urllib.request.urlretrieve('ftp://username:password#server/path/to/file', 'file')
Or urlopen:
import shutil
import urllib.request
from contextlib import closing
with closing(urllib.request.urlopen('ftp://server/path/to/file')) as r:
with open('file', 'wb') as f:
shutil.copyfileobj(r, f)
Python 2:
import shutil
import urllib2
from contextlib import closing
with closing(urllib2.urlopen('ftp://server/path/to/file')) as r:
with open('file', 'wb') as f:
shutil.copyfileobj(r, f)
You Can Try this
import ftplib
path = 'pub/Health_Statistics/NCHS/nhanes/2001-2002/'
filename = 'L28POC_B.xpt'
ftp = ftplib.FTP("Server IP")
ftp.login("UserName", "Password")
ftp.cwd(path)
ftp.retrbinary("RETR " + filename, open(filename, 'wb').write)
ftp.quit()
Try using the wget library for python. You can find the documentation for it here.
import wget
link = 'ftp://example.com/foo.txt'
wget.download(link)
Use urllib2. For more specifics, check out this example from doc.python.org:
Here's a snippet from the tutorial that may help
import urllib2
req = urllib2.Request('ftp://example.com')
response = urllib2.urlopen(req)
the_page = response.read()
import os
import ftplib
from contextlib import closing
with closing(ftplib.FTP()) as ftp:
try:
ftp.connect(host, port, 30*5) #5 mins timeout
ftp.login(login, passwd)
ftp.set_pasv(True)
with open(local_filename, 'w+b') as f:
res = ftp.retrbinary('RETR %s' % orig_filename, f.write)
if not res.startswith('226 Transfer complete'):
print('Downloaded of file {0} is not compile.'.format(orig_filename))
os.remove(local_filename)
return None
return local_filename
except:
print('Error during download from FTP')
As several folks have noted, requests doesn't support FTP but Python has other libraries that do. If you want to keep using the requests library, there is a requests-ftp package that adds FTP capability to requests. I've used this library a little and it does work. The docs are full of warnings about code quality though. As of 0.2.0 the docs say "This library was cowboyed together in about 4 hours of total work, has no tests, and relies on a few ugly hacks".
import requests, requests_ftp
requests_ftp.monkeypatch_session()
response = requests.get('ftp://example.com/foo.txt')
If you want to take advantage of recent Python versions' async features, you can use aioftp (from the same family of libraries and developers as the more popular aiohttp library). Here is a code example taken from their client tutorial:
client = aioftp.Client()
await client.connect("ftp.server.com")
await client.login("user", "pass")
await client.download("tmp/test.py", "foo.py", write_into=True)
urllib2.urlopen handles ftp links.
urlretrieve is not work for me, and the official document said that They might become deprecated at some point in the future.
import shutil
from urllib.request import URLopener
opener = URLopener()
url = 'ftp://ftp_domain/path/to/the/file'
store_path = 'path//to//your//local//storage'
with opener.open(url) as remote_file, open(store_path, 'wb') as local_file:
shutil.copyfileobj(remote_file, local_file)
I am trying to download a zip file to a local drive and extract all files to a destination folder.
so i have come up with solution but it is only to "download" a file from a directory to another directory but it doesn't work for downloading files. for the extraction, I am able to get it to work in 2.6 but not for 2.5. so any suggestions for the work around or another approach I am definitely open to.
thanks in advance.
######################################
'''this part works but it is not good for URl links'''
import shutil
sourceFile = r"C:\Users\blueman\master\test2.5.zip"
destDir = r"C:\Users\blueman\user"
shutil.copy(sourceFile, destDir)
print "file copied"
######################################################
'''extract works but not good for version 2.5'''
import zipfile
GLBzipFilePath =r'C:\Users\blueman\user\test2.5.zip'
GLBextractDir =r'C:\Users\blueman\user'
def extract(zipFilePath, extractDir):
zip = zipfile(zipFilePath)
zip.extractall(path=extractDir)
print "it works"
extract(GLBzipFilePath,GLBextractDir)
######################################################
urllib.urlretrieve can get a file (zip or otherwise;-) from a URL to a given path.
extractall is indeed new in 2.6, but in 2.5 you can use an explicit loop (get all names, open each name, etc). Do you need example code?
So here's the general idea (needs more try/except if you want to give a nice error message in each and every case which could go wrong, of which, of course, there are a million variants -- I'm only using a couple of such cases as examples...):
import os
import urllib
import zipfile
def getunzipped(theurl, thedir):
name = os.path.join(thedir, 'temp.zip')
try:
name, hdrs = urllib.urlretrieve(theurl, name)
except IOError, e:
print "Can't retrieve %r to %r: %s" % (theurl, thedir, e)
return
try:
z = zipfile.ZipFile(name)
except zipfile.error, e:
print "Bad zipfile (from %r): %s" % (theurl, e)
return
for n in z.namelist():
dest = os.path.join(thedir, n)
destdir = os.path.dirname(dest)
if not os.path.isdir(destdir):
os.makedirs(destdir)
data = z.read(n)
f = open(dest, 'w')
f.write(data)
f.close()
z.close()
os.unlink(name)
For downloading, look at urllib:
import urllib
webFile = urllib.urlopen(url)
For unzipping, use zipfile. See also this example.
The shortest way i've found so far, is to use +alex answer, but with ZipFile.extractall() instead of the loop:
from zipfile import ZipFile
from urllib import urlretrieve
from tempfile import mktemp
filename = mktemp('.zip')
destDir = mktemp()
theurl = 'http://www.example.com/file.zip'
name, hdrs = urlretrieve(theurl, filename)
thefile=ZipFile(filename)
thefile.extractall(destDir)
thefile.close()