Related
I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I've added to iTunes.
The text processing that creates/updates the XML file is written in Python. However, I use wget inside a Windows .bat file to download the actual MP3 file. I would prefer to have the entire utility written in Python.
I struggled to find a way to actually download the file in Python, thus why I resorted to using wget.
So, how do I download the file using Python?
One more, using urlretrieve:
import urllib.request
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
(for Python 2 use import urllib and urllib.urlretrieve)
Use urllib.request.urlopen():
import urllib.request
with urllib.request.urlopen('http://www.example.com/') as f:
html = f.read().decode('utf-8')
This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers.
On Python 2, the method is in urllib2:
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
In 2012, use the python requests library
>>> import requests
>>>
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760
You can run pip install requests to get it.
Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case.
2015-12-30
People have expressed admiration for the progress bar. It's cool, sure. There are several off-the-shelf solutions now, including tqdm:
from tqdm import tqdm
import requests
url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)
with open("10MB", "wb") as handle:
for data in tqdm(response.iter_content()):
handle.write(data)
This is essentially the implementation #kvance described 30 months ago.
import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
output.write(mp3file.read())
The wb in open('test.mp3','wb') opens a file (and erases any existing file) in binary mode so you can save data with it instead of just text.
Python 3
urllib.request.urlopen
import urllib.request
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()
urllib.request.urlretrieve
import urllib.request
urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
Note: According to the documentation, urllib.request.urlretrieve is a "legacy interface" and "might become deprecated in the future" (thanks gerrit)
Python 2
urllib2.urlopen (thanks Corey)
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
urllib.urlretrieve (thanks PabloG)
import urllib
urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
use wget module:
import wget
wget.download('url')
import os,requests
def download(url):
get_response = requests.get(url,stream=True)
file_name = url.split("/")[-1]
with open(file_name, 'wb') as f:
for chunk in get_response.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download("https://example.com/example.jpg")
An improved version of the PabloG code for Python 2/3:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )
import sys, os, tempfile, logging
if sys.version_info >= (3,):
import urllib.request as urllib2
import urllib.parse as urlparse
else:
import urllib2
import urlparse
def download_file(url, dest=None):
"""
Download and save a file specified by url to dest directory,
"""
u = urllib2.urlopen(url)
scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
filename = os.path.basename(path)
if not filename:
filename = 'downloaded.file'
if dest:
filename = os.path.join(dest, filename)
with open(filename, 'wb') as f:
meta = u.info()
meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
meta_length = meta_func("Content-Length")
file_size = None
if meta_length:
file_size = int(meta_length[0])
print("Downloading: {0} Bytes: {1}".format(url, file_size))
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = "{0:16}".format(file_size_dl)
if file_size:
status += " [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
status += chr(13)
print(status, end="")
print()
return filename
if __name__ == "__main__": # Only run if this file is called directly
print("Testing with 10MB download")
url = "http://download.thinkbroadband.com/10MB.zip"
filename = download_file(url)
print(filename)
Simple yet Python 2 & Python 3 compatible way comes with six library:
from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
Following are the most commonly used calls for downloading files in python:
urllib.urlretrieve ('url_to_file', file_name)
urllib2.urlopen('url_to_file')
requests.get(url)
wget.download('url', file_name)
Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.
Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.
In python3 you can use urllib3 and shutil libraires.
Download them by using pip or pip3 (Depending whether python3 is default or not)
pip3 install urllib3 shutil
Then run this code
import urllib.request
import shutil
url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
Note that you download urllib3 but use urllib in code
I agree with Corey, urllib2 is more complete than urllib and should likely be the module used if you want to do more complex things, but to make the answers more complete, urllib is a simpler module if you want just the basics:
import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()
Will work fine. Or, if you don't want to deal with the "response" object you can call read() directly:
import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()
If you have wget installed, you can use parallel_sync.
pip install parallel_sync
from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)
Doc:
https://pythonhosted.org/parallel_sync/pages/examples.html
This is pretty powerful. It can download files in parallel, retry upon failure , and it can even download files on a remote machine.
You can get the progress feedback with urlretrieve as well:
def report(blocknr, blocksize, size):
current = blocknr*blocksize
sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))
def downloadFile(url):
print "\n",url
fname = url.split('/')[-1]
print fname
urllib.urlretrieve(url, fname, report)
If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2.
First, these are the results (they are similar in different runs):
$ python wget_test.py
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============
The way I performed the test is using "profile" decorator. This is the full code:
import wget
import urllib
import time
from functools import wraps
def profile(func):
#wraps(func)
def inner(*args):
print func.__name__, ": starting"
start = time.time()
ret = func(*args)
end = time.time()
print func.__name__, ": {:.2f}".format(end - start)
return ret
return inner
url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'
def do_nothing(*args):
pass
#profile
def urlretrive_test(url):
return urllib.urlretrieve(url)
#profile
def wget_no_bar_test(url):
return wget.download(url, out='/tmp/', bar=do_nothing)
#profile
def wget_with_bar_test(url):
return wget.download(url, out='/tmp/')
urlretrive_test(url1)
print '=============='
time.sleep(1)
wget_no_bar_test(url2)
print '=============='
time.sleep(1)
wget_with_bar_test(url3)
print '=============='
time.sleep(1)
urllib seems to be the fastest
Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads.
import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])
In Jupyter Notebook, one can also call programs directly with the ! syntax:
!wget -O example_output_file.html https://example.com
Late answer, but for python>=3.6 you can use:
import dload
dload.save(url)
Install dload with:
pip3 install dload
Source code can be:
import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()
sock.close()
print htmlSource
I wrote the following, which works in vanilla Python 2 or Python 3.
import sys
try:
import urllib.request
python3 = True
except ImportError:
import urllib2
python3 = False
def progress_callback_simple(downloaded,total):
sys.stdout.write(
"\r" +
(len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
" [%3.2f%%]"%(100.0*float(downloaded)/float(total))
)
sys.stdout.flush()
def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
def _download_helper(response, out_file, file_size):
if progress_callback!=None: progress_callback(0,file_size)
if block_size == None:
buffer = response.read()
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size,file_size)
else:
file_size_dl = 0
while True:
buffer = response.read(block_size)
if not buffer: break
file_size_dl += len(buffer)
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size_dl,file_size)
with open(dstfilepath,"wb") as out_file:
if python3:
with urllib.request.urlopen(srcurl) as response:
file_size = int(response.getheader("Content-Length"))
_download_helper(response,out_file,file_size)
else:
response = urllib2.urlopen(srcurl)
meta = response.info()
file_size = int(meta.getheaders("Content-Length")[0])
_download_helper(response,out_file,file_size)
import traceback
try:
download(
"https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
"output.zip",
progress_callback_simple
)
except:
traceback.print_exc()
input()
Notes:
Supports a "progress bar" callback.
Download is a 4 MB test .zip from my website.
You can use PycURL on Python 2 and 3.
import pycurl
FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'
with open(FILE_DEST, 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, FILE_SRC)
c.setopt(c.WRITEDATA, f)
c.perform()
c.close()
Use Python Requests in 5 lines
import requests as req
remote_url = 'http://www.example.com/sound.mp3'
local_file_name = 'sound.mp3'
data = req.get(remote_url)
# Save file data to local copy
with open(local_file_name, 'wb')as file:
file.write(data.content)
Now do something with the local copy of the remote file
This may be a little late, But I saw pabloG's code and couldn't help adding a os.system('cls') to make it look AWESOME! Check it out :
import urllib2,os
url = "http://download.thinkbroadband.com/10MB.zip"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
os.system('cls')
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
If running in an environment other than Windows, you will have to use something other then 'cls'. In MAC OS X and Linux it should be 'clear'.
urlretrieve and requests.get are simple, however the reality not.
I have fetched data for couple sites, including text and images, the above two probably solve most of the tasks. but for a more universal solution I suggest the use of urlopen. As it is included in Python 3 standard library, your code could run on any machine that run Python 3 without pre-installing site-package
import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)
#remember to open file in bytes mode
with open(filename, 'wb') as f:
while True:
buffer = url_connect.read(buffer_size)
if not buffer: break
#an integer value of size of written data
data_wrote = f.write(buffer)
#you could probably use with-open-as manner
url_connect.close()
This answer provides a solution to HTTP 403 Forbidden when downloading file over http using Python. I have tried only requests and urllib modules, the other module may provide something better, but this is the one I used to solve most of the problems.
New Api urllib3 based implementation
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'your_url_goes_here')
>>> r.status
200
>>> r.data
*****Response Data****
More info: https://pypi.org/project/urllib3/
You can python requests
import os
import requests
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(URL, stream=True)
with open(outfile,'wb') as output:
output.write(response.content)
You can use shutil
import os
import requests
import shutil
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(url, stream = True)
with open(outfile, 'wb') as f:
shutil.copyfileobj(response.content, f)
If you are downloading from restricted url, don't forget to include access token in headers
I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.
After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.
It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.
Here it is:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup
# --- insert Stan's script here ---
# if sys.version_info >= (3,):
#...
#...
# def download_file(url, dest=None):
#...
#...
# --- new stuff ---
def collect_all_url(page_url, extensions):
"""
Recovers all links in page_url checking for all the desired extensions
"""
conn = urllib2.urlopen(page_url)
html = conn.read()
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('a')
results = []
for tag in links:
link = tag.get('href', None)
if link is not None:
for e in extensions:
if e in link:
# Fallback for badly defined links
# checks for missing scheme or netloc
if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
results.append(link)
else:
new_url=urlparse.urljoin(page_url,link)
results.append(new_url)
return results
if __name__ == "__main__": # Only run if this file is called directly
# Command line arguments
parser = argparse.ArgumentParser(
description='Download all files from a webpage.')
parser.add_argument(
'-u', '--url',
help='Page url to request')
parser.add_argument(
'-e', '--ext',
nargs='+',
help='Extension(s) to find')
parser.add_argument(
'-d', '--dest',
default=None,
help='Destination where to save the files')
parser.add_argument(
'-p', '--par',
action='store_true', default=False,
help="Turns on parallel download")
args = parser.parse_args()
# Recover files to download
all_links = collect_all_url(args.url, args.ext)
# Download
if not args.par:
for l in all_links:
try:
filename = download_file(l, args.dest)
print(l)
except Exception as e:
print("Error while downloading: {}".format(e))
else:
from multiprocessing.pool import ThreadPool
results = ThreadPool(10).imap_unordered(
lambda x: download_file(x, args.dest), all_links)
for p in results:
print(p)
An example of its usage is:
python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>
And an actual example if you want to see it in action:
python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics
Another possibility is with built-in http.client:
from http import HTTPStatus, client
from shutil import copyfileobj
# using https
connection = client.HTTPSConnection("www.example.com")
with connection.request("GET", "/noise.mp3") as response:
if response.status == HTTPStatus.OK:
copyfileobj(response, open("noise.mp3")
else:
raise Exception("request needs work")
The HTTPConnection object is considered “low-level” in that it performs the desired request once and assumes the developer will subclass it or script in a way to handle the nuances of HTTP. Libraries such as requests tend to handle more special cases such as automatically following redirects and so on.
You can use keras.utils.get_file to do it:
from tensorflow import keras
path_to_downloaded_file = keras.utils.get_file(
fname="file name",
origin="https://www.linktofile.com/link/to/file",
extract=True,
archive_format="zip", # downloaded file format
cache_dir="/", # cache and extract in current directory
)
Another way is to call an external process such as curl.exe. Curl by default displays a progress bar, average download speed, time left, and more all formatted neatly in a table.
Put curl.exe in the same directory as your script
from subprocess import call
url = ""
call(["curl", {url}, '--output', "song.mp3"])
Note: You cannot specify an output path with curl, so do an os.rename afterwards
So I am pulling jpg's from a url. I am able to save the image files as long as they are being saved to the same folder the python file is in. As soon as I attempt to change the folder(seen here as the outpath) the image files do not get created. I imagine it has something to do with my outpath, but it seems to be fine when I am printing and watching it in the console.
Ubuntu 11.10 OS by the way. I'm a newbie with both linux and python, so it could easily be either. :)
If I were to print the sequence taken from the CSV file it would look like: [['Champ1', 'Subname1', 'imgurl1'],['Champ2', 'subname2', 'imgurl2'],['Champ3','subname3','imgurl3']...]
(It was scraped from a website)
import csv
from urlparse import urlsplit
from urllib2 import urlopen, build_opener
from urllib import urlretrieve
import webbrowser
import os
import sys
reader = csv.reader(open('champdata.csv', "rb"), delimiter = ",", skipinitialspace=True)
champInfo = []
for champs in reader:
champInfo.append(champs)
size = len(champInfo)
def GetImages(x, out_folder="/home/sean/Home/workspace/CP/images"):
b=1
size = len(champInfo)
print size
while b < size:
temp_imgurls = x.pop(b)
filename = os.path.basename(temp_imgurls[2])
print filename
outpath = os.path.join(out_folder, filename)
print outpath
u = urlopen(temp_imgurls[2])
localFile = open(outpath, 'wb')
localFile.write(u.read())
localFile.close()
b+=1
GetImages(champInfo)
I understand it's quite crude, but it does work, only if I'm not attempting to change the save path.
Try providing the complete image path everywhere
E:/../home/sean/Home/workspace/CD/images
def GetImages(x):
b=1
size = len(champInfo)
print size
while b < size:
temp_imgurls = x.pop(b)
filename = temp_imgurls[2]
u = urlopen(temp_imgurls[2])
localFile = open(filename, 'wb')
localFile.write(u.read())
localFile.close()
And this code will be save files in the same directory where script is.
Updated Answer:
I think the answer to your problem is just to add a check for the output directory existence, and create it if needed. ie, add:
if not os.path.exists(out_folder):
os.makedirs(out_folder)
to your existing code.
More generally , you could try something more like this:
import csv
from urllib2 import urlopen
import os
import sys
default_outfolder = "/home/sean/Home/workspace/CD/images"
# simple arg passing wihtout error checking
out_folder = sys.argv[1] if len(sys.argv) == 2 else default_outfolder
if not os.path.exists(out_folder):
os.makedirs(out_folder) # creates out_folder, including any required parent ones
else:
if not os.path.isdir(out_folder):
raise RuntimeError('output path must be a directory')
reader = csv.reader(open('champdata.csv', "rb"), delimiter = ",", skipinitialspace=True)
for champs in reader:
img_url = champs[2]
filename = os.path.basename(img_url)
outpath = os.path.join(out_folder, filename)
print 'Downloading %s to %s' % (img_url, outpath)
with open(outpath, 'wb') as f:
u = urlopen(img_url)
f.write(u.read())
The above code works for champdata.csv of the form stuff,more_stuff,http://www.somesite.com.au/path/to/image.png
but will need to be adapted if I have not understood the actual format of your incoming data.
I know the URL of an image on Internet.
e.g. http://www.digimouth.com/news/media/2011/09/google-logo.jpg, which contains the logo of Google.
Now, how can I download this image using Python without actually opening the URL in a browser and saving the file manually.
Python 2
Here is a more straightforward way if all you want to do is save it as a file:
import urllib
urllib.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg")
The second argument is the local path where the file should be saved.
Python 3
As SergO suggested the code below should work with Python 3.
import urllib.request
urllib.request.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg")
import urllib
resource = urllib.urlopen("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")
output = open("file01.jpg","wb")
output.write(resource.read())
output.close()
file01.jpg will contain your image.
I wrote a script that does just this, and it is available on my github for your use.
I utilized BeautifulSoup to allow me to parse any website for images. If you will be doing much web scraping (or intend to use my tool) I suggest you sudo pip install BeautifulSoup. Information on BeautifulSoup is available here.
For convenience here is my code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import urllib
# use this image scraper from the location that
#you want to save scraped images to
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html)
def get_images(url):
soup = make_soup(url)
#this makes a list of bs4 element tags
images = [img for img in soup.findAll('img')]
print (str(len(images)) + "images found.")
print 'Downloading images to current working directory.'
#compile our unicode list of image links
image_links = [each.get('src') for each in images]
for each in image_links:
filename=each.split('/')[-1]
urllib.urlretrieve(each, filename)
return image_links
#a standard call looks like this
#get_images('http://www.wookmark.com')
This can be done with requests. Load the page and dump the binary content to a file.
import os
import requests
url = 'https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg'
page = requests.get(url)
f_ext = os.path.splitext(url)[-1]
f_name = 'img{}'.format(f_ext)
with open(f_name, 'wb') as f:
f.write(page.content)
Python 3
urllib.request — Extensible library for opening URLs
from urllib.error import HTTPError
from urllib.request import urlretrieve
try:
urlretrieve(image_url, image_local_path)
except FileNotFoundError as err:
print(err) # something wrong with local path
except HTTPError as err:
print(err) # something wrong with url
I made a script expanding on Yup.'s script. I fixed some things. It will now bypass 403:Forbidden problems. It wont crash when an image fails to be retrieved. It tries to avoid corrupted previews. It gets the right absolute urls. It gives out more information. It can be run with an argument from the command line.
# getem.py
# python2 script to download all images in a given url
# use: python getem.py http://url.where.images.are
from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import sys
import time
def make_soup(url):
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
html = urllib2.urlopen(req)
return BeautifulSoup(html, 'html.parser')
def get_images(url):
soup = make_soup(url)
images = [img for img in soup.findAll('img')]
print (str(len(images)) + " images found.")
print 'Downloading images to current working directory.'
image_links = [each.get('src') for each in images]
for each in image_links:
try:
filename = each.strip().split('/')[-1].strip()
src = urljoin(url, each)
print 'Getting: ' + filename
response = requests.get(src, stream=True)
# delay to avoid corrupted previews
time.sleep(1)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
except:
print ' An error occured. Continuing.'
print 'Done.'
if __name__ == '__main__':
url = sys.argv[1]
get_images(url)
A solution which works with Python 2 and Python 3:
try:
from urllib.request import urlretrieve # Python 3
except ImportError:
from urllib import urlretrieve # Python 2
url = "http://www.digimouth.com/news/media/2011/09/google-logo.jpg"
urlretrieve(url, "local-filename.jpg")
or, if the additional requirement of requests is acceptable and if it is a http(s) URL:
def load_requests(source_url, sink_path):
"""
Load a file from an URL (e.g. http).
Parameters
----------
source_url : str
Where to load the file from.
sink_path : str
Where the loaded file is stored.
"""
import requests
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(sink_path, 'wb') as f:
for chunk in r:
f.write(chunk)
Using requests library
import requests
import shutil,os
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
currentDir = os.getcwd()
path = os.path.join(currentDir,'Images')#saving images to Images folder
def ImageDl(url):
attempts = 0
while attempts < 5:#retry 5 times
try:
filename = url.split('/')[-1]
r = requests.get(url,headers=headers,stream=True,timeout=5)
if r.status_code == 200:
with open(os.path.join(path,filename),'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw,f)
print(filename)
break
except Exception as e:
attempts+=1
print(e)
ImageDl(url)
Use a simple python wget module to download the link. Usage below:
import wget
wget.download('http://www.digimouth.com/news/media/2011/09/google-logo.jpg')
This is very short answer.
import urllib
urllib.urlretrieve("http://photogallery.sandesh.com/Picture.aspx?AlubumId=422040", "Abc.jpg")
Version for Python 3
I adjusted the code of #madprops for Python 3
# getem.py
# python2 script to download all images in a given url
# use: python getem.py http://url.where.images.are
from bs4 import BeautifulSoup
import urllib.request
import shutil
import requests
from urllib.parse import urljoin
import sys
import time
def make_soup(url):
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
html = urllib.request.urlopen(req)
return BeautifulSoup(html, 'html.parser')
def get_images(url):
soup = make_soup(url)
images = [img for img in soup.findAll('img')]
print (str(len(images)) + " images found.")
print('Downloading images to current working directory.')
image_links = [each.get('src') for each in images]
for each in image_links:
try:
filename = each.strip().split('/')[-1].strip()
src = urljoin(url, each)
print('Getting: ' + filename)
response = requests.get(src, stream=True)
# delay to avoid corrupted previews
time.sleep(1)
with open(filename, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
except:
print(' An error occured. Continuing.')
print('Done.')
if __name__ == '__main__':
get_images('http://www.wookmark.com')
Late answer, but for python>=3.6 you can use dload, i.e.:
import dload
dload.save("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")
if you need the image as bytes, use:
img_bytes = dload.bytes("http://www.digimouth.com/news/media/2011/09/google-logo.jpg")
install using pip3 install dload
Something fresh for Python 3 using Requests:
Comments in the code. Ready to use function.
import requests
from os import path
def get_image(image_url):
"""
Get image based on url.
:return: Image name if everything OK, False otherwise
"""
image_name = path.split(image_url)[1]
try:
image = requests.get(image_url)
except OSError: # Little too wide, but work OK, no additional imports needed. Catch all conection problems
return False
if image.status_code == 200: # we could have retrieved error page
base_dir = path.join(path.dirname(path.realpath(__file__)), "images") # Use your own path or "" to use current working directory. Folder must exist.
with open(path.join(base_dir, image_name), "wb") as f:
f.write(image.content)
return image_name
get_image("https://apod.nasddfda.gov/apod/image/2003/S106_Mishra_1947.jpg")
this is the easiest method to download images.
import requests
from slugify import slugify
img_url = 'https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg'
img = requests.get(img_url).content
img_file = open(slugify(img_url) + '.' + str(img_url).split('.')[-1], 'wb')
img_file.write(img)
img_file.close()
If you don't already have the url for the image, you could scrape it with gazpacho:
from gazpacho import Soup
base_url = "http://books.toscrape.com"
soup = Soup.get(base_url)
links = [img.attrs["src"] for img in soup.find("img")]
And then download the asset with urllib as mentioned:
from pathlib import Path
from urllib.request import urlretrieve as download
directory = "images"
Path(directory).mkdir(exist_ok=True)
link = links[0]
name = link.split("/")[-1]
download(f"{base_url}/{link}", f"{directory}/{name}")
# import the required libraries from Python
import pathlib,urllib.request
# Using pathlib, specify where the image is to be saved
downloads_path = str(pathlib.Path.home() / "Downloads")
# Form a full image path by joining the path to the
# images' new name
picture_path = os.path.join(downloads_path, "new-image.png")
# "/home/User/Downloads/new-image.png"
# Using "urlretrieve()" from urllib.request save the image
urllib.request.urlretrieve("//example.com/image.png", picture_path)
# urlretrieve() takes in 2 arguments
# 1. The URL of the image to be downloaded
# 2. The image new name after download. By default, the image is saved
# inside your current working directory
Ok, so, this is my rudimentary attempt, and probably total overkill.
Update if needed, as this doesn't handle any timeouts, but, I got this working for fun.
Code listed here: https://github.com/JayRizzo/JayRizzoTools/blob/master/pyImageDownloader.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# =============================================================================
# Created Syst: MAC OSX High Sierra 21.5.0 (17G65)
# Created Plat: Python 3.9.5 ('v3.9.5:0a7dcbdb13', 'May 3 2021 13:17:02')
# Created By : Jeromie Kirchoff
# Created Date: Thu Jun 15 23:31:01 2022 CDT
# Last ModDate: Thu Jun 16 01:41:01 2022 CDT
# =============================================================================
# NOTE: Doesn't work on SVG images at this time.
# I will look into this further: https://stackoverflow.com/a/6599172/1896134
# =============================================================================
import requests # to get image from the web
import shutil # to save it locally
import os # needed
from os.path import exists as filepathexist # check if file paths exist
from os.path import join # joins path for different os
from os.path import expanduser # expands current home
from pyuser_agent import UA # generates random UserAgent
class ImageDownloader(object):
"""URL ImageDownloader.
Input : Full Image URL
Output: Image saved to your ~/Pictures/JayRizzoDL folder.
"""
def __init__(self, URL: str):
self.url = URL
self.headers = {"User-Agent" : UA().random}
self.currentHome = expanduser('~')
self.desktop = join(self.currentHome + "/Desktop/")
self.download = join(self.currentHome + "/Downloads/")
self.pictures = join(self.currentHome + "/Pictures/JayRizzoDL/")
self.outfile = ""
self.filename = ""
self.response = ""
self.rawstream = ""
self.createdfilepath = ""
self.imgFileName = ""
# Check if the JayRizzoDL exists in the pictures folder.
# if it doesn't exist create it.
if not filepathexist(self.pictures):
os.mkdir(self.pictures)
self.main()
def getFileNameFromURL(self, URL: str):
"""Parse the URL for the name after the last forward slash."""
NewFileName = self.url.strip().split('/')[-1].strip()
return NewFileName
def getResponse(self, URL: str):
"""Try streaming the URL for the raw data."""
self.response = requests.get(self.url, headers=self.headers, stream=True)
return self.response
def gocreateFile(self, name: str, response):
"""Try creating the file with the raw data in a custom folder."""
self.outfile = join(self.pictures, name)
with open(self.outfile, 'wb') as outFilePath:
shutil.copyfileobj(response.raw, outFilePath)
return self.outfile
def main(self):
"""Combine Everything and use in for loops."""
self.filename = self.getFileNameFromURL(self.url)
self.rawstream = self.getResponse(self.url)
self.createdfilepath = self.gocreateFile(self.filename, self.rawstream)
print(f"File was created: {self.createdfilepath}")
return
if __name__ == '__main__':
# Example when calling the file directly.
ImageDownloader("https://stackoverflow.design/assets/img/logos/so/logo-stackoverflow.png")
Download Image file, with avoiding all possible error:
import requests
import validators
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
def is_downloadable(url):
valid=validators. url(url)
if valid==False:
return False
req = Request(url)
try:
response = urlopen(req)
except HTTPError as e:
return False
except URLError as e:
return False
else:
return True
for i in range(len(File_data)): #File data Contain list of address for image
#file
url = File_data[i][1]
try:
if (is_downloadable(url)):
try:
r = requests.get(url, allow_redirects=True)
if url.find('/'):
fname = url.rsplit('/', 1)[1]
fname = pth+File_data[i][0]+"$"+fname #Destination to save
#image file
open(fname, 'wb').write(r.content)
except Exception as e:
print(e)
except Exception as e:
print(e)
I really like how I can easily share files on a network using the SimpleHTTPServer, but I wish there was an option like "download entire directory". Is there an easy (one liner) way to implement this?
Thanks
I did that modification for you, I don't know if there'are better ways to do that but:
Just save the file (Ex.: ThreadedHTTPServer.py) and access as:
$ python -m /path/to/ThreadedHTTPServer PORT
BPaste Raw Version
The modification also works in threaded way so you won't have problem with download and navigation in the same time, the code aren't organized but:
from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler
from SocketServer import ThreadingMixIn
import threading
import SimpleHTTPServer
import sys, os, zipfile
PORT = int(sys.argv[1])
def send_head(self):
"""Common code for GET and HEAD commands.
This sends the response code and MIME headers.
Return value is either a file object (which has to be copied
to the outputfile by the caller unless the command was HEAD,
and must be closed by the caller under all circumstances), or
None, in which case the caller has nothing further to do.
"""
path = self.translate_path(self.path)
f = None
if self.path.endswith('?download'):
tmp_file = "tmp.zip"
self.path = self.path.replace("?download","")
zip = zipfile.ZipFile(tmp_file, 'w')
for root, dirs, files in os.walk(path):
for file in files:
if os.path.join(root, file) != os.path.join(root, tmp_file):
zip.write(os.path.join(root, file))
zip.close()
path = self.translate_path(tmp_file)
elif os.path.isdir(path):
if not self.path.endswith('/'):
# redirect browser - doing basically what apache does
self.send_response(301)
self.send_header("Location", self.path + "/")
self.end_headers()
return None
else:
for index in "index.html", "index.htm":
index = os.path.join(path, index)
if os.path.exists(index):
path = index
break
else:
return self.list_directory(path)
ctype = self.guess_type(path)
try:
# Always read in binary mode. Opening files in text mode may cause
# newline translations, making the actual size of the content
# transmitted *less* than the content-length!
f = open(path, 'rb')
except IOError:
self.send_error(404, "File not found")
return None
self.send_response(200)
self.send_header("Content-type", ctype)
fs = os.fstat(f.fileno())
self.send_header("Content-Length", str(fs[6]))
self.send_header("Last-Modified", self.date_time_string(fs.st_mtime))
self.end_headers()
return f
def list_directory(self, path):
try:
from cStringIO import StringIO
except ImportError:
from StringIO import StringIO
import cgi, urllib
"""Helper to produce a directory listing (absent index.html).
Return value is either a file object, or None (indicating an
error). In either case, the headers are sent, making the
interface the same as for send_head().
"""
try:
list = os.listdir(path)
except os.error:
self.send_error(404, "No permission to list directory")
return None
list.sort(key=lambda a: a.lower())
f = StringIO()
displaypath = cgi.escape(urllib.unquote(self.path))
f.write('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">')
f.write("<html>\n<title>Directory listing for %s</title>\n" % displaypath)
f.write("<body>\n<h2>Directory listing for %s</h2>\n" % displaypath)
f.write("<a href='%s'>%s</a>\n" % (self.path+"?download",'Download Directory Tree as Zip'))
f.write("<hr>\n<ul>\n")
for name in list:
fullname = os.path.join(path, name)
displayname = linkname = name
# Append / for directories or # for symbolic links
if os.path.isdir(fullname):
displayname = name + "/"
linkname = name + "/"
if os.path.islink(fullname):
displayname = name + "#"
# Note: a link to a directory displays with # and links with /
f.write('<li>%s\n'
% (urllib.quote(linkname), cgi.escape(displayname)))
f.write("</ul>\n<hr>\n</body>\n</html>\n")
length = f.tell()
f.seek(0)
self.send_response(200)
encoding = sys.getfilesystemencoding()
self.send_header("Content-type", "text/html; charset=%s" % encoding)
self.send_header("Content-Length", str(length))
self.end_headers()
return f
Handler = SimpleHTTPServer.SimpleHTTPRequestHandler
Handler.send_head = send_head
Handler.list_directory = list_directory
class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
"""Handle requests in a separate thread."""
if __name__ == '__main__':
server = ThreadedHTTPServer(('0.0.0.0', PORT), Handler)
print 'Starting server, use <Ctrl-C> to stop'
server.serve_forever()
Look at the sources, e.g. online here. Right now, if you call the server with a URL that's a directory, its index.html file is served, or, missing that, the list_directory method is called. Presumably, you want instead to make a zip file with the directory's contents (recursively, I imagine), and serve that? Obviously there's no way to do it with a one-line change, since you want to replace what are now lines 68-80 (in method send_head) plus the whole of method list_directory, lines 98-137 -- that's already at least a change to over 50 lines;-).
If you're OK with a change of several dozen lines, not one, and the semantics I've described are what you want, you could of course build the required zipfile as a cStringIO.StringIO object with the ZipFile class, and populate it with an os.walk on the directory in question (assuming you want, recursively, to get all subdirectories as well). But it's most definitely not going to be a one-liner;-).
There is no one liner which would do it, also what do you mean by "download whole dir" as tar or zip?
Anyway you can follow these steps
Derive a class from SimpleHTTPRequestHandler or may be just copy its code
Change list_directory method to return a link to "download whole folder"
Change copyfile method so that for your links you zip whole dir and return it
You may cache zip so that you do not zip folder every time, instead see if any file is modified or not
Would be a fun exercise to do :)
There is no simple way.
An alternative is to use the python script below to download the whole folder recursively. This works well for Python 3. Change the URL as needed.
import os
from pathlib import Path
from urllib.parse import urlparse, urljoin
import requests
from bs4 import BeautifulSoup
def get_links(content):
soup = BeautifulSoup(content)
for a in soup.findAll('a'):
yield a.get('href')
def download(url):
path = urlparse(url).path.lstrip('/')
print(path)
r = requests.get(url)
if r.status_code != 200:
raise Exception('status code is {} for {}'.format(r.status_code, url))
content = r.text
if path.endswith('/'):
Path(path.rstrip('/')).mkdir(parents=True, exist_ok=True)
for link in get_links(content):
if not link.startswith('.'): # skip hidden files such as .DS_Store
download(urljoin(url, link))
else:
with open(path, 'w') as f:
f.write(content)
if __name__ == '__main__':
# the trailing / indicates a folder
url = 'http://ed470d37.ngrok.io/a/bc/'
download(url)
I like #mononoke 's solution. But there are several problems in it. They are
write files in text mode
sometimeshref and text are different, especially for non-ascii path
not download large file block-wisely
I tried to fix these problems:
import os
from pathlib import Path
from urllib.parse import urlparse, urljoin
import requests
from bs4 import BeautifulSoup
import math
def get_links(content):
soup = BeautifulSoup(content)
for a in soup.findAll('a'):
yield a.get('href'), a.get_text()
def download(url, path=None, overwrite=False):
if path is None:
path = urlparse(url).path.lstrip('/')
if url.endswith('/'):
r = requests.get(url)
if r.status_code != 200:
raise Exception('status code is {} for {}'.format(r.status_code, url))
content = r.text
Path(path.rstrip('/')).mkdir(parents=True, exist_ok=True)
for link, name in get_links(content):
if not link.startswith('.'): # skip hidden files such as .DS_Store
download(urljoin(url, link), os.path.join(path, name))
else:
if os.path.isfile(path):
print("#existing", path)
if not overwrite:
return
chunk_size = 1024*1024
r = requests.get(url, stream=True)
content_size = int(r.headers['content-length'])
total = math.ceil(content_size / chunk_size)
print("#", path)
with open(path, 'wb') as f:
c = 0
st = 100
for chunk in r.iter_content(chunk_size=chunk_size):
c += 1
if chunk:
f.write(chunk)
ap = int(c*st/total) - int((c-1)*st/total)
if ap > 0:
print("#" * ap, end="")
print("\r "," "*int(c*st/total), "\r", end="")
if __name__ == '__main__':
# the trailing / indicates a folder
url = 'http://ed470d37.ngrok.io/a/bc/'
download(url, "/data/bc")
I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I've added to iTunes.
The text processing that creates/updates the XML file is written in Python. However, I use wget inside a Windows .bat file to download the actual MP3 file. I would prefer to have the entire utility written in Python.
I struggled to find a way to actually download the file in Python, thus why I resorted to using wget.
So, how do I download the file using Python?
One more, using urlretrieve:
import urllib.request
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
(for Python 2 use import urllib and urllib.urlretrieve)
Use urllib.request.urlopen():
import urllib.request
with urllib.request.urlopen('http://www.example.com/') as f:
html = f.read().decode('utf-8')
This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers.
On Python 2, the method is in urllib2:
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
In 2012, use the python requests library
>>> import requests
>>>
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760
You can run pip install requests to get it.
Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case.
2015-12-30
People have expressed admiration for the progress bar. It's cool, sure. There are several off-the-shelf solutions now, including tqdm:
from tqdm import tqdm
import requests
url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)
with open("10MB", "wb") as handle:
for data in tqdm(response.iter_content()):
handle.write(data)
This is essentially the implementation #kvance described 30 months ago.
import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
output.write(mp3file.read())
The wb in open('test.mp3','wb') opens a file (and erases any existing file) in binary mode so you can save data with it instead of just text.
Python 3
urllib.request.urlopen
import urllib.request
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()
urllib.request.urlretrieve
import urllib.request
urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
Note: According to the documentation, urllib.request.urlretrieve is a "legacy interface" and "might become deprecated in the future" (thanks gerrit)
Python 2
urllib2.urlopen (thanks Corey)
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
urllib.urlretrieve (thanks PabloG)
import urllib
urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
use wget module:
import wget
wget.download('url')
import os,requests
def download(url):
get_response = requests.get(url,stream=True)
file_name = url.split("/")[-1]
with open(file_name, 'wb') as f:
for chunk in get_response.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download("https://example.com/example.jpg")
An improved version of the PabloG code for Python 2/3:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )
import sys, os, tempfile, logging
if sys.version_info >= (3,):
import urllib.request as urllib2
import urllib.parse as urlparse
else:
import urllib2
import urlparse
def download_file(url, dest=None):
"""
Download and save a file specified by url to dest directory,
"""
u = urllib2.urlopen(url)
scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
filename = os.path.basename(path)
if not filename:
filename = 'downloaded.file'
if dest:
filename = os.path.join(dest, filename)
with open(filename, 'wb') as f:
meta = u.info()
meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
meta_length = meta_func("Content-Length")
file_size = None
if meta_length:
file_size = int(meta_length[0])
print("Downloading: {0} Bytes: {1}".format(url, file_size))
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = "{0:16}".format(file_size_dl)
if file_size:
status += " [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
status += chr(13)
print(status, end="")
print()
return filename
if __name__ == "__main__": # Only run if this file is called directly
print("Testing with 10MB download")
url = "http://download.thinkbroadband.com/10MB.zip"
filename = download_file(url)
print(filename)
Simple yet Python 2 & Python 3 compatible way comes with six library:
from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
Following are the most commonly used calls for downloading files in python:
urllib.urlretrieve ('url_to_file', file_name)
urllib2.urlopen('url_to_file')
requests.get(url)
wget.download('url', file_name)
Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.
Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.
In python3 you can use urllib3 and shutil libraires.
Download them by using pip or pip3 (Depending whether python3 is default or not)
pip3 install urllib3 shutil
Then run this code
import urllib.request
import shutil
url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
Note that you download urllib3 but use urllib in code
I agree with Corey, urllib2 is more complete than urllib and should likely be the module used if you want to do more complex things, but to make the answers more complete, urllib is a simpler module if you want just the basics:
import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()
Will work fine. Or, if you don't want to deal with the "response" object you can call read() directly:
import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()
If you have wget installed, you can use parallel_sync.
pip install parallel_sync
from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)
Doc:
https://pythonhosted.org/parallel_sync/pages/examples.html
This is pretty powerful. It can download files in parallel, retry upon failure , and it can even download files on a remote machine.
You can get the progress feedback with urlretrieve as well:
def report(blocknr, blocksize, size):
current = blocknr*blocksize
sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))
def downloadFile(url):
print "\n",url
fname = url.split('/')[-1]
print fname
urllib.urlretrieve(url, fname, report)
If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2.
First, these are the results (they are similar in different runs):
$ python wget_test.py
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============
The way I performed the test is using "profile" decorator. This is the full code:
import wget
import urllib
import time
from functools import wraps
def profile(func):
#wraps(func)
def inner(*args):
print func.__name__, ": starting"
start = time.time()
ret = func(*args)
end = time.time()
print func.__name__, ": {:.2f}".format(end - start)
return ret
return inner
url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'
def do_nothing(*args):
pass
#profile
def urlretrive_test(url):
return urllib.urlretrieve(url)
#profile
def wget_no_bar_test(url):
return wget.download(url, out='/tmp/', bar=do_nothing)
#profile
def wget_with_bar_test(url):
return wget.download(url, out='/tmp/')
urlretrive_test(url1)
print '=============='
time.sleep(1)
wget_no_bar_test(url2)
print '=============='
time.sleep(1)
wget_with_bar_test(url3)
print '=============='
time.sleep(1)
urllib seems to be the fastest
Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads.
import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])
In Jupyter Notebook, one can also call programs directly with the ! syntax:
!wget -O example_output_file.html https://example.com
Late answer, but for python>=3.6 you can use:
import dload
dload.save(url)
Install dload with:
pip3 install dload
Source code can be:
import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()
sock.close()
print htmlSource
I wrote the following, which works in vanilla Python 2 or Python 3.
import sys
try:
import urllib.request
python3 = True
except ImportError:
import urllib2
python3 = False
def progress_callback_simple(downloaded,total):
sys.stdout.write(
"\r" +
(len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
" [%3.2f%%]"%(100.0*float(downloaded)/float(total))
)
sys.stdout.flush()
def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
def _download_helper(response, out_file, file_size):
if progress_callback!=None: progress_callback(0,file_size)
if block_size == None:
buffer = response.read()
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size,file_size)
else:
file_size_dl = 0
while True:
buffer = response.read(block_size)
if not buffer: break
file_size_dl += len(buffer)
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size_dl,file_size)
with open(dstfilepath,"wb") as out_file:
if python3:
with urllib.request.urlopen(srcurl) as response:
file_size = int(response.getheader("Content-Length"))
_download_helper(response,out_file,file_size)
else:
response = urllib2.urlopen(srcurl)
meta = response.info()
file_size = int(meta.getheaders("Content-Length")[0])
_download_helper(response,out_file,file_size)
import traceback
try:
download(
"https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
"output.zip",
progress_callback_simple
)
except:
traceback.print_exc()
input()
Notes:
Supports a "progress bar" callback.
Download is a 4 MB test .zip from my website.
You can use PycURL on Python 2 and 3.
import pycurl
FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'
with open(FILE_DEST, 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, FILE_SRC)
c.setopt(c.WRITEDATA, f)
c.perform()
c.close()
Use Python Requests in 5 lines
import requests as req
remote_url = 'http://www.example.com/sound.mp3'
local_file_name = 'sound.mp3'
data = req.get(remote_url)
# Save file data to local copy
with open(local_file_name, 'wb')as file:
file.write(data.content)
Now do something with the local copy of the remote file
This may be a little late, But I saw pabloG's code and couldn't help adding a os.system('cls') to make it look AWESOME! Check it out :
import urllib2,os
url = "http://download.thinkbroadband.com/10MB.zip"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
os.system('cls')
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
If running in an environment other than Windows, you will have to use something other then 'cls'. In MAC OS X and Linux it should be 'clear'.
urlretrieve and requests.get are simple, however the reality not.
I have fetched data for couple sites, including text and images, the above two probably solve most of the tasks. but for a more universal solution I suggest the use of urlopen. As it is included in Python 3 standard library, your code could run on any machine that run Python 3 without pre-installing site-package
import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)
#remember to open file in bytes mode
with open(filename, 'wb') as f:
while True:
buffer = url_connect.read(buffer_size)
if not buffer: break
#an integer value of size of written data
data_wrote = f.write(buffer)
#you could probably use with-open-as manner
url_connect.close()
This answer provides a solution to HTTP 403 Forbidden when downloading file over http using Python. I have tried only requests and urllib modules, the other module may provide something better, but this is the one I used to solve most of the problems.
New Api urllib3 based implementation
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'your_url_goes_here')
>>> r.status
200
>>> r.data
*****Response Data****
More info: https://pypi.org/project/urllib3/
You can python requests
import os
import requests
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(URL, stream=True)
with open(outfile,'wb') as output:
output.write(response.content)
You can use shutil
import os
import requests
import shutil
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(url, stream = True)
with open(outfile, 'wb') as f:
shutil.copyfileobj(response.content, f)
If you are downloading from restricted url, don't forget to include access token in headers
I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.
After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.
It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.
Here it is:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup
# --- insert Stan's script here ---
# if sys.version_info >= (3,):
#...
#...
# def download_file(url, dest=None):
#...
#...
# --- new stuff ---
def collect_all_url(page_url, extensions):
"""
Recovers all links in page_url checking for all the desired extensions
"""
conn = urllib2.urlopen(page_url)
html = conn.read()
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('a')
results = []
for tag in links:
link = tag.get('href', None)
if link is not None:
for e in extensions:
if e in link:
# Fallback for badly defined links
# checks for missing scheme or netloc
if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
results.append(link)
else:
new_url=urlparse.urljoin(page_url,link)
results.append(new_url)
return results
if __name__ == "__main__": # Only run if this file is called directly
# Command line arguments
parser = argparse.ArgumentParser(
description='Download all files from a webpage.')
parser.add_argument(
'-u', '--url',
help='Page url to request')
parser.add_argument(
'-e', '--ext',
nargs='+',
help='Extension(s) to find')
parser.add_argument(
'-d', '--dest',
default=None,
help='Destination where to save the files')
parser.add_argument(
'-p', '--par',
action='store_true', default=False,
help="Turns on parallel download")
args = parser.parse_args()
# Recover files to download
all_links = collect_all_url(args.url, args.ext)
# Download
if not args.par:
for l in all_links:
try:
filename = download_file(l, args.dest)
print(l)
except Exception as e:
print("Error while downloading: {}".format(e))
else:
from multiprocessing.pool import ThreadPool
results = ThreadPool(10).imap_unordered(
lambda x: download_file(x, args.dest), all_links)
for p in results:
print(p)
An example of its usage is:
python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>
And an actual example if you want to see it in action:
python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics
Another possibility is with built-in http.client:
from http import HTTPStatus, client
from shutil import copyfileobj
# using https
connection = client.HTTPSConnection("www.example.com")
with connection.request("GET", "/noise.mp3") as response:
if response.status == HTTPStatus.OK:
copyfileobj(response, open("noise.mp3")
else:
raise Exception("request needs work")
The HTTPConnection object is considered “low-level” in that it performs the desired request once and assumes the developer will subclass it or script in a way to handle the nuances of HTTP. Libraries such as requests tend to handle more special cases such as automatically following redirects and so on.
You can use keras.utils.get_file to do it:
from tensorflow import keras
path_to_downloaded_file = keras.utils.get_file(
fname="file name",
origin="https://www.linktofile.com/link/to/file",
extract=True,
archive_format="zip", # downloaded file format
cache_dir="/", # cache and extract in current directory
)
Another way is to call an external process such as curl.exe. Curl by default displays a progress bar, average download speed, time left, and more all formatted neatly in a table.
Put curl.exe in the same directory as your script
from subprocess import call
url = ""
call(["curl", {url}, '--output', "song.mp3"])
Note: You cannot specify an output path with curl, so do an os.rename afterwards