How to download a file over HTTP? - python
I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I've added to iTunes.
The text processing that creates/updates the XML file is written in Python. However, I use wget inside a Windows .bat file to download the actual MP3 file. I would prefer to have the entire utility written in Python.
I struggled to find a way to actually download the file in Python, thus why I resorted to using wget.
So, how do I download the file using Python?
One more, using urlretrieve:
import urllib.request
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
(for Python 2 use import urllib and urllib.urlretrieve)
Use urllib.request.urlopen():
import urllib.request
with urllib.request.urlopen('http://www.example.com/') as f:
html = f.read().decode('utf-8')
This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers.
On Python 2, the method is in urllib2:
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
In 2012, use the python requests library
>>> import requests
>>>
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760
You can run pip install requests to get it.
Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case.
2015-12-30
People have expressed admiration for the progress bar. It's cool, sure. There are several off-the-shelf solutions now, including tqdm:
from tqdm import tqdm
import requests
url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)
with open("10MB", "wb") as handle:
for data in tqdm(response.iter_content()):
handle.write(data)
This is essentially the implementation #kvance described 30 months ago.
import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
output.write(mp3file.read())
The wb in open('test.mp3','wb') opens a file (and erases any existing file) in binary mode so you can save data with it instead of just text.
Python 3
urllib.request.urlopen
import urllib.request
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()
urllib.request.urlretrieve
import urllib.request
urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
Note: According to the documentation, urllib.request.urlretrieve is a "legacy interface" and "might become deprecated in the future" (thanks gerrit)
Python 2
urllib2.urlopen (thanks Corey)
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
urllib.urlretrieve (thanks PabloG)
import urllib
urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
use wget module:
import wget
wget.download('url')
import os,requests
def download(url):
get_response = requests.get(url,stream=True)
file_name = url.split("/")[-1]
with open(file_name, 'wb') as f:
for chunk in get_response.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download("https://example.com/example.jpg")
An improved version of the PabloG code for Python 2/3:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )
import sys, os, tempfile, logging
if sys.version_info >= (3,):
import urllib.request as urllib2
import urllib.parse as urlparse
else:
import urllib2
import urlparse
def download_file(url, dest=None):
"""
Download and save a file specified by url to dest directory,
"""
u = urllib2.urlopen(url)
scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
filename = os.path.basename(path)
if not filename:
filename = 'downloaded.file'
if dest:
filename = os.path.join(dest, filename)
with open(filename, 'wb') as f:
meta = u.info()
meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
meta_length = meta_func("Content-Length")
file_size = None
if meta_length:
file_size = int(meta_length[0])
print("Downloading: {0} Bytes: {1}".format(url, file_size))
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = "{0:16}".format(file_size_dl)
if file_size:
status += " [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
status += chr(13)
print(status, end="")
print()
return filename
if __name__ == "__main__": # Only run if this file is called directly
print("Testing with 10MB download")
url = "http://download.thinkbroadband.com/10MB.zip"
filename = download_file(url)
print(filename)
Simple yet Python 2 & Python 3 compatible way comes with six library:
from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
Following are the most commonly used calls for downloading files in python:
urllib.urlretrieve ('url_to_file', file_name)
urllib2.urlopen('url_to_file')
requests.get(url)
wget.download('url', file_name)
Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.
Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.
In python3 you can use urllib3 and shutil libraires.
Download them by using pip or pip3 (Depending whether python3 is default or not)
pip3 install urllib3 shutil
Then run this code
import urllib.request
import shutil
url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
Note that you download urllib3 but use urllib in code
I agree with Corey, urllib2 is more complete than urllib and should likely be the module used if you want to do more complex things, but to make the answers more complete, urllib is a simpler module if you want just the basics:
import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()
Will work fine. Or, if you don't want to deal with the "response" object you can call read() directly:
import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()
If you have wget installed, you can use parallel_sync.
pip install parallel_sync
from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)
Doc:
https://pythonhosted.org/parallel_sync/pages/examples.html
This is pretty powerful. It can download files in parallel, retry upon failure , and it can even download files on a remote machine.
You can get the progress feedback with urlretrieve as well:
def report(blocknr, blocksize, size):
current = blocknr*blocksize
sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))
def downloadFile(url):
print "\n",url
fname = url.split('/')[-1]
print fname
urllib.urlretrieve(url, fname, report)
If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2.
First, these are the results (they are similar in different runs):
$ python wget_test.py
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============
The way I performed the test is using "profile" decorator. This is the full code:
import wget
import urllib
import time
from functools import wraps
def profile(func):
#wraps(func)
def inner(*args):
print func.__name__, ": starting"
start = time.time()
ret = func(*args)
end = time.time()
print func.__name__, ": {:.2f}".format(end - start)
return ret
return inner
url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'
def do_nothing(*args):
pass
#profile
def urlretrive_test(url):
return urllib.urlretrieve(url)
#profile
def wget_no_bar_test(url):
return wget.download(url, out='/tmp/', bar=do_nothing)
#profile
def wget_with_bar_test(url):
return wget.download(url, out='/tmp/')
urlretrive_test(url1)
print '=============='
time.sleep(1)
wget_no_bar_test(url2)
print '=============='
time.sleep(1)
wget_with_bar_test(url3)
print '=============='
time.sleep(1)
urllib seems to be the fastest
Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads.
import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])
In Jupyter Notebook, one can also call programs directly with the ! syntax:
!wget -O example_output_file.html https://example.com
Late answer, but for python>=3.6 you can use:
import dload
dload.save(url)
Install dload with:
pip3 install dload
Source code can be:
import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()
sock.close()
print htmlSource
I wrote the following, which works in vanilla Python 2 or Python 3.
import sys
try:
import urllib.request
python3 = True
except ImportError:
import urllib2
python3 = False
def progress_callback_simple(downloaded,total):
sys.stdout.write(
"\r" +
(len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
" [%3.2f%%]"%(100.0*float(downloaded)/float(total))
)
sys.stdout.flush()
def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
def _download_helper(response, out_file, file_size):
if progress_callback!=None: progress_callback(0,file_size)
if block_size == None:
buffer = response.read()
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size,file_size)
else:
file_size_dl = 0
while True:
buffer = response.read(block_size)
if not buffer: break
file_size_dl += len(buffer)
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size_dl,file_size)
with open(dstfilepath,"wb") as out_file:
if python3:
with urllib.request.urlopen(srcurl) as response:
file_size = int(response.getheader("Content-Length"))
_download_helper(response,out_file,file_size)
else:
response = urllib2.urlopen(srcurl)
meta = response.info()
file_size = int(meta.getheaders("Content-Length")[0])
_download_helper(response,out_file,file_size)
import traceback
try:
download(
"https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
"output.zip",
progress_callback_simple
)
except:
traceback.print_exc()
input()
Notes:
Supports a "progress bar" callback.
Download is a 4 MB test .zip from my website.
You can use PycURL on Python 2 and 3.
import pycurl
FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'
with open(FILE_DEST, 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, FILE_SRC)
c.setopt(c.WRITEDATA, f)
c.perform()
c.close()
Use Python Requests in 5 lines
import requests as req
remote_url = 'http://www.example.com/sound.mp3'
local_file_name = 'sound.mp3'
data = req.get(remote_url)
# Save file data to local copy
with open(local_file_name, 'wb')as file:
file.write(data.content)
Now do something with the local copy of the remote file
This may be a little late, But I saw pabloG's code and couldn't help adding a os.system('cls') to make it look AWESOME! Check it out :
import urllib2,os
url = "http://download.thinkbroadband.com/10MB.zip"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
os.system('cls')
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
If running in an environment other than Windows, you will have to use something other then 'cls'. In MAC OS X and Linux it should be 'clear'.
urlretrieve and requests.get are simple, however the reality not.
I have fetched data for couple sites, including text and images, the above two probably solve most of the tasks. but for a more universal solution I suggest the use of urlopen. As it is included in Python 3 standard library, your code could run on any machine that run Python 3 without pre-installing site-package
import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)
#remember to open file in bytes mode
with open(filename, 'wb') as f:
while True:
buffer = url_connect.read(buffer_size)
if not buffer: break
#an integer value of size of written data
data_wrote = f.write(buffer)
#you could probably use with-open-as manner
url_connect.close()
This answer provides a solution to HTTP 403 Forbidden when downloading file over http using Python. I have tried only requests and urllib modules, the other module may provide something better, but this is the one I used to solve most of the problems.
New Api urllib3 based implementation
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'your_url_goes_here')
>>> r.status
200
>>> r.data
*****Response Data****
More info: https://pypi.org/project/urllib3/
You can python requests
import os
import requests
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(URL, stream=True)
with open(outfile,'wb') as output:
output.write(response.content)
You can use shutil
import os
import requests
import shutil
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(url, stream = True)
with open(outfile, 'wb') as f:
shutil.copyfileobj(response.content, f)
If you are downloading from restricted url, don't forget to include access token in headers
I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.
After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.
It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.
Here it is:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup
# --- insert Stan's script here ---
# if sys.version_info >= (3,):
#...
#...
# def download_file(url, dest=None):
#...
#...
# --- new stuff ---
def collect_all_url(page_url, extensions):
"""
Recovers all links in page_url checking for all the desired extensions
"""
conn = urllib2.urlopen(page_url)
html = conn.read()
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('a')
results = []
for tag in links:
link = tag.get('href', None)
if link is not None:
for e in extensions:
if e in link:
# Fallback for badly defined links
# checks for missing scheme or netloc
if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
results.append(link)
else:
new_url=urlparse.urljoin(page_url,link)
results.append(new_url)
return results
if __name__ == "__main__": # Only run if this file is called directly
# Command line arguments
parser = argparse.ArgumentParser(
description='Download all files from a webpage.')
parser.add_argument(
'-u', '--url',
help='Page url to request')
parser.add_argument(
'-e', '--ext',
nargs='+',
help='Extension(s) to find')
parser.add_argument(
'-d', '--dest',
default=None,
help='Destination where to save the files')
parser.add_argument(
'-p', '--par',
action='store_true', default=False,
help="Turns on parallel download")
args = parser.parse_args()
# Recover files to download
all_links = collect_all_url(args.url, args.ext)
# Download
if not args.par:
for l in all_links:
try:
filename = download_file(l, args.dest)
print(l)
except Exception as e:
print("Error while downloading: {}".format(e))
else:
from multiprocessing.pool import ThreadPool
results = ThreadPool(10).imap_unordered(
lambda x: download_file(x, args.dest), all_links)
for p in results:
print(p)
An example of its usage is:
python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>
And an actual example if you want to see it in action:
python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics
Another possibility is with built-in http.client:
from http import HTTPStatus, client
from shutil import copyfileobj
# using https
connection = client.HTTPSConnection("www.example.com")
with connection.request("GET", "/noise.mp3") as response:
if response.status == HTTPStatus.OK:
copyfileobj(response, open("noise.mp3")
else:
raise Exception("request needs work")
The HTTPConnection object is considered “low-level” in that it performs the desired request once and assumes the developer will subclass it or script in a way to handle the nuances of HTTP. Libraries such as requests tend to handle more special cases such as automatically following redirects and so on.
You can use keras.utils.get_file to do it:
from tensorflow import keras
path_to_downloaded_file = keras.utils.get_file(
fname="file name",
origin="https://www.linktofile.com/link/to/file",
extract=True,
archive_format="zip", # downloaded file format
cache_dir="/", # cache and extract in current directory
)
Another way is to call an external process such as curl.exe. Curl by default displays a progress bar, average download speed, time left, and more all formatted neatly in a table.
Put curl.exe in the same directory as your script
from subprocess import call
url = ""
call(["curl", {url}, '--output', "song.mp3"])
Note: You cannot specify an output path with curl, so do an os.rename afterwards
Related
Python iterate file content returned by urlopen [duplicate]
I use the following code to stream large files from the Internet into a local file: fp = open(file, 'wb') req = urllib2.urlopen(url) for line in req: fp.write(line) fp.close() This works but it downloads quite slowly. Is there a faster way? (The files are large so I don't want to keep them in memory.)
No reason to work line by line (small chunks AND requires Python to find the line ends for you!-), just chunk it up in bigger chunks, e.g.: # from urllib2 import urlopen # Python 2 from urllib.request import urlopen # Python 3 response = urlopen(url) CHUNK = 16 * 1024 with open(file, 'wb') as f: while True: chunk = response.read(CHUNK) if not chunk: break f.write(chunk) Experiment a bit with various CHUNK sizes to find the "sweet spot" for your requirements.
You can also use shutil: import shutil try: from urllib.request import urlopen # Python 3 except ImportError: from urllib2 import urlopen # Python 2 def get_large_file(url, file, length=16*1024): req = urlopen(url) with open(file, 'wb') as fp: shutil.copyfileobj(req, fp, length)
I used to use mechanize module and its Browser.retrieve() method. In the past it took 100% CPU and downloaded things very slowly, but some recent release fixed this bug and works very quickly. Example: import mechanize browser = mechanize.Browser() browser.retrieve('http://www.kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.32-rc1.tar.bz2', 'Downloads/my-new-kernel.tar.bz2') Mechanize is based on urllib2, so urllib2 can also have similar method... but I can't find any now.
You can use urllib.retrieve() to download files: Example: try: from urllib import urlretrieve # Python 2 except ImportError: from urllib.request import urlretrieve # Python 3 url = "http://www.examplesite.com/myfile" urlretrieve(url,"./local_file")
how to download a .csv file from a web to computer using python [duplicate]
I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I've added to iTunes. The text processing that creates/updates the XML file is written in Python. However, I use wget inside a Windows .bat file to download the actual MP3 file. I would prefer to have the entire utility written in Python. I struggled to find a way to actually download the file in Python, thus why I resorted to using wget. So, how do I download the file using Python?
One more, using urlretrieve: import urllib.request urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3") (for Python 2 use import urllib and urllib.urlretrieve)
Use urllib.request.urlopen(): import urllib.request with urllib.request.urlopen('http://www.example.com/') as f: html = f.read().decode('utf-8') This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers. On Python 2, the method is in urllib2: import urllib2 response = urllib2.urlopen('http://www.example.com/') html = response.read()
In 2012, use the python requests library >>> import requests >>> >>> url = "http://download.thinkbroadband.com/10MB.zip" >>> r = requests.get(url) >>> print len(r.content) 10485760 You can run pip install requests to get it. Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case. 2015-12-30 People have expressed admiration for the progress bar. It's cool, sure. There are several off-the-shelf solutions now, including tqdm: from tqdm import tqdm import requests url = "http://download.thinkbroadband.com/10MB.zip" response = requests.get(url, stream=True) with open("10MB", "wb") as handle: for data in tqdm(response.iter_content()): handle.write(data) This is essentially the implementation #kvance described 30 months ago.
import urllib2 mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3") with open('test.mp3','wb') as output: output.write(mp3file.read()) The wb in open('test.mp3','wb') opens a file (and erases any existing file) in binary mode so you can save data with it instead of just text.
Python 3 urllib.request.urlopen import urllib.request response = urllib.request.urlopen('http://www.example.com/') html = response.read() urllib.request.urlretrieve import urllib.request urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3') Note: According to the documentation, urllib.request.urlretrieve is a "legacy interface" and "might become deprecated in the future" (thanks gerrit) Python 2 urllib2.urlopen (thanks Corey) import urllib2 response = urllib2.urlopen('http://www.example.com/') html = response.read() urllib.urlretrieve (thanks PabloG) import urllib urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
use wget module: import wget wget.download('url')
import os,requests def download(url): get_response = requests.get(url,stream=True) file_name = url.split("/")[-1] with open(file_name, 'wb') as f: for chunk in get_response.iter_content(chunk_size=1024): if chunk: # filter out keep-alive new chunks f.write(chunk) download("https://example.com/example.jpg")
An improved version of the PabloG code for Python 2/3: #!/usr/bin/env python # -*- coding: utf-8 -*- from __future__ import ( division, absolute_import, print_function, unicode_literals ) import sys, os, tempfile, logging if sys.version_info >= (3,): import urllib.request as urllib2 import urllib.parse as urlparse else: import urllib2 import urlparse def download_file(url, dest=None): """ Download and save a file specified by url to dest directory, """ u = urllib2.urlopen(url) scheme, netloc, path, query, fragment = urlparse.urlsplit(url) filename = os.path.basename(path) if not filename: filename = 'downloaded.file' if dest: filename = os.path.join(dest, filename) with open(filename, 'wb') as f: meta = u.info() meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all meta_length = meta_func("Content-Length") file_size = None if meta_length: file_size = int(meta_length[0]) print("Downloading: {0} Bytes: {1}".format(url, file_size)) file_size_dl = 0 block_sz = 8192 while True: buffer = u.read(block_sz) if not buffer: break file_size_dl += len(buffer) f.write(buffer) status = "{0:16}".format(file_size_dl) if file_size: status += " [{0:6.2f}%]".format(file_size_dl * 100 / file_size) status += chr(13) print(status, end="") print() return filename if __name__ == "__main__": # Only run if this file is called directly print("Testing with 10MB download") url = "http://download.thinkbroadband.com/10MB.zip" filename = download_file(url) print(filename)
Simple yet Python 2 & Python 3 compatible way comes with six library: from six.moves import urllib urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
Following are the most commonly used calls for downloading files in python: urllib.urlretrieve ('url_to_file', file_name) urllib2.urlopen('url_to_file') requests.get(url) wget.download('url', file_name) Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.
Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.
In python3 you can use urllib3 and shutil libraires. Download them by using pip or pip3 (Depending whether python3 is default or not) pip3 install urllib3 shutil Then run this code import urllib.request import shutil url = "http://www.somewebsite.com/something.pdf" output_file = "save_this_name.pdf" with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file: shutil.copyfileobj(response, out_file) Note that you download urllib3 but use urllib in code
I agree with Corey, urllib2 is more complete than urllib and should likely be the module used if you want to do more complex things, but to make the answers more complete, urllib is a simpler module if you want just the basics: import urllib response = urllib.urlopen('http://www.example.com/sound.mp3') mp3 = response.read() Will work fine. Or, if you don't want to deal with the "response" object you can call read() directly: import urllib mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()
If you have wget installed, you can use parallel_sync. pip install parallel_sync from parallel_sync import wget urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip'] wget.download('/tmp', urls) # or a single file: wget.download('/tmp', urls[0], filenames='x.zip', extract=True) Doc: https://pythonhosted.org/parallel_sync/pages/examples.html This is pretty powerful. It can download files in parallel, retry upon failure , and it can even download files on a remote machine.
You can get the progress feedback with urlretrieve as well: def report(blocknr, blocksize, size): current = blocknr*blocksize sys.stdout.write("\r{0:.2f}%".format(100.0*current/size)) def downloadFile(url): print "\n",url fname = url.split('/')[-1] print fname urllib.urlretrieve(url, fname, report)
If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2. First, these are the results (they are similar in different runs): $ python wget_test.py urlretrive_test : starting urlretrive_test : 6.56 ============== wget_no_bar_test : starting wget_no_bar_test : 7.20 ============== wget_with_bar_test : starting 100% [......................................................................] 541335552 / 541335552 wget_with_bar_test : 50.49 ============== The way I performed the test is using "profile" decorator. This is the full code: import wget import urllib import time from functools import wraps def profile(func): #wraps(func) def inner(*args): print func.__name__, ": starting" start = time.time() ret = func(*args) end = time.time() print func.__name__, ": {:.2f}".format(end - start) return ret return inner url1 = 'http://host.com/500a.iso' url2 = 'http://host.com/500b.iso' url3 = 'http://host.com/500c.iso' def do_nothing(*args): pass #profile def urlretrive_test(url): return urllib.urlretrieve(url) #profile def wget_no_bar_test(url): return wget.download(url, out='/tmp/', bar=do_nothing) #profile def wget_with_bar_test(url): return wget.download(url, out='/tmp/') urlretrive_test(url1) print '==============' time.sleep(1) wget_no_bar_test(url2) print '==============' time.sleep(1) wget_with_bar_test(url3) print '==============' time.sleep(1) urllib seems to be the fastest
Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads. import subprocess subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com']) In Jupyter Notebook, one can also call programs directly with the ! syntax: !wget -O example_output_file.html https://example.com
Late answer, but for python>=3.6 you can use: import dload dload.save(url) Install dload with: pip3 install dload
Source code can be: import urllib sock = urllib.urlopen("http://diveintopython.org/") htmlSource = sock.read() sock.close() print htmlSource
I wrote the following, which works in vanilla Python 2 or Python 3. import sys try: import urllib.request python3 = True except ImportError: import urllib2 python3 = False def progress_callback_simple(downloaded,total): sys.stdout.write( "\r" + (len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total + " [%3.2f%%]"%(100.0*float(downloaded)/float(total)) ) sys.stdout.flush() def download(srcurl, dstfilepath, progress_callback=None, block_size=8192): def _download_helper(response, out_file, file_size): if progress_callback!=None: progress_callback(0,file_size) if block_size == None: buffer = response.read() out_file.write(buffer) if progress_callback!=None: progress_callback(file_size,file_size) else: file_size_dl = 0 while True: buffer = response.read(block_size) if not buffer: break file_size_dl += len(buffer) out_file.write(buffer) if progress_callback!=None: progress_callback(file_size_dl,file_size) with open(dstfilepath,"wb") as out_file: if python3: with urllib.request.urlopen(srcurl) as response: file_size = int(response.getheader("Content-Length")) _download_helper(response,out_file,file_size) else: response = urllib2.urlopen(srcurl) meta = response.info() file_size = int(meta.getheaders("Content-Length")[0]) _download_helper(response,out_file,file_size) import traceback try: download( "https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip", "output.zip", progress_callback_simple ) except: traceback.print_exc() input() Notes: Supports a "progress bar" callback. Download is a 4 MB test .zip from my website.
You can use PycURL on Python 2 and 3. import pycurl FILE_DEST = 'pycurl.html' FILE_SRC = 'http://pycurl.io/' with open(FILE_DEST, 'wb') as f: c = pycurl.Curl() c.setopt(c.URL, FILE_SRC) c.setopt(c.WRITEDATA, f) c.perform() c.close()
Use Python Requests in 5 lines import requests as req remote_url = 'http://www.example.com/sound.mp3' local_file_name = 'sound.mp3' data = req.get(remote_url) # Save file data to local copy with open(local_file_name, 'wb')as file: file.write(data.content) Now do something with the local copy of the remote file
This may be a little late, But I saw pabloG's code and couldn't help adding a os.system('cls') to make it look AWESOME! Check it out : import urllib2,os url = "http://download.thinkbroadband.com/10MB.zip" file_name = url.split('/')[-1] u = urllib2.urlopen(url) f = open(file_name, 'wb') meta = u.info() file_size = int(meta.getheaders("Content-Length")[0]) print "Downloading: %s Bytes: %s" % (file_name, file_size) os.system('cls') file_size_dl = 0 block_sz = 8192 while True: buffer = u.read(block_sz) if not buffer: break file_size_dl += len(buffer) f.write(buffer) status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size) status = status + chr(8)*(len(status)+1) print status, f.close() If running in an environment other than Windows, you will have to use something other then 'cls'. In MAC OS X and Linux it should be 'clear'.
urlretrieve and requests.get are simple, however the reality not. I have fetched data for couple sites, including text and images, the above two probably solve most of the tasks. but for a more universal solution I suggest the use of urlopen. As it is included in Python 3 standard library, your code could run on any machine that run Python 3 without pre-installing site-package import urllib.request url_request = urllib.request.Request(url, headers=headers) url_connect = urllib.request.urlopen(url_request) #remember to open file in bytes mode with open(filename, 'wb') as f: while True: buffer = url_connect.read(buffer_size) if not buffer: break #an integer value of size of written data data_wrote = f.write(buffer) #you could probably use with-open-as manner url_connect.close() This answer provides a solution to HTTP 403 Forbidden when downloading file over http using Python. I have tried only requests and urllib modules, the other module may provide something better, but this is the one I used to solve most of the problems.
New Api urllib3 based implementation >>> import urllib3 >>> http = urllib3.PoolManager() >>> r = http.request('GET', 'your_url_goes_here') >>> r.status 200 >>> r.data *****Response Data**** More info: https://pypi.org/project/urllib3/
You can python requests import os import requests outfile = os.path.join(SAVE_DIR, file_name) response = requests.get(URL, stream=True) with open(outfile,'wb') as output: output.write(response.content) You can use shutil import os import requests import shutil outfile = os.path.join(SAVE_DIR, file_name) response = requests.get(url, stream = True) with open(outfile, 'wb') as f: shutil.copyfileobj(response.content, f) If you are downloading from restricted url, don't forget to include access token in headers
I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread. After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options. It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel. Here it is: #!/usr/bin/env python3 # -*- coding: utf-8 -*- from __future__ import (division, absolute_import, print_function, unicode_literals) import sys, os, argparse from bs4 import BeautifulSoup # --- insert Stan's script here --- # if sys.version_info >= (3,): #... #... # def download_file(url, dest=None): #... #... # --- new stuff --- def collect_all_url(page_url, extensions): """ Recovers all links in page_url checking for all the desired extensions """ conn = urllib2.urlopen(page_url) html = conn.read() soup = BeautifulSoup(html, 'lxml') links = soup.find_all('a') results = [] for tag in links: link = tag.get('href', None) if link is not None: for e in extensions: if e in link: # Fallback for badly defined links # checks for missing scheme or netloc if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc): results.append(link) else: new_url=urlparse.urljoin(page_url,link) results.append(new_url) return results if __name__ == "__main__": # Only run if this file is called directly # Command line arguments parser = argparse.ArgumentParser( description='Download all files from a webpage.') parser.add_argument( '-u', '--url', help='Page url to request') parser.add_argument( '-e', '--ext', nargs='+', help='Extension(s) to find') parser.add_argument( '-d', '--dest', default=None, help='Destination where to save the files') parser.add_argument( '-p', '--par', action='store_true', default=False, help="Turns on parallel download") args = parser.parse_args() # Recover files to download all_links = collect_all_url(args.url, args.ext) # Download if not args.par: for l in all_links: try: filename = download_file(l, args.dest) print(l) except Exception as e: print("Error while downloading: {}".format(e)) else: from multiprocessing.pool import ThreadPool results = ThreadPool(10).imap_unordered( lambda x: download_file(x, args.dest), all_links) for p in results: print(p) An example of its usage is: python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage> And an actual example if you want to see it in action: python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics
Another possibility is with built-in http.client: from http import HTTPStatus, client from shutil import copyfileobj # using https connection = client.HTTPSConnection("www.example.com") with connection.request("GET", "/noise.mp3") as response: if response.status == HTTPStatus.OK: copyfileobj(response, open("noise.mp3") else: raise Exception("request needs work") The HTTPConnection object is considered “low-level” in that it performs the desired request once and assumes the developer will subclass it or script in a way to handle the nuances of HTTP. Libraries such as requests tend to handle more special cases such as automatically following redirects and so on.
You can use keras.utils.get_file to do it: from tensorflow import keras path_to_downloaded_file = keras.utils.get_file( fname="file name", origin="https://www.linktofile.com/link/to/file", extract=True, archive_format="zip", # downloaded file format cache_dir="/", # cache and extract in current directory )
Another way is to call an external process such as curl.exe. Curl by default displays a progress bar, average download speed, time left, and more all formatted neatly in a table. Put curl.exe in the same directory as your script from subprocess import call url = "" call(["curl", {url}, '--output', "song.mp3"]) Note: You cannot specify an output path with curl, so do an os.rename afterwards
Python: download a file from an FTP server
I'm trying to download some public data files. I screenscrape to get the links to the files, which all look something like this: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/nhanes/2001-2002/L28POC_B.xpt I can't find any documentation on the Requests library website.
The requests library doesn't support ftp:// links. To download a file from an FTP server you could use urlretrieve: import urllib.request urllib.request.urlretrieve('ftp://server/path/to/file', 'file') # if you need to pass credentials: # urllib.request.urlretrieve('ftp://username:password#server/path/to/file', 'file') Or urlopen: import shutil import urllib.request from contextlib import closing with closing(urllib.request.urlopen('ftp://server/path/to/file')) as r: with open('file', 'wb') as f: shutil.copyfileobj(r, f) Python 2: import shutil import urllib2 from contextlib import closing with closing(urllib2.urlopen('ftp://server/path/to/file')) as r: with open('file', 'wb') as f: shutil.copyfileobj(r, f)
You Can Try this import ftplib path = 'pub/Health_Statistics/NCHS/nhanes/2001-2002/' filename = 'L28POC_B.xpt' ftp = ftplib.FTP("Server IP") ftp.login("UserName", "Password") ftp.cwd(path) ftp.retrbinary("RETR " + filename, open(filename, 'wb').write) ftp.quit()
Try using the wget library for python. You can find the documentation for it here. import wget link = 'ftp://example.com/foo.txt' wget.download(link)
Use urllib2. For more specifics, check out this example from doc.python.org: Here's a snippet from the tutorial that may help import urllib2 req = urllib2.Request('ftp://example.com') response = urllib2.urlopen(req) the_page = response.read()
import os import ftplib from contextlib import closing with closing(ftplib.FTP()) as ftp: try: ftp.connect(host, port, 30*5) #5 mins timeout ftp.login(login, passwd) ftp.set_pasv(True) with open(local_filename, 'w+b') as f: res = ftp.retrbinary('RETR %s' % orig_filename, f.write) if not res.startswith('226 Transfer complete'): print('Downloaded of file {0} is not compile.'.format(orig_filename)) os.remove(local_filename) return None return local_filename except: print('Error during download from FTP')
As several folks have noted, requests doesn't support FTP but Python has other libraries that do. If you want to keep using the requests library, there is a requests-ftp package that adds FTP capability to requests. I've used this library a little and it does work. The docs are full of warnings about code quality though. As of 0.2.0 the docs say "This library was cowboyed together in about 4 hours of total work, has no tests, and relies on a few ugly hacks". import requests, requests_ftp requests_ftp.monkeypatch_session() response = requests.get('ftp://example.com/foo.txt')
If you want to take advantage of recent Python versions' async features, you can use aioftp (from the same family of libraries and developers as the more popular aiohttp library). Here is a code example taken from their client tutorial: client = aioftp.Client() await client.connect("ftp.server.com") await client.login("user", "pass") await client.download("tmp/test.py", "foo.py", write_into=True)
urllib2.urlopen handles ftp links.
urlretrieve is not work for me, and the official document said that They might become deprecated at some point in the future. import shutil from urllib.request import URLopener opener = URLopener() url = 'ftp://ftp_domain/path/to/the/file' store_path = 'path//to//your//local//storage' with opener.open(url) as remote_file, open(store_path, 'wb') as local_file: shutil.copyfileobj(remote_file, local_file)
How to save an image locally using Python whose URL address I already know?
I know the URL of an image on Internet. e.g. http://www.digimouth.com/news/media/2011/09/google-logo.jpg, which contains the logo of Google. Now, how can I download this image using Python without actually opening the URL in a browser and saving the file manually.
Python 2 Here is a more straightforward way if all you want to do is save it as a file: import urllib urllib.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg") The second argument is the local path where the file should be saved. Python 3 As SergO suggested the code below should work with Python 3. import urllib.request urllib.request.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg")
import urllib resource = urllib.urlopen("http://www.digimouth.com/news/media/2011/09/google-logo.jpg") output = open("file01.jpg","wb") output.write(resource.read()) output.close() file01.jpg will contain your image.
I wrote a script that does just this, and it is available on my github for your use. I utilized BeautifulSoup to allow me to parse any website for images. If you will be doing much web scraping (or intend to use my tool) I suggest you sudo pip install BeautifulSoup. Information on BeautifulSoup is available here. For convenience here is my code: from bs4 import BeautifulSoup from urllib2 import urlopen import urllib # use this image scraper from the location that #you want to save scraped images to def make_soup(url): html = urlopen(url).read() return BeautifulSoup(html) def get_images(url): soup = make_soup(url) #this makes a list of bs4 element tags images = [img for img in soup.findAll('img')] print (str(len(images)) + "images found.") print 'Downloading images to current working directory.' #compile our unicode list of image links image_links = [each.get('src') for each in images] for each in image_links: filename=each.split('/')[-1] urllib.urlretrieve(each, filename) return image_links #a standard call looks like this #get_images('http://www.wookmark.com')
This can be done with requests. Load the page and dump the binary content to a file. import os import requests url = 'https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg' page = requests.get(url) f_ext = os.path.splitext(url)[-1] f_name = 'img{}'.format(f_ext) with open(f_name, 'wb') as f: f.write(page.content)
Python 3 urllib.request — Extensible library for opening URLs from urllib.error import HTTPError from urllib.request import urlretrieve try: urlretrieve(image_url, image_local_path) except FileNotFoundError as err: print(err) # something wrong with local path except HTTPError as err: print(err) # something wrong with url
I made a script expanding on Yup.'s script. I fixed some things. It will now bypass 403:Forbidden problems. It wont crash when an image fails to be retrieved. It tries to avoid corrupted previews. It gets the right absolute urls. It gives out more information. It can be run with an argument from the command line. # getem.py # python2 script to download all images in a given url # use: python getem.py http://url.where.images.are from bs4 import BeautifulSoup import urllib2 import shutil import requests from urlparse import urljoin import sys import time def make_soup(url): req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) html = urllib2.urlopen(req) return BeautifulSoup(html, 'html.parser') def get_images(url): soup = make_soup(url) images = [img for img in soup.findAll('img')] print (str(len(images)) + " images found.") print 'Downloading images to current working directory.' image_links = [each.get('src') for each in images] for each in image_links: try: filename = each.strip().split('/')[-1].strip() src = urljoin(url, each) print 'Getting: ' + filename response = requests.get(src, stream=True) # delay to avoid corrupted previews time.sleep(1) with open(filename, 'wb') as out_file: shutil.copyfileobj(response.raw, out_file) except: print ' An error occured. Continuing.' print 'Done.' if __name__ == '__main__': url = sys.argv[1] get_images(url)
A solution which works with Python 2 and Python 3: try: from urllib.request import urlretrieve # Python 3 except ImportError: from urllib import urlretrieve # Python 2 url = "http://www.digimouth.com/news/media/2011/09/google-logo.jpg" urlretrieve(url, "local-filename.jpg") or, if the additional requirement of requests is acceptable and if it is a http(s) URL: def load_requests(source_url, sink_path): """ Load a file from an URL (e.g. http). Parameters ---------- source_url : str Where to load the file from. sink_path : str Where the loaded file is stored. """ import requests r = requests.get(source_url, stream=True) if r.status_code == 200: with open(sink_path, 'wb') as f: for chunk in r: f.write(chunk)
Using requests library import requests import shutil,os headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36' } currentDir = os.getcwd() path = os.path.join(currentDir,'Images')#saving images to Images folder def ImageDl(url): attempts = 0 while attempts < 5:#retry 5 times try: filename = url.split('/')[-1] r = requests.get(url,headers=headers,stream=True,timeout=5) if r.status_code == 200: with open(os.path.join(path,filename),'wb') as f: r.raw.decode_content = True shutil.copyfileobj(r.raw,f) print(filename) break except Exception as e: attempts+=1 print(e) ImageDl(url)
Use a simple python wget module to download the link. Usage below: import wget wget.download('http://www.digimouth.com/news/media/2011/09/google-logo.jpg')
This is very short answer. import urllib urllib.urlretrieve("http://photogallery.sandesh.com/Picture.aspx?AlubumId=422040", "Abc.jpg")
Version for Python 3 I adjusted the code of #madprops for Python 3 # getem.py # python2 script to download all images in a given url # use: python getem.py http://url.where.images.are from bs4 import BeautifulSoup import urllib.request import shutil import requests from urllib.parse import urljoin import sys import time def make_soup(url): req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) html = urllib.request.urlopen(req) return BeautifulSoup(html, 'html.parser') def get_images(url): soup = make_soup(url) images = [img for img in soup.findAll('img')] print (str(len(images)) + " images found.") print('Downloading images to current working directory.') image_links = [each.get('src') for each in images] for each in image_links: try: filename = each.strip().split('/')[-1].strip() src = urljoin(url, each) print('Getting: ' + filename) response = requests.get(src, stream=True) # delay to avoid corrupted previews time.sleep(1) with open(filename, 'wb') as out_file: shutil.copyfileobj(response.raw, out_file) except: print(' An error occured. Continuing.') print('Done.') if __name__ == '__main__': get_images('http://www.wookmark.com')
Late answer, but for python>=3.6 you can use dload, i.e.: import dload dload.save("http://www.digimouth.com/news/media/2011/09/google-logo.jpg") if you need the image as bytes, use: img_bytes = dload.bytes("http://www.digimouth.com/news/media/2011/09/google-logo.jpg") install using pip3 install dload
Something fresh for Python 3 using Requests: Comments in the code. Ready to use function. import requests from os import path def get_image(image_url): """ Get image based on url. :return: Image name if everything OK, False otherwise """ image_name = path.split(image_url)[1] try: image = requests.get(image_url) except OSError: # Little too wide, but work OK, no additional imports needed. Catch all conection problems return False if image.status_code == 200: # we could have retrieved error page base_dir = path.join(path.dirname(path.realpath(__file__)), "images") # Use your own path or "" to use current working directory. Folder must exist. with open(path.join(base_dir, image_name), "wb") as f: f.write(image.content) return image_name get_image("https://apod.nasddfda.gov/apod/image/2003/S106_Mishra_1947.jpg")
this is the easiest method to download images. import requests from slugify import slugify img_url = 'https://apod.nasa.gov/apod/image/1701/potw1636aN159_HST_2048.jpg' img = requests.get(img_url).content img_file = open(slugify(img_url) + '.' + str(img_url).split('.')[-1], 'wb') img_file.write(img) img_file.close()
If you don't already have the url for the image, you could scrape it with gazpacho: from gazpacho import Soup base_url = "http://books.toscrape.com" soup = Soup.get(base_url) links = [img.attrs["src"] for img in soup.find("img")] And then download the asset with urllib as mentioned: from pathlib import Path from urllib.request import urlretrieve as download directory = "images" Path(directory).mkdir(exist_ok=True) link = links[0] name = link.split("/")[-1] download(f"{base_url}/{link}", f"{directory}/{name}")
# import the required libraries from Python import pathlib,urllib.request # Using pathlib, specify where the image is to be saved downloads_path = str(pathlib.Path.home() / "Downloads") # Form a full image path by joining the path to the # images' new name picture_path = os.path.join(downloads_path, "new-image.png") # "/home/User/Downloads/new-image.png" # Using "urlretrieve()" from urllib.request save the image urllib.request.urlretrieve("//example.com/image.png", picture_path) # urlretrieve() takes in 2 arguments # 1. The URL of the image to be downloaded # 2. The image new name after download. By default, the image is saved # inside your current working directory
Ok, so, this is my rudimentary attempt, and probably total overkill. Update if needed, as this doesn't handle any timeouts, but, I got this working for fun. Code listed here: https://github.com/JayRizzo/JayRizzoTools/blob/master/pyImageDownloader.py #!/usr/bin/env python3 # -*- coding: utf-8 -*- # ============================================================================= # Created Syst: MAC OSX High Sierra 21.5.0 (17G65) # Created Plat: Python 3.9.5 ('v3.9.5:0a7dcbdb13', 'May 3 2021 13:17:02') # Created By : Jeromie Kirchoff # Created Date: Thu Jun 15 23:31:01 2022 CDT # Last ModDate: Thu Jun 16 01:41:01 2022 CDT # ============================================================================= # NOTE: Doesn't work on SVG images at this time. # I will look into this further: https://stackoverflow.com/a/6599172/1896134 # ============================================================================= import requests # to get image from the web import shutil # to save it locally import os # needed from os.path import exists as filepathexist # check if file paths exist from os.path import join # joins path for different os from os.path import expanduser # expands current home from pyuser_agent import UA # generates random UserAgent class ImageDownloader(object): """URL ImageDownloader. Input : Full Image URL Output: Image saved to your ~/Pictures/JayRizzoDL folder. """ def __init__(self, URL: str): self.url = URL self.headers = {"User-Agent" : UA().random} self.currentHome = expanduser('~') self.desktop = join(self.currentHome + "/Desktop/") self.download = join(self.currentHome + "/Downloads/") self.pictures = join(self.currentHome + "/Pictures/JayRizzoDL/") self.outfile = "" self.filename = "" self.response = "" self.rawstream = "" self.createdfilepath = "" self.imgFileName = "" # Check if the JayRizzoDL exists in the pictures folder. # if it doesn't exist create it. if not filepathexist(self.pictures): os.mkdir(self.pictures) self.main() def getFileNameFromURL(self, URL: str): """Parse the URL for the name after the last forward slash.""" NewFileName = self.url.strip().split('/')[-1].strip() return NewFileName def getResponse(self, URL: str): """Try streaming the URL for the raw data.""" self.response = requests.get(self.url, headers=self.headers, stream=True) return self.response def gocreateFile(self, name: str, response): """Try creating the file with the raw data in a custom folder.""" self.outfile = join(self.pictures, name) with open(self.outfile, 'wb') as outFilePath: shutil.copyfileobj(response.raw, outFilePath) return self.outfile def main(self): """Combine Everything and use in for loops.""" self.filename = self.getFileNameFromURL(self.url) self.rawstream = self.getResponse(self.url) self.createdfilepath = self.gocreateFile(self.filename, self.rawstream) print(f"File was created: {self.createdfilepath}") return if __name__ == '__main__': # Example when calling the file directly. ImageDownloader("https://stackoverflow.design/assets/img/logos/so/logo-stackoverflow.png")
Download Image file, with avoiding all possible error: import requests import validators from urllib.request import Request, urlopen from urllib.error import URLError, HTTPError def is_downloadable(url): valid=validators. url(url) if valid==False: return False req = Request(url) try: response = urlopen(req) except HTTPError as e: return False except URLError as e: return False else: return True for i in range(len(File_data)): #File data Contain list of address for image #file url = File_data[i][1] try: if (is_downloadable(url)): try: r = requests.get(url, allow_redirects=True) if url.find('/'): fname = url.rsplit('/', 1)[1] fname = pth+File_data[i][0]+"$"+fname #Destination to save #image file open(fname, 'wb').write(r.content) except Exception as e: print(e) except Exception as e: print(e)
Stream large binary files with urllib2 to file
I use the following code to stream large files from the Internet into a local file: fp = open(file, 'wb') req = urllib2.urlopen(url) for line in req: fp.write(line) fp.close() This works but it downloads quite slowly. Is there a faster way? (The files are large so I don't want to keep them in memory.)
No reason to work line by line (small chunks AND requires Python to find the line ends for you!-), just chunk it up in bigger chunks, e.g.: # from urllib2 import urlopen # Python 2 from urllib.request import urlopen # Python 3 response = urlopen(url) CHUNK = 16 * 1024 with open(file, 'wb') as f: while True: chunk = response.read(CHUNK) if not chunk: break f.write(chunk) Experiment a bit with various CHUNK sizes to find the "sweet spot" for your requirements.
You can also use shutil: import shutil try: from urllib.request import urlopen # Python 3 except ImportError: from urllib2 import urlopen # Python 2 def get_large_file(url, file, length=16*1024): req = urlopen(url) with open(file, 'wb') as fp: shutil.copyfileobj(req, fp, length)
I used to use mechanize module and its Browser.retrieve() method. In the past it took 100% CPU and downloaded things very slowly, but some recent release fixed this bug and works very quickly. Example: import mechanize browser = mechanize.Browser() browser.retrieve('http://www.kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.32-rc1.tar.bz2', 'Downloads/my-new-kernel.tar.bz2') Mechanize is based on urllib2, so urllib2 can also have similar method... but I can't find any now.
You can use urllib.retrieve() to download files: Example: try: from urllib import urlretrieve # Python 2 except ImportError: from urllib.request import urlretrieve # Python 3 url = "http://www.examplesite.com/myfile" urlretrieve(url,"./local_file")