How to parallelize file downloads? - python

I can download a file at a time with:
import urllib.request
urls = ['foo.com/bar.gz', 'foobar.com/barfoo.gz', 'bar.com/foo.gz']
for u in urls:
urllib.request.urlretrieve(u)
I could try to subprocess it as such:
import subprocess
import os
def parallelized_commandline(command, files, max_processes=2):
processes = set()
for name in files:
processes.add(subprocess.Popen([command, name]))
if len(processes) >= max_processes:
os.wait()
processes.difference_update(
[p for p in processes if p.poll() is not None])
#Check if all the child processes were closed
for p in processes:
if p.poll() is None:
p.wait()
urls = ['http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.en.gz',
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.cs.gz',
'http://www.statmt.org/wmt15/training-monolingual-nc-v10/news-commentary-v10.de.gz']
parallelized_commandline('wget', urls)
Is there any way to parallelize urlretrieve without using os.system or subprocess to cheat?
Given that I must resort to the "cheat" for now, is subprocess.Popen the right way to download the data?
When using the parallelized_commandline() above, it's using multi-thread but not multi-core for the wget, is that normal? Is there a way to make it multi-core instead of multi-thread?

You could use a thread pool to download files in parallel:
#!/usr/bin/env python3
from multiprocessing.dummy import Pool # use threads for I/O bound tasks
from urllib.request import urlretrieve
urls = [...]
result = Pool(4).map(urlretrieve, urls) # download 4 files at a time
You could also download several files at once in a single thread using asyncio:
#!/usr/bin/env python3
import asyncio
import logging
from contextlib import closing
import aiohttp # $ pip install aiohttp
#asyncio.coroutine
def download(url, session, semaphore, chunk_size=1<<15):
with (yield from semaphore): # limit number of concurrent downloads
filename = url2filename(url)
logging.info('downloading %s', filename)
response = yield from session.get(url)
with closing(response), open(filename, 'wb') as file:
while True: # save file
chunk = yield from response.content.read(chunk_size)
if not chunk:
break
file.write(chunk)
logging.info('done %s', filename)
return filename, (response.status, tuple(response.headers.items()))
urls = [...]
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
with closing(asyncio.get_event_loop()) as loop, \
closing(aiohttp.ClientSession()) as session:
semaphore = asyncio.Semaphore(4)
download_tasks = (download(url, session, semaphore) for url in urls)
result = loop.run_until_complete(asyncio.gather(*download_tasks))
where url2filename() is defined here.

Related

Why subprocess with waitpid is crashing?

I am trying to parallel download urls with the following:
def parallel_download_files(self, urls, filenames):
pids = []
for (url, filename) in zip(urls, filenames):
pid = os.fork()
if pid == 0:
open(filename, 'wb').write(requests.get(url).content)
else:
pids.append(pid)
for pid in pids:
os.waitpid(pid, os.WNOHANG)
But when executing with a list of urls and filenames, the computer system is building up in memory and crashing. From the documentation, I thought that the options in waitpid should be correctly handled if setting it to os.WNOHANG. This is the first time I am trying parallel with forks, I have been doing such tasks with concurrent.futures.ThreadPoolExecutor before.
Using os.fork() is far from ideal especially as you're not handling the two processes that are being created (parent/child). multithreading is far superior for this use-case.
For example:
from concurrent.futures import ThreadPoolExecutor as TPE
from requests import get as GET
def parallel_download_files(urls, filenames):
def _process(t):
url, filename = t
try:
(r := GET(url)).raise_for_status()
with open(filename, 'wb') as output:
output.write(r.content)
except Exception as e:
print('Failed: ', url, filename, e)
with TPE() as executor:
executor.map(_process, zip(urls, filenames))
urls = ['https://www.bbc.co.uk', 'https://news.bbc.co.uk']
filenames = ['www.txt', 'news.txt']
parallel_download_files(urls, filenames)
Note:
If any filenames are duplicated in the filenames list then you'll need a more complex strategy that ensures that you never have more than one thread writing to the same file

how to download a .csv file from a web to computer using python [duplicate]

I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I've added to iTunes.
The text processing that creates/updates the XML file is written in Python. However, I use wget inside a Windows .bat file to download the actual MP3 file. I would prefer to have the entire utility written in Python.
I struggled to find a way to actually download the file in Python, thus why I resorted to using wget.
So, how do I download the file using Python?
One more, using urlretrieve:
import urllib.request
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
(for Python 2 use import urllib and urllib.urlretrieve)
Use urllib.request.urlopen():
import urllib.request
with urllib.request.urlopen('http://www.example.com/') as f:
html = f.read().decode('utf-8')
This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers.
On Python 2, the method is in urllib2:
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
In 2012, use the python requests library
>>> import requests
>>>
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760
You can run pip install requests to get it.
Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case.
2015-12-30
People have expressed admiration for the progress bar. It's cool, sure. There are several off-the-shelf solutions now, including tqdm:
from tqdm import tqdm
import requests
url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)
with open("10MB", "wb") as handle:
for data in tqdm(response.iter_content()):
handle.write(data)
This is essentially the implementation #kvance described 30 months ago.
import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
output.write(mp3file.read())
The wb in open('test.mp3','wb') opens a file (and erases any existing file) in binary mode so you can save data with it instead of just text.
Python 3
urllib.request.urlopen
import urllib.request
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()
urllib.request.urlretrieve
import urllib.request
urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
Note: According to the documentation, urllib.request.urlretrieve is a "legacy interface" and "might become deprecated in the future" (thanks gerrit)
Python 2
urllib2.urlopen (thanks Corey)
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
urllib.urlretrieve (thanks PabloG)
import urllib
urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
use wget module:
import wget
wget.download('url')
import os,requests
def download(url):
get_response = requests.get(url,stream=True)
file_name = url.split("/")[-1]
with open(file_name, 'wb') as f:
for chunk in get_response.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download("https://example.com/example.jpg")
An improved version of the PabloG code for Python 2/3:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )
import sys, os, tempfile, logging
if sys.version_info >= (3,):
import urllib.request as urllib2
import urllib.parse as urlparse
else:
import urllib2
import urlparse
def download_file(url, dest=None):
"""
Download and save a file specified by url to dest directory,
"""
u = urllib2.urlopen(url)
scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
filename = os.path.basename(path)
if not filename:
filename = 'downloaded.file'
if dest:
filename = os.path.join(dest, filename)
with open(filename, 'wb') as f:
meta = u.info()
meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
meta_length = meta_func("Content-Length")
file_size = None
if meta_length:
file_size = int(meta_length[0])
print("Downloading: {0} Bytes: {1}".format(url, file_size))
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = "{0:16}".format(file_size_dl)
if file_size:
status += " [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
status += chr(13)
print(status, end="")
print()
return filename
if __name__ == "__main__": # Only run if this file is called directly
print("Testing with 10MB download")
url = "http://download.thinkbroadband.com/10MB.zip"
filename = download_file(url)
print(filename)
Simple yet Python 2 & Python 3 compatible way comes with six library:
from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
Following are the most commonly used calls for downloading files in python:
urllib.urlretrieve ('url_to_file', file_name)
urllib2.urlopen('url_to_file')
requests.get(url)
wget.download('url', file_name)
Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.
Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.
In python3 you can use urllib3 and shutil libraires.
Download them by using pip or pip3 (Depending whether python3 is default or not)
pip3 install urllib3 shutil
Then run this code
import urllib.request
import shutil
url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
Note that you download urllib3 but use urllib in code
I agree with Corey, urllib2 is more complete than urllib and should likely be the module used if you want to do more complex things, but to make the answers more complete, urllib is a simpler module if you want just the basics:
import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()
Will work fine. Or, if you don't want to deal with the "response" object you can call read() directly:
import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()
If you have wget installed, you can use parallel_sync.
pip install parallel_sync
from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)
Doc:
https://pythonhosted.org/parallel_sync/pages/examples.html
This is pretty powerful. It can download files in parallel, retry upon failure , and it can even download files on a remote machine.
You can get the progress feedback with urlretrieve as well:
def report(blocknr, blocksize, size):
current = blocknr*blocksize
sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))
def downloadFile(url):
print "\n",url
fname = url.split('/')[-1]
print fname
urllib.urlretrieve(url, fname, report)
If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2.
First, these are the results (they are similar in different runs):
$ python wget_test.py
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============
The way I performed the test is using "profile" decorator. This is the full code:
import wget
import urllib
import time
from functools import wraps
def profile(func):
#wraps(func)
def inner(*args):
print func.__name__, ": starting"
start = time.time()
ret = func(*args)
end = time.time()
print func.__name__, ": {:.2f}".format(end - start)
return ret
return inner
url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'
def do_nothing(*args):
pass
#profile
def urlretrive_test(url):
return urllib.urlretrieve(url)
#profile
def wget_no_bar_test(url):
return wget.download(url, out='/tmp/', bar=do_nothing)
#profile
def wget_with_bar_test(url):
return wget.download(url, out='/tmp/')
urlretrive_test(url1)
print '=============='
time.sleep(1)
wget_no_bar_test(url2)
print '=============='
time.sleep(1)
wget_with_bar_test(url3)
print '=============='
time.sleep(1)
urllib seems to be the fastest
Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads.
import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])
In Jupyter Notebook, one can also call programs directly with the ! syntax:
!wget -O example_output_file.html https://example.com
Late answer, but for python>=3.6 you can use:
import dload
dload.save(url)
Install dload with:
pip3 install dload
Source code can be:
import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()
sock.close()
print htmlSource
I wrote the following, which works in vanilla Python 2 or Python 3.
import sys
try:
import urllib.request
python3 = True
except ImportError:
import urllib2
python3 = False
def progress_callback_simple(downloaded,total):
sys.stdout.write(
"\r" +
(len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
" [%3.2f%%]"%(100.0*float(downloaded)/float(total))
)
sys.stdout.flush()
def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
def _download_helper(response, out_file, file_size):
if progress_callback!=None: progress_callback(0,file_size)
if block_size == None:
buffer = response.read()
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size,file_size)
else:
file_size_dl = 0
while True:
buffer = response.read(block_size)
if not buffer: break
file_size_dl += len(buffer)
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size_dl,file_size)
with open(dstfilepath,"wb") as out_file:
if python3:
with urllib.request.urlopen(srcurl) as response:
file_size = int(response.getheader("Content-Length"))
_download_helper(response,out_file,file_size)
else:
response = urllib2.urlopen(srcurl)
meta = response.info()
file_size = int(meta.getheaders("Content-Length")[0])
_download_helper(response,out_file,file_size)
import traceback
try:
download(
"https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
"output.zip",
progress_callback_simple
)
except:
traceback.print_exc()
input()
Notes:
Supports a "progress bar" callback.
Download is a 4 MB test .zip from my website.
You can use PycURL on Python 2 and 3.
import pycurl
FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'
with open(FILE_DEST, 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, FILE_SRC)
c.setopt(c.WRITEDATA, f)
c.perform()
c.close()
Use Python Requests in 5 lines
import requests as req
remote_url = 'http://www.example.com/sound.mp3'
local_file_name = 'sound.mp3'
data = req.get(remote_url)
# Save file data to local copy
with open(local_file_name, 'wb')as file:
file.write(data.content)
Now do something with the local copy of the remote file
This may be a little late, But I saw pabloG's code and couldn't help adding a os.system('cls') to make it look AWESOME! Check it out :
import urllib2,os
url = "http://download.thinkbroadband.com/10MB.zip"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
os.system('cls')
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
If running in an environment other than Windows, you will have to use something other then 'cls'. In MAC OS X and Linux it should be 'clear'.
urlretrieve and requests.get are simple, however the reality not.
I have fetched data for couple sites, including text and images, the above two probably solve most of the tasks. but for a more universal solution I suggest the use of urlopen. As it is included in Python 3 standard library, your code could run on any machine that run Python 3 without pre-installing site-package
import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)
#remember to open file in bytes mode
with open(filename, 'wb') as f:
while True:
buffer = url_connect.read(buffer_size)
if not buffer: break
#an integer value of size of written data
data_wrote = f.write(buffer)
#you could probably use with-open-as manner
url_connect.close()
This answer provides a solution to HTTP 403 Forbidden when downloading file over http using Python. I have tried only requests and urllib modules, the other module may provide something better, but this is the one I used to solve most of the problems.
New Api urllib3 based implementation
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'your_url_goes_here')
>>> r.status
200
>>> r.data
*****Response Data****
More info: https://pypi.org/project/urllib3/
You can python requests
import os
import requests
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(URL, stream=True)
with open(outfile,'wb') as output:
output.write(response.content)
You can use shutil
import os
import requests
import shutil
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(url, stream = True)
with open(outfile, 'wb') as f:
shutil.copyfileobj(response.content, f)
If you are downloading from restricted url, don't forget to include access token in headers
I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.
After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.
It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.
Here it is:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup
# --- insert Stan's script here ---
# if sys.version_info >= (3,):
#...
#...
# def download_file(url, dest=None):
#...
#...
# --- new stuff ---
def collect_all_url(page_url, extensions):
"""
Recovers all links in page_url checking for all the desired extensions
"""
conn = urllib2.urlopen(page_url)
html = conn.read()
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('a')
results = []
for tag in links:
link = tag.get('href', None)
if link is not None:
for e in extensions:
if e in link:
# Fallback for badly defined links
# checks for missing scheme or netloc
if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
results.append(link)
else:
new_url=urlparse.urljoin(page_url,link)
results.append(new_url)
return results
if __name__ == "__main__": # Only run if this file is called directly
# Command line arguments
parser = argparse.ArgumentParser(
description='Download all files from a webpage.')
parser.add_argument(
'-u', '--url',
help='Page url to request')
parser.add_argument(
'-e', '--ext',
nargs='+',
help='Extension(s) to find')
parser.add_argument(
'-d', '--dest',
default=None,
help='Destination where to save the files')
parser.add_argument(
'-p', '--par',
action='store_true', default=False,
help="Turns on parallel download")
args = parser.parse_args()
# Recover files to download
all_links = collect_all_url(args.url, args.ext)
# Download
if not args.par:
for l in all_links:
try:
filename = download_file(l, args.dest)
print(l)
except Exception as e:
print("Error while downloading: {}".format(e))
else:
from multiprocessing.pool import ThreadPool
results = ThreadPool(10).imap_unordered(
lambda x: download_file(x, args.dest), all_links)
for p in results:
print(p)
An example of its usage is:
python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>
And an actual example if you want to see it in action:
python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics
Another possibility is with built-in http.client:
from http import HTTPStatus, client
from shutil import copyfileobj
# using https
connection = client.HTTPSConnection("www.example.com")
with connection.request("GET", "/noise.mp3") as response:
if response.status == HTTPStatus.OK:
copyfileobj(response, open("noise.mp3")
else:
raise Exception("request needs work")
The HTTPConnection object is considered “low-level” in that it performs the desired request once and assumes the developer will subclass it or script in a way to handle the nuances of HTTP. Libraries such as requests tend to handle more special cases such as automatically following redirects and so on.
You can use keras.utils.get_file to do it:
from tensorflow import keras
path_to_downloaded_file = keras.utils.get_file(
fname="file name",
origin="https://www.linktofile.com/link/to/file",
extract=True,
archive_format="zip", # downloaded file format
cache_dir="/", # cache and extract in current directory
)
Another way is to call an external process such as curl.exe. Curl by default displays a progress bar, average download speed, time left, and more all formatted neatly in a table.
Put curl.exe in the same directory as your script
from subprocess import call
url = ""
call(["curl", {url}, '--output', "song.mp3"])
Note: You cannot specify an output path with curl, so do an os.rename afterwards

Downloading Many Images with Python Requests and Multiprocessing

I'm attempting to download a few thousand images using Python and the multiprocessing and requests libs. Things start off fine but about 100 images in, everything locks up and I have to kill the processes. I'm using python 2.7.6. Here's the code:
import requests
import shutil
from multiprocessing import Pool
from urlparse import urlparse
def get_domain_name(s):
domain_name = urlparse(s).netloc
new_s = re.sub('\:', '_', domain_name) #replace colons
return new_s
def grab_image(url):
response = requests.get(url, stream=True, timeout=2)
if response.status_code == 200:
img_name = get_domain_name(url)
with open(IMG_DST + img_name + ".jpg", 'wb') as outf:
shutil.copyfileobj(response.raw, outf)
del response
def main():
with open(list_of_image_urls, 'r') as f:
urls = f.read().splitlines()
urls.sort()
pool = Pool(processes=4, maxtasksperchild=2)
pool.map(grab_image, urls)
pool.close()
pool.join()
if __name__ == "__main__":
main()
Edit: After changing the multiprocessing import to multiprocessing.dummy to use threads instead of processes I am still experiencing the same problem. It seems I'm sometimes hitting a motion jpeg stream instead of a single image, which is causing the associated problems. In order to deal with this issue I'm using a context manager and I created a FileTooBigException. While I haven't implement checking to make sure I've actually downloaded an image file and some other house cleaning, I thought the below code might be useful for someone:
class FileTooBigException(requests.exceptions.RequestException):
"""File over LIMIT_SIZE"""
def grab_image(url):
try:
img = ''
with closing(requests.get(url, stream=True, timeout=4)) as response:
if response.status_code == 200:
content_length = 0
img_name = get_domain_name(url)
img = IMG_DST + img_name + ".jpg"
with open(img, 'wb') as outf:
for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
outf.write(chunk)
content_length = content_length + CHUNK_SIZE
if(content_length > LIMIT_SIZE):
raise FileTooBigException(response)
except requests.exceptions.Timeout:
pass
except requests.exceptions.ConnectionError:
pass
except socket.timeout:
pass
except FileTooBigException:
os.remove(img)
pass
And, any suggested improvements welcome!
There is no point in using multiprocessing for I/O concurrency. In network I/O the thread involved just waits most of the time doing nothing. And Python threads are excellent for doing nothing. So use a threadpool, instead of a processpool. Each process consumes a lot of resouces and are unnecessary for I/O bound activities. While threads share the process state and are exactly what you are looking for.

Shared variable among various threads in Python

I've requirement to calculate the total number of post requests sent to a server. My script uses a thread per JSON file which contains post data. Below is the rough code snippet.
statistics = 0
def load_from_file(some_arguments, filename):
data_list = json.loads(open(filename).read())
url = address + getUrl(filename, config)
for data in data_list.get("results"):
statistics += 1
r = requests.post(url, data=json.dumps(data), headers=headers,
auth=HTTPBasicAuth(username, password))
def load_from_directory(some_arguments, directory):
pool = mp.Pool(mp.cpu_count() * 2)
func = partial(load_from_file, some_arguments)
file_list = [f for f in listdir(directory) if isfile(join(directory, f))]
pool.map(func, [join(directory, f) for f in file_list ])
pool.close()
pool.join()
print "total post requests", statistics
I want to print the total number of post requests processed using this script. Is it the right way?
Sharing memory is not so simple when using multiprocesses. I'm not seeing the need to use the multiprocessing module instead of threading. Multiprocessing is mostly used as a workaround for the Global interpreter lock.
In your example you are using IO bound operations which probably won't ever reach full CPU time. If you insist on using multiprocess instead of threading I suggest to take a look at exchanging-objects-between-processes.
Otherwise using threading you can share the global statistics variable between threads.
import threading
statistics = 0
def load_from_file(some_arguments, filename):
global statistics
data_list = json.loads(open(filename).read())
url = address + getUrl(filename, config)
for data in data_list.get("results"):
statistics += 1
r = requests.post(url, data=json.dumps(data), headers=headers,
auth=HTTPBasicAuth(username, password))
def load_from_directory(some_arguments, directory):
threads = []
func = partial(load_from_file, some_arguments)
file_list = [f for f in listdir(directory) if isfile(join(directory, f))]
for f in file_list:
t = threading.Thread(target=func, args=(join(directory, f)))
t.start()
threads.append(t)
#Wait for threads to finish
for thread in threads:
thread.join()
print "total post requests", statistics
Note: This currently simultaneously spawns threads based on the number of files in the directory. You might want to implement some kind of throttling for optimal performance.

How to download a file over HTTP?

I have a small utility that I use to download an MP3 file from a website on a schedule and then builds/updates a podcast XML file which I've added to iTunes.
The text processing that creates/updates the XML file is written in Python. However, I use wget inside a Windows .bat file to download the actual MP3 file. I would prefer to have the entire utility written in Python.
I struggled to find a way to actually download the file in Python, thus why I resorted to using wget.
So, how do I download the file using Python?
One more, using urlretrieve:
import urllib.request
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
(for Python 2 use import urllib and urllib.urlretrieve)
Use urllib.request.urlopen():
import urllib.request
with urllib.request.urlopen('http://www.example.com/') as f:
html = f.read().decode('utf-8')
This is the most basic way to use the library, minus any error handling. You can also do more complex stuff such as changing headers.
On Python 2, the method is in urllib2:
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
In 2012, use the python requests library
>>> import requests
>>>
>>> url = "http://download.thinkbroadband.com/10MB.zip"
>>> r = requests.get(url)
>>> print len(r.content)
10485760
You can run pip install requests to get it.
Requests has many advantages over the alternatives because the API is much simpler. This is especially true if you have to do authentication. urllib and urllib2 are pretty unintuitive and painful in this case.
2015-12-30
People have expressed admiration for the progress bar. It's cool, sure. There are several off-the-shelf solutions now, including tqdm:
from tqdm import tqdm
import requests
url = "http://download.thinkbroadband.com/10MB.zip"
response = requests.get(url, stream=True)
with open("10MB", "wb") as handle:
for data in tqdm(response.iter_content()):
handle.write(data)
This is essentially the implementation #kvance described 30 months ago.
import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
with open('test.mp3','wb') as output:
output.write(mp3file.read())
The wb in open('test.mp3','wb') opens a file (and erases any existing file) in binary mode so you can save data with it instead of just text.
Python 3
urllib.request.urlopen
import urllib.request
response = urllib.request.urlopen('http://www.example.com/')
html = response.read()
urllib.request.urlretrieve
import urllib.request
urllib.request.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
Note: According to the documentation, urllib.request.urlretrieve is a "legacy interface" and "might become deprecated in the future" (thanks gerrit)
Python 2
urllib2.urlopen (thanks Corey)
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
urllib.urlretrieve (thanks PabloG)
import urllib
urllib.urlretrieve('http://www.example.com/songs/mp3.mp3', 'mp3.mp3')
use wget module:
import wget
wget.download('url')
import os,requests
def download(url):
get_response = requests.get(url,stream=True)
file_name = url.split("/")[-1]
with open(file_name, 'wb') as f:
for chunk in get_response.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download("https://example.com/example.jpg")
An improved version of the PabloG code for Python 2/3:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import ( division, absolute_import, print_function, unicode_literals )
import sys, os, tempfile, logging
if sys.version_info >= (3,):
import urllib.request as urllib2
import urllib.parse as urlparse
else:
import urllib2
import urlparse
def download_file(url, dest=None):
"""
Download and save a file specified by url to dest directory,
"""
u = urllib2.urlopen(url)
scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
filename = os.path.basename(path)
if not filename:
filename = 'downloaded.file'
if dest:
filename = os.path.join(dest, filename)
with open(filename, 'wb') as f:
meta = u.info()
meta_func = meta.getheaders if hasattr(meta, 'getheaders') else meta.get_all
meta_length = meta_func("Content-Length")
file_size = None
if meta_length:
file_size = int(meta_length[0])
print("Downloading: {0} Bytes: {1}".format(url, file_size))
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = "{0:16}".format(file_size_dl)
if file_size:
status += " [{0:6.2f}%]".format(file_size_dl * 100 / file_size)
status += chr(13)
print(status, end="")
print()
return filename
if __name__ == "__main__": # Only run if this file is called directly
print("Testing with 10MB download")
url = "http://download.thinkbroadband.com/10MB.zip"
filename = download_file(url)
print(filename)
Simple yet Python 2 & Python 3 compatible way comes with six library:
from six.moves import urllib
urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
Following are the most commonly used calls for downloading files in python:
urllib.urlretrieve ('url_to_file', file_name)
urllib2.urlopen('url_to_file')
requests.get(url)
wget.download('url', file_name)
Note: urlopen and urlretrieve are found to perform relatively bad with downloading large files (size > 500 MB). requests.get stores the file in-memory until download is complete.
Wrote wget library in pure Python just for this purpose. It is pumped up urlretrieve with these features as of version 2.0.
In python3 you can use urllib3 and shutil libraires.
Download them by using pip or pip3 (Depending whether python3 is default or not)
pip3 install urllib3 shutil
Then run this code
import urllib.request
import shutil
url = "http://www.somewebsite.com/something.pdf"
output_file = "save_this_name.pdf"
with urllib.request.urlopen(url) as response, open(output_file, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
Note that you download urllib3 but use urllib in code
I agree with Corey, urllib2 is more complete than urllib and should likely be the module used if you want to do more complex things, but to make the answers more complete, urllib is a simpler module if you want just the basics:
import urllib
response = urllib.urlopen('http://www.example.com/sound.mp3')
mp3 = response.read()
Will work fine. Or, if you don't want to deal with the "response" object you can call read() directly:
import urllib
mp3 = urllib.urlopen('http://www.example.com/sound.mp3').read()
If you have wget installed, you can use parallel_sync.
pip install parallel_sync
from parallel_sync import wget
urls = ['http://something.png', 'http://somthing.tar.gz', 'http://somthing.zip']
wget.download('/tmp', urls)
# or a single file:
wget.download('/tmp', urls[0], filenames='x.zip', extract=True)
Doc:
https://pythonhosted.org/parallel_sync/pages/examples.html
This is pretty powerful. It can download files in parallel, retry upon failure , and it can even download files on a remote machine.
You can get the progress feedback with urlretrieve as well:
def report(blocknr, blocksize, size):
current = blocknr*blocksize
sys.stdout.write("\r{0:.2f}%".format(100.0*current/size))
def downloadFile(url):
print "\n",url
fname = url.split('/')[-1]
print fname
urllib.urlretrieve(url, fname, report)
If speed matters to you, I made a small performance test for the modules urllib and wget, and regarding wget I tried once with status bar and once without. I took three different 500MB files to test with (different files- to eliminate the chance that there is some caching going on under the hood). Tested on debian machine, with python2.
First, these are the results (they are similar in different runs):
$ python wget_test.py
urlretrive_test : starting
urlretrive_test : 6.56
==============
wget_no_bar_test : starting
wget_no_bar_test : 7.20
==============
wget_with_bar_test : starting
100% [......................................................................] 541335552 / 541335552
wget_with_bar_test : 50.49
==============
The way I performed the test is using "profile" decorator. This is the full code:
import wget
import urllib
import time
from functools import wraps
def profile(func):
#wraps(func)
def inner(*args):
print func.__name__, ": starting"
start = time.time()
ret = func(*args)
end = time.time()
print func.__name__, ": {:.2f}".format(end - start)
return ret
return inner
url1 = 'http://host.com/500a.iso'
url2 = 'http://host.com/500b.iso'
url3 = 'http://host.com/500c.iso'
def do_nothing(*args):
pass
#profile
def urlretrive_test(url):
return urllib.urlretrieve(url)
#profile
def wget_no_bar_test(url):
return wget.download(url, out='/tmp/', bar=do_nothing)
#profile
def wget_with_bar_test(url):
return wget.download(url, out='/tmp/')
urlretrive_test(url1)
print '=============='
time.sleep(1)
wget_no_bar_test(url2)
print '=============='
time.sleep(1)
wget_with_bar_test(url3)
print '=============='
time.sleep(1)
urllib seems to be the fastest
Just for the sake of completeness, it is also possible to call any program for retrieving files using the subprocess package. Programs dedicated to retrieving files are more powerful than Python functions like urlretrieve. For example, wget can download directories recursively (-R), can deal with FTP, redirects, HTTP proxies, can avoid re-downloading existing files (-nc), and aria2 can do multi-connection downloads which can potentially speed up your downloads.
import subprocess
subprocess.check_output(['wget', '-O', 'example_output_file.html', 'https://example.com'])
In Jupyter Notebook, one can also call programs directly with the ! syntax:
!wget -O example_output_file.html https://example.com
Late answer, but for python>=3.6 you can use:
import dload
dload.save(url)
Install dload with:
pip3 install dload
Source code can be:
import urllib
sock = urllib.urlopen("http://diveintopython.org/")
htmlSource = sock.read()
sock.close()
print htmlSource
I wrote the following, which works in vanilla Python 2 or Python 3.
import sys
try:
import urllib.request
python3 = True
except ImportError:
import urllib2
python3 = False
def progress_callback_simple(downloaded,total):
sys.stdout.write(
"\r" +
(len(str(total))-len(str(downloaded)))*" " + str(downloaded) + "/%d"%total +
" [%3.2f%%]"%(100.0*float(downloaded)/float(total))
)
sys.stdout.flush()
def download(srcurl, dstfilepath, progress_callback=None, block_size=8192):
def _download_helper(response, out_file, file_size):
if progress_callback!=None: progress_callback(0,file_size)
if block_size == None:
buffer = response.read()
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size,file_size)
else:
file_size_dl = 0
while True:
buffer = response.read(block_size)
if not buffer: break
file_size_dl += len(buffer)
out_file.write(buffer)
if progress_callback!=None: progress_callback(file_size_dl,file_size)
with open(dstfilepath,"wb") as out_file:
if python3:
with urllib.request.urlopen(srcurl) as response:
file_size = int(response.getheader("Content-Length"))
_download_helper(response,out_file,file_size)
else:
response = urllib2.urlopen(srcurl)
meta = response.info()
file_size = int(meta.getheaders("Content-Length")[0])
_download_helper(response,out_file,file_size)
import traceback
try:
download(
"https://geometrian.com/data/programming/projects/glLib/glLib%20Reloaded%200.5.9/0.5.9.zip",
"output.zip",
progress_callback_simple
)
except:
traceback.print_exc()
input()
Notes:
Supports a "progress bar" callback.
Download is a 4 MB test .zip from my website.
You can use PycURL on Python 2 and 3.
import pycurl
FILE_DEST = 'pycurl.html'
FILE_SRC = 'http://pycurl.io/'
with open(FILE_DEST, 'wb') as f:
c = pycurl.Curl()
c.setopt(c.URL, FILE_SRC)
c.setopt(c.WRITEDATA, f)
c.perform()
c.close()
Use Python Requests in 5 lines
import requests as req
remote_url = 'http://www.example.com/sound.mp3'
local_file_name = 'sound.mp3'
data = req.get(remote_url)
# Save file data to local copy
with open(local_file_name, 'wb')as file:
file.write(data.content)
Now do something with the local copy of the remote file
This may be a little late, But I saw pabloG's code and couldn't help adding a os.system('cls') to make it look AWESOME! Check it out :
import urllib2,os
url = "http://download.thinkbroadband.com/10MB.zip"
file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)
os.system('cls')
file_size_dl = 0
block_sz = 8192
while True:
buffer = u.read(block_sz)
if not buffer:
break
file_size_dl += len(buffer)
f.write(buffer)
status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
status = status + chr(8)*(len(status)+1)
print status,
f.close()
If running in an environment other than Windows, you will have to use something other then 'cls'. In MAC OS X and Linux it should be 'clear'.
urlretrieve and requests.get are simple, however the reality not.
I have fetched data for couple sites, including text and images, the above two probably solve most of the tasks. but for a more universal solution I suggest the use of urlopen. As it is included in Python 3 standard library, your code could run on any machine that run Python 3 without pre-installing site-package
import urllib.request
url_request = urllib.request.Request(url, headers=headers)
url_connect = urllib.request.urlopen(url_request)
#remember to open file in bytes mode
with open(filename, 'wb') as f:
while True:
buffer = url_connect.read(buffer_size)
if not buffer: break
#an integer value of size of written data
data_wrote = f.write(buffer)
#you could probably use with-open-as manner
url_connect.close()
This answer provides a solution to HTTP 403 Forbidden when downloading file over http using Python. I have tried only requests and urllib modules, the other module may provide something better, but this is the one I used to solve most of the problems.
New Api urllib3 based implementation
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET', 'your_url_goes_here')
>>> r.status
200
>>> r.data
*****Response Data****
More info: https://pypi.org/project/urllib3/
You can python requests
import os
import requests
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(URL, stream=True)
with open(outfile,'wb') as output:
output.write(response.content)
You can use shutil
import os
import requests
import shutil
outfile = os.path.join(SAVE_DIR, file_name)
response = requests.get(url, stream = True)
with open(outfile, 'wb') as f:
shutil.copyfileobj(response.content, f)
If you are downloading from restricted url, don't forget to include access token in headers
I wanted do download all the files from a webpage. I tried wget but it was failing so I decided for the Python route and I found this thread.
After reading it, I have made a little command line application, soupget, expanding on the excellent answers of PabloG and Stan and adding some useful options.
It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.
Here it is:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup
# --- insert Stan's script here ---
# if sys.version_info >= (3,):
#...
#...
# def download_file(url, dest=None):
#...
#...
# --- new stuff ---
def collect_all_url(page_url, extensions):
"""
Recovers all links in page_url checking for all the desired extensions
"""
conn = urllib2.urlopen(page_url)
html = conn.read()
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('a')
results = []
for tag in links:
link = tag.get('href', None)
if link is not None:
for e in extensions:
if e in link:
# Fallback for badly defined links
# checks for missing scheme or netloc
if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
results.append(link)
else:
new_url=urlparse.urljoin(page_url,link)
results.append(new_url)
return results
if __name__ == "__main__": # Only run if this file is called directly
# Command line arguments
parser = argparse.ArgumentParser(
description='Download all files from a webpage.')
parser.add_argument(
'-u', '--url',
help='Page url to request')
parser.add_argument(
'-e', '--ext',
nargs='+',
help='Extension(s) to find')
parser.add_argument(
'-d', '--dest',
default=None,
help='Destination where to save the files')
parser.add_argument(
'-p', '--par',
action='store_true', default=False,
help="Turns on parallel download")
args = parser.parse_args()
# Recover files to download
all_links = collect_all_url(args.url, args.ext)
# Download
if not args.par:
for l in all_links:
try:
filename = download_file(l, args.dest)
print(l)
except Exception as e:
print("Error while downloading: {}".format(e))
else:
from multiprocessing.pool import ThreadPool
results = ThreadPool(10).imap_unordered(
lambda x: download_file(x, args.dest), all_links)
for p in results:
print(p)
An example of its usage is:
python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>
And an actual example if you want to see it in action:
python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics
Another possibility is with built-in http.client:
from http import HTTPStatus, client
from shutil import copyfileobj
# using https
connection = client.HTTPSConnection("www.example.com")
with connection.request("GET", "/noise.mp3") as response:
if response.status == HTTPStatus.OK:
copyfileobj(response, open("noise.mp3")
else:
raise Exception("request needs work")
The HTTPConnection object is considered “low-level” in that it performs the desired request once and assumes the developer will subclass it or script in a way to handle the nuances of HTTP. Libraries such as requests tend to handle more special cases such as automatically following redirects and so on.
You can use keras.utils.get_file to do it:
from tensorflow import keras
path_to_downloaded_file = keras.utils.get_file(
fname="file name",
origin="https://www.linktofile.com/link/to/file",
extract=True,
archive_format="zip", # downloaded file format
cache_dir="/", # cache and extract in current directory
)
Another way is to call an external process such as curl.exe. Curl by default displays a progress bar, average download speed, time left, and more all formatted neatly in a table.
Put curl.exe in the same directory as your script
from subprocess import call
url = ""
call(["curl", {url}, '--output', "song.mp3"])
Note: You cannot specify an output path with curl, so do an os.rename afterwards

Categories

Resources