How to use progressbar module with urlretrieve

How to use progressbar module with urlretrieve - python

My pyhton3 script downloads a number of images over the internet using urlretrieve, and I'd like to add a progressbar with a completed percentage and download speed for each download.
The progressbar module seems like a good solution, but although I've looked through their examples, and example4 seems like the right thing, I still can't understand how to wrap it around the urlretrieve.
I guess I should add a third parameter:
urllib.request.urlretrieve('img_url', 'img_filename', some_progressbar_based_reporthook)
But how do I properly define it?

The suggestion in the other answer did not progress for me past 1%. Here is a complete implementation that works for me on Python 3:
import progressbar
import urllib.request
pbar = None
def show_progress(block_num, block_size, total_size):
global pbar
if pbar is None:
pbar = progressbar.ProgressBar(maxval=total_size)
pbar.start()
downloaded = block_num * block_size
if downloaded < total_size:
pbar.update(downloaded)
else:
pbar.finish()
pbar = None
urllib.request.urlretrieve(model_url, model_file, show_progress)

I think a better solution is to create a class that has all the needed state
import progressbar
class MyProgressBar():
def __init__(self):
self.pbar = None
def __call__(self, block_num, block_size, total_size):
if not self.pbar:
self.pbar=progressbar.ProgressBar(maxval=total_size)
self.pbar.start()
downloaded = block_num * block_size
if downloaded < total_size:
self.pbar.update(downloaded)
else:
self.pbar.finish()
and call :
urllib.request.urlretrieve('img_url', 'img_filename', MyProgressBar())

The hook is defined as:
urlretrieve(url[, filename[, reporthook[, data]]])
"The third argument, if present, is a hook function that will be called
once on establishment of the network connection and once after each block
read thereafter. The hook will be passed three arguments; a count of blocks
transferred so far, a block size in bytes, and the total size of the file.
The third argument may be -1 on older FTP servers which do not return a
file size in response to a retrieval request. "
So, you can write a hook as follows:
# Global variables
pbar = None
downloaded = 0
def show_progress(count, block_size, total_size):
if pbar is None:
pbar = ProgressBar(maxval=total_size)
downloaded += block_size
pbar.update(block_size)
if downloaded == total_size:
pbar.finish()
pbar = None
downloaded = 0
As a side note I strongly recommend you to use requests library which is a lot easier to use and you can iterate over the response with the iter_content() method.

In python 3 you can achieve the same result without the progressbar module:
import urllib.request
# prepare progressbar
def show_progress(block_num, block_size, total_size):
print(round(block_num * block_size / total_size *100,2), end="\r")
# use urlretrieve
urllib.request.urlretrieve(url, fileName, show_progress)

Related

Update tqdm bar

I'm creating a small script for me to download youtube videos sounds.
Everytime you download a sound i use a tqdm bar to display download infos.
The first time you download everything work fine, but the second time my bar is completely destroyed :(. i really don't know what's happening with it...
(i think the bar doesn't update correctly)
Here's the code that handle the bar and download the sound
Thanks for your time :)
def DownloadAudioFromUrl(url):
print("Getting the URL...")
vid = pafy.new(url)
print("Done")
print("Getting best quality...")
stream = vid.getbestaudio()
fileSize = stream.get_filesize()
print("Done")
print("Downloading: " + vid.title + " ...")
with tqdm.tqdm(total=fileSize, unit_scale=True, unit='B', initial=0) as pbar:
stream.download("Download/", quiet=True, callback=lambda _, received, *args: UpdateBar(pbar, received))
print("Done")
ResetProgressBar(pbar)
WebmToMp3()
def ResetProgressBar(bar):
bar.reset()
bar.clear()
bar.close()
# i used these last time i tried i don't undersand how they work :/
def UpdateBar(pbar, current_received):
global previous_received
diff = current_received - previous_received
pbar.update(diff)
previous_received = current_received
So i tried to update the bar with "reset" "clear" and "stop" but it changed nothing

Get Progress in Python file Upload to Azure

i'm uploading files to azure like so:
with open(tempfile, "rb") as data:
blob_client.upload_blob(data, blob_type='BlockBlob', length=None, metadata=None)
how can i get a progress indication?
when i try uploading as stream, it only uploads one chunk.
i'm sure i'm doing something wrong, but can't find info.
thanks!

It looks like the Azure library doesn't include a callback function to monitor progress.
Fortunately, you can add a wrapper around Python's file object which can call a callback everytime there's a read.
Try this:
import os
from io import BufferedReader, FileIO
class ProgressFile(BufferedReader):
# For binary opening only
def __init__(self, filename, read_callback):
f = FileIO(file=filename, mode='r')
self._read_callback = read_callback
super().__init__(raw=f)
# I prefer Pathlib but this should still support 2.x
self.length = os.stat(filename).st_size
def read(self, size=None):
calc_sz = size
if not calc_sz:
calc_sz = self.length - self.tell()
self._read_callback(position=self.tell(), read_size=calc_sz, total=self.length)
return super(ProgressFile, self).read(size)
def my_callback(position, read_size, total):
# Write your own callback. You could convert the absolute values to percentages
# Using .format rather than f'' for compatibility
print("position: {position}, read_size: {read_size}, total: {total}".format(position=position,
read_size=read_size,
total=total))
myfile = ProgressFile(filename='mybigfile.txt', read_callback=my_callback)
Then you would do
blob_client.upload_blob(myfile, blob_type='BlockBlob', length=None, metadata=None)
myfile.close()
Edit:
It looks like TQDM (progress monitor) has a neat wrapper: https://github.com/tqdm/tqdm#hooks-and-callbacks.
The bonus there is that you get easy access to a pretty progress bar.

This is how I ended up using the tqdm wrapper that Alastair mentioned above
size = os.stat(fname).st_size
with tqdm.wrapattr(open(fname, 'rb'), "read", total=size) as data:
blob_client.upload_blob(data)
Works perfectly, shows time estimate, progress bar, human readable file sizes and transfer speeds.

request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance

Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue

It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?

Python Multithreading Not Functioning

Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()

There are multiple issues in your code.
The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()

Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.

Speed up nautilus python-extensions for reading image's Exif

I've written a Nautilus extension which reads picture's metadata (executing exiftool), but when I open folders with many files, it really slows down the file manager and hangs until it finishes reading the file's data.
Is there a way to make Nautilus keep its work while it runs my extension? Perhaps the Exif data could appear gradually in the columns while I go on with my work.
#!/usr/bin/python
# Richiede:
# nautilus-python
# exiftool
# gconf-python
# Versione 0.15
import gobject
import nautilus
from subprocess import Popen, PIPE
from urllib import unquote
import gconf
def getexiftool(filename):
options = '-fast2 -f -m -q -q -s3 -ExifIFD:DateTimeOriginal -IFD0:Software -ExifIFD:Flash -Composite:ImageSize -IFD0:Model'
exiftool=Popen(['/usr/bin/exiftool'] + options.split() + [filename],stdout=PIPE,stderr=PIPE)
#'-Nikon:ShutterCount' non utilizzabile con l'argomento -fast2
output,errors=exiftool.communicate()
return output.split('\n')
class ColumnExtension(nautilus.ColumnProvider, nautilus.InfoProvider, gobject.GObject):
def __init__(self):
pass
def get_columns(self):
return (
nautilus.Column("NautilusPython::ExifIFD:DateTimeOriginal","ExifIFD:DateTimeOriginal","Data (ExifIFD)","Data di scatto"),
nautilus.Column("NautilusPython::IFD0:Software","IFD0:Software","Software (IFD0)","Software utilizzato"),
nautilus.Column("NautilusPython::ExifIFD:Flash","ExifIFD:Flash","Flash (ExifIFD)","Modalit\u00e0 del flash"),
nautilus.Column("NautilusPython::Composite:ImageSize","Composite:ImageSize","Risoluzione (Exif)","Risoluzione dell'immagine"),
nautilus.Column("NautilusPython::IFD0:Model","IFD0:Model","Fotocamera (IFD0)","Modello fotocamera"),
#nautilus.Column("NautilusPython::Nikon:ShutterCount","Nikon:ShutterCount","Contatore scatti (Nikon)","Numero di scatti effettuati dalla macchina a questo file"),
nautilus.Column("NautilusPython::Mp","Mp","Megapixel (Exif)","Dimensione dell'immagine in megapixel"),
)
def update_file_info_full(self, provider, handle, closure, file):
client = gconf.client_get_default()
if not client.get_bool('/apps/nautilus/nautilus-metadata/enable'):
client.set_bool('/apps/nautilus/nautilus-metadata/enable',0)
return
if file.get_uri_scheme() != 'file':
return
if file.get_mime_type() in ('image/jpeg', 'image/png', 'image/gif', 'image/bmp', 'image/x-nikon-nef', 'image/x-xcf', 'image/vnd.adobe.photoshop'):
gobject.timeout_add_seconds(1, self.update_exif, provider, handle, closure, file)
return Nautilus.OperationResult.IN_PROGRESS
file.add_string_attribute('ExifIFD:DateTimeOriginal','')
file.add_string_attribute('IFD0:Software','')
file.add_string_attribute('ExifIFD:Flash','')
file.add_string_attribute('Composite:ImageSize','')
file.add_string_attribute('IFD0:Model','')
file.add_string_attribute('Nikon:ShutterCount','')
file.add_string_attribute('Mp','')
return Nautilus.OperationResult.COMPLETE
def update_exif(self, provider, handle, closure, file):
filename = unquote(file.get_uri()[7:])
data = getexiftool(filename)
file.add_string_attribute('ExifIFD:DateTimeOriginal',data[0].replace(':','-',2))
file.add_string_attribute('IFD0:Software',data[1])
file.add_string_attribute('ExifIFD:Flash',data[2])
file.add_string_attribute('Composite:ImageSize',data[3])
file.add_string_attribute('IFD0:Model',data[4])
#file.add_string_attribute('Nikon:ShutterCount',data[5])
width, height = data[3].split('x')
mp = float(width) * float(height) / 1000000
mp = "%.2f" % mp
file.add_string_attribute('Mp',str(mp) + ' Mp')
Nautilus.info_provider_update_complete_invoke(closure, provider, handle, Nautilus.OperationResult.COMPLETE)
return false

That happens because you are invoking update_file_info, which is part of the asynchronous IO system of Nautilus. Therefore, it blocks nautilus if the operations are not fast enough.
In your case it is exacerbated because you are calling an external program, and that is an expensive operation. Notice that update_file_info is called once per file. If you have 100 files, then you will call 100 times the external program, and Nautilus will have to wait for each one before processing the next one.
Since nautilus-python 0.7 are available update_file_info_full and cancel_update, which allows you to program async calls. You can check the documentation of Nautilus 0.7 for more details.
It worth to mention this was a limitation of nautilus-python only, which previously did not expose those methods available in C.
EDIT: Added a couple of examples.
The trick is make the process as fast as possible or make it asynchronous.
Example 1: Invoking an external program
Using a simplified version of your code, we make asynchronous using
GObject.timeout_add_seconds in update_file_info_full.
from gi.repository import Nautilus, GObject
from urllib import unquote
from subprocess import Popen, PIPE
def getexiftool(filename):
options = '-fast2 -f -m -q -q -s3 -ExifIFD:DateTimeOriginal'
exiftool = Popen(['/usr/bin/exiftool'] + options.split() + [filename],
stdout=PIPE, stderr=PIPE)
output, errors = exiftool.communicate()
return output.split('\n')
class MyExtension(Nautilus.ColumnProvider, Nautilus.InfoProvider, GObject.GObject):
def __init__(self):
pass
def get_columns(self):
return (
Nautilus.Column(name='MyExif::DateTime',
attribute='Exif:Image:DateTime',
label='Date Original',
description='Data time original'
),
)
def update_file_info_full(self, provider, handle, closure, file_info):
if file_info.get_uri_scheme() != 'file':
return
filename = unquote(file_info.get_uri()[7:])
attr = ''
if file_info.get_mime_type() in ('image/jpeg', 'image/png'):
GObject.timeout_add_seconds(1, self.update_exif,
provider, handle, closure, file_info)
return Nautilus.OperationResult.IN_PROGRESS
file_info.add_string_attribute('Exif:Image:DateTime', attr)
return Nautilus.OperationResult.COMPLETE
def update_exif(self, provider, handle, closure, file_info):
filename = unquote(file_info.get_uri()[7:])
try:
data = getexiftool(filename)
attr = data[0]
except:
attr = ''
file_info.add_string_attribute('Exif:Image:DateTime', attr)
Nautilus.info_provider_update_complete_invoke(closure, provider,
handle, Nautilus.OperationResult.COMPLETE)
return False
The code above will not block Nautilus, and if the column 'Date Original' is available in the column view, the JPEG and PNG images will show the 'unknown' value, and slowly they will being updated (the subprocess is called after 1 second).
Examples 2: Using a library
Rather than invoking an external program, it could be better to use a library. As the example below:
from gi.repository import Nautilus, GObject
from urllib import unquote
import pyexiv2
class MyExtension(Nautilus.ColumnProvider, Nautilus.InfoProvider, GObject.GObject):
def __init__(self):
pass
def get_columns(self):
return (
Nautilus.Column(name='MyExif::DateTime',
attribute='Exif:Image:DateTime',
label='Date Original',
description='Data time original'
),
)
def update_file_info_full(self, provider, handle, closure, file_info):
if file_info.get_uri_scheme() != 'file':
return
filename = unquote(file_info.get_uri()[7:])
attr = ''
if file_info.get_mime_type() in ('image/jpeg', 'image/png'):
metadata = pyexiv2.ImageMetadata(filename)
metadata.read()
try:
tag = metadata['Exif.Image.DateTime'].value
attr = tag.strftime('%Y-%m-%d %H:%M')
except:
attr = ''
file_info.add_string_attribute('Exif:Image:DateTime', attr)
return Nautilus.OperationResult.COMPLETE
Eventually, if the routine is slow you would need to make it asynchronous (maybe using something better than GObject.timeout_add_seconds.
At last but not least, in my examples I used GObject Introspection (typically for Nautilus 3), but it easy to change it to use the module nautilus directly.

The above solution is only partly correct.
Between state changes for file_info metadata, the user should call file_info.invalidate_extension_info() to notify nautilus of the change.
Failing to do this could end up with 'unknown' appearing in your columns.
file_info.add_string_attribute('video_width', video_width)
file_info.add_string_attribute('video_height', video_height)
file_info.add_string_attribute('name_suggestion', name_suggestion)
file_info.invalidate_extension_info()
Nautilus.info_provider_update_complete_invoke(closure, provider, handle, Nautilus.OperationResult.COMPLETE)
Full working example here:
Fully working example
API Documentation

thanks to Dave!
i was looking for a solution to the 'unknown' text in the column for ages
file_info.invalidate_extension_info()
Fixed the issue for me right away :)
Per the api API Documentation
https://projects-old.gnome.org/nautilus-python/documentation/html/class-nautilus-python-file-info.html#method-nautilus-python-file-info--invalidate-extension-info
Nautilus.FileInfo.invalidate_extension_info
def invalidate_extension_info()
Invalidates the information Nautilus has about this file, which causes it to request new information from its Nautilus.InfoProvider providers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use progressbar module with urlretrieve - python

Related

Update tqdm bar

Get Progress in Python file Upload to Azure

request.urlretrieve in multiprocessing Python gets stuck

Python Multithreading Not Functioning

Speed up nautilus python-extensions for reading image's Exif

Categories

Resources