Get Progress in Python file Upload to Azure

Get Progress in Python file Upload to Azure - python

i'm uploading files to azure like so:
with open(tempfile, "rb") as data:
blob_client.upload_blob(data, blob_type='BlockBlob', length=None, metadata=None)
how can i get a progress indication?
when i try uploading as stream, it only uploads one chunk.
i'm sure i'm doing something wrong, but can't find info.
thanks!

It looks like the Azure library doesn't include a callback function to monitor progress.
Fortunately, you can add a wrapper around Python's file object which can call a callback everytime there's a read.
Try this:
import os
from io import BufferedReader, FileIO
class ProgressFile(BufferedReader):
# For binary opening only
def __init__(self, filename, read_callback):
f = FileIO(file=filename, mode='r')
self._read_callback = read_callback
super().__init__(raw=f)
# I prefer Pathlib but this should still support 2.x
self.length = os.stat(filename).st_size
def read(self, size=None):
calc_sz = size
if not calc_sz:
calc_sz = self.length - self.tell()
self._read_callback(position=self.tell(), read_size=calc_sz, total=self.length)
return super(ProgressFile, self).read(size)
def my_callback(position, read_size, total):
# Write your own callback. You could convert the absolute values to percentages
# Using .format rather than f'' for compatibility
print("position: {position}, read_size: {read_size}, total: {total}".format(position=position,
read_size=read_size,
total=total))
myfile = ProgressFile(filename='mybigfile.txt', read_callback=my_callback)
Then you would do
blob_client.upload_blob(myfile, blob_type='BlockBlob', length=None, metadata=None)
myfile.close()
Edit:
It looks like TQDM (progress monitor) has a neat wrapper: https://github.com/tqdm/tqdm#hooks-and-callbacks.
The bonus there is that you get easy access to a pretty progress bar.

This is how I ended up using the tqdm wrapper that Alastair mentioned above
size = os.stat(fname).st_size
with tqdm.wrapattr(open(fname, 'rb'), "read", total=size) as data:
blob_client.upload_blob(data)
Works perfectly, shows time estimate, progress bar, human readable file sizes and transfer speeds.

Related

Is there a "proper" way to load config files using ConfigParser? Or, is there an equivalent to logging.getLogger for config files?

I have a particularly large python project that has modules upon modules and classes that call other classes and so on. It's an organized mess.
I want to be able to just read the config file once and get/set key-value pairs out of it from any part of my program.
Right now, my setup looks like this: I have a module with these function
def initialize():
config_file = configparser.ConfigParser()
config_file_path = os.path.join(Path(__file__).resolve().parents[2], 'config.ini')
try:
config_file.read(config_file_path)
except FileNotFoundError:
raise FileNotFoundError
else:
return config_file, config_file_path
def get_config_data(section, key):
config_file, _ = initialize()
return config_file[section][key]
def set_config_data(section, key, value):
config_file, config_file_path = initialize()
config_file.set(section, key, str(value))
with open(config_file_path, 'w') as f:
config_file.write(f)
and whenever I need a config key-value pair, I just import it as CFG and use CFG.get_config_data(KEY, VALUE). Which means I have to run initialize every single time I need something. I don't think it's ideal (or is it? I genuinely don't know)
Is there a "proper and standard" method for reading config files in large python projects? Something that I can just import and get in the beginning? Or there's nothing wrong with my set-up as it is?

How to output model pickle file to s3 in luigi?

I have a task which trains the model eg:
class ModelTrain(luigi.Task):
def output(self):
client = S3Client(os.getenv("CONFIG_AWS_ACCESS_KEY"),
os.getenv("CONFIG_AWS_SECRET_KEY"))
model_output = os.path.join(
"s3://", _BUCKET, exp.version + '_model.joblib')
return S3Target(model_output, client)
def run(self):
joblib.dump(model, '/tmp/model.joblib')
with open(self.output().path, 'wb') as out_file:
out_file.write(joblib.load('/tmp/model.joblib'))
FileNotFoundError: [Errno 2] No such file or directory: 's3://bucket/version_model.joblib'
Any pointers in this regard would be helpful

A few suggestions-
First, make sure you're using the actual self.output().open() method instead of wrapping open(self.output().path). This loses the 'atomicity' of the luigi targets, plus those targets are supposed to be swappable, so if you changed back to a a LocalTarget your code should work the same way. You let the specific target class handle what it means to open the file. The error you get looks like python is trying to find a local path, which obviously doesn't work.
Second, I just ran into the same issue, so here's my solution plugged into this code:
from luigi import format
class ModelTrain(luigi.Task):
def output(self):
client = S3Client(os.getenv("CONFIG_AWS_ACCESS_KEY"),
os.getenv("CONFIG_AWS_SECRET_KEY"))
model_output = os.path.join(
"s3://", _BUCKET, exp.version + '_model.joblib')
# Use luigi.format.Nop for binary files
return S3Target(model_output, client, format=format.Nop)
def run(self):
# where does `model` come from?
with self.output().open('w') as s3_f:
joblib.dump(model, s3_f)
My task is using pickle so I had to follow something similar to this post to re-import.
class MyNextTask(Task):
...
def run(self):
with my_pickled_task.output().open() as f:
# The S3Target implements a read method and then I can use
# the `.loads()` method to import from a binary string
results = pickle.loads(f.read())
... do more stuff with results ...
I recognize this post is stale, but putting the solution I found out there for the next poor soul trying to do this same thing.

Could you try to remove .path in your open statement.
def run(self):
joblib.dump(model, '/tmp/model.joblib')
with open(self.output(), 'wb') as out_file:
out_file.write(joblib.load('/tmp/model.joblib'))

Python NamedTemporaryFile - ValueError When Reading

I am having an issue writing to a NamedTemporaryFile in Python and then reading it back. The function downloads a file via tftpy to the temp file, reads it, hashes the contents, and then compares the hash digest to the original file. The function in question is below:
def verify_upload(self, image, destination):
# create a tftp client
client = TftpClient(ip, 69, localip=self.binding_ip)
# generate a temp file to hold the download info
if not os.path.exists("temp"):
os.makedirs("temp")
with NamedTemporaryFile(dir="temp") as tempfile, open(image, 'r') as original:
try:
# attempt to download the target image
client.download(destination, tempfile, timeout=self.download_timeout)
except TftpTimeout:
raise RuntimeError("Could not download {0} from {1} for verification".format(destination, self.target_ip))
# hash the original file and the downloaded version
original_digest = hashlib.sha256(original.read()).hexdigest()
uploaded_digest = hashlib.sha256(tempfile.read()).hexdigest()
if self.verbose:
print "Original SHA-256: {0}\nUploaded SHA-256: {1}".format(original_digest, uploaded_digest)
# return the hash comparison
return original_digest == uploaded_digest
The problem is that every time I try to execute the line uploaded_digest = hashlib.sha256(tempfile.read()).hexdigest() the application errors out with a ValueError - I/O Operation on a closed file. Since the with block is not complete I am struggling to understand why the temp file would be closed. The only possibility I can think of is that tftpy is closing the file after doing the download, but I cannot find any point in the tftpy source where this would be happening. Note, I have also tried inserting the line tempfile.seek(0) in order to put the file back in a proper state for reading, however this also gives me the ValueError.
Is tftpy closing the file possibly? I read that there is possibly a bug in NamedTemporaryFile causing this problem? Why is the file closed before the reference defined by the with block goes out of scope?

TFTPy is closing the file. When you were looking at the source, you missed the following code path:
class TftpClient(TftpSession):
...
def download(self, filename, output, packethook=None, timeout=SOCK_TIMEOUT):
...
self.context = TftpContextClientDownload(self.host,
self.iport,
filename,
output,
self.options,
packethook,
timeout,
localip = self.localip)
self.context.start()
# Download happens here
self.context.end() # <--
TftpClient.download calls TftpContextClientDownload.end:
class TftpContextClientDownload(TftpContext):
...
def end(self):
"""Finish up the context."""
TftpContext.end(self) # <--
self.metrics.end_time = time.time()
log.debug("Set metrics.end_time to %s", self.metrics.end_time)
self.metrics.compute()
TftpContextClientDownload.end calls TftpContext.end:
class TftpContext(object):
...
def end(self):
"""Perform session cleanup, since the end method should always be
called explicitely by the calling code, this works better than the
destructor."""
log.debug("in TftpContext.end")
self.sock.close()
if self.fileobj is not None and not self.fileobj.closed:
log.debug("self.fileobj is open - closing")
self.fileobj.close() # <--
and TftpContext.end closes the file.

Python Multithreading Not Functioning

Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()

There are multiple issues in your code.
The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()

Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.

Speed up nautilus python-extensions for reading image's Exif

I've written a Nautilus extension which reads picture's metadata (executing exiftool), but when I open folders with many files, it really slows down the file manager and hangs until it finishes reading the file's data.
Is there a way to make Nautilus keep its work while it runs my extension? Perhaps the Exif data could appear gradually in the columns while I go on with my work.
#!/usr/bin/python
# Richiede:
# nautilus-python
# exiftool
# gconf-python
# Versione 0.15
import gobject
import nautilus
from subprocess import Popen, PIPE
from urllib import unquote
import gconf
def getexiftool(filename):
options = '-fast2 -f -m -q -q -s3 -ExifIFD:DateTimeOriginal -IFD0:Software -ExifIFD:Flash -Composite:ImageSize -IFD0:Model'
exiftool=Popen(['/usr/bin/exiftool'] + options.split() + [filename],stdout=PIPE,stderr=PIPE)
#'-Nikon:ShutterCount' non utilizzabile con l'argomento -fast2
output,errors=exiftool.communicate()
return output.split('\n')
class ColumnExtension(nautilus.ColumnProvider, nautilus.InfoProvider, gobject.GObject):
def __init__(self):
pass
def get_columns(self):
return (
nautilus.Column("NautilusPython::ExifIFD:DateTimeOriginal","ExifIFD:DateTimeOriginal","Data (ExifIFD)","Data di scatto"),
nautilus.Column("NautilusPython::IFD0:Software","IFD0:Software","Software (IFD0)","Software utilizzato"),
nautilus.Column("NautilusPython::ExifIFD:Flash","ExifIFD:Flash","Flash (ExifIFD)","Modalit\u00e0 del flash"),
nautilus.Column("NautilusPython::Composite:ImageSize","Composite:ImageSize","Risoluzione (Exif)","Risoluzione dell'immagine"),
nautilus.Column("NautilusPython::IFD0:Model","IFD0:Model","Fotocamera (IFD0)","Modello fotocamera"),
#nautilus.Column("NautilusPython::Nikon:ShutterCount","Nikon:ShutterCount","Contatore scatti (Nikon)","Numero di scatti effettuati dalla macchina a questo file"),
nautilus.Column("NautilusPython::Mp","Mp","Megapixel (Exif)","Dimensione dell'immagine in megapixel"),
)
def update_file_info_full(self, provider, handle, closure, file):
client = gconf.client_get_default()
if not client.get_bool('/apps/nautilus/nautilus-metadata/enable'):
client.set_bool('/apps/nautilus/nautilus-metadata/enable',0)
return
if file.get_uri_scheme() != 'file':
return
if file.get_mime_type() in ('image/jpeg', 'image/png', 'image/gif', 'image/bmp', 'image/x-nikon-nef', 'image/x-xcf', 'image/vnd.adobe.photoshop'):
gobject.timeout_add_seconds(1, self.update_exif, provider, handle, closure, file)
return Nautilus.OperationResult.IN_PROGRESS
file.add_string_attribute('ExifIFD:DateTimeOriginal','')
file.add_string_attribute('IFD0:Software','')
file.add_string_attribute('ExifIFD:Flash','')
file.add_string_attribute('Composite:ImageSize','')
file.add_string_attribute('IFD0:Model','')
file.add_string_attribute('Nikon:ShutterCount','')
file.add_string_attribute('Mp','')
return Nautilus.OperationResult.COMPLETE
def update_exif(self, provider, handle, closure, file):
filename = unquote(file.get_uri()[7:])
data = getexiftool(filename)
file.add_string_attribute('ExifIFD:DateTimeOriginal',data[0].replace(':','-',2))
file.add_string_attribute('IFD0:Software',data[1])
file.add_string_attribute('ExifIFD:Flash',data[2])
file.add_string_attribute('Composite:ImageSize',data[3])
file.add_string_attribute('IFD0:Model',data[4])
#file.add_string_attribute('Nikon:ShutterCount',data[5])
width, height = data[3].split('x')
mp = float(width) * float(height) / 1000000
mp = "%.2f" % mp
file.add_string_attribute('Mp',str(mp) + ' Mp')
Nautilus.info_provider_update_complete_invoke(closure, provider, handle, Nautilus.OperationResult.COMPLETE)
return false

That happens because you are invoking update_file_info, which is part of the asynchronous IO system of Nautilus. Therefore, it blocks nautilus if the operations are not fast enough.
In your case it is exacerbated because you are calling an external program, and that is an expensive operation. Notice that update_file_info is called once per file. If you have 100 files, then you will call 100 times the external program, and Nautilus will have to wait for each one before processing the next one.
Since nautilus-python 0.7 are available update_file_info_full and cancel_update, which allows you to program async calls. You can check the documentation of Nautilus 0.7 for more details.
It worth to mention this was a limitation of nautilus-python only, which previously did not expose those methods available in C.
EDIT: Added a couple of examples.
The trick is make the process as fast as possible or make it asynchronous.
Example 1: Invoking an external program
Using a simplified version of your code, we make asynchronous using
GObject.timeout_add_seconds in update_file_info_full.
from gi.repository import Nautilus, GObject
from urllib import unquote
from subprocess import Popen, PIPE
def getexiftool(filename):
options = '-fast2 -f -m -q -q -s3 -ExifIFD:DateTimeOriginal'
exiftool = Popen(['/usr/bin/exiftool'] + options.split() + [filename],
stdout=PIPE, stderr=PIPE)
output, errors = exiftool.communicate()
return output.split('\n')
class MyExtension(Nautilus.ColumnProvider, Nautilus.InfoProvider, GObject.GObject):
def __init__(self):
pass
def get_columns(self):
return (
Nautilus.Column(name='MyExif::DateTime',
attribute='Exif:Image:DateTime',
label='Date Original',
description='Data time original'
),
)
def update_file_info_full(self, provider, handle, closure, file_info):
if file_info.get_uri_scheme() != 'file':
return
filename = unquote(file_info.get_uri()[7:])
attr = ''
if file_info.get_mime_type() in ('image/jpeg', 'image/png'):
GObject.timeout_add_seconds(1, self.update_exif,
provider, handle, closure, file_info)
return Nautilus.OperationResult.IN_PROGRESS
file_info.add_string_attribute('Exif:Image:DateTime', attr)
return Nautilus.OperationResult.COMPLETE
def update_exif(self, provider, handle, closure, file_info):
filename = unquote(file_info.get_uri()[7:])
try:
data = getexiftool(filename)
attr = data[0]
except:
attr = ''
file_info.add_string_attribute('Exif:Image:DateTime', attr)
Nautilus.info_provider_update_complete_invoke(closure, provider,
handle, Nautilus.OperationResult.COMPLETE)
return False
The code above will not block Nautilus, and if the column 'Date Original' is available in the column view, the JPEG and PNG images will show the 'unknown' value, and slowly they will being updated (the subprocess is called after 1 second).
Examples 2: Using a library
Rather than invoking an external program, it could be better to use a library. As the example below:
from gi.repository import Nautilus, GObject
from urllib import unquote
import pyexiv2
class MyExtension(Nautilus.ColumnProvider, Nautilus.InfoProvider, GObject.GObject):
def __init__(self):
pass
def get_columns(self):
return (
Nautilus.Column(name='MyExif::DateTime',
attribute='Exif:Image:DateTime',
label='Date Original',
description='Data time original'
),
)
def update_file_info_full(self, provider, handle, closure, file_info):
if file_info.get_uri_scheme() != 'file':
return
filename = unquote(file_info.get_uri()[7:])
attr = ''
if file_info.get_mime_type() in ('image/jpeg', 'image/png'):
metadata = pyexiv2.ImageMetadata(filename)
metadata.read()
try:
tag = metadata['Exif.Image.DateTime'].value
attr = tag.strftime('%Y-%m-%d %H:%M')
except:
attr = ''
file_info.add_string_attribute('Exif:Image:DateTime', attr)
return Nautilus.OperationResult.COMPLETE
Eventually, if the routine is slow you would need to make it asynchronous (maybe using something better than GObject.timeout_add_seconds.
At last but not least, in my examples I used GObject Introspection (typically for Nautilus 3), but it easy to change it to use the module nautilus directly.

The above solution is only partly correct.
Between state changes for file_info metadata, the user should call file_info.invalidate_extension_info() to notify nautilus of the change.
Failing to do this could end up with 'unknown' appearing in your columns.
file_info.add_string_attribute('video_width', video_width)
file_info.add_string_attribute('video_height', video_height)
file_info.add_string_attribute('name_suggestion', name_suggestion)
file_info.invalidate_extension_info()
Nautilus.info_provider_update_complete_invoke(closure, provider, handle, Nautilus.OperationResult.COMPLETE)
Full working example here:
Fully working example
API Documentation

thanks to Dave!
i was looking for a solution to the 'unknown' text in the column for ages
file_info.invalidate_extension_info()
Fixed the issue for me right away :)
Per the api API Documentation
https://projects-old.gnome.org/nautilus-python/documentation/html/class-nautilus-python-file-info.html#method-nautilus-python-file-info--invalidate-extension-info
Nautilus.FileInfo.invalidate_extension_info
def invalidate_extension_info()
Invalidates the information Nautilus has about this file, which causes it to request new information from its Nautilus.InfoProvider providers.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get Progress in Python file Upload to Azure - python

Related

Is there a "proper" way to load config files using ConfigParser? Or, is there an equivalent to logging.getLogger for config files?

How to output model pickle file to s3 in luigi?

Python NamedTemporaryFile - ValueError When Reading

Python Multithreading Not Functioning

Speed up nautilus python-extensions for reading image's Exif

Categories

Resources