How can I make dask distributed forget specific keys?

How can I make dask distributed forget specific keys? - python

I do not see a way to have the scheduler forget keys that were previously used in an executed dask graph. Minimal example:
client = Client("127.0.0.1:8786")
def f():
raise KeyError()
dsk = {'A': (f,)}
client.get(dsk, 'A') # raises KeyError
If I go back to fix a bug and resubmit the graph:
def f():
return True
dsk = {'A': (f,)}
client.get(dsk, 'A') # still raises KeyError, but:
dsk = {'A1': (f,)}
client.get(dsk, 'A1') # returns True
I understand that this is the correct behavior, since f is already pickled and sent to the scheduler as is with the initial get call. Is there a way that I can have the scheduler forget 'A' before resubmission (without full restart)?

It looks like there are two issues here:
A future exists
The scheduler does clear out keys pretty aggressively once no future points to them. In this case it looks like the KeyError and traceback raised in the get call maintain a reference to the temporary future, which is unfortunate, given that it also maintains a circular reference back to them (I'd consider this a bug). If you were able to clean those up then things would probably be fine.
Cancellation policy doesn't forget exceptions
I was going to suggest that you can explicitly clear out state creating and cancelling a future
from dask.distributed import Future
Future('A').cancel()
Unfortunately for you it looks like current policy is to remember exceptions even past cancellation. You might consider raising an issue for this
Or just use delayed
It's rare to see people use custom graphs unless they are integrating with some other graph scheduling system. Most users use dask.delayed which handles all of these issue for you. In this case dask.delayed would construct a new key for you, as you have done in your example with A1, but it would all be hidden behind a comfortable usability layer.

Related

Struggling with how to iterate data

I am learning Python3 and I have a fairly simple task to complete but I am struggling how to glue it all together. I need to query an API and return the full list of applications which I can do and I store this and need to use it again to gather more data for each application from a different API call.
applistfull = requests.get(url,authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
else:
print(applistfull.status_code)
I next have I think 'summaryguid' and I need to again query a different API and return a value that could exist many times for each application; in this case the compiler used to build the code.
I can statically call a GUID in the URL and return the correct information but I haven't yet figured out how to get it to do the below for all of the above and build a master list:
summary = requests.get(f"url{summaryguid}moreurl",authmethod)
if summary.ok:
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(appsummary["compiler"])
I would prefer to not yet have someone just type out the right answer but just drop a few hints and let me continue to work through it logically so I learn how to deal with what I assume is a common issue in the future. My thought right now is I need to move my second if up as part of my initial block and continue the logic in that space but I am stuck with that.

You are on the right track! Here is the hint: the second API request can be nested inside the loop that iterates through the list of applications in the first API call. By doing so, you can get the information you require by making the second API call for each application.

import requests
applistfull = requests.get("url", authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
summary = requests.get(f"url/{summaryguid}/moreurl", authmethod)
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(app["profile"]["name"],appsummary["compiler"])
else:
print(applistfull.status_code)

How to recover from an error in a nested celery chain?

The kind of workflow that I want to run looks like this:
workflow = (
generator.s() |
spread.s() |
gather.s()
)
where spread is a task that replaces itself with a group.
from celery import Celery, group
celery_app = Celery()
#celery_app.task(bind=True)
def spread(self, numbers):
return self.replace(group(
(task_1.si(n) | task_2.s() | task_3.s()) for n in numbers
)
)
The whole workflow works fine and as expected.
My question is essentially only about the chains in the group created by spread. I don't care too much if some of them fail. I'm fine if an error somewhere in the chain would lead to a shorter list of results being passed to gather. However, I'm not sure how to achieve that.
I can, of course, catch exceptions in each of task_1, task_2, and task_3 and pass on an empty dummy result. For convenience I'd really like to be able to say that on an error anywhere in the chain, please log the traceback and remove the result from the group or pass on an empty dummy result.
I've searched the documentation and GitHub issues far and wide but could not find anything. I know that I can pass an on_error callback to the chain but I don't know how to pass on an empty result from there (if that's even possible).
Setup:
Python 3.6
celery 4.2.1
Redis broker and backend (though it's not a problem for me to switch if that would enable the behavior)

Joblib UserWarning while trying to cache results

I get the following UserWarning when trying to cache results using joblib:
import numpy
from tempfile import mkdtemp
cachedir = mkdtemp()
from joblib import Memory
memory = Memory(cachedir=cachedir, verbose=0)
#memory.cache
def get_nc_var3d(path_nc, var, year):
"""
Get value from netcdf for variable var for year
:param path_nc:
:param var:
:param year:
:return:
"""
try:
hndl_nc = open_or_die(path_nc)
val = hndl_nc.variables[var][int(year), :, :]
except:
val = numpy.nan
logger.info('Error in getting var ' + var + ' for year ' + str(year) + ' from netcdf ')
hndl_nc.close()
return val
I get the following warning when calling this function using parameters:
UserWarning: Persisting input arguments took 0.58s to run.
If this happens often in your code, it can cause performance problems
(results will be correct in all cases).
The reason for this is probably some large input arguments for a wrapped function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an example so that they can fix the problem.
Input parameters: C:/Users/rit/Documents/PhD/Projects/\GLA/Input/LUWH/\LUWLAN_v1.0h\transit_model.nc range_to_large 1150
How do I get rid of the warning? And why is it happening, since the input parameters are not too long?

I don't have an answer to the "why doesn't this work?" portion of the question. However to simply ignore the warning you can use warnings.catch_warnings with warnings.simplefilter as seen here.
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
your_code()
Obviously, I don't recommend ignoring the warning unless you're sure its harmless, but if you're going to do it this way will only suppress warnings inside the context manager and is straight out of the python docs

UserWarning: Persisting input arguments took 0.58s to run.
If this happens often in your code, it can cause performance problems
(results will be correct in all cases).
The reason for this is probably some large input arguments for a wrapped function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an example so that they can fix the problem.
the warning itself is self explanatory in my humble opinion. it might be in your code issue you can try to decrease the input size,or you can share your report with joblib team so that they can either help to improve joblib or suggest your better approach of usage to avoid this type of performance warnings.

The warning occurs because joblib is trying to persist the input arguments to disk for caching purposes, but it's taking too long for this operation. The cause of the issue could be due to some large input arguments, such as a long string, which is taking time to serialize.
To resolve the issue, you can either disable the persist argument of the cache method, which would result in no caching, or you can try to preprocess the input arguments to reduce their size before calling the cache method.
#memory.cache(persist=False)

Maya Python/Mel : Retrieve name of last object deleted

Was wondering if it is possible to retrieve the name of the last object deleted.
I have looked into listHistory, but that seems to list the history of a selected or named object. I have also looked into undoHistory printqueue, which prints out the undo history into the script editor, but i can't retrieve that information from the console.
Any ideas? I've looked around and can't find any info on this. Thanks in advance.

You can get the list with:
undoInfo -q -pq;
There are a few really really good use cases for scalping Maya undo. Such as determining selection order after the fact. In any case it may be difficult to know what it actually was form the queue so you may need to undo and redo to get what the deleted object was.
So this may or may not work, mileage may vary.
As a side note since your restoring stuff why not save the object list at time of save. The order is going to be the same (ensured), so you can see the changes in the end and deletions as missing objects. See the objects in in a plain ls are in creation order. You can use this for rudimentary diff from import to import for example. Same works for deletions.

Catching any individual deletion after the fact is not possible. However you can stick an attributeDeleted scriptJob on objects you want to monitor - it will fire when they are deleted. If you really want to catch every object, a scriptJob listening for the event DagObjectCreated will let you hook the other scriptJob to each new object - however that's not a good idea most of the time, since it will create a ton of scriptJobs in your scene (plus you'd have to also loop through the scene on load and attach the same deletion callback to existing objects as well...)
import maya.cmds as cmds
from functools import partial
def objectDeleted(obj):
print "%s was deleted" % obj
def catch_deletion(obj):
cmds.scriptJob ( attributeDeleted = ( (obj + ".tx"), partial(objectDeleted, obj) ) )
catch_deletion('pCube1')

How to accelerate reads from batches of files

I read many files from my system. I want to read them faster, maybe like this:
results=[]
for file in open("filenames.txt").readlines():
results.append(open(file,"r").read())
I don't want to use threading. Any advice is appreciated.
the reason why i don't want to use threads is because it will make my code unreadable,i want to find so tricky way to make speed faster and code lesser,unstander easier
yesterday i have test another solution with multi-processing,it works bad,i don't know why,
here is the code as follows:
def xml2db(file):
s=pq(open(file,"r").read())
dict={}
for field in g_fields:
dict[field]=s("field[#name='%s']"%field).text()
p=Product()
for k,v in dict.iteritems():
if v is None or v.strip()=="":
pass
else:
if hasattr(p,k):
setattr(p,k,v)
session.commit()
#cost_time
#statistics_db
def batch_xml2db():
from multiprocessing import Pool,Queue
p=Pool(5)
#q=Queue()
files=glob.glob(g_filter)
#for file in files:
# q.put(file)
def P():
while q.qsize()<>0:
xml2db(q.get())
p.map(xml2db,files)
p.join()

results = [open(f.strip()).read() for f in open("filenames.txt").readlines()]
This may be insignificantly faster, but it's probably less readable (depending on the reader's familiarity with list comprehensions).
Your main problem here is that your bottleneck is disk IO - buying a faster disk will make much more of a difference than modifying your code.

Well, if you want to improve performance then improve the algorithm, right? What are you doing with all this data? Do you really need it all in memory at the same time, potentially causing OOM if filenames.txt specifies too many or too large of files?
If you're doing this with lots of files I suspect you are thrashing, hence your 700s+ (1 hour+) time. Even my poor little HD can sustain 42 MB/s writes (42 * 714s = 30GB). Take that grain of salt knowing you must read and write, but I'm guessing you don't have over 8 GB of RAM available for this application. A related SO question/answer suggested you use mmap, and the answer above that suggested an iterative/lazy read like what you get in Haskell for free. These are probably worth considering if you really do have tens of gigabytes to munge.

Is this a one-off requirement or something that you need to do regularly? If it's something you're going to be doing often, consider using MySQL or another database instead of a file system.

Not sure if this is still the code you are using.
A couple adjustments I would consider making.
Original:
def xml2db(file):
s=pq(open(file,"r").read())
dict={}
for field in g_fields:
dict[field]=s("field[#name='%s']"%field).text()
p=Product()
for k,v in dict.iteritems():
if v is None or v.strip()=="":
pass
else:
if hasattr(p,k):
setattr(p,k,v)
session.commit()
Updated:
remove the use of the dict, it is extra object creation, iteration and collection.
def xml2db(file):
s=pq(open(file,"r").read())
p=Product()
for k in g_fields:
v=s("field[#name='%s']"%field).text()
if v is None or v.strip()=="":
pass
else:
if hasattr(p,k):
setattr(p,k,v)
session.commit()
You could profile the code using the python profiler.
This might tell you where the time being spent is.
It may be in session.Commit() this may need to be reduced to every couple of files.
I have no idea what it does so that is really a stab in the dark, you may try and run it without sending or writing any output.
If you can separate your code into Reading, Processing and Writing.
A) You can see how long it takes to read all the files.
Then by loading a single file into memory process it enough time to represent the entire job without extra reading IO.
B) Processing cost
Then save a whole bunch of sessions representative of your job size.
C) Output cost
Test the cost of each stage individually. This should show you what is taking the most time and if any improvement can be made in any area.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I make dask distributed forget specific keys? - python

Related

Struggling with how to iterate data

How to recover from an error in a nested celery chain?

Joblib UserWarning while trying to cache results

Maya Python/Mel : Retrieve name of last object deleted

How to accelerate reads from batches of files

Categories

Resources