Dask how to avoid recomputing things

Dask how to avoid recomputing things - python

Using dask I have defined a long pipeline of computations; at some point given constraints in apis and version I need to compute some small result (not lazy) and feed it in the lazy operations. My problem is that at this point the whole computation graph will be executed so that I can produce an intermediate results. Is there a way to not loose the work done at this point and have to recompute everything from scratch when in a following step I am storing the final results to disk?
Is using persist supposed to help with that?
Any help will be very appreciated.

Yes, this is the usecase that persist is for. The trick is figuring out where to apply it - this decision is usually influenced by:
The size of your intermediate results. These will be kept in memory until all references to them are deleted (e.g. foo in foo = intermediate.persist()).
The shape of your graph. It's better to persist only components that would need to be recomputed, to minimize the memory impact of the persisted values. You can use .visualize() to look at the graph.
The time it takes to compute the tasks. If the tasks are quick to compute, then it may be more beneficial just to recompute them rather than keep them around in memory.

Related

Simultaneous recursion on equivalent graphs using Python

I have a structure, looking a lot like a graph but I can 'sort' it. Therefore I can have two graphs, that are equivalent, but one is sorted and not the other. My goal is to compute a minimal dominant set (with a custom algorithm that fits my specific problem, so please do not link to other 'efficient' algorithms).
The thing is, I search for dominant sets of size one, then two, etc until I find one. If there isn't a dominant set of size i, using the sorted graph is a lot more efficient. If there is one, using the unsorted graph is much better.
I thought about using threads/multiprocessing, so that both graphs are explored at the same time and once one finds an answer (no solution or a specific solution), the other one stops and we go to the next step or end the algorithm. This didn't work, it just makes the process much slower (even though I would expect it to just double the time required for each step, compared to using the optimal graph without threads/multiprocessing).
I don't know why this didn't work and wonder if there is a better way, that maybe doesn't even required the use of threads/multiprocessing, any clue?

If you don't want an algorithm suggestion, then lazy evaluation seems like the way to go.
Setup the two in a data structure such that with a class_instance.next_step(work_to_do_this_step) where a class instance is a solver for one graph type. You'll need two of them. You can have each graph move one "step" (whatever you define a step to be) forward. By careful selection (possibly dynamically based on how things are going) of what a step is, you can efficiently alternate between how much work/time is being spent on the sorted vs unsorted graph approaches. Of course this is only useful if there is at least a chance that either algorithm may finish before the other.
In theory if you can independently define what those steps are, then you could split up the work to run them in parallel, but it's important that each process/thread is doing roughly the same amount of "work" so they all finish about the same time. Though writing parallel algorithms for these kinds of things can be a bit tricky.

Sounds like you're not doing what you describe. Possibly you're waiting for BOTH to finish somehow? Try doing that, and seeing if the time changes.

GPyOpt - how to run a physical experiment?

I'm trying to do some physical experiments to find a formulation that optimizes some parameters. By physical experiments I mean I have a chemistry bench, I'm mixing stuff together, then measuring the properties of that formulation. Historically I've used traditional DOEs, but I need to speed up my time to getting to the ideal formulation. I'm aware of simplex optimization, but I'm interested in trying out Bayesian optimization. I found GPyOpt which claims (even in the SO Tag description) to support physical experiments. However, it's not clear how to enable this kind of behavior.
One thing I've tried is to collect user input via input, and I suppose I could pickle off the optimizer and function, but this feels kludgy. In the example code below, I use the function from the GPyOpt example but I have to type in the actual value.
from GPyOpt.methods import BayesianOptimization
import numpy as np
# --- Define your problem
def f(x):
return (6*x-2)**2*np.sin(12*x-4)
def g(x):
print(f(x))
return float(input("Result?"))
domain = [{'name': 'var_1', 'type': 'continuous', 'domain': (0, 1)}]
myBopt = BayesianOptimization(f=g,
domain=domain,
X=np.array([[0.745], [0.766], [0], [1], [0.5]]),
Y=np.array([[f(0.745)], [f(0.766)], [f(0)], [f(1)], [f(0.5)]]),
acquisition_type='LCB')
myBopt.run_optimization(max_iter=15, eps=0.001)
So, my questions is, what is the intended way of using GPyOpt for physical experimentation?

A few things.
First, set f=None. Note that this has the side-effect of causing the BO object to ignore the maximize=True, if you happen to be using this.
Second, rather than use run_optimization, you want suggest_next_locations. The former runs the entire optimization, whereas the latter just runs a single iteration. This method returns a vector with parameter combinations ("locations") to go test in the lab.
Third, you'll need to make some decisions regarding batch size. The number of combinations/locations that you get are controlled by the batch_size parameter that you use to initialize the BayesianOptimization object. Choice of acquisition function is important here, because some are closely tied to a batch_size of 1. If you need larger batches, then you'll need to read the docs for combinations suitable to your situation (e.g. acquisition_type=EI and evaluator_type=local_penalization.
Fourth, you'll need to explicitly manage the data between iterations. There are at least two ways to approach this. One is to pickle the BO object and add more data to it. An alternative that I think is more elegant is to instead create a completely fresh BO object each time. When instantiating it, you concatenate the new data to the old data, and just run a single iteration on the whole set (again, using suggest_next_locations). This might be kind of insane if you were using BO to optimize a function in silico, but considering how slow the chemistry steps are likely to be, this might be cleanest (and easier to make mid-course corrections.)
Hope this helps!

storing values between iterations (cache-like mechanism) in pyCUDA

Good morning all,
I am kind of newbie with cuda/pyCuda, so probably this will have an easy solution employing some mechanism that I don't know....
I am employing pycuda to operate over pairs of values: I subtract the smallest from the biggest and then perform some time-consuming operations. As it must be repeated many times, it is well suited for GPUs.
However, most of the times the result of the substraction is the same. Then, performing the time-consuming operations make no sense. what I do in the non-GPU version of my code is something like:
myFunction(A,B):
index=A-B
try:
value = myDictionary[index]
except:
value = expensiveOperation(index)
myDictionary[index] = value
return value
As accessing the dictionary is much faster than expensiveOperation, and the value is found most of the times, I obtain a significant time gain.
When porting this to GPUs, I can call to myFunction(A,B) with a high degree of parallelism, which is great. However, I don't know how could I employ this dictionary mechanism -or a similar one- to avoid redundant operations.
any thoughts on this?
Thanks for your help
edit: I would like to know, is it possible to store the dictionary on the GPU, or should I copy it every time? If it's on the GPU, can it be accessed/edited by several cores at the same time? How should I implement it?

You could try this:
myFunction(A,B):
index=A-B
if index in myDictionary.keys():
value = myDictionary[index]
else:
value = expensiveOperation(index)
myDictionary[index] = value
return value

It seems your question is about implementing some sort of memoise facility inside GPU code. I don't think this is worth pursuing. In the GPU arithmetic operations are almost free, but memory access is very expensive (and random memory access even more so). Performing a dictionary/hash table look-up in GPU memory to retrieve an arithmetic result from a cache is almost guaranteed to be slower that the cost of just calculating the result. It sounds counter-intuitive, but that is the reality of GPU computing.
In an interpreted language like Python, which is relatively slow, using a fast native memoisation mechanism makes a lot of sense, and memoising the results of a complete kernel function call also could yield useful performance benefits for expensive kernels. But memoisation inside CUDA doesn't seem all that useful.

Iterating over a large data set in long running Python process - memory issues?

I am working on a long running Python program (a part of it is a Flask API, and the other realtime data fetcher).
Both my long running processes iterate, quite often (the API one might even do so hundreds of times a second) over large data sets (second by second observations of certain economic series, for example 1-5MB worth of data or even more). They also interpolate, compare and do calculations between series etc.
What techniques, for the sake of keeping my processes alive, can I practice when iterating / passing as parameters / processing these large data sets? For instance, should I use the gc module and collect manually?
UPDATE
I am originally a C/C++ developer and would have NO problem (and would even enjoy) writing parts in C++. I simply have 0 experience doing so. How do I get started?
Any advice would be appreciated.
Thanks!

Working with large datasets isn't necessarily going to cause memory complications. As long as you use sound approaches when you view and manipulate your data, you can typically make frugal use of memory.
There are two concepts you need to consider as you're building the models that process your data.
What is the smallest element of your data need access to to perform a given calculation? For example, you might have a 300GB text file filled with numbers. If you're looking to calculate the average of the numbers, read one number at a time to calculate a running average. In this example, the smallest element is a single number in the file, since that is the only element of our data set that we need to consider at any point in time.
How can you model your application such that you access these elements iteratively, one at a time, during that calculation? In our example, instead of reading the entire file at once, we'll read one number from the file at a time. With this approach, we use a tiny amount of memory, but can process an arbitrarily large data set. Instead of passing a reference to your dataset around in memory, pass a view of your dataset, which knows how to load specific elements from it on demand (which can be freed once worked with). This similar in principle to buffering and is the approach many iterators take (e.g., xrange, open's file object, etc.).
In general, the trick is understanding how to break your problem down into tiny, constant-sized pieces, and then stitching those pieces together one by one to calculate a result. You'll find these tenants of data processing go hand-in-hand with building applications that support massive parallelism, as well.
Looking towards gc is jumping the gun. You've provided only a high-level description of what you are working on, but from what you've said, there is no reason you need to complicate things by poking around in memory management yet. Depending on the type of analytics you are doing, consider investigating numpy which aims to lighten the burden of heavy statistical analysis.

Its hard to say without real look into your data/algo, but the following approaches seem to be universal:
Make sure you have no memory leaks, otherwise it would kill your program sooner or later. Use objgraph for it - great tool! Read the docs - it contains good examples of the types of memory leaks you can face at python program.
Avoid copying of data whenever possible. For example - if you need to work with part of the string or do string transformations - don't create temporary substring - use indexes and stay read-only as long as possible. It could make your code more complex and less "pythonic" but this is the cost for optimization.
Use gc carefully - it can make you process irresponsible for a while and at the same time add no value. Read the doc. Briefly: you should use gc directly only when there is real reason to do that, like Python interpreter being unable to free memory after allocating big temporary list of integers.
Seriously consider rewriting critical parts on C++. Start thinking about this unpleasant idea already now to be ready to do it when you data become bigger. Seriously, it usually ends this way. You can also give a try to Cython it could speed up the iteration itself.

Minimising reading from and writing to disk in Python for a memory-heavy operation

Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.

A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.

Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.

Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.

You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.

Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.

I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)

Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.

From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Find a method to have your corpus in memory. This might be a sort of ram disk with file system, or a database. No idea, which one is best for you.
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
iterate through the corpus and build and write some vectors
in the end you can collect and combine all vectors, if needed (this would be the reduce part)

Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.

Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
while (someCondition):
if psutil.phymem_usage().percent > 80.0:
dumpQueue(myQueue,somefile)
else:
addSomeStufftoQueue(myQueue,stuff)
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.