Python - go beyond RAM limits?

Python - go beyond RAM limits? - python

I'm trying to analyze text, but my Mac's RAM is only 8 gigs, and the RidgeRegressor just stops after a while with Killed: 9. I recon this is because it'd need more memory.
Is there a way to disable the stack size limiter so that the algorithm could use some kind of swap memory?

You will need to do it manually.
There are probably two different core-problems here:
A: holding your training-data
B: training the regressor
For A, you can try numpy's memmap which abstracts swapping away.
As an alternative, consider preparing your data to HDF5 or some DB. For HDF5, you can use h5py or pytables, both allowing numpy-like usage.
For B: it's a good idea to use some out-of-core ready algorithm. In scikit-learn those are the ones supporting partial_fit.
Keep in mind, that this training-process decomposes into at least two new elements:
Efficient being in regards to memory
Swapping is slow; you don't want to use something which holds N^2 aux-memory during learning
Efficient convergence
Those algorithms in the link above should be okay for both.
SGDRegressor can be parameterized to resemble RidgeRegression.
Also: it might be needed to use partial_fit manually, obeying the rules of the algorithm (often some kind of random-ordering needed for convergence-proofs). The problem with abstracting-away swapping is: if your regressor is doing a permutation in each epoch, without knowing how costly that is, you might be in trouble!
Because the problem itself is quite hard, there are some special libraries built for this, while sklearn needs some more manual work as explained. One of the most extreme ones (a lot of crazy tricks) might be vowpal_wabbit (where IO is often the bottleneck!). Of course there are other popular libs like pyspark, serving a slightly different purpose (distributed computing).

Related

Performance of single segment operation vs multiple operations on segments

Since currently there is no easy way to profile TensorFlow operations (Can I measure the execution time of individual operations with TensorFlow?), can anyone help me understand the benefits of using segment operations (e.g. segment_sum) compared to using multiple operations on pre-segmented tensors? Would segment_sum be more efficient than using dynamic_partition or gather followed by multiple reduce_sum? Would segment_sum be equally parallelizable?

I've updated the SO question you link to with some information about CPU inference profiling tools we've recently released at:
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/benchmark
Unfortunately the overall question is a lot harder to answer, since it depends on:
Whether you're focused on training, or inference.
If you're using a GPU, and if so what kind and how many.
Whether you're running distributed.
What your data looks like, and where the bottlenecks are.
What I usually end up doing is building small sub-graphs that are representative of the sort of ops I'm considering, and then timing how long they take on the sort of data I'll be feeding in. I know that isn't immediately helpful, since the experimentation can be time-consuming, but it is the best way to get an intuitive understanding of the optimal solution for your particular circumstances.

Running parallel iterations

I am trying to run a sort of simulations where there are fixed parameters i need to iterate on and find out the combinations which has the least cost.I am using python multiprocessing for this purpose but the time consumed is too high.Is there something wrong with my implementation?Or is there better solution.Thanks in advance
import multiprocessing
class Iters(object):
#parameters for iterations
iters['cwm']={'min':100,'max':130,'step':5}
iters['fx']={'min':1.45,'max':1.45,'step':0.01}
iters['lvt']={'min':106,'max':110,'step':1}
iters['lvw']={'min':9.2,'max':10,'step':0.1}
iters['lvk']={'min':3.3,'max':4.3,'step':0.1}
iters['hvw']={'min':1,'max':2,'step':0.1}
iters['lvh']={'min':6,'max':7,'step':1}
def run_mp(self):
mps=[]
m=multiprocessing.Manager()
q=m.list()
cmain=self.iters['cwm']['min']
while(cmain<=self.iters['cwm']['max']):
t2=multiprocessing.Process(target=mp_main,args=(cmain,iters,q))
mps.append(t2)
t2.start()
cmain=cmain+self.iters['cwm']['step']
for mp in mps:
mp.join()
r1=sorted(q,key=lambda x:x['costing'])
returning=[r1[0],r1[1],r1[2],r1[3],r1[4],r1[5],r1[6],r1[7],r1[8],r1[9],r1[10],r1[11],r1[12],r1[13],r1[14],r1[15],r1[16],r1[17],r1[18],r1[19]]
self.counter=len(q)
return returning
def mp_main(cmain,iters,q):
fmain=iters['fx']['min']
while(fmain<=iters['fx']['max']):
lvtmain=iters['lvt']['min']
while (lvtmain<=iters['lvt']['max']):
lvwmain=iters['lvw']['min']
while (lvwmain<=iters['lvw']['max']):
lvkmain=iters['lvk']['min']
while (lvkmain<=iters['lvk']['max']):
hvwmain=iters['hvw']['min']
while (hvwmain<=iters['hvw']['max']):
lvhmain=iters['lvh']['min']
while (lvhmain<=iters['lvh']['max']):
test={'cmain':cmain,'fmain':fmain,'lvtmain':lvtmain,'lvwmain':lvwmain,'lvkmain':lvkmain,'hvwmain':hvwmain,'lvhmain':lvhmain}
y=calculations(test,q)
lvhmain=lvhmain+iters['lvh']['step']
hvwmain=hvwmain+iters['hvw']['step']
lvkmain=lvkmain+iters['lvk']['step']
lvwmain=lvwmain+iters['lvw']['step']
lvtmain=lvtmain+iters['lvt']['step']
fmain=fmain+iters['fx']['step']
def calculations(test,que):
#perform huge number of calculations here
output={}
output['data']=test
output['costing']='foo'
que.append(output)
x=Iters()
x.run_thread()

From a theoretical standpoint:
You're iterating every possible combination of 6 different variables. Unless your search space is very small, or you wanted just a very rough solution, there's no way you'll get any meaningful results within reasonable time.
i need to iterate on and find out the combinations which has the least cost
This very much sounds like an optimization problem.
There are many different efficient ways of dealing with these problems, depending on the properties of the function you're trying to optimize. If it has a straighforward "shape" (it's injective), you can use a greedy algorithm such as hill climbing, or gradient descent. If it's more complex, you can try shotgun hill climbing.
There are a lot more complex algorithms, but these are the basic, and will probably help you a lot in this situation.
From a more practical programming standpoint:
You are using very large steps - so large, in fact, that you'll only probe the function 19,200. If this is what you want, it seems very feasible. In fact, if I comment the y=calculations(test,q), this returns instantly on my computer.
As you indicate, there's a "huge number of calculations" there - so maybe that is your real problem, and not the code you're asking for help with.
As to multiprocessing, my honest advise is to not use it until you already have your code executing reasonably fast. Unless you're running a supercomputing cluster (you're not programming a supercomputing cluster in python, are you??), parallel processing will get you speedups of 2-4x. That's absolutely negligible, compared to the gains you get by the kind of algorithmic changes I mentioned.
As an aside, I don't think I've ever seen that many nested loops in my life (excluding code jokes). If don't want to switch to another algorithm, you might want to consider using itertools.product together with numpy.arange

using cython or PyPy to optimise tuples/lists (graph theory algorithm implemented in python)

I am working on a theoretical graph theory problem which involves taking combinations of hyperedges in a hypergrapha to analyse the various cases.
I have implemented an initial version of the main algorithm in Python, but due to its combinatorial structure (and probably my implementation) the algorithm is quite slow.
One way I am considering speeding it up is by using either PyPy or Cython.
Looking at the documentation it seems Cython doesn't offer great speedup when it comes to tuples. This might be problematic for the implementation, since I am representing hyperedges as tuples - so the majority of the algorithm is in manipulating tuples (however they are all the same length, around len 6 each).
Since both my C and Python skills are quite minimal I would appreciate it if someone can advise what would be the best way to proceed in optimising the code given its reliance on tuples/lists. Is there a documentation of using lists/tuples with Cython (or PyPy)?

If your algorithm is bad in terms of computational complexity, then you cannot be saved, you need to write it better. Consult a good graph theory book or wikipedia, it's usually relatively easy, although there are some that have both non-trivial and crazy hard to implement algorithms. This sounds like a thing that PyPy can speed up quite significantly, but only by a constant factor, however it does not involve any modifications to your code. Cython does not speed up your code all that much without type declarations and it seems like this sort of problem cannot be really sped up just by types.
The constant part is what's crucial here - if the algorithm complexity grown like, say, 2^n (which is typical for a naive algorithm), then adding extra node to the graph doubles your time. This means 10 nodes add 1024 time time, 20 nodes 1024*1024 etc. If you're super-lucky, PyPy can speed up your algorithm by 100x, but this remains constant on the graph size (and you quickly run out of the universe time one way or another).

what would be the best way to proceed in optimising the code...
Profile first. There is a standard cProfile module that does simple profiling very well. Optimising your code before profiling is quite pointless.
Besides, for graphs you can try using the excellent networkx module. Also, if you deal with long sorted lists you can have a look at bisect and heapq modules.

Iterating over a large data set in long running Python process - memory issues?

I am working on a long running Python program (a part of it is a Flask API, and the other realtime data fetcher).
Both my long running processes iterate, quite often (the API one might even do so hundreds of times a second) over large data sets (second by second observations of certain economic series, for example 1-5MB worth of data or even more). They also interpolate, compare and do calculations between series etc.
What techniques, for the sake of keeping my processes alive, can I practice when iterating / passing as parameters / processing these large data sets? For instance, should I use the gc module and collect manually?
UPDATE
I am originally a C/C++ developer and would have NO problem (and would even enjoy) writing parts in C++. I simply have 0 experience doing so. How do I get started?
Any advice would be appreciated.
Thanks!

Working with large datasets isn't necessarily going to cause memory complications. As long as you use sound approaches when you view and manipulate your data, you can typically make frugal use of memory.
There are two concepts you need to consider as you're building the models that process your data.
What is the smallest element of your data need access to to perform a given calculation? For example, you might have a 300GB text file filled with numbers. If you're looking to calculate the average of the numbers, read one number at a time to calculate a running average. In this example, the smallest element is a single number in the file, since that is the only element of our data set that we need to consider at any point in time.
How can you model your application such that you access these elements iteratively, one at a time, during that calculation? In our example, instead of reading the entire file at once, we'll read one number from the file at a time. With this approach, we use a tiny amount of memory, but can process an arbitrarily large data set. Instead of passing a reference to your dataset around in memory, pass a view of your dataset, which knows how to load specific elements from it on demand (which can be freed once worked with). This similar in principle to buffering and is the approach many iterators take (e.g., xrange, open's file object, etc.).
In general, the trick is understanding how to break your problem down into tiny, constant-sized pieces, and then stitching those pieces together one by one to calculate a result. You'll find these tenants of data processing go hand-in-hand with building applications that support massive parallelism, as well.
Looking towards gc is jumping the gun. You've provided only a high-level description of what you are working on, but from what you've said, there is no reason you need to complicate things by poking around in memory management yet. Depending on the type of analytics you are doing, consider investigating numpy which aims to lighten the burden of heavy statistical analysis.

Its hard to say without real look into your data/algo, but the following approaches seem to be universal:
Make sure you have no memory leaks, otherwise it would kill your program sooner or later. Use objgraph for it - great tool! Read the docs - it contains good examples of the types of memory leaks you can face at python program.
Avoid copying of data whenever possible. For example - if you need to work with part of the string or do string transformations - don't create temporary substring - use indexes and stay read-only as long as possible. It could make your code more complex and less "pythonic" but this is the cost for optimization.
Use gc carefully - it can make you process irresponsible for a while and at the same time add no value. Read the doc. Briefly: you should use gc directly only when there is real reason to do that, like Python interpreter being unable to free memory after allocating big temporary list of integers.
Seriously consider rewriting critical parts on C++. Start thinking about this unpleasant idea already now to be ready to do it when you data become bigger. Seriously, it usually ends this way. You can also give a try to Cython it could speed up the iteration itself.

Minimising reading from and writing to disk in Python for a memory-heavy operation

Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.

A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.

Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.

Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.

You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.

Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.

I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)

Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.

From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Find a method to have your corpus in memory. This might be a sort of ram disk with file system, or a database. No idea, which one is best for you.
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
iterate through the corpus and build and write some vectors
in the end you can collect and combine all vectors, if needed (this would be the reduce part)

Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.

Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
while (someCondition):
if psutil.phymem_usage().percent > 80.0:
dumpQueue(myQueue,somefile)
else:
addSomeStufftoQueue(myQueue,stuff)
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.