I have a Cherrypy server. Some of that server's job is to crunch data (let's call it BigCalculation) return the result to the client. The raw data is a numpy array (3 dims) and could ultimately be close to 1Gb in size. If performance is good it may even get bigger than that - so whatever I do needs to be scalable.
The results of BigCalculation that is typically returned to the client is a 2d image corresponding to different ways to slice the cube & compute some metrics on that slice. For now all this is done sequentially as a demo on smaller data cubes, but now I need to parallel it to some extent.
I'm thinking of a few ways I could go on about it. Conceptually, they range from low-level parallelism to a higher level:
Something like Intel's python distribution. This is really low-level, but then assuming my code works fine on their distro I basically don't have to do anything.
At the data level: I could split the initial cube into sections that are to be processed in parallele
At the method level: a few such BigCalculation may be ongoing at the same time. Therefore I could spin each new BigCalculation in its own process.
Server level: I'm also thinking I could spin off a sub-server that does only the calculations. The server instance that's exposed to the user only serves the requests and anything that has to do server-crunching is actually forward to a computation server.
My question: is there a best approach/practice when trying to decide the level at which you implement parallel processing first? Would I be better to implement lower-level parallelism first (e.g. Intel's or slicing my cube of data) or higher-level first (e.g. spin each call to BigCalculation seperately)? Or is it mostly something to be decided in specific instances and there is no general advice one can really give on this topic?
I am writing a program where i need to send lots of small chunks of data to a server (mostly integers or strings), so i am using the struct-library.
Right now i am using struct.pack, but i am wondering if i should use struct.pack_into instead, as i read it reduces overhead.
However, i am not interested in "saving" the values- i just want to pack the data and quickly send it off. If i use struct.pack_into, would it save the values in any way as it uses a buffer, thus reducing performance?
Which of these 2 methods best suits my needs?
Thanks,
The difference between these methods really revolves around whether you already have an existing buffer you wish to write formatted data into (struct.pack_into), or whether you simply want to create a new buffer with the formatted data (struct.pack).
You are dealing with small buffers. Unless you have good reason to suspect you need to optimise for buffer copies, you may as well be using struct.pack
I am working on a long running Python program (a part of it is a Flask API, and the other realtime data fetcher).
Both my long running processes iterate, quite often (the API one might even do so hundreds of times a second) over large data sets (second by second observations of certain economic series, for example 1-5MB worth of data or even more). They also interpolate, compare and do calculations between series etc.
What techniques, for the sake of keeping my processes alive, can I practice when iterating / passing as parameters / processing these large data sets? For instance, should I use the gc module and collect manually?
UPDATE
I am originally a C/C++ developer and would have NO problem (and would even enjoy) writing parts in C++. I simply have 0 experience doing so. How do I get started?
Any advice would be appreciated.
Thanks!
Working with large datasets isn't necessarily going to cause memory complications. As long as you use sound approaches when you view and manipulate your data, you can typically make frugal use of memory.
There are two concepts you need to consider as you're building the models that process your data.
What is the smallest element of your data need access to to perform a given calculation? For example, you might have a 300GB text file filled with numbers. If you're looking to calculate the average of the numbers, read one number at a time to calculate a running average. In this example, the smallest element is a single number in the file, since that is the only element of our data set that we need to consider at any point in time.
How can you model your application such that you access these elements iteratively, one at a time, during that calculation? In our example, instead of reading the entire file at once, we'll read one number from the file at a time. With this approach, we use a tiny amount of memory, but can process an arbitrarily large data set. Instead of passing a reference to your dataset around in memory, pass a view of your dataset, which knows how to load specific elements from it on demand (which can be freed once worked with). This similar in principle to buffering and is the approach many iterators take (e.g., xrange, open's file object, etc.).
In general, the trick is understanding how to break your problem down into tiny, constant-sized pieces, and then stitching those pieces together one by one to calculate a result. You'll find these tenants of data processing go hand-in-hand with building applications that support massive parallelism, as well.
Looking towards gc is jumping the gun. You've provided only a high-level description of what you are working on, but from what you've said, there is no reason you need to complicate things by poking around in memory management yet. Depending on the type of analytics you are doing, consider investigating numpy which aims to lighten the burden of heavy statistical analysis.
Its hard to say without real look into your data/algo, but the following approaches seem to be universal:
Make sure you have no memory leaks, otherwise it would kill your program sooner or later. Use objgraph for it - great tool! Read the docs - it contains good examples of the types of memory leaks you can face at python program.
Avoid copying of data whenever possible. For example - if you need to work with part of the string or do string transformations - don't create temporary substring - use indexes and stay read-only as long as possible. It could make your code more complex and less "pythonic" but this is the cost for optimization.
Use gc carefully - it can make you process irresponsible for a while and at the same time add no value. Read the doc. Briefly: you should use gc directly only when there is real reason to do that, like Python interpreter being unable to free memory after allocating big temporary list of integers.
Seriously consider rewriting critical parts on C++. Start thinking about this unpleasant idea already now to be ready to do it when you data become bigger. Seriously, it usually ends this way. You can also give a try to Cython it could speed up the iteration itself.
I have a scientific application that reads a potentially huge data file from disk and transforms it into various Python data structures such as a map of maps, list of lists etc. NumPy is called in for numerical analysis. The problem is, the memory usage can grow rapidly. As swap space is called in, the system slows down significantly. The general strategy I have seen:
lazy initialization: this doesn't seem to help in the sense that many operations require in memory data anyway.
shelving: this Python standard library seems support writing data object into a datafile (backed by some db) . My understanding is that it dumps data to a file, but if you need it, you still have to load all of them into memory, so it doesn't exactly help. Please correct me if this is a misunderstanding.
The third option is to leverage a database, and offload as much data processing to it
As an example: a scientific experiment runs several days and have generated a huge (tera bytes of data) sequence of:
co-ordinate(x,y) observed event E at time t.
And we need to compute a histogram over t for each (x,y) and output a 3-dimensional array.
Any other suggestions? I guess my ideal case would be the in-memory data structure can be phased to disk based on a soft memory limit and this process should be as transparent as possible. Can any of these caching frameworks help?
Edit:
I appreciate all the suggested points and directions. Among those, I found user488551's comments to be most relevant. As much as I like Map/Reduce, to many scientific apps, the setup and effort for parallelization of code is even a bigger problem to tackle than my original question, IMHO. It is difficult to pick an answer as my question itself is so open ... but Bill's answer is more close to what we can do in real world, hence the choice. Thank you all.
Have you considered divide and conquer? Maybe your problem lends itself to that. One framework you could use for that is Map/Reduce.
Does your problem have multiple phases such that Phase I requires some data as input and generates an output which can be fed to phase II? In that case you can have 1 process do phase I and generate data for phase II. Maybe this will reduce the amount of data you simultaneously need in memory?
Can you divide your problem into many small problems and recombine the solutions? In this case you can spawn multiple processes that each handle a small sub-problem and have one or more processes to combine these results in the end?
If Map-Reduce works for you look at the Hadoop framework.
Well, if you need the whole dataset in RAM, there's not much to do but get more RAM. Sounds like you aren't sure if you really need to, but keeping all the data resident requires the smallest amount of thinking :)
If your data comes in a stream over a long period of time, and all you are doing is creating a histogram, you don't need to keep it all resident. Just create your histogram as you go along, write the raw data out to a file if you want to have it available later, and let Python garbage collect the data as soon as you have bumped your histogram counters. All you have to keep resident is the histogram itself, which should be relatively small.
Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?
take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.
A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.
Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.
Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.
You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.
Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.
I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)
Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.
From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Find a method to have your corpus in memory. This might be a sort of ram disk with file system, or a database. No idea, which one is best for you.
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
iterate through the corpus and build and write some vectors
in the end you can collect and combine all vectors, if needed (this would be the reduce part)
Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.
Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
while (someCondition):
if psutil.phymem_usage().percent > 80.0:
dumpQueue(myQueue,somefile)
else:
addSomeStufftoQueue(myQueue,stuff)
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.