Python Couchbase pipeline() kills interpreter with large batches

Python Couchbase pipeline() kills interpreter with large batches - python

So I wanted to use the python driver to batch insert documents to Couchbase. However if the batch exceeds few thousand documents the python kernel gets restarted (I'm using notebook).
To reproduce this please try:
from couchbase.bucket import Bucket
cb = Bucket('couchbase://localhost/default')
keys_per_doc = 50
doc_count = 10000
docs = [dict(
[
('very_long_feature_{}'.format(i), float(i) if i % 2 == 0 else i)
for i in xrange(keys_per_doc)
] + [('id', id_)] ) for id_ in xrange(doc_count)]
def sframe_to_cb(sf, key_column, key_prefix):
pipe = cb.pipeline()
with pipe:
for r in sf:
cb.upsert(key_prefix + str(r[key_column]), r)
return 0
p = sframe_to_cb(docs, 'id', 'test_')
The fun thing is that all docs get inserted and I suppose the interpreter dies when gathering the results on the pipeline.exit method.
I don't get any error message and the notebook console just says that it has restarted the notebook.
I'm curious what is causing this behavior and if there is a way to fix it.
Obviously I can do mini-batches (up to 3000 docs in my case) but this makes it much slower if they are processed sequentially.
I cannot use multiprocessing because I run the inserts inside of celery.
I cannot use multiple celery tasks because the serialisation of batches is too expensive and could kill our redis instance.
So the questions:
What is causing the crash with large batches and is there a way to fix it?
Assuming that nothing can go wrong with upserts can I make the pipeline discard results.
Is there a different way to achieve high throughput from a single process?
Aditional info as requested in comments:
VMWarte fusion on Mac running a Ubuntu 14.04 LTS VM
The guest ubuntu has 4GB RAM, 12GB swap on SSD, 2 cores (4 threads)
The impression that doing mini batches is slower comes from watching the bucket statsistics (large batch peaks at 10K TPS smaller ones get c.a. 2K TPS)
There is a large speed up if I use multiprocessing and these batches are distributed across multiple CPUs (20-30K TPS) however I cannot do this in production because of celery limitations (I cannot use a ProcessPoolExecutor inside a celery task)
I cannot really tell when exactly does the crash happen (I'm not sure if this is relevant)

Related

How to prevent python script from freezing in ubuntu 16.04

I work for a digital marketing agency having multiple clients. And in one of the projects, I have a very resource intensive python script (which fetches data for Facebook ads), to be run on all those clients (say 500+ in number) in ubuntu 16.04 server.
Originally script took around 2 mins to complete, with 300 MB RES & 1000 MB VM (as per htop), for 1 client. Hence optimized it with ThreadPoolExecutor (max_workers=10) so that script can run on 4 clients concurrently (almost).
Then found out that sometimes, script froze during run (or basically its in "comatose state"). Debugged & profiled and found that its not the script that's causing issue, but its the system.
Then batched the script, means if there are 20 input clients, ran 5 instances (4*5=20) of script. Here sometimes it went fine but sometimes last instance froze.
Then found out that RAM (2G) was being overused, hence increased swapping memory from 0 to 1G. That did the trick. But if few clients are heavy in memory, same thing happens.
Have attached the screenshot of the latest run where after running the 8 instances, last 2 froze. They can be left for days for that matter.
I am thinking of increasing the server RAM from 2G to 4G but not sure if that's the permanent solution. Did anyone has faced similar issue?

You need to fix the Ram consumption of your script,
if your script allocates more memory than your system can provide it get's memory errors, in case you have them in threadpools or similar constructs the threads may never return under some circumstances.
You can fix this by using async functions with timeouts and implementing automatic restart handlers, in case a process does not yield an expected results.
The best way to do that is heavily dependent on the script and will probably require altering already created code
The issue is definitly with your script and not with the OS.
The fastetst workaround would be to increase she system memory or to reduce the amount of threads.

If just adding 1GB of swap area "almost" did the trick then definitely increasing the physical memory is a good way to go. Btw remember that swapping means you're using disk storage, whose speed is measured in millisecs, while RAM speed is measured in nanosecs - so avoiding swap guarantees a performance boost.
And then, reboot your system every now and then. Although Linux is far better than Windows in this respect, memory leaks do occur in Linux too, and a reboot every few months will surely help.

As Gornoka stated you need to alter the memory comsumption of the script as added details this can also be done by removing declared variables within the script once used with the keyword
del
This can also be done by ensuring that if it is processing massive files it does this line by line and saving it as it finishes each line.
I have had this happen and it usually is an indicator of working with to much data at once within the ram and it is always better to work with it partially whenever possible and if not possible get more RAM

Python Multiprocessing with pool running only 400 jobs out of 2000 and getting stopped

I want to run a code module on 2000 separate data. For that I have used the following python code
num_workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=num_workers)
print
results = [pool.apply_async(run_Nested_Cage, args=(bodyid,)) for bodyid in body_IDs]
output = [p.get() for p in results]
Its running fine for 50 data in body_IDs but when I am giving 2000 bodies its starts running fine but after generating results for 424 bodies the programme is getting stopped without any error.
I am running it in AWS EC2 linux ubuntu server with 8 core 32 GB ram and 100GB storage.
Can anyone help me to identify the solution?

Sounds like either a memory issue, an uncaught exception, or code complexity issue.
Add some "debug" logs you can disable later
Instead of one enormous pool, try rotating among a pool of pools
Occasionally terminate a pool instance to free up memory
Run some profilers
Profilers:
memory-profiler
cpu-profiler

Chaining or chording very large job groups/tasks with Celery & Redis

I'm working on a project to parallelize some heavy simulation jobs. Each run takes about two minutes, takes 100% of the available CPU power, and generates over 100 MB of data. In order to execute the next step of the simulation, those results need to be combined into one huge result.
Note that this will be run on performant systems (currently testing on a machine with 16 GB ram and 12 cores, but will probably upgrade to bigger HW)
I can use a celery job group to easily dispatch about 10 of these jobs, and then chain that into the concatenation step and the next simulation. (Essentially a Celery chord) However, I need to be able to run at least 20 on this machine, and eventually 40 on a beefier machine. It seems that Redis doesn't allow for large enough objects on the result backend for me to do anything more than 13. I can't find any way to change this behavior.
I am currently doing the following, and it works fine:
test_a_group = celery.group(test_a(x) for x in ['foo', 'bar'])
test_a_result = rev_group.apply_async(add_to_parent=False)
return = test_b(test_a_result.get())
What I would rather do:
return chord(test_a_group, test_b())
The second one works for small datasets, but not large ones. It gives me a non-verbose 'Celery ChordError 104: connection refused' with large data.
Test B returns very small data, essentially a pass fail, and I am only passing the group result into B, so it should work, except that I think the entire group is being appended to the result of B, in the form of parent, making it too big. I can't find out how to prevent this from happening.
The first one works great, and I would be okay, except that it complains, saying:
[2015-01-04 11:46:58,841: WARNING/Worker-6] /home/cmaclachlan/uriel-venv/local/lib/python2.7/site-packages/celery/result.py:45:
RuntimeWarning: Never call result.get() within a task!
See http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks
In Celery 3.2 this will result in an exception being
raised instead of just being a warning.
warnings.warn(RuntimeWarning(E_WOULDBLOCK))
What the link essentially suggests is what I want to do, but can't.
I think I read somewhere that Redis has a limit of 500 mb on size of data pushed to it.
Any advice on this hairiest of problems?

Celery isn't really designed to address this problem directly. Generally speaking, you want to keep the inputs/outputs of tasks small.
Every input or output has to be serialized (pickle by default) and transmitted through the broker, such as RabbitMQ or Redis. Since the broker needs to queue the messages when there are no clients available to handle them, you end up potentially paying the hit of writing/reading the data to disk anyway (at least for RabbitMQ).
Typically, people store large data outside of celery and just access it within the tasks by URI, ID, or something else unique. Common solutions are to use a shared network file system (as already mentioned), a database, memcached, or cloud storage like S3.
You definitely should not call .get() within a task because it can lead to deadlock.

Python Multiprocess using Pool fails on AWS Ubuntu

I have a simple string matching script that tests just fine for multiprocessing with up to 8 Pool workers on my local mac with 4 cores. However, the same script on an AWS c1.xlarge with 8 cores generally kills all but 2 workers, the CPU only works at 25%, and after a few rounds stops with MemoryError.
I'm not too familiar with server configuration, so I'm wondering if there are any settings to tweak?
The pool implementation looks as follows, but doesn't seem to be the issue as it works locally. There would be several thousand targets per worker, and it doesn't run past the first five or so. Happy to share more of the code if necessary.
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch, itertools.izip(itertools.repeat(targetsPerBatch), xrange(0, totalTargets, targetsPerBatch))).get(99999999)
pool.close()
pool.join()

The MemoryError means you're running out of system-wide virtual memory. How much virtual memory you have is an abstract thing, based on the actual physical RAM plus swapfile size plus stuff that's paged into memory from other files and stuff that isn't paged anywhere because the OS is being clever and so on.
According to your comments, each process averages 0.75GB of real memory, and 4GB of virtual memory. So, your total VM usage is 32GB.
One common reason for this is that each process might peak at 4GB, but spend almost all of its time using a lot less than that. Python rarely releases memory to the OS; it'll just get paged out.
Anyway, 6GB of real memory is no problem on an 8GB Mac or a 7GB c1.xlarge instance.
And 32GB of VM is no problem on a Mac. A typical OS X system has virtually unlimited VM size—if you actually try to use all of it, it'll start creating more swap space automatically, paging like mad, and slowing your system to a crawl and/or running out of disk space, but that isn't going to affect you in this case.
But 32GB of VM is likely to be a problem on linux. A typical linux system has fixed-size swap, and doesn't let you push the VM beyond what it can handle. (It has a different trick that avoids creating probably-unnecessary pages in the first place… but once you've created the pages, you have to have room for them.) I'm not sure what an xlarge comes configured for, but the swapon tool will tell you how much swap you've got (and how much you're using).
Anyway, the easy solution is to create and enable an extra 32GB swapfile on your xlarge.
However, a better solution would be to reduce your VM use. Often each subprocess is doing a whole lot of setup work that creates intermediate data that's never needed again; you can use multiprocessing to push that setup into different processes that quit as soon as they're done, freeing up the VM. Or maybe you can find a way to do the processing more lazily, to avoid needing all that intermediate data in the first place.

Handling various resource requirements in an IPython cluster

I am using a PBS-based cluster and running IPython parallel over a set of nodes, each with either 24 or 32 cores and memory ranging from 24G to 72G; this heterogeneity is due to our cluster having history to it. In addition, I have jobs that I am sending to the IPython cluster that have varying resource requirements (cores and memory). I am looking for a way to submit jobs to the ipython cluster that know about their resource requirements and those of the available engines. I imagine there is a way to deal with this situation gracefully using IPython functionality, but I have not found it. Any suggestions as to how to proceed?

In addition to graph dependencies, which you indicate that you already get, IPython tasks can have functional dependencies. These can be arbitrary functions, like tasks themselves. A functional dependency runs before the real task, and if it returns False or raises a special parallel.UnmetDependency exception, the task will not be run on that engine, and will be retried somewhere else.
So to use this, you need a function that checks whatever metric you need. For instance, let's say we only want to run a task on your nodes with a minimum amount of memory. Here is a function that checks the total memory on the system (in bytes):
def minimum_mem(limit):
import sys
if sys.platform == 'darwin': # or BSD in general?
from subprocess import check_output
mem = int(check_output(['sysctl', '-n', 'hw.memsize']))
else: # linux
with open("/proc/meminfo") as f:
for line in f:
if line.startswith("MemTotal"):
mem = 1024 * int(line.split()[1])
break
return mem >= limit
kB = 1024.
MB = 1024 * kB
GB = 1024 * MB
so minimum_mem(4 * GB) will return True iff you have at least 4GB of memory on your system. If you want to check available memory instead of total memory, you can use the MemFree and Inactive values in /proc/meminfo to determine what is not already in use.
Now you can submit tasks only to engines with sufficient RAM by applying the #parallel.depend decorator:
#parallel.depend(minimum_mem, 8 * GB)
def big_mem_task(n):
import os, socket
return "big", socket.gethostname(), os.getpid(), n
amr = view.map(big_mem_task, range(10))
Similarly, you can apply restrictions based on the number of CPUs (multiprocessing.cpu_count is a useful function there).
Here is a notebook that uses these to restrict assignment of some dumb tasks.
Typically, the model is to run one IPython engine per core (not per node), but if you have specific multicore tasks, then you may want to use a smaller number (e.g. N/2 or N/4). If your tasks are really big, then you may actually want to restrict it to one engine per node. If you are running more engines per node, then you will want to be a bit careful about running high resource tasks together. As I have written them, these checks do not take into account other tasks on the same node, so if a node as 16 GB of RAM, and you have two tasks that each need 10, you will need to be more careful about how you track available resources.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.