Neptune slow to insert data - python

I'm trying to load data with a script in python where I create 26000 vertex and related relationship.
Using gremlin-python, the script is like
g.V().has('dog', 'name', 'pluto').fold().\
coalesce(__.unfold(), __.addV('dog').property('name', 'pluto')).store('dog').\
V().has('person', 'name', 'sam').fold().\
coalesce(__.unfold(), __.addV('person').property('name', 'sam')).store('person').\
select('person').unfold().\
coalesce(__.outE('has_dog').where(__.inV().where(eq('dog')).by(T.id).by(__.unfold().id())),
__.addE('has_dog').to(__.select('person').unfold())).toList()
In the same transaction I can append up to 50 vertex and edges.
If I execute this script in my PC (i7 with 16GB RAM) I take 4/5 minutes but using a Neptune instance with 8CPU and 32GB RAM, after 20 minutes, the script is only at 10% of execution.
How is it possible that Neptune is so slower?
Thanks.

Each Neptune instance that you connect to has a pool of worker threads. That pool will be two times the number of vCPU on the instance. If you send the queries in a single threaded fashion you are only taking advantage of one worker thread. You can substantially increase the throughput rates by dividing the work across multiple tasks in your application. I often use the multithreading library but even using basic Python threads will likely help as these are IO bound tasks and so the threads will likely yield. I have added millions of vertices and edges using Python in this way. Without doing something like this you are not taking full advantage of the available resources on the instance. If you have the work already divided up into batches of 50, you can spread those batches across multiple threads. Matching up the number of client threads/tasks with two times the number of vCPU on the Neptune instance is a good place to start.
Ideally the threads will touch different parts of the graph to avoid concurrently trying to modify the same vertices and edges from concurrent threads.

Related

Job, Worker, and Task in dask_jobqueue

I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:
In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.
Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.
My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.
However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.
So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)
Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.
One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).
From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.
This example adapted from docs, will result in 10 dask-worker, each with 24 cores:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=24,
processes=1,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=20,
processes=10,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs

One airlfow job is much slower than another one with same code base

I have two exactly the same airflow jobs with the same code base. The only difference is that they are writing data to different Mongo collections.
One of the jobs is ~ 30% slower than another, how is it possible? May Airflow allocate for one job more resources than for another?
If both use the same queues, have same priority and are always run in the same environment there should be no visible difference. Unless one of the jobs is run in different time and load on the system at that moment is higher. Is this duration difference a visible trend?
Have you tested performance of those jobs outside Airflow? Size and complexity of the collections may also matter.

Pros and cons of using shared Value/Array vs Queue/Pipe in Python multiprocessing

I've been slowly learning to use the multiprocessing library in Python these last few days, and I've come to a point where I'm asking myself this question, and I can't find an answer to this.
I understand that the answer might vary depending on the application, so I'll explain what my application is.
I've created a scheduler in my main process that control when multiple processes execute (these processes are spawned earlier, loop continuously and execute code when my scheduler raises a flag in a shared Value). Using counters in my scheduler, I can have multiple processes executing code at different frequencies (in the 100-400 Hz range), and they are all perfectly synchronized.
For example, one process executes a dynamic model of a quadcopter (ode) at 400 Hz and updates the quadcopter's state. My other processes (command generation and trajectory generation) run at lower frequencies (200 Hz and 100 Hz), but require the updated state. I see currently 2 methods of doing this:
With Pipes: This requires separate Pipes for the dynamics/control and dynamics/trajectory connections. Furthermore, I need the control and trajectory processes to use the latest calculated quadcopter's state, so I need to flush the Pipes until the last value in them. This works, but doesn't look very clean.
With a shared Value/Array : I would only need one Array for the state, my dynamics process would write to it, while my other processes would read from it. I would probably have to implement locks (can I read a shared Value/Array from 2 processes at the same time without a lock?). This hasn't been tested yet, but would probably be cleaner.
I've read around that it is a bad practice to use shared memory too much (why is that?). Yes, I'll be updating it at 400 Hz and reading it at 200 and 100 Hz, but it's not going to be such a large array (10-ish float or doubles). However, I've also read that shared memory is faster that Pipes/Queues, and I would like to prioritize speed in my code, if it's not too much of an issue to use shared memory.
Mind you, I'll have to send generated commands to my dynamics process (another 5-ish floats), and generated desired states to my control process (another 10-ish floats), so that's either more shared Arrays, of more Pipes.
So I was wondering, for my application, what are the pros and cons of both methods. Thanks!

Why is groupBy bottlenecking my pipeline?

I have a pipeline written in python apache-beam. It windows 800,000 time stamped data into 2 second windows overlapping every 1 second. My elements may have different keys.
When it does a groupBy, it will take 3 hours to complete. I am deployed in cloud dataflow using 10 workers. There isn't a significant increase processing speed when I increase the amount of workers. Why is this transform bottlenecking my pipeline?
To summarize the answer from jkff and others:
The pipelines appear to be bottlenecked by a single very large key. You can use regular Java logging and look in worker logs (e.g. measure your DoFn's processing time in processElement() and log it if it's more than a threshold), but unfortunately we do not yet provide higher-level tools for debugging "hot key" issues.
You can also turn on autoscaling so that the service can, at least, shut down unused workers so that you do not incur charges for them.

Multiprocessing a dictionary in python

I have two dictionaries of data and I created a function that acts as a rules engine to analyze entries in each dictionaries and does things based on specific metrics I set(if it helps, each entry in the dictionary is a node in a graph and if rules match I create edges between them).
Here's the code I use(its a for loop that passes on parts of the dictionary to a rules function. I refactored my code to a tutorial I read):
jobs = []
def loadGraph(dayCurrent, day2Previous):
for dayCurrentCount in graph[dayCurrent]:
dayCurrentValue = graph[dayCurrent][dayCurrentCount]
for day1Count in graph[day2Previous]:
day1Value = graph[day2Previous][day1Count]
#rules(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous)
p = multiprocessing.Process(target=rules, args=(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous))
jobs.append(p)
p.start()
print ' in rules engine for day', dayCurrentCount, ' and we are about ', ((len(graph[dayCurrent])-dayCurrentCount)/float(len(graph[dayCurrent])))
The data I'm studying could be rather large(could, because its randomly generated). I think for each day there's about 50,000 entries. Because most of the time is spend on this stage, I was wondering if I could use the 8 cores I have available to help process this faster.
Because each dictionary entry is being compared to a dictionary entry from the day before, I thought the proceses could be split up by that but my above code is slower than using it normally. I think this is because its creating a new process for every entry its doing.
Is there a way to speed this up and use all my cpus? My problem is, I don't want to pass the entire dictionary because then one core will get suck processing it, I would rather have a the process split to each cpu or in a way that I maximum all free cpus for this.
I'm totally new to multiprocessing so I'm sure there's something easy I'm missing. Any advice/suggestions or reading material would be great!
What I've done in the past is to create a "worker class" that processes data entries. Then I'll spin up X number of threads that each run a copy of the worker class. Each item in the dataset gets pushed into a queue that the worker threads are watching. When there are no more items in the queue, the threads spin down.
Using this method, I was able to process 10,000+ data items using 5 threads in about 3 seconds. When the app was only single-threaded, this would take significantly longer.
Check out: http://docs.python.org/library/queue.html
I would recommend looking into MapReduce implementations in Python. Here's one: http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=mapreduce+python. Also, take a look at a python package called Celery: http://celeryproject.org/. With celery you can distribute your computation not only among cores on a single machine, but also to a server farm (cluster). You do pay for that flexibility with more involved setup/maintenance.

Categories

Resources