Why is groupBy bottlenecking my pipeline?

Why is groupBy bottlenecking my pipeline? - python

I have a pipeline written in python apache-beam. It windows 800,000 time stamped data into 2 second windows overlapping every 1 second. My elements may have different keys.
When it does a groupBy, it will take 3 hours to complete. I am deployed in cloud dataflow using 10 workers. There isn't a significant increase processing speed when I increase the amount of workers. Why is this transform bottlenecking my pipeline?

To summarize the answer from jkff and others:
The pipelines appear to be bottlenecked by a single very large key. You can use regular Java logging and look in worker logs (e.g. measure your DoFn's processing time in processElement() and log it if it's more than a threshold), but unfortunately we do not yet provide higher-level tools for debugging "hot key" issues.
You can also turn on autoscaling so that the service can, at least, shut down unused workers so that you do not incur charges for them.

Related

Dask jobqueue - Is there a way to start all workers at the same time?

Say if I have the following deployment on SLURM:
cluster = SLURMCluster(processes=1, cores=25, walltime=1-00:00:00)
cluster.scale(20)
client = Client(cluster)
So I will have 20 nodes each with 25 cores.
Is there a way to tell the slurm scheduler to start all nodes at the same time, instead of starting each one individually when they become available?
A specific example: when nodes are being started individually, those that started the earliest might wait for several, say 2, hours until all 20 nodes are ready. This not only is a waste of resources but also this makes my total job time to be less than 24 hour (e.g. 22 hours).
This is something one can do easily with dask_mpi, where a single batch job is allocated. I am wondering if it's possible to do this with dask_jobqueue specifically.

dask-jobqueue itself doesn't propose such a functionality.
It is designed to submit independent jobs. So to achieve this you would have to look at the possibilities of the job queuing system, Slurm in your case, and see if this is possible without dask-jobqueue. Then you should try to add the correct options to dask-jobqueue if you can, though job_extra_directives kwarg for example.
I'm not aware of such a functionality within Slurm, but there are so many knobs it is hard to tell. I know this is not possible with PBS.
A good option to achieve what you want is, as you said so, using dask-mpi.
A final thought, you could also start your computation with the first two nodes, not waiting for the other to be ready. This should be doable in most cases.

How to find notebooks attached to a cluster in Databricks using API?

I run an overnight job terminating all running clusters in Azure Databricks. As each cluster might be used by multiple people, I want to find out programmatically which notebooks are attached to the each running cluster.
I use the Python Databricks Cluster API (https://github.com/crflynn/databricks-api), however I'm not against the REST API if necessary.
dbx_env.cluster.get_cluster(cluster_id)

There is no explicit API for that, so it's not so straightforward. One possible approach would be to analyze audit log for attachNotebook and attachNotebook events, and decide if cluster is used or not. But method may not be reliable, as events are appearing with delay, plus you need to have a job that will analyze the audit log.
Simpler solution would be to enforce auto-termination time on all interactive clusters - in this case they will be terminated automatically when nobody use them. You can either:
enforce that through cluster policies
have a script that will go through list of clusters and check auto-termination time, setting it to something like 30 or 60 minutes.
monitor create & edit events in audit log, and correct clusters that have no or very high auto-termination times

DynamoDB on-demand table: does intensive writing affect reading

I develop a highly loaded application that reads data from DynamoDB on-demand table. Let's say it constantly performs around 500 reads per second.
From time to time I need to upload a large dataset into the database (100 million records). I use python, spark and audienceproject/spark-dynamodb. I set throughput=40k and use BatchWriteItem() for data writing.
In the beginning, I observe some write throttled requests and write capacity is only 4k but then upscaling takes place, and write capacity goes up.
Questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.

That's a lot of questions in one question, you'll get a high level answer.
DynamoDB scales by increasing the number of partitions. Each item is stored on a partition. Each partition can handle:
up to 3000 Read Capacity Units
up to 1000 Write Capacity Units
up to 10 GB of data
As soon as any of these limits is reached, the partition is split into two and the items are redistributed. This happens until there is sufficient capacity available to meet demand. You don't control how that happens, it's a managed service that does this in the background.
The number of partitions only ever grows.
Based on this information we can address your questions:
Does intensive writing affects reading in the case of on-demand tables? Does autoscaling work independently for reading/writing?
The scaling mechanism is the same for read and write activity, but the scaling point differs as mentioned above. In an on-demand table AutoScaling is not involved, that's only for tables with provisioned throughput. You shouldn't notice an impact on your reads here.
Is it fine to set large throughput for a short period of time? As far as I see the cost is the same in the case of on-demand tables. What are the potential issues?
I assume you set the throughput that spark can use as a budget for writing, it won't have that much of an impact on on-demand tables. It's information, it can use internally to decide how much parallelization is possible.
I observe some throttled requests but eventually, all the data is successfully uploaded. How can this be explained? I suggest that the client I use has advanced rate-limiting logic and I didn't manage to find a clear answer so far.
If the client uses BatchWriteItem, it will get a list of items that couldn't be written for each request and can enqueue them again. Exponential backoff may be involved but that is an implementation detail. It's not magic, you just have to keep track of which items you've successfully written and enqueue those that you haven't again until the "to-write" queue is empty.

Flink Slots/Parallelism vs Max CPU capabilities

I'm trying to understand the logic behind flink's slots and parallelism configurations in .yaml document.
Official Flink Documentation states that for each core in your cpu, you have to allocate 1 slot and increase parallelism level by one simultaneously.
But i suppose that this is just a recommendation. If for a example i have a powerful core(e.g. the newest i7 with max GHz), it's different from having an old cpu with limited GHz. So running much more slots and parallelism than my system's cpu maxcores isn't irrational.
But is there any other way than just testing different configurations, to check my system's max capabilities with flink?
Just for the record, im using Flink's Batch Python API.

It is recommended to assign each slot at least one CPU core because each operator is executed by at least 1 thread. Given that you don't execute blocking calls in your operator and the bandwidth is high enough to feed the operators constantly with new data, 1 slot per CPU core should keep your CPU busy.
If on the other hand, your operators issue blocking calls (e.g. communicating with an external DB), it sometimes might make sense to configure more slots than you have cores.

There are several interesting points in your question.
First, the slots in Flink are the processing capabilities that each taskmanager brings to the cluster, and they limit, first, the number of applications that can be executed on it, as well as the number of executable operators at the same time. Tentatively, a computer should not provide more processing power than CPU units present in it. Of course, this is true if all the tasks that run on it are computation intensive in CPU and low IO operations. If you have operators in your application highly blocking by IO operations there will be no problem in configuring more slots than CPU cores available in your taskmanager as #Till_Rohrmann said.
On the other hand, the default parallelism is the number of CPU cores available to your application in the Flink cluster, although it is something you can specify as a parameter manually when you run your application or specify it in your code. Note that a Flink cluster can run multiple applications simultaneously and it is not convenient that only one block entire cluster, unless it is the target, so, the default parallelism is usually less than the number of slots available in your Cluster (the sum of all slots contributed by your taskmanagers).
However, an application with parallelism 4 means, tentatively, that if it contains an stream: input().Map().Reduce().Sink() there should be 4 instances of each operator, so, the sum of cores used by the application Is greater than 4. But, this is something that the developers of Flink should explain ;)

Python parallel processing libraries

Python seems to have many different packages available to assist one in parallel processing on an SMP based system or across a cluster. I'm interested in building a client server system in which a server maintains a queue of jobs and clients (local or remote) connect and run jobs until the queue is empty. Of the packages listed above, which is recommended and why?
Edit: In particular, I have written a simulator which takes in a few inputs and processes things for awhile. I need to collect enough samples from the simulation to estimate a mean within a user specified confidence interval. To speed things up, I want to be able to run simulations on many different systems, each of which report back to the server at some interval with the samples that they have collected. The server then calculates the confidence interval and determines whether the client process needs to continue. After enough samples have been gathered, the server terminates all client simulations, reconfigures the simulation based on past results, and repeats the processes.
With this need for intercommunication between the client and server processes, I question whether batch-scheduling is a viable solution. Sorry I should have been more clear to begin with.

Have a go with ParallelPython. Seems easy to use, and should provide the jobs and queues interface that you want.

There are also now two different Python wrappers around the map/reduce framework Hadoop:
http://code.google.com/p/happy/
http://wiki.github.com/klbostee/dumbo
Map/Reduce is a nice development pattern with lots of recipes for solving common patterns of problems.
If you don't already have a cluster, Hadoop itself is nice because it has full job scheduling, automatic data distribution of data across the cluster (i.e. HDFS), etc.

Given that you tagged your question "scientific-computing", and mention a cluster, some kind of MPI wrapper seems the obvious choice, if the goal is to develop parallel applications as one might guess from the title. Then again, the text in your question suggests you want to develop a batch scheduler. So I don't really know which question you're asking.

The simplest way to do this would probably just to output the intermediate samples to separate files (or a database) as they finish, and have a process occasionally poll these output files to see if they're sufficient or if more jobs need to be submitted.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.