When using spark-1.6.2 and pyspark, I saw this:
where you see that the active tasks are a negative number (the difference of the the total tasks from the completed tasks).
What is the source of this error?
Node that I have many executors. However, it seems like there is a task that seems to have been idle (I don't see any progress), while another identical task completed normally.
Also this is related: that mail I can confirm that many tasks are being created, since I am using 1k or 2k executors.
The error I am getting is a bit different:
16/08/15 20:03:38 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
16/08/15 20:07:18 WARN TaskSetManager: Lost task 20652.0 in stage 4.0 (TID 116652, myfoo.com): FetchFailed(BlockManagerId(61, mybar.com, 7337), shuffleId=0, mapId=328, reduceId=20652, message=
org.apache.spark.shuffle.FetchFailedException: java.util.concurrent.TimeoutException: Timeout waiting for task.
It is a Spark issue. It occurs when executors restart after failures. The JIRA issue for the same is already created. You can get more details about the same from https://issues.apache.org/jira/browse/SPARK-10141 link.
Answered in the Spark-dev mailing list from S. Owen, there are several JIRA tickets that are relevant to this issue, such as:
ResourceManager UI showing negative value
NodeManager reports negative running containers
This behavior usually occurs when (many) executors restart after failure(s).
This behavior can also occur when the application uses too many executors. Use coalesce() to fix this case.
To be exact, in Prepare my bigdata with Spark via Python, I had >400k partitions. I used data.coalesce(1024), as described in Repartition an RDD, and I was able to bypass that Spark UI bug. You see, partitioning, is a very important concept when it comes to Distributed Computing and Spark.
In my question I also use 1-2k executors, so it must be related.
Note: Too few partitions and you might experience this Spark Java Error: Size exceeds Integer.MAX_VALUE.
Related
I have a problem in my Storm setup and it looks like there's some discrepancy between the number of executors I set for the topology, and the number of actual bolt processes I see running on one of the servers in that topology.
When setting the number of executors per bolt I use the setBolt method of TopologyBuilder. The number of executors per the UI is correct (total of 105), and when drilling down to the number of executors per server I see that every server in my topology should hold 7-9 executors. This is all good and well however, when sshing to the server and using htop I see that there is one parent process with at least 30 child processes running for that bolt type.
A few notes:
I am using a very old version of Storm (0.9.3) that unfortunately I can't upgrade.
I'm running a Storm instance that is running python processes (don't know how relevant that is)
I think I'm missing something on the relation between the number of Storm processes and number of bolts/executors I'm configuring or, how to read htop properly. In any case, I would love to get some explanation.
I found this answer, saying that htop shows threads as processes but I still don't think that answers my question.
Thank you
we are using the most recent Spark build. We have as input a very large list of tuples (800 Mio.). We run our Pyspark program using docker containers with a master and multiple worker nodes. A driver is used to run the program and connect to the master.
When running the program, at the line sc.parallelize(tuplelist) the program either quits with an java heap error message or quits without an error at all. We do not use any Hadoop HDFS layer and no YARN.
We have so far considered the possible factors as mentioned in these SO postings:
Spark java.lang.OutOfMemoryError : Java Heap space
Spark java.lang.OutOfMemoryError: Java heap space (also the list of possible solutions by samthebest did not helped to solve the issue)
At this points we have the following question(s):
How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?
Do you know any (common?) mistake which may lead to the observed behevior?
How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?
Ans: There are multiple factors to decide the number of partitions.
1) There may be cases where having number of partitions 3-4 X times of your cores will be good case(considering each partition going to be processing more than few secs)
2) Partitions shouldn't be too small or too large(128MB or 256MB) will be good enough
Do you know any (common?) mistake which may lead to the observed behevior?
Can you check the executor memory and disk that is available to run the size.
If you can specify more details about the job e.g. number of cores, executor memory, number of executors and disk available it will be helpful to point out the issue.
Preamble
Yet another airflow tasks not getting executed question...
Everything was going more or less fine in my airflow experience up until this weekend when things really went downhill.
I have checked all the standard things e.g. as outlined in this helpful post.
I have reset the whole instance multiple times trying to get it working properly but I am totally losing the battle here.
Environment
version: airflow 1.10.2
os: centos 7
python: python 3.6
virtualenv: yes
executor: LocalExecutor
backend db: mysql
The problem
Here's what happens in my troubleshooting infinite loop / recurring nightmare.
I reset the metadata DB (or possibly the whole virtualenv and config etc) and re-enter connection information.
Tasks will get executed once. They may succeed. If I missed something in the setup, a task may fail.
When task fails, it goes to retry state.
I fix the issue with (e.g. forgot to enter a connection) and manually clear the task instance.
Cleared task instances do not run, but just sit in a "none" state
Attempts to get dag running again fail.
Before I started having this trouble, after a cleared a task instance, it would always very quickly get picked up and executed again.
But now, clearing the task instance usually results in the task instance getting stuck in a cleared state. It just sits there.
Worse, if I try failing the dag and all instances, and manually triggering the dag again, the task instances get created but stay in 'none' state. Restarting scheduler doesn't help.
Other observation
This is probably a red herring, but one thing I have noticed only recently is that when I click on the icon representing the task instances stuck in the 'none' state, it takes me to a "task instances" view filter that has the wrong filter; the filter is set at "string equals null".
But you need to switch it to "string empty yes" in order to have it actually return the task instances that are stuck.
I am assuming this is just an unrelated UI bug, a red herring as far as I am concerned, but I thought I'd mention it just in case.
Edit 1
I am noticing that there is some "null operator" going on:
Edit 2
Is null a valid value for task instance state? Or is this an indicator that something is wrong.
Edit 3
More none stuff.
Here are some bits from the task instance details page. Lots of attributes are none:
Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency Reason
Unknown All dependencies are met but the task instance is not running. In most cases this just means that the task will probably be scheduled soon unless:
- The scheduler is down or under heavy load
- The following configuration values may be limiting the number of queueable processes: parallelism, dag_concurrency, max_active_dag_runs_per_dag, non_pooled_task_slot_count
- This task instance already ran and had its state changed manually (e.g. cleared in the UI)
If this task instance does not start soon please contact your Airflow administrator for assistance.
Task Instance Attributes
Attribute Value
duration None
end_date None
is_premature False
job_id None
operator None
pid None
queued_dttm None
raw False
run_as_user None
start_date None
state None
Update
I may finally be on to something...
After my nightmarish, marathon, stuck-in-twilight-zone troubleshooting session, I threw my hands up and resolved to use docker containers instead of running natively. It was just too weird. Things were just not making sense. I needed to move to docker so that the environment could be completely controlled and reproduced.
So I started working on the docker setup based on puckel/docker-airflow. This was no trivial task either, because I decided to use environment variables for all parameters and connections. Not all hooks parse connection URIs the same way, so you have to be careful and look at the code and do some trial and error.
So then I did that, I finally got my docker setup working locally. But when I went to build the image on my EC2 instance, I found that the disk was full. And it was in no small part due to airflow logs that it was full.
So, my new theory is that lack of disk space may have had something to do with this. I am not sure if I will be able to find a smoking gun in the logs, but I will look.
Ok I am closing this out and marking the presumptive root cause as server was out of space.
There were a number of contributing factors:
My server did not have a lot of storage. Only 10GB. I did not realize it was so low. Resolution: add more space
Logging in airflow 1.10.2 went a little crazy. An INFO log message was outputting Harvesting DAG parsing results every second or two, which resulted, eventually, in a large log file. Resolution: This is fixed in commit [AIRFLOW-3911] Change Harvesting DAG parsing results to DEBUG log level (#4729), which is in 1.10.3, but you can always fork and cherry pick if you are stuck on 1.10.2.
Additionally, some of my scheduler / webserver interval params could have benefited from an increase. As a result I ended up with multi-GB log files. I think this may have been partly due to changing airflow versions without correctly updating airflow.cfg. Solution: when upgrading (or changing versions), temporarily move airflow.cfg so that a cfg compatible with the version will be generated, then merge them carefully. Another strategy is to rely only on environment variables, so that your config should always be as fresh install, and the only parameters in your env variables are parameter overrides and, possibly, connections.
Airflow may not log errors anywhere in this case; everything looked fine, except the scheduler was not queuing up jobs, or it would queue one or two and then just stop, without any error message. Solutions can include (1) add out-of-space alarms on your cloud computing provider, (2) figure out how to ensure scheduler raises some helpful exception in this case and contribute them to airflow.
I have been using celery for a while but am looking for an alternative due to the lack of windows support.
The top competitors seem to be dask and dramatiq. What I'm really looking for is for something that can distribute 1000 long running tasks onto 10 machines. Each should pick up the next job when it has completed the task, and give a callback with updates (in celery this can be nicely achieved with #task(bind=True), as the task instance itself can be accessed and I can send the status back to the instance that sent it with an update).
Is there a similar functionality available in dramatiq or dask? Any suggestions would be appreciated.
On the Dask side you're probably looking for the futures interface : https://docs.dask.org/en/latest/futures.html
Futures have a basic status like "finished" or "pending" or "error" that you can check any time. If you want more complex messages then you should look into Dask Queues, PubSub, or other intertask communication mechanisms, also available from that doc page.
Python seems to have many different packages available to assist one in parallel processing on an SMP based system or across a cluster. I'm interested in building a client server system in which a server maintains a queue of jobs and clients (local or remote) connect and run jobs until the queue is empty. Of the packages listed above, which is recommended and why?
Edit: In particular, I have written a simulator which takes in a few inputs and processes things for awhile. I need to collect enough samples from the simulation to estimate a mean within a user specified confidence interval. To speed things up, I want to be able to run simulations on many different systems, each of which report back to the server at some interval with the samples that they have collected. The server then calculates the confidence interval and determines whether the client process needs to continue. After enough samples have been gathered, the server terminates all client simulations, reconfigures the simulation based on past results, and repeats the processes.
With this need for intercommunication between the client and server processes, I question whether batch-scheduling is a viable solution. Sorry I should have been more clear to begin with.
Have a go with ParallelPython. Seems easy to use, and should provide the jobs and queues interface that you want.
There are also now two different Python wrappers around the map/reduce framework Hadoop:
http://code.google.com/p/happy/
http://wiki.github.com/klbostee/dumbo
Map/Reduce is a nice development pattern with lots of recipes for solving common patterns of problems.
If you don't already have a cluster, Hadoop itself is nice because it has full job scheduling, automatic data distribution of data across the cluster (i.e. HDFS), etc.
Given that you tagged your question "scientific-computing", and mention a cluster, some kind of MPI wrapper seems the obvious choice, if the goal is to develop parallel applications as one might guess from the title. Then again, the text in your question suggests you want to develop a batch scheduler. So I don't really know which question you're asking.
The simplest way to do this would probably just to output the intermediate samples to separate files (or a database) as they finish, and have a process occasionally poll these output files to see if they're sufficient or if more jobs need to be submitted.