we are using the most recent Spark build. We have as input a very large list of tuples (800 Mio.). We run our Pyspark program using docker containers with a master and multiple worker nodes. A driver is used to run the program and connect to the master.
When running the program, at the line sc.parallelize(tuplelist) the program either quits with an java heap error message or quits without an error at all. We do not use any Hadoop HDFS layer and no YARN.
We have so far considered the possible factors as mentioned in these SO postings:
Spark java.lang.OutOfMemoryError : Java Heap space
Spark java.lang.OutOfMemoryError: Java heap space (also the list of possible solutions by samthebest did not helped to solve the issue)
At this points we have the following question(s):
How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?
Do you know any (common?) mistake which may lead to the observed behevior?
How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?
Ans: There are multiple factors to decide the number of partitions.
1) There may be cases where having number of partitions 3-4 X times of your cores will be good case(considering each partition going to be processing more than few secs)
2) Partitions shouldn't be too small or too large(128MB or 256MB) will be good enough
Do you know any (common?) mistake which may lead to the observed behevior?
Can you check the executor memory and disk that is available to run the size.
If you can specify more details about the job e.g. number of cores, executor memory, number of executors and disk available it will be helpful to point out the issue.
Related
I have two exactly the same airflow jobs with the same code base. The only difference is that they are writing data to different Mongo collections.
One of the jobs is ~ 30% slower than another, how is it possible? May Airflow allocate for one job more resources than for another?
If both use the same queues, have same priority and are always run in the same environment there should be no visible difference. Unless one of the jobs is run in different time and load on the system at that moment is higher. Is this duration difference a visible trend?
Have you tested performance of those jobs outside Airflow? Size and complexity of the collections may also matter.
I am trying to use to_parquet but it crashes my system due to memory error. I've discovered it's trying to save 100-300 of my partitions at a time.
Is it possible to somehow specify that I want fewer partitions processed at a time in order to prevent a crash due to using up all the RAM?
Dask will use as many threads at a time as you give it. The tasks may be "processing" but that just means that they have been sent to a worker, which will handle them when it has a spare thread.
I am trying to use to_parquet but it crashes my system due to memory error.
However it could still be that your partitions are large enough that you can't fit several of them in memory at once. In this case you might want to select a smaller partition size. See https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions for more information.
When using spark-1.6.2 and pyspark, I saw this:
where you see that the active tasks are a negative number (the difference of the the total tasks from the completed tasks).
What is the source of this error?
Node that I have many executors. However, it seems like there is a task that seems to have been idle (I don't see any progress), while another identical task completed normally.
Also this is related: that mail I can confirm that many tasks are being created, since I am using 1k or 2k executors.
The error I am getting is a bit different:
16/08/15 20:03:38 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
16/08/15 20:07:18 WARN TaskSetManager: Lost task 20652.0 in stage 4.0 (TID 116652, myfoo.com): FetchFailed(BlockManagerId(61, mybar.com, 7337), shuffleId=0, mapId=328, reduceId=20652, message=
org.apache.spark.shuffle.FetchFailedException: java.util.concurrent.TimeoutException: Timeout waiting for task.
It is a Spark issue. It occurs when executors restart after failures. The JIRA issue for the same is already created. You can get more details about the same from https://issues.apache.org/jira/browse/SPARK-10141 link.
Answered in the Spark-dev mailing list from S. Owen, there are several JIRA tickets that are relevant to this issue, such as:
ResourceManager UI showing negative value
NodeManager reports negative running containers
This behavior usually occurs when (many) executors restart after failure(s).
This behavior can also occur when the application uses too many executors. Use coalesce() to fix this case.
To be exact, in Prepare my bigdata with Spark via Python, I had >400k partitions. I used data.coalesce(1024), as described in Repartition an RDD, and I was able to bypass that Spark UI bug. You see, partitioning, is a very important concept when it comes to Distributed Computing and Spark.
In my question I also use 1-2k executors, so it must be related.
Note: Too few partitions and you might experience this Spark Java Error: Size exceeds Integer.MAX_VALUE.
Deployment info: "pyspark --master yarn-client --num-executors 16 --driver-memory 16g --executor-memory 2g "
I am turning a 100,000 line text file (in hdfs dfs format) into a RDD object with corpus = sc.textFile("my_file_name"). When I execute corpus.count() I do get 100000. I realize that all these steps are performed on the master node.
Now, my question is when I perform some action like new_corpus=corpus.map(some_function), will the job be automatically distributed by pyspark among all available slaves (16 in my case)? Or do I have to specify something?
Notes:
I don't think that anything gets distributed actually (or at least not on the 16 nodes) because when I do new_corpus.count(), what prints out is [Stage some_number:> (0+2)/2], not [Stage some_number:> (0+16)/16]
I don't think that doing corpus = sc.textFile("my_file_name",16) is the solution for me because the function I want to apply works at the line level and therefore should be applied 100,000 times (the goal of parallelization is to speed up this process, like having each slave taking 100000/16 lines). It should not be applied 16 times on 16 subsets of the original text file.
Your observations are not really correct. Stages are not "executors". In Spark we have jobs, tasks and then stages. The job is kicked off by the master driver and then task are assigned to different worker nodes where stage is a collection of task which has the same shuffling dependencies. In your case shuffling happens only once.
To check if executors are really 16, you have to look into the resource manager. Usually it is at port 4040 since you are using yarn.
Also if you use rdd.map(), then it should parallelize according to your defined partitions and not the executors which you set in sc.textFile("my_file_name", numPartitions).
Here is an overview again:
https://spark.apache.org/docs/1.6.0/cluster-overview.html
First of, I saw yarn-client and a chill ran down my spine.
Is there a reason why you want the node where you submit your job to be running the driver? Why not let Yarn do its thing?
But about your question:
I realize that all these steps are performed on the master node.
No they are not. You might be mislead by the fact you are running your driver on the node you are connected to (see my spine-chill ;) ).
You tell yarn to start up 16 executors for you, and Yarn will do so.
It will try to take your rack and data locality into account to the best of its ability while doing so. These will be run in parallel.
Yarn is a resource manager, it manages the resources so you don't have to. All you have to specify with Spark is the number of executors you want and the memory yarn has to assign to the executors and driver.
Update: I have added this image to clarify how spark-submit (in clustered mode) works
I have a simple string matching script that tests just fine for multiprocessing with up to 8 Pool workers on my local mac with 4 cores. However, the same script on an AWS c1.xlarge with 8 cores generally kills all but 2 workers, the CPU only works at 25%, and after a few rounds stops with MemoryError.
I'm not too familiar with server configuration, so I'm wondering if there are any settings to tweak?
The pool implementation looks as follows, but doesn't seem to be the issue as it works locally. There would be several thousand targets per worker, and it doesn't run past the first five or so. Happy to share more of the code if necessary.
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch, itertools.izip(itertools.repeat(targetsPerBatch), xrange(0, totalTargets, targetsPerBatch))).get(99999999)
pool.close()
pool.join()
The MemoryError means you're running out of system-wide virtual memory. How much virtual memory you have is an abstract thing, based on the actual physical RAM plus swapfile size plus stuff that's paged into memory from other files and stuff that isn't paged anywhere because the OS is being clever and so on.
According to your comments, each process averages 0.75GB of real memory, and 4GB of virtual memory. So, your total VM usage is 32GB.
One common reason for this is that each process might peak at 4GB, but spend almost all of its time using a lot less than that. Python rarely releases memory to the OS; it'll just get paged out.
Anyway, 6GB of real memory is no problem on an 8GB Mac or a 7GB c1.xlarge instance.
And 32GB of VM is no problem on a Mac. A typical OS X system has virtually unlimited VM size—if you actually try to use all of it, it'll start creating more swap space automatically, paging like mad, and slowing your system to a crawl and/or running out of disk space, but that isn't going to affect you in this case.
But 32GB of VM is likely to be a problem on linux. A typical linux system has fixed-size swap, and doesn't let you push the VM beyond what it can handle. (It has a different trick that avoids creating probably-unnecessary pages in the first place… but once you've created the pages, you have to have room for them.) I'm not sure what an xlarge comes configured for, but the swapon tool will tell you how much swap you've got (and how much you're using).
Anyway, the easy solution is to create and enable an extra 32GB swapfile on your xlarge.
However, a better solution would be to reduce your VM use. Often each subprocess is doing a whole lot of setup work that creates intermediate data that's never needed again; you can use multiprocessing to push that setup into different processes that quit as soon as they're done, freeing up the VM. Or maybe you can find a way to do the processing more lazily, to avoid needing all that intermediate data in the first place.