PySpark Task Size

PySpark Task Size - python

I currently have a Spark cluster of 1 Driver and 2 Workers on version 2.4.5.
I would like to go further on optimizing parallelism to get a better throughput when loading and processing data, when I am doing this I often get these messages on the console:
WARN scheduler.TaskSetManager: Stage contains a task of very large size (728 KB). The maximum recommended task size is 100 KB.
How does this work? I am fairly new to the Spark technology but understand the basics of it, I would like to know how to optimize this but I'm not sure if it involves configuring the Slaves to have more executors and this way get more parallelism or if I need to Partition my Dataframes with either the coalesce or repartition functions.
Thank you guys in advance!

The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher thruput. The 728k is an arbitrary number related to your Stage. I had this sometimes when I first started out with Scala and Spark.
I cannot see your code so I leave it at this. But googling here on SO suggests lack of parallelism as well. In all honesty quite well known.

Related

How to figure out if a modin dataframe is going to fit in RAM?

Im learning how to work with large datasets, so im using modin.pandas.
I'm doing some aggregation, after which a 50GB dataset is hopefully going to become closer to 5GB in size - and now i need to check: if the df is small enough to fit in RAM, i want to cast it to pandas and enjoy a bug-free reliable library.
So, naturally, the question is: how to check it? .memory_usage(deep=True).sum() tells me how much the whole df uses, but i cant possibly know from that one number how much of it is in RAM, and how much is in swap - in other words, how much space do i need for casting the df to pandas. Are there other ways? Am i even right to assume that some partitions live in RAM while others - in swap? How to calculate how much data will flood the RAM when i call ._to_pandas()? Is there a hidden .__memory_usage_in_swap_that_needs_to_fit_in_ram() of some sorts?

Am i even right to assume that some partitions live in RAM while others - in swap?
Modin doesn't specify whether data should be in RAM or swap.
On Ray, it uses ray.put to store partitions. ray.put doesn't give any guarantees about where the data will go. Note that Ray spills data blocks to disk when they are too large for its in-memory object store. You can use ray memory to get a summary of how much of each storage Ray is using.
On Dask, modin uses dask.Client.scatter, which also doesn't give guarantees about where the data will go, to store partition data. I don't know any way to figure out how much of the stored data is really in RAM.

Pyspark - df.cache().count() taking forever to run

I'm trying to force eager evaluation for PySpark, using the count methodology I read online:
spark_df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
spark_df.cache().count()
However, when I try running the code, the cache count part is taking forever to run. My data size is relatively small (2.7GB, 15 mil rows), but after 28 min of running, I decided to kill the job. For comparison, when I use pandas.read_sql() method to read the data, it took only 6 min 43 seconds.
The machine I'm running the code on is pretty powerful, (20 vCPU, 160 GB RAM, Windows OS). I believe I'm missing a step to speed up the count statement.
Any help or suggestions are appreciated.

When you used pandas to read, it will use as much memory as possible from the available memory of the machine (assuming all 160Gb as you mentioned, which is by far larger than the data itself ~3Gb).
However, it's not the same with Spark. When you start your Spark session, typically you would have to mention upfront how much memory per executor (and driver, and application manager if applicable) you'd want to use, and if you don't specify it, it's going to be 1Gb according to the latest Spark documentation. So the first thing you want to do is giving more memory to your executors and driver.
Second, reading from JDBC by Spark is tricky, because slowness or not depends on the number of executors (and tasks), and those numbers depend on how many partitions your RDD (that read from JDBC connection) have, and the numbers of partitions depends on your table, your query, columns, conditions, etc. One way to force changing behavior, to have more partitions, more tasks, more executors, ... is via these configurations: numPartitions, partitionColumn, lowerBound, and
upperBound.
numPartitions is the number of partitions (hence the number of executors will be used)
partitionColumn is an integer type column that Spark would use to target partitioning
lowerBound is the min value of partitionColumn that you want to read
upperBound is the max value of partitionColumn that you want to read
You can read more here https://stackoverflow.com/a/41085557/3441510, but the basic idea is, you want to use a reasonable number of executors (defined by numPartitions), to process an equally distributed chunk of data for each executor (defined by partitionColumn, lowerBound and upperBound).

Import and work with large dataset (Python beginners)

since i couldn't find the best way to deal with my issue i came here to ask..
I'm a beginner with Python but i have to handle a large dataset.
However, i don't know what's the best way to handle the "Memory Error" problem.
I already have a 64 bits 3.7.3 Python version.
I saw that we can use TensorFlow or specify chunks in the pandas instruction or use the library Dask but i don't know which one is the best to fit with my problem and as a beginner it's not very clear.
I have a huge dataset (over 100M observations) i don't think reducing the dataset would decrease a lot the memory.
What i want to do is to test multiple ML algorithms with a train and test samples. I don't know how to deal with the problem.
Thanks!

This question is high level, so I'll provide some broad approaches for reducing memory usage in Dask:
Use a columnar file format like Parquet so you can leverage column pruning
Use column dtypes that require less memory int8 instead of int64
Strategically persist in memory, where appropriate
Use a cluster that's sized well for your data (running an analysis on 2GB of data requires different amounts of memory than 2TB)
split data into multiple files so it's easier to process in parallel
Your data has 100 million rows, which isn't that big (unless it had thousands of columns). Big data typically has billions or trillions of rows.
Feel free to add questions that are more specific and I can provide more specific advice. You can provide the specs of your machine / cluster, the memory requirements of the DataFrame (via ddf.memory_usage(deep=True)) and the actual code you're trying to run.

partitionBy taking too long while saving a dataset on S3 using Pyspark

I am trying to save a dataset using partitionBy on S3 using pyspark. I am partitioning by on a date column. Spark job is taking more than hour to execute it. If i run the code without partitionBy it just takes 3-4 mints.
Could somebody help me in fining tune the parititonby?

Ok, so spark is terrible at doing IO. Especially with respect to s3. Currently when you are writing in spark it will use a whole executor to write the data SEQUENTIALLY. That with the back and forth between s3 and spark leads to it being quite slow. So you can do a few things to help mitigate/side step these issues.
Use a different partitioning strategy, if possible, with the goal being minimizing files written.
If there is a shuffle involved before the write, you can change the settings around default shuffle size: spark.sql.shuffle.partitions 200 // 200 is the default you'll probably want to reduce this and/or repartition the data before writing.
You can go around sparks io and write your own hdfs writer or use s3 api directly. Using something like foreachpartition then a function for writing to s3. That way things will write in parallel instead of sequentially.
Finally, you may want to use repartition and partitionBy together when writing (DataFrame partitionBy to a single Parquet file (per partition)). This will lead to one file per partition when mixed with maxRecordsPerFile (below) above this will help keep your file size down.
As a side note: you can use the option spark.sql.files.maxRecordsPerFile 1000000 to help control file sizes to make sure they don't get out of control.
In short, you should avoid creating too many files, especially small ones. Also note: you will see a big performance hit when you go to read those 2000*n files back in as well.
We use all of the above strategies in different situations. But in general we just try to use a reasonable partitioning strategy + repartitioning before write. Another note: if a shuffle is performed your partitioning is destroyed and sparks automatic partitioning takes over. Hence, the need for the constant repartitioning.
Hope these suggestions help. SparkIO is quite frustrating but just remember to keep files read/written to a minimum and you should see fine performance.

Use version 2 of the FileOutputCommiter
.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

Python vs Scala (for Spark jobs)

I am pretty new to Spark, currently exploring it by playing with pyspark and spark-shell.
So here is the situation, I run same spark jobs with pyspark and spark-shell.
This is from pyspark:
textfile = sc.textFile('/var/log_samples/mini_log_2')
textfile.count()
And this one from spark-shell:
textfile = sc.textFile("file:///var/log_samples/mini_log_2")
textfile.count()
I tried both of them several times, first (python) one takes 30-35 seconds to complete while second one (scala) takes about 15 seconds. I am curious about what may cause this different performance results? Is it because of choice of language or spark-shell do something in background that pyspark don't?
UPDATE
So I did some tests on larger datasets, about 550 GB (zipped) in total. I am using Spark Standalone as master.
I observed that while using pyspark, tasks are equally shared among executors. However when using spark-shell, tasks are not shared equally. More powerful machines get more tasks while weaker machines gets fewer tasks.
With spark-shell, job is finished in 25 minutes and with pyspark it is around 55 minutes. How can I make Spark Standalone assign tasks with pyspark, as it assigns tasks with spark-shell?

Using python has some overhead, but it's significance depends on what you're doing.
Though recent reports indicate the overhead isn't very large (specifically for the new DataFrame API)
some of the overhead you encounter relates to constant per job overhead - which is almost irrelevant for large jobs.
You should to do a sample benchmark with a larger data set, and see if the overhead is a constant addition, or if it's proportional to the data size.
Another potential bottleneck is operations that apply a python function for each element (map, etc.) - if these operations are relevant for you, you should test them too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.