Python vs Scala (for Spark jobs)

Python vs Scala (for Spark jobs) - python

I am pretty new to Spark, currently exploring it by playing with pyspark and spark-shell.
So here is the situation, I run same spark jobs with pyspark and spark-shell.
This is from pyspark:
textfile = sc.textFile('/var/log_samples/mini_log_2')
textfile.count()
And this one from spark-shell:
textfile = sc.textFile("file:///var/log_samples/mini_log_2")
textfile.count()
I tried both of them several times, first (python) one takes 30-35 seconds to complete while second one (scala) takes about 15 seconds. I am curious about what may cause this different performance results? Is it because of choice of language or spark-shell do something in background that pyspark don't?
UPDATE
So I did some tests on larger datasets, about 550 GB (zipped) in total. I am using Spark Standalone as master.
I observed that while using pyspark, tasks are equally shared among executors. However when using spark-shell, tasks are not shared equally. More powerful machines get more tasks while weaker machines gets fewer tasks.
With spark-shell, job is finished in 25 minutes and with pyspark it is around 55 minutes. How can I make Spark Standalone assign tasks with pyspark, as it assigns tasks with spark-shell?

Using python has some overhead, but it's significance depends on what you're doing.
Though recent reports indicate the overhead isn't very large (specifically for the new DataFrame API)
some of the overhead you encounter relates to constant per job overhead - which is almost irrelevant for large jobs.
You should to do a sample benchmark with a larger data set, and see if the overhead is a constant addition, or if it's proportional to the data size.
Another potential bottleneck is operations that apply a python function for each element (map, etc.) - if these operations are relevant for you, you should test them too.

Related

Pyspark - df.cache().count() taking forever to run

I'm trying to force eager evaluation for PySpark, using the count methodology I read online:
spark_df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
spark_df.cache().count()
However, when I try running the code, the cache count part is taking forever to run. My data size is relatively small (2.7GB, 15 mil rows), but after 28 min of running, I decided to kill the job. For comparison, when I use pandas.read_sql() method to read the data, it took only 6 min 43 seconds.
The machine I'm running the code on is pretty powerful, (20 vCPU, 160 GB RAM, Windows OS). I believe I'm missing a step to speed up the count statement.
Any help or suggestions are appreciated.

When you used pandas to read, it will use as much memory as possible from the available memory of the machine (assuming all 160Gb as you mentioned, which is by far larger than the data itself ~3Gb).
However, it's not the same with Spark. When you start your Spark session, typically you would have to mention upfront how much memory per executor (and driver, and application manager if applicable) you'd want to use, and if you don't specify it, it's going to be 1Gb according to the latest Spark documentation. So the first thing you want to do is giving more memory to your executors and driver.
Second, reading from JDBC by Spark is tricky, because slowness or not depends on the number of executors (and tasks), and those numbers depend on how many partitions your RDD (that read from JDBC connection) have, and the numbers of partitions depends on your table, your query, columns, conditions, etc. One way to force changing behavior, to have more partitions, more tasks, more executors, ... is via these configurations: numPartitions, partitionColumn, lowerBound, and
upperBound.
numPartitions is the number of partitions (hence the number of executors will be used)
partitionColumn is an integer type column that Spark would use to target partitioning
lowerBound is the min value of partitionColumn that you want to read
upperBound is the max value of partitionColumn that you want to read
You can read more here https://stackoverflow.com/a/41085557/3441510, but the basic idea is, you want to use a reasonable number of executors (defined by numPartitions), to process an equally distributed chunk of data for each executor (defined by partitionColumn, lowerBound and upperBound).

PySpark Task Size

I currently have a Spark cluster of 1 Driver and 2 Workers on version 2.4.5.
I would like to go further on optimizing parallelism to get a better throughput when loading and processing data, when I am doing this I often get these messages on the console:
WARN scheduler.TaskSetManager: Stage contains a task of very large size (728 KB). The maximum recommended task size is 100 KB.
How does this work? I am fairly new to the Spark technology but understand the basics of it, I would like to know how to optimize this but I'm not sure if it involves configuring the Slaves to have more executors and this way get more parallelism or if I need to Partition my Dataframes with either the coalesce or repartition functions.
Thank you guys in advance!

The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher thruput. The 728k is an arbitrary number related to your Stage. I had this sometimes when I first started out with Scala and Spark.
I cannot see your code so I leave it at this. But googling here on SO suggests lack of parallelism as well. In all honesty quite well known.

Python real time parallel / distributed data processing

I have a Python application that processes in-memory-data. It provides <1 second response, by querying ~1 million records and then aggregating the result set.
What would be the best Python framework(s) to make this application more scalable ?
Here are more details :
Data is loaded from a single table on disk which is loaded into memory as numpy arrays and custom indexes using dictionaries.
This application starts breaching the 1 second time limit when the number of records grow more than > 5 million. Search part / locating the indexes takes 100 ms only. I see lot of time (900 to 2000 milli secs) is spent in just summing up the result set.
Also able to see CPU cores & RAM are not used to their full capacity. I see each core is used only upto 20% and a plenty of memory is free.
I just read a long list of python frameworks on distributed computing. Looking for specific solutions for real time responses, by:
making a better usage of available CPU & RAM in single machine through parallel processing to stay within < 1 second response time.
later by extending it beyond a single machine, to support even ~100 million records. This data is in a single table/file that can be horizontally partitioned across many machines, and these machines can work independently on their own data.
Suggestions from what you have "seen" working from your past experience is greatly appreciated.

Why does my Spark run slower than pure Python? Performance comparison

Spark newbie here. I tried to do some pandas action on my data frame using Spark, and surprisingly it's slower than pure Python (i.e. using pandas package in Python). Here's what I did:
1)
In Spark:
train_df.filter(train_df.gender == '-unknown-').count()
It takes about 30 seconds to get results back. But using Python it takes about 1 second.
2) In Spark:
sqlContext.sql("SELECT gender, count(*) FROM train GROUP BY gender").show()
Same thing, takes about 30 sec in Spark, 1 sec in Python.
Several possible reasons my Spark is much slower than pure Python:
1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.
2) My spark is running locally and I should run it in something like Amazon EC instead.
3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.
4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)
Which one of these is most likely the reason, or the most credible explanation? I would love to hear from some Spark experts. Thank you very much!!

Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets.
By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default. You need to change it to 10 or what ever valueby using sqlContext.sql("set spark.sql.shuffle.partitions=10");. It will be definitely faster than with default.
1) My dataset is about 220,000 records, 24 MB, and that's not a big
enough dataset to show the scaling advantages of Spark.
You are right, you will not see much difference at lower volumes. Spark can be slower as well.
2) My spark is running locally and I should run it in something like
Amazon EC instead.
For your volume it might not help much.
3) Running locally is okay, but my computing capacity just doesn't cut
it. It's a 8 Gig RAM 2015 Macbook.
Again it does not matter for 20MB data set.
4) Spark is slow because I'm running Python. If I'm using Scala it
would be much better. (Con argument: I heard lots of people are using
PySpark just fine.)
On stand alone there will be difference. Python has more run time overhead than scala, but on larger cluster with distributed capability it need not matter

Number of map tasks and split size

What I'm trying to do
I'm new to hadoop and I'm trying to perform MapReduce several times with a different number of mappers and reducers, and compare the execution time. The file size is about 1GB, and I'm not specifying the split size so it should be 64MB. I'm using a machine with 4 cores.
What I've done
The mapper and reducer are written in python. So, I'm using hadoop streaming. I specified the number of map tasks and reduce tasks by using '-D mapred.map.tasks=1 -D mapred.reduce.tasks=1'
Problem
Because I specified to used 1 map task and 1 reduce task, I expected to see just one attempt but I actually have 38 map attempts, and 1 reduce task. I read tutorials and SO questions similar to this problem, and some said that the default map task is 2, but I'm getting 38 map tasks. I also read that mapred.map.tasks only suggests the number and the number of map tasks is the number of split size. However, 1GB divided by 64MB is about 17, so I still don't understand why 38 map tasks were created.
1) If I want to use only 1 map task, do I have to set the input splits size to 1GB??
2) Let's say I successfully specify that I want to use only 2 map tasks, does it use 2 cores? And each core has 1 map task??

Number of mappers is actually governed by the InputFormat you are using. Having said that, based on the type of data you are processing, InputFormat may vary. Normally, for the data stored as files in HDFS FileInputFormat, or a subclass, is used which works on the principle of MR split = HDFS block. However, this is not always true. Say you are processing a flat binary file. In such a case there is no delimiter(\n or something else) to represent the split boundary. What would you do in such a case? So, the above principle doesn't always work.
Consider another scenario wherein you are processing data stored in a DB, and not in HDFS. What will happen in such a case as there is no concept of 64MB block size when we talk about DBs?
The framework tries its best to carry out the computation in a manner as efficient as possible, which might involve creation of lesser/more number of mappers as specified/expected by you. So, in order to see how exactly mappers are getting created you need to look into the InputFormat you are using in your job. getSplits() method to be precise.
If I want to use only 1 map task, do I have to set the input splits size to 1GB??
You can override the isSplitable(FileSystem, Path) method of your InputFormat to ensure that the input files are not split-up and are processed as a whole by a single mapper.
Let's say I successfully specify that I want to use only 2 map tasks, does it use 2 cores? And each core has 1 map task??
It depends on availability. Mappers can run on multiple cores simultaneously. And a single core can run multiple mappers sequentially.

Some add-on to your question 2: the parallelism of running map/reduce tasks on a node is controllable. One can set the maximum number of map/reduce tasks running simultaneously by a tasktracker via mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum. Defaults for both parameters are 2. For 4-core node mapreduce.tasktracker.map.tasks.maximum should be increased to at least 4, i.e. to make use of each core. 2 for max-reduce-tasks is expectedly ok. Btw, finding out best values for max map/reduce tasks is non-trivial as it depends on the degree of jobs parallelism on a cluster, whether mappers/reducers of a job(-s) are io- or computationally intensive, etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.