Spark newbie here. I tried to do some pandas action on my data frame using Spark, and surprisingly it's slower than pure Python (i.e. using pandas package in Python). Here's what I did:
1)
In Spark:
train_df.filter(train_df.gender == '-unknown-').count()
It takes about 30 seconds to get results back. But using Python it takes about 1 second.
2) In Spark:
sqlContext.sql("SELECT gender, count(*) FROM train GROUP BY gender").show()
Same thing, takes about 30 sec in Spark, 1 sec in Python.
Several possible reasons my Spark is much slower than pure Python:
1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.
2) My spark is running locally and I should run it in something like Amazon EC instead.
3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.
4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)
Which one of these is most likely the reason, or the most credible explanation? I would love to hear from some Spark experts. Thank you very much!!
Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets.
By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default. You need to change it to 10 or what ever valueby using sqlContext.sql("set spark.sql.shuffle.partitions=10");. It will be definitely faster than with default.
1) My dataset is about 220,000 records, 24 MB, and that's not a big
enough dataset to show the scaling advantages of Spark.
You are right, you will not see much difference at lower volumes. Spark can be slower as well.
2) My spark is running locally and I should run it in something like
Amazon EC instead.
For your volume it might not help much.
3) Running locally is okay, but my computing capacity just doesn't cut
it. It's a 8 Gig RAM 2015 Macbook.
Again it does not matter for 20MB data set.
4) Spark is slow because I'm running Python. If I'm using Scala it
would be much better. (Con argument: I heard lots of people are using
PySpark just fine.)
On stand alone there will be difference. Python has more run time overhead than scala, but on larger cluster with distributed capability it need not matter
Related
I'm just a Python newbie who's had fun dealing with data with Python.
When I was be able to use Python's representative data tool, Pandas, it seemed that it would be able to work on Excel very quickly.
However, I was somewhat disappointed to see it take more than 1 to 2 minutes to retrieve data(.xlsx) with 470,000 rows, and as a result, I found out that using modin and ray (or dask) would enable faster operation.
After learning how to use it simply as below, I compared it to using Pandas only. (this time, 100M rows data, about 5GB)
import ray
ray.init()
import modin.pandas as md
%%time
TB = md.read_csv('train.csv')
TB
But it only took 1 minute and 3 seconds to write Pandas, but it took 1 minute and 9 seconds to write modin [ray].
I was disappointed to see that it would take longer than just a small difference.
How can I use modin faster than pandas? Complex operations such as groupby or merge? Is there little difference in simply reading data?
Modin is faster to read data when other people are using it, is there something wrong with my computer's settings? I want to know why.
enter image description here
Write down the method installed at the prompt just in case you need it.
!pip install modin[ray]
!pip install ray[default]
First off, to do a fair assessment you always need to use the %%timeit magic command, which gives you an average of multiple runs.
Modin generally works best when you have:
Very large files
Large number of cores
The unimpressive performance, in your case, I believe is largely due to multi-processing management done by Ray/Dask, e.g. worker scheduling and all the set up that goes into parallelisation. When you meet at least one of the 2 criteria above (specially the first, given any current processor) the trade-off between the resource management and the speed up you get from Modin would be in your favour, but nor a 5GB file neither 6 cores are large enough to tip this in your favour. Parallelisation is costly, and the task must be worthy of it.
If it is a one-off, 1-2 minutes is not an unreasonable amount of time at all for this sort of thing. If it is a file that you are going to continuously read and write I would recommend writing it to HDF5 or pickle format in which case your read/write performance will improve far more than just using Modin.
Alternatively, Vaex is the fastest option around for reading any df. Though, I personally think it is still very incomplete and sometimes doesn't match the promises made about it beyond simple numerical-data operations, e.g. when you have large strings in your data.
I'm trying to force eager evaluation for PySpark, using the count methodology I read online:
spark_df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
spark_df.cache().count()
However, when I try running the code, the cache count part is taking forever to run. My data size is relatively small (2.7GB, 15 mil rows), but after 28 min of running, I decided to kill the job. For comparison, when I use pandas.read_sql() method to read the data, it took only 6 min 43 seconds.
The machine I'm running the code on is pretty powerful, (20 vCPU, 160 GB RAM, Windows OS). I believe I'm missing a step to speed up the count statement.
Any help or suggestions are appreciated.
When you used pandas to read, it will use as much memory as possible from the available memory of the machine (assuming all 160Gb as you mentioned, which is by far larger than the data itself ~3Gb).
However, it's not the same with Spark. When you start your Spark session, typically you would have to mention upfront how much memory per executor (and driver, and application manager if applicable) you'd want to use, and if you don't specify it, it's going to be 1Gb according to the latest Spark documentation. So the first thing you want to do is giving more memory to your executors and driver.
Second, reading from JDBC by Spark is tricky, because slowness or not depends on the number of executors (and tasks), and those numbers depend on how many partitions your RDD (that read from JDBC connection) have, and the numbers of partitions depends on your table, your query, columns, conditions, etc. One way to force changing behavior, to have more partitions, more tasks, more executors, ... is via these configurations: numPartitions, partitionColumn, lowerBound, and
upperBound.
numPartitions is the number of partitions (hence the number of executors will be used)
partitionColumn is an integer type column that Spark would use to target partitioning
lowerBound is the min value of partitionColumn that you want to read
upperBound is the max value of partitionColumn that you want to read
You can read more here https://stackoverflow.com/a/41085557/3441510, but the basic idea is, you want to use a reasonable number of executors (defined by numPartitions), to process an equally distributed chunk of data for each executor (defined by partitionColumn, lowerBound and upperBound).
since i couldn't find the best way to deal with my issue i came here to ask..
I'm a beginner with Python but i have to handle a large dataset.
However, i don't know what's the best way to handle the "Memory Error" problem.
I already have a 64 bits 3.7.3 Python version.
I saw that we can use TensorFlow or specify chunks in the pandas instruction or use the library Dask but i don't know which one is the best to fit with my problem and as a beginner it's not very clear.
I have a huge dataset (over 100M observations) i don't think reducing the dataset would decrease a lot the memory.
What i want to do is to test multiple ML algorithms with a train and test samples. I don't know how to deal with the problem.
Thanks!
This question is high level, so I'll provide some broad approaches for reducing memory usage in Dask:
Use a columnar file format like Parquet so you can leverage column pruning
Use column dtypes that require less memory int8 instead of int64
Strategically persist in memory, where appropriate
Use a cluster that's sized well for your data (running an analysis on 2GB of data requires different amounts of memory than 2TB)
split data into multiple files so it's easier to process in parallel
Your data has 100 million rows, which isn't that big (unless it had thousands of columns). Big data typically has billions or trillions of rows.
Feel free to add questions that are more specific and I can provide more specific advice. You can provide the specs of your machine / cluster, the memory requirements of the DataFrame (via ddf.memory_usage(deep=True)) and the actual code you're trying to run.
I currently have a Spark cluster of 1 Driver and 2 Workers on version 2.4.5.
I would like to go further on optimizing parallelism to get a better throughput when loading and processing data, when I am doing this I often get these messages on the console:
WARN scheduler.TaskSetManager: Stage contains a task of very large size (728 KB). The maximum recommended task size is 100 KB.
How does this work? I am fairly new to the Spark technology but understand the basics of it, I would like to know how to optimize this but I'm not sure if it involves configuring the Slaves to have more executors and this way get more parallelism or if I need to Partition my Dataframes with either the coalesce or repartition functions.
Thank you guys in advance!
The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher thruput. The 728k is an arbitrary number related to your Stage. I had this sometimes when I first started out with Scala and Spark.
I cannot see your code so I leave it at this. But googling here on SO suggests lack of parallelism as well. In all honesty quite well known.
I am pretty new to Spark, currently exploring it by playing with pyspark and spark-shell.
So here is the situation, I run same spark jobs with pyspark and spark-shell.
This is from pyspark:
textfile = sc.textFile('/var/log_samples/mini_log_2')
textfile.count()
And this one from spark-shell:
textfile = sc.textFile("file:///var/log_samples/mini_log_2")
textfile.count()
I tried both of them several times, first (python) one takes 30-35 seconds to complete while second one (scala) takes about 15 seconds. I am curious about what may cause this different performance results? Is it because of choice of language or spark-shell do something in background that pyspark don't?
UPDATE
So I did some tests on larger datasets, about 550 GB (zipped) in total. I am using Spark Standalone as master.
I observed that while using pyspark, tasks are equally shared among executors. However when using spark-shell, tasks are not shared equally. More powerful machines get more tasks while weaker machines gets fewer tasks.
With spark-shell, job is finished in 25 minutes and with pyspark it is around 55 minutes. How can I make Spark Standalone assign tasks with pyspark, as it assigns tasks with spark-shell?
Using python has some overhead, but it's significance depends on what you're doing.
Though recent reports indicate the overhead isn't very large (specifically for the new DataFrame API)
some of the overhead you encounter relates to constant per job overhead - which is almost irrelevant for large jobs.
You should to do a sample benchmark with a larger data set, and see if the overhead is a constant addition, or if it's proportional to the data size.
Another potential bottleneck is operations that apply a python function for each element (map, etc.) - if these operations are relevant for you, you should test them too.