Pyspark - df.cache().count() taking forever to run - python

I'm trying to force eager evaluation for PySpark, using the count methodology I read online:
spark_df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
spark_df.cache().count()
However, when I try running the code, the cache count part is taking forever to run. My data size is relatively small (2.7GB, 15 mil rows), but after 28 min of running, I decided to kill the job. For comparison, when I use pandas.read_sql() method to read the data, it took only 6 min 43 seconds.
The machine I'm running the code on is pretty powerful, (20 vCPU, 160 GB RAM, Windows OS). I believe I'm missing a step to speed up the count statement.
Any help or suggestions are appreciated.

When you used pandas to read, it will use as much memory as possible from the available memory of the machine (assuming all 160Gb as you mentioned, which is by far larger than the data itself ~3Gb).
However, it's not the same with Spark. When you start your Spark session, typically you would have to mention upfront how much memory per executor (and driver, and application manager if applicable) you'd want to use, and if you don't specify it, it's going to be 1Gb according to the latest Spark documentation. So the first thing you want to do is giving more memory to your executors and driver.
Second, reading from JDBC by Spark is tricky, because slowness or not depends on the number of executors (and tasks), and those numbers depend on how many partitions your RDD (that read from JDBC connection) have, and the numbers of partitions depends on your table, your query, columns, conditions, etc. One way to force changing behavior, to have more partitions, more tasks, more executors, ... is via these configurations: numPartitions, partitionColumn, lowerBound, and
upperBound.
numPartitions is the number of partitions (hence the number of executors will be used)
partitionColumn is an integer type column that Spark would use to target partitioning
lowerBound is the min value of partitionColumn that you want to read
upperBound is the max value of partitionColumn that you want to read
You can read more here https://stackoverflow.com/a/41085557/3441510, but the basic idea is, you want to use a reasonable number of executors (defined by numPartitions), to process an equally distributed chunk of data for each executor (defined by partitionColumn, lowerBound and upperBound).

Related

PySpark Task Size

I currently have a Spark cluster of 1 Driver and 2 Workers on version 2.4.5.
I would like to go further on optimizing parallelism to get a better throughput when loading and processing data, when I am doing this I often get these messages on the console:
WARN scheduler.TaskSetManager: Stage contains a task of very large size (728 KB). The maximum recommended task size is 100 KB.
How does this work? I am fairly new to the Spark technology but understand the basics of it, I would like to know how to optimize this but I'm not sure if it involves configuring the Slaves to have more executors and this way get more parallelism or if I need to Partition my Dataframes with either the coalesce or repartition functions.
Thank you guys in advance!
The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher thruput. The 728k is an arbitrary number related to your Stage. I had this sometimes when I first started out with Scala and Spark.
I cannot see your code so I leave it at this. But googling here on SO suggests lack of parallelism as well. In all honesty quite well known.

How does dask work for larger than memory datasets

Would anyone be able to tell me how dask works for larger than memory dataset in simple terms. For example I have a dataset which is 6GB and 4GB RAM with 2 Cores. How would dask go about loading the data and doing a simple calculation such as sum of a column.
Does dask automatically check the size of the memory and chunk the dataset to smaller than memory pieces. Then, once requested to compute bring chunk by chunk into memory and do the computation using each of the available cores. Am I right on this.
Thanks
Michael
By "dataset" you are apparently referring to a dataframe. Let's consider two file formats from which you may be loading: CSV and parquet.
For CSVs, there is no inherent chunking mechanism in the file, so you, the user, can choose the bytes-per-chunk appropriate for your application using dd.read_csv(path, blocksize=..), or allow Dask to try to make a decent guess; "100MB" may be a fine size to try.
For parquet, the format itself has internal chunking of the data, and Dask will make use of this pattern in loading the data
In both cases, each worker will load one chunk at a time, and calculate the column sum you have asked for. Then, the loaded data will be discarded to make space for the next one, only keeping the results of the sum in memory (a single number for each partition). If you have two workers, two partitions will be in memory and processed at the same time. Finally, all the sums are added together.
Thus, each partition should comfortably fit into memory - not be too big - but the time it takes to load and process each should be much longer than the overhead imposed by scheduling the task to run on a worker (the latter <1ms) - not be too small.

How do I get PySpark to write intermediate results to disk before running out of memory?

Background: In Hadoop Streaming, each reduce job writes to the hdfs as it finishes, thus clearing the way for the Hadoop cluster to execute the next reduce.
I am having trouble mapping this paradigm to (Py)Spark.
As an example,
df = spark.read.load('path')
df.rdd.reduceByKey(my_func).toDF().write.save('output_path')
When I run this, the cluster collects all of the data in the dataframe before it writes anything to disk. At least this is what it looks like is happening as I watch the job progress.
My problem is that my data is much bigger than my cluster memory, so I run out of memory before any data is written. In Hadoop Streaming, we don't have this problem because the output data is streamed to the disk to make room for the subsequent batches of data.
I have considered something like this:
for i in range(100):
(df.filter(df.loop_index==i)
.rdd
.reduceByKey(my_func)
.toDF()
.write.mode('append')
.save('output_path'))
where I only process a subset of my data in each iteration. But this seems kludgy mainly because I have to either persist df, which isn't possible because of memory constraints, or I have to re-read from the input hdfs source in each iteration.
One way to make the loop work is to partition the source folders by day or some other subset of the data. But for the sake of the question, let's assume that isn't possible.
Questions: How do I run a job like this in PySpark? Do I just have to have a much bigger cluster? If so, what are the common practices for sizing a cluster before processing the data?
It might help to repartition your data in a large number of partitions. The example below would be similar to your for loop, although you may want to try with less partitions first
df = spark.read.load('path').repartition(100)
You should also review the number of executors you are currently using (--num-executors). Reducing this number should also reduce your memory footprint.

Python vs Scala (for Spark jobs)

I am pretty new to Spark, currently exploring it by playing with pyspark and spark-shell.
So here is the situation, I run same spark jobs with pyspark and spark-shell.
This is from pyspark:
textfile = sc.textFile('/var/log_samples/mini_log_2')
textfile.count()
And this one from spark-shell:
textfile = sc.textFile("file:///var/log_samples/mini_log_2")
textfile.count()
I tried both of them several times, first (python) one takes 30-35 seconds to complete while second one (scala) takes about 15 seconds. I am curious about what may cause this different performance results? Is it because of choice of language or spark-shell do something in background that pyspark don't?
UPDATE
So I did some tests on larger datasets, about 550 GB (zipped) in total. I am using Spark Standalone as master.
I observed that while using pyspark, tasks are equally shared among executors. However when using spark-shell, tasks are not shared equally. More powerful machines get more tasks while weaker machines gets fewer tasks.
With spark-shell, job is finished in 25 minutes and with pyspark it is around 55 minutes. How can I make Spark Standalone assign tasks with pyspark, as it assigns tasks with spark-shell?
Using python has some overhead, but it's significance depends on what you're doing.
Though recent reports indicate the overhead isn't very large (specifically for the new DataFrame API)
some of the overhead you encounter relates to constant per job overhead - which is almost irrelevant for large jobs.
You should to do a sample benchmark with a larger data set, and see if the overhead is a constant addition, or if it's proportional to the data size.
Another potential bottleneck is operations that apply a python function for each element (map, etc.) - if these operations are relevant for you, you should test them too.

Insert performance with Cassandra

sorry for my English in advance.
I am a beginner with Cassandra and his data model. I am trying to insert one million rows in a cassandra database in local on one node. Each row has 10 columns and I insert those only in one column family.
With one thread, that operation took around 3 min. But I would like do the same operation with 2 millions rows, and keeping a good time. Then I tried with 2 threads to insert 2 millions rows, expecting a similar result around 3-4min. bUT i gor a result like 7min...twice the first result. As I check on differents forums, multithreading is recommended to improve performance.
That is why I am asking that question : is it useful to use multithreading to insert data in local node (client and server are in the same computer), in only one column family?
Some informations :
- I use pycassa
- I have separated commitlog repertory and data repertory on differents disks
- I use batch insert for each thread
- Consistency Level : ONE
- Replicator factor : 1
It's possible you're hitting the python GIL but more likely you're doing something wrong.
For instance, putting 2M rows in a single batch would be Doing It Wrong.
Try running multiple clients in multiple processes, NOT threads.
Then experiment with different insert sizes.
1M inserts in 3 mins is about 5500 inserts/sec, which is pretty good for a single local client. On a multi-core machine you should be able to get several times this amount provided that you use multiple clients, probably inserting small batches of rows, or individual rows.
You might consider Redis. Its single-node throughput is supposed to be faster. It's different from Cassandra though, so whether or not it's an appropriate option would depend on your use case.
The time taken doubled because you inserted twice as much data. Is it possible that you are I/O bound?

Categories

Resources