How to create python threads in pyspark code - python

I have around 70 hive queries which I am executing in pyspark in sequence. I am looking at ways to improve the runtime be running the hive queries in parallel. I am planning to do this by by creating python threads and running the sqlContext.sql in the threads. Would this create threads in driver and improve performance.

I am assuming, you do not have any dependency on these hive queries and so they can run in parallel. You can accomplish this by threading, but not sure of the benefit in a single user application - because the total number of resources is fixed for your cluster i.e. the total time to finish the all the queries will be the same - as the spark scheduler will round robing across these individual jobs - when you multi thread it.
https://spark.apache.org/docs/latest/job-scheduling.html explains this
1) SPARK by default uses a FIFO scheduler ( which you are observing)
2) By threading you can use a "fair" scheduler
3) Ensure the method that is being threaded -set this
sc.setLocalProperty("spark.scheduler.pool", )
4) The pool id needs to be different for each thread
Example use case of threading from a code perspective:
# set the spark context to use a fair scheduler mode
conf = SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
sc = new SparkContext(conf)
# runs a query taking a spark context, pool_id and query..
def runQuery(sc,<POOL_ID>,query):
sc.setLocalProperty("spark.scheduler.pool", pool_id)
.....<your code>
return df
t1 = threading.thread(target=runQuery,args=(sc,"1",<query1>)
t2 = threading.thread(target=runQuery,args=(sc,"2",<query2>)
# start the threads...
t1.start()
t2.sart()
# wait for the threads to complete and get the returned data frames...
df1 = t1.join()
df2 = t2.join()
Like the spark documentation indicates, you will not observe an improvement in the overall throughput.. it is suited for multi-user sharing of resources. Hope this helps.

Related

Dask scheduler empty / graph not showing

I have a setup as follows:
# etl.py
from dask.distributed import Client
import dask
from tasks import task1, task2, task3
def runall(**kwargs):
print("done")
def etl():
client = Client()
tasks = {}
tasks['task1'] = dask.delayed(task)(*args)
tasks['task2'] = dask.delayed(task)(*args)
tasks['task3'] = dask.delayed(task)(*args)
out = dask.delayed(runall)(**tasks)
out.compute()
This logic was borrowed from luigi and works nicely with if statements to control what tasks to run.
However, some of the tasks load large amounts of data from SQL and cause GIL freeze warnings (At least this is my suspicion as it is hard to diagnose what line exactly causes the issue). Sometimes the graph / monitoring shown on 8787 does not show anything just scheduler empty, I suspect these are caused by the app freezing dask. What is the best way to load large amounts of data from SQL in dask. (MSSQL and oracle). At the moment this is doen with sqlalchemy with tuned settings. Would adding async and await help?
However, some of tasks are a bit slow and I'd like to use stuff like dask.dataframe or bag internally. The docs advise against calling delayed inside delayed. Does this also hold for dataframe and bag. The entire script is run on a single 40 core machine.
Using bag.starmap I get a graph like this:
where the upper straight lines are added/ discovered once the computation reaches that task and compute is called inside it.
There appears to be no rhyme or reason other than the machine was / is very busy and struggling to show the state updates and bokeh plots as desired.

How does LocalCluster() affect the number of tasks?

Do the calculations (like dask method dd.merge) need to be done inside or outside the LocalCluster? Do final calculations (like .compute) need to be done inside or outside the LocalCluster?
My main question is - how does LocalCluster() affect the number of tasks?
I and my colleague noticed that placing dd.merge outside of LocalCLuster() downgraded the number of tasks significantly (like 10x or smth like that). What is the reason for that?
pseudo example
many tasks:
dd.read_parquet(somewhere, index=False)
with LocalCluster(
n_workers=8,
processes=True,
threads_per_worker=1,
memory_limit="10GB",
ip="tcp://localhost:9895",
) as cluster, Client(cluster) as client:
dd.merge(smth)
smth..to_parquet(
somewhere, engine="fastparquet", compression="snappy"
)
few tasks:
dd.read_parquet(somewhere, index=False)
dd.merge(smth)
with LocalCluster(
n_workers=8,
processes=True,
threads_per_worker=1,
memory_limit="10GB",
ip="tcp://localhost:9895",
) as cluster, Client(cluster) as client:
smth..to_parquet(
somewhere, engine="fastparquet", compression="snappy"
)
The performance difference is due to the difference in the schedulers being used.
According the the dask docs:
The dask collections each have a default scheduler
dask.dataframe use the threaded scheduler by default
The default scheduler is what is used when there is not another scheduler registered.
Additionally, according to the dask distributed docs:
When we create a Client object it registers itself as the default Dask scheduler. All .compute() methods will automatically start using the distributed system.
So when operating within the context manager for the cluster, computations implicitly use that scheduler.
A couple of additional notes:
It may be the case that the default scheduler is using more threads than the local cluster you are defining. It is also possible that a significant difference in performance is due to the overhead of inter-process communication that is not incurred by the threaded scheduler. More information about the schedulers is available here.

Spark fair scheduling not working

I'm investigating the suitability of using spark as the back end for my REST API. A problem with that seems to be Spark's FIFO scheduling approach. This means that if a large task is under execution, no small task can finish until that heavy task has finished. According to https://spark.apache.org/docs/latest/job-scheduling.html a fair scheduler should fix this. However, I cannot notice this changing anything. Am I configuring the scheduler wrong?
scheduler.xml:
<?xml version="1.0"?>
<allocations>
<pool name="test">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>10</minShare>
</pool>
</allocations>
My code:
$ pyspark --conf spark.scheduler.mode=FAIR --conf spark.scheduler.allocation.file=/home/hadoop/scheduler.xml
>>> import threading
>>> sc.setLocalProperty("spark.scheduler.pool", "test")
>>> def heavy_spark_job():
# Do some heavy work
>>>
>>> def smaller_spark_job():
# Do something simple
>>>
>>> threading.Thread(target=heavy_spark_job).start()
>>> smaller_spark_job()
The smaller spark job can only start when the first task of the heavy spark job doesn't need all of the available CPU cores.
You just need to set a different pools for your tasks:
By default, each pool gets an equal share of the cluster (also equal in share to each job in the default pool), but inside each pool, jobs run in FIFO order. For example, if you create one pool per user, this means that each user will get an equal share of the cluster, and that each user’s queries will run in order instead of later queries taking resources from that user’s earlier ones.
https://spark.apache.org/docs/latest/job-scheduling.html#default-behavior-of-pools
Also, in PySpark the child thread couldn't inherit the parent thread's local properties, you have to set the pool inside the thread target functions.
I believe, you have implemented fair scheduling correctly. There is a problem in threading as you are not running smaller_spark_job as a thread. Hence, it is waiting for heavy_spark_job to get complete first. It should be like below.
threading.Thread(target=heavy_spark_job).start()
threading.Thread(target= smaller_spark_job).start()
Also, my post can help you to verify FAIR Scheduling from Spark UI.

How to make spark run all tasks in a job concurrently?

I have a system where a REST API (Flask) uses spark-sumbit to send a job to an up-and-running pyspark.
For various reasons, I need spark to run all tasks at the same time (i.e. I need to set the number of executors = the number of tasks during runtime).
For example, if I have twenty tasks and only 4 cores, I want each core to execute 5 tasks (executors) without having to restart spark.
I know I can set the number of executors when starting spark, but I don't want to do that since spark is executing other jobs.
Is this possible to achieve through a work around?
Use spark scheduler pools. Here is an example for running multiple queries using scheduler pools (all the way to the end of the article, for convenience copying here), the same logic works for DStreams too:
https://docs.databricks.com/spark/latest/structured-streaming/production.html
// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)
// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)

Python threading or multiprocess questions with sqlite3 and matplotlib

I have a python script that I'd like to run using two processes or threads. I am limited to two because I am connecting to an api/link which only has two license. I grab the license by importing their module and instantiating their class. Here are my issues:
I need to write to a sqlitedb3. I tried to share a db connection, pass it to the "worker" and have it create its own cursor but I will get stuck with a "database locked" message and it seems no matter how long I keep retrying, the lock doesnt clear. My program will spend about 5min loading data from a model, then about a minute processing data and inserting into the db. Then at the end before I move to the next model, it does a commit(). I think I can live with just creating two separate databases though
After it writes to the database, I use matplotlib to create some plots and images then save them to a file with a unique name. I kept getting "QApplication was not created in the main() thread" and "Xlib: unexpected async reply". I figure that switching from threading to multiprocess may help this
I want to make sure only two threads or processes are running at once. What is the best way to accomplish this. With threading, I was doing the following:
c1 = load_lib_get_license()
c2 = load_lib_get_license()
prc_list = list of models to process
while (len(prc_list) > 0):
if not t1.is_alive():
t1 = threading.Process(target=worker,args=(c1,db_connection,prc_list.pop(0))
t1.start()
if not t2.is_alive():
t2 = threading.Process(target=worker,args=(c2,db_connection,prc_list.pop(0))
t2.start()
while (t1.is_alive() and t2.is_alive():
sleep(1)
Queue is probably what you're looking for, maybe the link in this previous answer might help you:
Sharing data between threads in Python

Categories

Resources