How to quickly instantiate a pyspark SparkContext for unit testing? - python

In my Python 3.8 unit test code, I need to instantiate a SparkContext to test some functions manipulating a RDD.
The problem is that instantiating a SparkContext takes a few seconds, which is too slow. I'm using this code to instantiate a SparkContext:
from pyspark import SparkContext
return SparkContext.getOrCreate()
The tests only run filter, map and distinct on an RDD. In my tests, I only want to test an expected output against obtained output on a tiny dataset (lists of length between 5 and 10) and I want them to run almost instantaneously. I don't need all other features offered by PySpark (such as parallelization).
I tried various ways, such as below, but instantiation takes as much time:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
sc = SparkContext(conf=conf)

Related

PySpark show_profile() prints nothing with DataFrame API operations

Pyspark uses cProfile and works according to the docs for the RDD API, but it seems that there is no way to get the profiler to print results after running a bunch of DataFrame API operations?
from pyspark import SparkContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
rdd = sc.parallelize([('a', 0), ('b', 1)])
df = sqlContext.createDataFrame(rdd)
rdd.count() # this ACTUALLY gets profiled :)
sc.show_profiles() # here is where the profiling prints out
sc.show_profiles() # here prints nothing (no new profiling to show)
rdd.count() # this ACTUALLY gets profiled :)
sc.show_profiles() # here is where the profiling prints out in DataFrame API
df.count() # why does this NOT get profiled?!?
sc.show_profiles() # prints nothing?!
# and again it works when converting to RDD but not
df.rdd.count() # this ACTUALLY gets profiled :)
sc.show_profiles() # here is where the profiling prints out
df.count() # why does this NOT get profiled?!?
sc.show_profiles() # prints nothing?!
That is the expected behavior.
Unlike RDD API, which provides native Python logic, DataFrame / SQL API are JVM native. Unless you invoke Python udf* (including pandas_udf), no Python code is executed on the worker machines. All that is done on the Python side, is simple API calls through Py4j gateway.
Therefore there no profiling information exists.
* Note that the udf's seem to be excluded from the profiling as well.

Heavy stateful UDF in pyspark

I have to run a really heavy python function as UDF in Spark and I want to cache some data inside UDF. The case is similar to one mentioned here
I am aware that it is slow and wrong.
But the existing infrastructure is in spark and I don't want set up a new infrastructure and deal with data loading/parallelization/fail safety separately for this case.
This is how my spark program looks like:
from mymodule import my_function # here is my function
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
schema = StructType().add("input", "string")
df = spark.read.format("json").schema(schema).load("s3://input_path")
udf1 = udf(my_function, StructType().add("output", "string"))
df.withColumn("result", udf1(df.input)).write.json("s3://output_path/")
The my_function internally calls a method of an object with a slow constructor.
Therefore I don't want the object to be initialized for every entry and I am trying to cache it:
from my_slow_class import SlowClass
from cachetools import cached
#cached(cache={})
def get_cached_object():
# this call is really slow therefore I am trying
# to cache it with cachetools
return SlowClass()
def my_function(input):
slow_object = get_cached_object()
output = slow_object.call(input)
return {'output': output}
mymodule and my_slow_class are installed as modules on each spark machine.
It seems working. The constructor is called only a few times (only 10-20 times for 100k lines in input dataframe). And that is what I want.
My concern is multithreading/multiprocessing inside Spark executors and if the cached SlowObject instance is shared between many parallel my_function calls.
Can I rely on the fact that my_function is called once at a time inside python processes on worker nodes? Does spark use any multiprocessing/multithreading in python process that executes my UDF?
Spark forks Python process to create individual workers, however all processing in the individual worker process is sequential, unless multithreading or multiprocessing is used explicitly by the UserDefinedFunction.
So as long as state is used for caching and slow_object.call is a pure function you have nothing to worry about.

HiveContext vs spark sql

I am trying to compare spark sql vs hive context, may I know any difference, is the hivecontext sql use the hive query, while spark sql use the spark query?
Below is my code:
sc = pyspark.SparkContext(conf=conf).getOrCreate()
sqlContext = HiveContext(sc)
sqlContext.sql ('select * from table')
While sparksql:
spark.sql('select * from table')
May I know the difference of this two?
SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with DataFrame and Dataset APIs. Most importantly, it curbs the number of concepts and constructs a developer has to juggle while interacting with Spark.
SparkSession, without explicitly creating SparkConf, SparkContext or SQLContext, encapsulates them within itself.
SparkSession has merged SQLContext and HiveContext in one object from Spark 2.0+.
When building a session object, for example:
val spark = SparkSession .builder() .appName("SparkSessionExample").config( "spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
.enableHiveSupport() provides HiveContext functions. so you will be able to access Hive tables since spark session is initialized with HiveSupport.
So, there is no difference between "sqlContext.sql" and "spark.sql", but it is advised to use "spark.sql", since spark is single point of entry for all the Spark API's.

How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session

Q: How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session
I made the following code change to increase spark.sql.pivotMaxValues. It sadly had no effect in the resulting error after restarting jupyter and running the code again.
from pyspark import SparkConf, SparkContext
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
import numpy as np
try:
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker') # original
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", "99999")
conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", 99999)
sc = SparkContext(conf=conf)
except:
print("Variables sc and conf are now defined. Everything is OK and ready to run.")
<... (other code) ...>
df = sess.read.csv(in_filename, header=False, mode="DROPMALFORMED", schema=csv_schema)
ct = df.crosstab('username', 'itemname')
Spark error message that was thrown on my crosstab line of code:
IllegalArgumentException: "requirement failed: The number of distinct values for itemname, can't exceed 1e4. Currently 16467"
I expect I'm not actually setting the config variable that I was trying to set, so what is a way to get that value actually set, programmatically if possible? THanks.
References:
Finally, you may be interested to know that there is a maximum number
of values for the pivot column if none are specified. This is mainly
to catch mistakes and avoid OOM situations. The config key is
spark.sql.pivotMaxValues and its default is 10,000.
Source: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
I would prefer to change the config variable upwards, since I have written the crosstab code already which works great on smaller datasets. If it turns out there truly is no way to change this config variable then my backup plans are, in order:
relational right outer join to implement my own Spark crosstab with higher capacity than was provided by databricks
scipy dense vectors with handmade unique combinations calculation code using dictionaries
kernel.json
This configuration file should be distributed together with jupyter
~/.ipython/kernels/pyspark/kernel.json
It contains SPARK configuration, including variable PYSPARK_SUBMIT_ARGS - list of arguments that will be used with spark-submit script.
You can try to add --conf spark.sql.pivotMaxValues=99999 to this variable in mentioned script.
PS
There are also cases where people are trying to override this variable programmatically. You can give it a try too...

pyspark intersection() function to compare data frames

Below is the code I have written to compare two dataframes and impose intersection function on them.
import os
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://xxx:xxx").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxxx").option("password","xxxx").load()
df.registerTempTable("test")
df1= sqlContext.sql("select * from test where amitesh<= 300")
df2= sqlContext.sql("select * from test where amitesh <= 400")
df3= df1.intersection(df2)
df3.show()
I am getting below error:
AttributeError: 'DataFrame' object has no attribute 'intersection'
If my understanding is correct, intersection() is an inbuilt sub-function derived from python set function. So,
1) if I am trying to use it inside pyspark, do I need to import any special module inside my code, or it should work as in-built for pyspark as well?
2) To use this intersection() function, do we first need to convert df to rdd?
Please correct me wherever I am wrong. Can somebody give me a working example?
My motive is to get the common record from SQL server and move to HIVE. As of now, I am first trying to get my intersection function work and then start with the HIVE requirement that I can take care off if intersection() is working.
I got it working for me, instead of intersection(), I used intersect(), it worked.

Categories

Resources