Pyspark uses cProfile and works according to the docs for the RDD API, but it seems that there is no way to get the profiler to print results after running a bunch of DataFrame API operations?
from pyspark import SparkContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
rdd = sc.parallelize([('a', 0), ('b', 1)])
df = sqlContext.createDataFrame(rdd)
rdd.count() # this ACTUALLY gets profiled :)
sc.show_profiles() # here is where the profiling prints out
sc.show_profiles() # here prints nothing (no new profiling to show)
rdd.count() # this ACTUALLY gets profiled :)
sc.show_profiles() # here is where the profiling prints out in DataFrame API
df.count() # why does this NOT get profiled?!?
sc.show_profiles() # prints nothing?!
# and again it works when converting to RDD but not
df.rdd.count() # this ACTUALLY gets profiled :)
sc.show_profiles() # here is where the profiling prints out
df.count() # why does this NOT get profiled?!?
sc.show_profiles() # prints nothing?!
That is the expected behavior.
Unlike RDD API, which provides native Python logic, DataFrame / SQL API are JVM native. Unless you invoke Python udf* (including pandas_udf), no Python code is executed on the worker machines. All that is done on the Python side, is simple API calls through Py4j gateway.
Therefore there no profiling information exists.
* Note that the udf's seem to be excluded from the profiling as well.
Related
In my Python 3.8 unit test code, I need to instantiate a SparkContext to test some functions manipulating a RDD.
The problem is that instantiating a SparkContext takes a few seconds, which is too slow. I'm using this code to instantiate a SparkContext:
from pyspark import SparkContext
return SparkContext.getOrCreate()
The tests only run filter, map and distinct on an RDD. In my tests, I only want to test an expected output against obtained output on a tiny dataset (lists of length between 5 and 10) and I want them to run almost instantaneously. I don't need all other features offered by PySpark (such as parallelization).
I tried various ways, such as below, but instantiation takes as much time:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
sc = SparkContext(conf=conf)
I'm quite new to pyspark, and I'm having the following error: Py4JJavaError: An error occurred while calling o517.showString. and I've read that is due to a lack of memory:Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
So, I've been reading that a turn-around to this situation is to use df.persist() and then read again the persisted df, so I would like to know:
Given a for loop in which I do some .join operations, should I use the .persist() inside the loop or at the end of it? e.g.
for col in columns:
df_AA = df_AA.join(df_B, df_AA[col] == 'some_value', 'outer').persist()
--> or <--
for col in columns:
df_AA = df_AA.join(df_B, df_AA[col] == 'some_value', 'outer')
df_AA.persist()
Once I've done that, how should I read back?
df_AA.unpersist()? sqlContext.read.some_thing(df_AA)?
I'm really new to this, so please, try to explain as best as you can.
I'm running on a local machine (8GB ram), using jupyter-notebooks(anaconda); windows 7; java 8; python 3.7.1; pyspark v2.4.3
Spark is lazy evaluated framework so, none of the transformations e.g: join are called until you call an action.
So go ahead with what you have done
from pyspark import StorageLevel
for col in columns:
df_AA = df_AA.join(df_B, df_AA[col] == 'some_value', 'outer')
df_AA.persist(StorageLevel.MEMORY_AND_DISK)
df_AA.show()
There multiple persist options available so choosing the MEMORY_AND_DISK will spill the data that cannot be handled in memory into DISK.
Also GC errors could be a result of lesser DRIVER memory provided for the Spark Application to run.
I have to run a really heavy python function as UDF in Spark and I want to cache some data inside UDF. The case is similar to one mentioned here
I am aware that it is slow and wrong.
But the existing infrastructure is in spark and I don't want set up a new infrastructure and deal with data loading/parallelization/fail safety separately for this case.
This is how my spark program looks like:
from mymodule import my_function # here is my function
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
schema = StructType().add("input", "string")
df = spark.read.format("json").schema(schema).load("s3://input_path")
udf1 = udf(my_function, StructType().add("output", "string"))
df.withColumn("result", udf1(df.input)).write.json("s3://output_path/")
The my_function internally calls a method of an object with a slow constructor.
Therefore I don't want the object to be initialized for every entry and I am trying to cache it:
from my_slow_class import SlowClass
from cachetools import cached
#cached(cache={})
def get_cached_object():
# this call is really slow therefore I am trying
# to cache it with cachetools
return SlowClass()
def my_function(input):
slow_object = get_cached_object()
output = slow_object.call(input)
return {'output': output}
mymodule and my_slow_class are installed as modules on each spark machine.
It seems working. The constructor is called only a few times (only 10-20 times for 100k lines in input dataframe). And that is what I want.
My concern is multithreading/multiprocessing inside Spark executors and if the cached SlowObject instance is shared between many parallel my_function calls.
Can I rely on the fact that my_function is called once at a time inside python processes on worker nodes? Does spark use any multiprocessing/multithreading in python process that executes my UDF?
Spark forks Python process to create individual workers, however all processing in the individual worker process is sequential, unless multithreading or multiprocessing is used explicitly by the UserDefinedFunction.
So as long as state is used for caching and slow_object.call is a pure function you have nothing to worry about.
Q: How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session
I made the following code change to increase spark.sql.pivotMaxValues. It sadly had no effect in the resulting error after restarting jupyter and running the code again.
from pyspark import SparkConf, SparkContext
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
import numpy as np
try:
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker') # original
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", "99999")
conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", 99999)
sc = SparkContext(conf=conf)
except:
print("Variables sc and conf are now defined. Everything is OK and ready to run.")
<... (other code) ...>
df = sess.read.csv(in_filename, header=False, mode="DROPMALFORMED", schema=csv_schema)
ct = df.crosstab('username', 'itemname')
Spark error message that was thrown on my crosstab line of code:
IllegalArgumentException: "requirement failed: The number of distinct values for itemname, can't exceed 1e4. Currently 16467"
I expect I'm not actually setting the config variable that I was trying to set, so what is a way to get that value actually set, programmatically if possible? THanks.
References:
Finally, you may be interested to know that there is a maximum number
of values for the pivot column if none are specified. This is mainly
to catch mistakes and avoid OOM situations. The config key is
spark.sql.pivotMaxValues and its default is 10,000.
Source: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
I would prefer to change the config variable upwards, since I have written the crosstab code already which works great on smaller datasets. If it turns out there truly is no way to change this config variable then my backup plans are, in order:
relational right outer join to implement my own Spark crosstab with higher capacity than was provided by databricks
scipy dense vectors with handmade unique combinations calculation code using dictionaries
kernel.json
This configuration file should be distributed together with jupyter
~/.ipython/kernels/pyspark/kernel.json
It contains SPARK configuration, including variable PYSPARK_SUBMIT_ARGS - list of arguments that will be used with spark-submit script.
You can try to add --conf spark.sql.pivotMaxValues=99999 to this variable in mentioned script.
PS
There are also cases where people are trying to override this variable programmatically. You can give it a try too...
Using native Python code in SQL UDFs in Monetdb is really powerful. BUT, debugging such UDFs could benefit from more support. In particular, if I use the old-fashioned print('debugging info') it disappears in the big black void.
create function dummy()
returns string
language python{
print('Entering the dummy UDF')
return 'hello';
};
How to retrieve this information from the server or MonetDB client.
I was debugging some Python UDF last week :)
Step 1: first make sure your Python code at least works in a Python interpreter.
Step 2: in a Python UDF, write your debugging info. to a file, e.g.:
f = open('/tmp/debug.out', 'w')
f.write('my debugging info\n')
f.close()
This isn't ideal, but it works. Also, I used this to export the parameter values of my Python UDF. In this way, I can run the body of my Python UDF in a Python interpreter with the exact data I receive from MonetDB.
In case someone is still interested in this problem.
There are two novel ways of debugging MonetDB's Python/UDFs.
1) Using the python client pymonetdb (https://github.com/gijzelaerr/pymonetdb).
You can install it throw pip
pip install numpy
To use it, think of the following setting with a table that holds an integer and a UDF that computes the mean absolute deviation of a given column.
CREATE TABLE integers(i INTEGER);
INSERT INTO integers VALUES (1), (3), (6), (8), (10);
CREATE OR REPLACE FUNCTION mean_deviation(column INTEGER)
RETURNS DOUBLE LANGUAGE PYTHON {
mean = 0.0
for i in range (0, len(column)):
mean += column[I]
mean = mean / len(column)
distance = 0.0
for i in range (0, len(column)):
distance += column[i] - mean
deviation = distance/len(column)
return deviation;
};
To debug your function using terminal debugging (i.e., pdb) you just need to open a database connection using pymonetdb.connect(), later you get a cursor object from the connection, and through the cursor object you call the debug() function, sending as parameters the SQL you want to examine and the UDF name you wish to debug.
import pymonetdb
conn = pymonetdb.connect(database='demo') #Open Database connection
c = conn.cursor()
sql = 'select mean_deviation(i) from integers;'
c.debug(sql, 'mean_deviation') #Console Debugging
There is an optional sampling step that only transfers a uniform random sample of the data instead of the full input data set. If you wish to sample you just need to send the number of elements you wish to get from the sampling (e.g., c.debug(sql, 'mean_deviation', 10) in case you desire the subset of 10 elements)
2) Using a POC plugin for PyCharm called devudf, which you can install throw the plugin page of pycharm, or by directly going to the JetBrains page: https://plugins.jetbrains.com/plugin/12063-devudf. It adds an option to the main menu called "UDF Development" and allows for you do directly import and export UDFs from your database directly to pycharm, and enjoy the IDE's debugging capabilities.