I see in the services UI that I can create a Spark cluster. I also see that I can use the Spark operator runtime when executing a job. What is the use case for each and why would I choose one vs the other?
There are two ways of using Spark in Iguazio:
Create a standalone Spark cluster via the Iguazio UI (like you found on the services page). This is a persistent cluster that you can associate with multiple jobs, Jupyter notebooks, etc. This is a good choice for long running computations with a static pool of resources. An overview of the Spark service in Iguazio can be found here along with some ingestion examples.
When creating a JupyterLab instance in the UI, there is an option to associate it with an existing Spark cluster. This lets you use PySpark out of the box
Create an ephemeral Spark cluster via the Spark Operator. This is a temporary cluster that only exists for the duration of the job. This is a good choice for shorter one-off jobs with a static or variable pool of resources. The Spark Operator runtime is usually the better option if you don't need a persistent Spark cluster. Some examples of using the Spark operator on Iguazio can be found here as well as below.
import mlrun
import os
# set up new spark function with spark operator
# command will use our spark code which needs to be located on our file system
# the name param can have only non capital letters (k8s convention)
sj = mlrun.new_function(kind='spark', command='spark_read_csv.py', name='sparkreadcsv')
# set spark driver config (gpu_type & gpus=<number_of_gpus> supported too)
sj.with_driver_limits(cpu="1300m")
sj.with_driver_requests(cpu=1, mem="512m")
# set spark executor config (gpu_type & gpus=<number_of_gpus> are supported too)
sj.with_executor_limits(cpu="1400m")
sj.with_executor_requests(cpu=1, mem="512m")
# adds fuse, daemon & iguazio's jars support
sj.with_igz_spark()
# set spark driver volume mount
# sj.function.with_driver_host_path_volume("/host/path", "/mount/path")
# set spark executor volume mount
# sj.function.with_executor_host_path_volume("/host/path", "/mount/path")
# args are also supported
sj.spec.args = ['-spark.eventLog.enabled','true']
# add python module
sj.spec.build.commands = ['pip install matplotlib']
# Number of executors
sj.spec.replicas = 2
# Rebuilds the image with MLRun - needed in order to support artifactlogging etc
sj.deploy()
# Run task while setting the artifact path on which our run artifact (in any) will be saved
sj.run(artifact_path='/User')
Where the spark_read_csv.py file looks like:
from pyspark.sql import SparkSession
from mlrun import get_or_create_ctx
context = get_or_create_ctx("spark-function")
# build spark session
spark = SparkSession.builder.appName("Spark job").getOrCreate()
# read csv
df = spark.read.load('iris.csv', format="csv",
sep=",", header="true")
# sample for logging
df_to_log = df.describe().toPandas()
# log final report
context.log_dataset("df_sample",
df=df_to_log,
format="csv")
spark.stop()
Related
I'm able to establish a connection to my Databricks FileStore DBFS and access the filestore.
Reading, writing, and transforming data with Pyspark is possible but when I try to use a local Python API such as pathlib or the OS module I am unable to get past the first level of the DBFS file system
I can use a magic command:
%fs ls dbfs:\mnt\my_fs\... which works perfectly and lists all the child directories?
but if I do os.listdir('\dbfs\mnt\my_fs\') it returns ['mount.err'] as a return value
I've tested this on a new cluster and the result is the same
I'm using Python on a Databricks Runtine Version 6.1 with Apache Spark 2.4.4
is anyone able to advise.
Edit :
Connection Script :
I've used the Databricks CLI library to store my credentials which are formatted according to the databricks documentation:
def initialise_connection(secrets_func):
configs = secrets_func()
# Check if the mount exists
bMountExists = False
for item in dbutils.fs.ls("/mnt/"):
if str(item.name) == r"WFM/":
bMountExists = True
# drop if exists to refresh credentials
if bMountExists:
dbutils.fs.unmount("/mnt/WFM")
bMountExists = False
# Mount a drive
if not (bMountExists):
dbutils.fs.mount(
source="adl://test.azuredatalakestore.net/WFM",
mount_point="/mnt/WFM",
extra_configs=configs
)
print("Drive mounted")
else:
print("Drive already mounted")
We experienced this issue when the same container was mounted to two different paths in the workspace. Unmounting all and remounting resolved our issue. We were using Databricks version 6.2 (Spark 2.4.4, Scala 2.11). Our blob store container config:
Performance/Access tier: Standard/Hot
Replication: Read-access geo-redundant storage (RA-GRS)
Account kind: StorageV2 (general purpose v2)
Notebook script to run to unmount all mounts in /mnt:
# Iterate through all mounts and unmount
print('Unmounting all mounts beginning with /mnt/')
dbutils.fs.mounts()
for mount in dbutils.fs.mounts():
if mount.mountPoint.startswith('/mnt/'):
dbutils.fs.unmount(mount.mountPoint)
# Re-list all mount points
print('Re-listing all mounts')
dbutils.fs.mounts()
Minimal job to test on automated job cluster
Assuming you have a separate process to create the mounts. Create job definition (job.json) to run Python script on automated cluster:
{
"name": "Minimal Job",
"new_cluster": {
"spark_version": "6.2.x-scala2.11",
"spark_conf": {},
"node_type_id": "Standard_F8s",
"driver_node_type_id": "Standard_F8s",
"num_workers": 2,
"enable_elastic_disk": true,
"spark_env_vars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
}
},
"timeout_seconds": 14400,
"max_retries": 0,
"spark_python_task": {
"python_file": "dbfs:/minimal/job.py"
}
}
Python file (job.py) to print out mounts:
import os
path_mounts = '/dbfs/mnt/'
print(f"Listing contents of {path_mounts}:")
print(os.listdir(path_mounts))
path_mount = path_mounts + 'YOURCONTAINERNAME'
print(f"Listing contents of {path_mount }:")
print(os.listdir(path_mount))
Run databricks CLI commands to run job. View Spark Driver logs for output, confirming that mount.err does not exist.
databricks fs mkdirs dbfs:/minimal
databricks fs cp job.py dbfs:/minimal/job.py --overwrite
databricks jobs create --json-file job.json
databricks jobs run-now --job-id <JOBID FROM LAST COMMAND>
We have experienced the same issue when connecting to the an Azure Generation2 storage account (without hierarchical name spaces).
The error seems to occur when switching the Databricks Runtime Environment from 5.5 to 6.x. However, we have not been able to pinpoint the exact reason for this. We assume some functionality might have been deprecated.
Updating Answer: With Azure Data Lake Gen1 storage accounts: dbutils has access adls gen1 tokens/access creds and hence the file listing within mnt point works where as std py api calls do not have access to creds/spark conf, first call that you see is listing folders and its not making any calls to adls api's.
I have tested in Databricks Runtime version 6.1 (includes Apache Spark 2.4.4, Scala 2.11)
Commands works as excepted without any error message.
Update: Output for the inside folders.
Hope this helps. Could you please try and do let us know.
Q: How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session
I made the following code change to increase spark.sql.pivotMaxValues. It sadly had no effect in the resulting error after restarting jupyter and running the code again.
from pyspark import SparkConf, SparkContext
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
import numpy as np
try:
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker') # original
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", "99999")
conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", 99999)
sc = SparkContext(conf=conf)
except:
print("Variables sc and conf are now defined. Everything is OK and ready to run.")
<... (other code) ...>
df = sess.read.csv(in_filename, header=False, mode="DROPMALFORMED", schema=csv_schema)
ct = df.crosstab('username', 'itemname')
Spark error message that was thrown on my crosstab line of code:
IllegalArgumentException: "requirement failed: The number of distinct values for itemname, can't exceed 1e4. Currently 16467"
I expect I'm not actually setting the config variable that I was trying to set, so what is a way to get that value actually set, programmatically if possible? THanks.
References:
Finally, you may be interested to know that there is a maximum number
of values for the pivot column if none are specified. This is mainly
to catch mistakes and avoid OOM situations. The config key is
spark.sql.pivotMaxValues and its default is 10,000.
Source: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
I would prefer to change the config variable upwards, since I have written the crosstab code already which works great on smaller datasets. If it turns out there truly is no way to change this config variable then my backup plans are, in order:
relational right outer join to implement my own Spark crosstab with higher capacity than was provided by databricks
scipy dense vectors with handmade unique combinations calculation code using dictionaries
kernel.json
This configuration file should be distributed together with jupyter
~/.ipython/kernels/pyspark/kernel.json
It contains SPARK configuration, including variable PYSPARK_SUBMIT_ARGS - list of arguments that will be used with spark-submit script.
You can try to add --conf spark.sql.pivotMaxValues=99999 to this variable in mentioned script.
PS
There are also cases where people are trying to override this variable programmatically. You can give it a try too...
Have a case of connecting to Spark using Scala. Previously I didn't have experience with Scala and used Python in combination with Spark.
So for Python the connection was done like this:
import findspark
import pyspark
findspark.init('/Users/SD/Data/spark-1.6.1-bin-hadoop2.6')
sc = pyspark.SparkContext(appName="myAppName")
and then the coding process began.
So my question is- how can I establish the connection to Spark using Scala dialect?
Thanks!
Irrespective of python or scala, the following steps are common
Make the jars available to the language you are using (python path for python and sbt entry for scala)
scala
name := "ProjectName"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.0"
python
PYTHONPATH=/Users/XXX/softwares/spark-1.6.1-bin-hadoop2.6/python:/Users/XXX/softwares/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip:PYTHONPATH
Once the libraries are available, the usage is regular as below
In scala
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
In python
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
the code snippet you provided is getting the libraries for python. It may work, but might not the final approach you would follow.
I am trying to get the application output of the spark run and cannot find a straightforward way doing that.
Basically I am talking about the content of the <spark install dir>/work directory on the cluster worker.
I could've copied that directory to the location I need, but in case of 100500 nodes it simply doesn't scale.
The other option I was considering is to attach an exit function (like a TRAP in bash) to get the logs from each worker as a part of the app run. I just think there has to be a better solution than that.
Yeah, I know that I can use YARN or Mesos cluster manager to get the logs, however it seems really weird to me that in order to do such a convenient thing I cannot use the default cluster manager.
Thanks a lot.
In the end I went for the following solution (Python):
import os
import tarfile
from io import BytesIO
from pyspark.sql import SparkSession
# Get the spark app.
spark = SparkSession.builder.appName("my-spark-app").getOrCreate()
# Get the executor working directories.
spark_home = os.environ.get('SPARK_HOME')
if spark_home:
num_workers = 0
with open(os.path.join(spark_home, 'conf', 'slaves'), 'r') as f:
for line in f:
num_workers += 1
if num_workers:
executor_logs_path = '/where/to/store/executor_logs'
def _map(worker):
'''Returns the list of tuples of the name and the tar.gz of the worker log directory in binary format
for the corresponding worker.
'''
flo = BytesIO()
with tarfile.open(fileobj=flo, mode="w:gz") as tar:
tar.add(os.path.join(spark_home, 'work'), arcname='work')
return [('worker_%d_dir.tar.gz' % worker, flo.getvalue()),]
def _reduce(worker1, worker2):
'''Appends the worker name and its log tar.gz's into the list.
'''
worker1.extend(worker2)
return worker1
os.makedirs(executor_logs_path)
logs = spark.sparkContext.parallelize(range(num_workers), num_workers).map(_map).reduce(_reduce)
with tarfile.open(os.path.join(executor_logs_path, 'logs.tar'), 'w') as tar:
for name, data in logs:
info = tarfile.TarInfo(name=name)
info.size=len(data)
tar.addfile(tarinfo=info, fileobj=BytesIO(data))
A couple of concerns though:
not sure if using the map-reduce technique is the best way to collect the logs
the files (tarballs) are being created in memory, so depending on your application it can crush if the files are too big
perhaps there is a better way to determine the number of workers
I'm running a Spark Streaming task in a cluster using YARN. Each node in the cluster runs multiple spark workers. Before the streaming starts I want to execute a "setup" function on all workers on all nodes in the cluster.
The streaming task classifies incoming messages as spam or not spam, but before it can do that it needs to download the latest pre-trained models from HDFS to local disk, like this pseudo code example:
def fetch_models():
if hadoop.version > local.version:
hadoop.download()
I've seen the following examples here on SO:
sc.parallelize().map(fetch_models)
But in Spark 1.6 parallelize() requires some data to be used, like this shitty work-around I'm doing now:
sc.parallelize(range(1, 1000)).map(fetch_models)
Just to be fairly sure that the function is run on ALL workers I set the range to 1000. I also don't exactly know how many workers are in the cluster when running.
I've read the programming documentation and googled relentlessly but I can't seem to find any way to actually just distribute anything to all workers without any data.
After this initialization phase is done, the streaming task is as usual, operating on incoming data from Kafka.
The way I'm using the models is by running a function similar to this:
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition)))
Theoretically I could check whether or not the models are up to date in the on_partition function, though it would be really wasteful to do this on each batch. I'd like to do it before Spark starts retrieving batches from Kafka, since the downloading from HDFS can take a couple of minutes...
UPDATE:
To be clear: it's not an issue on how to distribute the files or how to load them, it's about how to run an arbitrary method on all workers without operating on any data.
To clarify what actually loading models means currently:
def on_partition(config, partition):
if not MyClassifier.is_loaded():
MyClassifier.load_models(config)
handle_partition(config, partition)
While MyClassifier is something like this:
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
MyClassifier.clf = load_from_file(config)
Static methods since PySpark doesn't seem to be able to serialize classes with non-static methods (the state of the class is irrelevant with relation to another worker). Here we only have to call load_models() once, and on all future batches MyClassifier.clf will be set. This is something that should really not be done for each batch, it's a one time thing. Same with downloading the files from HDFS using fetch_models().
If all you want is to distribute a file between worker machines the simplest approach is to use SparkFiles mechanism:
some_path = ... # local file, a file in DFS, an HTTP, HTTPS or FTP URI.
sc.addFile(some_path)
and retrieve it on the workers using SparkFiles.get and standard IO tools:
from pyspark import SparkFiles
with open(SparkFiles.get(some_path)) as fw:
... # Do something
If you want to make sure that model is actually loaded the simplest approach is to load on module import. Assuming config can be used to retrieve model path:
model.py:
from pyspark import SparkFiles
config = ...
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
path = SparkFiles.get(config.get("model_file"))
MyClassifier.clf = load_from_file(path)
# Executed once per interpreter
MyClassifier.load_models(config)
main.py:
from pyspark import SparkContext
config = ...
sc = SparkContext("local", "foo")
# Executed before StreamingContext starts
sc.addFile(config.get("model_file"))
sc.addPyFile("model.py")
import model
ssc = ...
stream = ...
stream.map(model.MyClassifier.do_something).pprint()
ssc.start()
ssc.awaitTermination()
This is a typical use case for Spark's broadcast variables. Let's say fetch_models returns the models rather than saving them locally, you would do something like:
bc_models = sc.broadcast(fetch_models())
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition, bc_models.value)))
This does assume that your models fit in memory, on the driver and the executors.
You may be worried that broadcasting the models from the single driver to all the executors is inefficient, but it uses 'efficient broadcast algorithms' that can outperform distributing through HDFS significantly according to this analysis