Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
Does anyone know why that is?
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory
Related
I'm new to Pyspark and asking question for best design pattern/practice:
I'm developing a library that should run both, on local machine and on Databricks.
Currently working on loading secrets. If code runs on databricks, I should load secrets using dbutils.secrets.get while if code runs on local machine, dotenv.load_dotenv.
Question:
How can I create/refer to dbutils variable (which is readily provided in databricks instance)? pyspark doesnt have such module... even if I import SparkSession I still need DBUtils which is not found on pyspark local installation.
my current solution: if identify that code runs on Databricks, I create dbutils with:
dbutils = globals()['dbutils']
Just follow the approach described in documentation for databricks-connect - wrap instantiation of dbutils into a function call that will behave differently depending on if you're on Databricks or not:
def get_secret(....):
if spark.conf.get("spark.databricks.service.client.enabled") == "true":
from pyspark.dbutils import DBUtils
return DBUtils(spark).secrets.get(....)
else:
get local secret
I've installed pyspark, but have not installed any hadoop or spark version seperatly.
Apparently under Windows pyspark needs access to the winutils.exe for Hadoop for some things (e.g. writing files to disk). When pyspark wants to access the winutilis.exe it looks for it in the bin directory of the folder specified by the HADOOP_HOME environment variable (user variable). Therefore I copied the winutils.exe into the bin directory of pyspark (.\site-packages\pyspark\bin) and specified HADOOP_HOME as .\site-packages\pyspark\. This solved the problem of getting the error message: Failed to locate the winutils binary in the hadoop binary path.
However, when I start a Spark session using pyspark I still get the following warning:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Installing Hadoop and then specifying its installation directory for HADDOP_HOME did prevent the warning. Has a specific hadoop version to be installed to make pyspark work without restrictions?
Hadoop installation is not mandatory.
Spark is distributed computing engine only.
Spark offers only computation & it doesn't have any storage.
But Spark is integrated with huge variety of storage systems like HDFS, Cassandra, HBase, Mongo DB, Local file system etc....
Spark is designed to run on top of variety of resource management platforms like Spark, Mesos, YARN, Local, Kubernetes etc....
PySpark is Python API on top of Spark to develop Spark applications in Python. So Hadoop installation is not mandatory.
Note: Hadoop Installation is only required either to run Pyspark application on top of YARN or to access input/output of Pyspark application from/to HDFS/Hive/HBase or Both.
About the warning you posted is normal one. So ignore it.
I am working with Elasticsearch 6.7 which has an Elasticsearch SQL cli. This allows me to run more standard SQL queries. This is preferred over the API method as the query capabilities are much more robust.
I am attempting to run a query through this CLI and insert those results into a pandas data frame. Is this something I can do via the subprocess method or is there an easier/better way. This will go into production so it needs to run on multiple environments.
This python program will be running on a different host than the Elasticsearch machine.
I'd like to develop programs with spark + hive and unit test them locally.
Is there a way to get hive to run in-process? Or something else that will facilitate unit testing?
I'm using python 2.7 on Mac
EDIT: since spark 2, it is possible to create a local hive metastore that can be used in test. the original answer is at the bottom.
from the spark sql programming guide:
When working with Hive, one must instantiate SparkSession with Hive
support, including connectivity to a persistent Hive metastore,
support for Hive serdes, and Hive user-defined functions. Users who do
not have an existing Hive deployment can still enable Hive support.
When not configured by the hive-site.xml, the context automatically
creates metastore_db in the current directory and creates a directory
configured by spark.sql.warehouse.dir, which defaults to the directory
spark-warehouse in the current directory that the Spark application is
started. Note that the hive.metastore.warehouse.dir property in
hive-site.xml is deprecated since Spark 2.0.0. Instead, use
spark.sql.warehouse.dir to specify the default location of database in
warehouse. You may need to grant write privilege to the user who
starts the Spark application.
basically what it means that if you don't configure hive, spark will create a metastore for you, and store it on local disk.
2 configuration that you should be a aware of:
spark.sql.warehouse.dir - a spark config, points to where the data in the table is stored on the disk, ie: "/path/to/test/folder/warehouse/"
javax.jdo.option.ConnectionURL - this is a hive config, and should be set in hive-site.xml (or as a system property), ie: "jdbc:derby:;databaseName=/path/to/test/folder/metastore_db;create=true"
those are not mandatory (since they have a default value), but sometimes it is convenient to set them explicitly
you need to make sure to clean the test folder between tests, to have a clean env for each suite
Original Answer:
I would recommend installing a vagrant box that contains the a full (small) hadoop cluster in a VM on your machine.
you can find a ready vagrant here: http://blog.cloudera.com/blog/2014/06/how-to-install-a-virtual-apache-hadoop-cluster-with-vagrant-and-cloudera-manager/
That way your test could run in the same environment as production
I have set of simple python 2.7 scripts. Also, I have set of linux nodes. I want to run at a specific time these scripts on these nodes.
Each script may work on each node. The script is not able to run on multiple nodes simultaneously.
So, I want to complete 3 simple tasks:
To deploy set of scripts.
To run at a specific time main script with specific parameters on any node.
To get result, when script is finished.
It seems, that I am able to complete first task. I have the following code snippet:
import urllib
import urlparse
from pyspark import SparkContext
def path2url(path):
return urlparse.urljoin(
'file:', urllib.pathname2url(path))
MASTER_URL = "spark://My-PC:7077"
deploy_zip_path = "deploy.zip"
sc = SparkContext(master=("%s" % MASTER_URL), appName="Job Submitter", pyFiles=[path2url("%s" % deploy_zip_path)])
But I have problems. This code immediately launches tasks. But I want just deploy scripts to all nodes.
I would recommend keeping the code to deploy your PySpark scripts outside of your PySpark scripts.
Chronos is a job scheduler that runs on Apache Mesos. Spark can run on Mesos. Chronos runs jobs as a shell command. So, you can run your scripts with any arguments you specify. You will need to deploy Spark and your scripts to Mesos nodes. Then, you can run submit your Spark scripts with Chronos using spark-submit as the command.
You would store your results by writing to some kind of storage mechanism within your PySpark scripts. Spark has support for text files, HDFS, Amazon S3, and more. If Spark doesn't support the storage mechanism you need, you can use an external library that does. For example, I write to Cassandra in my PySpark scripts using cassandra-driver.