I'd like to develop programs with spark + hive and unit test them locally.
Is there a way to get hive to run in-process? Or something else that will facilitate unit testing?
I'm using python 2.7 on Mac
EDIT: since spark 2, it is possible to create a local hive metastore that can be used in test. the original answer is at the bottom.
from the spark sql programming guide:
When working with Hive, one must instantiate SparkSession with Hive
support, including connectivity to a persistent Hive metastore,
support for Hive serdes, and Hive user-defined functions. Users who do
not have an existing Hive deployment can still enable Hive support.
When not configured by the hive-site.xml, the context automatically
creates metastore_db in the current directory and creates a directory
configured by spark.sql.warehouse.dir, which defaults to the directory
spark-warehouse in the current directory that the Spark application is
started. Note that the hive.metastore.warehouse.dir property in
hive-site.xml is deprecated since Spark 2.0.0. Instead, use
spark.sql.warehouse.dir to specify the default location of database in
warehouse. You may need to grant write privilege to the user who
starts the Spark application.
basically what it means that if you don't configure hive, spark will create a metastore for you, and store it on local disk.
2 configuration that you should be a aware of:
spark.sql.warehouse.dir - a spark config, points to where the data in the table is stored on the disk, ie: "/path/to/test/folder/warehouse/"
javax.jdo.option.ConnectionURL - this is a hive config, and should be set in hive-site.xml (or as a system property), ie: "jdbc:derby:;databaseName=/path/to/test/folder/metastore_db;create=true"
those are not mandatory (since they have a default value), but sometimes it is convenient to set them explicitly
you need to make sure to clean the test folder between tests, to have a clean env for each suite
Original Answer:
I would recommend installing a vagrant box that contains the a full (small) hadoop cluster in a VM on your machine.
you can find a ready vagrant here: http://blog.cloudera.com/blog/2014/06/how-to-install-a-virtual-apache-hadoop-cluster-with-vagrant-and-cloudera-manager/
That way your test could run in the same environment as production
Related
I am working on a process to automatically remove and add databases to Azure. When the database isn't in use, it can be removed from Azure and placed in cheaper S3 storage as a .bacpac.
I am using SqlPackage.exe from Microsoft as a PowerShell script to export and import these databases from and to Azure respectively in either direction. I invoke it via a Python script to use boto3.
The issue I have is with the down direction at step 3. The sequence would be:
Download the Azure SQL DB to a .bacpac (can be achieved with SqlPackage.exe)
Upload this .bacpac to cheaper S3 storage (using boto3 Python SDK)
Delete the Azure SQL Database (It appears the Azure Blob Python SDK can't help me, and it appears SQLPackage.exe does not have a delete function)
Is step 3 impossible to automate with a script? Could a workaround be to SqlPackage.exe import a small dummy .bacpac with the same name to overwrite the old bigger DB?
Thanks.
To remove an Azure SQL Database using PowerShell, you will need to use Remove-AzSqlDatabase Cmdlet.
To remove an Azure SQL Database using Azure CLI, you will need to us az sql db delete.
If you want to write code in Python to delete the database, you will need to use Azure SDK for Python.
I've installed pyspark, but have not installed any hadoop or spark version seperatly.
Apparently under Windows pyspark needs access to the winutils.exe for Hadoop for some things (e.g. writing files to disk). When pyspark wants to access the winutilis.exe it looks for it in the bin directory of the folder specified by the HADOOP_HOME environment variable (user variable). Therefore I copied the winutils.exe into the bin directory of pyspark (.\site-packages\pyspark\bin) and specified HADOOP_HOME as .\site-packages\pyspark\. This solved the problem of getting the error message: Failed to locate the winutils binary in the hadoop binary path.
However, when I start a Spark session using pyspark I still get the following warning:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Installing Hadoop and then specifying its installation directory for HADDOP_HOME did prevent the warning. Has a specific hadoop version to be installed to make pyspark work without restrictions?
Hadoop installation is not mandatory.
Spark is distributed computing engine only.
Spark offers only computation & it doesn't have any storage.
But Spark is integrated with huge variety of storage systems like HDFS, Cassandra, HBase, Mongo DB, Local file system etc....
Spark is designed to run on top of variety of resource management platforms like Spark, Mesos, YARN, Local, Kubernetes etc....
PySpark is Python API on top of Spark to develop Spark applications in Python. So Hadoop installation is not mandatory.
Note: Hadoop Installation is only required either to run Pyspark application on top of YARN or to access input/output of Pyspark application from/to HDFS/Hive/HBase or Both.
About the warning you posted is normal one. So ignore it.
I see that there's a built-in I/O connector for BigQuery, but a lot of our data is stored in Snowflake. Is there a workaround for connecting to Snowflake? The only thing I can think of doing is to use sqlalchemy to run the query and then dump the output to Cloud Storage Buckets, and then Apache-Beam can get the input data from the files stored in the Bucket.
There were added Snowflake Python and Java connectors to Beam recently.
Right now (version 2.24) it supports only ReadFromSnowflake operation in apache_beam.io.external.snowflake.
In the 2.25 release WriteToSnowflake will also be available in apache_beam.io.snowflake module. You can still use the old path, however it will be considered deprecated in this version.
Right now it runs only on Flink Runner but there is an effort to make it available for other runners as well.
Also, it's a cross-language transform so some additional setup can be needed - it's quite well documented in the pydoc here (I'm pasting it below):
https://github.com/apache/beam/blob/release-2.24.0/sdks/python/apache_beam/io/external/snowflake.py
Snowflake transforms tested against Flink portable runner.
**Setup**
Transforms provided in this module are cross-language transforms
implemented in the Beam Java SDK. During the pipeline construction, Python SDK
will connect to a Java expansion service to expand these transforms.
To facilitate this, a small amount of setup is needed before using these
transforms in a Beam Python pipeline.
There are several ways to setup cross-language Snowflake transforms.
* Option 1: use the default expansion service
* Option 2: specify a custom expansion service
See below for details regarding each of these options.
*Option 1: Use the default expansion service*
This is the recommended and easiest setup option for using Python Snowflake
transforms.This option requires following pre-requisites
before running the Beam pipeline.
* Install Java runtime in the computer from where the pipeline is constructed
and make sure that 'java' command is available.
In this option, Python SDK will either download (for released Beam version) or
build (when running from a Beam Git clone) a expansion service jar and use
that to expand transforms. Currently Snowflake transforms use the
'beam-sdks-java-io-expansion-service' jar for this purpose.
*Option 2: specify a custom expansion service*
In this option, you startup your own expansion service and provide that as
a parameter when using the transforms provided in this module.
This option requires following pre-requisites before running the Beam
pipeline.
* Startup your own expansion service.
* Update your pipeline to provide the expansion service address when
initiating Snowflake transforms provided in this module.
Flink Users can use the built-in Expansion Service of the Flink Runner's
Job Server. If you start Flink's Job Server, the expansion service will be
started on port 8097. For a different address, please set the
expansion_service parameter.
**More information**
For more information regarding cross-language transforms see:
- https://beam.apache.org/roadmap/portability/
For more information specific to Flink runner see:
- https://beam.apache.org/documentation/runners/flink/
Snowflake (as most of the portable IOs) has its own java expansion service which should be downloaded automatically when you don't specify your own custom one. I don't think it should be needed but I'm mentioning it just to be on the safe side. You can download the jar and start it with java -jar <PATH_TO_JAR> <PORT> and then pass it to snowflake.ReadFromSnowflake as expansion_service='localhost:<PORT>'. Link to 2.24 version: https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-snowflake-expansion-service/2.24.0
Notice that it's still experimental though and feel free to report issues on Beam Jira.
Google Cloud Support here!
There's no direct connector from Snowflake to Cloud Dataflow, but one workaround would be what you've mentioned. First dump the output to Cloud Storage, and then connect Cloud Storage to Cloud Dataflow.
I hope that helps.
For future folks looking for a tutorial on how to start with Snowflake and Apache Beam, I can recommend the below tutorial which was made by the creators of the connector.
https://www.polidea.com/blog/snowflake-and-apache-beam-on-google-dataflow/
Having two separate pyspark applications that instantiate a HiveContext in place of a SQLContext lets one of the two applications fail with the error:
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34039))
The other application terminates successfully.
I am using Spark 1.6 from the Python API and want to make use of some Dataframe functions, that are only supported with a HiveContext (e.g. collect_set). I've had the same issue on 1.5.2 and earlier.
This is enough to reproduce:
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sq = HiveContext(sc)
data_source = '/tmp/data.parquet'
df = sq.read.parquet(data_source)
time.sleep(60)
The sleep is just to keep the script running while I start the other process.
If I have two instances of this script running, the above error shows when reading the parquet-file. When I replace HiveContext with SQLContext everything's fine.
Does anyone know why that is?
By default Hive(Context) is using embedded Derby as a metastore. It is intended mostly for testing and supports only one active user. If you want to support multiple running applications you should configure a standalone metastore. At this moment Hive supports PostgreSQL, MySQL, Oracle and MySQL. Details of configuration depend on a backend and option (local / remote) but generally speaking you'll need:
a running RDBMS server
a metastore database created using provided scripts
a proper Hive configuration
Cloudera provides a comprehensive guide you may find useful: Configuring the Hive Metastore.
Theoretically it should be also possible to create separate Derby metastores with a proper configuration (see Hive Admin Manual - Local/Embedded Metastore Database) or to use Derby in Server Mode.
For development you can start applications in different working directories. This will create separate metastore_db for each application and avoid the issue of multiple active users. Providing separate Hive configuration should work as well but is less useful in development:
When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory
So I am creating a local Python script which I plan to export as an executable. However, this script is in need of a MongoDB instances that runs in the background as a service or daemon. How could one possibly include this MongoDB service along with their own ported application?
I have this configuration manually installed on my own computer with a MongoDB database installed as a local Windows service, and Python where my script adds and removes to the database as some events are fired. Is there any possible way to distribute this setup without manual installation of Python and MongoDB?
If you want to include installations of all your utilities, I recommend pynsist. It'll allow you to make a Windows installer that will make your code launchable as an app on the clients system, and include any other files and/or folders that you want.
Py2exe converts python scripts and their dependencies into Windows executable files. It has some limitations, but may work for your application.
You might also get away with not installing mongo, by embedding something like this in your application: https://github.com/Softmotions/ejdb. This may require you to rewrite your data access code.
If you can't or don't do that, then you could have all your clients share a multi-tenant mongo that you host someplace in the cloud.
Finally, if you can't or won't convert your python script to a standalone exe with an embedded database, and you don't want to host a shared mongo instance for your clients, there are legions of software installation makers that make deploying mongo, python, setting up an execution environment, creating services, etc, pretty easy. Some are free, some cost money. A long list can be found here: https://en.m.wikipedia.org/wiki/List_of_installation_software