Spark-Scala connection - python

Have a case of connecting to Spark using Scala. Previously I didn't have experience with Scala and used Python in combination with Spark.
So for Python the connection was done like this:
import findspark
import pyspark
findspark.init('/Users/SD/Data/spark-1.6.1-bin-hadoop2.6')
sc = pyspark.SparkContext(appName="myAppName")
and then the coding process began.
So my question is- how can I establish the connection to Spark using Scala dialect?
Thanks!

Irrespective of python or scala, the following steps are common
Make the jars available to the language you are using (python path for python and sbt entry for scala)
scala
name := "ProjectName"
version := "1.0"
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.0"
python
PYTHONPATH=/Users/XXX/softwares/spark-1.6.1-bin-hadoop2.6/python:/Users/XXX/softwares/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip:PYTHONPATH
Once the libraries are available, the usage is regular as below
In scala
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
In python
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
the code snippet you provided is getting the libraries for python. It may work, but might not the final approach you would follow.

Related

When would I use Spark Operator vs Spark Standalone in Iguazio?

I see in the services UI that I can create a Spark cluster. I also see that I can use the Spark operator runtime when executing a job. What is the use case for each and why would I choose one vs the other?
There are two ways of using Spark in Iguazio:
Create a standalone Spark cluster via the Iguazio UI (like you found on the services page). This is a persistent cluster that you can associate with multiple jobs, Jupyter notebooks, etc. This is a good choice for long running computations with a static pool of resources. An overview of the Spark service in Iguazio can be found here along with some ingestion examples.
When creating a JupyterLab instance in the UI, there is an option to associate it with an existing Spark cluster. This lets you use PySpark out of the box
Create an ephemeral Spark cluster via the Spark Operator. This is a temporary cluster that only exists for the duration of the job. This is a good choice for shorter one-off jobs with a static or variable pool of resources. The Spark Operator runtime is usually the better option if you don't need a persistent Spark cluster. Some examples of using the Spark operator on Iguazio can be found here as well as below.
import mlrun
import os
# set up new spark function with spark operator
# command will use our spark code which needs to be located on our file system
# the name param can have only non capital letters (k8s convention)
sj = mlrun.new_function(kind='spark', command='spark_read_csv.py', name='sparkreadcsv')
# set spark driver config (gpu_type & gpus=<number_of_gpus> supported too)
sj.with_driver_limits(cpu="1300m")
sj.with_driver_requests(cpu=1, mem="512m")
# set spark executor config (gpu_type & gpus=<number_of_gpus> are supported too)
sj.with_executor_limits(cpu="1400m")
sj.with_executor_requests(cpu=1, mem="512m")
# adds fuse, daemon & iguazio's jars support
sj.with_igz_spark()
# set spark driver volume mount
# sj.function.with_driver_host_path_volume("/host/path", "/mount/path")
# set spark executor volume mount
# sj.function.with_executor_host_path_volume("/host/path", "/mount/path")
# args are also supported
sj.spec.args = ['-spark.eventLog.enabled','true']
# add python module
sj.spec.build.commands = ['pip install matplotlib']
# Number of executors
sj.spec.replicas = 2
# Rebuilds the image with MLRun - needed in order to support artifactlogging etc
sj.deploy()
# Run task while setting the artifact path on which our run artifact (in any) will be saved
sj.run(artifact_path='/User')
Where the spark_read_csv.py file looks like:
from pyspark.sql import SparkSession
from mlrun import get_or_create_ctx
context = get_or_create_ctx("spark-function")
# build spark session
spark = SparkSession.builder.appName("Spark job").getOrCreate()
# read csv
df = spark.read.load('iris.csv', format="csv",
sep=",", header="true")
# sample for logging
df_to_log = df.describe().toPandas()
# log final report
context.log_dataset("df_sample",
df=df_to_log,
format="csv")
spark.stop()

How to create an OpenSplice DDS topic using python and statically created topic classes?

I've been trying to use ADLINK's Vortex OpenSplice Community edition with the Python API (python version 3.6 within a PyEnv virtual environment) on Ubuntu 20.04.2 LTS. I've followed the PythonDCPSAPIGuide and got the python examples in ($OSPL_HOME/tools/python/examples) working. However I can't figure out how to create a topic associated for a domain participant for a statically generated topic class using idlpp. How would I be able to do this?
What I have done so far:
I have an IDL file that has include paths for quite a few other IDL files. I have converted these IDL files to a python topic classes using the following bash script:
#!/bin/bash
for FILE in *.idl; do
$OSPL_HOME/bin/idlpp -I $OSPL_HOME/etc/idl -S -l python -d . $FILE
done
This creates a series of python packages (python topic classes) that I import into my python script which is in the same directory.
Using these packages I would like to create or register a topic with a domain participant in my python script. For example something like the following python code, (however the 'create_topic' function doesn't exist):
# myExampleDDSFile.py
from dds import *
from foo import foo_type # idlpp generated module/class
from foo2 import foo_type2 # idlpp generated module/class
dp = DomainParticipant()
topic = dp.create_topic('foo_topic',foo_type) # this function doesn't exist for a domain participant
pub = dp.create_publisher()
Would this be possible and if so how would I be able to register a topic that I have statically created with a domain participant in python?
I noticed in the provided python examples (e.g. $OSPL_HOME/tools/python/examples/example1.py) a topic is registered dynamically using the following code below, but I don't think this relates to statically generated python topic classes:
# example1.py snippet
dp = DomainParticipant()
gen_info = ddsutil.get_dds_classes_from_idl('example1.idl', 'basic::module_SequenceOfStruct::SequenceOfStruct_struct')
topic = gen_info.register_topic(dp, 'Example1')
I also couldn't see a relevant function in the source code.
I apologise if this is a simple question or if I have missed something - I am very new to Vortex OpenSplice DDS.
Any help would be appreciated.
I can't speak to OpenSplice, but you can do this with CoreDX DDS. For example, given the IDL file "hello.idl":
struct StringMsg
{
string msg;
};
run
coredx_ddl -l python -f hello.idl -d hello
And, then the following python is an example of how to use the generated 'StringMsg' type to construct a Topic and a DataReader:
import time
import dds.domain
from hello import StringMsg
# Use default QoS, and no listener
dp = dds.domain.DomainParticipant( 0 )
topic = dds.topic.Topic( StringMsg, "helloTopic", "StringMsg", dp )
sub = dds.sub.Subscriber( dp )
dr = dds.sub.DataReader( sub, topic )
rc = dr.create_readcondition( dds.core.SampleState.ANY_SAMPLE_STATE,
dds.core.ViewState.ANY_VIEW_STATE,
dds.core.InstanceState.ANY_INSTANCE_STATE )
ws = dds.cond.WaitSet()
ws.attach_condition(rc)
while True:
t = dds.core.Duration.infinite()
print('waiting for data...')
ws.wait(t)
while True:
try:
samples = dr.take( )
for s in samples:
if s.info.valid_data:
print("received: {}".format(s.data.msg))
except dds.core.Error:
break;

Cloudera/CDH v6.1.x + Python HappyBase v1.1.0: TTransportException(type=4, message='TSocket read 0 bytes')

EDIT: This question and answer applies to anyone who is experiencing the exception stated in the subject line: TTransportException(type=4, message='TSocket read 0 bytes'); whether or not Cloudera and/or HappyBase is involved.
The root issue (as it turned out) stems from mismatching protocol and/or transport formats on the client-side with what the server-side is implementing, and this can happen with any client/server paring. Mine just happened to be Cloudera and HappyBase, but yours needn't be and you can run into this same issue.
Has anyone recently tried using the happybase v1.1.0 (latest) Python package to interact with Hbase on Cloudera CDH v6.1.x?
I'm trying various options with it, but keep getting the exception:
thriftpy.transport.TTransportException:
TTransportException(type=4, message='TSocket read 0 bytes')
Here is how I start a session and submit a simple call to get a listing of tables (using Python v3.6.7:
import happybase
CDH6_HBASE_THRIFT_VER='0.92'
hbase_cnxn = happybase.Connection(
host='vps00', port=9090,
table_prefix=None,
compat=CDH6_HBASE_THRIFT_VER,
table_prefix_separator=b'_',
timeout=None,
autoconnect=True,
transport='buffered',
protocol='binary'
)
print('tables:', hbase_cnxn.tables()) # Exception happens here.
And here is how Cloudera CDH v6.1.x starts the Hbase Thrift server (truncated for brevity):
/usr/java/jdk1.8.0_141-cloudera/bin/java [... snip ... ] \
org.apache.hadoop.hbase.thrift.ThriftServer start \
--port 9090 -threadpool --bind 0.0.0.0 --framed --compact
I've tried several variations to options, but getting nowhere.
Has anyone ever got this to work?
EDIT:
I next compiled Hbase.thrift (from the Hbase source files -- same HBase version as used by CDH v6.1.x) and used the Python thrift bindings package (in other words, I removed happybase from the equation) and got the same exception.
(._.);
Thank you!
After a days worth of working on this, the answer to my question is the following:
import happybase
CDH6_HBASE_THRIFT_VER='0.92'
hbase_cnxn = happybase.Connection(
host='vps00', port=9090,
table_prefix=None,
compat=CDH6_HBASE_THRIFT_VER,
table_prefix_separator=b'_',
timeout=None,
autoconnect=True,
transport='framed', # Default: 'buffered' <---- Changed.
protocol='compact' # Default: 'binary' <---- Changed.
)
print('tables:', hbase_cnxn.tables()) # Works. Output: [b'ns1:mytable', ]
Note that although this Q&A was framed in the context of Cloudera, it turns out (as you'll see) that this was Thrift versions and Thrift Server-Side configurations related, and so it applies to Hortonworks and MapR users, too.
Explanation:
On Cloudera CDH v6.1.x (and probably future versions, too) if you visit the Hbase Thrift Server Configuration section of its management U.I., you'll find -- among many other settings -- these:
Notice that compact protocol and framed transport are both enabled; so they correspondingly needed to be changed in happybase from its defaults (which I show above).
As mentioned in EDIT follow-up to my initial question, I also investigated a pure Thrift (non happybase) solution. And with analogous changes to Python code for that case, I got that to work, too. Here is the code you should use for the pure Thrift solution (taking care to read my commented annotations below):
from thrift.protocol import TCompactProtocol # Notice the import: TCompactProtocol [!]
from thrift.transport.TTransport import TFramedTransport # Notice the import: TFramedTransport [!]
from thrift.transport import TSocket
from hbase import Hbase
# -- This hbase module is compiled using the thrift(1) command (version >= 0.10 [!])
# and a Hbase.thrift file (obtained from http://archive.apache.org/dist/hbase/
# -- Also, your "pip freeze | grep '^thrift='" should show a version of >= 0.10 [!]
# if you want Python3 support.
(host,port) = ("vps00","9090")
transport = TFramedTransport(TSocket.TSocket(host, port))
protocol = TCompactProtocol.TCompactProtocol(transport)
client = Hbase.Client(protocol)
transport.open()
# Do stuff here ...
print(client.getTableNames()) # Works. Output: [b'ns1:mytable', ]
transport.close()
I hope this spares people the pain I went through. =:)
CREDITS:
Here (MapR) and
Here (Blog from China)
I also encountered this problem when I was using CDH 6.3.2 HBase recently. It is not enough to follow the above configuration. Also need to close hbase.regionserver.thrift.http and hbase.thrift.support.proxyuser in order to connect successfully

How to upsert or partial updates with script documents in ElasticSearch with Spark?

I have a pseudocode in python that reads from a Kafka stream and upsert documents in Elasticsearch (incrementing a counter view if the document exists already.
for message in consumer:
msg = json.loads(message.value)
print(msg)
index = INDEX_NAME
es_id = msg["id"]
script = {"script":"ctx._source.view+=1","upsert" : msg}
es.update(index=index, doc_type="test", id=es_id, body=script)
Since I want to use it in a distributed environment, I am using Spark Structured Streaming
df.writeStream \
.format("org.elasticsearch.spark.sql")\
.queryName("ESquery")\
.option("es.resource","credentials/url") \
.option("checkpointLocation", "checkpoint").start()
or SparkStreaming in scala that reads from KafkaStream:
// Initializing Spark Streaming Context and kafka stream
sparkConf.setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(10))
[...]
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topicsSet, kafkaParams)
)
[...]
val urls = messages.map(record => JsonParser.parse(record.value()).values.asInstanceOf[Map[String, Any]])
urls.saveToEs("credentials/credential")
.saveToEs(...) is the API of elastic-hadoop.jar documented here. Unfortunately this repo is not really well documented. So I cannot understand where I can put the script command.
Is there anyone can help me? Thank you in advance
You should be able to do it by setting write mode "update" ( or upsert) and passing your script as "script" (depends on ES version).
EsSpark.saveToEs(rdd, "spark/docs", Map("es.mapping.id" -> "id", "es.write.operation" -> "update","es.update.script.inline" -> "your script" , ))
Probably you want to use "upsert"
There are some good unit tests in cascading integration in same library; These settings should be good for spark as both uses same writer.
I suggest to read unit tests to pick correct settings for your ES version.

How to fix this error: "SQLContext object has no no attribute 'jsonFile'

I am learning Spark now. When I tried to load a json file, as follows:
people=sqlContext.jsonFile("C:\wdchentxt\CustomerData.json")
I got the following error:
AttributeError: 'SQLContext' object has no attribute 'jsonFile'
I am running this on Windows 7 PC, with spark-2.1.0-bin-hadoop2.7, and Python 2.7.13 (Dec 17, 2016).
Thank you for any suggestions that you may have.
You probably forgot to import the implicits. This is what my solution looks like in Scala:
def loadJson(filename: String, sqlContext: SqlContext): Dataset[Row] = {
import sqlContext._
import sqlContext.implicits._
val df = sqlContext.read.json(filename)
df
}
First, the more recent versions of Spark (like the one you are using) involve .read.json(..) instead of the deprecated .readJson(..).
Second, you need to be sure that your SqlContext is setup right, as mentioned here: pyspark : NameError: name 'spark' is not defined. In my case, it's setup like this:
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
myObjects = sqlContext.read.json('file:///home/cloudera/Downloads/json_files/firehose-1-2018-08-24-17-27-47-7066324b')
Note that they have version-specific quick-start tutorials that can help with getting some of the basic operations right, as mentioned here: name spark is not defined
So, my point is to always check to ensure that with whatever library or language you are using (and this applies in general across all technologies) that you are following the documentation that matches the version you are running because it is very common for breaking changes to create a lot of confusion if there is a version mismatch. In cases where the technology you are trying to use is not well documented in the version you are running, that's when you need to evaluate if you should upgrade to a more recent version or create a support ticket with those who maintain the project so that you can help them to better support their users.
You can find a guide on all of the version-specific changes of Spark here: https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-16-to-20
You can also find version-specific documentation on Spark and PySpark here (e.g. for version 1.6.1): https://spark.apache.org/docs/1.6.1/sql-programming-guide.html
As mentioned before, .jsonFile (...) has been deprecated1, use this instead:
people = sqlContext.read.json("C:\wdchentxt\CustomerData.json").rdd
Source:
[1]: https://docs.databricks.com/spark/latest/data-sources/read-json.html

Categories

Resources