I am having problems executing the example code from the mleap repository. I wish to run the code in a script instead of a jupyter notebook (which is how the example is run). My script is as follows:
##################################################################################
# start a local spark session
# https://spark.apache.org/docs/0.9.0/python-programming-guide.html
##################################################################################
from pyspark import SparkContext, SparkConf
conf = SparkConf()
#set app name
conf.set("spark.app.name", "train classifier")
#Run Spark locally with as many worker threads as logical cores on your machine (cores X threads).
conf.set("spark.master", "local[*]")
#number of cores to use for the driver process (only in cluster mode)
conf.set("spark.driver.cores", "1")
#Limit of total size of serialized results of all partitions for each Spark action (e.g. collect)
conf.set("spark.driver.maxResultSize", "1g")
#Amount of memory to use for the driver process
conf.set("spark.driver.memory", "1g")
#Amount of memory to use per executor process (e.g. 2g, 8g).
conf.set("spark.executor.memory", "2g")
#pass configuration to the spark context object along with code dependencies
sc = SparkContext(conf=conf)
from pyspark.sql.session import SparkSession
spark = SparkSession(sc)
##################################################################################
import mleap.pyspark
# # Imports MLeap serialization functionality for PySpark
from mleap.pyspark.spark_support import SimpleSparkSerializer
# Import standard PySpark Transformers and packages
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row
# Create a test data frame
l = [('Alice', 1), ('Bob', 2)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df2.collect()
# Build a very simple pipeline using two transformers
string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')
feature_assembler = VectorAssembler(
inputCols=[string_indexer.getOutputCol()], outputCol="features")
feature_pipeline = [string_indexer, feature_assembler]
featurePipeline = Pipeline(stages=feature_pipeline)
featurePipeline.fit(df2)
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
On executing spark-submit script.py I get the following error:
17/09/18 13:26:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/opringle/Documents/Repos/finn/Magellan/src/no_spark_predict.py", line 58, in <module>
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'
Any help would be much appreciated! I have installed mleap from pypy.
See Here
It seems MLeap isn't ready for Spark 2.3 yet. If you happen to be running Spark 2.3, try downgrading to 2.2 and retry. Hopefully, that helps!
I have solved the issue by attaching the following jar file when running:
spark-submit --packages ml.combust.mleap:mleap-spark_2.11:0.8.1 script.py
It seems you didn't follow the steps correctly, here http://mleap-docs.combust.ml/getting-started/py-spark.html it says that
Note: the import of mleap.pyspark needs to happen before any other PySpark libraries are imported.
Hence try importing your SparkContext after mleap
Related
I am try to using pysparkling.ml.H2OMOJOModel for predict a spark dataframe using a MOJO model trained with h2o==3.32.0.2 in AWS Glue Jobs, how ever a got the error: TypeError: 'JavaPackage' object is not callable.
I opened a ticket in AWS support and they confirmed that Glue environment is ok and the problem is probably with sparkling-water (pysparkling). It seems that some dependency library is missing, but I have no idea which one.
The simple code bellow works perfectly if I run in my local computer (I only need to change the mojo path for GBM_grid__1_AutoML_20220323_233606_model_53.zip)
Could anyone ever run sparkling-water in Glue jobs successfully?
Job Details:
-Glue version 2.0
--additional-python-modules, h2o-pysparkling-2.4==3.36.0.2-1
-Worker type: G1.X
-Number of workers: 2
-Using script "createFromMojo.py"
createFromMojo.py:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd
from pysparkling.ml import H2OMOJOSettings
from pysparkling.ml import H2OMOJOModel
# from pysparkling.ml import *
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
#Job setup
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
caminho_modelo_mojo='s3://prod-lakehouse-stream/modeling/approaches/GBM_grid__1_AutoML_20220323_233606_model_53.zip'
print(caminho_modelo_mojo)
print(dir())
settings = H2OMOJOSettings(convertUnknownCategoricalLevelsToNa = True, convertInvalidNumbersToNa = True)
model = H2OMOJOModel.createFromMojo(caminho_modelo_mojo, settings)
data = {'days_since_last_application': [3, 2, 1, 0], 'job_area': ['a', 'b', 'c', 'd']}
base_escorada = model.transform(spark.createDataFrame(pd.DataFrame.from_dict(data)))
print(base_escorada.printSchema())
print(base_escorada.show())
job.commit()
I could run successfully following the steps:
Downloaded sparkling water distribution zip: http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.1/3.36.1.1-1-3.1/index.html
Dependent JARs path: s3://bucket_name/sparkling-water-assembly-scoring_2.12-3.36.1.1-1-3.1-all.jar
--additional-python-modules, h2o-pysparkling-3.1==3.36.1.1-1-3.1
I'm following this tutorial video: https://www.youtube.com/watch?v=EzQArFt_On4
The example code provided in this video:
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
glueContext = GlueContext(SparkContext.getOrCreate())
glueJob = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glueJob.init(args['JOB_NAME'], args)
sparkSession = glueContext.spark_session
#ETL process code
def etl_process():
...
return xxx
glueJob.commit()
I'm wondering if the part before the function etl_process can be used in production directly? Or do I need to wrap that part into a separate function so that I can add unit test for it?
something like this:
def define_spark_session():
sc = SparkContext.getOrCreate()
glue_context = GlueContext(sc)
glue_job = Job(glue_context)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glue_job.init(args['JOB_NAME'], args)
spark_session = glue_context.spark_session
return spark_session
But it seems doesn't need a parameter...
Or should I just write unit test for etl_process function?
Or maybe I can create a separate python file with etl_process function and import it in this script?
I'm new to this, a bit confused, might someone be able to help please? Thanks.
As for now it is very difficult to test AWS Glue itself locally, although there are some solutions like downloading a docker image AWS provides you and run it from there (you'll probably need some tweaks but should be all right).
I guess the easies way is to transform the DynamicFrame you get from gluelibs into a Spark DataFrame (.toDf()) and then do thinks in pure Spark (PySpark) so you'll be able to test the result.
dataFrame = dynamic_frame.toDf()
def transormation(dataframe):
return dataframe.withColumn(...)
def test_transformation()
result = transformation(input_test_dataframe)
assert ...
I'm trying to connect to hive using Python. I installed all of the dependencies required (sasl, thrift_sasl, etc..)
Here is how I try to connect:
configuration = {"hive.server2.authentication.kerberos.principal" : "hive/_HOST#REALM_HOST", "hive.server2.authentication.kerberos.keytab" : "/etc/security/keytabs/hive.service.keytab"}
connection = hive.Connection(configuration = configuration, host="host", port=port, auth="KERBEROS", kerberos_service_name = "hiveserver2")
But I get this error:
Minor code may provide more information (Cannot find KDC for realm "REALM_DOMAIN")
Whay I'm missing? Does someone has an example of an pyHive connection using kerberos?
Thank you for your help.
Thank you #Kishore.
Actually in PySpark, the code looks like this :
import pyspark
from pyspark import SparkContext
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
import pyspark.sql.types as T
def connection(self):
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
sc = pyspark.SparkContext(conf=conf)
self.cursor = HiveContext(sc)
self.cursor.setConf("hive.exec.dynamic.partition", "true")
self.cursor.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
self.cursor.setConf("hive.warehouse.subdir.inherit.perms", "true")
self.cursor.setConf('spark.scheduler.mode', 'FAIR')
and you can request using :
rows = self.cursor.sql("SELECT someone FROM something")
for row in rows.collect():
print row
I'm actually running the code via the command :
spark-submit --master yarn MyProgram.py
I guess you could using basically run the python with pyspark installed like :
python MyProgram.py
but I didn't tried so I won't assure that it's working
I don't know in pyspark, but I am using below scala code and it is working since last one year. If you can change this code in python. Replace the value of properties based on your kerberos.
System.setProperty("hive.metastore.uris", "add hive.metastore.uris url");
System.setProperty("hive.metastore.sasl.enabled", "true")
System.setProperty("hive.metastore.kerberos.keytab.file", "add keytab")
System.setProperty("hive.security.authorization.enabled", "false")
System.setProperty("hive.metastore.kerberos.principal", "replace hive.metastore.kerberos.principal value")
System.setProperty("hive.metastore.execute.setugi", "true")
val hiveContext = new HiveContext(sparkContext)
I have a python script which is dependent on another file, which is also essential for other scripts, so i have zipped it and shipped it to run as a spark-submit job, but unfortunately it seems not to be working, here is my code snippet and the error i'm getting all the time
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def main(spark):
employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employees.json")
# employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employee.json")
employee.printSchema()
employee.show()
people = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/people.json")
people.printSchema()
people.show()
employee.createOrReplaceTempView("employee")
people.createOrReplaceTempView("people")
newDataFrame = employee.join(people,(employee.name==people.name),how="inner")
newDataFrame.distinct().show()
return "Hello I'm Done Processing the Operation"
which is the external dependencies called by other modules as well, and here is another script which is trying to execute the file
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def sampleTest(output):
print output
if __name__ == "__main__":
#Application Name for the Spark RDD using Python
# APP_NAME = "Spark Application"
spark = SparkSession \
.builder \
.appName("Spark Application") \
.config("spark.master", "spark://192.168.2.3:7077") \
.getOrCreate()
# .config() \
import SparkFileMerge
abc = SparkFileMerge.main(spark)
sampleTest(abc)
now when i'm executing the command
./spark-submit --py-files /home/varun/SparkPythonJob.zip /home/varun/main.py
it is giving me the following error.
Traceback (most recent call last):
File "/home/varun/main.py", line 18, in <module>
from SparkFileMerge import SparkFileMerge
ImportError: No module named SparkFileMerge
any help will be highly appreciated.
What composes SparkPythonJob.zip ?
First, you should check that the first code snippet is actually in a file called SparkFileMerge.py.
im trying to save my streaming data from spark to cassandra, spark is conected to kafka and its working ok, but saving to cassandra its making me become crazy. Im using spark 2.0.2, kafka 0.10 and cassandra 2.23,
this is how im submiting to spark
spark-submit --verbose --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 --jars /tmp/pyspark-cassandra-0.3.5.jar --driver-class-path /tmp/pyspark-cassandra-0.3.5.jar --py-files /tmp/pyspark-cassandra-0.3.5.jar --conf spark.cassandra.connection.host=localhost /tmp/direct_kafka_wordcount5.py localhost:9092 testing
and this is my code it just a little modification from the spark examples, its works but i cant save this data to cassandra....
and this what im trying to do but just with the count result
http://rustyrazorblade.com/2015/05/spark-streaming-with-python-and-kafka/
from __future__ import print_function
import sys
import os
import time
import pyspark_cassandra
import pyspark_cassandra.streaming
from pyspark_cassandra import CassandraSparkContext
import urllib
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql.functions import from_unixtime, unix_timestamp, min, max
from pyspark.sql.types import FloatType
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 1)
sqlContext = SQLContext(sc)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
counts=lines.count()
counts.saveToCassandra("spark", "count")
counts.pprint()
ssc.start()
ssc.awaitTermination()
i got this error,
Traceback (most recent call last):
File "/tmp/direct_kafka_wordcount5.py", line 88, in
counts.saveToCassandra("spark", "count")
Pyspark Casasndra stopped being updated a while ago and the latest version only supports up to Spark 1.6
https://github.com/TargetHolding/pyspark-cassandra
Additionally
counts=lines.count() // Returns data to the driver (not an RDD)
counts is now an Integer. This means the function saveToCassandra doesn't apply since that is a function for RDDs