How to create multiple DStream for kinesis on pyspark? - python

In the example https://github.com/apache/spark/tree/master/external/kinesis-asl, both scala and java are creating multiple Dstreams manually
val sparkConfig = new SparkConf().setAppName("KinesisWordCountASL")
val ssc = new StreamingContext(sparkConfig, batchInterval)
// Create the Kinesis DStreams
val kinesisStreams = (0 until numStreams).map { i =>
KinesisInputDStream.builder
.streamingContext(ssc)
.streamName(streamName)
.endpointUrl(endpointUrl)
.regionName(regionName)
.initialPosition(new Latest())
.checkpointAppName(appName)
.checkpointInterval(kinesisCheckpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
}
// Union all the streams
val unionStreams = ssc.union(kinesisStreams)
But for python, there is only one DStream being created
sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl")
ssc = StreamingContext(sc, 1)
appName, streamName, endpointUrl, regionName = sys.argv[1:]
lines = KinesisUtils.createStream(
ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, 2)
When manually creating multiple Dstream on python,
sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl")
ssc = StreamingContext(sc, 1)
appName, streamName, endpointUrl, regionName = sys.argv[1:]
dstreams = [KinesisUtils.createStream(
ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, 2) for i in range(num_streams)]
lines = sc.union(dstreams)
This will throw an error
ValueError: Cannot run multiple SparkContexts at once;
Anyone knows how to replicate the java/scala examples on creating DStreams? Thanks

We can combine multiple streams in python also. We need to use union function available on StreamingContext class.
If you look into your code, you are calling union method on SparkContext variable i.e sc instead of that use StreamingContext valriable i.e lines = ssc.union(dstreams)

Related

missing id from vertext dataframe in pyspark creating Graphframe

I have writting this code using Python, when I run it, the following errors show up.
spark = SparkSession\
.builder\
.appName("GraphX")\
.getOrCreate()
e = spark.read.parquet("hdfs://localhost:9000/gf/edge")
v = spark.read.parquet("hdfs://localhost:9000/gf/vertex")
s = GraphFrame(v, e)
s.edges.show()
s.vertices.show()

Using Hive SQLContext in spark executors

I am getting a complex json message through kafka. I need to extract the required fields from the json and store them in hive tables. I know I cannot use the spark driver sqlContext in the executors. I want to know how to use the sqlContext in the code run by the executors. Here is the code :
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming", topic)
msgs = kvs.map(lambda msg: msg[1])
msgs.foreachRDD(lambda rdd: rdd.foreach(lambda m : timeline_events(m)))
def timeline_events(m):
msg = json.loads(m)
for msgJson in msg:
event_id = msgJson['events'][0]['event_id']
event_type = msgJson['events'][0]['type']
incidence_source = msgJson['incident']['source']
csr_description = msgJson['incident']['data']['csr_description']
sc_display_priority = msgjson['incident']['data']['display_priority']
launch_tool_rec_label = msgJson['incident']['data']['LaunchTool'][0]['Label']
launch_tool_rec_uri = msgJson['incident']['data']['LaunchTool'][0]['URI']
launch_itg_rec_label = msgJson['incident']['data']['LaunchItg'][0]['Label']
launch_itg_rec_uri = msgJson['incident']['data']['LaunchItg'][0]['URI']
sqlContext.sql("Insert into nexus.timeline_events values({},{},{},{},{},{},{},{},{},{},{})".format(event_id, event_type, csr_description, incidence_source, sc_display_priority, launch_tool_rec_label,launch_tool_rec_uri, launch_tool_rec_id, launch_itg_rec_label, launch_itg_rec_uri, launch_itg_rec_id))

ERROR:SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063

I am presently working with ASN 1 Decoder.I will be getting a Hex decimal code from producer and i will be collecting it in consumer.
Then after i will be converting the hex code to RDD and then pass the hex value RDD to another function with in same class Decode_Module and will be using python asn1 decoder to decode the hex data and return it back and print it.
I don't understand whats wrong with my code.I have already installed my asn1 parser dependencies in worker nodes too.
Any wrong with the way i call in lambda expression or something else.
My ERROR: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063
PLEASE HELP ME THANK YOU
My CODE:
class telco_cn:
def __init__(self,sc):
self.sc = sc
print ('in init function')
logging.info('eneterd into init function')
def decode_module(self,msg):
try:
logging.info('Entered into generate module')
### Providing input for module we need to load
load_module(config_values['load_module'])
### Providing Value for Type of Decoding
ASN1.ASN1Obj.CODEC = config_values['PER_DECODER']
### Providing Input for Align/UnAlign
PER.VARIANT = config_values['PER_ALIGNED']
### Providing Input for pdu load
pdu = GLOBAL.TYPE[config_values['pdu_load']]
### Providing Hex value to buf
buf = '{}'.format(msg).decode('hex')
return val
except Exception as e:
logging.debug('error in decode_module function %s' %str(e))
def consumer_input(self,sc,k_topic):
logging.info('entered into consumer input');print(k_topic)
consumer = KafkaConsumer(ip and other values given)
consumer.subscribe(k_topic)
for msg in consumer:
print(msg.value);
a = sc.parallelize([msg.value])
d = a.map(lambda x: self.decode_module(x)).collect()
print d
if __name__ == "__main__":
logging.info('Entered into main')
conf = SparkConf()
conf.setAppName('telco_consumer')
conf.setMaster('yarn-client')
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
cn = telco_cn(sc)
cn.consumer_input(sc,config_values['kafka_topic'])
This is because self.decode_module contain instance of SparkContext.
To fix your code you can use #staticmethod:
class telco_cn:
def __init__(self, sc):
self.sc = sc
#staticmethod
def decode_module(msg):
return msg
def consumer_input(self, sc, k_topic):
a = sc.parallelize(list('abcd'))
d = a.map(lambda x: telco_cn.decode_module(x)).collect()
print d
if __name__ == "__main__":
conf = SparkConf()
sc = SparkContext(conf=conf)
cn = telco_cn(sc)
cn.consumer_input(sc, '')
For more infomation:
http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark
You cannot reference the instance method (self.decode_module) inside the lambda expression, because it the instance object contains a SparkContext reference.
This occurs because internally PySpark tries to Pickle everything it gets to send to its workers. So when you say it should execute self.decode_module() inside the nodes, PySpark tries to pickle the whole (self) object (that contains a reference to the spark context).
To fix that, you just need to remove the SparkContext reference from the telco_cn class and use a different approach like using the SparkContext before calling the class instance (like Zhangs's answer suggests).
With me the issue was:
text_df = "some text"
convertUDF = udf(lambda z: my_fynction(z), StringType())
cleaned_fun = text_df.withColumn('cleaned', udf(convertUDF, StringType())('text'))
I was giving udf() twice. Just did this:
convertUDF = lambda z: my_fynction(z)
cleaned_fun = text_df.withColumn('cleaned', udf(convertUDF, StringType())('text'))
and solved the error

SparkStreaming: How to get list like collect()

I am beginner of SparkStreaming.
I want to load HBase record at SparkStreaming App.
So, I write the the under code by python.
My "load_records" function is getting HBase Records and return the records.
SparkStreaming can not use collect(). sc.newAPIHadoopRDD() need to be used at Driver Program. But SparkStreaming do not have the method which can get objects from workers to driver.
How to get HBase Record at SparkStreaming? or How to call sc.newAPIHadoopRDD()?
def load_records(sc, table, keys):
host = 'localhost'
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
rdd_list = []
for key in keys:
if table == "user":
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "user",
"hbase.mapreduce.scan.columns": "u:uid",
"hbase.mapreduce.scan.row.start": key, "hbase.mapreduce.scan.row.stop": key + "\x00"}
rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv, valueConverter=valueConv, conf=conf)
rdd_list.append(rdd)
first_rdd = rdd_list.pop(0)
for rdd in rdd_list:
first_rdd = first_rdd.union(rdd)
return first_rdd
sc = SparkContext(appName="UserStreaming")
ssc = StreamingContext(sc, 3)
topics = ["json"]
broker_list = "localhost:9092"
inputs = KafkaUtils.createDirectStream(ssc, topics, {"metadata.broker.list": broker_list})
jsons = inputs.map(lambda input: json.loads(input[1]))
user_id_rdd = jsons.map(lambda json: json["user_id"])
# the under line is not working. Any another methods?
user_id_list = user_id_rdd.collect()
user_record_rdd = load_records(sc, 'user', user_id_list)

Unable to create a dataframe from json dstream using pyspark

I am attempting to create a dataframe from json in dstream but the code below does not seem to get the dataframe right -
import sys
import json
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
if __name__ == "__main__":
if len(sys.argv) != 3:
raise IOError("Invalid usage; the correct format is:\nquadrant_count.py <hostname> <port>")
# Initialize a SparkContext with a name
spc = SparkContext(appName="jsonread")
sqlContext = SQLContext(spc)
# Create a StreamingContext with a batch interval of 2 seconds
stc = StreamingContext(spc, 2)
# Checkpointing feature
stc.checkpoint("checkpoint")
# Creating a DStream to connect to hostname:port (like localhost:9999)
lines = stc.socketTextStream(sys.argv[1], int(sys.argv[2]))
lines.pprint()
parsed = lines.map(lambda x: json.loads(x))
def process(time, rdd):
print("========= %s =========" % str(time))
try:
# Get the singleton instance of SQLContext
sqlContext = getSqlContextInstance(rdd.context)
# Convert RDD[String] to RDD[Row] to DataFrame
rowRdd = rdd.map(lambda w: Row(word=w))
wordsDataFrame = sqlContext.createDataFrame(rowRdd)
# Register as table
wordsDataFrame.registerTempTable("mytable")
testDataFrame = sqlContext.sql("select summary from mytable")
print(testDataFrame.show())
print(testDataFrame.printSchema())
except:
pass
parsed.foreachRDD(process)
stc.start()
# Wait for the computation to terminate
stc.awaitTermination()
No errors but when the script runs, it does read the json from streaming context successfully however it does not print the values in summary or the dataframe schema.
Example json I am attempting to read -
{"reviewerID": "A2IBPI20UZIR0U", "asin": "1384719342", "reviewerName":
"cassandra tu \"Yeah, well, that's just like, u...", "helpful": [0,
0], "reviewText": "Not much to write about here, but it does exactly
what it's supposed to. filters out the pop sounds. now my recordings
are much more crisp. it is one of the lowest prices pop filters on
amazon so might as well buy it, they honestly work the same despite
their pricing,", "overall": 5.0, "summary": "good", "unixReviewTime":
1393545600, "reviewTime": "02 28, 2014"}
I am absolute new comer to spark streaming and started working on pet projects by reading documentation. Any help and guidance is greatly appreciated.

Categories

Resources