im trying to save my streaming data from spark to cassandra, spark is conected to kafka and its working ok, but saving to cassandra its making me become crazy. Im using spark 2.0.2, kafka 0.10 and cassandra 2.23,
this is how im submiting to spark
spark-submit --verbose --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 --jars /tmp/pyspark-cassandra-0.3.5.jar --driver-class-path /tmp/pyspark-cassandra-0.3.5.jar --py-files /tmp/pyspark-cassandra-0.3.5.jar --conf spark.cassandra.connection.host=localhost /tmp/direct_kafka_wordcount5.py localhost:9092 testing
and this is my code it just a little modification from the spark examples, its works but i cant save this data to cassandra....
and this what im trying to do but just with the count result
http://rustyrazorblade.com/2015/05/spark-streaming-with-python-and-kafka/
from __future__ import print_function
import sys
import os
import time
import pyspark_cassandra
import pyspark_cassandra.streaming
from pyspark_cassandra import CassandraSparkContext
import urllib
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql.functions import from_unixtime, unix_timestamp, min, max
from pyspark.sql.types import FloatType
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 1)
sqlContext = SQLContext(sc)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
counts=lines.count()
counts.saveToCassandra("spark", "count")
counts.pprint()
ssc.start()
ssc.awaitTermination()
i got this error,
Traceback (most recent call last):
File "/tmp/direct_kafka_wordcount5.py", line 88, in
counts.saveToCassandra("spark", "count")
Pyspark Casasndra stopped being updated a while ago and the latest version only supports up to Spark 1.6
https://github.com/TargetHolding/pyspark-cassandra
Additionally
counts=lines.count() // Returns data to the driver (not an RDD)
counts is now an Integer. This means the function saveToCassandra doesn't apply since that is a function for RDDs
Related
I am reading a file in HDFS but I need to authenticate in kerberos but I don't know how to include the kinit command in the python script.
How could I perform Authentication in Python using keytab
kinit
kinit userlinux -k -t /etc/security/keytabs/userlinux.keytab
Script Python
import os
import sys
from pyspark.sql import HiveContext, Row
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import *
if __name__ =='__main__':
conf=SparkConf().setAppName("Spark RDD").set("spark.speculation","true")
sc=SparkContext(conf=conf)
sc.setLogLevel("OFF")
sqlContext = HiveContext(sc)
rddCentral = sc.textFile("/user/csv/file.csv")
rddCentralMap = rddCentral.map(lambda line : line.split(","))
print('paso 1')
dfFields = ["AGE","NAME","CITY"]
dfSchema = StructType([StructField(field_name, StringType(), True) for field_name in dfFields])
dfCentral = sqlContext.createDataFrame(rddCentralMap, dfSchema)
dfCentralFilter = dfCentral.filter("AGE > 10 and CITY = 'B'").orderBy("AGE","CITY", ascending=True)
for row in dfCentralFilter.rdd.collect():
print(row['AGE'])
print(row['NAME'])
print(row['CITY'])
sc.stop()
I set up a kafka system with a producer and a consumer, streaming as messages the lines of a json file.
Using pyspark, I need to analyze the data for the different streaming windows. To do so, I need to have a look at the data as they are streamed by pyspark... How can I do it?
To run the code I used Yannael's Docker container. Here is my python code:
# Add dependencies and load modules
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.ui.port=4040 --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0,com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3 pyspark-shell'
from kafka import KafkaConsumer
from random import randint
from time import sleep
# Load modules and start SparkContext
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
conf = SparkConf() \
.setAppName("Streaming test") \
.setMaster("local[2]") \
.set("spark.cassandra.connection.host", "127.0.0.1")
try:
sc.stop()
except:
pass
sc = SparkContext(conf=conf)
sqlContext=SQLContext(sc)
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# Create streaming task
ssc = StreamingContext(sc, 0.60)
kafkaStream = KafkaUtils.createStream(ssc, "127.0.0.1:2181", "spark-streaming-consumer", {'test': 1})
ssc.start()
You can either call kafkaStream.pprint(), or learn more about structured streaming and you can print like so
query = kafkaStream \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
I see that you have cassandraendpoints, so assuming you're writing into Cassandra, you can use Kafka Connect rather than writing Spark code for this
I am using kafka Confluent 4.0.0 to fetch data from SQL Server into kafka topics.
I want to read topics data stored in kafka from spark streaming program using below program:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
schema_registry_client = CachedSchemaRegistryClient(url='http://schemaregistryhostip:8081')
serializer = MessageSerializer(schema_registry_client)
ssc = StreamingContext(sc, 30)
kvs = KafkaUtils.createDirectStream(ssc, ["topic name"], {"metadata.broker.list": "brokerip:9092"}, valueDecoder=serializer.decode_message)
lines = kvs.map(lambda x: x[1])
lines.pprint()
ssc.start()
ssc.awaitTermination()
But I am getting below error when I start spark streaming using the method ssc.start().
ImportError: No module named confluent_kafka.avro.serializer
I am using spark 2.1.0 version with kafka 0.9 in MapR environment.I am trying to read from Kafka topic into spark streaming. However i am facing error as below when i am running Kafkautils createDirectStream command.
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.streaming.kafka09.KafkaUtilsPythonHelper.createDirectStream.
Trace:
py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class
java.util.ArrayList, class java.util.HashMap]) does not exist
Code that i am running
from __future__ import print_function
import sys
from pyspark import SparkContext,SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext
from pyspark.streaming.kafka09 import KafkaUtils;
sqlContext = SQLContext(sc)
ssc = StreamingContext(sc, 3)
strLoc = '/home/mapr/stream:info'
kafkaparams = {"zookeeper.connect" : "x.x.x.x:5181","metadata.broker.list" : "x.x.x.x:9092"}
strarg = KafkaUtils.createDirectStream(ssc,[strLoc],kafkaparams) <- Error when i run this command on pyspark shell
Am trying to refining your code . Please try to execute with this below code .
from pyspark.sql import SQLContext, SparkSession
from pyspark.streaming import StreamingContext
from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
from pyspark.streaming.kafka import KafkaUtils
import json
var_schema_url = 'http://localhost:8081'
var_kafka_parms_src = {"metadata.broker.list": 'localhost:9092'}
schema_registry_client = CachedSchemaRegistryClient(var_schema_url)
serializer = MessageSerializer(schema_registry_client)
spark = SparkSession.builder \
.appName('Advertiser_stream') \
.master('local[*]') \
.getOrCreate()
def handler(message):
records = message.collect()
for record in records:
<<You can process that data >>
sc = spark.sparkContext
ssc = StreamingContext(sc, 5)
kvs = KafkaUtils.createDirectStream(ssc, ['Topic-name'], var_kafka_parms_src,valueDecoder=serializer.decode_message)
kvs.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
Below is my code to write data into Hive
from pyspark import since,SparkContext as sc
from pyspark.sql import SparkSession
from pyspark.sql.functions import _functions , isnan
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import HiveContext as hc
spark = SparkSession.builder.appName("example-spark").config("spark.sql.crossJoin.enabled","true").config('spark.sql.warehouse.dir',"file:///C:/spark-2.0.0-bin-hadoop2.7/bin/metastore_db/spark-warehouse").config('spark.rpc.message.maxSize','1536').getOrCreate()
Name= spark.read.csv("file:///D:/valid.csv", header="true",inferSchema =
True,sep=',')
join_df=join_df.where("LastName != ''").show()
join_df.registerTempTable("test")
hc.sql("CREATE TABLE dev_party_tgt_repl STORED AS PARQUETFILE AS SELECT * from dev_party_tgt")
After executing the above code I get below error
Traceback (most recent call last):
File "D:\01 Delivery Support\01
easyJet\SparkEclipseWorkspace\SparkTestPrograms\src\NameValidation.py", line
22, in <module>
join_df.registerTempTable("test")
AttributeError: 'NoneType' object has no attribute 'test'
My System Environment details:
OS:Windows
Eclipse Neon
Spark Version :2.0.0
Try this:
join_df.where("LastName != ''").write.saveAsTable("dev_party_tgt_repl")