[Python]Failed to find data source: kafka - python

I want to read the data sent by the kafka producer, but I encountered the following problem:
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
Then, according to the error message, I tried to search the official documentation and this website,Then I found something error like me:link1
However, through these methods I found that I still can't solve this problem, so want to ask if there is any better way to help me solve it
Attached below is my error code and version information
My error code:
from kafka import KafkaProducer
from pyspark.python.pyspark.shell import spark
from pyspark.streaming import StreamingContext
from pyspark import SparkConf, SparkContext
import json
import sys
import os
import findspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1'
findspark.init()
def ReadingDataToKafka():
spark_conf = SparkConf().setAppName("KafkaWordCount")
sc = SparkContext.getOrCreate(conf=spark_conf)
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 1)
ssc.checkpoint("file:///tmp/ZHYCargeProject")
topics = 'sex'
topicAry = topics.split(",")
topicMap = {}
for topic in topicAry:
topicMap[topic] = 1
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "bigdataweb01:9092") \
.option("subscribe", "sex") \
.load()
The error message is as follows:
Traceback (most recent call last):
File "/tmp/ZHYCargeProject/demo3/kafka_text.py", line 94, in <module>
ReadingDataToKafka()
File "/tmp/ZHYCargeProject/demo3/kafka_text.py", line 23, in ReadingDataToKafka
df = spark.readStream \
File "/home/ubuntu/anaconda3/envs/pyspark/lib/python3.8/site-packages/pyspark/sql/streaming.py", line 482, in load
return self._df(self._jreader.load())
File "/home/ubuntu/anaconda3/envs/pyspark/lib/python3.8/site-packages/py4j/java_gateway.py", line 1309, in __call__
return_value = get_return_value(
File "/home/ubuntu/anaconda3/envs/pyspark/lib/python3.8/site-packages/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
Related version information:
python== 3.8.13
java==1.8.0_312
Spark==3.2.1
kafka==2.12-3.20
scala==2.12.15
kafka-python==2.0.2
pyspark==3.1.2

Related

pyspark streaming and utils import issues

I am trying to run below code
import findspark
findspark.init('/opt/spark')
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0 pyspark-shell'
import sys
import time
from pyspark.context import SparkContext
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
n_secs = 1
topic = "video-stream-event"
conf = SparkConf().setAppName("KafkaStreamProcessor").setMaster("local[*]")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, n_secs)
kafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {
'bootstrap.servers':'127.0.0.1 :9092',
'group.id':'video-group',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'})
#lines = kafkaStream.map(lambda x: x[1])
print(kafkastream)
I got the following error
from pyspark.streaming.kafka import KafkaUtils ModuleNotFoundError:
No module named 'pyspark.streaming.kafka' log4j:WARN No appenders
could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
Used python ==3.7 and pyspark ==3.1.2
changed to pyspark 2.4.5 and 2.4.6 and exdecuted the same code ,got the below error
> 21/10/18 14:05:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind
> to another address WARNING: An illegal reflective access operation has
> occurred WARNING: Illegal reflective access by
> org.apache.spark.unsafe.Platform
> (file:/opt/spark/jars/spark-unsafe_2.11-2.4.5.jar) to method
> java.nio.Bits.unaligned() WARNING: Please consider reporting this to
> the maintainers of org.apache.spark.unsafe.Platform WARNING: Use
> --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be
> denied in a future release 21/10/18 14:05:47 WARN NativeCodeLoader:
> Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable Traceback (most recent call
> last): File "/home/deepika/Desktop/kafka/kafka_pyspark.py", line 12,
> in <module>
> from pyspark.context import SparkContext File "/home/deepika/Downloads/code_dump/spark/python/pyspark/__init__.py",
> line 51, in <module>
> from pyspark.context import SparkContext File "/home/deepika/Downloads/code_dump/spark/python/pyspark/context.py",
> line 31, in <module>
> from pyspark import accumulators File "/home/deepika/Downloads/code_dump/spark/python/pyspark/accumulators.py",
> line 97, in <module>
> from pyspark.serializers import read_int, PickleSerializer File "/home/deepika/Downloads/code_dump/spark/python/pyspark/serializers.py",
> line 72, in <module>
> from pyspark import cloudpickle File "/home/deepika/Downloads/code_dump/spark/python/pyspark/cloudpickle.py",
> line 145, in <module>
> _cell_set_template_code = _make_cell_set_template_code() File "/home/deepika/Downloads/code_dump/spark/python/pyspark/cloudpickle.py",
> line 126, in _make_cell_set_template_code
> return types.CodeType( TypeError: an integer is required (got type bytes) log4j:WARN No appenders could be found for logger
> (org.apache.spark.util.ShutdownHookManager). log4j:WARN Please
> initialize the log4j system properly. log4j:WARN See
> http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Any idea what to do now?I want to run the above code.Tried python 3.7 and 3.8 and pysprk versions too ending with these 2 errors
INstalled pyspark with this link :
Installation Link for pyspark
You should use spark-sql-kafka-0-10
You need to move findspark.init() after os.environ line. Also, you don't actually need this line, as you can provide the packages via findspark.
SPARK_VERSION = '3.1.2'
SCALA_VERSION = '2.12'
import findspark
findspark.add_packages(['org.apache.spark:spark-sql-kafka-0-10_' + SCALA_VERSION + ':' + SPARK_VERSION ])
findspark.init()
from pyspark import SparkContext, SparkConf
Also, if just getting started with Spark, use the latest version
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Regarding the log4j error, you need to create a log4j.properties file in $SPARK_HOME/conf

Pyspark - print messages from Kafka

I set up a kafka system with a producer and a consumer, streaming as messages the lines of a json file.
Using pyspark, I need to analyze the data for the different streaming windows. To do so, I need to have a look at the data as they are streamed by pyspark... How can I do it?
To run the code I used Yannael's Docker container. Here is my python code:
# Add dependencies and load modules
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.ui.port=4040 --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0,com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3 pyspark-shell'
from kafka import KafkaConsumer
from random import randint
from time import sleep
# Load modules and start SparkContext
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, Row
conf = SparkConf() \
.setAppName("Streaming test") \
.setMaster("local[2]") \
.set("spark.cassandra.connection.host", "127.0.0.1")
try:
sc.stop()
except:
pass
sc = SparkContext(conf=conf)
sqlContext=SQLContext(sc)
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# Create streaming task
ssc = StreamingContext(sc, 0.60)
kafkaStream = KafkaUtils.createStream(ssc, "127.0.0.1:2181", "spark-streaming-consumer", {'test': 1})
ssc.start()
You can either call kafkaStream.pprint(), or learn more about structured streaming and you can print like so
query = kafkaStream \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
I see that you have cassandraendpoints, so assuming you're writing into Cassandra, you can use Kafka Connect rather than writing Spark code for this

Failed to Import External Dependency in Spark

I have a python script which is dependent on another file, which is also essential for other scripts, so i have zipped it and shipped it to run as a spark-submit job, but unfortunately it seems not to be working, here is my code snippet and the error i'm getting all the time
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def main(spark):
employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employees.json")
# employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employee.json")
employee.printSchema()
employee.show()
people = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/people.json")
people.printSchema()
people.show()
employee.createOrReplaceTempView("employee")
people.createOrReplaceTempView("people")
newDataFrame = employee.join(people,(employee.name==people.name),how="inner")
newDataFrame.distinct().show()
return "Hello I'm Done Processing the Operation"
which is the external dependencies called by other modules as well, and here is another script which is trying to execute the file
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def sampleTest(output):
print output
if __name__ == "__main__":
#Application Name for the Spark RDD using Python
# APP_NAME = "Spark Application"
spark = SparkSession \
.builder \
.appName("Spark Application") \
.config("spark.master", "spark://192.168.2.3:7077") \
.getOrCreate()
# .config() \
import SparkFileMerge
abc = SparkFileMerge.main(spark)
sampleTest(abc)
now when i'm executing the command
./spark-submit --py-files /home/varun/SparkPythonJob.zip /home/varun/main.py
it is giving me the following error.
Traceback (most recent call last):
File "/home/varun/main.py", line 18, in <module>
from SparkFileMerge import SparkFileMerge
ImportError: No module named SparkFileMerge
any help will be highly appreciated.
What composes SparkPythonJob.zip ?
First, you should check that the first code snippet is actually in a file called SparkFileMerge.py.

mleap AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'

I am having problems executing the example code from the mleap repository. I wish to run the code in a script instead of a jupyter notebook (which is how the example is run). My script is as follows:
##################################################################################
# start a local spark session
# https://spark.apache.org/docs/0.9.0/python-programming-guide.html
##################################################################################
from pyspark import SparkContext, SparkConf
conf = SparkConf()
#set app name
conf.set("spark.app.name", "train classifier")
#Run Spark locally with as many worker threads as logical cores on your machine (cores X threads).
conf.set("spark.master", "local[*]")
#number of cores to use for the driver process (only in cluster mode)
conf.set("spark.driver.cores", "1")
#Limit of total size of serialized results of all partitions for each Spark action (e.g. collect)
conf.set("spark.driver.maxResultSize", "1g")
#Amount of memory to use for the driver process
conf.set("spark.driver.memory", "1g")
#Amount of memory to use per executor process (e.g. 2g, 8g).
conf.set("spark.executor.memory", "2g")
#pass configuration to the spark context object along with code dependencies
sc = SparkContext(conf=conf)
from pyspark.sql.session import SparkSession
spark = SparkSession(sc)
##################################################################################
import mleap.pyspark
# # Imports MLeap serialization functionality for PySpark
from mleap.pyspark.spark_support import SimpleSparkSerializer
# Import standard PySpark Transformers and packages
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row
# Create a test data frame
l = [('Alice', 1), ('Bob', 2)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df2.collect()
# Build a very simple pipeline using two transformers
string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')
feature_assembler = VectorAssembler(
inputCols=[string_indexer.getOutputCol()], outputCol="features")
feature_pipeline = [string_indexer, feature_assembler]
featurePipeline = Pipeline(stages=feature_pipeline)
featurePipeline.fit(df2)
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
On executing spark-submit script.py I get the following error:
17/09/18 13:26:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/opringle/Documents/Repos/finn/Magellan/src/no_spark_predict.py", line 58, in <module>
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'
Any help would be much appreciated! I have installed mleap from pypy.
See Here
It seems MLeap isn't ready for Spark 2.3 yet. If you happen to be running Spark 2.3, try downgrading to 2.2 and retry. Hopefully, that helps!
I have solved the issue by attaching the following jar file when running:
spark-submit --packages ml.combust.mleap:mleap-spark_2.11:0.8.1 script.py
It seems you didn't follow the steps correctly, here http://mleap-docs.combust.ml/getting-started/py-spark.html it says that
Note: the import of mleap.pyspark needs to happen before any other PySpark libraries are imported.
Hence try importing your SparkContext after mleap

spark, cassandra, streaming, python, error, database, kafka

im trying to save my streaming data from spark to cassandra, spark is conected to kafka and its working ok, but saving to cassandra its making me become crazy. Im using spark 2.0.2, kafka 0.10 and cassandra 2.23,
this is how im submiting to spark
spark-submit --verbose --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 --jars /tmp/pyspark-cassandra-0.3.5.jar --driver-class-path /tmp/pyspark-cassandra-0.3.5.jar --py-files /tmp/pyspark-cassandra-0.3.5.jar --conf spark.cassandra.connection.host=localhost /tmp/direct_kafka_wordcount5.py localhost:9092 testing
and this is my code it just a little modification from the spark examples, its works but i cant save this data to cassandra....
and this what im trying to do but just with the count result
http://rustyrazorblade.com/2015/05/spark-streaming-with-python-and-kafka/
from __future__ import print_function
import sys
import os
import time
import pyspark_cassandra
import pyspark_cassandra.streaming
from pyspark_cassandra import CassandraSparkContext
import urllib
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql.functions import from_unixtime, unix_timestamp, min, max
from pyspark.sql.types import FloatType
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 1)
sqlContext = SQLContext(sc)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
counts=lines.count()
counts.saveToCassandra("spark", "count")
counts.pprint()
ssc.start()
ssc.awaitTermination()
i got this error,
Traceback (most recent call last):
File "/tmp/direct_kafka_wordcount5.py", line 88, in
counts.saveToCassandra("spark", "count")
Pyspark Casasndra stopped being updated a while ago and the latest version only supports up to Spark 1.6
https://github.com/TargetHolding/pyspark-cassandra
Additionally
counts=lines.count() // Returns data to the driver (not an RDD)
counts is now an Integer. This means the function saveToCassandra doesn't apply since that is a function for RDDs

Categories

Resources