pyspark streaming and utils import issues - python

I am trying to run below code
import findspark
findspark.init('/opt/spark')
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0 pyspark-shell'
import sys
import time
from pyspark.context import SparkContext
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
n_secs = 1
topic = "video-stream-event"
conf = SparkConf().setAppName("KafkaStreamProcessor").setMaster("local[*]")
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, n_secs)
kafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {
'bootstrap.servers':'127.0.0.1 :9092',
'group.id':'video-group',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'})
#lines = kafkaStream.map(lambda x: x[1])
print(kafkastream)
I got the following error
from pyspark.streaming.kafka import KafkaUtils ModuleNotFoundError:
No module named 'pyspark.streaming.kafka' log4j:WARN No appenders
could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
Used python ==3.7 and pyspark ==3.1.2
changed to pyspark 2.4.5 and 2.4.6 and exdecuted the same code ,got the below error
> 21/10/18 14:05:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind
> to another address WARNING: An illegal reflective access operation has
> occurred WARNING: Illegal reflective access by
> org.apache.spark.unsafe.Platform
> (file:/opt/spark/jars/spark-unsafe_2.11-2.4.5.jar) to method
> java.nio.Bits.unaligned() WARNING: Please consider reporting this to
> the maintainers of org.apache.spark.unsafe.Platform WARNING: Use
> --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be
> denied in a future release 21/10/18 14:05:47 WARN NativeCodeLoader:
> Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable Traceback (most recent call
> last): File "/home/deepika/Desktop/kafka/kafka_pyspark.py", line 12,
> in <module>
> from pyspark.context import SparkContext File "/home/deepika/Downloads/code_dump/spark/python/pyspark/__init__.py",
> line 51, in <module>
> from pyspark.context import SparkContext File "/home/deepika/Downloads/code_dump/spark/python/pyspark/context.py",
> line 31, in <module>
> from pyspark import accumulators File "/home/deepika/Downloads/code_dump/spark/python/pyspark/accumulators.py",
> line 97, in <module>
> from pyspark.serializers import read_int, PickleSerializer File "/home/deepika/Downloads/code_dump/spark/python/pyspark/serializers.py",
> line 72, in <module>
> from pyspark import cloudpickle File "/home/deepika/Downloads/code_dump/spark/python/pyspark/cloudpickle.py",
> line 145, in <module>
> _cell_set_template_code = _make_cell_set_template_code() File "/home/deepika/Downloads/code_dump/spark/python/pyspark/cloudpickle.py",
> line 126, in _make_cell_set_template_code
> return types.CodeType( TypeError: an integer is required (got type bytes) log4j:WARN No appenders could be found for logger
> (org.apache.spark.util.ShutdownHookManager). log4j:WARN Please
> initialize the log4j system properly. log4j:WARN See
> http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Any idea what to do now?I want to run the above code.Tried python 3.7 and 3.8 and pysprk versions too ending with these 2 errors
INstalled pyspark with this link :
Installation Link for pyspark

You should use spark-sql-kafka-0-10
You need to move findspark.init() after os.environ line. Also, you don't actually need this line, as you can provide the packages via findspark.
SPARK_VERSION = '3.1.2'
SCALA_VERSION = '2.12'
import findspark
findspark.add_packages(['org.apache.spark:spark-sql-kafka-0-10_' + SCALA_VERSION + ':' + SPARK_VERSION ])
findspark.init()
from pyspark import SparkContext, SparkConf
Also, if just getting started with Spark, use the latest version
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Regarding the log4j error, you need to create a log4j.properties file in $SPARK_HOME/conf

Related

[Python]Failed to find data source: kafka

I want to read the data sent by the kafka producer, but I encountered the following problem:
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
Then, according to the error message, I tried to search the official documentation and this website,Then I found something error like me:link1
However, through these methods I found that I still can't solve this problem, so want to ask if there is any better way to help me solve it
Attached below is my error code and version information
My error code:
from kafka import KafkaProducer
from pyspark.python.pyspark.shell import spark
from pyspark.streaming import StreamingContext
from pyspark import SparkConf, SparkContext
import json
import sys
import os
import findspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.1'
findspark.init()
def ReadingDataToKafka():
spark_conf = SparkConf().setAppName("KafkaWordCount")
sc = SparkContext.getOrCreate(conf=spark_conf)
sc.setLogLevel("ERROR")
ssc = StreamingContext(sc, 1)
ssc.checkpoint("file:///tmp/ZHYCargeProject")
topics = 'sex'
topicAry = topics.split(",")
topicMap = {}
for topic in topicAry:
topicMap[topic] = 1
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "bigdataweb01:9092") \
.option("subscribe", "sex") \
.load()
The error message is as follows:
Traceback (most recent call last):
File "/tmp/ZHYCargeProject/demo3/kafka_text.py", line 94, in <module>
ReadingDataToKafka()
File "/tmp/ZHYCargeProject/demo3/kafka_text.py", line 23, in ReadingDataToKafka
df = spark.readStream \
File "/home/ubuntu/anaconda3/envs/pyspark/lib/python3.8/site-packages/pyspark/sql/streaming.py", line 482, in load
return self._df(self._jreader.load())
File "/home/ubuntu/anaconda3/envs/pyspark/lib/python3.8/site-packages/py4j/java_gateway.py", line 1309, in __call__
return_value = get_return_value(
File "/home/ubuntu/anaconda3/envs/pyspark/lib/python3.8/site-packages/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
Related version information:
python== 3.8.13
java==1.8.0_312
Spark==3.2.1
kafka==2.12-3.20
scala==2.12.15
kafka-python==2.0.2
pyspark==3.1.2

Pyspark - Error related to SparkContext - no attribute _jsc

Unsure of what the issue is with this. I've seen similar issues regarding this problem, but nothing that solves my issue. Full Error,
Traceback (most recent call last):
File "C:/Users/computer/PycharmProjects/spark_test/spark_test/test.py", line 4, in <module>
sqlcontext = SQLContext(sc)
File "C:\Users\computer\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\context.py", line 74, in __init__
self._jsc = self._sc._jsc
AttributeError: type object 'SparkContext' has no attribute '_jsc'
Here is the simple code I am trying to run:
from pyspark import SQLContext
from pyspark.context import SparkContext as sc
sqlcontext = SQLContext(sc)
df = sqlcontext.read.json('random.json')
If you are using Spark Shell, you will notice that SparkContext is already created.
Otherwise, you can create the SparkContext by importing, initializing and providing the configuration settings. In your case you only passed the SparkContext to SQLContext
import pyspark
conf = pyspark.SparkConf()
# conf.set('spark.app.name', app_name) # Optional configurations
# init & return
sc = pyspark.SparkContext.getOrCreate(conf=conf)
sqlcontext = SQLContext(sc)
df = sqlcontext.read.json('random.json')

Failed to Import External Dependency in Spark

I have a python script which is dependent on another file, which is also essential for other scripts, so i have zipped it and shipped it to run as a spark-submit job, but unfortunately it seems not to be working, here is my code snippet and the error i'm getting all the time
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def main(spark):
employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employees.json")
# employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employee.json")
employee.printSchema()
employee.show()
people = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/people.json")
people.printSchema()
people.show()
employee.createOrReplaceTempView("employee")
people.createOrReplaceTempView("people")
newDataFrame = employee.join(people,(employee.name==people.name),how="inner")
newDataFrame.distinct().show()
return "Hello I'm Done Processing the Operation"
which is the external dependencies called by other modules as well, and here is another script which is trying to execute the file
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def sampleTest(output):
print output
if __name__ == "__main__":
#Application Name for the Spark RDD using Python
# APP_NAME = "Spark Application"
spark = SparkSession \
.builder \
.appName("Spark Application") \
.config("spark.master", "spark://192.168.2.3:7077") \
.getOrCreate()
# .config() \
import SparkFileMerge
abc = SparkFileMerge.main(spark)
sampleTest(abc)
now when i'm executing the command
./spark-submit --py-files /home/varun/SparkPythonJob.zip /home/varun/main.py
it is giving me the following error.
Traceback (most recent call last):
File "/home/varun/main.py", line 18, in <module>
from SparkFileMerge import SparkFileMerge
ImportError: No module named SparkFileMerge
any help will be highly appreciated.
What composes SparkPythonJob.zip ?
First, you should check that the first code snippet is actually in a file called SparkFileMerge.py.

mleap AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'

I am having problems executing the example code from the mleap repository. I wish to run the code in a script instead of a jupyter notebook (which is how the example is run). My script is as follows:
##################################################################################
# start a local spark session
# https://spark.apache.org/docs/0.9.0/python-programming-guide.html
##################################################################################
from pyspark import SparkContext, SparkConf
conf = SparkConf()
#set app name
conf.set("spark.app.name", "train classifier")
#Run Spark locally with as many worker threads as logical cores on your machine (cores X threads).
conf.set("spark.master", "local[*]")
#number of cores to use for the driver process (only in cluster mode)
conf.set("spark.driver.cores", "1")
#Limit of total size of serialized results of all partitions for each Spark action (e.g. collect)
conf.set("spark.driver.maxResultSize", "1g")
#Amount of memory to use for the driver process
conf.set("spark.driver.memory", "1g")
#Amount of memory to use per executor process (e.g. 2g, 8g).
conf.set("spark.executor.memory", "2g")
#pass configuration to the spark context object along with code dependencies
sc = SparkContext(conf=conf)
from pyspark.sql.session import SparkSession
spark = SparkSession(sc)
##################################################################################
import mleap.pyspark
# # Imports MLeap serialization functionality for PySpark
from mleap.pyspark.spark_support import SimpleSparkSerializer
# Import standard PySpark Transformers and packages
from pyspark.ml.feature import VectorAssembler, StandardScaler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import Row
# Create a test data frame
l = [('Alice', 1), ('Bob', 2)]
rdd = sc.parallelize(l)
Person = Row('name', 'age')
person = rdd.map(lambda r: Person(*r))
df2 = spark.createDataFrame(person)
df2.collect()
# Build a very simple pipeline using two transformers
string_indexer = StringIndexer(inputCol='name', outputCol='name_string_index')
feature_assembler = VectorAssembler(
inputCols=[string_indexer.getOutputCol()], outputCol="features")
feature_pipeline = [string_indexer, feature_assembler]
featurePipeline = Pipeline(stages=feature_pipeline)
featurePipeline.fit(df2)
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
On executing spark-submit script.py I get the following error:
17/09/18 13:26:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/Users/opringle/Documents/Repos/finn/Magellan/src/no_spark_predict.py", line 58, in <module>
featurePipeline.serializeToBundle("jar:file:/tmp/pyspark.example.zip")
AttributeError: 'Pipeline' object has no attribute 'serializeToBundle'
Any help would be much appreciated! I have installed mleap from pypy.
See Here
It seems MLeap isn't ready for Spark 2.3 yet. If you happen to be running Spark 2.3, try downgrading to 2.2 and retry. Hopefully, that helps!
I have solved the issue by attaching the following jar file when running:
spark-submit --packages ml.combust.mleap:mleap-spark_2.11:0.8.1 script.py
It seems you didn't follow the steps correctly, here http://mleap-docs.combust.ml/getting-started/py-spark.html it says that
Note: the import of mleap.pyspark needs to happen before any other PySpark libraries are imported.
Hence try importing your SparkContext after mleap

Pyspark writing data into hive

Below is my code to write data into Hive
from pyspark import since,SparkContext as sc
from pyspark.sql import SparkSession
from pyspark.sql.functions import _functions , isnan
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import HiveContext as hc
spark = SparkSession.builder.appName("example-spark").config("spark.sql.crossJoin.enabled","true").config('spark.sql.warehouse.dir',"file:///C:/spark-2.0.0-bin-hadoop2.7/bin/metastore_db/spark-warehouse").config('spark.rpc.message.maxSize','1536').getOrCreate()
Name= spark.read.csv("file:///D:/valid.csv", header="true",inferSchema =
True,sep=',')
join_df=join_df.where("LastName != ''").show()
join_df.registerTempTable("test")
hc.sql("CREATE TABLE dev_party_tgt_repl STORED AS PARQUETFILE AS SELECT * from dev_party_tgt")
After executing the above code I get below error
Traceback (most recent call last):
File "D:\01 Delivery Support\01
easyJet\SparkEclipseWorkspace\SparkTestPrograms\src\NameValidation.py", line
22, in <module>
join_df.registerTempTable("test")
AttributeError: 'NoneType' object has no attribute 'test'
My System Environment details:
OS:Windows
Eclipse Neon
Spark Version :2.0.0
Try this:
join_df.where("LastName != ''").write.saveAsTable("dev_party_tgt_repl")

Categories

Resources