Pyspark - Error related to SparkContext - no attribute _jsc - python

Unsure of what the issue is with this. I've seen similar issues regarding this problem, but nothing that solves my issue. Full Error,
Traceback (most recent call last):
File "C:/Users/computer/PycharmProjects/spark_test/spark_test/test.py", line 4, in <module>
sqlcontext = SQLContext(sc)
File "C:\Users\computer\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\context.py", line 74, in __init__
self._jsc = self._sc._jsc
AttributeError: type object 'SparkContext' has no attribute '_jsc'
Here is the simple code I am trying to run:
from pyspark import SQLContext
from pyspark.context import SparkContext as sc
sqlcontext = SQLContext(sc)
df = sqlcontext.read.json('random.json')

If you are using Spark Shell, you will notice that SparkContext is already created.
Otherwise, you can create the SparkContext by importing, initializing and providing the configuration settings. In your case you only passed the SparkContext to SQLContext
import pyspark
conf = pyspark.SparkConf()
# conf.set('spark.app.name', app_name) # Optional configurations
# init & return
sc = pyspark.SparkContext.getOrCreate(conf=conf)
sqlcontext = SQLContext(sc)
df = sqlcontext.read.json('random.json')

Related

AttributeError: 'SQLContext' object has no attribute 'load' - ApacheSpark + SparkSql connection error with mysql

I am getting the error while running the code:
Error:
AttributeError: 'SQLContext' object has no attribute 'load'
Please let me know if anyone has faced same problem
Code:
# required import modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# creating a configuration for context. here "Spark-SQL" is the name of the application and we will create local spark context.
conf = SparkConf().setAppName("Spark-SQL").setMaster("local")
# create a spark context. It is the entry point into all relational functionality in Spark.
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# loading the contents from the MySql database table to the dataframe.
df = sqlContext.load(source="jdbc", url="jdbc:mysql://localhost:3306/test?user=root&password=",dbtable="members_indexed")
# display the schema of the dataframe.
df.show()
# display the schema of the dataframe.
df.printSchema()

pyspark streaming with kafka error

I am using spark 2.1.0 version with kafka 0.9 in MapR environment.I am trying to read from Kafka topic into spark streaming. However i am facing error as below when i am running Kafkautils createDirectStream command.
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.streaming.kafka09.KafkaUtilsPythonHelper.createDirectStream.
Trace:
py4j.Py4JException: Method createDirectStream([class org.apache.spark.streaming.api.java.JavaStreamingContext, class
java.util.ArrayList, class java.util.HashMap]) does not exist
Code that i am running
from __future__ import print_function
import sys
from pyspark import SparkContext,SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext
from pyspark.streaming.kafka09 import KafkaUtils;
sqlContext = SQLContext(sc)
ssc = StreamingContext(sc, 3)
strLoc = '/home/mapr/stream:info'
kafkaparams = {"zookeeper.connect" : "x.x.x.x:5181","metadata.broker.list" : "x.x.x.x:9092"}
strarg = KafkaUtils.createDirectStream(ssc,[strLoc],kafkaparams) <- Error when i run this command on pyspark shell
Am trying to refining your code . Please try to execute with this below code .
from pyspark.sql import SQLContext, SparkSession
from pyspark.streaming import StreamingContext
from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
from pyspark.streaming.kafka import KafkaUtils
import json
var_schema_url = 'http://localhost:8081'
var_kafka_parms_src = {"metadata.broker.list": 'localhost:9092'}
schema_registry_client = CachedSchemaRegistryClient(var_schema_url)
serializer = MessageSerializer(schema_registry_client)
spark = SparkSession.builder \
.appName('Advertiser_stream') \
.master('local[*]') \
.getOrCreate()
def handler(message):
records = message.collect()
for record in records:
<<You can process that data >>
sc = spark.sparkContext
ssc = StreamingContext(sc, 5)
kvs = KafkaUtils.createDirectStream(ssc, ['Topic-name'], var_kafka_parms_src,valueDecoder=serializer.decode_message)
kvs.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()

Failed to Import External Dependency in Spark

I have a python script which is dependent on another file, which is also essential for other scripts, so i have zipped it and shipped it to run as a spark-submit job, but unfortunately it seems not to be working, here is my code snippet and the error i'm getting all the time
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def main(spark):
employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employees.json")
# employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employee.json")
employee.printSchema()
employee.show()
people = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/people.json")
people.printSchema()
people.show()
employee.createOrReplaceTempView("employee")
people.createOrReplaceTempView("people")
newDataFrame = employee.join(people,(employee.name==people.name),how="inner")
newDataFrame.distinct().show()
return "Hello I'm Done Processing the Operation"
which is the external dependencies called by other modules as well, and here is another script which is trying to execute the file
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def sampleTest(output):
print output
if __name__ == "__main__":
#Application Name for the Spark RDD using Python
# APP_NAME = "Spark Application"
spark = SparkSession \
.builder \
.appName("Spark Application") \
.config("spark.master", "spark://192.168.2.3:7077") \
.getOrCreate()
# .config() \
import SparkFileMerge
abc = SparkFileMerge.main(spark)
sampleTest(abc)
now when i'm executing the command
./spark-submit --py-files /home/varun/SparkPythonJob.zip /home/varun/main.py
it is giving me the following error.
Traceback (most recent call last):
File "/home/varun/main.py", line 18, in <module>
from SparkFileMerge import SparkFileMerge
ImportError: No module named SparkFileMerge
any help will be highly appreciated.
What composes SparkPythonJob.zip ?
First, you should check that the first code snippet is actually in a file called SparkFileMerge.py.

Pyspark writing data into hive

Below is my code to write data into Hive
from pyspark import since,SparkContext as sc
from pyspark.sql import SparkSession
from pyspark.sql.functions import _functions , isnan
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import HiveContext as hc
spark = SparkSession.builder.appName("example-spark").config("spark.sql.crossJoin.enabled","true").config('spark.sql.warehouse.dir',"file:///C:/spark-2.0.0-bin-hadoop2.7/bin/metastore_db/spark-warehouse").config('spark.rpc.message.maxSize','1536').getOrCreate()
Name= spark.read.csv("file:///D:/valid.csv", header="true",inferSchema =
True,sep=',')
join_df=join_df.where("LastName != ''").show()
join_df.registerTempTable("test")
hc.sql("CREATE TABLE dev_party_tgt_repl STORED AS PARQUETFILE AS SELECT * from dev_party_tgt")
After executing the above code I get below error
Traceback (most recent call last):
File "D:\01 Delivery Support\01
easyJet\SparkEclipseWorkspace\SparkTestPrograms\src\NameValidation.py", line
22, in <module>
join_df.registerTempTable("test")
AttributeError: 'NoneType' object has no attribute 'test'
My System Environment details:
OS:Windows
Eclipse Neon
Spark Version :2.0.0
Try this:
join_df.where("LastName != ''").write.saveAsTable("dev_party_tgt_repl")

'PipelinedRDD' object has no attribute 'toDF' in PySpark

I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark.
I've just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured).
My my_script.py is:
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext
sc = SparkContext("local", "Teste Original")
data = MLUtils.loadLibSVMFile(sc, "/home/svm_capture").toDF()
and I'm running using: ./spark-submit my_script.py
And I get the error:
Traceback (most recent call last):
File "/home/fred-spark/spark-1.5.0-bin-hadoop2.6/pipeline_teste_original.py", line 34, in <module>
data = MLUtils.loadLibSVMFile(sc, "/home/fred-spark/svm_capture").toDF()
AttributeError: 'PipelinedRDD' object has no attribute 'toDF'
What I can't understand is that if I run:
data = MLUtils.loadLibSVMFile(sc, "/home/svm_capture").toDF()
directly inside PySpark shell, it works.
toDF method is a monkey patch executed inside SparkSession (SQLContext constructor in 1.x) constructor so to be able to use it you have to create a SQLContext (or SparkSession) first:
# SQLContext or HiveContext in Spark 1.x
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
rdd = sc.parallelize([("a", 1)])
hasattr(rdd, "toDF")
## False
spark = SparkSession(sc)
hasattr(rdd, "toDF")
## True
rdd.toDF().show()
## +---+---+
## | _1| _2|
## +---+---+
## | a| 1|
## +---+---+
Not to mention you need a SQLContext or SparkSession to work with DataFrames in the first place.
Make sure you have spark session too.
sc = SparkContext("local", "first app")
spark = SparkSession(sc)

Categories

Resources