Pyspark writing data into hive - python

Below is my code to write data into Hive
from pyspark import since,SparkContext as sc
from pyspark.sql import SparkSession
from pyspark.sql.functions import _functions , isnan
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import HiveContext as hc
spark = SparkSession.builder.appName("example-spark").config("spark.sql.crossJoin.enabled","true").config('spark.sql.warehouse.dir',"file:///C:/spark-2.0.0-bin-hadoop2.7/bin/metastore_db/spark-warehouse").config('spark.rpc.message.maxSize','1536').getOrCreate()
Name= spark.read.csv("file:///D:/valid.csv", header="true",inferSchema =
True,sep=',')
join_df=join_df.where("LastName != ''").show()
join_df.registerTempTable("test")
hc.sql("CREATE TABLE dev_party_tgt_repl STORED AS PARQUETFILE AS SELECT * from dev_party_tgt")
After executing the above code I get below error
Traceback (most recent call last):
File "D:\01 Delivery Support\01
easyJet\SparkEclipseWorkspace\SparkTestPrograms\src\NameValidation.py", line
22, in <module>
join_df.registerTempTable("test")
AttributeError: 'NoneType' object has no attribute 'test'
My System Environment details:
OS:Windows
Eclipse Neon
Spark Version :2.0.0

Try this:
join_df.where("LastName != ''").write.saveAsTable("dev_party_tgt_repl")

Related

AttributeError: 'SQLContext' object has no attribute 'load' - ApacheSpark + SparkSql connection error with mysql

I am getting the error while running the code:
Error:
AttributeError: 'SQLContext' object has no attribute 'load'
Please let me know if anyone has faced same problem
Code:
# required import modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# creating a configuration for context. here "Spark-SQL" is the name of the application and we will create local spark context.
conf = SparkConf().setAppName("Spark-SQL").setMaster("local")
# create a spark context. It is the entry point into all relational functionality in Spark.
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# loading the contents from the MySql database table to the dataframe.
df = sqlContext.load(source="jdbc", url="jdbc:mysql://localhost:3306/test?user=root&password=",dbtable="members_indexed")
# display the schema of the dataframe.
df.show()
# display the schema of the dataframe.
df.printSchema()

Pyspark - Error related to SparkContext - no attribute _jsc

Unsure of what the issue is with this. I've seen similar issues regarding this problem, but nothing that solves my issue. Full Error,
Traceback (most recent call last):
File "C:/Users/computer/PycharmProjects/spark_test/spark_test/test.py", line 4, in <module>
sqlcontext = SQLContext(sc)
File "C:\Users\computer\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\context.py", line 74, in __init__
self._jsc = self._sc._jsc
AttributeError: type object 'SparkContext' has no attribute '_jsc'
Here is the simple code I am trying to run:
from pyspark import SQLContext
from pyspark.context import SparkContext as sc
sqlcontext = SQLContext(sc)
df = sqlcontext.read.json('random.json')
If you are using Spark Shell, you will notice that SparkContext is already created.
Otherwise, you can create the SparkContext by importing, initializing and providing the configuration settings. In your case you only passed the SparkContext to SQLContext
import pyspark
conf = pyspark.SparkConf()
# conf.set('spark.app.name', app_name) # Optional configurations
# init & return
sc = pyspark.SparkContext.getOrCreate(conf=conf)
sqlcontext = SQLContext(sc)
df = sqlcontext.read.json('random.json')

Failed to Import External Dependency in Spark

I have a python script which is dependent on another file, which is also essential for other scripts, so i have zipped it and shipped it to run as a spark-submit job, but unfortunately it seems not to be working, here is my code snippet and the error i'm getting all the time
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def main(spark):
employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employees.json")
# employee = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/employee.json")
employee.printSchema()
employee.show()
people = spark.read.json("/storage/hadoop/hadoop-3.0.0/bin/people.json")
people.printSchema()
people.show()
employee.createOrReplaceTempView("employee")
people.createOrReplaceTempView("people")
newDataFrame = employee.join(people,(employee.name==people.name),how="inner")
newDataFrame.distinct().show()
return "Hello I'm Done Processing the Operation"
which is the external dependencies called by other modules as well, and here is another script which is trying to execute the file
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
def sampleTest(output):
print output
if __name__ == "__main__":
#Application Name for the Spark RDD using Python
# APP_NAME = "Spark Application"
spark = SparkSession \
.builder \
.appName("Spark Application") \
.config("spark.master", "spark://192.168.2.3:7077") \
.getOrCreate()
# .config() \
import SparkFileMerge
abc = SparkFileMerge.main(spark)
sampleTest(abc)
now when i'm executing the command
./spark-submit --py-files /home/varun/SparkPythonJob.zip /home/varun/main.py
it is giving me the following error.
Traceback (most recent call last):
File "/home/varun/main.py", line 18, in <module>
from SparkFileMerge import SparkFileMerge
ImportError: No module named SparkFileMerge
any help will be highly appreciated.
What composes SparkPythonJob.zip ?
First, you should check that the first code snippet is actually in a file called SparkFileMerge.py.

spark, cassandra, streaming, python, error, database, kafka

im trying to save my streaming data from spark to cassandra, spark is conected to kafka and its working ok, but saving to cassandra its making me become crazy. Im using spark 2.0.2, kafka 0.10 and cassandra 2.23,
this is how im submiting to spark
spark-submit --verbose --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 --jars /tmp/pyspark-cassandra-0.3.5.jar --driver-class-path /tmp/pyspark-cassandra-0.3.5.jar --py-files /tmp/pyspark-cassandra-0.3.5.jar --conf spark.cassandra.connection.host=localhost /tmp/direct_kafka_wordcount5.py localhost:9092 testing
and this is my code it just a little modification from the spark examples, its works but i cant save this data to cassandra....
and this what im trying to do but just with the count result
http://rustyrazorblade.com/2015/05/spark-streaming-with-python-and-kafka/
from __future__ import print_function
import sys
import os
import time
import pyspark_cassandra
import pyspark_cassandra.streaming
from pyspark_cassandra import CassandraSparkContext
import urllib
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql.functions import from_unixtime, unix_timestamp, min, max
from pyspark.sql.types import FloatType
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 1)
sqlContext = SQLContext(sc)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
counts=lines.count()
counts.saveToCassandra("spark", "count")
counts.pprint()
ssc.start()
ssc.awaitTermination()
i got this error,
Traceback (most recent call last):
File "/tmp/direct_kafka_wordcount5.py", line 88, in
counts.saveToCassandra("spark", "count")
Pyspark Casasndra stopped being updated a while ago and the latest version only supports up to Spark 1.6
https://github.com/TargetHolding/pyspark-cassandra
Additionally
counts=lines.count() // Returns data to the driver (not an RDD)
counts is now an Integer. This means the function saveToCassandra doesn't apply since that is a function for RDDs

'PipelinedRDD' object has no attribute 'toDF' in PySpark

I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark.
I've just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured).
My my_script.py is:
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext
sc = SparkContext("local", "Teste Original")
data = MLUtils.loadLibSVMFile(sc, "/home/svm_capture").toDF()
and I'm running using: ./spark-submit my_script.py
And I get the error:
Traceback (most recent call last):
File "/home/fred-spark/spark-1.5.0-bin-hadoop2.6/pipeline_teste_original.py", line 34, in <module>
data = MLUtils.loadLibSVMFile(sc, "/home/fred-spark/svm_capture").toDF()
AttributeError: 'PipelinedRDD' object has no attribute 'toDF'
What I can't understand is that if I run:
data = MLUtils.loadLibSVMFile(sc, "/home/svm_capture").toDF()
directly inside PySpark shell, it works.
toDF method is a monkey patch executed inside SparkSession (SQLContext constructor in 1.x) constructor so to be able to use it you have to create a SQLContext (or SparkSession) first:
# SQLContext or HiveContext in Spark 1.x
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
rdd = sc.parallelize([("a", 1)])
hasattr(rdd, "toDF")
## False
spark = SparkSession(sc)
hasattr(rdd, "toDF")
## True
rdd.toDF().show()
## +---+---+
## | _1| _2|
## +---+---+
## | a| 1|
## +---+---+
Not to mention you need a SQLContext or SparkSession to work with DataFrames in the first place.
Make sure you have spark session too.
sc = SparkContext("local", "first app")
spark = SparkSession(sc)

Categories

Resources