I tried to create a standalone PySpark program that reads a csv and stores it in a hive table. I have trouble configuring Spark session, conference and contexts objects. Here is my code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import *
conf = SparkConf().setAppName("test_import")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.config(conf=conf)
dfRaw = spark.read.csv("hdfs:/user/..../test.csv",header=False)
dfRaw.createOrReplaceTempView('tempTable')
sqlContext.sql("create table customer.temp as select * from tempTable")
And I get the error:
dfRaw = spark.read.csv("hdfs:/user/../test.csv",header=False)
AttributeError: 'Builder' object has no attribute 'read'
Which is the right way to configure spark session object in order to use read.csv command? Also, can someone explain the diference between Session, Context and Conference objects?
There is no need to use both SparkContext and SparkSession to initialize Spark. SparkSession is the newer, recommended way to use.
To initialize your environment, simply do:
spark = SparkSession\
.builder\
.appName("test_import")\
.getOrCreate()
You can run SQL commands by doing:
spark.sql(...)
Prior to Spark 2.0.0, three separate objects were used: SparkContext, SQLContext and HiveContext. These were used separatly depending on what you wanted to do and the data types used.
With the intruduction of the Dataset/DataFrame abstractions, the SparkSession object became the main entry point to the Spark environment. It's still possible to access the other objects by first initialize a SparkSession (say in a variable named spark) and then do spark.sparkContext/spark.sqlContext.
Related
In my Python 3.8 unit test code, I need to instantiate a SparkContext to test some functions manipulating a RDD.
The problem is that instantiating a SparkContext takes a few seconds, which is too slow. I'm using this code to instantiate a SparkContext:
from pyspark import SparkContext
return SparkContext.getOrCreate()
The tests only run filter, map and distinct on an RDD. In my tests, I only want to test an expected output against obtained output on a tiny dataset (lists of length between 5 and 10) and I want them to run almost instantaneously. I don't need all other features offered by PySpark (such as parallelization).
I tried various ways, such as below, but instantiation takes as much time:
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
sc = SparkContext(conf=conf)
I have a code which reads data from Hive Table and applies a pandas udf, the moment it reads data from table it runs in 11 executors , however the moment it executes a pandas udf it uses only 1 executor. Is there a way to assign say 10 executors to execute pandas udf
spark-submit --master yarn --deploy-mode client --conf spark.dynamicAllocation.enabled=false --conf spark.executor.instances=20 code_test.py
Code Snippet:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("Test").enableHiveSupport().getOrCreate()
#pandas_udf("double", PandasUDFType.GROUPED_AGG)
def mean_udf(v):
return v.mean()
df = spark.sql("select id, cast(tran_am as double) as v from table")
df.groupby("id").agg(mean_udf(df['v'])).show()
may need to extract your UDF function to another file, then can broadcast to all executors.
I have just installed pyspark2.4.5 in my ubuntu18.04 laptop, and when I run following codes,
#this is a part of the code.
import pubmed_parser as pp
from pyspark.sql import SparkSession
from pyspark.sql import Row
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
parse_results_rdd = medline_files_rdd.\
flatMap(lambda x: [Row(file_name=os.path.basename(x), **publication_dict)
for publication_dict in pp.parse_medline_xml(x)])
medline_df = parse_results_rdd.toDF()
# save to parquet
medline_df.write.parquet('raw_medline.parquet', mode='overwrite')
medline_df = spark.read.parquet('raw_medline.parquet')
I get such error,
medline_files_rdd = spark.sparkContext.parallelize(glob('/mnt/hgfs/ShareDir/data/*.gz'), numSlices=1000)
NameError: name 'spark' is not defined
I have seen similiar questions on StackOverflow, but all of them can not solve my problem.Does anyone can help me?Thanks a lot.
By the way, I am new in spark, if I just want to use spark in Python, does it enough that I just install pyspark by using
pip install pyspark ? any others should I do? Should I install Hadoop or others?
Just create spark session in the starting
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
I am trying to compare spark sql vs hive context, may I know any difference, is the hivecontext sql use the hive query, while spark sql use the spark query?
Below is my code:
sc = pyspark.SparkContext(conf=conf).getOrCreate()
sqlContext = HiveContext(sc)
sqlContext.sql ('select * from table')
While sparksql:
spark.sql('select * from table')
May I know the difference of this two?
SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with DataFrame and Dataset APIs. Most importantly, it curbs the number of concepts and constructs a developer has to juggle while interacting with Spark.
SparkSession, without explicitly creating SparkConf, SparkContext or SQLContext, encapsulates them within itself.
SparkSession has merged SQLContext and HiveContext in one object from Spark 2.0+.
When building a session object, for example:
val spark = SparkSession .builder() .appName("SparkSessionExample").config( "spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
.enableHiveSupport() provides HiveContext functions. so you will be able to access Hive tables since spark session is initialized with HiveSupport.
So, there is no difference between "sqlContext.sql" and "spark.sql", but it is advised to use "spark.sql", since spark is single point of entry for all the Spark API's.
Below is the code I have written to compare two dataframes and impose intersection function on them.
import os
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://xxx:xxx").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxxx").option("password","xxxx").load()
df.registerTempTable("test")
df1= sqlContext.sql("select * from test where amitesh<= 300")
df2= sqlContext.sql("select * from test where amitesh <= 400")
df3= df1.intersection(df2)
df3.show()
I am getting below error:
AttributeError: 'DataFrame' object has no attribute 'intersection'
If my understanding is correct, intersection() is an inbuilt sub-function derived from python set function. So,
1) if I am trying to use it inside pyspark, do I need to import any special module inside my code, or it should work as in-built for pyspark as well?
2) To use this intersection() function, do we first need to convert df to rdd?
Please correct me wherever I am wrong. Can somebody give me a working example?
My motive is to get the common record from SQL server and move to HIVE. As of now, I am first trying to get my intersection function work and then start with the HIVE requirement that I can take care off if intersection() is working.
I got it working for me, instead of intersection(), I used intersect(), it worked.