I'm constantly reusing udfs across ipython notebooks and am trying to figure out if there is some way to share the code.
I'd love to be able to make a file, let's call it sparktoolz.py
import pyspark.sql.functions as F
import pyspark.sql.types as T
def myfunc(foo):
# do stuff to foo
return transformed_foo
myfunc_udf = F.udf(myfunc, T.SomeType())
Then from any given notebook in the same directory as sparktoolz.py do something like this:
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
sc.addPyFile('sparktoolz.py')
from sparktoolz import myfunc_udf
df = sqlContext.read.parquet('path/to/foo')
stuff = df.select(myfunc_udf(F.col('bar')))
Whenever I try something like this, the notebook can find sparktoolz.py but gives me an ImportError: cannot import name myfunc_udf.
Related
I'm following this tutorial video: https://www.youtube.com/watch?v=EzQArFt_On4
The example code provided in this video:
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
glueContext = GlueContext(SparkContext.getOrCreate())
glueJob = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glueJob.init(args['JOB_NAME'], args)
sparkSession = glueContext.spark_session
#ETL process code
def etl_process():
...
return xxx
glueJob.commit()
I'm wondering if the part before the function etl_process can be used in production directly? Or do I need to wrap that part into a separate function so that I can add unit test for it?
something like this:
def define_spark_session():
sc = SparkContext.getOrCreate()
glue_context = GlueContext(sc)
glue_job = Job(glue_context)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glue_job.init(args['JOB_NAME'], args)
spark_session = glue_context.spark_session
return spark_session
But it seems doesn't need a parameter...
Or should I just write unit test for etl_process function?
Or maybe I can create a separate python file with etl_process function and import it in this script?
I'm new to this, a bit confused, might someone be able to help please? Thanks.
As for now it is very difficult to test AWS Glue itself locally, although there are some solutions like downloading a docker image AWS provides you and run it from there (you'll probably need some tweaks but should be all right).
I guess the easies way is to transform the DynamicFrame you get from gluelibs into a Spark DataFrame (.toDf()) and then do thinks in pure Spark (PySpark) so you'll be able to test the result.
dataFrame = dynamic_frame.toDf()
def transormation(dataframe):
return dataframe.withColumn(...)
def test_transformation()
result = transformation(input_test_dataframe)
assert ...
I am trying to load data in a dataframe using pyspark. The files are in parquet format. I am using the following code
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField,IntegerType,StringType,BooleanType,DateType,TimestampType,LongType,FloatType,DoubleType,ArrayType,ShortType
from pyspark.sql import HiveContext
from pyspark.sql.functions import lit
import datetime
from pyspark import SparkContext
from pyspark import SQLContext
from datetime import datetime
from datetime import *
from datetime import date, timedelta as td
import datetime
from datetime import datetime
from pyspark import SparkContext
from pyspark.sql import HiveContext
import pandas as pd
daterange = pd.date_range('2019-12-01','2019-12-31')
df = sqlContext.createDataFrame(sc.emptyRDD())
for process_date in daterange:
try:
name = 's3://location/process_date={}'.format(process_date.strftime("%Y-%m-%d"))+'/'
print(name)
x = spark.read.parquet(name)
x = x.withColumn('process_date',lit(process_date.strftime("%Y-%m-%d")))
x.show()
df = df.union(x)
except:
print("File doesnt exist for"+str(process_date.strftime("%Y-%m-%d")))
But when i am running this code,
i am getting the output df is an empty data set and despite having data for some dates, i am getting exception print message in all the date range.
Can anyone guide me what i am doing wrong?
I think the problem is the union and a too broad except clause.
Union will only work if the schemas of the dataframes to be unioned is the same.
Hence emptyDF.union(nonEmtpy) raises an error that you catch in the except clause.
I have 45 pyspark scripts to run where a password is stored in each script. I want to use a file placed in HDFS where I can store the password and use this for all the scripts.
Instead of changing password, I will do in file (please refer to the script below).
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.types import *
sc = SparkContext()
sqlContext = HiveContext(sc)
sqlContext.setConf("spark.sql.tungsten.enabled", "false")
CSKU_query = """ (select * from CSKU) a """
CSKU = sqlContext.read.format("jdbc").options(url="jdbc:sap://myip:port",currentschema="SAPABAP1",user="username",password="mypassword",dbtable=CSKU_query).load()
CSKU.write.format("parquet").save("/user/admin/sqoop/base/sap/CSKU/")
Instead of specifying password in each script, it should fetch from file where i can refer that.
Thanks in advance
I am getting the error while running the code:
Error:
AttributeError: 'SQLContext' object has no attribute 'load'
Please let me know if anyone has faced same problem
Code:
# required import modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# creating a configuration for context. here "Spark-SQL" is the name of the application and we will create local spark context.
conf = SparkConf().setAppName("Spark-SQL").setMaster("local")
# create a spark context. It is the entry point into all relational functionality in Spark.
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
# loading the contents from the MySql database table to the dataframe.
df = sqlContext.load(source="jdbc", url="jdbc:mysql://localhost:3306/test?user=root&password=",dbtable="members_indexed")
# display the schema of the dataframe.
df.show()
# display the schema of the dataframe.
df.printSchema()
im trying to save my streaming data from spark to cassandra, spark is conected to kafka and its working ok, but saving to cassandra its making me become crazy. Im using spark 2.0.2, kafka 0.10 and cassandra 2.23,
this is how im submiting to spark
spark-submit --verbose --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 --jars /tmp/pyspark-cassandra-0.3.5.jar --driver-class-path /tmp/pyspark-cassandra-0.3.5.jar --py-files /tmp/pyspark-cassandra-0.3.5.jar --conf spark.cassandra.connection.host=localhost /tmp/direct_kafka_wordcount5.py localhost:9092 testing
and this is my code it just a little modification from the spark examples, its works but i cant save this data to cassandra....
and this what im trying to do but just with the count result
http://rustyrazorblade.com/2015/05/spark-streaming-with-python-and-kafka/
from __future__ import print_function
import sys
import os
import time
import pyspark_cassandra
import pyspark_cassandra.streaming
from pyspark_cassandra import CassandraSparkContext
import urllib
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql.functions import from_unixtime, unix_timestamp, min, max
from pyspark.sql.types import FloatType
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: direct_kafka_wordcount.py <broker_list> <topic>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 1)
sqlContext = SQLContext(sc)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
counts=lines.count()
counts.saveToCassandra("spark", "count")
counts.pprint()
ssc.start()
ssc.awaitTermination()
i got this error,
Traceback (most recent call last):
File "/tmp/direct_kafka_wordcount5.py", line 88, in
counts.saveToCassandra("spark", "count")
Pyspark Casasndra stopped being updated a while ago and the latest version only supports up to Spark 1.6
https://github.com/TargetHolding/pyspark-cassandra
Additionally
counts=lines.count() // Returns data to the driver (not an RDD)
counts is now an Integer. This means the function saveToCassandra doesn't apply since that is a function for RDDs