I am getting a complex json message through kafka. I need to extract the required fields from the json and store them in hive tables. I know I cannot use the spark driver sqlContext in the executors. I want to know how to use the sqlContext in the code run by the executors. Here is the code :
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming", topic)
msgs = kvs.map(lambda msg: msg[1])
msgs.foreachRDD(lambda rdd: rdd.foreach(lambda m : timeline_events(m)))
def timeline_events(m):
msg = json.loads(m)
for msgJson in msg:
event_id = msgJson['events'][0]['event_id']
event_type = msgJson['events'][0]['type']
incidence_source = msgJson['incident']['source']
csr_description = msgJson['incident']['data']['csr_description']
sc_display_priority = msgjson['incident']['data']['display_priority']
launch_tool_rec_label = msgJson['incident']['data']['LaunchTool'][0]['Label']
launch_tool_rec_uri = msgJson['incident']['data']['LaunchTool'][0]['URI']
launch_itg_rec_label = msgJson['incident']['data']['LaunchItg'][0]['Label']
launch_itg_rec_uri = msgJson['incident']['data']['LaunchItg'][0]['URI']
sqlContext.sql("Insert into nexus.timeline_events values({},{},{},{},{},{},{},{},{},{},{})".format(event_id, event_type, csr_description, incidence_source, sc_display_priority, launch_tool_rec_label,launch_tool_rec_uri, launch_tool_rec_id, launch_itg_rec_label, launch_itg_rec_uri, launch_itg_rec_id))
Related
I have writting this code using Python, when I run it, the following errors show up.
spark = SparkSession\
.builder\
.appName("GraphX")\
.getOrCreate()
e = spark.read.parquet("hdfs://localhost:9000/gf/edge")
v = spark.read.parquet("hdfs://localhost:9000/gf/vertex")
s = GraphFrame(v, e)
s.edges.show()
s.vertices.show()
I am trying to read a file allocated in azure datalake gen2 into spark dataframe using python.
Code is
from pyspark import SparkConf
from pyspark.sql import SparkSession
# create spark session
key = "some_key"
appName = "DataExtract"
master = "local[*]"
sparkConf = SparkConf() \
.setAppName(appName) \
.setMaster(master) \
.set("fs.azure.account.key.myaccount.dfs.core.windows.net", key)
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
data_csv="abfs://test-file-system#myaccount.dfs.core.windows.net/data.csv"
data_out = "abfs://test-file-system#myaccount.dfs.core.windows.net/data_out.csv"
# read csv
df = self.spark_session.read.csv(data_csv)
# write csv
df.write.csv(data_out)
The file is read and is written well, but I am getting following error
ERROR AzureBlobFileSystemStore: Failed to parse the date Thu, 09 Sep 2021 10:12:34 GMT
Date seems to be file creation date.
How can I parse the date to avoid getting the error?
I tried reproducing the same issue and found it is with these lines that is causing the error.
data_csv="abfs://test-file-system#myaccount.dfs.core.windows.net/data.csv" data_out =
"abfs://test-file-system#myaccount.dfs.core.windows.net/data_out.csv"
# read csv df = self.spark_session.read.csv(data_csv) ```
Here is the code that worked for me when I tried replacing the above lines of code i.e.. abfs to abfss
from pyspark import SparkConf
from pyspark.sql import SparkSession
# create spark session
key = "<Your Storage Account Key>"
appName = "<Synapse App Name>"
master = "local[*]"
sparkConf = SparkConf() \
.setAppName(appName) \
.setMaster(master) \
.set("fs.azure.account.key.<Storage Account Name>.dfs.core.windows.net", key)
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
data_csv="abfss://<ContainerName>#<Storage Account Name>.dfs.core.windows.net/<Directory>"
# read csv
df1 = spark.read.option('header','true')\
.option('delimiter', ',')\
.csv(data_csv + '/sample1.csv')
df1.show()
# write csv
df2 = df1.write.csv(data_csv + '/<Give the name of blob you want to write to>.csv')
else you can even try the below code which perfectly worked for me
from pyspark.sql import SparkSession
from pyspark.sql.types import *
account_name = "<StorageAccount Name>"
container_name = "<Storage Account Container Name>"
relative_path = "<Directory path>"
adls_path = 'abfss://%s#%s.dfs.core.windows.net/%s'%(container_name,account_name,relative_path)
dataframe1 = spark.read.option('header','true')\
.option('delimiter', ',')\
.csv(adls_path + '/sample1.csv')
dataframe1.show()
dataframe2 = dataframe1.write.csv(adls_path + '/<Give the name of blob you want to write to>.csv')
REFERENCE :
Synapse Spark – Reading CSV files from Azure Data Lake Storage Gen 2 with Synapse Spark using Python - SQL Stijn (sql-stijn.com)
In the example https://github.com/apache/spark/tree/master/external/kinesis-asl, both scala and java are creating multiple Dstreams manually
val sparkConfig = new SparkConf().setAppName("KinesisWordCountASL")
val ssc = new StreamingContext(sparkConfig, batchInterval)
// Create the Kinesis DStreams
val kinesisStreams = (0 until numStreams).map { i =>
KinesisInputDStream.builder
.streamingContext(ssc)
.streamName(streamName)
.endpointUrl(endpointUrl)
.regionName(regionName)
.initialPosition(new Latest())
.checkpointAppName(appName)
.checkpointInterval(kinesisCheckpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
}
// Union all the streams
val unionStreams = ssc.union(kinesisStreams)
But for python, there is only one DStream being created
sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl")
ssc = StreamingContext(sc, 1)
appName, streamName, endpointUrl, regionName = sys.argv[1:]
lines = KinesisUtils.createStream(
ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, 2)
When manually creating multiple Dstream on python,
sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl")
ssc = StreamingContext(sc, 1)
appName, streamName, endpointUrl, regionName = sys.argv[1:]
dstreams = [KinesisUtils.createStream(
ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, 2) for i in range(num_streams)]
lines = sc.union(dstreams)
This will throw an error
ValueError: Cannot run multiple SparkContexts at once;
Anyone knows how to replicate the java/scala examples on creating DStreams? Thanks
We can combine multiple streams in python also. We need to use union function available on StreamingContext class.
If you look into your code, you are calling union method on SparkContext variable i.e sc instead of that use StreamingContext valriable i.e lines = ssc.union(dstreams)
I'm trying to load the CSV file with schema under auto detection but I am unable to load the file into Big query. Can any one help me on this.
Please find my code below:
def load_data_from_file(dataset_name, table_name, source_file_name):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
table.reload()
with open(source_file_name, 'rb') as source_file:
job = table.upload_from_file(
source_file, source_format='text/csv')
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_name, table_name))
def wait_for_job(job):
while True:
job.reload()
if job.state == 'DONE':
if job.error_result:
raise RuntimeError(job.errors)
return
time.sleep(1)
Based on the Google BigQuery python API documentation, you should set source_format to 'CSV' instead of 'text/csv':
source_format='CSV'
Code Sample:
with open(csv_file.name, 'rb') as readable:
table.upload_from_file(
readable, source_format='CSV', skip_leading_rows=1)
Source: https://googlecloudplatform.github.io/google-cloud-python/stable/bigquery-usage.html#datasets
If this does not solve your problem, please provide more specifics about the errors you are observing.
You can use the below code snippet to create and load data (CSV format) from Cloud Storage to BigQuery with auto-detect schema:
from google.cloud import bigquery
bigqueryClient = bigquery.Client()
jobConfig = bigquery.LoadJobConfig()
jobConfig.skip_leading_rows = 1
jobConfig.source_format = bigquery.SourceFormat.CSV
jobConfig.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
jobConfig.autodetect=True
datasetName = "dataset-name"
targetTable = "table_name"
uri = "gs://bucket_name/file.csv"
tableRef = bigqueryClient.dataset(datasetName).table(targetTable)
bigqueryJob = bigqueryClient.load_table_from_uri(uri, tableRef, job_config=jobConfig)
bigqueryJob.result()
Currently, the Python Client has no support for loading data from file with a schema auto-detection flag (I plan on doing a pull request to add this support but still I'd like to talk to the maintainers what their opinions are on this implementation).
There are still some ways to work this around. I didn't find a very elegant solution so far but nevertheless this code allows you to add schema detection as input flag:
from google.cloud.bigquery import Client
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/your/json.key'
import google.cloud.bigquery.table as mtable
def _configure_job_metadata(metadata,
allow_jagged_rows,
allow_quoted_newlines,
create_disposition,
encoding,
field_delimiter,
ignore_unknown_values,
max_bad_records,
quote_character,
skip_leading_rows,
write_disposition):
load_config = metadata['configuration']['load']
if allow_jagged_rows is not None:
load_config['allowJaggedRows'] = allow_jagged_rows
if allow_quoted_newlines is not None:
load_config['allowQuotedNewlines'] = allow_quoted_newlines
if create_disposition is not None:
load_config['createDisposition'] = create_disposition
if encoding is not None:
load_config['encoding'] = encoding
if field_delimiter is not None:
load_config['fieldDelimiter'] = field_delimiter
if ignore_unknown_values is not None:
load_config['ignoreUnknownValues'] = ignore_unknown_values
if max_bad_records is not None:
load_config['maxBadRecords'] = max_bad_records
if quote_character is not None:
load_config['quote'] = quote_character
if skip_leading_rows is not None:
load_config['skipLeadingRows'] = skip_leading_rows
if write_disposition is not None:
load_config['writeDisposition'] = write_disposition
load_config['autodetect'] = True # --> Here you can add the option for schema auto-detection
mtable._configure_job_metadata = _configure_job_metadata
bq_client = Client()
ds = bq_client.dataset('dataset_name')
ds.table = lambda: mtable.Table('table_name', ds)
table = ds.table()
with open(source_file_name, 'rb') as source_file:
job = table.upload_from_file(
source_file, source_format='text/csv')
Just wanted to show how i've used the python client.
Below is my function to create a table and load it with a csv file.
Also, self.client is my bigquery.Client()
def insertTable(self, datasetName, tableName, csvFilePath, schema=None):
"""
This function creates a table in given dataset in our default project
and inserts the data given via a csv file.
:param datasetName: The name of the dataset to be created
:param tableName: The name of the dataset in which the table needs to be created
:param csvFilePath: The path of the file to be inserted
:param schema: The schema of the table to be created
:return: returns nothing
"""
csv_file = open(csvFilePath, 'rb')
dataset_ref = self.client.dataset(datasetName)
# <import>: from google.cloud.bigquery import Dataset
dataset = Dataset(dataset_ref)
table_ref = dataset.table(tableName)
if schema is not None:
table = bigquery.Table(table_ref,schema)
else:
table = bigquery.Table(table_ref)
try:
self.client.delete_table(table)
except:
pass
table = self.client.create_table(table)
# <import>: from google.cloud.bigquery import LoadJobConfig
job_config = LoadJobConfig()
table_ref = dataset.table(tableName)
job_config.source_format = 'CSV'
job_config.skip_leading_rows = 1
job_config.autodetect = True
job = self.client.load_table_from_file(
csv_file, table_ref, job_config=job_config)
job.result()
Let me know if this solves your problem.
I am beginner of SparkStreaming.
I want to load HBase record at SparkStreaming App.
So, I write the the under code by python.
My "load_records" function is getting HBase Records and return the records.
SparkStreaming can not use collect(). sc.newAPIHadoopRDD() need to be used at Driver Program. But SparkStreaming do not have the method which can get objects from workers to driver.
How to get HBase Record at SparkStreaming? or How to call sc.newAPIHadoopRDD()?
def load_records(sc, table, keys):
host = 'localhost'
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
rdd_list = []
for key in keys:
if table == "user":
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "user",
"hbase.mapreduce.scan.columns": "u:uid",
"hbase.mapreduce.scan.row.start": key, "hbase.mapreduce.scan.row.stop": key + "\x00"}
rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv, valueConverter=valueConv, conf=conf)
rdd_list.append(rdd)
first_rdd = rdd_list.pop(0)
for rdd in rdd_list:
first_rdd = first_rdd.union(rdd)
return first_rdd
sc = SparkContext(appName="UserStreaming")
ssc = StreamingContext(sc, 3)
topics = ["json"]
broker_list = "localhost:9092"
inputs = KafkaUtils.createDirectStream(ssc, topics, {"metadata.broker.list": broker_list})
jsons = inputs.map(lambda input: json.loads(input[1]))
user_id_rdd = jsons.map(lambda json: json["user_id"])
# the under line is not working. Any another methods?
user_id_list = user_id_rdd.collect()
user_record_rdd = load_records(sc, 'user', user_id_list)