I'm using the PYSPARK to extract the files and doing basic transformation and loading the data to HIVE. Using for loop to find the extract files and loading it to Hive. We have around 60 tables. Looping each file and loading take time. So using ThreadpoolExecutor to run the threads in parallel. Here is the sample code prototype.
def func(args):
df=extract(args)
tbl,status=load(df)
return tbl,status
def extract(args):
###finding file list and loading it to Hive####
return df
def load(df)
###Loading it to Hive###
status[tbl]='Completed'
return tbl,status
status={}
listA =['ABC','BCD','DEF']
prcs=[]
with futures.ThreadPoolExecutor() as executor:
for i in listA:
prcs.append(executor.submit(func,args))
for tsk in futures.as_completed(prcs):
tbl, status = future.result()
print(tbl)
print(status)
It works well. I'm redirecting the spark-submit log to a file. But while using threadpoolexecutor, logs are clumsy, cant debug anything. Any better way to group the logs based on thread. Here thread denotes the each table. I'm new to Python. Kindly help.
As described here.
Spark uses log4j for logging. You can configure it by adding a
log4j.properties file in the conf directory. One way to start is to
copy the existing log4j.properties.template located there.
So you can configure via log4j.properties or programmatically as described in How to configure the log level of a specific logger using log4j in pyspark?. Post is about log level, but similar concept to configure the appender.
For what to put in logs to make them more meaningful, you need some correlation-id to correlate the logs. Not sure if thread id/name is sufficient? If not then see the note about MDC. You can add your own custom MDC (See: Spark application and logging MDC (Mapped Diagnostic Context)
)
By default, Spark adds 1 record to the MDC (Mapped Diagnostic
Context): mdc.taskName, which shows something like task 1.0 in stage
0.0. You can add %X{mdc.taskName} to your patternLayout in order to print it in the logs. Moreover, you can use
spark.sparkContext.setLocalProperty(s"mdc.$name", "value") to add
user specific data into MDC. The key in MDC will be the string of
“mdc.$name”.
If taskName isn't sufficient then you can create your own correlation id and add to MDC (setLocalProperty()) and use it in patternLayout.
Related
I am trying to write a script to import messages to a Uniform Distributed queue in Weblogic using WLST but I am unable to find a solution that specifically caters to my requirement.
Let me explain the requirement:
I have error queues that store failed messages. I have exported them as an xml file (using WLST) and segregated them based on the different error code in message header into smaller xml files which need to be imported into the main queue for reprocessing(not using Admin console).
I am sure that there is something that can be done to achieve this as I am able to import the segregated xml files using the import option in Admin console which works like a charm but have no idea how it is actually being done so that it could be implemented as a script.
I have explored a few options like exporting the files as a binary SER file which works but it is not something that can be used to filter out the retryable messages only.
The wlst method importMessages() only accepts a composite datatype array. Any method to convert/create the required composite Datatype array from the xml files would also be a great solution to the issue.
I agree it is not very simple and intuitive.
You have 2 solutions :
pure WLST code
java code using JMS API
If you want to write pure WLST code here is a code sample that will help you. The code creates and publish n messages into a queue.
The buildJMSMessage() function is responsible to create a text message.
from javax.management.openmbean import CompositeData
from weblogic.jms.extensions import JMSMessageInfo, JMSMessageFactoryImpl
...
def buildJMSMessage(text):
handle = 1
state = 1
XidString = None
sequenceNumber = 1
consumerID = None
wlmessage = JMSMessageFactoryImpl.getFactory().createTextMessage(text)
destinationName = ""
bodyIncluded = True
msg = JMSMessageInfo(handle, state, XidString, sequenceNumber, consumerID, wlmessage, destinationName, bodyIncluded)
return msg
....
quanity = 10
messages = jarray.zeros(quantity,CompositeData)
for i in range(0,quantity):
messages[i] = buildJMSMessage('Test message #'+str(i)).toCompositeData()
i = i + 1
queue.importMessages(messages, False)
I am very new to airflow and I am trying to create a DAG based on the below requirement.
Task 1 - Run a Bigquery query to get a value which I need to push to 2nd task in the dag
Task 2 - Use the value from the above query and run another query and export the data into google cloud bucket.
I have read other answers related to this and I understand we cannot use xcom_pull or xcom_push in bigqueryoperator in airflow. So what I am doing is using a python operator where I can use jinja template variables by using "provide_context=True".
Below is the snipped of my code. Just the task 1 where I want to do "task_instance.xcom_push" in order to see the value in airflow under logs xcom.
def get_bq_operator(dag, task_id, configuration, table_params=None, trigger_rule='all_success'):
bq_operator = BigQueryInsertJobOperator(
task_id=task_id,
configuration=configuration,
gcp_conn_id=gcp_connection_id,
dag=dag,
params=table_params,
trigger_rule=trigger_rule,
**task_instance.xcom_push(key='yr_wk', value=yr_wk),**
)
return bq_operator
def get_bq_wm_yr_wk():
get_bq_operator(dag,app_name,bigquery_util.get_bq_job_configuration(
bq_query,
query_params=None))
get_wm_yr_wk = PythonOperator(task_id='get_wm_yr_wk',
python_callable=get_bq_wm_yr_wk,
provide_context=True,
on_failure_callback=failure_callback,
on_retry_callback=failure_callback,
dag=dag)
"bq_query" is the one I am passing the sql file which has my query and the query returns the value of yr_wk which I need to use in my 2nd task.
The highlighted task_instance.xcom_push(key='yr_wk', value=yr_wk), in get_bq_operator is failing and the errror i am getting is as below
raise KeyError(f'Variable {key} does not exist')
KeyError: 'Variable ei_migration_hour does not exist'
If I comment the line above , the DAG runs fine. However, how do I validate the value of yr_wk?? I want to push it so that I can view the value in logs.
I do not fully understand your code :), but if you want to do something with results of BigQuery query, then by far better way to approach it is to use BigQueryHook in your python callable.
Operators in Airflow are usually thin wrappers around Hooks that really provide a "complete" taks (for example you can use it run an update operation) but if you want to do something with the result of it and you already do it via Python Operator, it is far better to use Hooks directly as you do not make all the assumptions that operators have in execute method.
In your case it should be something like (and I am using here the new TaskFlow syntax which is preferred to do this kind of operations. See https://airflow.apache.org/docs/apache-airflow/stable/tutorial_taskflow_api.html for the tutorial on Task Flow API. Aspecially in Airflow 2 it became the de-facto default way of writing tasks.
#task(.....)
def my_task():
hook = BigQueryHook(....) # initialize it with the right parameters
result = hook.run(sql='YOUR_QUERY', ...) # add other necessary params
processed_result = process_result(result) # do something with the result
return processed_result
This way you do not evey have to run xcom_push (task_flow API will do it for you automatically and other tasks will be able to use by just doing :
#task
next_task(input):
pass
And then:
result = my_task()
next_task(result)
Then all the xcom push/pull will be handled for you automatically via TaskFlow.
When running a Databricks notebook as a job, you can specify job or run parameters that can be used within the code of the notebook. However, it wasn't clear from documentation how you actually fetch them. I'd like to be able to get all the parameters as well as job id and run id.
Job/run parameters
When the notebook is run as a job, then any job parameters can be fetched as a dictionary using the dbutils package that Databricks automatically provides and imports. Here's the code:
run_parameters = dbutils.notebook.entry_point.getCurrentBindings()
If the job parameters were {"foo": "bar"}, then the result of the code above gives you the dict {'foo': 'bar'}. Note that Databricks only allows job parameter mappings of str to str, so keys and values will always be strings.
Note that if the notebook is run interactively (not as a job), then the dict will be empty. The getCurrentBinding() method also appears to work for getting any active widget values for the notebook (when run interactively).
Getting the jobId and runId
To get the jobId and runId you can get a context json from dbutils that contains that information. (Adapted from databricks forum):
import json
context_str = dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()
context = json.loads(context_str)
run_id_obj = context.get('currentRunId', {})
run_id = run_id_obj.get('id', None) if run_id_obj else None
job_id = context.get('tags', {}).get('jobId', None)
So within the context object, the path of keys for runId is currentRunId > id and the path of keys to jobId is tags > jobId.
Nowadays you can easily get the parameters from a job through the widget API. This is pretty well described in the official documentation from Databricks. Below, I'll elaborate on the steps you have to take to get there, it is fairly easy.
Create or use an existing notebook that has to accept some parameters. We want to know the job_id and run_id, and let's also add two user-defined parameters environment and animal.
# Get parameters from job
job_id = dbutils.widgets.get("job_id")
run_id = dbutils.widgets.get("run_id")
environment = dbutils.widgets.get("environment")
animal = dbutils.widgets.get("animal")
print(job_id)
print(run_id)
print(environment)
print(animal)
Now let's go to Workflows > Jobs to create a parameterised job. Make sure you select the correct notebook and specify the parameters for the job at the bottom. According to the documentation, we need to use curly brackets for the parameter values of job_id and run_id. For the other parameters, we can pick a value ourselves.
Note: The reason why you are not allowed to get the job_id and run_id directly from the notebook, is because of security reasons (as you can see from the stack trace when you try to access the attributes of the context). Within a notebook you are in a different context, those parameters live at a "higher" context.
Run the job and observe that it outputs something like:
dev
squirrel
137355915119346
7492
Command took 0.09 seconds
You can even set default parameters in the notebook itself, that will be used if you run the notebook or if the notebook is triggered from a job without parameters. This makes testing easier, and allows you to default certain values.
# Adding widgets to a notebook
dbutils.widgets.text("environment", "tst")
dbutils.widgets.text("animal", "turtle")
# Removing widgets from a notebook
dbutils.widgets.remove("environment")
dbutils.widgets.remove("animal")
# Or removing all widgets from a notebook
dbutils.widgets.removeAll()
And last but not least, I tested this on different cluster types, so far I found no limitations. My current settings are:
spark.databricks.cluster.profile serverless
spark.databricks.passthrough.enabled true
spark.databricks.pyspark.enableProcessIsolation true
spark.databricks.repl.allowedLanguages python,sql
I am trying to get the application output of the spark run and cannot find a straightforward way doing that.
Basically I am talking about the content of the <spark install dir>/work directory on the cluster worker.
I could've copied that directory to the location I need, but in case of 100500 nodes it simply doesn't scale.
The other option I was considering is to attach an exit function (like a TRAP in bash) to get the logs from each worker as a part of the app run. I just think there has to be a better solution than that.
Yeah, I know that I can use YARN or Mesos cluster manager to get the logs, however it seems really weird to me that in order to do such a convenient thing I cannot use the default cluster manager.
Thanks a lot.
In the end I went for the following solution (Python):
import os
import tarfile
from io import BytesIO
from pyspark.sql import SparkSession
# Get the spark app.
spark = SparkSession.builder.appName("my-spark-app").getOrCreate()
# Get the executor working directories.
spark_home = os.environ.get('SPARK_HOME')
if spark_home:
num_workers = 0
with open(os.path.join(spark_home, 'conf', 'slaves'), 'r') as f:
for line in f:
num_workers += 1
if num_workers:
executor_logs_path = '/where/to/store/executor_logs'
def _map(worker):
'''Returns the list of tuples of the name and the tar.gz of the worker log directory in binary format
for the corresponding worker.
'''
flo = BytesIO()
with tarfile.open(fileobj=flo, mode="w:gz") as tar:
tar.add(os.path.join(spark_home, 'work'), arcname='work')
return [('worker_%d_dir.tar.gz' % worker, flo.getvalue()),]
def _reduce(worker1, worker2):
'''Appends the worker name and its log tar.gz's into the list.
'''
worker1.extend(worker2)
return worker1
os.makedirs(executor_logs_path)
logs = spark.sparkContext.parallelize(range(num_workers), num_workers).map(_map).reduce(_reduce)
with tarfile.open(os.path.join(executor_logs_path, 'logs.tar'), 'w') as tar:
for name, data in logs:
info = tarfile.TarInfo(name=name)
info.size=len(data)
tar.addfile(tarinfo=info, fileobj=BytesIO(data))
A couple of concerns though:
not sure if using the map-reduce technique is the best way to collect the logs
the files (tarballs) are being created in memory, so depending on your application it can crush if the files are too big
perhaps there is a better way to determine the number of workers
I'm running a Spark Streaming task in a cluster using YARN. Each node in the cluster runs multiple spark workers. Before the streaming starts I want to execute a "setup" function on all workers on all nodes in the cluster.
The streaming task classifies incoming messages as spam or not spam, but before it can do that it needs to download the latest pre-trained models from HDFS to local disk, like this pseudo code example:
def fetch_models():
if hadoop.version > local.version:
hadoop.download()
I've seen the following examples here on SO:
sc.parallelize().map(fetch_models)
But in Spark 1.6 parallelize() requires some data to be used, like this shitty work-around I'm doing now:
sc.parallelize(range(1, 1000)).map(fetch_models)
Just to be fairly sure that the function is run on ALL workers I set the range to 1000. I also don't exactly know how many workers are in the cluster when running.
I've read the programming documentation and googled relentlessly but I can't seem to find any way to actually just distribute anything to all workers without any data.
After this initialization phase is done, the streaming task is as usual, operating on incoming data from Kafka.
The way I'm using the models is by running a function similar to this:
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition)))
Theoretically I could check whether or not the models are up to date in the on_partition function, though it would be really wasteful to do this on each batch. I'd like to do it before Spark starts retrieving batches from Kafka, since the downloading from HDFS can take a couple of minutes...
UPDATE:
To be clear: it's not an issue on how to distribute the files or how to load them, it's about how to run an arbitrary method on all workers without operating on any data.
To clarify what actually loading models means currently:
def on_partition(config, partition):
if not MyClassifier.is_loaded():
MyClassifier.load_models(config)
handle_partition(config, partition)
While MyClassifier is something like this:
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
MyClassifier.clf = load_from_file(config)
Static methods since PySpark doesn't seem to be able to serialize classes with non-static methods (the state of the class is irrelevant with relation to another worker). Here we only have to call load_models() once, and on all future batches MyClassifier.clf will be set. This is something that should really not be done for each batch, it's a one time thing. Same with downloading the files from HDFS using fetch_models().
If all you want is to distribute a file between worker machines the simplest approach is to use SparkFiles mechanism:
some_path = ... # local file, a file in DFS, an HTTP, HTTPS or FTP URI.
sc.addFile(some_path)
and retrieve it on the workers using SparkFiles.get and standard IO tools:
from pyspark import SparkFiles
with open(SparkFiles.get(some_path)) as fw:
... # Do something
If you want to make sure that model is actually loaded the simplest approach is to load on module import. Assuming config can be used to retrieve model path:
model.py:
from pyspark import SparkFiles
config = ...
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
path = SparkFiles.get(config.get("model_file"))
MyClassifier.clf = load_from_file(path)
# Executed once per interpreter
MyClassifier.load_models(config)
main.py:
from pyspark import SparkContext
config = ...
sc = SparkContext("local", "foo")
# Executed before StreamingContext starts
sc.addFile(config.get("model_file"))
sc.addPyFile("model.py")
import model
ssc = ...
stream = ...
stream.map(model.MyClassifier.do_something).pprint()
ssc.start()
ssc.awaitTermination()
This is a typical use case for Spark's broadcast variables. Let's say fetch_models returns the models rather than saving them locally, you would do something like:
bc_models = sc.broadcast(fetch_models())
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition, bc_models.value)))
This does assume that your models fit in memory, on the driver and the executors.
You may be worried that broadcasting the models from the single driver to all the executors is inefficient, but it uses 'efficient broadcast algorithms' that can outperform distributing through HDFS significantly according to this analysis