Spark streaming and py4j.Py4JException: Method __getnewargs__([]) does not exist - python

I am trying to implement a Spark streaming application, but I am getting back an exception: "py4j.Py4JException: Method getnewargs([]) does not exist"
I do not understand the source of this exception. I read here that I cannot use a SparkSession instance outside of the driver. But, I do not know whether I am doing that. I don't understand how to tell whether some code executes on the driver or an executor - I understand the difference between transformations and actions (I think), but when it comes to streams and foreachRDD, I get lost.
The app is a Spark streaming app, running on AWS EMR, reading data from AWS Kinesis. We submit the Spark app via spark-submit, with --deploy-mode cluster. Each record in the stream is a JSON object in the form:
{"type":"some string","state":"an escaped JSON string"}
E.g.:
{"type":"type1","state":"{\"some_property\":\"some value\"}"}
Here is my app in its current state:
# Each handler subclasses from BaseHandler and
# has the method
# def process(self, df, df_writer, base_writer_path)
# Each handler's process method performs additional transformations.
# df_writer is a function which writes a Dataframe to some S3 location.
HANDLER_MAP = {
'type1': Type1Handler(),
'type2': Type2Handler(),
'type3': Type3Handler()
}
FORMAT = 'MyProject %(asctime)s %(levelname)s %(name)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
# Use a closure and lambda to create streaming context
create = lambda: create_streaming_context(
spark_app_name=spark_app_name,
kinesis_stream_name=kinesis_stream_name,
kinesis_endpoint=kinesis_endpoint,
kinesis_region=kinesis_region,
initial_position=InitialPositionInStream.LATEST,
checkpoint_interval=checkpoint_interval,
checkpoint_s3_path=checkpoint_s3_path,
data_s3_path=data_s3_path)
streaming_context = StreamingContext.getOrCreate(checkpoint_s3_path, create)
streaming_context.start()
streaming_context.awaitTermination()
The function for creating the streaming context:
def create_streaming_context(
spark_app_name, kinesis_stream_name, kinesis_endpoint,
kinesis_region, initial_position, checkpoint_interval,
data_s3_path, checkpoint_s3_path):
"""Create a new streaming context or reuse a checkpointed one."""
# Spark configuration
spark_conf = SparkConf()
spark_conf.set('spark.streaming.blockInterval', 37500)
spark_conf.setAppName(spark_app_name)
# Spark context
spark_context = SparkContext(conf=spark_conf)
# Spark streaming context
streaming_context = StreamingContext(spark_context, batchDuration=300)
streaming_context.checkpoint(checkpoint_s3_path)
# Spark session
spark_session = get_spark_session_instance(spark_conf)
# Set up stream processing
stream = KinesisUtils.createStream(
streaming_context, spark_app_name, kinesis_stream_name,
kinesis_endpoint, kinesis_region, initial_position,
checkpoint_interval)
# Each record in the stream is a JSON object in the form:
# {"type": "some string", "state": "an escaped JSON string"}
json_stream = stream.map(json.loads)
for state_type in HANDLER_MAP.iterkeys():
filter_stream(json_stream, spark_session, state_type, data_s3_path)
return streaming_context
The function get_spark_session_instance returns a global SparkSession instance (copied from here):
def get_spark_session_instance(spark_conf):
"""Lazily instantiated global instance of SparkSession"""
logger.info('Obtaining global SparkSession instance...')
if 'sparkSessionSingletonInstance' not in globals():
logger.info('Global SparkSession instance does not exist, creating it...')
globals()['sparkSessionSingletonInstance'] = SparkSession\
.builder\
.config(conf=spark_conf)\
.getOrCreate()
return globals()['sparkSessionSingletonInstance']
The filter_stream function is intended to filter the stream by the type property in the JSON. The intention is to transform the stream into a stream where each record is the escaped JSON string from the "state" property in the original JSON:
def filter_stream(json_stream, spark_session, state_type, data_s3_path):
"""Filter stream by type and process the stream."""
state_type_stream = json_stream\
.filter(lambda jsonObj: jsonObj['type'] == state_type)\
.map(lambda jsonObj: jsonObj['state'])
state_type_stream.foreachRDD(lambda rdd: process_rdd(spark_session, rdd, state_type, df_writer, data_s3_path))
The process_rdd function is intended to load the JSON into a Dataframe, using the correct schema depending on the type in the original JSON object. The handler instance returns a valid Spark schema, and has a process method which performs further transformations on the dataframe (after which df_writer is called, and the Dataframe is written to S3):
def process_rdd(spark_session, rdd, state_type, df_writer, data_s3_path):
"""Process an RDD by state type."""
if rdd.isEmpty():
logger.info('RDD is empty, returning early.')
return
handler = HANDLER_MAP[state_type]
df = spark_session.read.json(rdd, handler.get_schema())
handler.process(df, df_writer, data_s3_path)
Basically I am confused about the source of the exception. Is it related to how I am using spark_session.read.json? If so, how is it related? If not, is there something else in my code which is incorrect?
Everything seems to work correctly if I just replace the call to StreamingContext.getOrCreate with the contents of the create_streaming_context method. I was mistaken about this - I get the same exception either way. I think the checkpoint stuff is a red herring... I am obviously doing something else incorrectly.
I would greatly appreciate any help with this problem and I'm happy to clarify anything or add additional information!

Related

Error trying to write CSV file to Google Cloud Storage from Dataflow pipeline

I'm working on building a Dataflow pipeline that reads a CSV file (containing 250,000 rows) from my Cloud Storage bucket, modifies the value of each row and then writes the modified contents to a new CSV in the same bucket. With the code below I'm able to read and modify the contents of the original file, but when I attempt to write the contents of the new file in GCS I get the following error:
google.api_core.exceptions.TooManyRequests: 429 POST https://storage.googleapis.com/upload/storage/v1/b/my-bucket/o?uploadType=multipart: {
"error": {
"code": 429,
"message": "The rate of change requests to the object my-bucket/product-codes/URL_test_codes.csv exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"errors": [
{
"message": "The rate of change requests to the object my-bucket/product-codes/URL_test_codes.csv exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"domain": "usageLimits",
"reason": "rateLimitExceeded"
}
]
}
}
: ('Request failed with status code', 429, 'Expected one of', <HTTPStatus.OK: 200>) [while running 'Store Output File']
My code in Dataflow:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import traceback
import sys
import pandas as pd
from cryptography.fernet import Fernet
import google.auth
from google.cloud import storage
fernet_secret = 'aD4t9MlsHLdHyuFKhoyhy9_eLKDfe8eyVSD3tu8KzoP='
bucket = 'my-bucket'
inputFile = f'gs://{bucket}/product-codes/test_codes.csv'
outputFile = 'product-codes/URL_test_codes.csv'
#Pipeline Logic
def product_codes_pipeline(project, env, region='us-central1'):
options = PipelineOptions(
streaming=False,
project=project,
region=region,
staging_location="gs://my-bucket-dataflows/Templates/staging",
temp_location="gs://my-bucket-dataflows/Templates/temp",
template_location="gs://my-bucket-dataflows/Templates/Generate_Product_Codes.py",
subnetwork='https://www.googleapis.com/compute/v1/projects/{}/regions/us-central1/subnetworks/{}-private'.format(project, env)
)
# Transform function
def genURLs(code):
f = Fernet(fernet_secret)
encoded = code.encode()
encrypted = f.encrypt(encoded)
decrypted = f.decrypt(encrypted.decode().encode())
decoded = decrypted.decode()
if code != decoded:
print(f'Error: Code {code} and decoded code {decoded} do not match')
sys.exit(1)
url = 'https://some-url.com/redeem/product-code=' + encrypted.decode()
return url
class WriteCSVFIle(beam.DoFn):
def __init__(self, bucket_name):
self.bucket_name = bucket_name
def start_bundle(self):
self.client = storage.Client()
def process(self, urls):
df = pd.DataFrame([urls], columns=['URL'])
bucket = self.client.get_bucket(self.bucket_name)
bucket.blob(f'{outputFile}').upload_from_string(df.to_csv(index=False), 'text/csv')
# End function
p = beam.Pipeline(options=options)
(p | 'Read Input CSV' >> beam.io.ReadFromText(inputFile, skip_header_lines=1)
| 'Map Codes' >> beam.Map(genURLs)
| 'Store Output File' >> beam.ParDo(WriteCSVFIle(bucket)))
p.run()
The code produces URL_test_codes.csv in my bucket, but the file only contains one row (not including the 'URL' header) which tells me that my code is writing/overwriting the file as it processes each row. Is there a way to bulk write the contents of the entire file instead of making a series of requests to update the file? I'm new to Python/Dataflow so any help is greatly appreciated.
Let's point out the issues: the evident one is a quota issue from GCS side, reflected by the '429' error codes. But as you noted, this is derived from the inherent issue, which is more related to how you try to write your data to your blob.
Since a Beam Pipeline generates a Parallel Collection of elements, when you add elements to your PCollection, each pipeline step will be executed for each element, in other words, your ParDo function will try to write something to your output file once per element in your PCollection.
So, there are some issues with your WriteCSVFIle function. For example, in order to write your PCollection to GCS, it would be better to use a separate pipeline task focused on writing the whole PCollection, such as follows:
First, you can import this Function already included in Apache Beam:
from apache_beam.io import WriteToText
Then, you use it at the end of your pipeline:
| 'Write PCollection to Bucket' >> WriteToText('gs://{0}/{1}'.format(bucket_name, outputFile))
With this option, you don't need to create a storage client or reference a blob, the function just needs to receive the GCS URI where it would write the final result and you can adjust it according to the parameters you can find in the documentation.
With this, you only need to address the Dataframe created in your WriteCSVFIle function. Each pipeline step creates a new PCollection, so if a Dataframe-creator function should receive an element from a PCollection of URLs, then the new PCollection elements resulting from the Dataframe function will have 1 dataframe per url following your current logic, but since it seems you just want to write the results from genURLs considering that 'URL' is the only column in your dataframe, maybe going directly from genURLs to WriteToText can output what you're looking for.
Either way, you can adjust your pipeline accordingly, but at least with the WriteToText transform it would take care of writing your whole final PCollection to your Cloud Storage bucket.

How to use Airflow ExternalTaskSensor as a SmartSensor?

I'm trying to implement the ExternalTaskSensor using SmartSensors but since it uses execution_date to poke for the other DAG status I can't seem to be able to pass it, if I omit it from my SmartExternalSensor it says that there is a KeyError with the execution_date, since it doesn't exist.
I tried overriding the get_poke_context method
def get_poke_context(self, context):
result = super().get_poke_context(context)
if self.execution_date is None:
result['execution_date'] = context['execution_date']
return result
but It now says that the datetime object is not json serializable (this is done while registering the sensor as a SmartSensor using json.dumps) and runs as a normal sensor. If I pass directly the string of that datetime object it says that str object has no isoformat() method so I know the execution date must be a datetime object.
Do you guys have any idea on how to work around this?
I get similar issues trying to use ExternalTaskSensor as a SmartSensor. This below hasn't been tested extensively, but seems to work.
import datetime
from airflow.sensors.external_task import ExternalTaskSensor
from airflow.utils.session import provide_session
class SmartExternalTaskSensor(ExternalTaskSensor):
# Something a bit odd happens with ExternalTaskSensor when run as a smart
# sensor. ExternalTaskSensor requires execution_date in the poke context,
# but the smart sensor system passes all poke context values to the
# constructor of ExternalTaskSensor, but it doesn't allow execution_date
# as an argument. So we add it...
def __init__(self, execution_date=None, **kwargs):
super().__init__(**kwargs)
def get_poke_context(self, context):
return {
'external_dag_id': self.external_dag_id,
'external_task_id': self.external_task_id,
'timeout': self.timeout,
'check_existence': self.check_existence,
# ... but execution_date has to be manually extracted from the
# context, and converted to a string, since it will be JSON
# encoded by the smart sensor system...
'execution_date': context['execution_date'].isoformat(),
}
#provide_session
def poke(self, context, session=None):
return super().poke(
{
**context,
# ... and then converted back to a datetime object since
# that's what ExternalTaskSensor poke needs
'execution_date': datetime.datetime.fromisoformat(
context['execution_date']
),
},
session,
)

How do i create a decorator for a function with added functionality?

I want to refactor my code. What I am currently doing is extracting data from an ad platform API endpoint and transforming and uploading it to big query. I have the following code which works but I want to refactor it after having learnt about decorators.
Decorators are very powerful and useful tool in Python since it allows programmers to modify the behavior of function or class. Decorators allow us to wrap another function in order to extend the behavior of wrapped function, without permanently modifying it.
import datauploader
import ndjson
import os
def upload_ads_details(extractor, access_token, acccount_id, req_output,
bq_client_name, bq_dataset_id, bq_gs_bucket,
ndjson_local_file_path, ndjson_file_name):
# Function to Extract data from the API/Ad Platform
ads_dictionary = extractor.get_ad_dictionary(access_token, acccount_id)
# Converting data to ndjson for upload to big query
output_ndjson = ndjson.dumps(ads_dictionary)
with open(ndjson_local_file_path, 'w') as f:
f.writelines(output_ndjson)
print(os.path.abspath(ndjson_local_file_path))
# This code below remains the same for all the other function calls
if req_output:
# Inputs for the uploading functions
print("Processing Upload")
partition_by = "_insert_time"
str_gcs_file_name = ndjson_file_name
str_local_file_name = ndjson_local_file_path
gs_bucket = bq_gs_bucket
gs_file_format = "JSON"
table_id = 'ads_performance_stats_table'
table_schema = ads_dictionary_schema
# Uploading Function
datauploader.loadToBigQuery(
bq_client_name,
bq_dataset_id,
table_id,
table_schema,
partition_by,
str_gcs_file_name,
str_local_file_name,
gs_bucket,
gs_file_format,
autodetect=False,
req_partition=True,
skip_leading_n_row=0
)

How to convert Ansible-Playbooks to be used with Ansible-API

I'm using Ansible to set up servers.
Sometimes I'm using the command ansible-playbook <my-playbook.yml -i <inventory.txt> and sometimes I'm using the Ansible Python API (https://docs.ansible.com/ansible/latest/dev_guide/developing_api.html)
Unfortunatley I don't know how to insert my already existing playbooks in the API.
Here is the API:
#!/usr/bin/env python
import json
import shutil
from ansible.module_utils.common.collections import ImmutableDict
from ansible.parsing.dataloader import DataLoader
from ansible.vars.manager import VariableManager
from ansible.inventory.manager import InventoryManager
from ansible.playbook.play import Play
from ansible.executor.task_queue_manager import TaskQueueManager
from ansible.plugins.callback import CallbackBase
from ansible import context
import ansible.constants as C
class ResultCallback(CallbackBase):
"""A sample callback plugin used for performing an action as results come in
If you want to collect all results into a single object for processing at
the end of the execution, look into utilizing the ``json`` callback plugin
or writing your own custom callback plugin
"""
def v2_runner_on_ok(self, result, **kwargs):
"""Print a json representation of the result
This method could store the result in an instance attribute for retrieval later
"""
host = result._host
print(json.dumps({host.name: result._result}, indent=4))
# since the API is constructed for CLI it expects certain options to always be set in the context object
context.CLIARGS = ImmutableDict(connection='local', module_path=['/to/mymodules'], forks=10, become=None,
become_method=None, become_user=None, check=False, diff=False)
# initialize needed objects
loader = DataLoader() # Takes care of finding and reading yaml, json and ini files
passwords = dict(vault_pass='secret')
# Instantiate our ResultCallback for handling results as they come in. Ansible expects this to be one of its main display outlets
results_callback = ResultCallback()
# create inventory, use path to host config file as source or hosts in a comma separated string
inventory = InventoryManager(loader=loader, sources='localhost,')
# variable manager takes care of merging all the different sources to give you a unified view of variables available in each context
variable_manager = VariableManager(loader=loader, inventory=inventory)
# create data structure that represents our play, including tasks, this is basically what our YAML loader does internally.
play_source = dict(
name = "Ansible Play",
hosts = 'localhost',
gather_facts = 'no',
tasks = [
dict(action=dict(module='shell', args='ls'), register='shell_out'),
dict(action=dict(module='debug', args=dict(msg='{{shell_out.stdout}}')))
]
)
# Create play object, playbook objects use .load instead of init or new methods,
# this will also automatically create the task objects from the info provided in play_source
play = Play().load(play_source, variable_manager=variable_manager, loader=loader)
# Run it - instantiate task queue manager, which takes care of forking and setting up all objects to iterate over host list and tasks
tqm = None
try:
tqm = TaskQueueManager(
inventory=inventory,
variable_manager=variable_manager,
loader=loader,
passwords=passwords,
stdout_callback=results_callback, # Use our custom callback instead of the ``default`` callback plugin, which prints to stdout
)
result = tqm.run(play) # most interesting data for a play is actually sent to the callback's methods
finally:
# we always need to cleanup child procs and the structures we use to communicate with them
if tqm is not None:
tqm.cleanup()
# Remove ansible tmpdir
shutil.rmtree(C.DEFAULT_LOCAL_TMP, True)
I already tried to convert the playbooks from yaml to python dictionary using
import yaml
with open('ansible-files/main.yml') as f:
data = yaml.load(f)
and then passing data in as play_source.
This will work for simple playbooks, but not for more complex ones, where different roles are involved. Is there a way to pass an existing playbook with roles, templates and other stuff directly to the API?
If not: Is there another way of using existing playbooks, when using the Python API without the need to rewrite everything by hand?
Thank you!
Using PlaybookExecutor to execute the playbook which includes complex roles.
executor = PlaybookExecutor(
playbooks=[playbooks],
inventory=self.inventory,
variable_manager=self.variable_manager,
loader=self.loader,
passwords=self.passwords
)
executor.run()
The playbooks is yml file path.

Twitter streaming using spark and kafka: How store the data in MongoDB

I am collecting twitter stream data using this python code
https://github.com/sridharswamy/Twitter-Sentiment-Analysis-Using-Spark-Streaming-And-Kafka/blob/master/app.py
After that, I run this code to create streaming context and to store the data in MongoDB.
def main():
conf = SparkConf().setMaster("local[2]").setAppName("Streamer")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 10)
ssc.checkpoint("checkpoint")
kstream = KafkaUtils.createDirectStream(
ssc, topics = ['topic1'], kafkaParams = {"metadata.broker.list":
'localhost:9092'})
tweets = kstream.map(lambda x: x[1].encode("ascii", "ignore"))
#................insert in MonGODB.........................
db.mynewcollection.insert_one(tweets)
ssc.start()
ssc.awaitTerminationOrTimeout(100)
ssc.stop(stopGraceFully = True)
if __name__=="__main__":
urllib3.contrib.pyopenssl.inject_into_urllib3()
connection = pymongo.MongoClient('....',...)
db = connection['twitter1']
db.authenticate('..','...')
main()
but I got this error:
TypeError: document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping
I also tried to use 'foreachRDD' and create function 'save'
tweets.foreachRDD(Save)
and I moved the 'insert' to this function
def Save(rdd):
if not rdd.isEmpty():
db.mynewcollection.insert_one(rdd)
but it does not work
TypeError: can't pickle _thread.lock objects
Can anyone help me to know how to store the streaming data in MongoDB
The first error occurs because you pass distributed object into db.mynewcollection.insert_one.
The second error occurs because you initalize database connection on the driver, and in general, connection objects cannot be serialized.
While there exist a number of Spark / MongoDB connectors you should take a look at (Getting Spark, Python, and MongoDB to work together) a generic pattern is to use foreachPartition. Define helper
def insert_partition(xs):
connection = pymongo.MongoClient('....',...)
db = connection['twitter1']
db.authenticate('..','...')
db.mynewcollection.insert_many(xs)
and then:
def to_dict(s):
return ... # Convert input to a format acceptable by `insert_many`, for example with json.loads
tweets.map(to_dict) \
.foreachRDD(lambda rdd: rdd.foreachPartition(insert_partition))

Categories

Resources