How to use Airflow ExternalTaskSensor as a SmartSensor? - python

I'm trying to implement the ExternalTaskSensor using SmartSensors but since it uses execution_date to poke for the other DAG status I can't seem to be able to pass it, if I omit it from my SmartExternalSensor it says that there is a KeyError with the execution_date, since it doesn't exist.
I tried overriding the get_poke_context method
def get_poke_context(self, context):
result = super().get_poke_context(context)
if self.execution_date is None:
result['execution_date'] = context['execution_date']
return result
but It now says that the datetime object is not json serializable (this is done while registering the sensor as a SmartSensor using json.dumps) and runs as a normal sensor. If I pass directly the string of that datetime object it says that str object has no isoformat() method so I know the execution date must be a datetime object.
Do you guys have any idea on how to work around this?

I get similar issues trying to use ExternalTaskSensor as a SmartSensor. This below hasn't been tested extensively, but seems to work.
import datetime
from airflow.sensors.external_task import ExternalTaskSensor
from airflow.utils.session import provide_session
class SmartExternalTaskSensor(ExternalTaskSensor):
# Something a bit odd happens with ExternalTaskSensor when run as a smart
# sensor. ExternalTaskSensor requires execution_date in the poke context,
# but the smart sensor system passes all poke context values to the
# constructor of ExternalTaskSensor, but it doesn't allow execution_date
# as an argument. So we add it...
def __init__(self, execution_date=None, **kwargs):
super().__init__(**kwargs)
def get_poke_context(self, context):
return {
'external_dag_id': self.external_dag_id,
'external_task_id': self.external_task_id,
'timeout': self.timeout,
'check_existence': self.check_existence,
# ... but execution_date has to be manually extracted from the
# context, and converted to a string, since it will be JSON
# encoded by the smart sensor system...
'execution_date': context['execution_date'].isoformat(),
}
#provide_session
def poke(self, context, session=None):
return super().poke(
{
**context,
# ... and then converted back to a datetime object since
# that's what ExternalTaskSensor poke needs
'execution_date': datetime.datetime.fromisoformat(
context['execution_date']
),
},
session,
)

Related

PATCH, Update a row with SQLalchemy, project build using Flask and CORS

I hope everything is going well.
I'm working in a project really big and it wasn't set up by me. The project is buils using flask and cors.
I'm trying to create a query to update a row with SQLAlchemy follow the structure that the project has. so basically is like that:
#app.route("/update-topic", methods=['PATCH'])
async def update_by_id():
input_data = request.get_json()
await update_record(input_data)
return ApplicationTopicSchema(many=True).dump(data)
As you see in the code above is just a simple endpoint with PATCH method that get the input data and pass it to a function update_record(), that function is in charge to update the record like you can see in the next code:
from sqlalchemy import and_, update
class AppTopics(Base):
__tablename__ = AppTopics.__table__
async def update_record(self, data):
id_data = data['id']
query = self.__tablename__.update().\
where(self.__tablename__.c.id == id_data).values(**data).returning(self.__tablename__)
await super().fetch_one(query=query)
return 'updated'
Basically is something like that, and when I try to use the endpoint I get the next message error:
TypeError: The response value returned by the view function cannot be None
Executing <Handle <TaskWakeupMethWrapper object at 0x000001CAD3970F10>(<Future f
inis...events.py:418>) created at C:\Python\Python380\lib\asyncio\tasks.py:881>
Also, I'm trying to structure the query in another like this:
from sqlalchemy import and_, update
class AppTopics(Base):
__tablename__ = AppTopics.__table__
async def update_record(self, data):
u = update(self.__tablename__)
u = u.values({"topic": data['topic']})
u = u.where(self.__tablename__.c.id == data['id'])
await super().fetch_one(query=u)
return 'updated'
However I got the same error.
May you guys knows what is happening and what means this error:
TypeError: The response value returned by the view function cannot be None
Executing <Handle <TaskWakeupMethWrapper object at 0x000001B1B4861100>(<Future f
inis...events.py:418>) created at C:\Python\Python380\lib\asyncio\tasks.py:881>
Thanks in advance for your help and time.
Have a good day, evening, afternoon :)
The error message "TypeError: The response value returned by the view function cannot be None" is indicating that the view function (in this case, the update_by_id function) is not returning a value.
It seems that the function update_record does not return anything. If you want to return the string "updated" after updating the record, you should use a return statement like this:
async def update_record(self, data):
# update code here
return 'updated'
And on the update_by_id function you should call the return value of await update_record(input_data) to return it.
async def update_by_id():
input_data = request.get_json()
result = await update_record(input_data)
return result
Another point is that in the second example, you are not returning anything either, you should add a return statement before the end of the function.
Also, you are returning 'ApplicationTopicSchema(many=True).dump(data)' but the input data data is not being defined in the function, you should use the 'result' variable returned by update_record function instead.
async def update_by_id():
input_data = request.get_json()
result = await update_record(input_data)
return ApplicationTopicSchema(many=True).dump(result)
It's important to note that in the first example, the update_record function seems to be missing the self parameter, and could be causing some issues with the class.
It's also important to check if the fetch_one function from super() is waiting for the query with await keyword, and also if the fetch_one is returning something, otherwise it could be the cause of the None return value.
My understanding and knowledge is limited, but I hope this helps. Feel free to shoot me any further questions.

Python. Attach data to asyncio.Task

Is there proper way to attach additional data to asyncio.create_task()? The example is
import asyncio
from dataclasses import dataclass
#dataclass
class Foo:
name: str
url_to_download: str
size: int
...
async def download_file(url: str):
return await download_impl()
objs: list[Foo] = [obj1, obj2, ...]
# Question
# How to attach the Foo object to the each task?
tasks = [asyncio.create_task(download_file(obj.url_to_download)) for obj in objs]
for task in asyncio.as_completed(tasks):
# Question
# How to find out which Foo obj corresponds the downloaded data?
data = await task
process(data)
There is also way to forward the Foo object to the download_file and return it with the downloaded data, but it is poor design. Do I miss something, or anyone has a better design to solve that problem?
If I understood correctly, you just want a way to match the returned values from all your tasks to the instances of Foo whose url_to_download attributes you passed as arguments to said tasks.
Since all you are doing in that last loop is blocking until all tasks are completed, you may as well simply run the coroutines concurrently via asyncio.gather. The order of the individual return values in the list it returns corresponds to the order of the coroutines passed to it as arguments:
from asyncio import gather, run
from dataclasses import dataclass
#dataclass
class Foo:
url_to_download: str
...
async def download_file(url: str):
return url
async def main() -> None:
objs: list[Foo] = [Foo("foo"), Foo("bar"), Foo("baz")]
returned_values = await gather(
*(download_file(obj.url_to_download) for obj in objs)
)
print(returned_values) # ['foo', 'bar', 'baz']
if __name__ == '__main__':
run(main())
That means you can simply match the objs and returned_values via index or zip them or whatever you need to do.
As for your comment that saving the returned the returned value in an attribute of the object itself is "poor design", I see absolutely no justification for that assessment. That would also be a perfectly valid way and arguably even cleaner. You might even define a download method on Foo for that purpose. But that is another discussion.

How to mock sqlalchemy model create method in pytest

I have an API which creates new flights in db, generating id and then inserting into db. I now need to mock this. Its an existing code so i don't want to change that.
# create_flight.py (Existing code)
from app.models.flight import Flight
flight=Flight()
flight.create_flight(json_data) # need to mock this ,
# this generates and commits in db and sets the flight object
response= { 'flight_id' : flight.flight_id }
My attempt of using pytest
#conftest.py
#pytest.fixture
def mock_response(monkeypatch):
def mock_create_flight(*args):
class flight:
flight_id=str(uuid.uuid4())
test=flight()
return test
monkeypatch.setattr('app.models.flight.Flight.create_flight',mock_create_flight)
it patches it correctly but i want the flight should have flight_id attribute set so that response dictionary sends the mocked flight id rather than going to db. I definitely doing something wrong here. not sure where. Thanks for looking
I got around it by importing the sqlalchemy model Flight in conftest , here is my solution
from app.models.flight import Flight
import uuid
#pytest.fixture(scope='session')
def get_flight_id():
yield str(uuid.uuid4())
#pytest.fixture(scope="session")
def mock_response(monkeypatch,get_flight_id):
def mock_create_flight(*args):
flight = Flight()
flight.flight_id = get_flight_id
flight.version_id = 1
return flight
monkeypatch.setattr('app.models.flight.Flight.create_flight',mock_create_flight)
This then returns mocked flight object in the original code

Call a function in Python file using Flask

I have python file called testing_file.py:
from datetime import datetime
import MySQLdb
# Open database connection
class DB():
def __init__(self, server, user, password, db_name):
db = MySQLdb.connect(server, user, password, db_name )
self.cur = db.cursor()
def time_statistic(self, start_date, end_date):
time_list = {}
sql = "SELECT activity_log.datetime, activity_log.user_id FROM activity_log"
self.cur.execute(sql)
self.date_data = self.cur.fetchall()
for content in self.date_data:
timestamp = str(content[0])
datetime_object = datetime.strptime(timestamp, '%Y-%m-%d %H:%M:%S')
timestamps = datetime.strftime(datetime_object, "%Y-%m-%d")
if start_dt <= timestamps and timestamps <= end_dt:
if timestamps not in time_list:
time_list[timestamps]=1
else:
time_list[timestamps]+=1
return json.dumps(time_list)
start_date = datetime.strptime(str('2017-4-7'), '%Y-%m-%d')
start_dt = datetime.strftime(start_date, "%Y-%m-%d")
end_date = datetime.strptime(str('2017-5-4'), '%Y-%m-%d')
end_dt = datetime.strftime(end_date, "%Y-%m-%d")
db = DB("host","user_db","pass_db","db_name")
db.time_statistic(start_date, end_date)
I want to access the result (time_list) thru API using Flask. This is what i've wrote so far, doesn't work and also I've tried another way:
from flask import Flask
from testing_api import *
app = Flask(__name__)
#app.route("/")
def get():
db = DB("host","user_db","pass_db","db_name")
d = db.time_statistic()
return d
if __name__ == "__main__":
app.run(debug=True)
Question: This is my first time work with API and Flask. Can anyone please help me thru this. Any hints are appreciated. Thank you
I've got empty list as result {}
There are many things wrong with what you are doing.
1.> def get(self, DB) why self? This function does not belong to a class. It is not an instance function. self is a reference of the class instance when an instance method is called. Here not only it is not needed, it is plain and simple wrong.
2.> If you look into flask's routing declaration a little bit, you will see how you should declare a route with parameter. This is the link. In essence you should something like this
#app.route("/path/<variable>")
def route_func(variable):
return variable
3.> Finally, one more thing I would like to mention, Please do not call a regular python file test_<filename>.py unless you plan to use it as a unit testing file. This is very confusing.
Oh, and you have imported DB from your module already no need to pass it as a parameter to a function. It should be anyway available inside it.
There are quite a few things that are wrong (ranging from "useless and unclear" to "plain wrong") in your code.
wrt/ the TypeError: as the error message says, your get() function expects two arguments (self and DB) which won't be passed by Flask - and are actually not used in the function anyway. Remove both arguments and you'll get rid of this error - just to find out you now have a NameError on the first line of the get() function (obviously since you didn't import time_statistic nor defined start_date and end_date).

Spark streaming and py4j.Py4JException: Method __getnewargs__([]) does not exist

I am trying to implement a Spark streaming application, but I am getting back an exception: "py4j.Py4JException: Method getnewargs([]) does not exist"
I do not understand the source of this exception. I read here that I cannot use a SparkSession instance outside of the driver. But, I do not know whether I am doing that. I don't understand how to tell whether some code executes on the driver or an executor - I understand the difference between transformations and actions (I think), but when it comes to streams and foreachRDD, I get lost.
The app is a Spark streaming app, running on AWS EMR, reading data from AWS Kinesis. We submit the Spark app via spark-submit, with --deploy-mode cluster. Each record in the stream is a JSON object in the form:
{"type":"some string","state":"an escaped JSON string"}
E.g.:
{"type":"type1","state":"{\"some_property\":\"some value\"}"}
Here is my app in its current state:
# Each handler subclasses from BaseHandler and
# has the method
# def process(self, df, df_writer, base_writer_path)
# Each handler's process method performs additional transformations.
# df_writer is a function which writes a Dataframe to some S3 location.
HANDLER_MAP = {
'type1': Type1Handler(),
'type2': Type2Handler(),
'type3': Type3Handler()
}
FORMAT = 'MyProject %(asctime)s %(levelname)s %(name)s: %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
# Use a closure and lambda to create streaming context
create = lambda: create_streaming_context(
spark_app_name=spark_app_name,
kinesis_stream_name=kinesis_stream_name,
kinesis_endpoint=kinesis_endpoint,
kinesis_region=kinesis_region,
initial_position=InitialPositionInStream.LATEST,
checkpoint_interval=checkpoint_interval,
checkpoint_s3_path=checkpoint_s3_path,
data_s3_path=data_s3_path)
streaming_context = StreamingContext.getOrCreate(checkpoint_s3_path, create)
streaming_context.start()
streaming_context.awaitTermination()
The function for creating the streaming context:
def create_streaming_context(
spark_app_name, kinesis_stream_name, kinesis_endpoint,
kinesis_region, initial_position, checkpoint_interval,
data_s3_path, checkpoint_s3_path):
"""Create a new streaming context or reuse a checkpointed one."""
# Spark configuration
spark_conf = SparkConf()
spark_conf.set('spark.streaming.blockInterval', 37500)
spark_conf.setAppName(spark_app_name)
# Spark context
spark_context = SparkContext(conf=spark_conf)
# Spark streaming context
streaming_context = StreamingContext(spark_context, batchDuration=300)
streaming_context.checkpoint(checkpoint_s3_path)
# Spark session
spark_session = get_spark_session_instance(spark_conf)
# Set up stream processing
stream = KinesisUtils.createStream(
streaming_context, spark_app_name, kinesis_stream_name,
kinesis_endpoint, kinesis_region, initial_position,
checkpoint_interval)
# Each record in the stream is a JSON object in the form:
# {"type": "some string", "state": "an escaped JSON string"}
json_stream = stream.map(json.loads)
for state_type in HANDLER_MAP.iterkeys():
filter_stream(json_stream, spark_session, state_type, data_s3_path)
return streaming_context
The function get_spark_session_instance returns a global SparkSession instance (copied from here):
def get_spark_session_instance(spark_conf):
"""Lazily instantiated global instance of SparkSession"""
logger.info('Obtaining global SparkSession instance...')
if 'sparkSessionSingletonInstance' not in globals():
logger.info('Global SparkSession instance does not exist, creating it...')
globals()['sparkSessionSingletonInstance'] = SparkSession\
.builder\
.config(conf=spark_conf)\
.getOrCreate()
return globals()['sparkSessionSingletonInstance']
The filter_stream function is intended to filter the stream by the type property in the JSON. The intention is to transform the stream into a stream where each record is the escaped JSON string from the "state" property in the original JSON:
def filter_stream(json_stream, spark_session, state_type, data_s3_path):
"""Filter stream by type and process the stream."""
state_type_stream = json_stream\
.filter(lambda jsonObj: jsonObj['type'] == state_type)\
.map(lambda jsonObj: jsonObj['state'])
state_type_stream.foreachRDD(lambda rdd: process_rdd(spark_session, rdd, state_type, df_writer, data_s3_path))
The process_rdd function is intended to load the JSON into a Dataframe, using the correct schema depending on the type in the original JSON object. The handler instance returns a valid Spark schema, and has a process method which performs further transformations on the dataframe (after which df_writer is called, and the Dataframe is written to S3):
def process_rdd(spark_session, rdd, state_type, df_writer, data_s3_path):
"""Process an RDD by state type."""
if rdd.isEmpty():
logger.info('RDD is empty, returning early.')
return
handler = HANDLER_MAP[state_type]
df = spark_session.read.json(rdd, handler.get_schema())
handler.process(df, df_writer, data_s3_path)
Basically I am confused about the source of the exception. Is it related to how I am using spark_session.read.json? If so, how is it related? If not, is there something else in my code which is incorrect?
Everything seems to work correctly if I just replace the call to StreamingContext.getOrCreate with the contents of the create_streaming_context method. I was mistaken about this - I get the same exception either way. I think the checkpoint stuff is a red herring... I am obviously doing something else incorrectly.
I would greatly appreciate any help with this problem and I'm happy to clarify anything or add additional information!

Categories

Resources