Does a celery chain execute tasks in a specific order? - python

I have a task task_main which calls other tasks. But I need them to execute in a specific order.
Celery docs say not to call them one after another with .delay() and get().
http://docs.celeryproject.org/en/latest/userguide/tasks.html#avoid-launching-synchronous-subtasks
Will using chain run them in order? I cannot find this in the docs.
#shared_task
def task_a():
pass
#shared_task
def task_b():
pass
#shared_task
def task_b():
pass
#shared_task
def task_main():
chain = task_a.s() | task_b.s() | task_c.s()
chain()

Yes, if you use chains tasks will get run one after another.
Here's the correct documentation for that: http://docs.celeryproject.org/en/latest/userguide/canvas.html#chains

Maybe a more concrete example following a python data science ETL pipeline, basically, we extract data from DB, then transform the data into the expected manner, then load the data into result backend:
#app.task(base=TaskWithDBClient, ignore_result=True)
def extract_task(user_id):
"""Extract data from db w.r.t user."""
data = # some db operations ...
return data
#app.task()
def transform_task(data):
"""Transform input into expected form."""
data = .... # some code
# the data will be stored in result backend
# because we didn't ignore result.
return data
#app.task(ignore_result=True)
def etl(user_id):
"""Extract, transform and load."""
ch = chain(extract_task.s(user_id),
transform_task.s())()
return ch
Back to your main application, you only need to call:
etl.delay(user_id)
The tasks will be executed sequentially.

Related

Getting current execution date in a task or asset in dagster

Is there an easier way than what I'm doing to get the current date in an dagster asset, than what I'm currently doing?
def current_dt():
return datetime.today().strftime('%Y-%m-%d')
#asset
def my_task(current_dt):
return current_dt
In airflow these are passed by default in the python callable function definition ex: def my_task(ds, **kwargs):
In Dagster, the typical way to do things that require Airflow execution_dates is with partitions:
from dagster import asset, build_schedule_from_partitioned_job, define_asset_job, DailyPartitionsDefinition
partitions_def = DailyPartitionsDefinition(start_date="2020-01-01")
#asset(partitions_def=partitions_def)
def my_asset(context):
current_dt = context.asset_partitions_time_window_for_output().start
my_job = define_asset_job("my_job", selection=[my_asset], partitions_def=partitions_def)
defs = Definitions(
assets=[my_asset],
schedules=[build_schedule_from_partitioned_job(my_job)],
)
This will set up a schedule to fill each daily partition at the end of each day, and you can also kick off runs for particular partitions or kick off backfills that materialize sets of partitions.

Add webserver to existing python service

I have a script that continually runs, processing data that it gets from an external device. The core logic follows something like:
from external_module import process_data, get_data, load_interesting_things
class MyService:
def __init__(self):
self.interesting_items = load_interesting_things()
self.run()
def run(self):
try:
while True:
data = get_data()
for item in self.interesting_items:
item.add_datapoint(process_data(data, item))
except KeyboardInterrupt:
pass
I would like to add the ability to request information for the various interesting things via a RESTful API.
Is there a way in which I can add something like a Flask web service to the program such that the web service can get a stat from the interesting_items list to return? For example something along the lines of:
#app.route("/item/<idx>/average")
def average(idx: int):
avg = interesting_items[idx].getAverage()
return jsonify({"average":avg})
Assuming there is the necessary idx bounds checking and any appropriate locking implemented.
It does not have to be Flask, but it should be light weight. I want to avoid using a database. I would prefer to use a webservice, but if it is not possible without completely restructuring the code base I can use a socket instead, but this is less preferable.
The server would be running on a local network only and usually only handling a single user, sometimes it may have a few.
I needed to move the run() method out of the __init__() method, so that I could have a global reference to the service, and start the run method in a separate thread. Something along the lines of:
service = MyService()
service_thread = threading.Thread(target=service.run, daemon=True)
service_thread.start()
app = flask.Flask("appname")
...
#app.route("/item/<idx>/average")
def average(idx: int):
avg = service.interesting_items[idx].getAverage()
return jsonify({"average":avg})
...
app.run()

Getting task result with python huey from redis data store

I'm working with the huey task queue https://github.com/coleifer/huey in flask . I'm trying to run a task and get a task id number back from my initial function:
#main.route('/renew',methods=['GET', 'POST'])
def renew():
print(request.form)
user =request.form.get('user')
pw =request.form.get('pw')
res =renewer(user,pw)
res(blocking=True) # Block for up to 5 seconds
print(res)
return res.id
After running this I plug the outputted id (which is the same as the result in the screenshot)
into :
#main.route('/get_result_by_id',methods=['GET', 'POST'])
def get_result_by_id():
print(request.form)
id =request.form.get('id')
from ..tasking.tasks import my_huey
res = my_huey.result(id)
if res==None:
res = 'no value'
return res
However I'm getting 'no value'
How can I access the value in the data store?
When you are doing res(blocking=True) in def renew() you are fetching the result from the result store and effectively removing it. When you then try to fetch the result again using the id, it will just return nothing.
You have 2 options to solve this:
Either use res(blocking=True, preserve=True) to preserve the result in the result store so you can still fetch it with your second call.
Use a storage that uses expiring results like RedisExpireStorage. When configuring this storage while setting up the huey instance, you can specify for how long your results should be stored. This would give you x amount of time to do the second call based on the task/result id.

rq queue always empty

I'm using django-rq in my project.
What I want to achieve:
I have a first view that loads a template where an image is acquired from webcam and saved on my pc. Then, the view calls a second view, where an asynchronous task to process the image is enqueued using rq. Finally, after a 20-second delay, a third view is called. In this latter view I'd like to retrieve the result of the asynchronous task.
The problem: the job object is correctly created, but the queue is always empty, so I cannot use queue.fetch_job(job_id). Reading here I managed to find the job in the FinishedJobRegistry, but I cannot access it, since the registry is not iterable.
from django_rq import job
import django_rq
from rq import Queue
from redis import Redis
from rq.registry import FinishedJobRegistry
redis_conn = Redis()
q = Queue('default',connection=redis_conn)
last_job_id = ''
def wait(request): #second view, starts the job
template = loader.get_template('pepper/wait.html')
job = q.enqueue(processImage)
print(q.is_empty()) # this is always True!
last_job_id = job.id # this is the expected job id
return HttpResponse(template.render({},request))
def ceremony(request): #third view, retrieves the result
template = loader.get_template('pepper/ceremony.html')
print(q.is_empty()) # True
registry = FinishedJobRegistry('default', connection=redis_conn)
finished_job_ids = registry.get_job_ids() #here I have the correct id (last_job_id)
return HttpResponse(template.render({},request))
The question: how can I retrieve the result of the asynchronous job from the finished job registry? Or, better, how can I correctly enqueue the job?
I have found an other way to do it: I'm simply using a global list of jobs, that I'm modifying in the views. Anyway, I'd like to know the right way to do this...

Understanding flow of execution of Python code

I'm trying to do home assignment connected with python from Data Manipulation at Scale: Systems and Algorithms at Curesra. Generally I have problems with understanding base code which was presented as an example of MapReduce alogorythm. I would be grateful for helping me understand it in 2 places, details below.
I tired to go step by step through code flow of below two files after running command:
python wordcount.py 'data/books.json'
File wordcount.py is opened
mr = MapReduce.MapReduce() - me object is created
def __init__(self): part from MapReduce.py is
executed
We come back to wordcount.py
Functions def mapper(record): and def reducer(key,list_of_values): are created but for the time being without execution
Python go to if __name__ == '__main__':
` inputdata = open(sys.argv[1]) - json file is assigned to a
variable
mr.execute(inputdata, mapper, reducer) - A call to the function from MapReduce.py.
And here is my first question we haven't deffined mapper or reducer variable/object so far. Is it just null/no value passed to this function or we somehow defined this variable before but I missed this?
Later me move to def execute(self, data, mapper, reducer): in
MapReduce.py
And there we have mapper(record).
So this is reference to a function in wordcount.py, am I right? But if we have reference to a function in different file shouldn't we use import at the beginning of the file and define from which file this function came?
(...) further code execution
wordcount.py file:
import MapReduce
import sys
"""
Word Count Example in the Simple Python MapReduce Framework
"""
mr = MapReduce.MapReduce()
# =============================
# Do not modify above this line
def mapper(record):
# key: document identifier
# value: document contents
key = record[0]
value = record[1]
words = value.split()
for w in words:
mr.emit_intermediate(w, 1)
def reducer(key, list_of_values):
# key: word
# value: list of occurrence counts
total = 0
for v in list_of_values:
total += v
mr.emit((key, total))
# Do not modify below this line
# =============================
if __name__ == '__main__':
inputdata = open(sys.argv[1])
mr.execute(inputdata, mapper, reducer)
MapReduce.py file:
import json
class MapReduce:
def __init__(self):
self.intermediate = {}
self.result = []
def emit_intermediate(self, key, value):
self.intermediate.setdefault(key, [])
self.intermediate[key].append(value)
def emit(self, value):
self.result.append(value)
def execute(self, data, mapper, reducer):
for line in data:
record = json.loads(line)
mapper(record)
for key in self.intermediate:
reducer(key, self.intermediate[key])
#jenc = json.JSONEncoder(encoding='latin-1')
jenc = json.JSONEncoder()
for item in self.result:
print jenc.encode(item)
Thank you in advance for help with that.
In python everything is a object, that include functions, so you can pass a functionA as argument to another functionB (or class or whenever), and if functionB expect that you to do it, it will assume that you give it a functions with the right firm and a proceed as normal.
In yours case
mr.execute(inputdata, mapper, reducer)
here mapper, reducer are the functions previously defined that are passed as argument to the method execute of the instance mr of the class MapReduce and as you can see, said method use it as the functions that it expect.
Thank to this you can, as the that code show, make generic code that do some calculus that can be used in similar way by many applications by given the user the options of supplies his/her own functions.
A much more generic example of this is the function map, this function receive a function that do something, map don't care what it does or where it comefrom, only that receive as many argument as map himself receive (others that say functions) and return a value to build a new list with the results.

Categories

Resources