I’m working on an ETL project using Azure Functions where I extract data from blob storage, transform the data in Python and pandas, and load the data using pandas to_sql(). I’m trying to make this process more efficient by using asyncio and language workers.
I’m a little confused because I was under the impression that asyncio works using one thread, but the Azure Functions documentation says you can use multiple language workers if you change your config and that even a method that doesn’t use the async keyword runs in a thread pool.
Does that mean that if I don’t use the async keyword my methods will run concurrently using language workers? Do I have to use asyncio to utilize language workers?
Also, the documentation says that Azure Functions can scale to up to 200 instances. How can I scale to that many instances if I’m only allowed a maximum of 10 language workers?
Edit: Thanks Anatoli. Just to clarify, if I have a Timer Trigger with the following code:
import azure.functions as func
from . import client_one_etl
from . import client_two_etl
def main(mytimer: func.TimerRequest) -> None:
client_one_etl.main()
client_two_etl.main()
If I have increased the number of language workers, does that mean both client_one_etl.main() and client_two_etl.main() are automatically run in separate threads even without using asyncio? And if client_two_etl.main() needs client_one_etl.main() to finish before executing, I will need to use async await to prevent them from running concurrently?
And for the separate instances, if client_one_etl.main() and client_two_etl.main() do not rely on each other does that mean I can execute them in one Azure Function app as separate .py scripts that run in their own VM? Is it possible to run multiple Timer Triggers (calling multiple __init__.py scripts each in their own VM for one Azure Function)? Then all scripts will need to complete within 10 minutes if I increase functionTimeout in the host.json file?
FUNCTIONS_WORKER_PROCESS_COUNT limits the maximum number of worker processes per Functions host instance. If you set it to 10, each host instance will be able to run up to 10 Python functions concurrently. Each worker process will still execute Python code on a single thread, but now you have up to 10 of them running concurrently. You don't need to use asyncio for this to happen. (Having said that, there are legitimate reasons to use asyncio to improve scalability and resource utilization, but you don't have to do that to take advantage of multiple Python worker processes.)
The 200 limit applies to the number of Functions host instances per Function app. You can think of these instances as separate VMs. The FUNCTIONS_WORKER_PROCESS_COUNT limit is applied to each of them individually, which brings the total number of concurrent threads to 2000.
UPDATE (answering the additional questions):
As soon as your function invocation starts on a certain worker, it will run on this worker until complete. Within this invocation, code execution will not be distributed to other worker processes or Functions host instances, and it will not be automatically parallelized for you in any other way. In your example, client_two_etl.main() will start after client_one_etl.main() exits, and it will start on the same worker process, so you will not observe any concurrency, regardless of the configured limits (unless you do something special in client_*_etl.main()).
When multiple invocations happen around the same time, these invocations may be automatically distributed to multiple workers, and this is where the limits mentioned above apply. Each invocation will still run on exactly one worker, from start to finish. In your example, if you manage to invoke this function twice around the same time, each invocation can get its own worker and they can run concurrently, but each will execute both client_one_etl.main() and client_two_etl.main() sequentially.
Please also note that because you are using a timer trigger on a single function, you will not experience any concurrency at all: by design, timer trigger will not start a new invocation until the previous invocation is complete. If you want concurrency, either use a different trigger type (for example, you can put a queue message on timer, and then the function triggered by the queue can scale out to multiple workers automatically), or use multiple timer triggers with multiple functions, like you suggested.
If what you actually want is to run independent client_one_etl.main() and client_two_etl.main() concurrently, the most natural thing to do is to invoke them from different functions, each implemented in a separate __init__.py with its own trigger, within the same or different Function apps.
functionTimeout in host.json is applied per function invocation. So, if you have multiple functions in your app, each invocation should complete within the specified limit. This does not mean all of them together should complete within this limit (if I understood your question correctly).
UPDATE 2 (answering more questions):
#JohnT Please note that I'm not talking about the number of Function apps or ___init___.py scripts. A function (described by ___init___.py) is a program that defines what needs to be done. You can create way more than 10 functions per app, but don't do this to increase concurrency - this will not help. Instead, add functions to separate logically-independent and coherent programs. Function invocation is a process that actively executes the program, and this is where the limits I'm talking about apply. You will need to be very clear on the difference between a function and a function invocation.
Now, in order to invoke a function, you need a worker process dedicated to this invocation until this invocation is complete. Next, in order to run a worker process, you need a machine that will host this process. This is what the Functions host instance is (not a very accurate definition of Functions host instance, but good enough for the purposes of this discussion). When running on Consumption plan, your app can scale out to 200 Functions host instances, and each of them will start a single worker process by default (because FUNCTIONS_WORKER_PROCESS_COUNT = 1), so you can run up to 200 function invocations simultaneously. Increasing FUNCTIONS_WORKER_PROCESS_COUNT will allow each Functions host instance create more than one worker process, so up to FUNCTIONS_WORKER_PROCESS_COUNT function invocations can be handled by each Functions host instance, bringing the potential total to 2000.
Please note though that "can scale out" does not necessarily mean "will scale out". For more details, see Azure Functions scale and hosting and Azure Functions limits.
Related
I'm refactoring a .NET application to airflow. This .NET application uses multiple threads to extract and process data from a mongoDB (Without multiple threads the process takes ~ 10hrs, with multi threads i can reduce this) .
In each documment on mongoDB I have a key value namedprocess. This value is used to control which thread process the documment. I'm going to develop an Airflow DAG to optimize this process. My doubt is about performance and the best way to do this.
My application should have multiple tasks (I will control the process variable in the input of the python method). Or should I use only 1 task and use Python MultiThreading inside this task? The image below illustrates my doubt.
Multi Task X Single Task (Multi Threading)
I know that using MultiTask I'm going to do more DB Reads (1 per task). Although, using Python Multi Threading I know I'll have to do a lot of control processing inside de task method. What is the best, fastest and optimized way to do this?
It really depends on the nature of your processing.
Multi-threading in Python can be limiting because of GIL (Global Interpreter Lock) - there are some operations that require exclusive lock, and this limit the parallelism it can achieve. Especially if you mix CPU and I/O operations the effects might be that a lot of time is spent by threads waiting for the lock. But it really depends on what you do - you need to experiment to see if GIL affects your multithreading.
Multiprocessing (which is used by Airflow for Local Executor) is better because each process runs effectively a separate Python interpreter. So each process has it's own GIL - at the expense of resources used (each process uses it's own memory, sockets and so on). Each task in Airlfow will run in a separate process.
However Airflow offers a bit more - it also offers multi-machine. You can run separate workers With X processes on Y machines, effectively running up to X*Y processes at a time.
Unfortunately, Airflow is (currently) not well suited to run dynamic number of parallel tasks of the same type. Specifically if you would like to split load to process to N pieces and run each piece in a separate task - this would only really work if N is constant and does not change over time for the same DAG (like if you know you have 10 machines, with 4 CPUs, you's typically want to run 10*4 = 40 tasks at a time, so you'd have to split your job into 40 tasks. And it cannot change dynamically between runs really - you'd have to write your DAG to run 40 parallel tasks every time it runs.
Not sure if I helped, but there is no single "best optimised" answer - you need to experiment and check what works best for your case.
I am really struggling to understand the interaction between asyncio event loop and multiple workers/threads/processes.
I am using dash: which uses flask internally and gunicorn.
Say I have two functions
def async_download_multiple_files(files):
# This function uses async just so that it can concurrently send
# Multiple requests to different webservers and returns data.
def sync_callback_dash(files):
# This is a sync function that is called from a dash callback to get data
asyncio.run(async_download_multiple_files(files))
As I understand, asyncio.run runs the async function in an event loop but blocks it:
From Python Docs
While a Task is running in the event loop, no other Tasks can run in the same thread.
But what happens when I run a WSGI server like Gunicorn with multiple workers.
Say there are 2 requests coming in simultaneously, presumably there will be multiple calls to sync_callback_dash which will happen in parallel because of multiple Gunicorn workers.
Can both request 1 and request 2 try to execute the asyncio.run in parallel in different threads\processes ? Will one block the other ?
If they can run in parallel, what is the use of having asyncio workers that Gunicorn offers?
I answered this question with the assumption that there is some lack of knowledge on some of the fundamental understandings of threads/processes/async loop. If there was not, forgive me for the amount of detail.
First thing to note is that processes and threads are two separate concepts. This answer might give you some context. To expand:
Processes are run directly by the CPU, and if the CPU has multiple cores, processes can be run in parallel. Inside processes is where threads are run. There is always at least 1 thread per process, but there can be more. If there are more, the process switches between which thread it is executing after every (specific) millisecond (dictated by things out of the scope of this question)- and therefore threads are not run in absolute parallel, but rather constantly switched in and out of the CPU (at least as it pertains to Python, specifically, due to something called the GIL). The async loop runs inside a thread, and switches context relating specifically to I/O-bound instructions (more of this below).
Regarding this question, it's worth noting that Gunicorn workers are processes, and not threads (though you can increase the amount of threads per worker).
The intention of asynchronous code (with the use of async def, await, and asyncio) is to speed-up performance as it specifically relates to I/O bound tasks. Stuff like getting a file from disk, sending/receiving a network request, or anything that requires a physical piece of your computer - whether it is SSD, or the network card - other than the CPU to do some work. It can also be used for large CPU-bound instructions, but this is usually where threads come in. Note that I/O bound instructions are much slower than CPU bound instructions as the electricity inside your computer literally has to travel further distances, as well as perform extra steps in the hardware level (to keep things simple).
These tasks waste the CPU time (or, more specifically, the current process's time) on simply waiting for a reply. Asynchronous code is run with the help of a loop that auto-manages the context switching of I/O bound instructions and normal CPU bound instructions (dependent on the use of await keywords) by leveraging the idea that a function can "yield" control back to the loop, and allow the loop to continue processing other pieces of code while it waits. When async code sends an I/O bound instruction (e.g. grab the latest packet from the network card), instead of sitting still and waiting for a reply it will switch the current process' context to the next task in its list to speed up general execution time (adding that previous I/O bound call to this list to check back in later). There is more to this, but this is the general gist as it relates to your question.
This is what it means when the docs says:
While a Task is running in the event loop, no other Tasks can run in the same thread.
The async loop is not running things in parallel, but rather constantly switching context between different instructions for a more optimized CPU + I/O relationship/execution.
Processes, in the other hand, run in parallel in your CPU assuming you have multiple cores. Gunicorn workers - as mentioned earlier - are processes. When you run multiple async workers with Gunicorn you are effectively running multiple asyncio.loop in multiple (independent, and parallel-running) processes. This should answer your question on:
Can both request 1 and request 2 try to execute the asyncio.run in parallel in different threads\processes ? Will one block the other ?
If there is ever the case that one worker gets stuck on some extremely long I/O bound (or even non-async computation) instruction(s), other workers are there to take care of the next request(s).
With asyncio it is possible to run a separate event loop in each thread. Both will run in parallel (to the extent the Python Interpreter is capable). There are some restrictions. Communication between those loops must use threadsafe methods. Signals and subprocesses can be handled in the main thread only.
Calling asyncio.run in a callback will block until the asyncio part completely finishes. It is not clear from your question if this is what you want.
Alternatively, you could start a long running event loop in one thread and use asyncio.run_coroutine_threadsafe from other threads. Read the docs with an example here.
I'm fairly new to Celery/AMQP and am trying to come up with a task/queue/worker design to meet the following requirements.
I have multiple types of "per-user" tasks: e.g., TaskA, TaskB, TaskC. Each of these "per-user" tasks read/write data for one particular user in the system. So at any given time, I might need to create tasks User1_TaskA, User1_TaskB, User1_TaskC, User2_TaskA, User2_TaskB, etc. I need to ensure that, for each user, no two tasks of any task type execute concurrently. I want a system in which no worker can execute User1_TaskA at the same time as any other worker is executing User1_TaskB or User1_TaskC, but while User1_TaskA is executing, other workers shouldn't be blocked from concurrently executing User2_TaskA, User3_TaskA, etc.
I realize this could be implemented using some sort of external locking mechanism (e.g., in the DB), but I'm hoping there's a more elegant task/queue/worker design that would work.
I suppose one possible solution is to implement queues as user buckets such that, when the workers are launched there's config that specifies how many buckets to create, and each "bucket worker" is bound to exactly one bucket. Then an "intermediate worker" would pull off tasks from the main task queue and assign them into the bucketed queues via, say, a hash/mod scheme. So UserA's tasks would always end up in the same queue, and multiple tasks for UserA would back up behind each other. I don't love this approach, as it would require the number of buckets to be defined ahead of time, and would seem to prevent (easily) adding workers dynamically. Seems to me there's got to be a better way -- suggestions would be greatly appreciated.
What's so bad in using an external locking mechanism? It's simple, straightforward, and efficient enough. You can find an example of distributed task locking in Celery here. Extend it by creating a lock per user, and you're done!
I wrote an API which which does some database operations with values requested by the API caller. How does this whole API system work when more than on person calls a function at the same time?
Do different instances of my API code start when a number of API calls are made?
If I need to handle like 2500 parallel API calls, what exact precaution (like paying attention to database load) do I need to take?
Do you plan to call your python API from some other python code? If so then how is the parallelism achieved? Do you plan to spawn many threads, use your api in every thread?
Anyway it's worthwhile to take a look at multiprocessing module that allows one to create separate python processes. There are lots of threading modules that allow to parallelize code execution within the same process. But keep in mind that the latter case is a subject for Global Interpreter Lock - google for more info.
I'm a bit puzzled about how to write asynchronous code in python/twisted. Suppose (for arguments sake) I am exposing a function to the world that will take a number and return True/False if it is prime/non-prime, so it looks vaguely like this:
def IsPrime(numberin):
for n in range(2,numberin):
if numberin % n == 0: return(False)
return(True)
(just to illustrate).
Now lets say there is a webserver which needs to call IsPrime based on a submitted value. This will take a long time for large numberin.
If in the meantime another user asks for the primality of a small number, is there a way to run the two function calls asynchronously using the reactor/deferreds architecture so that the result of the short calc gets returned before the result of the long calc?
I understand how to do this if the IsPrime functionality came from some other webserver to which my webserver would do a deferred getPage, but what if it's just a local function?
i.e., can Twisted somehow time-share between the two calls to IsPrime, or would that require an explicit invocation of a new thread?
Or, would the IsPrime loop need to be chunked into a series of smaller loops so that control can be passed back to the reactor rapidly?
Or something else?
I think your current understanding is basically correct. Twisted is just a Python library and the Python code you write to use it executes normally as you would expect Python code to: if you have only a single thread (and a single process), then only one thing happens at a time. Almost no APIs provided by Twisted create new threads or processes, so in the normal course of things your code runs sequentially; isPrime cannot execute a second time until after it has finished executing the first time.
Still considering just a single thread (and a single process), all of the "concurrency" or "parallelism" of Twisted comes from the fact that instead of doing blocking network I/O (and certain other blocking operations), Twisted provides tools for performing the operation in a non-blocking way. This lets your program continue on to perform other work when it might otherwise have been stuck doing nothing waiting for a blocking I/O operation (such as reading from or writing to a socket) to complete.
It is possible to make things "asynchronous" by splitting them into small chunks and letting event handlers run in between these chunks. This is sometimes a useful approach, if the transformation doesn't make the code too much more difficult to understand and maintain. Twisted provides a helper for scheduling these chunks of work, cooperate. It is beneficial to use this helper since it can make scheduling decisions based on all of the different sources of work and ensure that there is time left over to service event sources without significant additional latency (in other words, the more jobs you add to it, the less time each job will get, so that the reactor can keep doing its job).
Twisted does also provide several APIs for dealing with threads and processes. These can be useful if it is not obvious how to break a job into chunks. You can use deferToThread to run a (thread-safe!) function in a thread pool. Conveniently, this API returns a Deferred which will eventually fire with the return value of the function (or with a Failure if the function raises an exception). These Deferreds look like any other, and as far as the code using them is concerned, it could just as well come back from a call like getPage - a function that uses no extra threads, just non-blocking I/O and event handlers.
Since Python isn't ideally suited for running multiple CPU-bound threads in a single process, Twisted also provides a non-blocking API for launching and communicating with child processes. You can offload calculations to such processes to take advantage of additional CPUs or cores without worrying about the GIL slowing you down, something that neither the chunking strategy nor the threading approach offers. The lowest level API for dealing with such processes is reactor.spawnProcess. There is also Ampoule, a package which will manage a process pool for you and provides an analog to deferToThread for processes, deferToAMPProcess.