I have multiple functions in a python script. Each of these functions can be run in 2 to 60 seconds. I want to run all these functions every minute. So, each function should be run every minute.
I use the schedule library in python. But the problem is sometimes it takes more than 60s to run all functions. So, the schedule library misses some functions during the ruuing.
are there any ways that I can create a queue and make sure that all backlog functions will be executed?
Related
so title doesn't explain much.
I have a function where it should run on a separate .py file.
So I have a python file where it takes some variables and waits for event to happen than continue until it finishes. You can think this as event listener (web socket).
This file runs and doesn't give output just does some functions and when event finishes it closes. So running one file only is no problem but I want to run more than 10 of these at the same time for different purposes but same event, this causes problems where some of them doesn't work or miss the event.
I do this by running 10 terminals (cmd or shell). Which I think it creates problem because of running this much of event handling in shells, in future I might use more than 10 files maybe 50-100.
So what I tried:
I tried one-file using threading (multi-threading) and it didn't work.
My goal
I want help with this problem where I can run as many as these files without missing events and slowing down the system. I am open to ideas.
Concurrent.future could be a good option, it execute a piece of
code in another thread. See documentation
https://docs.python.org/3/library/concurrent.futures.html.
The threading library that comes with Python allows you to start
mutliple times the same function in different threads whithout having
to wait for them to finish. See the documentation
ttps://docs.python.org/3/library/threading.html.
A similar API is in the library multiprocessing allow you do the
same in pultiple processes. Documentation is
https://docs.python.org/3/library/multiprocessing.html. One
difference is that in Python threads are virtual, all manage in the
single interpreter process. With multiprocessing you start several
processes and probably have less impact on the performance.
The code you have to run in a process or a thread has to be in a defined function. It seems that this code is in a separate .py file, a module, therefore you have to import it (https://docs.python.org/3/tutorial/modules.html) first. So one file manage the thread/multiprocess in a loop, another for the code listening the event and only one terminal will be required to start them.
You can use multi Threading.
here in this page you will find some very useful examples of what you want (I recommend using the concurrent.futures cause in the new version of python 3 you will run into some bugs using the threading ).
https://realpython.com/intro-to-python-threading/#using-a-threadpoolexecutor
I’m working on an ETL project using Azure Functions where I extract data from blob storage, transform the data in Python and pandas, and load the data using pandas to_sql(). I’m trying to make this process more efficient by using asyncio and language workers.
I’m a little confused because I was under the impression that asyncio works using one thread, but the Azure Functions documentation says you can use multiple language workers if you change your config and that even a method that doesn’t use the async keyword runs in a thread pool.
Does that mean that if I don’t use the async keyword my methods will run concurrently using language workers? Do I have to use asyncio to utilize language workers?
Also, the documentation says that Azure Functions can scale to up to 200 instances. How can I scale to that many instances if I’m only allowed a maximum of 10 language workers?
Edit: Thanks Anatoli. Just to clarify, if I have a Timer Trigger with the following code:
import azure.functions as func
from . import client_one_etl
from . import client_two_etl
def main(mytimer: func.TimerRequest) -> None:
client_one_etl.main()
client_two_etl.main()
If I have increased the number of language workers, does that mean both client_one_etl.main() and client_two_etl.main() are automatically run in separate threads even without using asyncio? And if client_two_etl.main() needs client_one_etl.main() to finish before executing, I will need to use async await to prevent them from running concurrently?
And for the separate instances, if client_one_etl.main() and client_two_etl.main() do not rely on each other does that mean I can execute them in one Azure Function app as separate .py scripts that run in their own VM? Is it possible to run multiple Timer Triggers (calling multiple __init__.py scripts each in their own VM for one Azure Function)? Then all scripts will need to complete within 10 minutes if I increase functionTimeout in the host.json file?
FUNCTIONS_WORKER_PROCESS_COUNT limits the maximum number of worker processes per Functions host instance. If you set it to 10, each host instance will be able to run up to 10 Python functions concurrently. Each worker process will still execute Python code on a single thread, but now you have up to 10 of them running concurrently. You don't need to use asyncio for this to happen. (Having said that, there are legitimate reasons to use asyncio to improve scalability and resource utilization, but you don't have to do that to take advantage of multiple Python worker processes.)
The 200 limit applies to the number of Functions host instances per Function app. You can think of these instances as separate VMs. The FUNCTIONS_WORKER_PROCESS_COUNT limit is applied to each of them individually, which brings the total number of concurrent threads to 2000.
UPDATE (answering the additional questions):
As soon as your function invocation starts on a certain worker, it will run on this worker until complete. Within this invocation, code execution will not be distributed to other worker processes or Functions host instances, and it will not be automatically parallelized for you in any other way. In your example, client_two_etl.main() will start after client_one_etl.main() exits, and it will start on the same worker process, so you will not observe any concurrency, regardless of the configured limits (unless you do something special in client_*_etl.main()).
When multiple invocations happen around the same time, these invocations may be automatically distributed to multiple workers, and this is where the limits mentioned above apply. Each invocation will still run on exactly one worker, from start to finish. In your example, if you manage to invoke this function twice around the same time, each invocation can get its own worker and they can run concurrently, but each will execute both client_one_etl.main() and client_two_etl.main() sequentially.
Please also note that because you are using a timer trigger on a single function, you will not experience any concurrency at all: by design, timer trigger will not start a new invocation until the previous invocation is complete. If you want concurrency, either use a different trigger type (for example, you can put a queue message on timer, and then the function triggered by the queue can scale out to multiple workers automatically), or use multiple timer triggers with multiple functions, like you suggested.
If what you actually want is to run independent client_one_etl.main() and client_two_etl.main() concurrently, the most natural thing to do is to invoke them from different functions, each implemented in a separate __init__.py with its own trigger, within the same or different Function apps.
functionTimeout in host.json is applied per function invocation. So, if you have multiple functions in your app, each invocation should complete within the specified limit. This does not mean all of them together should complete within this limit (if I understood your question correctly).
UPDATE 2 (answering more questions):
#JohnT Please note that I'm not talking about the number of Function apps or ___init___.py scripts. A function (described by ___init___.py) is a program that defines what needs to be done. You can create way more than 10 functions per app, but don't do this to increase concurrency - this will not help. Instead, add functions to separate logically-independent and coherent programs. Function invocation is a process that actively executes the program, and this is where the limits I'm talking about apply. You will need to be very clear on the difference between a function and a function invocation.
Now, in order to invoke a function, you need a worker process dedicated to this invocation until this invocation is complete. Next, in order to run a worker process, you need a machine that will host this process. This is what the Functions host instance is (not a very accurate definition of Functions host instance, but good enough for the purposes of this discussion). When running on Consumption plan, your app can scale out to 200 Functions host instances, and each of them will start a single worker process by default (because FUNCTIONS_WORKER_PROCESS_COUNT = 1), so you can run up to 200 function invocations simultaneously. Increasing FUNCTIONS_WORKER_PROCESS_COUNT will allow each Functions host instance create more than one worker process, so up to FUNCTIONS_WORKER_PROCESS_COUNT function invocations can be handled by each Functions host instance, bringing the potential total to 2000.
Please note though that "can scale out" does not necessarily mean "will scale out". For more details, see Azure Functions scale and hosting and Azure Functions limits.
I have a python script which basically calls a few functions, but takes quite a while to complete. I want to be able to run this script many times on a loop.
I need the script to wait until it's previous iteration is done.
so for example;
for i in range(times_to_run):
do_things(thing_variable)
How can I make it so that the script waits until do_things() is done, before calling it again, and iterating through the loop. Thanks!
I've been coding the python "apscheduler" package (Advanced Python Scheduler) into my app, so far it's going good, I'm able to do almost everything that I had envisioned doing with it.
Only one kink left to iron out...
The function my events are calling will only accept around 3 calls a second or fail as it is triggering very slow hardware I/O :(
I've tried limiting the max number of threads in the threadpool from 20 to just 1 to try and slow down execution, but since I'm not really putting a bit load on apscheduler my events are still firing pretty much concurrently (well... very, very close together at least).
Is there a way to 'stagger' different events that fire within the same second?
I have recently found this question because I, like yourself, was trying to stagger scheduled jobs slightly to compensate for slow hardware.
Including an argument like this in the scheduler add_job call staggers the start time for each job by 200ms (while incrementing idx for each job):
next_run_time=datetime.datetime.now() + datetime.timedelta(seconds=idx * 0.2)
What you want to use is the 'jitter' option.
From the docs:
The jitter option enables you to add a random component to the
execution time. This might be useful if you have multiple servers and
don’t want them to run a job at the exact same moment or if you want
to prevent multiple jobs with similar options from always running
concurrently
Example:
# Run the `job_function` every hour with an extra-delay picked randomly
# in a [-120,+120] seconds window.
sched.add_job(job_function, 'interval', hours=1, jitter=120)
I don't know about apscheduler but have you considered using a Redis LIST (queue) and simply serializing the event feed into that one critically bounded function so that it fires no more than three times per second? (For example you could have it do a blocking POP with a one second max delay, increment your trigger count for every event, sleep when it hits three, and zero the trigger count any time the blocking POP times out (Or you could just use 333 millisecond sleeps after each event).
My solution for future reference:
I added a basic bool lock in the function being called and a wait which seems to do the trick nicely - since it's not the calling of the function itself that raises the error, but rather a deadlock situation with what the function carries out :D
I'm working with Python and Jython to deploy applications to WebSphere. However, we are running into an issue with the WAS libraries where calls to these actions will sometimes take up to 30 minutes to execute. In order to troubleshoot this issue, I need to be able to keep tabs on if a process is still executing after so many minutes and, if so, send an alert.
I'm assuming I will need to put the call in a separate thread and watch it, but I have no experience with multithreading. What is the best way to do so and are there any gotchas?
Python has a built-in Timer class which will handle the threading stuff for you:
import threading
def after_2_minutes():
if process_still_running():
print "ALERT!!"
# start a timer in the background that waits 2 minutes
threading.Timer(2 * 60, after_2_minutes).start()
# start the process
foo.bar()
Gotchas: after_30_minutes will run in a separate thread, so as with any thread, you have to make sure that its actions don't interfere with your other threads (although you can manipulate simple data structures without problems due to the Global interpreter lock in CPython).