My Python AppEngine app interacts with slow external systems (think receiving data from narrow-band connections). Half-hour-long interactions are a norm. I need to run 10-15 of such interactions in parallel.
My options are background tasks and "background threads" (not plain Python threads). Theoretically they look about the same. I'd stick with tasks since background threads don't run on the local development server.
Are there any significant advantages of one approach over the other?
It depends on how long the "interaction" takes. Appengine has a limit of 60 seconds per HTTP Requests.
If your external systems send data periodically then I would advice to grab the data in small chunks to respect the 60 seconds limit. Aggregate those into blobs and then process the data periodically using tasks.
Related
I implemented a script in which every day I process several urls and make many I/O operations, and I am subclassing threading.Thread and starting a number of threads (say 32).
The workload varies day by day but as soon as the processing starts I am sure that no more tasks will be added to the input queue.
Also, my script is not supporting any front-end (at least for now).
I feel though that this solution will not be so easily scalable in the case of multiple processes / machines and would like to give Celery (or any distributed task queue) a shot, but I always read that it’s better suited for long-running tasks running in the background to avoid blocking a UI.
On the other hand, I have also read that having many small tasks is not a problem with Celery.
What’s your thought on this? Would be easy to scale Celery workers possibly across processes / machines?
I have a few Celery workers that perform tasks that are not always that fast. The tasks are usually a bunch of HTTP requests and DB queries (using pyscopg2 behind SQLAlchemy). I'm running in Kubernetes and the CPU usage is always fairly low (0.01 or so). Celery automatically set the concurrency to 2 (number of cores of a single node), but I was wondering whether it would make sense to manually increase this number.
I always read that the concurrency (processes?) should be the same as the number of cores, but if the worker does not use a whole core, couldn't it be more? Like concurrency=10 ? Or that would make no difference and I'm just missing the point of processes and concurrency?
I couldn't find information on that. Thanks.
Everything is true. Celery automatically sets the number of cores as concurrency, as it assumes that you will the entire core (CPU intensive task).
Sounds like you can increase the concurrency, as your tasks are doing more I/O bound tasks (and the CPU is idle).
To be on the safe side, I would do it gradually and increase to 5 first, monitor, ensure that CPU are fine and then to 10..
I'm using python + locust for performance testing. I mostly use java and in java 1 cpu thread = java thread. So if i have VM with 12 threads, I can perform only 12 actions in parallel.
But locust has parameter USERS which stands for "Peak number of concurrent Locust users". Does it work the same way? If i put USERS = 25 but VM has only 12 threads, will it mean that simultaneously it will execute only 12 actions in parallel and the rest will wait until any thread finishes?
Locust uses gevent which makes I/O asyncronous. A single Locust/Python process can only use one CPU thread (a slight oversimplification), but it can make concurrent HTTP requests: When a request is made by one user, control is immediately handed over to other running users, which can in turn trigger other requests.
This is fundamentally different from Java (which is threaded but often synchronous), but similar to JavaScript.
As long as you run enough Locust worker processes, this is a very efficient approach, and a single process can handle thousands of concurrent users (in fact, the number of users is almost never a limitation - the number of requests per second is the limiting factor)
See Locust's documentation (https://docs.locust.io/en/stable/running-locust-distributed.html)
Because Python cannot fully utilize more than one core per process (see GIL), you should typically run one worker instance per processor core on the worker machines in order to utilize all their computing power.
I have a python (2.6.5 64-bit, Windows 2008 Server R2) app that launches worker processes. The parent process puts jobs in a job queue, from which workers pick them up. Similarly it has a results queue. Each worker performs its job by querying a server. CPU usage by the workers is low.
When the number of workers grows, CPU usage on the servers actually shrinks. The servers themselves are not the bottleneck, as I can load them up further from other applications.
Anyone else seen similar behavior? Is there an issue with python multiprocessing queues when a large number of processes are reading or writing to the same queues?
Two different ideas for performance constraints:
The bottleneck is the workers fighting each other and the parent for access to the job queue.
The bottleneck is connection rate-limits (syn-flood protection) on the servers.
Gathering more information:
Profile the amount of work done: tasks completed per second, use this as your core performance metric.
Use packet capture to view the network activity for network-level delays.
Have your workers document how long they wait for access to the job queue.
Possible improvements:
Have your workers use persistent connections if available/applicable (e.g. HTTP).
Split the tasks into multiple job queues fed to pools of workers.
Not exactly sure what is going on unless you provide all the details.
However, remember that the real concurrency is bounded by the actual number of hardware threads. If the number of processes launched is much larger than the actual number of hardware threads, at some point the context-switching overhead will be more than the benefit of having more concurrent processes.
Creating of new thead is very expensive operation.
One of the simplest ways for controling a lot of paralell network connections is to use stackless threads with support of asyncronical sockets. Python had great support and a bunch of libraries for that.
My favorite one is gevent, which has a great and comletely transparent monkey-patching utility.
I would like to synchronize two events between two (or more) wire networked Linux machines. Can I use NTP to do this?
NTP seems to be mostly focused on synchronizing to a time server, where I need two machines to be synchronized to each other. There is a subtle difference there. For example, if one machine is located half as many hops away as a second machine from the time server, I might be able to get better synchronization if I try to synchronize the two machines to each other directly instead of synchronizing both to a time server.
A slightly different question: If I were to use NTP, what would be the best way to schedule events? A cronjob or at script? Could I get better (sub second) synchronization if I were to use a library like this one.
Finally, does anyone know of any time synchronization software packages that are suited to synchronizing two (or more) machines together, not necessarily synchronizing to a time server.
Thanks for any help.
You might try delegating one machine as the master, and the remaining machines as slaves. When the synchronized events should occur, the master triggers the slaves to commence.
The synchronization would be limited only by the latency (ping) between the machines, and you wouldn't need to worry about system clocks consistency.
What is ping latency variation between your hosts? What kind of start time discrepancy between coordinated processes is ok for you? How are going to start processes? Cron is very imprecise, and startup time for processes needs to be accounted for, too.
If ping times to different hosts vary significantly, I'd do something like this.
Use a reliable public NTP server to synchronize clocks on all coordinated hosts a few minutes before the event. With frequent events, 3-4 synchronizations a day should be plenty enough, though.
Using a low-precision scheduler like cron, 2-3 minutes ahead of time, start a simple wrapper shell script that will wait() until e.g. 15 seconds before event. Then the wrapper script starts the tagert app, with higher-than-normal priority.
The app loads (disk access and dynamic linking is slow), reads whatever data files needed, makes all time-consuming calculations, etc. Then it waits until the start moment with subsecond precision using usleep() and ftime() or gettimeofday(), put right before the fireLasersAtTheMoon() or what happens to be your target action.
Obviously, it makes little sense to synchronize so precisely actions that are naturally imprecise, like network communication. If your network has predictable latency, you just measure it using round-trip times of ping and make a master process on one host to send a start command via ssh to other host(s) taking latency to account.