I'm trying to implement some code to import user's data from another service via the service's API. The way I'm going to set it up is all the request jobs will be kept in a queue which my simple importer program will draw from. Handling one task at a time won't come anywhere close to maxing out any of the computer's resources so I'm wondering what is the standard way to structure a program to run multiple "jobs" at once? Should I be looking into threading or possibly a program that pulls the jobs from the queue and launches instances of the importer program? Thanks for the help.
EDIT: What I have right now is in Python although I'm open to rewriting it in another language if need be.
Use a Producer-Consumer queue, with as many Consumer threads as you need to optimize resource usage on the host (sorry - that's very vague advice, but the "right number" is problem-dependent).
If requests are lightweight you may well only need one Producer thread to handle them.
Launching multiple processes could work too - best choice depends on your requirements. Do you need the Producer to know whether the operation worked, or is it 'fire-and-forget'? Do you need retry logic in the event of failure? How do you keep count of concurrent Consumers in this model? And so on.
For Python, take a look at this.
Related
I'm trying to build a system that would manage a small database and populate it with data from the web.
I would like to have this process run in the background, but still have some way of interacting with it.
How can I go about this in Python?
I would like to know how to do this in two cases:
from within the same python script something like daemon = task.start() followed by daemon.get_info() or daemon.do_something()
from the shell (via another program I could make) myclient get_info or myclient do_something
Could someone give me some key concepts to go look into?
edit: I just read this blogpost, is socket programming (as indicated in his last example) the best way to go about this?
So in the end I landed on some terminology that I was missing.
The core concept seems to be inter-process communication (ipc).
On unix variants the two easiest ways of implementing this are:
Named Pipes (one way communication)
Sockets (two way)
A python script that would make use of these could spawn another thread which would repeatedly read from the pipe and communicate messages back to the main thread through a queue.
I have seen a few variants of my question but not quite exactly what I am looking for, hence opening a new question.
I have a Flask/Gunicorn app that for each request inserts some data in a store and, consequently, kicks off an indexing job. The indexing is 2-4 times longer than the main data write and I would like to do that asynchronously to reduce the response latency.
The overall request lifespan is 100-150ms for a large request body.
I have thought about a few ways to do this, that is as resource-efficient as possible:
Use Celery. This seems the most obvious way to do it, but I don't want to introduce a large library and most of all, a dependency on Redis or other system packages.
Use subprocess.Popen. This may be a good route but my bottleneck is I/O, so threads could be more efficient.
Using threads? I am not sure how and if that can be done. All I know is how to launch multiple processes concurrently with ThreadPoolExecutor, but I only need to spawn one additional task, and return immediately without waiting for the results.
asyncio? This too I am not sure how to apply to my situation. asyncio has always a blocking call.
Launching data write and indexing concurrently: not doable. I have to wait for a response from the data write to launch indexing.
Any suggestions are welcome!
Thanks.
Celery will be your best bet - it's exactly what it's for.
If you have a need to introduce dependencies, it's not a bad thing to have dependencies. Just as long as you don't have unneeded dependencies.
Depending on your architecture, though, more advanced and locked-in solutions might be available. You could, if you're using AWS, launch an AWS Lambda function by firing off an AWS SNS notification, and have that handle what it needs to do. The sky is the limit.
I actually should have perused the Python manual section on concurrency better: the threading module does just what I needed: https://docs.python.org/3.5/library/threading.html
And I confirmed with some dummy sleep code that the sub-thread gets completed even after the Flask request is completed.
I am fairly new to Python and programming in general. I have written a script to go through a long list (~7000) of URLs and check their status to find any broken links. Predictably, this takes a few hours to request each URL one by one. I have heard that multiprocessing (or multithreading?) can be used to speed things up. What is the best approach to this? How many processes/threads should I run in one go? Do I have to create batches of URLs to check concurrently?
The answer to the question depends on whether the process spends most of its time processing data or waiting for the network. If it is the former, then you need to use multiprocessing, and spawn about as many processes as you have physical cores on the system. Do not forget to make sure that you choose the appropriate algorithm for the task. Finally, if all else fails, coding parts of the program in C can be a viable solution as well.
If your program is slow because it spends a lot of time waiting for individual server responses, you can parallelize network access using threads or an asynchronous IO framework. In this case you can use many more threads than you have physical processor cores because most of the time your cores will be sleeping waiting for something interesting to happen. You will need to measure the results on your machine to find out the best number of threads that works for you.
Whatever you do, please make sure that your program is not hammering the remote servers with a large number of concurrent or repeated requests.
Task is:
I have task queue stored in db. It grows. I need to solve tasks by python script when I have resources for it. I see two ways:
python script working all the time. But i don't like it (reason posible memory leak).
python script called by cron and do a little part of task. But i need to solve the problem of one working active script in memory (To prevent active scripts count grow). What is the best solution to implement it in python?
Any ideas to solve this problem at all?
You can use a lockfile to prevent multiple scripts from running out of cron. See the answers to an earlier question, "Python: module for creating PID-based lockfile". This is really just good practice in general for anything that you need to make sure won't have multiple instances running, actually, so you should look into it even if you do have the script running constantly, which I do suggest.
For most things, it shouldn't be too hard to avoid memory leaks, but if you're having a lot of trouble with it (I sometimes do with complex third-party web frameworks, for example), I would suggest instead writing the script with a small, carefully-designed main loop that monitors the database for new jobs, and then uses the multiprocessing module to fork off new processes to complete each task.
When a task is complete, the child process can exit, immediately freeing any memory that isn't properly garbage collected, and the main loop should be simple enough that you can avoid any memory leaks.
This also offers the advantage that you can run multiple tasks in parallel if your system has more than one CPU core, or if your tasks spend a lot of time waiting for I/O.
This is a bit of a vague question. One thing you should remember is that it is very difficult to leak memory in Python, because of the automatic garbage collection. croning a Python script to handle the queue isn't very nice, although it would work fine.
I would use method 1; if you need more power you could make a small Python process that monitors the DB queue and starts new processes to handle the tasks.
I'd suggest using Celery, an asynchronous task queuing system which I use myself.
It may seem a bit heavy for your use case, but it makes it easy to expand later by adding more worker resources if/when needed.
Feel free to close and/or redirect if this has been asked, but here's my situation:
I've got an application that will require doing a bunch of small units of work (polling a web service until something is done, then parsing about 1MB worth of XML and putting it in a database). I want to have a simple async queueing mechanism that'll poll for work to do in a queue, execute the units of work that need to be done, and have the flexibility to allow for spawning multiple worker processes so these units of work can be done in parallel. (Bonus if there's some kind of event framework that would also me to listen for when work is complete.)
I'm sure there is stuff to do this. Am I describing Twisted? I poked through the documentation, I'm just not sure exactly how my problems maps onto their framework, but I haven't spent much time with it. Should I just look at the multiprocess libraries in Python? Something else?
There's celery.
You could break it down into 2 different Tasks: one to poll the web service and queue the second task, which would be in charge of parsing the XML and persisting it.
This problem sounds like a pretty good candidate for Python's built-in (2.6+ anyway) multiprocessing module: http://docs.python.org/library/multiprocessing.html
A simple solution would be to create a process Pool and use your main program to poll for the XML chunks. Once it has them it can then pass them off to the pool for parsing/persisting.