Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
My application polls an API repeatedly and spawns processes to parse any new data resulting from these calls, conditionally making an API request based on those data. The speed of that turnaround time is critical.
A large bottleneck seems to be related to the setup of the actual spawned processes themselves -- a few module imports and normal instantiation code, which take up to 0.05 seconds on a middling Amazon setup†. It seems like what it would be helpful to have a batch of processes with those imports/init code already done††, waiting to process results. What is the best approach to create/communicate with a pool (10-20?) of warm, reusable, and extremely lightweight processes in Python?
† - yes, I know throwing better hardware at the problem will help, and I'll do that too.
†† - yes, I know doing less will help, and I'm working on making the code as streamlined and minimal as possible
Well, you're in for a learning curve here, but multiprocessing.Pool() will create a pool of any number of processes you specify. Use the initializer= argument to specify a function each process will run at the start. Then there are several methods you can use to submit work items to the processes in the pool - read the docs, play with it, and ask questions if you get stuck.
One caution: "extremely lightweight processes" is impossible. By definition, processes are "heavy". "How heavy" is up to your operating system, and has approximately nothing to do with the programming language you use. If you're looking for lightweight, you're looking for threads.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I have a few files that are ~64GB in size that I think I would like to convert to hdf5 format. I was wondering what the best approach for doing so would be? Reading line-by-line seems to take more than 4 hours, so I was thinking of using multiprocessing in sequence, but was hoping for some direction on what would be the most efficient way without resorting to hadoop. Any help would be very much appreciated. (and thank you in advance)
For this type of problem I typically turn from Python. You're right that multiprocessing/parallelization is a good solution, but Python is not pleasant to work with in this area. Consider trying something on the JVM. I like Clojure's core.async, but there's also the peach ("parallel each") or celluloid libraries for JRuby that's much closer to Python.
The approach doesn't have to be as "heavy" as Hadoop, but I'd still use a similar map/reduce pattern over the files. Have a thread that is reading line by line from the source file(s) and dispatching to several threads. (Using core.async I'd have multiple queues which are getting consumed by different threads, then feeding back a "finished" signal into a watchdog thread.) In the end you should be able to squeeze a lot of performance out of your CPU.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I need a key-value database, like redis or memcached, but not in memory and rather on disk. After filling the database (which we do regularly and from scratch), I'd actually only need the get operation, but from many different processes (so Kyoto Cabinet and LevelDB do not work for me).
I need like 5 million keys and ~10-30gb of data, so some other simple databases don't work as well.
I can't find any information on whether RocksDB can handle multiple read-only clients; it's not straight-forward to build on my OS so I wanted to ask before doing that. If it can't, is there any database which would work? Preferably with an Ubuntu package and Python bindings ;-).
We're just using many-many small files now, but it really sucks, as we want easy backups, copying, etc. I also suspect this may cause slowdowns, but it doesn't really matter that much.
Yes, you should be able to run multiple read-only clients on a single RocksDB database. Just open the database with DB::OpenForReadOnly() call: https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L108
The simplest answer is probably Berkeley DB, and bindings are a part of the stdlib: https://docs.python.org/2/library/anydbm.html
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I'm reading around and see that it is a bad idea to have remote application talk directly to my MongoDB e.g. install a Mongodb driver in a phone app. The best way is to have a REST interface on a server to talk between the database and the end user. But what about the aggregation framework?
I see Sleepy.mongoose and Eve but I cannot see anything about aggregation.
Is there any way/or REST interface which allows you to make aggregation calls (I'm interested in subdocuments)?
E.g. requesting $ curl 'http://localhost:27080/customFunction/Restaurant' and return all the subdocuments matching shop.kind with Restaurant.
I'm familiar with python and java, is there any API framework that allows you to do that?
Before you get flagged as off-topic as you likely will for asking for opinions and not a specific programming question I'll just say one bit. Hopefully on-topic.
I highly doubt that most projects will go beyond being a basic CRUD adaptor allowing you access to collection objects and sometimes (badly) database objects. Is with their various ORM backed counterparts they will do doubt allow a similar query syntax to be executed from the client, so queries could be composed and sent through as JSON, which will not surprisingly look much like (identical) to the standard query syntax for MongoDB.
For myself I prefer to roll my own, and largely because you may want to implement a lot of customer behavior and actions, and in some way abstract a little from having a lot of CRUD code in the client. Let's face it, you're probably passing through and passing JSON that is going into the native structures you're using anyway. So it's not hard really. Anyhow, each to his own I suppose.
There is a listing of other implementations on available here:
http://docs.mongodb.org/ecosystem/tools/http-interfaces/
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
At work I'm not allowed to use perl for web services. Python is allowed however.
What I need to do is serve up the results of some very slow c++ binaries. Each exe takes up to 20 seconds to run. In perl I'd just use mojolicious's non blocking event loop ( an example.of which is given here. http://blogs.perl.org/users/joel_berger/2014/01/writing-non-blocking-applications-with-mojolicious-part-3.html )
How would one go about doing this with django and python?
Tornado using non blocking IO , the concepts are the same as in perl or node js event loop, multiple tasks per thread and so on.
Probably won't be possible with Django, as the entire framework will need to be built specifically for running inside an event loop. In an event-driven framework, slow operations (I/O for example) needs to be implemented using callbacks, so that the actual I/O can be offloaded to the event loop itself, and the callback only called when the operation has finished; Django is not implemented like this.
Take a look at Twisted — it is an event-driven networking engine for Python that also has some web application frameworks built on top of it.
Take a look at A clean, lightweight alternative to Python's twisted. I'd choose gevent for a web app, as it runs with uWSGI--the most versatile web server to run Python code.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
OK, I have been reading about the celery and rabbitmq, while I appreciate the effort of the project and the documentation, I am still confused about a lot of things.
http://www.celeryproject.org/
http://ask.github.com/django-celery/
I am super confused about if celery is only for Django or a standalone server, as the second link claims celery is tightly used with Django. Both sites show different ways of setting up and using celery, which to me is chaotic.
Enough rant, is there a proper book available that I can buy?
Well not a book but I recently did setup in Dotcloud for Django+Celery, and here's the short doc:
http://web.archive.org/web/20150329132442/http://docs.dotcloud.com/tutorials/python/django-celery/
It's intended for simple tasks to be run asynchronously. There is a dotcloud-specific setup, but the rest might clear things up a bit. AFAIK, Celery started tightly coupled with Django but later became an entirely different animal, although it still retains superb compatibility with Django.
I don't know of a book, I guess a quick Amazon search would dig that up.
The bottom line is, celery is run as a separate server and works just as well for a standalone python program as Django, so it is not tied directly to Django. You can also run the celeryd worker software on multiple computers so they can all process the same queue concurrently. Often a separate queueing server, such as RabbitMQ is run to store the queue message.
Keep in mind, django-celery is just an integration app that acts as glue between Django and Celery.
This was asked a long time back and the celery docs have been significantly spruced up since, it'd be good to start with the FAQ's to allay queries of this nature.
http://docs.celeryproject.org/en/latest/faq.html#is-celery-for-django-only