I have some code which depends on Greenlets, and need to remove this dependency.
Can anyone explain to me exactly what I'll need to do?
They would preferably be replaced with threads or (better yet) processes from the multiprocessing module, but anything that relies solely on the Python standard library would be sufficient for my needs.
Functionality can be sacrificed, as I don't need asynchronous code, nor does the code that I am converting (for my uses, not the original implementation).
UPDATE:
Specifically, I need to know of alternatives to Greenlet.spawn()
It really depends on the structure of your code and high level architecture of your system. If you think that whatever you are using greenlets for can be done using multiprocessing module in the Python Standard library, then you can do that. I think, if you post specific instances than you can get the specific ways to those using multiprocessing. But beware these are two different ways to solve a generic problem of concurrency.
Related
Today, I found a library named trio which says itself is an asynchronous API for humans. These words are a little similar with requests'. As requests is really a good library, I am wondering what is the advantages of trio.
There aren't many articles about it, I just find an article discussing curio and asyncio. To my surprise, trio says itself is even better than curio(next-generation curio).
After reading half of the article, I cannot find the core difference between these two asynchronous framework. It just gives some examples that curio's implementation is more convenient than asyncio's. But the underlying structure is almost the same.
So could someone give me a reason I have to accept that trio or curio is better than asyncio? Or explain more about why I should choose trio instead of built-in asyncio?
Where I'm coming from: I'm the primary author of trio. I'm also one of the top contributors to curio (and wrote the article about it that you link to), and a Python core dev who's been heavily involved in discussions about how to improve asyncio.
In trio (and curio), one of the core design principles is that you never program with callbacks; it feels more like thread-based programming than callback-based programming. I guess if you open up the hood and look at how they're implemented internally, then there are places where they use callbacks, or things that are sorta equivalent to callbacks if you squint. But that's like saying that Python and C are equivalent because the Python interpreter is implemented in C. You never use callbacks.
Anyway:
Trio vs asyncio
Asyncio is more mature
The first big difference is ecosystem maturity. At the time I'm writing this in March 2018, there are many more libraries with asyncio support than trio support. For example, right now there aren't any real HTTP servers with trio support. The Framework :: AsyncIO classifier on PyPI currently has 122 libraries in it, while the Framework :: Trio classifier only has 8. I'm hoping that this part of the answer will become out of date quickly – for example, here's Kenneth Reitz experimenting with adding trio support in the next version of requests – but right now, you should expect that if you're trio for anything complicated, then you'll run into missing pieces that you need to fill in yourself instead of grabbing a library from pypi, or that you'll need to use the trio-asyncio package that lets you use asyncio libraries in trio programs. (The trio chat channel is useful for finding out about what's available, and what other people are working on.)
Trio makes your code simpler
In terms of the actual libraries, they're also very different. The main argument for trio is that it makes writing concurrent code much, much simpler than using asyncio. Of course, when was the last time you heard someone say that their library makes things harder to use... let me give a concrete example. In this talk (slides), I use the example of implementing RFC 8305 "Happy eyeballs", which is a simple concurrent algorithm used to efficiently establish a network connection. This is something that Glyph has been thinking about for years, and his latest version for Twisted is ~600 lines long. (Asyncio would be about the same; Twisted and asyncio are very similar architecturally.) In the talk, I teach you everything you need to know to implement it in <40 lines using trio (and we fix a bug in his version while we're at it). So in this example, using trio literally makes our code an order of magnitude simpler.
You might also find these comments from users interesting: 1, 2, 3
There are many many differences in detail
Why does this happen? That's a much longer answer :-). I'm gradually working on writing up the different pieces in blog posts and talks, and I'll try to remember to update this answer with links as they become available. Basically, it comes down to Trio having a small set of carefully designed primitives that have a few fundamental differences from any other library I know of (though of course build on ideas from lots of places). Here are some random notes to give you some idea:
A very, very common problem in asyncio and related libraries is that you call some_function(), and it returns, so you think it's done – but actually it's still running in the background. This leads to all kinds of tricky bugs, because it makes it difficult to control the order in which things happen, or know when anything has actually finished, and it can directly hide problems because if a background task crashes with an unhandled exception, asyncio will generally just print something to the console and then keep going. In trio, the way we handle task spawning via "nurseries" means that none of these things happen: when a function returns then you know it's done, and Trio's currently the only concurrency library for Python where exceptions always propagate until you catch them.
Trio's way of managing timeouts and cancellations is novel, and I think better than previous state-of-the-art systems like C# and Golang. I actually did write a whole essay on this, so I won't go into all the details here. But asyncio's cancellation system – or really, systems, it has two of them with slightly different semantics – are based on an older set of ideas than even C# and Golang, and are difficult to use correctly. (For example, it's easy for code to accidentally "escape" a cancellation by spawning a background task; see previous paragraph.)
There's a ton of redundant stuff in asyncio, which can make it hard to tell which thing to use when. You have futures, tasks, and coroutines, which are all basically used for the same purpose but you need to know the differences between them. If you want to implement a network protocol, you have to pick whether to use the protocols/transports layer or the streams layer, and they both have tricky pitfalls (this is what the first part of the essay you linked is about).
Trio's currently the only concurrency library for Python where control-C just works the way you expect (i.e., it raises KeyboardInterrupt where-ever your code is). It's a small thing, but it makes a big difference :-). For various reasons, I don't think this is fixable in asyncio.
Summing up
If you need to ship something to production next week, then you should use asyncio (or Twisted or Tornado or gevent, which are even more mature). They have large ecosystems, other people have used them in production before you, and they're not going anywhere.
If trying to use those frameworks leaves you frustrated and confused, or if want to experiment with a different way of doing things, then definitely check out trio – we're friendly :-).
If you want to ship something to production a year from now... then I'm not sure what to tell you. Python concurrency is in flux. Trio has many advantages at the design level, but is that enough to overcome asyncio's head start? Will asyncio being in the standard library be an advantage, or a disadvantage? (Notice how these days everyone uses requests, even though the standard library has urllib.) How many of the new ideas in trio can be added to asyncio? No-one knows. I expect that there will be a lot of interesting discussions about this at PyCon this year :-).
There is a Scala library I'd like to use, namely BIDMach, however I need to be able to use it from Python rather than in Scala. I've been trying to think of different ways of possible being able communicate between the library and Python code, such as creating an HTTP server in Scala and calling this from Python, using something like JPype to try and use Scala libraries in Python, and different types of interprocess communication. However, none of them seem to work very well, and would seem to require a large amount of reimplementation of what is already in the library. Does anyone know of a good way of going about this?
edit: In terms of exactly what I think I'd like to do, ideally I'd be able to get close to almost all of the libraries functionality usable in Python, however that probably isn't realistic. It would be nice if some of the Scala classes were easily usable in Python, without too much repeated implementation effort. The reason I don't think what I've looked into so far will work well is because it requires a fair bit of reimplementation of what is already in the library (i.e. representing something like a matrix in JSON, as a way of transporting data to/from Python/Scala)
What are the simplest way to use all cores off a computer for a python program ? In particular, I would want to parallelize a numpy function (which already exists). Is there something like openmp under fortran in python ?
Check out the multiprocessing library. It even allows to spread work across multiple computers.
It depends on what you want to do and how numpy is compiled on your machine (in some cases, some multicore use will be automatic). See this page for details.
It may or may not fit to your specific problem you want to solve, but I personally find the ipython shell's parallel infrastructure quite attractive. It is relatively easy to set up an ipcluster on localhost (see in the manual).
You can wrap your function you wish to evaluate into a #parallel decorator for example and its evaluation will be distributed among many cores (see the Quick and easy parallelism section of the manual)
i am learning python.. i want to do certain kind of scripting in python.. like,
i want to communicate 'wmic' commands through dos promt.. store the result a file..
access some sqlite database, take the data it has and compare with the result i stored..
now, what i dont get is that, how should i proceed? is there any specific frameworks or modules/libraries? like, win32api / com or what else?
pls guide me what things i should follow/learn to accomplish what i intend to do..
thanks
for your project look here
http://tgolden.sc.sabren.com/python/wmi/index.html
http://docs.python.org/library/sqlite3.html
here is list of generic python modules
http://docs.python.org/modindex.html
http://docs.python.org/ - here you can find all information you need with examples.
One of the particularly attractive features of Python is the "batteries included" philosophy: The standard library is huge, and extremely well thought out in 90% of the modules I've ever used. Conversely, this means that a good approach is to learn it first before branching out and installing third-party libraries that may be much less well supported and, ultimately, have no advantage.
In practice, this means I'd recommend keeping https://docs.python.org/release/2.6.5/library/index.html close to your heart, and delve into the documentation for the sqlite3 and probably the subprocess modules. You may or may not want win32api later (I've never worked with wmic, so I'm not sure what you'd need), or come back here when you have concrete questions and already explored the standard library offering.
Would it be possible to make a python cluster, by writing a telnet server, then telnet-ing the commands and output back-and-forth? Has anyone got a better idea for a python compute cluster?
PS. Preferably for python 3.x, if anyone knows how.
The Python wiki hosts a very comprehensive list of Python cluster computing libraries and tools. You might be especially interested in Parallel Python.
Edit: There is a new library that is IMHO especially good at clustering: execnet. It is small and simple. And it appears to have less bugs than, say, the standard multiprocessing module.
You can see most of the third-party packages available for Python 3 listed here; relevant to cluster computation is mpi4py -- most other distributed computing tools such as pyro are still Python-2 only, but MPI is a leading standard for cluster distributed computation and well looking into (I have no direct experience using mpi4py with Python 3, yet, but by hearsay I believe it's a good implementation).
The main alternative is Python's own built-in multiprocessing, which also scales up pretty well if you have no interest in interfacing existing nodes that respect the MPI standards but may not be coded in Python.
There is no real added value in rolling your own (as Atwood says, don't reinvent the wheel, unless your purpose is just to better understand wheels!-) -- use one of the solid, tested, widespread solutions, already tested, debugged and optimized on your behalf!-)
Look into these
http://www.parallelpython.com/
http://pyro.sourceforge.net/
I have used both and both are exellent for distributed computing
for more detailed list of options see
http://wiki.python.org/moin/ParallelProcessing
and if you want to auto execute something on remote machine , better alternative to telnet is ssh as in http://pydsh.sourceforge.net/
What kind of stuff do you want to do? You might want to check out hadoop. The backend, heavy lifting is done in java, but has a python interface, so you can write python scripts create and send the input, as well as process the results.
If you need to write administrative scripts, take a look at the ClusterShell Python library too, or/and its parallel shell clush. It's useful when dealing with node sets also (man nodeset).
I think IPython.parallel is the way to go. I've been using it extensively for the last year and a half. It allows you to work interactively with as many worker nodes as you want. If you are on AWS, StarCluster is a great way to get IPython.parallel up and running quickly and easily with as many EC2 nodes as you can afford. (It can also automatically install Hadoop, and a variety of other useful tools, if needed.) There are some tricks to using it. (For example, you don't want to send large amounts of data through the IPython.parallel interface itself. Better to distribute a script that will pull down chunks of data on each engine individually.) But overall, I've found it to be a remarkably easy way to do distributed processing (WAY better than Hadoop!)
"Would it be possible to make a python cluster"
Yes.
I love yes/no questions. Anything else you want to know?
(Note that Python 3 has few third-party libraries yet, so you may wanna stay with Python 2 at the moment.)