Python Kafka Consumer Library that supports scalability and recoverability - python

Basically, I have a use case where I'm trying to set up a python process to run in containers on EKS to consume data from Kafka and process the same. So far, I've gathered that I could consider using the following three frameworks for this:
Apache Spark(Pyspark)
kafka-python Module
Confluent-kafka-python Module.
I'm looking for a solution that has the following features:
Seamless scalability
If a container dies, and it gets spun up again, it should know where to pick up the offset where it left off from.
Any advice on which of the above mentioned libraries would be most suitable for this? Thanks in advance!

Since you're looking for a mere library, I would avoid PySpark since it's a whole framework with it's own deployment model that is going to have a major impact on how you structure your application. Unless you explicitly know that you need the specific features of Spark, I would skip this one. Similarly, Apache Flink and Beam frameworks exist with Python support.
Confluent Kafka is a Python wrapper around the librdkafka C library, so you have the guarantee it's going to be the most performant and compatible option. Also, since it's maintained by Confluent it should continue being updated by very knowledgeable professional in the long term: it's a dependable solution.
You also have Kafka python which is implemented fully in python. It might "feel" more pythonic, but may also be less performant. At the time of writing this the repo does not show a lot of recent activity.
There's also faust that aimed to be Kafka Streams-like, and aiokafka which uses AsyncIO support

Related

Extensible Local HTTP Server with Framework in Python

I'm trying to build a desktop application using Python. To make it able to be used on as many platforms as possible, I think web UI may be a good choice. This boils down to the problem of making a local HTTP server first. I did some survey and found that people are mainly talking about BaseHTTPServer and SimpleHTTPServer. For prototyping, subclassing them may suffice.
Besides pure prototyping, I also want to leave some room for extension to real service. That is, once mature, I'd like to move the codes to a real dedicated HTTP server, so that end users only need a browser to use it.
I say "extensible" in the following sense:
The code modification is as minimum as possible in the migration process.
I will focus on algorithm in the prototyping stage. I also want to leave some room for future front end designer.
It looks WSGI + Django is a widely mentioned combination. After some search, what I found is using WSGI in apache or nginx. Is it possible to use self-contained modules? i.e. wsgiref + Django, so that I can start everything just from one entry script. I don't want to bother potential first adopters by asking them install apache and configure it. It will be very good if you have sample codes or pointers for further reading.
I'm new to Python and web programming in Python. Thanks for your help. I just try to make sure I'm on the right track. My underlying algorithms is implemented in Python 2.7. So the UI solution had better also be in Python 2.7.
I think what you may want is Bottle. It is a web framework that only needs the standard library to be installed. It also has compatibility with many other production servers, as well as shipping with it's own development server. And if that isn't good enough, it is all in a single file, and has support with many different templating languages, as well as it's own built in templating language.
Check it out here: http://bottlepy.org/docs/dev/
As mentioned bottle is a good choice, I personally like Flask, which if I recall correctly is what bottle is based off of. Anyways there are three things that really make Flask a joy to use.
Blueprints - essentially an application architecture
Flask-Sijax - allows for comet technology
Celery - an asynchronous task queue/job queue based on distributed message passing
there are a lot of other plugins, including one for an admin interface that I haven't tried out yet but it looks promising, and it works with Python 2.7

Non relational database for twisted

I'm looking for any key-value database implementation for working with twisted in asynchronous mode.
The one Idea that I have is using the Twisted Memcache API with MemcacheDB.
Is this some other solution?
One of possible solution is using Redis(REmote DIctionary Server).
Redis is very fast, powerful and stable key-value storage which is used in many projects. Stackoverflow also uses redis;).
I've recently start using redis in my current project for creating user's ratings. My personal opinion: redis is very simple, very fast and stable. It also has a pretty command line client, I like it.
On website I use synchronous redis package. Server uses twisted and requires asynchronous approach. Fortunately, there is third-party module txredis, which allows easily to interact with redis database using twisted. I didn't have any problems with it. However, txredis doesn't have a connection pool, but it's not a problem to implement it manually, if needed.
I use Apache Cassandra from twisted, using Telephus if production for years.
Adding one more point to #dr. 's answer marked as accepted : Use the python package txredisapi, which uses redis protocol for twisted with connection pool support and many more.

Jython, Jepp or Pylons for the performance

I'm trying to incorporate server-based code diff and highlighting in my GWT (Java) project. I managed to incorporate Pygments and difflib into my code using Jython. The basic idea is to generate complete markup on the server and then simply inject code into the page as innerHTML.
I found Jython completely inadequate as even for relatively small files (2K-3K lines) it takes Pygments or difflib forever (minutes not seconds) to process these files. Difflib actually reliably causes OOM errors in the process with dedicated 500M of memory
So I'm wondering if my current setup is wrong or Jython is simply unsuitable for this purpose?
If so, what's next? I discover Jepp but then I would have to build my project for each platform and it has little documentation and don't seem very stable. Another possibility would be to run Pylons as a separate webservice on the same host and get the markup directly to client or channel it through server. And yet another way is to use Java System to execute python script as a process and capture the output.
I would be very interested to hear solid suggestion on the matter.
Having a separate service sounds like the best way to go. For Pygments, there is already a service available (on Google App Engine). The source for the app is BSD open source and on GitHub here. You could adapt this to add difflib functionality too, of course.
I'm going to accept answer above since it coincides with my findings but just to let anyone who reads this know - running separate webservice for Pygments using Python-native solution such as Bottle performs many times better than embedded Jython. Especially on Linux

Web gateway interfaces in Python 3

I've finally concluded that I can no longer afford to just hope the ongoing Py3k/WSGI disasterissues will be resolved anytime soon, so I need to get ready to move on.
Unfortunately, my available options don't seem a whole lot better:
While I find a few different Python modules for FastCGI scattered around the web, none of them seem to be getting much (if any) attention and/or maintenance, particularly with regard to Python 3.x, and it's difficult to distinguish which, if any, are really viable.
Falling all the way back to the built-in CGI module is hardly better than building something myself from scratch (worse, there's an important bug or two in there that may not get attention until Python 3.3).
There is no higher sin than handling HTTP directly in a production webapp. And anyway, that's still reinventing the wheel.
Surely somebody out there is deploying webapps on 3.x in production. What gateway interface are you using, with which module/libraries, and why?
CherryPy 3.2 release candidates support Python 3.X. Because it only supports WSGI at the web server interface layer and not through the whole stack, then you are isolated from issues as to whether WSGI will change. CherryPy has its own internal WSGI server, but also can run under Apache/mod_wsgi with Python 3.1+. See:
http://www.cherrypy.org/wiki/WhatsNewIn32
http://code.google.com/p/modwsgi/wiki/SupportForPython3X
bottle supports Python 3, but it suffers from the broken stdlib. However, multipart reimplements cgi.FieldStorage and can be used with bottle to build a Python 3 WSGI web app. I just published a demo. For the moment it is just a test, but as far as I can tell it works well.

What is the advantage of using Python Virtualbox API?

what is the advantage of using a python virtualbox API instead of using XPCOM?
The advantage is that pyvb is lot easier to work with.
On the contrary the documentation for the python API of XPCOM doesn't exist, and the API is not pythonic at all. You can't do introspection to find methods/attributes of an object, etc. So you have to check the C++ source to find how it works or some python scripts already written (like vboxshell.py and VBoxWebSrv.py).
On the other hand pyvb is really just python wrapper that call VirtuaBoxManager on the command line. I don't know if it's a real disadvantage or not?
I would generally recommend against either one. If you need to use virtualization programmatically, take a look at libvirt, which gives you cross platform and cross hypervisor support; which lets you do kvm/xen/vz/vmware later on.
That said, the SOAP api is using two extra abstraction layers (the client and server side of the HTTP transaction), which is pretty clearly then just calling the XPCOM interface.
If you need local host only support, use XPCOM. The extra indirection of libvirt/SOAP doesn't help you.
If you need to access virtualbox on a various hosts across multiple client machines, use SOAP or libvirt
If you want cross platform support, or to run your code on Linux, use libvirt.
From sun's site on VirtualBox python APIs:
SOAP allows to control remote VMs over
HTTP, while XPCOM is much more
high-performing and exposes certain
functionality not available with SOAP.
They use very different technologies
(SOAP is procedural, while XPCOM is
OOP), but as it is ultimately API to
the same functionality of the
VirtualBox, we kept in bindings
original semantics, so other that
connection establishment, code could
be written in such a way that people
may not care what communication
channel with VirtualBox instance is
used.
From that article, I'm having trouble seeing the difference between "python virtualbox API" and "XPCOM". Could you provide a link to the API you're thinking of?

Categories

Resources