Dynamic updates and API - Bind or something else? - python

At place where I work we've few upcoming projects which require Dynamic-DNS functionality, eg. being able to dynamicaly insert/modify/delete DNS records.
So far we've been using simple Bind setup with one master and few slaves. Master's data (zone files) are in git and we've simple but pretty effective Fabric file which we use to make sure all changes are committed to git and then to deploy changes to Master which propagates changes further to slaves.
We use views, we use them a lot and given how much internal stuff we've got, it's a must to keep this kind of functionality, i.e. to not expose internal records to public.
I've been researching possible solutions for quite a while which would a - allow us to perform dynamic updates on all zones including the same zones in different views, b - expose ideally restful API we can talk to to issue those updates, c - be open source so we can either use it or at least base on something.
Sadly I haven't found anything that's even close to this requirement set which I think is not very individual. We started considering actually writing something on our own - use Python with Python DNS that will talk to Bind through nsupdate protocol and issue changes as we want but before I dive in, I thought to get some advice whether I haven't missed anything ?
Any advice, much appreciated.

Related

Python - Multiprocessing and database entries

I'm working on a framework for Digital Forensic Investigators to use to compare files with each other for my Master's capstone project. However, I ran into a bit of a snag...
I'm trying to implement multiprocessing on the comparisons since using a single core seems to be really slow. The trouble I'm having, however, is when the code goes to enter information into an SQLite database. It will occasionally get a "Database is locked" error when two cores complete at nearly the same time.
So, simple side of my question, is it unsafe to operate database functions within a multiprocessing environment due to the errors I'm encountering? If not, is there a method of going about this that is safe and won't result in random errors?
Thanks!
Your problem is that you are trying to have multiple writers access a toy database -- i.e. sqlite -- which is stored in a single file. Using Lock might help, but it's going to kill your multiprocess throughput because of all the waiting-for-the-lock time. In essence, the lock choke point will serialize your program.
Setting up either MySQL or Postgres on almost any platform is straightforward, and there are several excellent Python modules for accessing them. Using one of those will completely eliminate this problem.
Update for an extended response to comment:
I always ask clients / students, "What problem are you trying to solve?" I'm assuming that you are not trying to create a database system, simply to use one. SQLite3 is fine for a well-defined set of problems, but multiprocess access is not one of them. I could veer off into asking what aspect of your project requires multiprocess access, but I'll assume that you have already determined that this is needed. I don't know either your programming skills or your understanding of how a database works, so forgive me if the following is a bit basic.
Normally you need a database (my preference is Postgres), and a Python module that understands all of the fiddly details of how to talk to that database. Then you need to know what it is you want the DBMS to do for you. The Good News is that you are hardly the first to go down this path.
The Postgres Wiki is full of good stuff. See their page on Python Drivers. Psycopg2 is the category leader and runs on Win/Linux/Mac. Also check out PyPi, the Python Package Index, for many well-written extensions.
If you want to stay more object-oriented, as opposed to writing straight SQL, you might want to look at an ORM like SQLAlchemy. This is another category leader that is well-maintained and widely deployed.
The value of using an ORM is that you can (mostly) keep your head in ObjectLand, where most of your problem lives, and not get tangled up in the cognitive dissonance created by object-oriented programming vs. relational database management, which are two very different views of the world of data.
If you need more help, email me. My address is in my profile.
You can make use of Lock. Take a look at https://docs.python.org/2/library/multiprocessing.html#synchronization-between-processes

I need to run untrusted server-side code in a web app - what are my options?

# Context -- skip if you want to get right to the point
I've been building a rather complex web application in Python (Bottle/gevent/MongoDB). It is a RSVP system which allows several independent front-end instances with registration forms als well as back-end access with granular user permissions (those users are our clients). I now need to implement a flexible map-reduce engine to collect statistics on the registration data. A one-size-fits-all solution is impossible since the data gathered varies from instance to instance. I also want to keep this open for our more technically inclined clients.
# End of context
So I need to execute arbitrary strings of code (some kind of ad-hoc plugin - language doesn't matter) entered through a web interface. I've already learned that it's virtually impossible to properly sandbox Python, so that's no option.
As of now I've looked into Lua and found Lupa, Lunatic Python and Lupy, but all three of them allow access to parts of the Python runtime.
There's also PyExecJS and its various runtimes (V8, Node, SpiderMonkey), but I have no idea whether it poses any security risks.
Questions:
1. Does anyone know of another (more fitting) option?
2. To those familiar with any of the Lua bindings: Is it possible to make them completely safe without too much hassle?
3. To those familiar with PyExecJS: How secure is it? Also, what kind of performance should I expect for, say, calling a short mapping function 1000 times and then iterating over a 1000-item list?
Here are a few ways you can run untrusted code:
a docker container that runs the code, I would suggest checking codecube.io out, it does exactly what you want and you can learn more about the process here
using the libsandbox libraries but at the present time the documentation is pretty bad
PyPy’s sandboxing
Sneklang is strict subset of Python, that is safely evaluated in your provided scope.
It is limited by scope size, and by number of node evaluation steps and protects from infinite loops, stack overflows, and excessive memory usage.
There is an online sandbox as well: https://sneklang.functup.com
I've made this project specifically because I had the same requirements.

Python Backend Design Patterns

I am now working on a big backend system for a real-time and history tracking web service.
I am highly experienced in Python and intend to use it with sqlalchemy (MySQL) to develop the backend.
I don't have any major experience developing robust and sustainable backend systems and I was wondering if you guys could point me out to some documentation / books about backend design patterns? I basically need to feed data to a database by querying different services (over HTML / SOAP / JSON) at realtime, and to keep history of that data.
Thanks!
Can you define "backend" more precisely? Normally, in web dev, I follow a MVC'ish structure where my "front-end", html/css/js and code dealing with displaying either, is loosly coupled with my "backend" model (business objects and data persistence; i.e. database).
I like Django's Model/View/Template approach:
http://docs.djangoproject.com/en/dev/faq/general/#django-appears-to-be-a-mvc-framework-but-you-call-the-controller-the-view-and-the-view-the-template-how-come-you-don-t-use-the-standard-names
But, you haven't really defined what you mean by "backend" so its hard to give advice on design patterns. You said you are experienced in Python, have you ever developed a database driven web application before?
update
Based on your comment, I won't be able to help much as I don't have much experience doing "backends" like that. However, seeing as how you are pulling in resources from the web, your latency/throughput is going to be pretty high. So, in order to increase overall effectiveness, you are going to want to have something that can run multiple threads or processes with pretty high concurrency. I suggest you check out the answers on this thread (and search for similar ones):
Concurrent downloads - Python
Specifically, I found the example for the recursive web server and the example following it to probably be a very good start on your solution:
http://eventlet.net/doc/examples.html#recursive-web-crawler
As for taking that idea and then turning it into a robust/continuous process, that's going to depend a lot on your platform and how well you do error handling. Basically:
run it in a loop and make sure you handle any error that can possibly be thrown
have some kind process monitoring your worker process to kill/restart it if it hangs or dies
make sure you have a monitoring solution to notify you if it stops working (nagios, etc.)
One of the best ways to keep things "robust" is to make them as simple (not simplistic) as possible. If all you are doing is pulling in info from the web, parsing it in some way, and then storing that info in a DB, then try to keep the process that simple. Don't add unnecessarily complexity in an effort to make it more robust. If you end up with a 200 line script that does what you want, great!
Use Apache, Django and Piston.
Use REST as the protocol.
Write as little code as possible.
Django models, forms, and admin interface.
Piston wrapppers for your resources.

How should I configure Amazon EC2 to perform parallelizable data-intensive calculations?

I have a computational intensive project that is highly parallelizable: basically, I have a function that I need to run on each observation in a large table (Postgresql). The function itself is a stored python procedure.
Amazon EC2 seems like an excellent fit for the project.
My question is this: Should I make a custom image (AMI) that already contains the database? This would seem to have the advantage of minimizing data transfers and making parallelization simple: each image could get some assigned block of indices to compute, e.g., image 1 gets 1:100, image 2 101:200 etc. Splitting up the data and the instances (which most how-to guides suggest) doesn't seem to make sense for my application, but I'm very new to this so I'm not confident my intuition is right.
you will definitely want to keep the data and the server instance separate in order for changes in your data to be persisted when you are done with the instance. your best bet will be to start with a basic image that has the OS & database platform you want to use, customize it to suit your needs, and then mount one or more EBS volumes containing your data. You may also want to create your own server instance once you are finished with your customization, unless what you are doing is fairly straightforward.
some helpful links:
http://docs.amazonwebservices.com/AmazonEC2/gsg/2006-10-01/creating-an-image.html
http://developer.amazonwebservices.com/connect/entry.jspa?categoryID=100&externalID=1663
(you said postgres but this mysql tutorial covers the same basic concepts you'll want to keep in mind)
If you've already got the function implemented in Python, the simplest route might be to look at PiCloud, which just gives you a really easy interface for running a Python function on EC2, handling pretty much everything else for you. Whether it's economically sensible will depend on how much data has to get sent per function call vs how long computations take to run.

How would one make Python objects persistent in a web-app?

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.
Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.
I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.
I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1
First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.
This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.
Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Categories

Resources