Bad practice to have ORMs with NoSQL stores? - python

I use Redis (redis-py) inside my Python platform. Recently it was suggested that I switch to an ORM.
E.g.: python-stdnet, rom or redisco
Is use of ORMs considered bad practice in the NoSQL world?

Ultimately the question boils down to at what layer do you want to write code.
Do you want to write code that manipulates data structures in a remote database, or do you want to write higher-level code that uses the abstractions built on top of those data structures? You can think of it as a similar question about relational databases as do you want to write SQL, or do you want to write higher-level code?
Personally, despite using rom myself for a variety of tasks (I am the author), I also directly manipulate Redis in the same projects where it makes sense.

Comments pointing out that the R in ORM is for relational are technically correct. That doesn't mean there aren't valid uses and reasons for libraries that abstract redis away.
There are some great libraries that make interfacing with a redis feel much nicer and more idiomatic to the language you are using. For ruby libraries like ohm or redis-native_hash (disclosure: I wrote that one) do just that. For python there are tools like redisco and surely others. These make persisting objects to redis very simple and make working with redis feel much more ruby-ish or python-ish.
Here are a few more benefits from using even the most basic abstraction, like a very thin wrapper you might write and keep in your application:
Switching redis clients will be easier. Maybe you'll never do this, but if you did, changing your calls to redis in one place (your wrapper) is much simpler than changing them everywhere you use redis.
Implementing things you might need for scaling, like sharding or connection pooling, is likely going to be easier if your calls are made through some abstraction.
Replacing redis with some other key/value store or data structure server would be simpler if an abstraction is in place.
I'm not advocating using an object mapping library or building your own abstraction, just pointing out there are valid reasons why you would. Its up to you to evaluate your needs and pick what works best for you. There is nothing wrong with calling redis directly either.

Related

which databases can be used better for pyqt application

i want to create application in windows. i need to use databases which would be preferable best for pyqt application
like
sqlalchemy
mysql
etc.
I would use SQLite every time unless performance became an obvious big problem.
It comes with Python
You don't need to worry about installing it on a target machine or having an existing installation which might clash (including a potential port clash - SQLite doesn't use a port)
It's fairly small (doesn't increase the installed size too much)
Then, a much less obvious choice that I would very much consider making: adding Django to the mix. Django's model system could make for much simpler management, depending on the type of data you're working with. Also, in the case where I've considered it (I just haven't got to that stage of development yet) it means I can reuse the models I've got on the server and a good bit of code from there too.
Obviously in this case you could need to be careful about what you expose; business-critical processing stuff that you don't want to share, potential security holes in server code which you've helpfully provided the code for, etc.
SQlite is fine for a single user.
If you are going over a network to talk to a central database, then you need a database woith a decent Python lirary.
Take a serious look at MySQL if you need/want SQL.
Otherwise, there is CouchDB in the Not SQL camp, which is great if you are storing documents, and can express searches as Map/reduce functions. Poor for adhoc queries.
If you want a relational database I'd recommend you to use SQLAlchemy, as you then get a choice as well as an ORM. Bu default go with SQLite, as per other recommendations here.
If you don't need a relational database, take a look at ZODB. It's an awesome Python-only object-oriented database.
i guess its totally upto you ..but as far as i am concerned i personlly use sqlite, becoz it is easy to use and amazingly simple syntax whereas for MYSQL u can use it for complex apps and has options for performance tuning. but in end its totally upto u and wt your app requires

Twisted + SQLAlchemy and the best way to do it

So I'm writing yet another Twisted based daemon. It'll have an xmlrpc interface as usual so I can easily communicate with it and have other processes interchange data with it as needed.
This daemon needs to access a database. We've been using SQL Alchemy in place of hard coding SQL strings for our latest projects - those mostly done for web apps in Pylons.
We'd like to do the same for this app and re-use library code that makes use of SQL Alchemy. So what to do? Well of course since that library was written for use in a Pylons app it's all the straight-forward blocking style code that everyone is accustomed to and all of the non-blocking is magically handled by Pylons via threading, thread locals, scoped sessions and so on.
So now for Twisted I guess I'm a bit stuck. I could:
Just write the sql I need directly if it's minimal and use the dbapi pool in twisted to do runInteractions etc when I need to hit the db.
Use the objects and inherently blocking methods in our library and block now and then in my Twisted daemon. Bah.
Use sAsync which was last updated in 2008 and kind of reuse the models we have defined already but not really and this doesn't address that the library code needs to work in Pylons too. Does that even work with the latest version SQL Alchemy? Who knows. That project looked great though - why was it apparently abandoned?
Spawn a separate subprocess and have it deal with the library code and all it's blocking, the results being returned back to my daemon when ready as objects marshalled via YAML over xmlrpc.
Use deferToThread and then expunge the objects returned having made sure to do eager loads so that I have all my stuff that I might need. Seems kind of ugha to me.
I'm also stuck using Python 2.5.4 atm so no 2.6 yet and I don't think I can just do an import from future to get access to the cool new multiprocessing module stuff in there. That's OK though I guess as we've got dealing with interprocess communication down pretty well.
So I'm leaning towards option 4 mostly as that would avoid the mortal sin of logic duplication with option 1 while also staying the heck away from threads.
My first attempt though will be option 2 to just get the thing going and then separate out the calls to the library code perhaps into a separate process if it looks like there's a good chance that something might take a bit too long to block on. Sad. Maybe a combination of Stackless Python and Twisted would be interesting here.
Any better ideas?
In the intervening couple of years, Alex Gaynor created https://github.com/alex/alchimia which may be a better central repository for doing integration with SQLAlchemy and Twisted.
Firstly, I can unfortunately only second your opinion that twisted and
SQLAlchemy don't play along very well. I have worked some with both
and would be somewhat afraid of the complexity that would arise from
putting them together.
All the database integration layers that I know of to date use
twisteds threading integration layer, and if you want to avoid that at
all costs you are pretty much stuck with point 4 in your list.
On the other hand, I have seen examples of database connecting code
using deferToThread() and friends that worked very well.
Anyway, some pointers if you'd be ready to consider other frameworks
than SQLAlchemy:
The DivMod guys have been doing some tentative work on twisted -
database integration based on the Storm ORM (google for "storm orm").
See this link for an example:
http://divmod.readthedocs.org/en/latest/products/nevow/storm-approach.html
Also, head over to DivMod's site and have a look at the sources of
their Axiom db layer (probably not of any use to you directly since
it's Sqlite only, but it's principles might be useful).
There's a storm branch that you can use with twisted directly (internally it does the defer to thread stuff) on launchpad https://code.launchpad.net/~therve/storm/twisted-integration. I've used it nicely.
Sadly sqlalchemy is significantly more complex in implementation to audit for async usage. If you really want to use it, i'd recommend an out of process approach with a storage rpc layer.
alternatively if your feeling adventurous and using postgresql, the latest pyscopg2 supports true async usage (https://launchpad.net/txpostgres), and the storm source is pretty simple to hack on ;-)
incidentally the storm you tried last year may not have had the C-extension on by default (it is now in the latest releases.) which might account for your speed issues.
Perhaps twistar is what you're looking for. It's a native active record (aka ORM) implementation for twisted, working on top of twisted.enterprise.adbapi.
http://findingscience.com/twistar/

What is so bad with threadlocals

Everybody in Django world seems to hate threadlocals(http://code.djangoproject.com/ticket/4280, http://code.djangoproject.com/wiki/CookBookThreadlocalsAndUser). I read Armin's essay on this(http://lucumr.pocoo.org/2006/7/10/why-i-cant-stand-threadlocal-and-others), but most of it hinges on threadlocals is bad because it is inelegant.
I have a scenario where theadlocals will make things significantly easier. (I have a app where people will have subdomains, so all the models need to have access to the current subdomain, and passing them from requests is not worth it, if the only problem with threadlocals is that they are inelegant, or make for brittle code.)
Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?
I avoid this sort of usage of threadlocals, because it introduces an implicit non-local coupling. I frequently use models in all kinds of non-HTTP-oriented ways (local management commands, data import/export, etc). If I access some threadlocals data in models.py, now I have to find some way to ensure that it is always populated whenever I use my models, and this could get quite ugly.
In my opinion, more explicit code is cleaner and more maintainable. If a model method requires a subdomain in order to operate, that fact should be made obvious by having the method accept that subdomain as a parameter.
If I absolutely could find no way around storing request data in threadlocals, I would at least implement wrapper methods in a separate module that access threadlocals and call the model methods with the needed data. This way the models.py remains self-contained and models can be used without the threadlocals coupling.
I don't think there is anything wrong with threadlocals - yes, it is a global variable, but besides that it's a normal tool. We use it just for this purpose (storing subdomain model in the context global to the current request from middleware) and it works perfectly.
So I say, use the right tool for the job, in this case threadlocals make your app much more elegant than passing subdomain model around in all the model methods (not mentioning the fact that it is even not always possible - when you are overriding django manager methods to limit queries by subdomain, you have no way to pass anything extra to get_query_set, for example - so threadlocals is the natural and only answer).
Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?
CPython's interpreter has a Global Interpreter Lock (GIL) which means that only one Python thread can be executed by the interpreter at any given time. It isn't clear to me that a Python interpreter implementation would necessarily need to use more than one operating system thread to achieve this, although in practice CPython does.
Java's main locking mechanism is via objects' monitor locks. This is a decentralized approach that allows the use of multiple concurrent threads on multi-core and or multi-processor CPUs, but also produces much more complicated synchronization issues for the programmer to deal with.
These synchronization issues only arise with "shared-mutable state". If the state isn't mutable, or as in the case of a ThreadLocal it isn't shared, then that is one less complicated problem for the Java programmer to solve.
A CPython programmer still has to deal with the possibility of race conditions, but some of the more esoteric Java problems (such as publication) are presumably solved by the interpreter.
A CPython programmer also has the option to code performance critical code in Python-callable C or C++ code where the GIL restriction does not apply. Technically a Java programmer has a similar option via JNI, but this is rightly or wrongly considered less acceptable in Java than in Python.
You want to use threadlocals when you're working with multiple threads and want to localize some objects to a specific thread, eg. having one database connection for each thread.
In your case, you want to use it more as a global context (if I understand you correctly), which is probably a bad idea. It will make your app a bit slower, more coupled and harder to test.
Why is passing it from request not worth it? Why don't you store it in session or user profile?
There difference with Java is that web development there is much more stateful than in Python/PERL/PHP/Ruby world so people are used to all kind of contexts and stuff like that. I don't think that is an advantage, but it does seem like it at the beginning.
I have found using ThreadLocal is an excellent way to implement Dependency Injection in a HTTP request/response environment (i.e. any webapp). You just set up a servlet filter to 'inject' the object you need into the thread on receiving the request and 'uninject' it on returning the response.
It's a smart man's DI without all the XML ugliness, without the MB of Spring Jars (not to mention its learning curve) and without all the cryptic repetitive #annotation nonsense and because it doesn't individually inject many object instances with the dependencies it's probably a heck of a lot faster and uses less memory.
It worked so well we opened sourced our exPOJO Filter that can inject a Hibernate session or a JDO PersistenceManager using ThreadLocal:
http://www.expojo.com

Comparing persistent storage solutions in python

I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:
Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
Ease-of-use. This is python. The storage should feel like python.
Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
Ease of installation. My sysadmin should be able to get this running on our cluster.
I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.
Some potential options that I've started looking at (see this post):
pyTables
ZopeDB
shove
shelve
redis
durus
Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?
Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.
A RDBMS.
Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.
That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.
It is really fast.
In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".
Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.
RDBMSs are still the best choice for data storage.
I'm working on such a project and I'm using SQLite.
SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).
You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).
I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)
I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.
Anyway, SQLite works for me.
if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.
Sqlite -- it comes with python, fast, widely availible and easy to maintain
If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.
Go with a RDBMS is reliable scalable and fast.
If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.
The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.
It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.
Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.
You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.

How would one make Python objects persistent in a web-app?

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.
Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.
I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.
I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1
First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.
This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.
Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Categories

Resources