It seems to me like going through the whole process of creating the expression tree and then creating a query from it again is a wasted time when using sqlalchemy. Apart from an occasional dynamic query, almost everything will be exactly the same during the whole life of an application (apart from the parameters of course).
Is there any way to just save a query once it's created and reuse it later on with different parameters? Or maybe there's some internal mechanism which already does something similar?
It seems to me like going through the whole process of creating the expression tree and then creating a query from it again is a wasted time when using sqlalchemy.
Do you have any estimates on how much time is wasted, compared to the rest of the application? Profiling here is extremely important before making your program more complex. As I will often note, Reddit serves well over one billion page views a day, they use the SQLAlchemy Core to query their database, and the last time I looked at their code they make no attempt to optimize this process - they build expression trees on the fly and compile each time. We have had users that have determined that their specific system actually benefits from optimiztions in these areas, however.
I've written up some background on profiling here: How can I profile a SQLAlchemy powered application?
Is there any way to just save a query once it's created and reuse it later on with different parameters? Or maybe there's some internal mechanism which already does something similar?
There are several methods, depending on what APIs you're using and what areas you'd like to optimize.
There's two main portions to rendering SQL - there's the construction of the expression tree, so to speak, and then the compilation of the string from the expression tree.
The tree itself, which can either be a select() construct if using Core or a Query() if using ORM, can be reused. A select() especially has nothing associated with it that prevents it from being reused as often as you like (same for insert(), delete(), update(), etc.).
In the ORM, a Query also can be used with different sessions using the with_session() method. The win here is not as much, as Query() still produces an entire select() each time it is invoked. However as we'll see below there is a recipe that can allow this to be cached.
The next level of optimization involves the caching of the actual SQL text rendered. This is an area where a little more care is needed, as the SQL we generate is specific to the target backend; there are also edge cases where various parameterizations change the SQL itself (such as using "TOP N ROWS" with SQL Server, we can't use a bound parameter there). Caching of the SQL strings is provided using the execution_options() method of Connection, which is also available in a few other places, setting the compiled_cache feature by sending it a dictionary or other dict-like object which will cache the string format of statements, keyed to the dialect, the identity of the construct, and the parameters sent. This feature is normally used by the ORM for insert/update/delete statements.
There's a recipe I've posted which integrates the compiled_cache feature with the Query, at BakedQuery. By taking any Query and saying query.bake(), you can now run that query with any Session and as long as you don't call any more chained methods on it, it should use a cached form of the SQL string each time:
q = s.query(Foo).filter(Foo.data==bindparam('foo')).bake()
for i in range(10):
result = q.from_session(s).params(foo='data 12').all()
It's experimental, and not used very often, but it's exactly what you're asking for here. I'd suggest you tailor it to your needs, keep an eye on memory usage as you use it and make sure you follow how it works.
SQLAlchemy 1.0 introduced the baked extension which is a cache designed specifically for caching Query objects. It caches the object's construction and string-compilation steps to minimize the overhead of the Python interpreter on repetitive queries.
Official docs here: http://docs.sqlalchemy.org/en/latest/orm/extensions/baked.html
Note that it does NOT in cache the result set returned by the database. For that, you'll want to check out the dogpile.cache:
http://docs.sqlalchemy.org/en/latest/orm/examples.html#module-examples.dogpile_caching
Related
I am finding Neo4j slow to add nodes and relationships/arcs/edges when using the REST API via py2neo for Python. I understand that this is due to each REST API call executing as a single self-contained transaction.
Specifically, adding a few hundred pairs of nodes with relationships between them takes a number of seconds, running on localhost.
What is the best approach to significantly improve performance whilst staying with Python?
Would using bulbflow and Gremlin be a way of constructing a bulk insert transaction?
Thanks!
There are several ways to do a bulk create with py2neo, each making only a single call to the server.
Use the create method to build a number of nodes and relationships in a single batch.
Use a cypher CREATE statement.
Use the new WriteBatch class (just released this week) to manually make a batch of nodes and relationships (this is really just a manual version of 1).
If you have some code, I'm happy to look at it and make suggestions on performance tweaks. There are also quite a few tests you may be able to get inspiration from.
Cheers,
Nige
Neo4j's write performance is slow unless you are doing a batch insert.
The Neo4j batch importer (https://github.com/jexp/batch-import) is the fastest way to load data into Neo4j. It's a Java utility, but you don't need to know any Java because you're just running the executable. It handles typed data and indexes, and it imports from a CSV file.
To use it with Bulbs (http://bulbflow.com/) Models, use the model get_bundle() method to get the data, index name, and index keys, which is prepared for insert, and then output the data to a CSV file. Or if you don't want to model your data, just output your data from Python to the CSV file.
Will that work for you?
There's so many old answers to this question online, that it took me forever to realize there's an import tool that comes with neo4j. It's very fast and the best tool I was able to find.
Here's a simple example if we want to import student nodes:
bin/neo4j-import --into [path-to-your-neo4j-directory]/data/graph.db --nodes students
The students file contains data that looks like this, for example:
studentID:Id(Student),name,year:int,:LABEL
1111,Amy,2000,Student
2222,Jane,2012,Student
3333,John,2013,Student
Explanation:
The header explains how the data below it should be interpreted.
studentID is a property with type Id(Student).
name is of type string which is the default.
year is an integer
:LABEL is the label you want for these nodes, in this case it is "Student"
Here's the documentation for it: http://neo4j.com/docs/stable/import-tool-usage.html
Note: I realize the question specifically mentions python, but another useful answer mentions a non-python solution.
Well, I myself had need for massive performance from neo4j. I end up doing following things to improve graph performance.
Ditched py2neo, since there were lot of issues with it. Besides it is very convenient to use REST endpoint provided by neo4j, just make sure to use request sessions.
Use raw cypher queries for bulk insert, instead of any OGM(Object-Graph Mapper). That is very crucial if you need an high-performant system.
Performance was not still enough for my needs, so I ended writing a custom system that merges 6-10 queries together using WITH * AND UNION clauses. That improved performance by a factor of 3 to 5 times.
Use larger transaction size with atleast 1000 queries.
To insert a bulk of nodes in very high speed to Neo4K
Batch Inserter
http://neo4j.com/docs/stable/batchinsert-examples.html
In my case I'm working on Java.
I'm beginning to develop a site in Pyramid and, before I commit to using SQLAlchemy, would like to know if it's possible to wrap/extend it to add in 'database lock' functionality.
One quick example as to why I'd like this functionality is for write throttling. My wrapper will be able to detect if a user is flooding the database with writes and, if they are, they'll be prevented from further writes for X amount of time.
I was looking into extending sqlalchemy.org.session.Session and overriding the add method which would perform this throttle check. If the user passes the check, it would simply pass the query off to super(MyWrapper, self).query(*args, **kargs)
This is easy enough to do. However, it only adds the throttle functionality to DBSession.query. If somewhere in my code I use DBSession.execute, the throttle check is bypassed.
Is there a cleaner way to accomplish this?
Detecting excessive network traffic from particular clients is something you might be doing outside the ORM, even outside the Python app, like at the network or database client configuration level.
If within the Python app, definitely not in the ORM. add() doesn't correspond very cleanly to a SQL statement in any case (no SQL is emitted until flush(), and only if the given object was previously pending. add() also cascades to many objects and can result in any number of INSERT statements).
For a simple count on statements, cursor execute events are the best way to go. This gives you a hook at the point of calling execute() on the DBAPI cursor. See before_cursor_execute() at http://www.sqlalchemy.org/docs/core/events.html#connection-events.
I noticed that a significant part of my (pure Python) code deals with tables. Of course, I have class Table which supports the basic functionality, but I end up adding more and more features to it, such as queries, validation, sorting, indexing, etc.
I to wonder if it's a good idea to remove my class Table, and refactor the code to use a regular relational database that I will instantiate in-memory.
Here's my thinking so far:
Performance of queries and indexing would improve but communication between Python code and the separate database process might be less efficient than between Python functions. I assume that is too much overhead, so I would have to go with sqlite which comes with Python and lives in the same process. I hope this means it's a pure performance gain (at the cost of non-standard SQL definition and limited features of sqlite).
With SQL, I will get a lot more powerful features than I would ever want to code myself. Seems like a clear advantage (even with sqlite).
I won't need to debug my own implementation of tables, but debugging mistakes in SQL are hard since I can't put breakpoints or easily print out interim state. I don't know how to judge the overall impact of my code reliability and debugging time.
The code will be easier to read, since instead of calling my own custom methods I would write SQL (everyone who needs to maintain this code knows SQL). However, the Python code to deal with database might be uglier and more complex than the code that uses pure Python class Table. Again, I don't know which is better on balance.
Any corrections to the above, or anything else I should think about?
SQLite does not run in a separate process. So you don't actually have any extra overhead from IPC. But IPC overhead isn't that big, anyway, especially over e.g., UNIX sockets. If you need multiple writers (more than one process/thread writing to the database simultaneously), the locking overhead is probably worse, and MySQL or PostgreSQL would perform better, especially if running on the same machine. The basic SQL supported by all three of these databases is the same, so benchmarking isn't that painful.
You generally don't have to do the same type of debugging on SQL statements as you do on your own implementation. SQLite works, and is fairly well debugged already. It is very unlikely that you'll ever have to debug "OK, that row exists, why doesn't the database find it?" and track down a bug in index updating. Debugging SQL is completely different than procedural code, and really only ever happens for pretty complicated queries.
As for debugging your code, you can fairly easily centralize your SQL calls and add tracing to log the queries you are running, the results you get back, etc. The Python SQLite interface may already have this (not sure, I normally use Perl). It'll probably be easiest to just make your existing Table class a wrapper around SQLite.
I would strongly recommend not reinventing the wheel. SQLite will have far fewer bugs, and save you a bunch of time. (You may also want to look into Firefox's fairly recent switch to using SQLite to store history, etc., I think they got some pretty significant speedups from doing so.)
Also, SQLite's well-optimized C implementation is probably quite a bit faster than any pure Python implementation.
You could try to make a sqlite wrapper with the same interface as your class Table, so that you keep your code clean and you get the sqlite performences.
If you're doing database work, use a database, if your not, then don't. Using tables, it sound's like you are. I'd recommend using an ORM to make it more pythonic. SQLAlchemy is the most flexible (though it's not strictly just an ORM).
We've worked hard to work up a full dimensional database model of our problem, and now it's time to start coding. Our previous projects have used hand-crafted queries constructed by string manipulation.
Is there any best/standard practice for interfacing between python and a complex database layout?
I've briefly evaluated SQLAlchemy, SQLObject, and Django-ORM, but (I may easily be missing something) they seem tuned for tiny web-type (OLTP) transactions, where I'm doing high-volume analytical (OLAP) transactions.
Some of my requirements, that may be somewhat different than usual:
load large amounts of data relatively quickly
update/insert small amounts of data quickly and easily
handle large numbers of rows easily (300 entries per minute over 5 years)
allow for modifications in the schema, for future requirements
Writing these queries is easy, but writing the code to get the data all lined up is tedious, especially as the schema evolves. This seems like something that a computer might be good at?
Don't get confused by your requirements. One size does not fit all.
load large amounts of data relatively quickly
Why not use the databases's native loaders for this? Use Python to prepare files, but use database tools to load. You'll find that this is amazingly fast.
update/insert small amounts of data quickly and easily
That starts to bend the rules of a data warehouse. Unless you're talking about Master Data Management to update reporting attributes of a dimension.
That's what ORM's and web frameworks are for.
handle large numbers of rows easily (300 entries per minute over 5 years)
Again, that's why you use a pipeline of Python front-end processing, but the actual INSERT's are done by database tools. Not Python.
alter schema (along with python interface) easily, for future requirements
You have almost no use for automating this. It's certainly your lowest priority task for "programming". You'll often do this manually in order to preserve data properly.
BTW, "hand-crafted queries constructed by string manipulation" is probably the biggest mistake ever. These are hard for the RDBMS parser to handle -- they're slower than using queries that have bind variables inserted.
I'm using SQLAlchemy with a pretty big datawarehouse and I'm using it for the full ETL process with success. Specially in certain sources where I have some complex transformation rules or with some heterogeneous sources (such as web services). I'm not using the Sqlalchemy ORM but rather using its SQL Expression Language because I don't really need to map anything with objects in the ETL process. Worth noticing that when I'm bringing a verbatim copy of some of the sources I rather use the db tools for that -such as PostgreSQL dump utility-. You can't beat that.
SQL Expression Language is the closest you will get with SQLAlchemy (or any ORM for the matter) to handwriting SQL but since you can programatically generate the SQL from python you will save time, specially if you have some really complex transformation rules to follow.
One thing though, I rather modify my schema by hand. I don't trust any tool for that job.
SQLAlchemy definitely. Compared to SQLAlchemy, all other ORMs look like child's toy. Especially the Django-ORM. What's Hibernate to Java, SQLAlchemy is to Python.
I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.
Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.
I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.
I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1
First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.
This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.
Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.