Python: interact with complex data warehouse

Python: interact with complex data warehouse - python

We've worked hard to work up a full dimensional database model of our problem, and now it's time to start coding. Our previous projects have used hand-crafted queries constructed by string manipulation.
Is there any best/standard practice for interfacing between python and a complex database layout?
I've briefly evaluated SQLAlchemy, SQLObject, and Django-ORM, but (I may easily be missing something) they seem tuned for tiny web-type (OLTP) transactions, where I'm doing high-volume analytical (OLAP) transactions.
Some of my requirements, that may be somewhat different than usual:
load large amounts of data relatively quickly
update/insert small amounts of data quickly and easily
handle large numbers of rows easily (300 entries per minute over 5 years)
allow for modifications in the schema, for future requirements
Writing these queries is easy, but writing the code to get the data all lined up is tedious, especially as the schema evolves. This seems like something that a computer might be good at?

Don't get confused by your requirements. One size does not fit all.
load large amounts of data relatively quickly
Why not use the databases's native loaders for this? Use Python to prepare files, but use database tools to load. You'll find that this is amazingly fast.
update/insert small amounts of data quickly and easily
That starts to bend the rules of a data warehouse. Unless you're talking about Master Data Management to update reporting attributes of a dimension.
That's what ORM's and web frameworks are for.
handle large numbers of rows easily (300 entries per minute over 5 years)
Again, that's why you use a pipeline of Python front-end processing, but the actual INSERT's are done by database tools. Not Python.
alter schema (along with python interface) easily, for future requirements
You have almost no use for automating this. It's certainly your lowest priority task for "programming". You'll often do this manually in order to preserve data properly.
BTW, "hand-crafted queries constructed by string manipulation" is probably the biggest mistake ever. These are hard for the RDBMS parser to handle -- they're slower than using queries that have bind variables inserted.

I'm using SQLAlchemy with a pretty big datawarehouse and I'm using it for the full ETL process with success. Specially in certain sources where I have some complex transformation rules or with some heterogeneous sources (such as web services). I'm not using the Sqlalchemy ORM but rather using its SQL Expression Language because I don't really need to map anything with objects in the ETL process. Worth noticing that when I'm bringing a verbatim copy of some of the sources I rather use the db tools for that -such as PostgreSQL dump utility-. You can't beat that.
SQL Expression Language is the closest you will get with SQLAlchemy (or any ORM for the matter) to handwriting SQL but since you can programatically generate the SQL from python you will save time, specially if you have some really complex transformation rules to follow.
One thing though, I rather modify my schema by hand. I don't trust any tool for that job.

SQLAlchemy definitely. Compared to SQLAlchemy, all other ORMs look like child's toy. Especially the Django-ORM. What's Hibernate to Java, SQLAlchemy is to Python.

Related

Preferred (or recommended) way to store large amounts of simulation configurations, runs values and final results

I am working with some network simulator. After making some extensions to it, I need to make a lot of different simulations and tests. I need to record:
simulation scenario configurations
values of some parameters (e.g. buffer sizes, signal qualities, position) per devices per time unit t
final results computed from those recorded values
Second data is needed to perform some visualization after simulation was performed (simple animation, showing some statistics over time).
I am using Python with matplotlib etc. for post-processing the data and for writing a proper app (now considering pyQt or Django, but this is not the topic of the question). Now I am wondering what would be the best way to store this data?
My first guess was to use XML files, but it can be too much overhead from the XML syntax (I mean, files can grow up to very big sizes, especially for the second part of the data type). So I tried to design a database... But this also seems to me to be not the proper way... Maybe a mix of both?
I have tried to find some clues in Google, but found nothing special. Have you ever had a need for storing such data? How have you done that? Is there any "design pattern" for that?

Separate concerns:
Apart from pondering on the technology to use for storing data (DBMS, CSV, or maybe one of the specific formats for scientific data), note that you have three very different kinds of data to manage:
Simulation scenario configurations: these are (typically) rather small, but they need to be simple to edit, simple to re-use, and should allow to reproduce a simulation run. Here, text or code files seem to be a good choice (these should also be version-controlled).
Raw simulation data: this is where you should be really careful if you are concerned with simulation performance, because writing 3 GB of data during a run can take a huge amount of time if implemented badly. One way to proceed would be to use existing file formats for this purpose (see below) and see if they work for you. If not, you can still use a DBMS. Also, it is usually a good idea to include a description of the scenario that generated the data (or at least a reference), as this helps you managing the results.
Data for post-processing: how to store this mostly depends on the post-processing tools. For example, if you already have a class structure for your visualization application, you could define a file format that makes it easy to read in the required data.
Look for existing solutions:
The problem you face (How to manage simulation data?) is fundamental and there are many potential solutions, each coming with certain trade-offs. As you are working in network simulation, check out what capabilities other tools used in your community provide. It could be that their developers ran into problems you are not even anticipating yet (regarding reproducibility etc.), and already found a good solution. For example, you could check out how OMNeT++ is handling simulation output: the simulation configurations are defined in a separate file, results are written to vec and sca files (depending on their nature). As far as I understood your problems with hierarchical data, this is supported as well (vectors get unique IDs and are associated with an attribute of some model entity).
Additional tools already work with these file formats, e.g. to convert them to other formats like CSV/MATLAB files, so you could even think of creating files in the same format (documented here) and to use existing tools/converters for post-processing.
Many other simulation tools will have similar features, so take a look at what would work best for you.

It sounds like you need to record more or less the same kinds of information for each case, so a relational database sounds like a good fit-- why do you think it's "not the proper way"?
If your data fits in a collection of CSV files, you're most of the way to a relational database already! Just store in database tables instead, and you have support for foreign keys and queries. If you go on to implement an object-oriented solution, you can initialize your objects from the database.

If your data structures are well-known and stable AND you need some of the SQL querying / computation features then a light-weight relational DB like SQLite might be the way to go (just make sure it can handle your eventual 3+GB data).
Else - ie, each simulation scenario might need a dedicated data structure to store the results -, and you don't need any SQL feature, then you might be better using a more free-form solution (document-oriented database, OO database, filesystem + csv, whatever).
Note that you can still use a SQL db in the second case, but you'll have to dynamically create tables for each resultset, and of course dynamically create the relevant SQL queries too.

Should I use complex SQL queries or process results in the application?

I am dealing with an application with huge SQL queries. They are so complex that when I finish understanding one I have already forgotten how it all started.
I was wondering if it will be a good practice to pull more data from database and make the final query in my code, let's say, with Python. Am I nuts? Would it be that bad for performance?
Note, results are huge too, I am talking about an ERP in production developed by other people.

Let the DB figure out how best to retrieve the information that you want, else you'll have to duplicate the functionality of the RDBMS in your code, and that will be way more complex than your SQL queries.
Plus, you'll waste time transferring all that unneeded information from the DB to your app, so that you can filter and process it in code.
All this is true because you say you're dealing with large data.

I would have the business logic in the application, as much as possible. Complex business logic in queries are difficult to maintain. (when I finish understanding one I have already forgotten how it all started)Complex logic in stored procedures are ok. But with a typical python application, you would want your business logic to be in python.
Now, the database is way better in handling data than your application code. So if your logic involves huge amount of data, you may get better performance with the logic in the database. But this will be for complex reports, bookkeeping operations and such, that operate on a large volume of data. You may want to use stored procedures, or systems that specialize in such operations (a data warehouse for reports) for these types of operations.
Normal OLTP operations do not involve much of data. The database may be huge, but the data required for a typical transaction will be (typically) a very small part of it. Querying this in a large database may cause performance issues, but you can optimize this in several ways (indexes, full text searches, redundancy, summary tables... depends on your actual problem).
Every rule has exceptions, but as a general guideline, try to have your business logic in your application code. Stored procedures for complex logic. A separate data warehouse or a set of procedures for reporting.

My experience is that you should let the database do the processing. It will be much faster than retrieving all data from the db first and have some code to join and filter the results.
The hard part here is to document your queries, so someone else (or even you) will understand what is going on if you look at them after some time. I found that most databases allow comments in SQL. Sometimes between /* comment */ tags and sometimes commenting a line with -- comment.
A documented query can look like this
select name, dateofbirth from (
-- The next subquery will retrieve ....
select ....
) SR /* SR SubResults */
....

#Nivas is generally correct.
These are pretty common patterns
Division of labour - the DBAs have to return all the data the business need, but they only have a database to work with. The developers could work with the DBAs to do it better but departmental responsbilities make it nearly impossible. So SQL to do morethan retrieve data is used.
lack of smaller functions. Could the massive query be broken down into smaller stages, using working tables? Yes, but I have known environments where a new table needs reams of approavals - a heavy Query is just written
So, in general, getting data out of the database - thats down to the database. But if a SQL query is too long its going to be hard for the RDBMS to optimise, and it probably means the query is spanning data, business logic and even presentation in one go.
I would suggest a saner approach is usually to seperate out the "get me the data" portions into stored procedures or other controllable queries that populate staging tables. Then the business logic can be written into a scripting language sitting above and controlling the stored procedures. And presentation is left elsewhere. In essence solutions like cognos try to do this anyway.
But if you are looking at an ERP in production, the constraints and the solutions above probably already exist - are you talking to the right people?

One of my app (B) use tempdb to split complex queries to batch. Another app (B) use complex queries without that.
App A is more effective for large DB.
But to take a look to your situation you can exeute DMV script like that:
-- Get top total worker time queries for entire instance (Query 38) (Top Worker Time Queries)
SELECT TOP(50) DB_NAME(t.[dbid]) AS [Database Name], t.[text] AS [Query Text],
qs.total_worker_time AS [Total Worker Time], qs.min_worker_time AS [Min Worker Time],
qs.total_worker_time/qs.execution_count AS [Avg Worker Time],
qs.max_worker_time AS [Max Worker Time], qs.execution_count AS [Execution Count],
qs.total_elapsed_time/qs.execution_count AS [Avg Elapsed Time],
qs.total_logical_reads/qs.execution_count AS [Avg Logical Reads],
qs.total_physical_reads/qs.execution_count AS [Avg Physical Reads],
qp.query_plan AS [Query Plan], qs.creation_time AS [Creation Time]
FROM sys.dm_exec_query_stats AS qs WITH (NOLOCK)
CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS t
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qp
ORDER BY qs.total_worker_time DESC OPTION (RECOMPILE);
then you can open query plan for heavies query and try to search something like:
StatementOptmEarlyAbortReason="TimeOut"
or
StatementOptmEarlyAbortReason="MemoryLimitExceeded"
These facts tell you to split your complex query into batch+tempdb
PS. It works for good queries without index/table scans, missing indexes and so on.

Handle large data pools in python

I'm working on an academic project aimed at studying people behavior.
The project will be divided in three parts:
A program to read the data from some remote sources, and build a local data pool with it.
A program to validate this data pool, and to keep it coherent
A web interface to allow people to read/manipulate the data.
The data consists of a list of people, all with an ID #, and with several characteristics: height, weight, age, ...
I need to easily make groups out of this data (e.g.: all with a given age, or a range of heights) and the data is several TB big (but can reduced in smaller subsets of 2-3 gb).
I have a strong background on the theoretical stuff behind the project, but I'm not a computer scientist. I know java, C and Matlab, and now I'm learning python.
I would like to use python since it seems easy enough and greatly reduce the verbosity of Java. The problem is that I'm wondering how to handle the data pool.
I'm no expert of databases but I guess I need one here. What tools do you think I should use?
Remember that the aim is to implement very advanced mathematical functions on sets of data, thus we want to reduce complexity of source code. Speed is not an issue.

Sounds that the main functionality needed can be found from:
pytables
and
scipy/numpy

Go with a NoSQL database like MongoDB which is much easier to handle data in such a case than having to learn SQL.

Since you aren't an expert I recommend you to use mysql database as the backend of storing your data, it's easy to learn and you'll have a capability to query your data using SQL and write your data using python see this MySQL Guide Python-Mysql

Pros and cons of using sqlite3 vs custom table implementation

I noticed that a significant part of my (pure Python) code deals with tables. Of course, I have class Table which supports the basic functionality, but I end up adding more and more features to it, such as queries, validation, sorting, indexing, etc.
I to wonder if it's a good idea to remove my class Table, and refactor the code to use a regular relational database that I will instantiate in-memory.
Here's my thinking so far:
Performance of queries and indexing would improve but communication between Python code and the separate database process might be less efficient than between Python functions. I assume that is too much overhead, so I would have to go with sqlite which comes with Python and lives in the same process. I hope this means it's a pure performance gain (at the cost of non-standard SQL definition and limited features of sqlite).
With SQL, I will get a lot more powerful features than I would ever want to code myself. Seems like a clear advantage (even with sqlite).
I won't need to debug my own implementation of tables, but debugging mistakes in SQL are hard since I can't put breakpoints or easily print out interim state. I don't know how to judge the overall impact of my code reliability and debugging time.
The code will be easier to read, since instead of calling my own custom methods I would write SQL (everyone who needs to maintain this code knows SQL). However, the Python code to deal with database might be uglier and more complex than the code that uses pure Python class Table. Again, I don't know which is better on balance.
Any corrections to the above, or anything else I should think about?

SQLite does not run in a separate process. So you don't actually have any extra overhead from IPC. But IPC overhead isn't that big, anyway, especially over e.g., UNIX sockets. If you need multiple writers (more than one process/thread writing to the database simultaneously), the locking overhead is probably worse, and MySQL or PostgreSQL would perform better, especially if running on the same machine. The basic SQL supported by all three of these databases is the same, so benchmarking isn't that painful.
You generally don't have to do the same type of debugging on SQL statements as you do on your own implementation. SQLite works, and is fairly well debugged already. It is very unlikely that you'll ever have to debug "OK, that row exists, why doesn't the database find it?" and track down a bug in index updating. Debugging SQL is completely different than procedural code, and really only ever happens for pretty complicated queries.
As for debugging your code, you can fairly easily centralize your SQL calls and add tracing to log the queries you are running, the results you get back, etc. The Python SQLite interface may already have this (not sure, I normally use Perl). It'll probably be easiest to just make your existing Table class a wrapper around SQLite.
I would strongly recommend not reinventing the wheel. SQLite will have far fewer bugs, and save you a bunch of time. (You may also want to look into Firefox's fairly recent switch to using SQLite to store history, etc., I think they got some pretty significant speedups from doing so.)
Also, SQLite's well-optimized C implementation is probably quite a bit faster than any pure Python implementation.

You could try to make a sqlite wrapper with the same interface as your class Table, so that you keep your code clean and you get the sqlite performences.

If you're doing database work, use a database, if your not, then don't. Using tables, it sound's like you are. I'd recommend using an ORM to make it more pythonic. SQLAlchemy is the most flexible (though it's not strictly just an ORM).

Comparing persistent storage solutions in python

I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:
Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
Ease-of-use. This is python. The storage should feel like python.
Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
Ease of installation. My sysadmin should be able to get this running on our cluster.
I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.
Some potential options that I've started looking at (see this post):
pyTables
ZopeDB
shove
shelve
redis
durus
Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?

Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.

A RDBMS.
Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.
That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.
It is really fast.
In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".
Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.
RDBMSs are still the best choice for data storage.

I'm working on such a project and I'm using SQLite.
SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).
You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).
I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)
I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.
Anyway, SQLite works for me.

if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.

Sqlite -- it comes with python, fast, widely availible and easy to maintain

If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.

Go with a RDBMS is reliable scalable and fast.
If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.

The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.

It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.
Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.
You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.