I have three python object databases which I've constructed through the ZODB module, which I would like to merge into one. The reason I have three and not one is because each object belongs to one of three populations, and was added to the database once my code conducted an analysis of said object. The analysis of each object can definitely be done in parallel. My code takes a few days to run, so to prevent this from being a week long endeavor, I have three computers each processing objects from one of the three populations, and outputting a single ZODB database once it has completed. I couldn't have three computers adding the analysis of objects from different populations to the same database because of the way ZODB handles conflicts. Essentially, until you close the database, it is locked from the inside.
My questions are:
1) How can I merge multiple .fs database files into a single master database? The structure of each database is exactly the same - meaning the dictionary structures are the same between each. As an example, MyDB may represent the ZODB database structure of the first population:
root.['MyDB']['ID123456']['property1']
... ['ID123456']['property2']
... ... ...
root.['MyDB']['ID123457']['property1']
... ['ID123457']['property2']
... ... ...
...
where ellipsis represents more of the same. The names of the keys 'property1', 'property2', etc., are all the same for each 'IDXXXXXX' key within the database, though the values will certaily vary.
2) What would have been the smarter thing to do to run this code in parallel while still resulting in a single ZODB structure?
Please let me know if clarification is needed.
Thanks!
The smarter thing would have been to use ZEO to share the ZODB storage among the processes.
ZEO shares a ZODB database across the network and extends the conflict resolution across multiple clients, which can reside on the same machine or elsewhere.
Alternatively, you could use the RelStorage backend to store your ZODB instead of using the standard FileStorage; this backend uses a traditional relational database to provide concurrent access instead.
See zc.lockfile.LockError in ZODB for some usage examples for either option.
The ZODB data structures are otherwise merely persisted Python data structures; merging the three ZODB datastructures requires you to open each of the databases and merging the nested structures as needed.
Okay, since ZODB object databases are essentially just dictionaries of python objects, this post happens to be the answer I was looking for. It talks about how to add databases together, and in doing so literally adds together any similar common keys of both databases. It's still the answer I'm looking for because both databases are mutually exclusive, and so the result would be a single ZODB database which contains unmodified entries of the other two.
Related
I have a Flask web app that has no registered users, but its database is updated daily (therefore the content only changes once a day).
It seems to me the best choice would be to cache the entire website once a day and serve everything from the cache.
I tried with Flask Cache, but a dynamic page is created and then cached for every different user-session, which is clearly not ideal since the content is always the same no matter who's browsing the website.
Do you know how can I do better, either with Flask Cache or using something else?
Perhaps use an in-memory SQLite database? Will look and feel like any regular db, but with memory access speeds.
A couple of years ago, I wrote an in-memory database which I called littletable. Tables are represented as lists of objects. Selects and queries are normally done by simple list scans, but common object properties can be indexed. Tables can be joined or pivoted.
The main difference in the littletable model is that there is no separate concept of a table vs. a results list. The result of any query or join is another table. Tables can also store namedtuples and a littletable-defined type called a DataObject. Tables can be imported/exported to CSV files to persist any updates.
There is at least one website that uses littletable to maintain its mostly-static product catalog. You might also find littletable useful for prototyping before creating actual tables in a more common database. Here's a link to the online docs.
I'm working on a project in Python and using SQLite3. I don't expect to be using any huge number of records (less than some other projects I've done that don't show any notable performance penalty) and I'm trying to decide if I should put the entire database in one file or multiple files. It's a ledger program that will be keeping names of all vendors, configuration info, and all data for the user in one DB file, but I was considering using a different DB file for each ledger (in the case of using different ledgers for different purposes or investment activities).
I know, from here, that I can do joins, when needed, across DBs in different files, so I don't see any reason I have to keep all the tables in one DB, but I also don't see a reason I need to split them up into different files.
How does using one DB in SQLite compare to using multiple DBs? What are the strengths and disadvantages to using one file or using multiple files? Is there a compelling reason for using one format over the other?
Here are couple of points to consider. Feel free to add more in comments.
Adventages:
You can place each database file on a different physical drive and benefint from parallel read/write operations, making those operations slightly faster.
Disadventages:
You won't be able to create foreign keys across databases.
Views that rely on tables from several databases will require you to attach all databases all the time, using exactly same names for attached databases (querying the view will report an error if the SELECT statement defined inside is incorrect, but it's compiled and validated only when queried).
Triggers cannot operate cross-database, so trigger on some table can query only tables from the same database.
Transactions will be atomic across databases, but only if the main database is neither in WAL mode, or a :memory: database.
In other words, you can achive some speed boost (assuming you have file drives to spere), but you lose some flexibility in database design and it's harder to maintain consistency.
I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask.
I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame (stored in hdf5 format and loaded via pytables).
The question is:
(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrames are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?
(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask (or werkzeug) has a SimpleCaching class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).
I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.
Answers to some aspects of what you're asking for:
It's not quite clear from your description whether you have the tables in your SQL database only, stored as HDF5 files or both. Something to look out for here is that if you use Python 2.x and create the files via pandas' HDFStore class, any strings will be pickled leading to fairly large files. You can also generate pandas DataFrame's directly from SQL queries using read_sql, for example.
If you don't need any relational operations then I would say ditch the postgre server, if it's already set up and you might need that in future keep using the SQL server. The nice thing about the server is that even if you don't expect concurrency issues, it will be handled automatically for you using (Flask-)SQLAlchemy causing you less headache. In general, if you ever expect to add more tables (files), it's less of an issue to have one central database server than maintaining multiple files lying around.
Whichever way you go, Flask-Cache will be your friend, using either a memcached or a redis backend. You can then cache/memoize the function that returns a prepared DataFrame from either SQL or HDF5 file. Importantly, it also let's you cache templates which may play a role in displaying large tables.
You could, of course, also generate a global variable, for example, where you create the Flask app and just import that wherever it's needed. I have not tried this and would thus not recommend it. It might cause all sorts of concurrency issues.
Edit: Main Topic:
Is there a way I can force SqlAlchemy to pre-populate the session as far as it can? Syncronize as much state from the database as possible (there will be no DB updates at this point).
I am having some mild performance issues and I believe I have traced it to SqlAlchemy. I'm sure there are changes in my declarative and db-schema that could improve time, but that is not what I am asking about here.
My SqlAlchemy declarative defines 8 classes, my database has 11 tables with only 7 of them holding my real data, and total my database has 800 records (all Integers and UnicodeText). My database engine is sqlite and the actual size is currently 242Kb.
Really, the number of entities is quite small, but many of the table relationships have recursive behavior (5-6 levels deep). My problem starts with the wonderful automagic that SA does for me, and my reluctance to properly extract the data with my own python classes.
I have ORM attribute access scattered across all kinds of iterators, recursive evaluators, right up to my file I/O streams. The access on these attributes is largely non-linear, and ever time I do a lookup, my callstack disappears into SqlAlchemy for quite some time, and I am getting lots of singleton queries.
I am using mostly default SA settings (python 2.7.2, sqlalchemy 0.7).
Considering that RAM is not an issue, and that my database is so small (for the time being), is there a way I can just force SqlAlchemy to pre-populate the session as far as it can. I am hoping that if I just load the raw data into memory, then the most I will have to do is chase a few joins dynamically (almost all queries are pretty straighforward).
I am hoping for a 5 minute fix so I can run some reports ASAP. My next month of TODO is likely going to be full of direct table queries and tighter business logic that can pipeline tuples.
A five minute fix for that kind of issue is unlikely, but for many-to-one "singleton" gets there is a simple recipe I use often. Suppose you're loading lots of User objects and they all have many-to-one references to a Category of some kind:
# load all categories, then hold onto them
categories = Session.query(Category).all()
for user in Session.query(User):
print user, user.category # no SQL will be emitted for the Category
this because the query.get() that a many-to-one emits will look in the local identity map for the primary key first.
If you're looking for more caching than that (and have a bit more than five minutes to spare), the same concept can be expanded to also cache the results of SELECT statements in a way that the cache is associated only with the current Session - check out the local_session_caching.py recipe included with the distribution examples.
Could any one shed some light on how to migrate my MongoDB to PostgreSQL? What tools do I need, what about handling primary keys and foreign key relationships, etc?
I had MongoDB set up with Django, but would like to convert it back to PostgreSQL.
Whether the migration is easy or hard depends on a very large number of things including how many different versions of data structures you have to accommodate. In general you will find it a lot easier if you approach this in stages:
Ensure that all the Mongo data is consistent in structure with your RDBMS model and that the data structure versions are all the same.
Move your data. Expect that problems will be found and you will have to go back to step 1.
The primary problems you can expect are data validation problems because you are moving from a less structured data platform to a more structured one.
Depending on what you are doing regarding MapReduce you may have some work there as well.
In the mean time, Postgres Foreign Data Wrapper for MongoDB has emerged (versions 9.1-9.4). With it, one can set up a view to MongoDB, via the PostgreSQL, and then handle the data as SQL.
This would probably mean rather easy way to copy data as well.
Limitations of FDW that I have faced:
objects within arrays (in MongoDB) do not seem to be addressable
objects with dynamic key names do not seem to be addressable
I know it's 2015 now. :)