SQLite: Using One File vs. Many Files

SQLite: Using One File vs. Many Files - python

I'm working on a project in Python and using SQLite3. I don't expect to be using any huge number of records (less than some other projects I've done that don't show any notable performance penalty) and I'm trying to decide if I should put the entire database in one file or multiple files. It's a ledger program that will be keeping names of all vendors, configuration info, and all data for the user in one DB file, but I was considering using a different DB file for each ledger (in the case of using different ledgers for different purposes or investment activities).
I know, from here, that I can do joins, when needed, across DBs in different files, so I don't see any reason I have to keep all the tables in one DB, but I also don't see a reason I need to split them up into different files.
How does using one DB in SQLite compare to using multiple DBs? What are the strengths and disadvantages to using one file or using multiple files? Is there a compelling reason for using one format over the other?

Here are couple of points to consider. Feel free to add more in comments.
Adventages:
You can place each database file on a different physical drive and benefint from parallel read/write operations, making those operations slightly faster.
Disadventages:
You won't be able to create foreign keys across databases.
Views that rely on tables from several databases will require you to attach all databases all the time, using exactly same names for attached databases (querying the view will report an error if the SELECT statement defined inside is incorrect, but it's compiled and validated only when queried).
Triggers cannot operate cross-database, so trigger on some table can query only tables from the same database.
Transactions will be atomic across databases, but only if the main database is neither in WAL mode, or a :memory: database.
In other words, you can achive some speed boost (assuming you have file drives to spere), but you lose some flexibility in database design and it's harder to maintain consistency.

Related

What did my teacher mean by 'db.sqlite3' will fall short in bigger real word problems?

I am very new to programming and this site too... An online course that I follow told me that it is not possible to manage bigger databases with db.sqlite3, what does it mean anyway?

Choice of Relational Database Management Systems (RDBMS) is dependent on your use case. The different options available have different pros and cons and hence, for different applications, some are more suitable than others.
I typically use SQLite (only for development purposes) and then switch to MySQL for my Django projects.
SQLite: Is file based. You can actually see the file in your project directory so all the CRUD (Create, Retrieve, Update, Delete) is done directly onto that file. Also, all the underlying code for the RDBMS is quite small in size. So all this makes it good for applications which don't require intensive use of databases or perhaps require offline storage e.g. IoT, small websites etc. When you try to use it for big projects that require intensive use of databases e.g. online stores, you run into many problems because the RDBMS is not as well developed as MySQL or PostgreSQL. The primary problem is a lack of concurrency i.e. only one device can be writing to the database at a time because operations are serialised.
MySQL: Is one of the most popularly used and my personal favourite (very easy to configure and use with Django). It's based on the client/server database model and not a file like SQLite and is very scalable i.e. it is capable of way more than SQLite and you can use it for many different applications that require heavy use of the RDBMS. It has better security, allows for concurrent operations and outperforms PostgreSQL in performance when you need to do lots of reading operations.
PostgreSQL: Is also a very strong option and capable of most of the stuff that MySQL can do but handles clients in a different way and it has an edge over MySQL in SELECTs and INSERTs. MySQL is still soooo much more widely used than PostgreSQL though.
There are also many other options on the market. You can take a look at this article which compares a bunch of them. But to answer your question, SQLite is very simplistic compared to the other options and stores everything in a file in your project rather than on a server, so as a result, there is little security, lack of concurrency etc. This is fine when developing and for use cases that do not require major use of databases but will not cut it for big projects.

This is not a matter of how big the DB is. SQLite DB can be very big, hundreds of Gigabytes.
It is a matter of how many user are using the application (you mention django) concurrently. As SQLite only support one writer at a time, the other are queued. Fortunately, you can have many concurrent readers.
So if you have a lot of concurrent access (that are not explicitly marked are read-only) then SQLite is not a good choice anymore. You'll prefer something like PostgreSQL.
BTW, everything is better explained in the documentation ;)

Selecting a database for your project is like selecting any other technology. It depends on your use case.
Size isn't the issue, complexity is. SQLite3 databases can grow as big as 281 terabytes. Limits on number of tables, columns & rows are also pretty decent.
If your application logic requires SQL operations like:
RIGHT OUTER JOIN, FULL OUTER JOIN
ALTER TABLE, ADD CONSTRAINT, etc..
DELETE, INSERT, or UPDATE on a VIEW
Custom user permissions to read/write
Then SQLite3 should not be your choice of database as these SQL features are not implemented in SQLite3.

in-memory sqlite in production with python

I am creating a python system that needs to handle many files. Each of the file has more than 10 thousand lines of text data.
Because DB (like mysql) can not be used in that environment, when file is uploaded by a user, I think I will save all the data of the uploaded file in in-memory-SQLite so that I can use SQL to fetch specific data from there.
Then, when all operations by program are finished, save the processed data in a file. This is the file users will receive from the system.
But some websites say SQLite shouldn't be used in production. But in my case, I just save them temporarily in memory to use SQL for the data. Is there any problem for using SQLite in production even in this scenario?
Edit:
The data in in-memory-DB doesn't need to be shared between processes. It just creates tables, process data, then discard all data and tables after saving the processed data in file. I just think saving everything in list makes search difficult and slow. So using SQLite is still a problem?

SQLite shouldn't be used in production is not a one-for-all rule, it's more of a rule of thumb. Of course there are appliances where one could think of reasonable use of SQLite even in production environments.
However your case doesn't seem to be one of them. While SQLite supports multi-threaded and multi-process environments, it will lock all tables when it opens a write transaction. You need to ask yourself whether this is a problem for your particular case, but if you're uncertain go for "yes, it's a problem for me".
You'd be probably okay with in-memory structures alone, unless there are some details you haven't uncovered.

I'm not familiar with the specific context of your system, but if what you're looking for is a SQL database that is
light
Access is from a single process and a single thread.
If the system crashes in the middle, you have a good way to recover from it (either backing up the last stable version of the database or just create it from scratch).
If you meet all these criteria, using SQLite is production is fine. OSX, for example, uses sqlite for a few purposes (e.g. ./var/db/auth.db).

Adding Multiple ZODB Databases Together

I have three python object databases which I've constructed through the ZODB module, which I would like to merge into one. The reason I have three and not one is because each object belongs to one of three populations, and was added to the database once my code conducted an analysis of said object. The analysis of each object can definitely be done in parallel. My code takes a few days to run, so to prevent this from being a week long endeavor, I have three computers each processing objects from one of the three populations, and outputting a single ZODB database once it has completed. I couldn't have three computers adding the analysis of objects from different populations to the same database because of the way ZODB handles conflicts. Essentially, until you close the database, it is locked from the inside.
My questions are:
1) How can I merge multiple .fs database files into a single master database? The structure of each database is exactly the same - meaning the dictionary structures are the same between each. As an example, MyDB may represent the ZODB database structure of the first population:
root.['MyDB']['ID123456']['property1']
... ['ID123456']['property2']
... ... ...
root.['MyDB']['ID123457']['property1']
... ['ID123457']['property2']
... ... ...
...
where ellipsis represents more of the same. The names of the keys 'property1', 'property2', etc., are all the same for each 'IDXXXXXX' key within the database, though the values will certaily vary.
2) What would have been the smarter thing to do to run this code in parallel while still resulting in a single ZODB structure?
Please let me know if clarification is needed.
Thanks!

The smarter thing would have been to use ZEO to share the ZODB storage among the processes.
ZEO shares a ZODB database across the network and extends the conflict resolution across multiple clients, which can reside on the same machine or elsewhere.
Alternatively, you could use the RelStorage backend to store your ZODB instead of using the standard FileStorage; this backend uses a traditional relational database to provide concurrent access instead.
See zc.lockfile.LockError in ZODB for some usage examples for either option.
The ZODB data structures are otherwise merely persisted Python data structures; merging the three ZODB datastructures requires you to open each of the databases and merging the nested structures as needed.

Okay, since ZODB object databases are essentially just dictionaries of python objects, this post happens to be the answer I was looking for. It talks about how to add databases together, and in doing so literally adds together any similar common keys of both databases. It's still the answer I'm looking for because both databases are mutually exclusive, and so the result would be a single ZODB database which contains unmodified entries of the other two.

Memory usage of file versus database for simple data storage

I'm writing the server for a Javascript app that has a syncing feature. Files and directories being created and modified by the client need to be synced to the server (the same changes made on the client need to be made on the server, including deletes).
Since every file is on the server, I'm debating the need for a MySQL database entry corresponding to each file. The following information needs to be kept on each file/directory for every user:
Whether it was deleted or not (since deletes need to be synced to other clients)
The timestamp of when every file was last modified (so I know whether the file needs updating by the client or not)
I could keep both of those pieces of information in files (e.g. .deleted file and .modified file in every user's directory containing file paths + timestamps in the latter) or in the database.
However, I also have to fit under an 80mb memory constraint. Between file storage and
database storage, which would be more memory-efficient for this purpose?
Edit: Files have to be stored on the filesystem (not in a database), and users have a quota for the storage space they can use.

Probably the filesystem variant will be more efficient memory wise as long as the number of files is low, but that solution probably won't scale. Databases are optimized to do exactly that. Searching the filesystem, opening the file, searching the document, will be expensive as the number of files and requests increase.
But nobody says you have to use MySQl. A NoSQL database like Redis, or maybe something like CouchDB (where you could keep the file itself and include versioning) might be solutions that are more attractive.
here a quick comparison of NoSQL databases.
And a longer comparison.
Edit: From your comments, I would build it as follows: create an API abstracting the backend for all the operations you want to do. Then implement the backend part with the 2 or 3 operations that happen most, or could be more expensive, for the filesytem, and for a database (or two). Test and benchmark.

I'd go for one of the NoSQL databases. You can store file contents and provide some key function based on user's IDs in order to retrieve those contents when you need them. Redis or Casandra can be good choices for this case. There are many libs to use these databases in Python as well as in many other languages.

In my opinion, the only real way to be sure is to build a test system and compare the space requirements. It shouldn't take that long to generate some random data programatically. One might think the file system would be more efficient, but databases can and might compress the data or deduplicate it, or whatever. Don't forget that a database would also make it easier to implement new features, perhaps access control.

Django with huge mysql database

What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.

I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).

I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.

Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.

As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.

Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.