We're currently working on a python project that basically reads and writes M2M data into/from a SQLite database. This database consists of multiple tables, one of them storing current values coming from the cloud. This last table is worrying me a bit since it's being written very often and the application runs on a flash drive.
I've read that virtual tables could be the solution. I've thought in converting the critical table into a virtual one and then link its contents to a real file (XML or JSON) stored in RAM (/tmp for example in Debian). I've been reading this article:
http://drdobbs.com/database/202802959?pgno=1
that explains more or less how to do what I want. It's quite complex and I think that this is not very doable using Python. Maybe we need to develop our own sqlite extension, I don't know...
Any idea about how to "place" our conflicting table in RAM whilst the rest of the database stays in FLASH? Any better/simpler approach about how take the virtual table way under Python?
A very simple, SQL-only solution to create a in-memory table is using SQLite's ATTACH command with the special ":memory:" pseudo-filename:
ATTACH DATABASE ":memory:" AS memdb;
CREATE TABLE memdb.my_table (...);
Since the whole database "memdb" is kept in RAM, the data will be lost once you close the database connection, so you will have to take care of persistence by yourself.
One way to do it could be:
Open your main SQLite database file
Attach a in-memory secondary database
Duplicate your performance-critical table in the in-memory database
Run all queries on the duplicate table
Once done, write the in-memory table back to the original table (BEGIN; DELETE FROM real_table; INSERT INTO real_table SELECT * FROM memory_table;)
But the best advice I can give you: Make sure that you really have a performance problem, the simple solution could just as well be fast enough!
Use an in-memory data structure server. Redis is a sexy option, and you can easily implement a table using lists. Also, it comes with a decent python driver.
Related
I am creating a python system that needs to handle many files. Each of the file has more than 10 thousand lines of text data.
Because DB (like mysql) can not be used in that environment, when file is uploaded by a user, I think I will save all the data of the uploaded file in in-memory-SQLite so that I can use SQL to fetch specific data from there.
Then, when all operations by program are finished, save the processed data in a file. This is the file users will receive from the system.
But some websites say SQLite shouldn't be used in production. But in my case, I just save them temporarily in memory to use SQL for the data. Is there any problem for using SQLite in production even in this scenario?
Edit:
The data in in-memory-DB doesn't need to be shared between processes. It just creates tables, process data, then discard all data and tables after saving the processed data in file. I just think saving everything in list makes search difficult and slow. So using SQLite is still a problem?
SQLite shouldn't be used in production is not a one-for-all rule, it's more of a rule of thumb. Of course there are appliances where one could think of reasonable use of SQLite even in production environments.
However your case doesn't seem to be one of them. While SQLite supports multi-threaded and multi-process environments, it will lock all tables when it opens a write transaction. You need to ask yourself whether this is a problem for your particular case, but if you're uncertain go for "yes, it's a problem for me".
You'd be probably okay with in-memory structures alone, unless there are some details you haven't uncovered.
I'm not familiar with the specific context of your system, but if what you're looking for is a SQL database that is
light
Access is from a single process and a single thread.
If the system crashes in the middle, you have a good way to recover from it (either backing up the last stable version of the database or just create it from scratch).
If you meet all these criteria, using SQLite is production is fine. OSX, for example, uses sqlite for a few purposes (e.g. ./var/db/auth.db).
Situation: Need to deal with large amounts of data (~260MB CSV datafile of about 50B of data per line)
Problem: If I just read from the file every time I need to deal with it, it will take a long time. So I decided to push everything into a database. I need a fast database infrastructure to handle the data as I need to do a lot of reading and writing.
Question: What are the faster choices for database in Python?
Additional information 1: The data comprises of 3 columns and I do not see myself needing anymore than that. Would this mean that a NoSQL database is preferred?
Additional information 2: However, if in the future I do need more than one database working together, would it be better to go for a SQL database?
Additional information 3: I think it would help to mention that I am looking at a few different DBs (MongoDB, SQLite, tinydb), but do suggest other DBs that you know are faster.
I've experienced this same situation many times and I wanted something faster than a typical relational database. Redis is very fast and scalable key/value database. You can get started quickly using the Popoto ORM
Here is an example:
import popoto
class City(popoto.Model):
id = popoto.UniqueKeyField()
name = popoto.KeyField()
description = popoto.Field()
for line in open("cities.csv"):
csv_row = line.split('\t')
City.create(
id=csv_row[0],
name=csv_row[1],
description=csv_row[2]
)
new_york = City.query.get(name="New York")
This is the absolute fastest way to store and retrieve data without having to learn the nuances of a new database system.
Keep in mind that if your database grows beyond 5GB, an in-memory database like Redis can start to become expensive compared to slower disk-based databases like Postgres or MySQL
full disclosure: I help maintain the open source Popoto project
I am creating a simple python program that needs to search a somewhat large database ( ~40 tables, 6 Million or so rows all together ).
Currently, I use MySQLdb to query my local MySQL database then I have some other python function that work with the data and returns some statistics and other stuff. I would like to share this with others that do not want to construct their own database. At this point the database is used for queries only.
How best can I share the database and python program as a "package". Do I have to give up on the SQL method and switch to some sort of text file database or is there an easier way... sqlite maybe?
If the answer is sqlite how do I go about exporting my current SQL database to the sqlite database? Is there any gotchas I should know about?
Currently I use simple SELECT quarries with a few WHERE statements to locate the data I need. I am afraid that if I switched to text based database I would end up having to write a large amount of code to make these queries.
Thank you in advance for any suggestions.
EDIT
So I wrote my little python program with an sqlite3 database and it works perfectly.
I ended up using using a shell script called mysql2sqlite.sh found here to convert my MySQL database to sqlite. It worked flawlessly.
I only had to change 2 lines of python code. Awesome.
My little program runs in osx, windows and linux (ubuntu and redhat) without any changes or hassle. Thanks for the advise!
Converting your database could be as easy as an sql-dump and then an import, depending on the complexity of your db. See this post for strategies and alternatives.
I'm user of a Python application that has poorly indexed tables and was wondering if it's possible to improve performance by converting the SQLite database into an in-memory database upon application startup. My thinking is that it would minimize the issue of full table scans, especially since SQLite might be creating autoindexes, as the documentation says that is enabled by default. How can this be accomplished using the SQLAlchemy ORM (that is what the application uses)?
At the start of the program, move the database file to a ramdisk, point SQLAlchemy to it and do your processing, and then move the SQLite file back to non-volatile storage.
It's not a great solution, but it'll help you determine whether caching your database in memory is worthwhile.
Whenever you set a variable in Python you are instantiating an object. This means you are allocating memory for it.
When you query sqlite you are simply reading information off the disk into memory.
sqlalchemy is simply an abstraction. You read the data from disk into memory in the same way, by querying the database and setting the returned data to a variable.
What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.
I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).
I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.
Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.
As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.
Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.