Fast database solutions for Python Programming - python

Situation: Need to deal with large amounts of data (~260MB CSV datafile of about 50B of data per line)
Problem: If I just read from the file every time I need to deal with it, it will take a long time. So I decided to push everything into a database. I need a fast database infrastructure to handle the data as I need to do a lot of reading and writing.
Question: What are the faster choices for database in Python?
Additional information 1: The data comprises of 3 columns and I do not see myself needing anymore than that. Would this mean that a NoSQL database is preferred?
Additional information 2: However, if in the future I do need more than one database working together, would it be better to go for a SQL database?
Additional information 3: I think it would help to mention that I am looking at a few different DBs (MongoDB, SQLite, tinydb), but do suggest other DBs that you know are faster.

I've experienced this same situation many times and I wanted something faster than a typical relational database. Redis is very fast and scalable key/value database. You can get started quickly using the Popoto ORM
Here is an example:
import popoto
class City(popoto.Model):
id = popoto.UniqueKeyField()
name = popoto.KeyField()
description = popoto.Field()
for line in open("cities.csv"):
csv_row = line.split('\t')
City.create(
id=csv_row[0],
name=csv_row[1],
description=csv_row[2]
)
new_york = City.query.get(name="New York")
This is the absolute fastest way to store and retrieve data without having to learn the nuances of a new database system.
Keep in mind that if your database grows beyond 5GB, an in-memory database like Redis can start to become expensive compared to slower disk-based databases like Postgres or MySQL
full disclosure: I help maintain the open source Popoto project

Related

Multi-user efficient time-series storing for Django web app

I'm developing a Django app. Use-case scenario is this:
50 users, each one can store up to 300 time series and each time serie has around 7000 rows.
Each user can ask at any time to retrieve all of their 300 time series and ask, for each of them, to perform some advanced data analysis on the last N rows. The data analysis cannot be done in SQL but in Pandas, where it doesn't take much time... but retrieving 300,000 rows in separate dataframes does!
Users can also ask results of some analysis that can be performed in SQL (like aggregation+sum by date) and that is considerably faster (to the point where I wouldn't be writing this post if that was all of it).
Browsing and thinking around, I've figured storing time series in SQL is not a good solution (read here).
Ideal deploy architecture looks like this (each bucket is a separate server!):
Problem: time series in SQL are too slow to retrieve in a multi-user app.
Researched solutions (from this article):
PyStore: https://github.com/ranaroussi/pystore
Arctic: https://github.com/manahl/arctic
Here are some problems:
1) Although these solutions are massively faster for pulling millions of rows time series into a single dataframe, I might need to pull around 500.000 rows into 300 different dataframes. Would that still be as fast?
This is the current db structure I'm using:
class TimeSerie(models.Model):
...
class TimeSerieRow(models.Model):
date = models.DateField()
timeserie = models.ForeignKey(timeserie)
number = ...
another_number = ...
And this is the bottleneck in my application:
for t in TimeSerie.objects.filter(user=user):
q = TimeSerieRow.objects.filter(timeserie=t).orderby("date")
q = q.filter( ... time filters ...)
df = pd.DataFrame(q.values())
# ... analysis on df
2) Even if PyStore or Arctic can do that faster, that'd mean I'd loose the ability to decouple my db from the Django instances, effictively using resources of one machine way better, but being stuck to use only one and not being scalable (or use as many separate databases as machines). Can PyStore/Arctic avoid this and provide an adapter for remote storage?
Is there a Python/Linux solution that can solve this problem? Which architecture I can use to overcome it? Should I drop the scalability of my app and/or accept that every N new users I'll have to spawn a separate database?
The article you refer to in your post is probably the best answer to your question. Clearly good research and a few good solutions being proposed (don't forget to take a look at InfluxDB).
Regarding the decoupling of the storage solution from your instances, I don't see the problem:
Arctic uses mongoDB as a backing store
pyStore uses a file system as a backing store
InfluxDB is a database server on its own
So as long as you decouple the backing store from your instances and make them shared among instances, you'll have the same setup as for your posgreSQL database: mongoDB or InfluxDB can run on a separate centralised instance; the file storage for pyStore can be shared, e.g. using a shared mounted volume. The python libraries that access these stores of course run on your django instances, like psycopg2 does.

Storing large dataset of tweets: Text files vs Database

I have collected a large Twitter dataset (>150GB) that is stored in some text files. Currently I retrieve and manipulate the data using custom Python scripts, but I am wondering whether it would make sense to use a database technology to store and query this dataset, especially given its size. If anybody has experience handling twitter datasets of this size, please share your experiences, especially if you have any suggestions as to what database technology to use and how long the import might take. Thank you
I recommend using a database schema for this, especially considering it's size. (this is without knowing anything about what the dataset holds) That being said, I suggest now or for future questions of this nature using the software suggestions website for this plus adding more about what the dataset would look like.
As for suggesting a certain database in specific, I recommend doing some research about what each do but for something that just holds data with no relations any will do and could show great query improvement vs just txt files as query's can be cached and data is faster to retrieve due to how databases store and lookup files weather it just be hashed values or whatever they use.
Some popular databases:
MYSQL, PostgreSQL - Relational Databases (simple and fast and easy to use/setup but need some knowledge of SQL)
MongoDB - NoSQL Database (also easy to use and setup and no SQL needed, it relies more on dicts to access DB through the API. Also memory mapped so can be faster than Relational but need to have enough RAM for the Indexes.)
ZODB - Full Python NoSQL Database (Kind of like MongoDB but written in Python)
These are very light and brief explanations of each DB, be sure to do your research before using them, they each have their pros and cons. Also, remember this is just a couple of many popular and highly used Databases, there's also TinyDB, SQLite (comes with Python), and PickleDB that are full Python but are generally for small applications.
My experience is mainly with PostgreSQL, TinyDB, and MongoDB, my favorite being MongoDB and PGSQL. For you, I'd look at either of those but don't limit yourself there's a slue of them plus many drivers that help you write easier/less code if that's what you want. Remember google is your friend! And welcome to Stack Overflow!
Edit
If your dataset is and will remain fairly simple but just large and you want to keep with using txt files, consider pandas and maybe a JSON or a csv format and library. It can greatly help and increase efficiency when querying/managing data like this from txt files plus less memory usage as it won't always or ever need the entire dataset in memory.
you can try using any NOSql DB. Mongo DB would be a good place to start

Recommendation for manipulating data with python vs. SQL for a django app

Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP

SQLite table in RAM instead of FLASH

We're currently working on a python project that basically reads and writes M2M data into/from a SQLite database. This database consists of multiple tables, one of them storing current values coming from the cloud. This last table is worrying me a bit since it's being written very often and the application runs on a flash drive.
I've read that virtual tables could be the solution. I've thought in converting the critical table into a virtual one and then link its contents to a real file (XML or JSON) stored in RAM (/tmp for example in Debian). I've been reading this article:
http://drdobbs.com/database/202802959?pgno=1
that explains more or less how to do what I want. It's quite complex and I think that this is not very doable using Python. Maybe we need to develop our own sqlite extension, I don't know...
Any idea about how to "place" our conflicting table in RAM whilst the rest of the database stays in FLASH? Any better/simpler approach about how take the virtual table way under Python?
A very simple, SQL-only solution to create a in-memory table is using SQLite's ATTACH command with the special ":memory:" pseudo-filename:
ATTACH DATABASE ":memory:" AS memdb;
CREATE TABLE memdb.my_table (...);
Since the whole database "memdb" is kept in RAM, the data will be lost once you close the database connection, so you will have to take care of persistence by yourself.
One way to do it could be:
Open your main SQLite database file
Attach a in-memory secondary database
Duplicate your performance-critical table in the in-memory database
Run all queries on the duplicate table
Once done, write the in-memory table back to the original table (BEGIN; DELETE FROM real_table; INSERT INTO real_table SELECT * FROM memory_table;)
But the best advice I can give you: Make sure that you really have a performance problem, the simple solution could just as well be fast enough!
Use an in-memory data structure server. Redis is a sexy option, and you can easily implement a table using lists. Also, it comes with a decent python driver.

Django with huge mysql database

What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.
I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).
I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.
Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.
As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.
Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.

Categories

Resources