I am trying to find a way to improve the speed while pushing data to a MySQL database using pandas in python.
After my performance tests I arrived to the same conclusion that other people did: the best way to push data to a MySQL database is to use the native query 'LOAD DATA INFILE..." instead of the to_sql pandas method (even with improvements like this one or this one).
My problem is that when I want to push my data, it is in memory. So in order to use the native MySQL query, I need to dump it first into a file on the disk and then use the 'LOAD DATA...' query.
So here my question, is there a way to 'simulate' a file written on the disk so i can avoid dumping my big files (200MB+) on it ?
It might happen that dumping a big file can take some minutes, so I would not want to lose too much time there...
This approach may be a viable alternative without touching disk (for the load file):
Write code to create multi-row INSERT statements and execute them. Suggest 1000 rows at a time, with autocommit=ON.
Related
Situation: Need to deal with large amounts of data (~260MB CSV datafile of about 50B of data per line)
Problem: If I just read from the file every time I need to deal with it, it will take a long time. So I decided to push everything into a database. I need a fast database infrastructure to handle the data as I need to do a lot of reading and writing.
Question: What are the faster choices for database in Python?
Additional information 1: The data comprises of 3 columns and I do not see myself needing anymore than that. Would this mean that a NoSQL database is preferred?
Additional information 2: However, if in the future I do need more than one database working together, would it be better to go for a SQL database?
Additional information 3: I think it would help to mention that I am looking at a few different DBs (MongoDB, SQLite, tinydb), but do suggest other DBs that you know are faster.
I've experienced this same situation many times and I wanted something faster than a typical relational database. Redis is very fast and scalable key/value database. You can get started quickly using the Popoto ORM
Here is an example:
import popoto
class City(popoto.Model):
id = popoto.UniqueKeyField()
name = popoto.KeyField()
description = popoto.Field()
for line in open("cities.csv"):
csv_row = line.split('\t')
City.create(
id=csv_row[0],
name=csv_row[1],
description=csv_row[2]
)
new_york = City.query.get(name="New York")
This is the absolute fastest way to store and retrieve data without having to learn the nuances of a new database system.
Keep in mind that if your database grows beyond 5GB, an in-memory database like Redis can start to become expensive compared to slower disk-based databases like Postgres or MySQL
full disclosure: I help maintain the open source Popoto project
Background:
I have an application written in Python to monitor the status of tools. The tools send their data from specific runs and it all gets stored in an Oracle database as JSON files.
My Problem/Solution:
Instead of connecting to the DB and then querying it repeatedly when I want to compare the current run data to the previous run's data, I want to make a copy of the database query so that I can compare the new run data to the copy that I made instead of to the results of the query.
The reason I want to do this is because constantly querying the server for the previous run's data is slow and puts unwanted load/usage on the server.
For the previous run's data there are multiple files associated with it (because there are multiple tools) and therefore each query has more than one file that would need to be copied. Locally storing the copies of the files in the query is what I intended to do, but I was wondering what the best way to go about this was since I am relativity new to doing something like this.
So any help and suggestions on how to efficiently store the results of a query, which are multiple JSON files, would be greatly appreciated!
As you described querying the db too many times is not an option. OK in that case I would do this the following way :
When your program starts you get the data for all tools as a set of JSON-Files per tool right? OK. I am not sure how you get the data by querying the tools directly or by querying the db .. does not matter.
You check if you have old data in the "cache-dictionary" for that tool. If yes do your compare and store the "new data" as "previous data" in the cache. Ready for the next run. Do this for all tools. This loops forever :-)
This "cache dictionary" now can be implemented in memory or on disk. For your amount of data I think memory is just fine.
With that approach you do not have to query the db for the old data. The case that you cannot do the compare if you do not have old data in the "cache" at program start could be handled that you try to get it from db (risking long query times but what to do :-)
I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.
We're currently working on a python project that basically reads and writes M2M data into/from a SQLite database. This database consists of multiple tables, one of them storing current values coming from the cloud. This last table is worrying me a bit since it's being written very often and the application runs on a flash drive.
I've read that virtual tables could be the solution. I've thought in converting the critical table into a virtual one and then link its contents to a real file (XML or JSON) stored in RAM (/tmp for example in Debian). I've been reading this article:
http://drdobbs.com/database/202802959?pgno=1
that explains more or less how to do what I want. It's quite complex and I think that this is not very doable using Python. Maybe we need to develop our own sqlite extension, I don't know...
Any idea about how to "place" our conflicting table in RAM whilst the rest of the database stays in FLASH? Any better/simpler approach about how take the virtual table way under Python?
A very simple, SQL-only solution to create a in-memory table is using SQLite's ATTACH command with the special ":memory:" pseudo-filename:
ATTACH DATABASE ":memory:" AS memdb;
CREATE TABLE memdb.my_table (...);
Since the whole database "memdb" is kept in RAM, the data will be lost once you close the database connection, so you will have to take care of persistence by yourself.
One way to do it could be:
Open your main SQLite database file
Attach a in-memory secondary database
Duplicate your performance-critical table in the in-memory database
Run all queries on the duplicate table
Once done, write the in-memory table back to the original table (BEGIN; DELETE FROM real_table; INSERT INTO real_table SELECT * FROM memory_table;)
But the best advice I can give you: Make sure that you really have a performance problem, the simple solution could just as well be fast enough!
Use an in-memory data structure server. Redis is a sexy option, and you can easily implement a table using lists. Also, it comes with a decent python driver.
What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.
I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).
I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.
Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.
As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.
Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.