What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.
I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).
I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.
Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.
As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.
Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.
Related
I am creating a python system that needs to handle many files. Each of the file has more than 10 thousand lines of text data.
Because DB (like mysql) can not be used in that environment, when file is uploaded by a user, I think I will save all the data of the uploaded file in in-memory-SQLite so that I can use SQL to fetch specific data from there.
Then, when all operations by program are finished, save the processed data in a file. This is the file users will receive from the system.
But some websites say SQLite shouldn't be used in production. But in my case, I just save them temporarily in memory to use SQL for the data. Is there any problem for using SQLite in production even in this scenario?
Edit:
The data in in-memory-DB doesn't need to be shared between processes. It just creates tables, process data, then discard all data and tables after saving the processed data in file. I just think saving everything in list makes search difficult and slow. So using SQLite is still a problem?
SQLite shouldn't be used in production is not a one-for-all rule, it's more of a rule of thumb. Of course there are appliances where one could think of reasonable use of SQLite even in production environments.
However your case doesn't seem to be one of them. While SQLite supports multi-threaded and multi-process environments, it will lock all tables when it opens a write transaction. You need to ask yourself whether this is a problem for your particular case, but if you're uncertain go for "yes, it's a problem for me".
You'd be probably okay with in-memory structures alone, unless there are some details you haven't uncovered.
I'm not familiar with the specific context of your system, but if what you're looking for is a SQL database that is
light
Access is from a single process and a single thread.
If the system crashes in the middle, you have a good way to recover from it (either backing up the last stable version of the database or just create it from scratch).
If you meet all these criteria, using SQLite is production is fine. OSX, for example, uses sqlite for a few purposes (e.g. ./var/db/auth.db).
Background:
I have an application written in Python to monitor the status of tools. The tools send their data from specific runs and it all gets stored in an Oracle database as JSON files.
My Problem/Solution:
Instead of connecting to the DB and then querying it repeatedly when I want to compare the current run data to the previous run's data, I want to make a copy of the database query so that I can compare the new run data to the copy that I made instead of to the results of the query.
The reason I want to do this is because constantly querying the server for the previous run's data is slow and puts unwanted load/usage on the server.
For the previous run's data there are multiple files associated with it (because there are multiple tools) and therefore each query has more than one file that would need to be copied. Locally storing the copies of the files in the query is what I intended to do, but I was wondering what the best way to go about this was since I am relativity new to doing something like this.
So any help and suggestions on how to efficiently store the results of a query, which are multiple JSON files, would be greatly appreciated!
As you described querying the db too many times is not an option. OK in that case I would do this the following way :
When your program starts you get the data for all tools as a set of JSON-Files per tool right? OK. I am not sure how you get the data by querying the tools directly or by querying the db .. does not matter.
You check if you have old data in the "cache-dictionary" for that tool. If yes do your compare and store the "new data" as "previous data" in the cache. Ready for the next run. Do this for all tools. This loops forever :-)
This "cache dictionary" now can be implemented in memory or on disk. For your amount of data I think memory is just fine.
With that approach you do not have to query the db for the old data. The case that you cannot do the compare if you do not have old data in the "cache" at program start could be handled that you try to get it from db (risking long query times but what to do :-)
I have developed a Shiny Application. When it starts, it loads, ONCE, some datatables. Around 4 GB of datatables. Then, people connecting to the app can use the interface and play with those datatables.
This application is nice but has some limitations. That's why I am looking for another solution.
My idea is to have Pandas and Django working together. This way, I could develop an interface and a RESTful API at the same time. But what I would need is that all requests coming to Django can use pandas datatables that have been loaded once. Imagine if for any request 4 GB of memory had to be loaded... It would be horrible.
I have looked everywhere but couldn't find any way of doing this. I found this question: https://stackoverflow.com/questions/28661255/pandas-sharing-same-dataframe-across-the-request But it has no responses.
Why do I need to have the data in RAM ? Because I need performance to render the asked results fastly. I can't ask MariaDB to calculate and maintain those datas for example as it involves some calculations that sole R or a specialized package in Python or other languages can do.
I have a similar use case where I want to load (instantiate) a certain object only once and use it in all requests, since it takes some time (seconds) to load and I couldn't afford the lag that would introduce for every request.
I use a feature in Django>=1.7, the AppConfig.ready() method, to load this only once.
Here is the code:
# apps.py
from django.apps import AppConfig
from sexmachine.detector import Detector
class APIConfig(AppConfig):
name = 'api'
def ready(self):
# Singleton utility
# We load them here to avoid multiple instantiation across other
# modules, that would take too much time.
print("Loading gender detector..."),
global gender_detector
gender_detector = Detector()
print("ok!")
Then when you want to use it:
from api.apps import gender_detector
gender_detector.get_gender('John')
Load your data table in the ready() method and then use it from anywhere. I reckon the table will be loaded once for each WSGI worker, so be careful here.
I may be misunderstanding the problem but to me having a 4 GB database table that is readily accessible by users shouldn't be too much of a problem. Is there anything wrong with actually just loading up the data one time upfront like you described? 4GB isn't too much RAM now.
Personally I'd recommend you just use the database system itself instead of loading stuff into memory and crunching with python. If you set up the data properly you can issue many thousands of queries in just seconds. Pandas is actually written to mimic SQL so most of the code you are using can probably be translated directly to SQL. Just recently I had a situation at work where I set up a big join operation essentially to take a couple hundred files (~4GB in total, 600k rows per each file) using pandas. The total execution time ended up being like 72 hours or something and this was an operation that had to be run once an hour. A coworker ended up rewriting the same python/pandas code as a pretty simple SQL query that finished in 5 minutes instead of 72 hours.
Anyways I'd recommend looking into storing your pandas dataframe in an actual database table. Django is built on a database (usually mySQL or Postgres) and pandas even has a function to directly insert your dataframe into the database dataframe.to_sql( 'database_connection_str' )! From there you can write the django code such that responses will make a single query to DB, fetch values and return data in timely manner.
This might sound like a bit of an odd question - but is it possible to load data from a (in this case MySQL) table to be used in Django without the need for a model to be present?
I realise this isn't really the Django way, but given my current scenario, I don't really know how better to solve the problem.
I'm working on a site, which for one aspect makes use of a table of data which has been bought from a third party. The columns of interest are liklely to remain stable, however the structure of the table could change with subsequent updates to the data set. The table is also massive (in terms of columns) - so I'm not keen on typing out each field in the model one-by-one. I'd also like to leave the table intact - so coming up with a model which represents the set of columns I am interested in is not really an ideal solution.
Ideally, I want to have this table in a database somewhere (possibly separate to the main site database) and access its contents directly using SQL.
You can always execute raw SQL directly against the database: see the docs.
There is one feature called inspectdb in Django. for legacy databases like MySQL , it creates models automatically by inspecting your db tables. it stored in our app files as models.py. so we don't need to type all column manually.But read the documentation carefully before creating the models because it may affect the DB data ...i hope this will be useful for you.
I guess you can use any SQL library available for Python. For example : http://www.sqlalchemy.org/
You have just then to connect to your database, perform your request and use the datas at your will. I think you can't use Django without their model system, but nothing prevents you from using another library for this in parallel.
We're currently working on a python project that basically reads and writes M2M data into/from a SQLite database. This database consists of multiple tables, one of them storing current values coming from the cloud. This last table is worrying me a bit since it's being written very often and the application runs on a flash drive.
I've read that virtual tables could be the solution. I've thought in converting the critical table into a virtual one and then link its contents to a real file (XML or JSON) stored in RAM (/tmp for example in Debian). I've been reading this article:
http://drdobbs.com/database/202802959?pgno=1
that explains more or less how to do what I want. It's quite complex and I think that this is not very doable using Python. Maybe we need to develop our own sqlite extension, I don't know...
Any idea about how to "place" our conflicting table in RAM whilst the rest of the database stays in FLASH? Any better/simpler approach about how take the virtual table way under Python?
A very simple, SQL-only solution to create a in-memory table is using SQLite's ATTACH command with the special ":memory:" pseudo-filename:
ATTACH DATABASE ":memory:" AS memdb;
CREATE TABLE memdb.my_table (...);
Since the whole database "memdb" is kept in RAM, the data will be lost once you close the database connection, so you will have to take care of persistence by yourself.
One way to do it could be:
Open your main SQLite database file
Attach a in-memory secondary database
Duplicate your performance-critical table in the in-memory database
Run all queries on the duplicate table
Once done, write the in-memory table back to the original table (BEGIN; DELETE FROM real_table; INSERT INTO real_table SELECT * FROM memory_table;)
But the best advice I can give you: Make sure that you really have a performance problem, the simple solution could just as well be fast enough!
Use an in-memory data structure server. Redis is a sexy option, and you can easily implement a table using lists. Also, it comes with a decent python driver.