Improve performance when importing data to MySQL?

Improve performance when importing data to MySQL? - python

I'm using Django to build a website with a MySQL (MyISAM) backend.
The database data is imported from a number of XML files that an external script process and output as a JSON-file. Whenever a new JSON file differ from the old one, I need to wipe the old MySQL-db and recreate it using manage.py loaddata, (at least that's the easy way to do it, I guess I could check the differences between the JSON files and apply those to the database, but I haven't figured out a good solution for this (I'm neither a very good coder nor a web developer)).
Anyway, the JSON file is around 10 Mb, and ends up being about 21,000 rows of SQL (It's not expected to grow significantly). There are 7 tables, and they all look something like this:
class Subnetwork(models.Model):
SubNetwork = models.CharField(max_length=50)
NetworkElement = models.CharField(max_length=50)
subNetworkId = models.IntegerField()
longName = models.CharField(max_length=50)
shortName = models.CharField(max_length=50)
suffix = models.CharField(max_length=50)
It takes up to a minute (sometimes only 30 seconds) to import it into MySQL. I don't know if this is to be expected from a file of this size? What can I do (if anything) to improve perfomance?
For what it's worth, here's some profiler output https://gist.github.com/1287847

There are a couple of solutions, same decent than others, but here is a workaround to keep your system's "downtime" minimal, without needing to write a db synchronize mechanism (which would probably be a better solution in most times).:
Create a custom settings_build.py file, with from settings import * that chooses a random name for a new db (probably with the date in the db name), creates it by calling mysqladmin, and update the name into DATABASES.
Create a custom django management command (let's call it builddb) by either cloning the loaddata command or calling it, and on successful result, it should write the db name to a dbname text file with one line and executes a shell command that reloads your django (apache/gunicorn/?) server.
Modify your settings.py to load the database name from the text file.
And now run your build process like this:
./manage.py builddb --settings=settings_build

I solved it by exporting the processed XML-files to csv instead of json, and then used a separate script that called mysqlimport to do the importing.

Related

Large Import into django postgres database

I have a CSV file with 4,500,000 rows in it that needs to be imported into my django postgres database. This files includes relations so it isn't as easy as using COPY to import the CSV file straight into the database.
If I wanted to load it straight into postgres, I can change the CSV file to match the database tables, but I'm not sure how to get the relationship since I need to know the inserted id in order to build the relationship.
Is there a way to generate sql inserts that will get the last id and use that in future statements?
I initially wrote this using django ORM, but its going to take way to long to do that and it seems to be slowing down. I removed all of my indexes and contraints, so that shouldn't be the issue.
The database is running locally on my machine. I figured once I get the data into a database, it wouldn't be hard to dump and reload it on the production database.
So how can I get this data into my database with the correct relationships?
Note that I don't know JAVA so the answer suggested here isn't super practical for me: Django with huge mysql database
EDIT:
Here are more details:
I have a model something like this:
class Person(models.Model):
name = models.CharField(max_length=100)
offices = models.ManyToManyField(Office)
job = models.ForeignKey(Job)
class Office(models.Model):
address = models.CharField(max_length=100)
class Job(models.Model):
title = models.CharField(max_length=100)
So I have a person who can have 1 job but many offices. (My real model has more fields, but you get the idea).
My CSV file is something like this:
name,office_1,office_2,job
hailey,"123 test st","222 USA ave.",Programmer
There are more fields than that, but I'm only including the relevant ones.
So I need to make the person object and the office objects and relate them. The job objects are already created so all I need to do there is find the job and save it as the person's job.
The original data was not in a database before this. Only the flat file. We are trying to make it relational so there is more flexibility.
Thanks!!!

Well this is though one.
When you say relations, they are all on a single CSV file? I mean, like this, presuming a simple data model, with a relation to itself?
id;parent_id;name
4;1;Frank
1;;George
2;1;Costanza
3;1;Stella
If this is the case and it's out of order, I would write a Python script to reorder these and then import them.
I had a scenario a while back that I had a number of CSV files, but they were from individual models, where I loaded the first parent one, then the second, etc.
We wrote here custom importers that would read the data from a single CSV, and would do some processing on it, like check if it already existed, if some things were valid, etc. A method for each CSV file.
For CSV's that were big enough, we just split them in smaller files (around 200k records each) and processed them one after the other. The difference is that all the previous data that this big CSV depended on, was already in the database, imported by the same method described previously.
Without an example, I can't comment much more.
EDIT
Well, since you gave us your model, and based on the fact that the job model is already there, I would go for something like this:
create a custom method, even if you one n you can invoke from the shell. A method/function or whatever, that will receive a single line of the file.
In that method, discover how many offices that person is related to. Search to see if the office already exists in the DB. If so, use it to relate a person and the office. If not, create it and relate them
Lookup for the job. Does it exist? Yes, then use it. No? Create it and then use it.
Something like this:
def process_line(line):
data = line.split(";")
person = Person()
# fill in the person details that are in the CSV
person.name = data[1]
person.name = data[2]
person.save() # you'll need to save to use the m2m
offices = get_offices_from_line(line) # returns the plain data, not office instances
for office in offices:
obj, create = get_or_create(office_address=office)
if (obj):
person.offices.add(obj)
if (create):
person.offices.add(create)
job_obj, job_create = get_or_create(job_title=data[5])
# repeat
Be aware that the function above was not tested or guarded against any kind of errors. You'll need to:
Do that yourself;
Create the function that identifies the offices each person has. I don't know the data, but perhaps if you look at the field preceding the first office and look until the first field after all the offices you'll be able to grasp all of them;
You'll need to create a function to parse the high level file, iterate the lines and pass them along your shiny import function.
Here are the docs for get_or_create: https://docs.djangoproject.com/en/1.8/ref/models/querysets/#get-or-create

Import csv data to web2py database and process uploads

I've made a really simple single-user database application with web2py to be deployed to a desktop machine. The reason I choose web2py is due to its simplicity and its not intrusive web server.
My problem is that I need to migrate an existing database from another application that I've just preprocessed and prepared into a csv file that can be now perfectly imported into web2py's sqlite database.
Now, I have a problem with a 'upload' field in one of the tables, which correspond to a small image, I've formated that field into de the csv, with the name of the corresponding .jpg file that I extrated from the original database. The problem is that I have not managed how to insert these correctly into the upload folder, as the web2py engine automatically changes the filename of the users' uploads to a safe format, and copying my files straight to the folder does not work.
My question is, does anyone know a proper way to include this image collection into the uploads folder?. I don't know if there is a way to disable this protection or if I have to manually change their name to a valid hash. I've also considered the idea of coding an automatic insert process into the database...
Thanks all for you attention!
EDIT (a working example):
An example database:
db.define_table('product',
Field('name'),
Field('color'),
Field('picture', 'upload'),
)
Then using the default appadmin module from my application I import a csv file with entries of the form:
product.name,product.color,product.picture
"p1","red","p1.jpg"
"p2","blue","p2.jpg"
Then in my application I have the usual download function:
def download():
return response.download(request, db)
Which I call requesting the images uploaded into the database, for example, to be included into a view:
<img src="{{=URL('download', args=product.picture)}}" />
So my problem is that I have all the images corresponding the database records and I need to import them into my application, by properly including them into the uploads folder.

If you want the files to be named via the standard web2py file upload mechanism (which is a good idea for security reasons) and easily downloaded via the built-in response.download() method, then you can do something like the following.
In /yourapp/controllers/default.py:
def copy_files():
import os
for row in db().select(db.product.id, db.product.picture):
picture = open(os.path.join(request.folder, 'private', row.picture), 'rb')
row.update_record(picture=db.product.picture.store(picture, row.picture))
return 'Files copied'
Then place all the files in the /yourapp/private directory and go to the URL /default/copy_files (you only need to do this once). This will copy each file into the /uploads directory and rename it, storing the new name in the db.product.picture field.
Note, the above function doesn't have to be a controller action (though if you do it that way, you should remove the function when finished). Instead, it could be a script that you run via the web2py command line (needs to be run in the app environment to have access to the database connection and model, as well as reference to the proper /uploads folder) -- in that case, you would need to call db.commit() at the end (this is not necessary during HTTP requests).
Alternatively, you can leave things as they are and instead (a) manage uploads and downloads manually instead of relying on web2py's built-in mechanisms, or (b) create custom_store and custom_retrieve functions (unfortunately, I don't think these are well documented) for the picture field, which will bypass web2py's built-in store and retrieve functions. Life will probably be easier, though, if you just go through the one-time process described above.

Automatically update database in django

I am trying to build a website that displays stock information and I have a file called populate_stocks.py that populates the database with a given set of stocks. Since these stocks change almost every minute, I need to make sure I update the database with new information by running populate_stocks.py again.
I was wondering if there is any way to let my django application automatically call this file to update the stock information. I searched around and found another person using crontab which seems a bit complicated and was wondering if there is another solution.

Though I agree that the use of crontab to solve this problem is actually probably the easiest solution to this problem, there is a possible alternative over crontab which would be to make a management function that runs alongside your server.
Basically in its absolute simplest form you would make a basic loop in the form of a Django NoArgsCommand.
from django.core.management.base import NoArgsCommand
from populate_stocks import yourmysticalfunctionofupdating
class Command(NoArgsCommand):
help = "This runs the loop of glory, that does as it is told."
def handle_noargs(self, **options):
while True:
yourmysticalfunctionofupdating()
You would need to put this into a management -> commands folder and name the python file whatever you want the command should be (imagine it is updatify.py in this example).
You could then run the following command to run your watchdog.
./manage.py updatify
Though this may be overkill for your particular problem I have found it very helpful for trickier issues, and I hope it saves someone some time.

Importing a CSV file into a PostgreSQL DB using Python-Django

Note: Scroll down to the Background section for useful details. Assume the project uses Python-Django and South, in the following illustration.
What's the best way to import the following CSV
"john","doe","savings","personal"
"john","doe","savings","business"
"john","doe","checking","personal"
"john","doe","checking","business"
"jemma","donut","checking","personal"
Into a PostgreSQL database with the related tables Person, Account, and AccountType considering:
Admin users can change the database model and CSV import-representation in real-time via a custom UI
The saved CSV-to-Database table/field mappings are used when regular users import CSV files
So far two approaches have been considered
ETL-API Approach: Providing an ETL API a spreadsheet, my CSV-to-Database table/field mappings, and connection info to the target database. The API would then load the spreadsheet and populate the target database tables. Looking at pygrametl I don't think what i'm aiming for is possible. In fact, i'm not sure any ETL APIs do this.
Row-level Insert Approach: Parsing the CSV-to-Database table/field mappings, parsing the spreadsheet, and generating SQL inserts in "join-order".
I implemented the second approach but am struggling with algorithm defects and code complexity. Is there a python ETL API out there that does what I want? Or an approach that doesn't involve reinventing the wheel?
Background
The company I work at is looking to move hundreds of project-specific design spreadsheets hosted in sharepoint into databases. We're near completing a web application that meets the need by allowing an administrator to define/model a database for each project, store spreadsheets in it, and define the browse experience. At this stage of completion transitioning to a commercial tool isn't an option. Think of the web application as a django-admin alternative, though it isn't, with a DB modeling UI, CSV import/export functionality, customizable browse, and modularized code to address project-specific customizations.
The implemented CSV import interface is cumbersome and buggy so i'm trying to get feedback and find alternate approaches.

How about separating the problem into two separate problems?
Create a Person class which represents a person in the database. This could use Django's ORM, or extend it, or you could do it yourself.
Now you have two issues:
Create a Person instance from a row in the CSV.
Save a Person instance to the database.
Now, instead of just CSV-to-Database, you have CSV-to-Person and Person-to-Database. I think this is conceptually cleaner. When the admins change the schema, that changes the Person-to-Database side. When the admins change the CSV format, they're changing the CSV-to-Database side. Now you can deal with each separately.
Does that help any?

I write import sub-systems almost every month at work, and as I do that kind of tasks to much I wrote sometime ago django-data-importer. This importer works like a django form and has readers for CSV, XLS and XLSX files that give you lists of dicts.
With data_importer readers you can read file to lists of dicts, iter on it with a for and save lines do DB.
With importer you can do same, but with bonus of validate each field of line, log errors and actions, and save it at end.
Please, take a look at https://github.com/chronossc/django-data-importer. I'm pretty sure that it will solve your problem and will help you with process of any kind of csv file from now :)
To solve your problem I suggest use data-importer with celery tasks. You upload the file and fire import task via a simple interface. Celery task will send file to importer and you can validate lines, save it, log errors for it. With some effort you can even present progress of task for users that uploaded the sheet.

I ended up taking a few steps back to address this problem per Occam's razor using updatable SQL views. It meant a few sacrifices:
Removing: South.DB-dependent real-time schema administration API, dynamic model loading, and dynamic ORM syncing
Defining models.py and an initial south migration by hand.
This allows for a simple approach to importing flat datasets (CSV/Excel) into a normalized database:
Define unmanaged models in models.py for each spreadsheet
Map those to updatable SQL Views (INSERT/UPDATE-INSTEAD SQL RULEs) in the initial south migration that adhere to the spreadsheet field layout
Iterating through the CSV/Excel spreadsheet rows and performing an INSERT INTO <VIEW> (<COLUMNS>) VALUES (<CSV-ROW-FIELDS>);

Here is another approach that I found on github. Basically it detects the schema and allows overrides. Its whole goal is to just generate raw sql to be executed by psql and or whatever driver.
https://github.com/nmccready/csv2psql
% python setup.py install
% csv2psql --schema=public --key=student_id,class_id example/enrolled.csv > enrolled.sql
% psql -f enrolled.sql
There are also a bunch of options for doing alters (creating primary keys from many existing cols) and merging / dumps.

Django with huge mysql database

What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.

I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).

I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.

Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.

As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.

Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Improve performance when importing data to MySQL? - python

I solved it by exporting the processed XML-files to csv instead of json, and then used a separate script that called mysqlimport to do the importing.

Related

Large Import into django postgres database

Import csv data to web2py database and process uploads

Automatically update database in django

Importing a CSV file into a PostgreSQL DB using Python-Django

Django with huge mysql database

Categories

Resources