Django migration 11 million rows, need to break it down

Django migration 11 million rows, need to break it down - python

I have a table which I am working on and it contains 11 million rows there abouts... I need to run a migration on this table but since Django trys to store it all in cache I run out of ram or disk space which ever comes first and it comes to abrupt halt.
I'm curious to know if anyone has faced this issue and has come up with a solution to essentially "paginate" migrations maybe into blocks of 10-20k rows at a time?
Just to give a bit of background I am using Django 1.10 and Postgres 9.4 and I want to keep this automated still if possible (which I still think it can be)
Thanks
Sam

The issue comes from a Postgresql which rewrites each row on adding a new column (field).
What you would need to do is to write your own data migration in the following way:
Add a new column with null=True. In this case data will not be
rewritten and migration will finish pretty fast.
Migrate it
Add a default value
Migrate it again.
That is basically a simple pattern on how to deal with adding a new row in a huge postgres database.

Related

Changing innodb_page_size in my.cnf file does not restart mysql database

Hope you have a great day. I have a table with 470 columns to be exact. I am working on Django unit testing and the tests won't execute giving the error when I run command python manage.py test:
Row size too large (> 8126). Changing some columns to TEXT or BLOB or using ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPRESSED may help. In
current row format, BLOB prefix of 768 bytes is stored inline
To resolve this issue I am trying to increase the innodb_page_size in MySQL my.cnf file. When I restart MySQL server after changing value in my.cnf file, MySQL won't restart.
I have tried almost every available solution on stackoverflow but no success.
MYSQL version=5.5.57
Ubuntu version = 16.04
Any help would be greatly appreciated. Thank you

Since I have never seen anyone use the feature of having bigger block size, I have no experience with making it work. And I recommend you not be the first to try.
Instead I offer several likely workarounds.
Don't use VARCHAR(255) blindly; make the lengths realistic for the data involved.
Don't use uf8 (or utf8mb4) for columns that can only have ascii. Examples: postcode, hex strings, UUIDs, country_code, etc. Use CHARACTER SET ascii.
Vertically partition the table. That is spit it into two tables with the same PRIMARY KEY.
Don't splay arrays across columns; use another table and have multiple rows in it. Example: phone1, phone2, phone3.

Django and Premade mysql databases

I have two database in mysql that have tables built from another program I have wrote to get data, etc. However I would like to use django and having trouble understanding the model/view after going through the tutorial and countless hours of googling. My problem is I just want to access the data and displaying the data. I tried to create routers and done inspectdb to create models. Only to get 1146 (table doesn't exist issues). I have a unique key.. Lets say (a, b) and have 6 other columns in the table. I just need to access those 6 columns row by row. I'm getting so many issues. If you need more details please let me know. Thank you.

inspectdb is far from being perfect. If you have an existing db with a bit of complexity you will end up probably changing a lot of code generated by this command. One you're done btw it should work fine. What's your exact issue? If you run inspectdb and it creates a model of your table you should be able to import it and use it like a normal model, can you share more details or errors you are getting while querying the table you're interested in?

Do Django delete db_index?

I am using postgresql for my django app.
Managed to delete almost 500000 rows, but size of my DB didn't lower significantly.
Deleted them with smth like lots.objects.filter(id__in=[ids]).delete() in chunks (because it's too hard to delete so many rows in one query).
Some columns have db_index=True, so I think indexes were not deleted.
Do I have the possibility to delete also indexes for deleted objects from django?
Maybe there is also a way to see unused indexes from Django?

None of this has anything to do with Django. If an item is deleted from a database, it is always automatically deleted from any indexes - otherwise indexing just wouldn't work.
Normally you should let Postgres itself determine the size of the database files. Deleted items are removed when a VACUUM operation is done; again, normally Postgres will do this via a regularly scheduled daemon. If you need to specifically recover space, then you can run VACUUM manually. See the docs.

Will dumping, creating a new database, and then restoring Django auth and system related tables cause any problems?

I put my first Django site online eight months ago. It was both a proof of concept as well as my first experience with Django. Fast forward eight months, I have validated my idea, but since it was a proof of concept and my first Django project, the code is pretty messy. Essentially, I am going to be re-writing the majority of the site, including re-engineering the models.
This is all fine and good. I have all my new models planned out. Essentially, I am going to create a new database to develop off of and let South manage any new database schema changes I make.
It is important to note two things:
I will not be creating a new project, just a new database.
This will be the first time I am incorporating South into the project and I would prefer to start with fresh models and a fresh database.
My question is, when I create the new database, will importing the contents of the old auth_* and django_* tables into the new auth_* and django_* tables create any problems? I have had some users register using the original proof of concept and I don't want to lose their information. I've never had to do this before so I'm not sure if there will be any repercussions.

If you use sql dump, such as
mysqldump -uusername -ppassword db_name table_name > xxxx.sql
mysql -uusername -ppassword new_db_name < xxxx.sql
The database's side is fine，if your backend is some other db，you can still find the similar commands.
For a new db, i think you need to export/import auth_user, i'm not quite sure if you need other contents in django_* tables. You can do this step by step, and see whether the new project works.

Django with huge mysql database

What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.

I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).

I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.

Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.

As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.

Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django migration 11 million rows, need to break it down - python

Related

Changing innodb_page_size in my.cnf file does not restart mysql database

Django and Premade mysql databases

Do Django delete db_index?

Will dumping, creating a new database, and then restoring Django auth and system related tables cause any problems?

Django with huge mysql database

Categories

Resources