Efficient data migration on a large django table

Efficient data migration on a large django table - python

I need to add a new column to a large (5m row) django table. I have a south schemamigration that creates the new column. Now I'm writing a datamigration script to populate the new column. It looks like this. (If you're not familiar with south migrations, just ignore the orm. prefixing the model name.)
print "Migrating %s articles." % orm.Article.objects.count()
cnt = 0
for article in orm.Article.objects.iterator():
if cnt % 500 == 0:
print " %s done so far" % cnt
# article.newfield = calculate_newfield(article)
article.save()
cnt += 1
I switched from objects.all to objects.iterator to reduce memory requirements. But something is still chewing up vast memory when I run this script. Even with the actually useful line commented out as above, the script still grows to using 10+ GB of ram before getting very far through the table and I give up on it.
Seems like something is holding on to these objects in memory. How can I run this so it's not a memory hog?
FWIW, I'm using python 2.6, django 1.2.1, south 0.7.2, mysql 5.1.

Ensure settings.DEBUG is set to False. DEBUG=True fills memory especially with database intensive operations, since it stores all queries sent to the RDBMS within a view.
With Django 1.8 out, it should not be necessary since a hardcoded max of 9000 queries are now stored, instead of an infinite number before.

Welcome to Django's ORM. I think this is an inherent problem.
I've also had problems with large databases, dumpdata, loaddata and the like.
You have two choices.
Stop trying to use south and write your own ORM migration. You can have multiple database definitions in your settings. Create "old" and "new". Write your own one-time migrator from the old database to the new database. Once that's tested and works, run it one final time and then switch the database definitions and restart Django.
Ditch south and the ORM and write your own SQL migration. Use raw SQL to copy data out of the old structure to the new structure. Debug separately. When it's good, run it one final time and then switch your setting and restart Django.
It's not that south or the ORM are particularly bad. But, for bulk processing in large databases, they cache too much in memory.

Or, what happens if you create a raw query in situ which implements a rudimentary resultset size limit?
a la: https://docs.djangoproject.com/en/1.3/topics/db/sql/#index-lookups
while min < rowcount:
min += 500
max = min + 500
articles = Article.objects.raw('SELECT * from article where id > %s and id < %s' % (min, max))
for old_article in articles:
# create the new article
article.save()

orm.Article.objects.iterator()
Does that run the whole query and save the result in memory? Or fetch rows from the database one at a time?
I'm guessing it does it all at once. See if you can replace that loop with a database cursor that pulls the data in an incremental fashion:
eg: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany
db = blah.connect("host='%s' dbname='%s' user='%s' password='%s'" % ...
new, old = db.cursor(), db.cursor()
old.execute("""
SELECT *
FROM whatever
""")
for row in old.fetchmany(size=500):
(col1, col2, col3...) = row
new = db.cursor()
new.execute("""
INSERT INTO yourtable (
col1, col2, col3...)
VALUES (
%s, %s, %s, %s, %s)
""",(col1, col2, col3,...))
new.close()
old.close()
It will be slow. I pulled this from a standalone migration script of mine so ymmv.
fetchmany is standard (PEP249). I've not done exactly what you're looking for so there's a little work still to do from this sample: I've not looped over the loop - to get sets of 500 till done - so you'll need to work that out for yourself.

If you don't need full access to the objects, you can always use a combo of only and values or values_list on your queryset. That should help reduce the memory requirements significantly, but I'm not sure whether it will be enough.

Related

python postgres can I fetchall() 1 million rows?

I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.
I would like to know would cur.fetchall() fail or cause my server to go down? (since my RAM might not be that big to hold all that data)
q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
doSomething(row)
what is the smarter way to do this?

The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:
row = cursor.fetchone()
However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.
Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:
cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)
while True:
rows = cursor.fetchmany(5000)
if not rows:
break
for row in rows:
# do something with row
pass
you find more about server side cursors in the psycopg2 wiki

Consider using server side cursor:
When a database query is executed, the Psycopg cursor usually fetches
all the records returned by the backend, transferring them to the
client process. If the query returned an huge amount of data, a
proportionally large amount of memory will be allocated by the client.
If the dataset is too large to be practically handled on the client
side, it is possible to create a server side cursor. Using this kind
of cursor it is possible to transfer to the client only a controlled
amount of data, so that a large dataset can be examined without
keeping it entirely in memory.
Here's an example:
cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
cursor.execute("FETCH 1000 FROM super_cursor")
rows = cursor.fetchall()
if not rows:
break
for row in rows:
doSomething(row)

fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:
row = cur.fetchone()
while row:
# do something with row
row = cur.fetchone()

Here is the code to use for simple server side cursor with the speed of fetchmany management.
The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().
With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.

EDIT: using fetchmany (along with fetchone() and fetchall(), even with a row limit (arraysize) will still send the entire resultset, keeping it client-side (stored in the underlying c library, I think libpq) for any additional fetchmany() calls, etc. Without using a named cursor (which would require an open transaction), you have to resort to using limit in the sql with an order-by, then analyzing the results and augmenting the next query with where (ordered_val = %(last_seen_val)s and primary_key > %(last_seen_pk)s OR ordered_val > %(last_seen_val)s)
This is misleading for the library to say the least, and there should be a blurb in the documentation about this. I don't know why it's not there.
Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.
The fetchmany loop is tedious but I think it's the best solution here. To make life easier, you can use the following:
from functools import partial
from itertools import chain
# from_iterable added >= python 2.7
from_iterable = chain.from_iterable
# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
if parms is None:
curs.execute(sql)
else:
curs.execute(sql, parms)
chunks_until_empty = iter(partial(fetchmany, chunksize), [])
return from_iterable(chunks_until_empty)
# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
print 'lots of waffles: %s' % (row,)

As I was reading comments and answers I thought I should clarify something about fetchone and Server-side cursors for future readers.
With normal cursors (client-side), Psycopg fetches all the records returned by the backend, transferring them to the client process. The whole records are buffered in the client's memory. It is when you execute a query like curs.execute('SELECT * FROM ...'.
This question also confirms that.
All the fetch* methods are there for accessing this stored data.
Q: So how fetchone can help us memory wise ?
A: It fetches only one record from the stored data and creates a single Python object and hands you in your Python code while fetchall will fetch and create n Python objects from this data and hands it to you all in one chunk.
So If your table has 1,000,000 records, this is what's going on in memory:
curs.execute --> whole 1,000,000 result set + fetchone --> 1 Python object
curs.execute --> whole 1,000,000 result set + fetchall --> 1,000,000 Python objects
Of-course fetchone helped but still we have the whole records in memory. This is where Server-side cursors comes into play:
PostgreSQL also has its own concept of cursor (sometimes also called
portal). When a database cursor is created, the query is not
necessarily completely processed: the server might be able to produce
results only as they are needed. Only the results requested are
transmitted to the client: if the query result is very large but the
client only needs the first few records it is possible to transmit
only them.
...
their interface is the same, but behind the scene they
send commands to control the state of the cursor on the server (for
instance when fetching new records or when moving using scroll()).
So you won't get the whole result set in one chunk.
The draw-back :
The downside is that the server needs to keep track of the partially
processed results, so it uses more memory and resources on the server.

Slow MySQL queries in Python but fast elsewhere

I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.

Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?

I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.

Optimizing performance of Postgresql database writes in Django?

I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.
The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.
As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?
Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)
class Template(models.Model):
template_name = models.TextField(_("Name"), max_length=70)
sourcepackage = models.TextField(_("Source package"), max_length=70)
translation_domain = models.TextField(_("Domain"), max_length=70)
total = models.IntegerField(_("Total"))
enabled = models.BooleanField(_("Enabled"))
priority = models.IntegerField(_("Priority"))
release = models.ForeignKey(Release)
class Translation(models.Model):
release = models.ForeignKey(Release)
template = models.ForeignKey(Template)
language = models.ForeignKey(Language)
translated = models.IntegerField(_("Translated"))
And here's the bit of code that seems to take ages to complete:
#transaction.commit_manually
def add_translations(translation_data, lp_translation):
releases = Release.objects.all()
# There are 5 releases
for release in releases:
# translation_data has about 90K entries
# this is the part that takes a long time
for lp_translation in translation_data:
try:
language = Language.objects.get(
code=lp_translation['language'])
except Language.DoesNotExist:
continue
translation = Translation(
template=Template.objects.get(
sourcepackage=lp_translation['sourcepackage'],
template_name=lp_translation['template_name'],
translation_domain=\
lp_translation['translation_domain'],
release=release),
translated=lp_translation['translated'],
language=language,
release=release,
)
translation.save()
# I realize I should commit every n entries
transaction.commit()
# I've also got another bit of code to fill in some data I'm
# not getting from the json files
# Add missing templates
languages = Language.objects.filter(visible=True)
languages_total = len(languages)
for language in languages:
templates = Template.objects.filter(release=release)
for template in templates:
try:
translation = Translation.objects.get(
template=template,
language=language,
release=release)
except Translation.DoesNotExist:
translation = Translation(template=template,
language=language,
release=release,
translated=0,
untranslated=0)
translation.save()
transaction.commit()

Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.
If the import files are available locally to the server you can use COPY. Else you could use the meta command \copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.
If you just want to add new rows to a table:
COPY tbl FROM '/absolute/path/to/file' FORMAT csv;
Or if you want to INSERT / UPDATE some rows:
First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.
SET LOCAL temp_buffers='128MB';
In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:
select pg_size_pretty(pg_total_relation_size('tmp_x'));
Create the temporary table:
CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);
Or, to just duplicate the structure of an existing table:
CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;
Copy values (should take seconds, not hours):
COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;
From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:
ANALYZE tmp_x;
For instance, to update existing rows, matched by id:
UPDATE tbl
SET col_a = tmp_x.col_a
USING tmp_x
WHERE tbl.id = tmp_x.id;
Finally, drop the temporary table:
DROP TABLE tmp_x;
Or have it dropped automatically at the end of the session.

Merging SQLite databases is driving me mad. Help?

I've got 32 SQLite (3.7.9) databases with 3 tables each that I'm trying to merge together using the idiom that I've found elsewhere (each db has the same schema):
attach db1.sqlite3 as toMerge;
insert into tbl1 select * from toMerge.tbl1;
insert into tbl2 select * from toMerge.tbl2;
insert into tbl3 select * from toMerge.tbl3;
detach toMerge;
and rinse-repeating for the entire set of databases. I do this using python and the sqlite3 module:
for fn in filelist:
completedb = sqlite3.connect("complete.sqlite3")
c = completedb.cursor()
c.execute("pragma synchronous = off;")
c.execute("pragma journal_mode=off;")
print("Attempting to merge " + fn + ".")
query = "attach '" + fn + "' as toMerge;"
c.execute(query)
try:
c.execute("insert into tbl1 select * from toMerge.tbl1;")
c.execute("insert into tbl2 select * from toMerge.tbl2;")
c.execute("insert into tbl3 select * from toMerge.tbl3;")
c.execute("detach toMerge;")
completedb.commit()
except sqlite3.Error as err:
print "Error! ", type(err), " Error msg: ", err
raise
2 of the tables are fairly small, only 50K rows per db, while the third (tbl3) is larger, about 850 - 900K rows. Now, what happens is that the inserts progressively slow down until I get to about the fourth database when they grind to a near halt (on the order of a a megabyte or two in file size added every 1-3 minutes to the combined database). In case it was python, I've even tried dumping out the tables as INSERTs (.insert; .out foo; sqlite3 complete.db < foo is the skeleton, found here) and combining them in a bash script using the sqlite3 CLI to do the work directly, but I get exactly the same problem.
The table setup of tbl3 isn't too demanding - a text field containing a UUID, two integers, and four real values. My worry is that it's the number of rows, because I ran into exactly the same trouble at exactly the same spot (about four databases in) when the individual databases were an order of magnitude larger in terms of file size with the same number of rows (I trimmed the contents of tbl3 significantly by storing summary stats instead of raw data). Or maybe it's the way I'm performing the operation? Can anyone shed some light on this problem that I'm having before I throw something out the window?

Try adding or removing indexes/primary key for the larger table.

You didn't mention the OS you were using or the db file sizes. Windows can have issues with files that are bigger than 2Gb depending on what version.
In any case, since this is a glorified batch script why not get rid of the for loop, get the filename from sys.argv, and then just run it once for each merge db. That way you will never have to deal with memory issues from doing too much in one process.
Mind you, if you end the loop with the following that will likely also fix things.
c.close()
completedb.close()
You say that the same thing occurs when you follow this process using the CLI and quitting after every db. I assume that you mean the Python CLI, and quitting means that you exit and restart Python. If that is true, and it still develops a problem every 4th database, then something is wrong with your SQLITE shared library. It shouldn't be keeping state like that.
If I were in your shoes, I would stop using attach and just open multiple connections in Python, then move the data in batches of about 1000 records per commit. It would be slower than your technique because all the data moves in and out of Python objects, but I think it would also be more reliable. Open the complete db, then loop around opening a second db, copying, then closing the second db. For the copying, I would use OFFSET and LIMIT on the SELECT statements to process batches of 100 records, then commit, then repeat.
In fact, I would also count the completedb records, and the second db records before copying, then after copying count the completedb records to ensure that I had copied the expected amount. Also, you would be keeping track of the value of the next OFFSET and I would write that to a text file right after committing, so that I could interrupt and restart the process at any time and it would carry on where it left off.

Memory usage with Django + SQLite3

I've got a very large SQLite table with over 500,000 rows with about 15 columns (mostly floats). I'm wanting to transfer data from the SQLite DB to a Django app (which could be backed by many RDBMs, but Postgres in my case). Everything works OK, but as the iteration continues, memory usage jumps by 2-3 meg a second for the Python process. I've tried using 'del' to delete the EVEMapDenormalize and row objects at the end of each iteration, but the bloat continues. Here's an excerpt, any ideas?
class Importer_mapDenormalize(SQLImporter):
def run_importer(self, conn):
c = conn.cursor()
for row in c.execute('select * from mapDenormalize'):
mapdenorm, created = EVEMapDenormalize.objects.get_or_create(id=row['itemID'])
mapdenorm.x = row['x']
mapdenorm.y = row['y']
mapdenorm.z = row['z']
if row['typeID']:
mapdenorm.type = EVEInventoryType.objects.get(id=row['typeID'])
if row['groupID']:
mapdenorm.group = EVEInventoryGroup.objects.get(id=row['groupID'])
if row['solarSystemID']:
mapdenorm.solar_system = EVESolarSystem.objects.get(id=row['solarSystemID'])
if row['constellationID']:
mapdenorm.constellation = EVEConstellation.objects.get(id=row['constellationID'])
if row['regionID']:
mapdenorm.region = EVERegion.objects.get(id=row['regionID'])
mapdenorm.save()
c.close()
I'm not at all interested in wrapping this SQLite DB with the Django ORM. I'd just really like to figure out how to get the data transferred without sucking all of my RAM.

Silly me, this was addressed in the Django FAQ.
Needed to clear the DB query cache while in DEBUG mode.
from django import db
db.reset_queries()

I think a select * from mapDenormalize and loading the result into memory will always be a bad idea. My advise is - spread script into chunks. Use LIMIT to get data in portions.
Get first portion, work with it, close to cursor, and then get the next portion.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.