I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.
The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.
As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?
Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)
class Template(models.Model):
template_name = models.TextField(_("Name"), max_length=70)
sourcepackage = models.TextField(_("Source package"), max_length=70)
translation_domain = models.TextField(_("Domain"), max_length=70)
total = models.IntegerField(_("Total"))
enabled = models.BooleanField(_("Enabled"))
priority = models.IntegerField(_("Priority"))
release = models.ForeignKey(Release)
class Translation(models.Model):
release = models.ForeignKey(Release)
template = models.ForeignKey(Template)
language = models.ForeignKey(Language)
translated = models.IntegerField(_("Translated"))
And here's the bit of code that seems to take ages to complete:
#transaction.commit_manually
def add_translations(translation_data, lp_translation):
releases = Release.objects.all()
# There are 5 releases
for release in releases:
# translation_data has about 90K entries
# this is the part that takes a long time
for lp_translation in translation_data:
try:
language = Language.objects.get(
code=lp_translation['language'])
except Language.DoesNotExist:
continue
translation = Translation(
template=Template.objects.get(
sourcepackage=lp_translation['sourcepackage'],
template_name=lp_translation['template_name'],
translation_domain=\
lp_translation['translation_domain'],
release=release),
translated=lp_translation['translated'],
language=language,
release=release,
)
translation.save()
# I realize I should commit every n entries
transaction.commit()
# I've also got another bit of code to fill in some data I'm
# not getting from the json files
# Add missing templates
languages = Language.objects.filter(visible=True)
languages_total = len(languages)
for language in languages:
templates = Template.objects.filter(release=release)
for template in templates:
try:
translation = Translation.objects.get(
template=template,
language=language,
release=release)
except Translation.DoesNotExist:
translation = Translation(template=template,
language=language,
release=release,
translated=0,
untranslated=0)
translation.save()
transaction.commit()
Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.
If the import files are available locally to the server you can use COPY. Else you could use the meta command \copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.
If you just want to add new rows to a table:
COPY tbl FROM '/absolute/path/to/file' FORMAT csv;
Or if you want to INSERT / UPDATE some rows:
First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.
SET LOCAL temp_buffers='128MB';
In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:
select pg_size_pretty(pg_total_relation_size('tmp_x'));
Create the temporary table:
CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);
Or, to just duplicate the structure of an existing table:
CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;
Copy values (should take seconds, not hours):
COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;
From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:
ANALYZE tmp_x;
For instance, to update existing rows, matched by id:
UPDATE tbl
SET col_a = tmp_x.col_a
USING tmp_x
WHERE tbl.id = tmp_x.id;
Finally, drop the temporary table:
DROP TABLE tmp_x;
Or have it dropped automatically at the end of the session.
Related
I have a list of storage accounts and I would like to copy the exact table content from source_table to destination_table exactly how it is. Which mean if I add an entry to source_table that will be moved to the destination_table same think if I delete the entry from source I want it to be deleted from destination.
So far I have in place this code:
source_table = TableService(account_name="sourcestorageaccount",
account_key="source key")
destination_storage = TableService(account_name="destination storage",
account_key="destinationKey")
query_size = 1000
# save data to storage2 and check if there is lefted data in current table,if yes recurrence
def queryAndSaveAllDataBySize(source_table_name, target_table_name, resp_data: ListGenerator,
table_out: TableService, table_in: TableService, query_size: int):
for item in resp_data:
tb_name = source_table_name
del item.etag
del item.Timestamp
print("INSERT data:" + str(item) + "into TABLE:" + tb_name)
table_in.insert_or_replace_entity(target_table_name, item)
if resp_data.next_marker:
data = table_out.query_entities(table_name=source_table_name, num_results=query_size,
marker=resp_data.next_marker)
queryAndSaveAllDataBySize(source_table_name, target_table_name, data, table_out, table_in, query_size)
tbs_out = table_service_out.list_tables()
print(tbs_out)
for tb in tbs_out:
table = tb.name
# create table with same name in storage2
table_service_in.create_table(table_name=table, fail_on_exist=False)
# first query
data = table_service_out.query_entities(tb.name, num_results=query_size)
queryAndSaveAllDataBySize(tb.name, table, data, table_service_out, table_service_in, query_size)
As you can see this block of code up runs just perfectly, it loops over the source storage account table and creates the same table and its content in destination storage account. but I am missing the part of how I can check if a record has been deleted from the source storage and remove the same record from the destination table.
I hope my question/issue is clear enough, and if not please just ask me for more informations.
Thank you so much for any help you can provide
UPDATE:
The more a think about this the more the logic get messy.
One of the solution that I thought about and tried is to have 2 lists to store every single table entity:
Source_table_entries
Destination_table_entries
Once I have populated the lists for each run I can compare the partition keys and if a partition key is present in Destination_table_entries but on in source, that will me promoted to be deleted.
But this logic will work flawless as long as I have a small table, unfortunately some table contains hundreds of thousands of entities (and I have hundreds of storages) which sooner or later will become a mess to managed.
So one of the solution that I thought about. Is to keep the same code I have above and just create a new table every week and delete the older one (from the destination storage). For example
Table week 1
Table week 2
Table week 3 (this will be deleted)
I read around that I could potentially add a metadata to the table for date and leverage that to decide which table should be deleted based on date time. But I cannot find anything in the documentation.
Can anyone please direct me on the best approach for this. Thank you so much, I am loosing my mind on this last bit
Sqlite database with two tables, each over 28 million rows long. Here's the schema:
CREATE TABLE MASTER (ID INTEGER PRIMARY KEY AUTOINCREMENT,PATH TEXT,FILE TEXT,FULLPATH TEXT,MODIFIED_TIME FLOAT);
CREATE TABLE INCREMENTAL (INC_ID INTEGER PRIMARY KEY AUTOINCREMENT,INC_PATH TEXT,INC_FILE TEXT,INC_FULLPATH TEXT,INC_MODIFIED_TIME FLOAT);
Here's an example row from MASTER:
ID PATH FILE FULLPATH MODIFIED_TIME
---------- --------------- ---------- ----------------------- -------------
1 e:\ae/BONDS/0/0 100.bin e:\ae/BONDS/0/0/100.bin 1213903192.5
The tables have mostly identical data, with some differences between MODIFIED_TIME in MASTER and INC_MODIFIED_TIME in INCREMENTAL.
If I execute the following query in sqlite, I get the results I expect:
select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;
That query will pause for a minute or so, return a number of rows, pause again, return some more, etc., and finish without issue. Takes about 2 minutes to fully return everything.
However, if I execute the same query in Python:
changed_files = conn.execute("select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;")
It will never return - I can leave it running for 24 hours and still have nothing. The python32.exe process doesn't start consuming a large amount of cpu or memory - it stays pretty static. And the process itself doesn't actually seem to go unresponsive - however, I can't Ctrl-C to break, and have to kill the process to actually stop the script.
I do not have these issues with a small test database - everything runs fine in Python.
I realize this is a large amount of data, but if sqlite is handling the actual queries, python shouldn't be choking on it, should it? I can do other large queries from python against this database. For instance, this works:
new_files = conn.execute("SELECT DISTINCT INC_FULLPATH, INC_PATH, INC_FILE from INCREMENTAL where INC_FULLPATH not in (SELECT DISTINCT FULLPATH from MASTER);")
Any ideas? Are the pauses in between sqlite returning data causing a problem for Python? Or is something never occurring at the end to signal the end of the query results (and if so, why does it work with small databases)?
Thanks. This is my first stackoverflow post and I hope I followed the appropriate etiquette.
Python tends to have older versions of the SQLite library, especially Python 2.x, where it is not updated.
However, your actual problem is that the query is slow.
Use the usual mechanisms to optimize it, such as creating a two-column index on INC_FULLPATH and INC_MODIFIED_TIME.
I am writing a script in Python to QC data in a proprietary ESRI database table. The purpose of the script is not to modify invalid data, but simply to report invalid data to the user via a csv file. I am using ESRI's ArcPy package to access each individual record with arcpy.SearchCursor. The SearchCursor is the only way to access each individual record in the ESRI formats.
As I scroll through each record of the tables, I do multiple QC checks to validate specific business logic. One of those checks is looking for duplicate data in particular fields. One of those fields may be geometry. I have done this by creating an empty container object for each of those fields and as I check each record I use the following logic.
for field in dupCheckFields:
if row.getValue(field) in fieldValues[field]: dupValues.add(row.getValue(idField))
else: fieldValues[field].append(row.getValue(field))
The above code is an example of the basic logic I use. Where I am running into trouble is the fact that each of these tables may contain anywhere from 5000 records to 10 million records. I either run out of memory or the performance grinds to a halt.
I have tried the following container types: sets, lists, dictionaries, ZODB + BList, and Shelve.
With the in-memory types (sets, lists, dictionaries) the process is very fast at the start, but as it progresses it gets much slower. We these types, if I have many records in the table I will run out of memory. With the persistent data types, I don't run out of memory, but it takes a very long time to process.
I only need the data while the script is running and any persistent data files will be deleted upon completion.
Question: Is there a better container type out there to provide low-memory storage of lots of data without a large cost in performance when accessing the data?
System: Win7 64-bit, Python 2.6.5 32-bit, 4gb RAM
Thanks in advance for your help.
EDIT:
Sample SQLite code:
import sqlite3, os, arcpy, timeit
fc = r"path\to\feature\class"
# test feature class was in ESRI ArcSDE format and contained "." characters separating database name, owner, and feature class name
fcName = fc.split(".")[-1]
# convert ESRI data types to SQLite data types
dataTypes = {"String":"text","Guid":"text","Double":"real","SmallInteger":"integer"}
fields = [(field.name,dataTypes[field.type]) for field in arcpy.ListFields(fc) if field.name != arcpy.Describe(fc).OIDFieldName]
# SQL string to create table in SQLite with same schema as feature class
createTableString = """create table %s(%s,primary key(%s))""" % (fcName,",\n".join('%s %s' % field for field in fields),fields[0][0])
# SQL string to insert data into SQLite table
insertString = """insert into %s values(%s)""" % (fcName, ",".join(["?" for i in xrange(len(fields))]))
# location to save SQLite database
loc = r'C:\TEMPORARY_QC_DATA'
def createDB():
conn = sqlite3.connect(os.path.join(loc,'database.db'))
cur = conn.cursor()
cur.execute(createTableString)
conn.commit()
rows = arcpy.SearchCursor(fc)
i = 0
for row in rows:
try:
cur.execute(insertString, [row.getValue(field[0]) for field in fields])
if i % 10000 == 0:
print i, "records"
conn.commit()
i += 1
except sqlite3.IntegrityError: pass
print i, "records"
t1 = timeit.Timer("createDB()","from __main__ import createDB")
print t1.timeit(1)
Unfortunately I cannot share the test data I used with this code, however it was an ESRI ArcSDE geodatabase table containing approx. 10 fields and approx. 7 mil records.
I tried to use timeit to determine how long this process took, however after 2 hours of processing, only 120,000 records were complete.
If you store hashes in (compressed) files, you could stream through them to compare hashes to look for duplicates. Streaming usually has very low memory requirements — you can set the buffer you want, say, one line per hashed record. The tradeoff is generally time, particularly if you add compression, but if you order the files by some criteria, then you may be able to walk through the uncompressed streams to more quickly compare records.
I think I'd evaluate storing persistent data (such as the known field vals and counts) in a SQLite database. It is of course a trade off between memory usage and performance.
If you use a persistence mechanism that supports concurrent access, you can probably parallelise the processing of your data, using multiprocessing. Once complete, a summary of errors can be generated from the database.
I need to add a new column to a large (5m row) django table. I have a south schemamigration that creates the new column. Now I'm writing a datamigration script to populate the new column. It looks like this. (If you're not familiar with south migrations, just ignore the orm. prefixing the model name.)
print "Migrating %s articles." % orm.Article.objects.count()
cnt = 0
for article in orm.Article.objects.iterator():
if cnt % 500 == 0:
print " %s done so far" % cnt
# article.newfield = calculate_newfield(article)
article.save()
cnt += 1
I switched from objects.all to objects.iterator to reduce memory requirements. But something is still chewing up vast memory when I run this script. Even with the actually useful line commented out as above, the script still grows to using 10+ GB of ram before getting very far through the table and I give up on it.
Seems like something is holding on to these objects in memory. How can I run this so it's not a memory hog?
FWIW, I'm using python 2.6, django 1.2.1, south 0.7.2, mysql 5.1.
Ensure settings.DEBUG is set to False. DEBUG=True fills memory especially with database intensive operations, since it stores all queries sent to the RDBMS within a view.
With Django 1.8 out, it should not be necessary since a hardcoded max of 9000 queries are now stored, instead of an infinite number before.
Welcome to Django's ORM. I think this is an inherent problem.
I've also had problems with large databases, dumpdata, loaddata and the like.
You have two choices.
Stop trying to use south and write your own ORM migration. You can have multiple database definitions in your settings. Create "old" and "new". Write your own one-time migrator from the old database to the new database. Once that's tested and works, run it one final time and then switch the database definitions and restart Django.
Ditch south and the ORM and write your own SQL migration. Use raw SQL to copy data out of the old structure to the new structure. Debug separately. When it's good, run it one final time and then switch your setting and restart Django.
It's not that south or the ORM are particularly bad. But, for bulk processing in large databases, they cache too much in memory.
Or, what happens if you create a raw query in situ which implements a rudimentary resultset size limit?
a la: https://docs.djangoproject.com/en/1.3/topics/db/sql/#index-lookups
while min < rowcount:
min += 500
max = min + 500
articles = Article.objects.raw('SELECT * from article where id > %s and id < %s' % (min, max))
for old_article in articles:
# create the new article
article.save()
orm.Article.objects.iterator()
Does that run the whole query and save the result in memory? Or fetch rows from the database one at a time?
I'm guessing it does it all at once. See if you can replace that loop with a database cursor that pulls the data in an incremental fashion:
eg: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany
db = blah.connect("host='%s' dbname='%s' user='%s' password='%s'" % ...
new, old = db.cursor(), db.cursor()
old.execute("""
SELECT *
FROM whatever
""")
for row in old.fetchmany(size=500):
(col1, col2, col3...) = row
new = db.cursor()
new.execute("""
INSERT INTO yourtable (
col1, col2, col3...)
VALUES (
%s, %s, %s, %s, %s)
""",(col1, col2, col3,...))
new.close()
old.close()
It will be slow. I pulled this from a standalone migration script of mine so ymmv.
fetchmany is standard (PEP249). I've not done exactly what you're looking for so there's a little work still to do from this sample: I've not looped over the loop - to get sets of 500 till done - so you'll need to work that out for yourself.
If you don't need full access to the objects, you can always use a combo of only and values or values_list on your queryset. That should help reduce the memory requirements significantly, but I'm not sure whether it will be enough.
I've got a very large SQLite table with over 500,000 rows with about 15 columns (mostly floats). I'm wanting to transfer data from the SQLite DB to a Django app (which could be backed by many RDBMs, but Postgres in my case). Everything works OK, but as the iteration continues, memory usage jumps by 2-3 meg a second for the Python process. I've tried using 'del' to delete the EVEMapDenormalize and row objects at the end of each iteration, but the bloat continues. Here's an excerpt, any ideas?
class Importer_mapDenormalize(SQLImporter):
def run_importer(self, conn):
c = conn.cursor()
for row in c.execute('select * from mapDenormalize'):
mapdenorm, created = EVEMapDenormalize.objects.get_or_create(id=row['itemID'])
mapdenorm.x = row['x']
mapdenorm.y = row['y']
mapdenorm.z = row['z']
if row['typeID']:
mapdenorm.type = EVEInventoryType.objects.get(id=row['typeID'])
if row['groupID']:
mapdenorm.group = EVEInventoryGroup.objects.get(id=row['groupID'])
if row['solarSystemID']:
mapdenorm.solar_system = EVESolarSystem.objects.get(id=row['solarSystemID'])
if row['constellationID']:
mapdenorm.constellation = EVEConstellation.objects.get(id=row['constellationID'])
if row['regionID']:
mapdenorm.region = EVERegion.objects.get(id=row['regionID'])
mapdenorm.save()
c.close()
I'm not at all interested in wrapping this SQLite DB with the Django ORM. I'd just really like to figure out how to get the data transferred without sucking all of my RAM.
Silly me, this was addressed in the Django FAQ.
Needed to clear the DB query cache while in DEBUG mode.
from django import db
db.reset_queries()
I think a select * from mapDenormalize and loading the result into memory will always be a bad idea. My advise is - spread script into chunks. Use LIMIT to get data in portions.
Get first portion, work with it, close to cursor, and then get the next portion.