I need to iterate over large collection (3 * 10^6 elements) in Django to do some kind of analysis that can't be done using single SQL statement.
Is it possible to turn off collection caching in django? (Caching all the data is not to be acceptable data has around 0.5GB)
Is it possible to make django fetch collection in chunks? It seems that it tries to pre fetch whole collection in to the memory and then iterate over it. I think that observing the speed of execution:
iter(Coll.objects.all()).next() - this takes forever
iter(Coll.objects.all()[:10000]).next() - this takes less than a second
Use QuerySet.iterator() to walk over the results instead of loading them all first.
It seams that the problem was caused by the database backend (sqlite) that doesn't support reading in chunks.
I've used sqlite as the database will be trashed after I do all the computations but it seems that sqlite isn't good even for that.
Here is what I've found in django source code of sqlite backend:
class DatabaseFeatures(BaseDatabaseFeatures):
# SQLite cannot handle us only partially reading from a cursor's result set
# and then writing the same rows to the database in another cursor. This
# setting ensures we always read result sets fully into memory all in one
# go.
can_use_chunked_reads = False
Related
I've just set up a delta-loading data flow between multiple Mysql DBs and a Porgres DB. It's only copying tens of Mbs every 15mins.
Yet, I'd like to set a process to fully load the data between them in case of emergency...
Python is just crashing and seems not to be fast enough when using SQLachemy etc.
I've read that the best might be to just dump everything from MySQL into CSV and then use file_fdw to load the entire tables into Postgres..
Has anyone faced a similar issue? If yes, how did you proceed?
Long story made short, ORM overhead is killing your performance.
When you're not manipulating the objects involved, it's better to use SQA Core expressions ("SQL Expressions") which are almost as fast as pure SQL.
Solution:
Of course I'm presuming your MySQL and Postgres models have been meticulously synchronized (i.e. values from an object from MySQL are not a problem for creating object in Postgres model and vice versa).
Overview:
get Table objects out of declarative classes
select (SQLAlchemy Expression) from one database
convert rows to dicts
insert into the other database
More or less:
# get tables
m_table = ItemMySQL.__table__
pg_table = ItemPG.__table__
# SQL Expression that gets a range of rows quickly
pg_q = select([pg_table]).where(
and_(
pg_table.c.id >= id_start,
pg_table.c.id <= id_end,
))
# get PG DB rows
eng_pg = DBSessionPG.get_bind()
conn_pg = eng_pg.connect()
result = conn_pg.execute(pg_q)
rows_pg = result.fetchall()
for row_pg in rows_pg:
# convert PG row object into dict
value_d = dict(row_pg)
# insert into MySQL
m_table.insert().values(**value_d)
# close row proxy object and connection, else suffer leaks
result.close()
conn_pg.close()
Background on performance, see accepted answer (by SQA principal author himself):
Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly?
Since you seem to have Python crashing, perhaps you're using too much memory? Hence I suggest reading and writing rows in batches.
A further improvement could be using .values to insert a number of rows in one call, see here: http://docs.sqlalchemy.org/en/latest/core/tutorial.html#inserts-and-updates
Q: Which is quicker for this scenario?
My scenario: my application will be storing either in either an array or postgresql db a list of links, so it might look like:
1) mysite.com
a) /users/login
b) /users/registration/
c) /contact/
d) /locate/search
e) /priv/admin-login
The above entries under 1) - I will be doing string searches on these urls to find for example any path that contains:
'login'
for example.
The above letters a) through e) could maybe have anywhere from 5-100 more entries for a given domain.
*The usage: * This data structure can change potentially as much as everyday, but only once per day. Some key/values will be removed, others will be modified. An individual set like:
dict2 = { 'thesite.com': 123, 98.6: 37 };
Each key will represent 1 and only 1 domain.
I've tried searching a bit on this, but cannot seem to find a real good answer to : when should an array be used and when should a db like postgresql be used?
I've always used a db to handle data (using mysql, not postgresql), but I'm now trying to do it better from now on, so I wondered if an array or other data structure would work better within a loop, and while trying tomatch a given string while looping.
As always, thank you!
A full SQL database would probably be overkill. If you can fit everything in memory, put it all in a dict and then use the pickle module to serialize it and write it to the disk.
Another good option would be to use one of the dbm modules (dbm/dbm.ndbm, gdbm or anydbm) to store the data in a disk-bound hash table. It will have O(1) lookup times without the need to connect and form a query like in a bigger database.
edit: If you have multiple values per key and you don't want a full-blown database, SQLite would be a good choice. There is already a built-in module for it, sqlite3 (as mentioned in the comments)
Test it. It's your dataset, your hardware, your available disk and network IO, your usage pattern. There's no one true answer here. We don't even know how many queries are you planning - are we talking about one per minute or thousands per second?
If your data fits nicely in memory and doesn't take a massive amount of time to load the first time, sticking it into a dictionary in memory will probably be faster.
If you're always looking for full words (like in the login case), you will gain some speed too from splitting the url into parts and indexing those separately.
I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.
I have a memory issue with mongoengine (in python).
Let's say I have a very large amount of custom_documents (several thousands).
I want to process them all, like this:
for item in custom_documents.objects():
process(item)
The problem is custom_documents.objects() load every objects in memory and my app use several GB ...
How can I do to make it more memory wise?
Is there a way to make mongoengine to query the DB lazily (it request objects when we iterates on the queryset)?
According to the docs (and in my experience), collection.objects returns a lazy QuerySet. Your first problem might be that you're calling the objects attribute, rather than just using it as an iterable. I feel like there must be some other reason your app is using so much memory, perhaps process(object) stores a reference to it somehow? Try the following code and check your app's memory usage:
queryset = custom_documents.objects
print queryset.count()
Since QuerySets are lazy, you can do things like custom_documents.limit(100).skip(500) as well in order to return objects 500-600 only.
I think you want to look at querysets - these are the MongoEngine wrapper for cursors:
http://mongoengine.org/docs/v0.4/apireference.html#querying
They let you control the number of objects returned, essentially taking care of the batch size settings etc. that you can set directly in the pymongo driver:
http://api.mongodb.org/python/current/api/pymongo/cursor.html
Cursors are set up to generally behave this way by default, you have to try to get them to return everything in one shot, even in the native mongodb shell.
Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.