Django ORM: Organizing massive amounts of data, the right way

Django ORM: Organizing massive amounts of data, the right way - python

I have a Django app that uses django-piston to send out XML feeds to internal clients. Generally, these work pretty well but we have some XML feeds that currently run over 15 minutes long. This causes timeouts, and the feeds become unreliable.
I'm trying to ponder ways that I can improve this setup. If it requires some re-structuring of the data, that could be possible too.
Here is how the data collection currently looks:
class Data(models.Model)
# fields
class MetadataItem(models.Model)
data = models.ForeignKey(Data)
# handlers.py
data = Data.objects.filter(**kwargs)
for d in data:
for metaitem in d.metadataitem_set.all():
# There is usually anywhere between 55 - 95 entries in this loop
label = metaitem.get_label() # does some formatting here
data_metadata[label] = metaitem.body
Obviously, the core of the program is doing much more, but I'm just pointing out where the problem lies. When we have a data list of 300 it just becomes unreliable and times out.
What I've tried:
Getting a collection of all the data id's, then doing a single large query to get all the MetadataItem's. Finally, filtering those in my loop. This was to preserve some queries which it did reduce.
Using .values() to reduce model instance overhead, which did speed it up but not by much.
One idea I'm thinking one simpler solution to this is to write to a cache in steps. So to reduce time out; I would write the first 50 data sets, save to cache, adjust some counter, write the next 50, etc. Still need to ponder this.
Hoping someone can help lead me into the right direction with this.

The problem in the piece of code you posted is that Django doesn't include objects that are connected through a reverse relationship automatically, so you have to make a query for each object. There's a nice way around this, as Daniel Roseman points out in his blog!
If this doesn't solve your problem well, you could also have a look at trying to get everything in one raw sql query...

You could maybe further reduce the query count by first getting all Data id's and then using select_related to get the data and it's metadata in a single big query. This would greatly reduce the number of queries, but the size of the queries might be impractical/too big. Something like:
data_ids = Data.objects.filter(**kwargs).values_list('id', flat = True)
for i in data_ids:
data = Data.objects.get(pk = i).select_related()
# data.metadataitem_set.all() can now be called without quering the database
for metaitem in data.metadataitem_set.all():
# ...
However, I would suggest, if possible, to precompute the feeds from somewhere outside the webserver. Maybe you could store the result in memcache if it's smaller than 1 MB. Or you could be one of the cool new kids on the block and store the result in a "NoSQL" database like redis. Or you could just write it to a file on disk.

If you can change the structure of the data, maybe you can also change the datastore?
The "NoSQL" databases which allow some structure, like CouchDB or MongoDB could actually be useful here.
Let's say for every Data item you have a document. The document would have your normal fields. You would also add a 'metadata' field which is a list of metadata. What about the following datastructure:
{
'id': 'someid',
'field': 'value',
'metadata': [
{ 'key': 'value' },
{ 'key': 'value' }
]
}
You would then be able to easily get to a data record and get all it's metadata. For searching, add indexes to the fields in the 'data' document.
I've worked on a system in Erlang/OTP that used Mnesia which is basically a key-value database with some indexing and helpers. We used nested records heavily to great success.
I added this as a separate answer as it's totally different than the other.

Another idea is to use Celery (www.celeryproject.com) which is a task management system for python and django. You can use it to perform any long running tasks asynchronously without holding up your main app server.

Related

Is it a good idea to store copies of documents from a mongodb collection in a dictionary list, and use this data instead of querying the database?

I am currently developing a Python Discord bot that uses a Mongo database to store user data.
As this data is continually changed, the database would be subjected to a massive number of queries to both extract and update the data; so I'm trying to find ways to minimize client-server communication and reduce bot response times.
In this sense, is it a good idea to create a copy of a Mongo collection as a dictionary list as soon as the script is run, and manipulate the data offline instead of continually querying the database?
In particular, every time a data would be searched with the collection.find() method, it is instead extracted from the list. On the other hand, every time a data needs to be updated with collection.update(), both the list and the database are updated.
I'll give an example to better explain what I'm trying to do. Let's say that my collection contains documents with the following structure:
{"user_id": id_of_the_user, "experience": current_amount_of_experience}
and the experience value must be continually increased.
Here's how I'm implementing it at the moment:
online_collection = db["collection_name"] # mongodb cursor
offline_collection = list(online_collection.find()) # a copy of the collection
def updateExperience(user_id):
online_collection.update_one({"user_id":user_id}, {"$inc":{"experience":1}})
mydocument = next((document for document in offline_documents if document["user_id"] == user_id))
mydocument["experience"] += 1
def findExperience(user_id):
mydocument = next((document for document in offline_documents if document["user_id"] == user_id))
return mydocument["experience"]
As you can see, the database is involved only for the update function.
Is this a valid approach?
For very large collections (millions of documents) does the next () function have the same execution times or would there still be some slowdowns?
Also, while not explicitly asked in the question, I'd me more than happy to get any advice on how to improve the performance of a Discord bot, as long as it doesn't include using a VPS or sharding, since I'm already using these options.

I don't really see why not - as long as you're aware of the following :
You will need the system resources to load an entire database into memory
It is your responsibility to sync the actual db and your local store
You do need to be the only person/system updating the database
Eventually this pattern will fail i.e. db gets too large, or more than one process needs to update, so it isn't future-proof.
In essence you're talking about a caching solution - so no need to reinvent the wheel - many such products/solutions you could use.
It's probably not the traditional way of doing things, but if it works then why not

Python: is it faster to query clean string value from a db column or cleanup the string on the view/template?

I have a City model and fixture data with list of cities, and currently doing cleanups for URL on view and template before loading them. So I do below in a template to have a URL like this: http://newcity.domain.com.
<a href="http://{{ city.name|lower|cut:" " }}.{{ SITE_URL}}">
The actual city.name is "New City"
Would it be better if I stored already cleaned data (newcity) in a new column (short_name) on MySQL db and just use city.short_name on templates and views?

This seems very opinion-oriented. Is it faster? The only way to know for sure is to measure. Is it faster to a degree that you care about? Probably not. In any event, it's better not to make schema design decisions based on performance unless you've observed measurably bad performance.
All other things being equal, it is generally best to store the data in different columns. It's easier to join it in controller or template code than it is to separate it out into its pieces.

storing the short name in a MySQL database requires I/O. I/O is always slow, for such an easy transormation of data, it should be faster to keep it, like it is and avoid I/O to a database.
If you really want to know the difference, use timeit (https://docs.python.org/2/library/timeit.html), probably accessing a database is much slower.

It really depends, if you have a fixed amount of cities in a list, just make them hardcoded (unless you have lots of cities that will actually put some stress on your server's resources - but I don't think that it's the case here), otherwise - you must use some type of persistent store for the cities and a database will come handy.

Python: Array vs Database for storage of key/value

Q: Which is quicker for this scenario?
My scenario: my application will be storing either in either an array or postgresql db a list of links, so it might look like:
1) mysite.com
a) /users/login
b) /users/registration/
c) /contact/
d) /locate/search
e) /priv/admin-login
The above entries under 1) - I will be doing string searches on these urls to find for example any path that contains:
'login'
for example.
The above letters a) through e) could maybe have anywhere from 5-100 more entries for a given domain.
*The usage: * This data structure can change potentially as much as everyday, but only once per day. Some key/values will be removed, others will be modified. An individual set like:
dict2 = { 'thesite.com': 123, 98.6: 37 };
Each key will represent 1 and only 1 domain.
I've tried searching a bit on this, but cannot seem to find a real good answer to : when should an array be used and when should a db like postgresql be used?
I've always used a db to handle data (using mysql, not postgresql), but I'm now trying to do it better from now on, so I wondered if an array or other data structure would work better within a loop, and while trying tomatch a given string while looping.
As always, thank you!

A full SQL database would probably be overkill. If you can fit everything in memory, put it all in a dict and then use the pickle module to serialize it and write it to the disk.
Another good option would be to use one of the dbm modules (dbm/dbm.ndbm, gdbm or anydbm) to store the data in a disk-bound hash table. It will have O(1) lookup times without the need to connect and form a query like in a bigger database.
edit: If you have multiple values per key and you don't want a full-blown database, SQLite would be a good choice. There is already a built-in module for it, sqlite3 (as mentioned in the comments)

Test it. It's your dataset, your hardware, your available disk and network IO, your usage pattern. There's no one true answer here. We don't even know how many queries are you planning - are we talking about one per minute or thousands per second?
If your data fits nicely in memory and doesn't take a massive amount of time to load the first time, sticking it into a dictionary in memory will probably be faster.
If you're always looking for full words (like in the login case), you will gain some speed too from splitting the url into parts and indexing those separately.

How do I transform every doc in a large Mongodb collection without map/reduce?

Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.

Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.

Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.