I want to store 'status updates' in mongodb. Therefore this collection/array can get very big.I think one option would be to save the documents in an array nested in the user/group/... document.(Different collections need their own 'status updates')The other way would be to create another collection save the messages their and relate the user/group/... to the status updates via another objectId
I want to know
what is faster
what is easier to administrate and query
I think I'm not going to use an orm/drm just "plain" pymongo.
I haven't found any clear answer in the docs, maybe someone already tested this?
This is an older presentation, but still relevant for these kinds of questions, and discusses some of the tradeoffs.
http://www.10gen.com/presentations/mongosf2011/schemascale
TLDR(W) - it depends how many updates is "very big", and how you're accessing them. If you always need to access the full set at once and they're < 16MB, you can embed, if you generally need only a few at a time you can link. There's also a hybrid approach which is to embed recent and link the rest.
Related
I have a nested data structure defined with protocol-buffer messages. I have a service that receives these messages. On the server side, I need to store these messages and be able to search/find messages that have certain values for different fields, or to find the message(s) that is referenced in another one.
I have searched on what would be the best way to do it, and it seems having a database that can store these messages (directly or via a JSON) and allow query in them would be a good way.
I searched on what kind of database would provide this support effectively, but it was not very successful.
One way I found was around MongoDB, setting a mirror schema and converting messages to JSON and storing on MongoDB.
I also found the ProfaneDB, where the problem it states to address is "very much" like what I need. However, it seems it has been dormant in the last 3-4 years, and not sure how stable/scalable it is, or there has been more recent, or more popular solutions for this.
I thought there should be better solutions to go for this use case. I'd appreciate if one could advise what would be a good way to do this?
I think you should discard the binary protobuf messages as soon as you've unmarshaled them on your server. Unless you have a legal requirement to retain the transmitted message as-is. The protobuf format is optimized for network transmission (on-the-wire) not searching.
Once you have the message in your preferred language's structs type, most databases will be able to store the data. Your focus would then need to be on how you wish to access the data, what levels of reliability, availability, consistency etc. How much you want to pay...
One important requirement is whether you want to have structured queries against your data or whether you want free-form (arbitrary|text) searches. For the former, you may consider SQL and NoSQL databases. For the latter, something like Elasticsearch.
There are so many excellent, well-supported, cloud-based (if you want it) databases that can meet your needs, that you should disregard any that aren't popular unless you have very specific needs that are only addressed by a niche solution.
db.test.find_one(ObjectId('4f3dd96d1453373bcb000000'))
or something else entirely? I know that the _id column is indexed automatically and am hoping to capitalize on that efficiency.
Thanks!
Yes, your approach is correct.
Since you're asking about efficiency, remember that when you're optimizing read operations for performance, you may want to read only the attributes that you need. If certain attributes of your documents are large, then this can reduce the IO costs (transferring data from server to client) dramatically. For example, if your document has 20 attributes, but you're only using 5 of them, then don't pull the other 15 over the wire. In pymongo, you can do this using the optional fields parameter of the collection.find function. Obviously you need to balance performance vs code maintainability here, since listing attributes increases maintenance costs.
More optimization suggestions are available in the official docs. Their list includes "Optimization #3: Select only relevant fields" which is just the point that I made above.
If you're getting a value specifically by the _id, then I would say yes this is the most efficient approach.
Depending on your data, it may be more efficient to index that value and search on it.
if you know the _id than you should call in that way only. db.test.find_one(ObjectId('4f3dd96d1453373bcb000000'))
You full code in pymongo may be like this
connection=Connection(self.host ) #%(self.user_name,self.password))
#connection1=Connection(host=self.host, port=self.port)
db=connection[self.db_name]
db.authenticate(self.user_name, self.password)
collection=db[self.question_collection]
obj_id= ObjectId(_id)
info=collection.find_one(obj_id)
I have some things that do not need to be indexed or searched (game configurations) so I was thinking of storing JSON on a BLOB. Is this a good idea at all? Or are there alternatives?
If you need to query based on the values within the JSON, it would be better to store the values separately.
If you are just loading a set of configurations like you say you are doing, storing the JSON directly in the database works great and is a very easy solution.
No different than people storing XML snippets in a database (that doesn't have XML support). Don't see any harm in it, if it really doesn't need to be searched at the DB level. And the great thing about JSON is how parseable it is.
I don't see why not. As a related real-world example, WordPress stores serialized PHP arrays as a single value in many instances.
I think,It's beter serialize your XML.If you are using python language ,cPickle is good choice.
I'm creating a Django-powered site for my newspaper-ish site. The least obvious and common-sense task that I have come across in getting the site together is how best to generate a "top articles" list for the sidebar of the page.
The first thing that came to mind was some sort of database column that is updated (based on what?) with every view. That seems (to my instincts) ridiculously database intensive and impractical and thus I think I'd like to find another solution.
Thanks all.
I would give celery a try (with django-celery). While it's not so easy to configure and use as cache, it enables you to queue tasks like incrementing counters and do them in background. It could be even combined with cache technique - in views increment counters in cache and define PeriodicTask that will run every now and then, resetting counters and writing them to the database.
I just remembered - I once found this blog entry which provides nice way of incrementing 'viewed_count' (or similar) column in database with AJAX JS call. If you don't have heavy traffic maybe it's good idea?
Also mentioned in this post is django-tracking, but I don't know much about it, I never used it myself (yet).
Premature optimization, first try the db way and then see if it really is too database sensitive. Any decent database has so good caches it probably won't matter very much. And even if it is a problem, take a look at the other db/cache suggestions here.
It is most likely by the way is that you will have many more intensive db queries with each view than a simple view update.
If you do something like sort by top views, it would be fast if you index the view column in the DB. Another option is to only collect the top x articles every hour or so, and toss that value into Django's cache framework.
The nice thing about caching the list is that the algorithm you use to determine top articles can be as complex as you like without hitting the DB hard with every page view. Django's cache framework can use memory, db, or file system. I prefer DB, but many others prefer memory. I believe it uses pickle, so you can also store Python objects directly. It's easy to use, recommended.
An index wouldn't help as them main problem I believe is not so much getting the sorted list as having a DB write with every page view of an article. Another index actually makes that problem worse, albeit only a little.
So I'd go with the cache. I think django's cache shim is a problem here because it requires timeouts on all keys. I'm not sure if that's imposed by memcached, if not then go with redis. Actually just go with redis anyway, the python library is great, I've used it from django projects before, and it has atomic increments and powerful sorting - everything you need.
I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.