Is it a good idea to store copies of documents from a mongodb collection in a dictionary list, and use this data instead of querying the database? - python

I am currently developing a Python Discord bot that uses a Mongo database to store user data.
As this data is continually changed, the database would be subjected to a massive number of queries to both extract and update the data; so I'm trying to find ways to minimize client-server communication and reduce bot response times.
In this sense, is it a good idea to create a copy of a Mongo collection as a dictionary list as soon as the script is run, and manipulate the data offline instead of continually querying the database?
In particular, every time a data would be searched with the collection.find() method, it is instead extracted from the list. On the other hand, every time a data needs to be updated with collection.update(), both the list and the database are updated.
I'll give an example to better explain what I'm trying to do. Let's say that my collection contains documents with the following structure:
{"user_id": id_of_the_user, "experience": current_amount_of_experience}
and the experience value must be continually increased.
Here's how I'm implementing it at the moment:
online_collection = db["collection_name"] # mongodb cursor
offline_collection = list(online_collection.find()) # a copy of the collection
def updateExperience(user_id):
online_collection.update_one({"user_id":user_id}, {"$inc":{"experience":1}})
mydocument = next((document for document in offline_documents if document["user_id"] == user_id))
mydocument["experience"] += 1
def findExperience(user_id):
mydocument = next((document for document in offline_documents if document["user_id"] == user_id))
return mydocument["experience"]
As you can see, the database is involved only for the update function.
Is this a valid approach?
For very large collections (millions of documents) does the next () function have the same execution times or would there still be some slowdowns?
Also, while not explicitly asked in the question, I'd me more than happy to get any advice on how to improve the performance of a Discord bot, as long as it doesn't include using a VPS or sharding, since I'm already using these options.

I don't really see why not - as long as you're aware of the following :
You will need the system resources to load an entire database into memory
It is your responsibility to sync the actual db and your local store
You do need to be the only person/system updating the database
Eventually this pattern will fail i.e. db gets too large, or more than one process needs to update, so it isn't future-proof.
In essence you're talking about a caching solution - so no need to reinvent the wheel - many such products/solutions you could use.
It's probably not the traditional way of doing things, but if it works then why not

Related

Is there a good way to store a boolean array into a file or database in python?

I am building an image mosaic that detect if the user's selected area are taken or not.
My idea is to store the available_spots in a list, and I would just have to look through the list to check whether a spot is available or not.
The problem is that when I reload the website, avaliable_spots also gets reset to blank list,
so I want to store this array somewhere, that is fast to read and write to.
I am currently thinking about a text file to store this, but that might take forever to read since array length is over 1.4 million. Is there any other solutions that might be better?
You can't store the data in a file for a few reasons: (1) GAE standard won't let you, (2) the data is lost when your server is restarted, and (3) different instances will have different data.
Of course you can and should store the data in a database of your choice. Firestore is likely a better and cheaper option than SQL. It should be fast enough for you and you can implement caching if needed.
You might be able to store the data in a single Firestore entity and consider using compression if you are getting close to the max entity size.
If you want to store into a database you can use the "sqlite3" module.
Is a simple database that gets stored in a file so you dont have to install a database program. Is great for small projects.
If you want to do more complex stuff with databases you can use "sqlalchemy".

Basic friend timeline algorithm?

I'm sure a lot of services online today must perform a task similar to what I'm doing. A user has friends, and I want to get all status updates of all the user's friends after their friends last status update date.
That was a mouthful, but here's what I have:
A user has say 10 friends. What I want to do is get new status updates for all his friends. So, I prepare a dictionary with each friend's last status date. Something like:
for friend in user:
dictionary['userId] = friend.id
dictionary['lastDate'] = friend.mostRecentStatusUpdate.date
Then, on my server side, I do something like this:
for dict in friends:
userId = dict['userId]
lastDate = dict['lastDate']
# each get below, however, launches an RPC and does a separate table lookup, so if I have 100 friends, this seems extremely inefficient
get statusUpdates for userId where postDate > lastDate
The problem with the above approach is that on the server side each iteration of the for loop launches a new query, which launches an RPC. So if there are a lot of friends, it would seem to be really inefficient.
Is there a better way to design my structure to make this task more efficient? How does say Twitter do something like that, where it gets new time line updates?
From the high level, I'd suggest you follow the prescribed app-engine mantra - make writes expensive to make reads cheap.
For each friend, you should keep a collection of known friends and their last status updates. This will allow you to update friends at write time. This is expensive for the write, but saves you processing and querying at read. This also assumes that you read more than you write.
Additionally, if you are just trying to display N number of latest updates for each friend, I would suggest you use NDB Structured property to store the Friend objects - this way you can create matching data structure. As part of the object, create a collection of keys that correspond to the status updates. When the status update is written, add to the collection, and potentially remove older entries (if space is a concern).
This way when you need to retrieve the updates, you are getting them by key, instead of a more expensive query types.
An alternative to this that avoids any additional queries, is to keep the entire update instead of just keys. However, this will be a lot bigger for storage - 10 friends all interconnected, means 100 versions of the same update.

How do I transform every doc in a large Mongodb collection without map/reduce?

Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.

Django ORM: Organizing massive amounts of data, the right way

I have a Django app that uses django-piston to send out XML feeds to internal clients. Generally, these work pretty well but we have some XML feeds that currently run over 15 minutes long. This causes timeouts, and the feeds become unreliable.
I'm trying to ponder ways that I can improve this setup. If it requires some re-structuring of the data, that could be possible too.
Here is how the data collection currently looks:
class Data(models.Model)
# fields
class MetadataItem(models.Model)
data = models.ForeignKey(Data)
# handlers.py
data = Data.objects.filter(**kwargs)
for d in data:
for metaitem in d.metadataitem_set.all():
# There is usually anywhere between 55 - 95 entries in this loop
label = metaitem.get_label() # does some formatting here
data_metadata[label] = metaitem.body
Obviously, the core of the program is doing much more, but I'm just pointing out where the problem lies. When we have a data list of 300 it just becomes unreliable and times out.
What I've tried:
Getting a collection of all the data id's, then doing a single large query to get all the MetadataItem's. Finally, filtering those in my loop. This was to preserve some queries which it did reduce.
Using .values() to reduce model instance overhead, which did speed it up but not by much.
One idea I'm thinking one simpler solution to this is to write to a cache in steps. So to reduce time out; I would write the first 50 data sets, save to cache, adjust some counter, write the next 50, etc. Still need to ponder this.
Hoping someone can help lead me into the right direction with this.
The problem in the piece of code you posted is that Django doesn't include objects that are connected through a reverse relationship automatically, so you have to make a query for each object. There's a nice way around this, as Daniel Roseman points out in his blog!
If this doesn't solve your problem well, you could also have a look at trying to get everything in one raw sql query...
You could maybe further reduce the query count by first getting all Data id's and then using select_related to get the data and it's metadata in a single big query. This would greatly reduce the number of queries, but the size of the queries might be impractical/too big. Something like:
data_ids = Data.objects.filter(**kwargs).values_list('id', flat = True)
for i in data_ids:
data = Data.objects.get(pk = i).select_related()
# data.metadataitem_set.all() can now be called without quering the database
for metaitem in data.metadataitem_set.all():
# ...
However, I would suggest, if possible, to precompute the feeds from somewhere outside the webserver. Maybe you could store the result in memcache if it's smaller than 1 MB. Or you could be one of the cool new kids on the block and store the result in a "NoSQL" database like redis. Or you could just write it to a file on disk.
If you can change the structure of the data, maybe you can also change the datastore?
The "NoSQL" databases which allow some structure, like CouchDB or MongoDB could actually be useful here.
Let's say for every Data item you have a document. The document would have your normal fields. You would also add a 'metadata' field which is a list of metadata. What about the following datastructure:
{
'id': 'someid',
'field': 'value',
'metadata': [
{ 'key': 'value' },
{ 'key': 'value' }
]
}
You would then be able to easily get to a data record and get all it's metadata. For searching, add indexes to the fields in the 'data' document.
I've worked on a system in Erlang/OTP that used Mnesia which is basically a key-value database with some indexing and helpers. We used nested records heavily to great success.
I added this as a separate answer as it's totally different than the other.
Another idea is to use Celery (www.celeryproject.com) which is a task management system for python and django. You can use it to perform any long running tasks asynchronously without holding up your main app server.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Categories

Resources