Which one data load method is the best for perfomance? - python

For example, I have object user stored in database (Redis)
It has several fields:
String nick
String password
String email
List posts
List comments
Set followers
and so on...
In Python programm I have class (User) with same fields for this object. Instances of this class maps to object in database. The question is how to get data from DB for best performance:
Load values for each field on instance creating and initialize fields with it.
Load field value each time on field value requesting.
As second one but after value load replace field property by loaded value.
p.s. redis runs in localhost

The method entirely depends on the requirements.
If there is only one client reading and modifying the properties, this is a rather simple problem. When modifying data, just change the instance attributes in your current Python program and -- at the same time -- keep the DB in sync while keeping your program responsive. To that end, you should outsource blocking calls to another thread or make use of greenlets. If there is only one client, there definitely is no need to fetch a property from the DB on each value lookup.
If there are multiple clients reading the data and only one client modifying the data, you have to think about which level of synchronization you need. If you need 100 % synchronization, you will have to fetch data from the DB on each value lookup.
If there are multiple clients changing the data in the database you better look into a rock-solid industry standard solution rather than writing your own DB cache/mapper.
Your distinction between (2) and (3) does not really make sense. If you fetch data on every lookup, there is no need to 'store' data. You see, if there can be multiple clients involved these things quickly become quite complex and it's really hard to get it right.

Related

Is it a good idea to store copies of documents from a mongodb collection in a dictionary list, and use this data instead of querying the database?

I am currently developing a Python Discord bot that uses a Mongo database to store user data.
As this data is continually changed, the database would be subjected to a massive number of queries to both extract and update the data; so I'm trying to find ways to minimize client-server communication and reduce bot response times.
In this sense, is it a good idea to create a copy of a Mongo collection as a dictionary list as soon as the script is run, and manipulate the data offline instead of continually querying the database?
In particular, every time a data would be searched with the collection.find() method, it is instead extracted from the list. On the other hand, every time a data needs to be updated with collection.update(), both the list and the database are updated.
I'll give an example to better explain what I'm trying to do. Let's say that my collection contains documents with the following structure:
{"user_id": id_of_the_user, "experience": current_amount_of_experience}
and the experience value must be continually increased.
Here's how I'm implementing it at the moment:
online_collection = db["collection_name"] # mongodb cursor
offline_collection = list(online_collection.find()) # a copy of the collection
def updateExperience(user_id):
online_collection.update_one({"user_id":user_id}, {"$inc":{"experience":1}})
mydocument = next((document for document in offline_documents if document["user_id"] == user_id))
mydocument["experience"] += 1
def findExperience(user_id):
mydocument = next((document for document in offline_documents if document["user_id"] == user_id))
return mydocument["experience"]
As you can see, the database is involved only for the update function.
Is this a valid approach?
For very large collections (millions of documents) does the next () function have the same execution times or would there still be some slowdowns?
Also, while not explicitly asked in the question, I'd me more than happy to get any advice on how to improve the performance of a Discord bot, as long as it doesn't include using a VPS or sharding, since I'm already using these options.
I don't really see why not - as long as you're aware of the following :
You will need the system resources to load an entire database into memory
It is your responsibility to sync the actual db and your local store
You do need to be the only person/system updating the database
Eventually this pattern will fail i.e. db gets too large, or more than one process needs to update, so it isn't future-proof.
In essence you're talking about a caching solution - so no need to reinvent the wheel - many such products/solutions you could use.
It's probably not the traditional way of doing things, but if it works then why not

Is there a way to error on related queries in Django ORM?

I have a Django model backed by a very large table (Log) containing millions of rows. This model has a foreign key reference to a much smaller table (Host). Example models:
class Host(Model):
name = CharField()
class Log(Model):
value = CharField()
host = ForeignKey(Host)
In reality there are many more fields and also more foreign keys similar to Log.host.
There is an iter_logs() function that efficiently iterates over Log records using a paginated query scheme. Other places in the program use iter_logs() to process large volumes of Log records, doing things like dumping to a file, etc.
For efficient operation, any code that uses iter_logs() should only access fields like value. But problems arise when someone innocently accesses log.host. In this case Django will issue a separate query each time a new Log record's host is accessed, killing the performance of the efficient paginated query in iter_logs().
I know I can use select_related to efficiently fetch the related host records, but all known uses of iter_logs() should not need this, so it would be wasteful. If a use case for accessing log.host did arise in this context I would want to add a parameter to iter_logs() to optionally use .select_related("host"), but that has not become necessary yet.
I am looking for a way to tell the underlying query logic in Django to never perform additional database queries except those explicitly allowed in iter_logs(). If such a query becomes necessary it should raise an error instead. Is there a way to do that with Django?
One solution I'd prefer to avoid: wrap or otherwise modify objects yielded by iter_logs() to prevent access to foreign keys.
More generally, Django's deferred query logic breaks encapsulation of code that constructs queries. Dependent code must know about the implementation or risk imposing major inefficiencies. This is usually fine at small scale where a little inefficiency does not matter, but becomes a real problem at larger scale. An early error would be much better because it would be easy to detect in small-scale tests rather than deferring the problem to production run time where it manifests as general slowness.

Using database as a key prefix in redis

I'm evaluating using redis to store some session values. When constructing the redis client (we will be using this python one) I get to pass in the db to use. Is it appropriate to use the DB as a sort of prefix for my keys? E.g. store all session keys in db 0 and some messages in db 1 and so on? Or should I keep all my applications keys in the same db?
Quoting my answer from this question:
It depends on your use case, but my rule of thumb is: If you have a
very large quantity of related data keys that are unrelated to all the
rest of your data in Redis, put them in a new database. Reasons being:
You may need to (non-ideally) use the keys command to get all of that
data at some point, and having the data segregated makes that much
cheaper.
You may want to switch to a second redis server later, and having
related data pre-segregated makes this much easier.
You can keep your databases named somewhere, so it's easier for you,
or a new employee to figure out where to look for particular data.
Conversely, if your data is related to other data, they should always
live in the same database, so you can easily write pipelines and lua
scripts that can access both.

Django - Correct way to load a large default value in a model

I have a field called schema in a django model that usually contains a rather large json string. There is a default value (around 2000 characters) that I would like added when any new instance is created.
I find it rather unclean to dump the whole thing in the models.py in a variable. What is the best way to load this default value in my schema (which is a TextField)?
Example:
class LevelSchema(models.Model):
user = models.ForeignKey(to=User)
active = models.BooleanField(default=False)
schema = models.TextField() # Need a default value for this
I thought about this a bit. If I am using a json file to store the default value somewhere, what is the best place to put it? Obviously it is preferable if it is in the same app in a folder.
The text is rather massive. Would span half the file as it is
formatted json which I would like editable in future. I prefer loading
it from a file (like fixtures), just that I want to know if there is a
method already present in Django.
In django, you have two options:
Listen on post_save, and when created is true, set the default value of the object by reading the file.
Set the default to a callable (a function), and in that method read the file (make sure you close it after), and return its contents.
You can also stick the data in some k/v store (like redis or memcache) for faster access. It would also be better since you won't be constantly opening and closing files.
Finally, the most restrictive option would be to set up a trigger on the database that does the populating for you. You would have to store the json in the database somewhere. Added benefit to this approach is that you can write a django front end to update the json. Downside is it will restrict your application to those database that you decide to support with your trigger.
I find the use of a variable not particularly unclean. But you could "abuse" the
fact that the default argument, that all fields support, can be a callable.
So you could this "crazy" thing:
def get_default_json():
json_text = open('mylargevalue.json').read()
return json_text
and then on your field:
schema = models.TextField(default=get_default_json)
I haven't tried anything like it, but I suppose it could work.

Python sqlite3, saving instance of a class with an other instance as it's attribute?

I'm creating a game mod for Counter-Strike in python, and it's basically all done. The only thing left is to code a REAL database, and I don't have any experience on sqlite, so I need quite a lot of help.
I have a Player class with attribute self.steamid, which is unique for every Counter-Strike player (received from the game engine), and self.entity, which holds in an "Entity" for player, and Entity-class has lots and lots of more attributes, such as level, name and loads of methods. And Entity is a self-made Python class).
What would be the best way to implement a database, first of all, how can I save instances of Player with an other instance of Entity as it's attribute into a database, powerfully?
Also, I will need to get that users data every time he connects to the game server, (I have player_connect event), so how would I receive the data back?
All the tutorials I found only taught about saving strings or integers, but nothing about whole instances. Will I have to save every attribute on all instances (Entity instance has few more instances as it's attributes, and all of them have huge amounts of attributes...), or is there a faster, easier way?
Also, it's going to be a locally saved database, so I can't really use any other languages than sql.
You need an ORM. Either you roll your own (which I never suggest), or you use one that exists already. Probably the two most popular in Python are sqlalchemy, and the ORM bundled with Django.
SQL databses typically can hold only fundamental datatypes. You can use SQLAlchemy if you want to map your models so that their attributes are automatically mapped to SQL types - but it would require a lot of study and trial and error using SQLlite on your part.
I think you are not entirely correct when you say "it has to be SQL" - if you are running Python code, you can save whatver format you like.
However, Python allows you to serialize your instance Data to a string - which is persistable in a database.
So, you can create a varchar(65535) field in the SQL, along with an ID field (which could be the player ID number you mentioned, for example), and persist to it the value returned by:
import pickle
value = pickle.dumps(my_instance)
When retrieving the value you do the reverse:
my_instance = pickle.loads(value)

Categories

Resources