I have a requirement where I would like to store two or more independent sets of key value pairs in Redis. they are required to be independent because they represent completely different datasets.
For example, consider the following independent kv sets:
Mapping between user_id and corresponding session_id
Mapping between game lobby id and corresponding host's user_id
Mapping between user_id and usernames
Now, if I had to store this key values "in memory" using some data structure in Python, I would user 3 separate dicts (or HashMaps in Java). But how can I achieve this in Redis? The separation is required because for almost all user_ids there will be session_id as well as username.
I'm referring this Redis doc for available datatypes: https://redis.io/topics/data-types and the closest match with my requirements is Hash. But as mentioned, they are objects and only space efficient if there are a few fields. So is this okay to store huge amounts of kv pairs? Will that impact search time going forward?
Are there any alternatives to achieve this using Redis? I'm completely new to Redis.
You can do that by creating separate databases (sometimes Redis calls them 'key-spaces') for each of your data sets. Then you create your connection/pool objects with the db parameter:
>>> pool = redis.ConnectionPool(host='redis_host', port=6379, db=0)
>>> r = redis.Redis(connection_pool=pool)
The only drawback of this is that you need to then manage different pools for each data set. Another (perhaps easier) option is to just add a tag to the keys that indicate which data set they belong to.
Also worth noting that the key-spaces are organized numerically (so you can't name them like in other database systems), and it only supports a small number (I think 15).
As for the number of K-V pairs, it's completely fine to store huge numbers of them, and lookups should still be pretty quick. O(1) lookup is the main advantage of key-based data structures, with the tradeoff being that you are limited (to some extent anyway) to using the key to access the data (as opposed to other databases where you can query on any field in the table).
Related
I have a script that repopulates a large database and would generate id values from other tables when needed.
Example would be recording order information when given customer names only. I would check to see if the customer exists in a CUSTOMER table. If so, SELECT query to get his ID and insert the new record. Else I would create a new CUSTOMER entry and get the Last_Insert_Id().
Since these values duplicate a lot and I don't always need to generate a new ID -- Would it be better for me to store the ID => CUSTOMER relationship as a dictionary that gets checked before reaching the database or should I make the script constantly requery the database? I'm thinking the first approach is the best approach since it reduces load on the database, but I'm concerned for how large the ID Dictionary would get and the impacts of that.
The script is running on the same box as the database, so network delays are negligible.
"Is it more efficient"?
Well, a dictionary is storing the values in a hash table. This should be quite efficient for looking up a value.
The major downside is maintaining the dictionary. If you know the database is not going to be updated, then you can load it once and the in-application memory operations are probably going to be faster than anything you can do with a database.
However, if the data is changing, then you have a real challenge. How do you keep the memory version aligned with the database version? This can be very tricky.
My advice would be to keep the work in the database, using indexes for the dictionary key. This should be fast enough for your application. If you need to eke out further speed, then using a dictionary is one possibility -- but no doubt, one possibility out of many -- for improving the application performance.
Let's say there is a table of People. and let's say that are 1000+ in the system. Each People item has the following fields: name, email, occupation, etc.
And we want to allow a People item to have a list of names (nicknames & such) where no other data is associated with the name - a name is just a string.
Is this exactly what the pickleType is for? what kind of performance benefits are there between using pickle type and creating a Name table to have the name field of People be a one-to-many kind of relationship?
Yes, this is one good use case of sqlalchemy's PickleType field, documented very well here. There are obvious performance advantages to using this.
Using your example, assume you have a People item which uses a one to many database look. This requires the database to perform a JOIN to collect the sub-elements; in this case, the Person's nicknames, if any. However, you have the benefit of having native objects ready to use in your python code, without the cost of deserializing pickles.
In comparison, the list of strings can be pickled and stored as a PickleType in the database, which are internally stores as a LargeBinary. Querying for a Person will only require the database to hit a single table, with no JOINs which will result in an extremely fast return of data. However, you now incur the "cost" of de-pickling each item back into a python object, which can be significant if you're not storing native datatypes; e.g. string, int, list, dict.
Additionally, by storing pickles in the database, you also lose the ability for the underlying database to filter results given a WHERE condition; especially with integers and datetime objects. A native database call can return values within a given numeric or date range, but will have no concept of what the string representing these items really is.
Lastly, a simple change to a single pickle could allow arbitrary code execution within your application. It's unlikely, but must be stated.
IMHO, storing pickles is a nice way to store certain types of data, but will vary greatly on the type of data. I can tell you we use it pretty extensively in our schema, even on several tables with over half a billions records quite nicely.
It is well-known that the AppEngine datastore is built on top of Bigtable, which is intrinsically sorted by key. It is also known (somewhat) how the keys are generated by the AppEngine datastore: by "combining" the app id, entity kind, instance path, and unique instance identifier (perhaps through concatenation, see here).
What is not clear is whether a transformation is done on that unique instance identifier before it is stored such as would make sequential keys non-sequential in storage (e.g. if I specify key_name="Test", is "Test" just concatenated at the end of the key without transformation?) Of course it makes sense to preserve the app-ids, entity kinds, and paths as-is to take advantage of locality/key-sorting in the Bigtable (Google's other primary storage technology, F1, works similarly with hierarchical keys), but I don't know about the unique instance identifier.
Can I rely upon key_names being preserved as-is in AppEngine's datastore?
Keys are composed using a special Protocol Buffer serialization that preserves the natural order of the fields it encodes. That means that yes, two entities with the same kind and parent will have their keys ordered by key name.
Note, though, that the sort order has entity type and parent key first, though, so two entities of different types, or of the same type but with different parent entities, will not appear sequentially even if their keys are sequential.
In addition to what #Nick explained:
If you use auto-generated numeric IDs, the legacy system used to be semi-increasing (IDs were assigned in increasing blocks), but with the new system they are pretty scattered.
I am considering to serialize a big set of database records for cache in Redis, using python and Cassandra. I have either to serialize each record and persist a string in redis or to create a dictionary for each record and persist in redis as a list of dictionaries.
Which way is faster? pickle each record? or create a dictionary for each record?
And second : Is there any method to fetch from database as list of dic's? (instead of a list of model obj's)
Instead of serializing your dictionaries into strings and storing them in a Redis LIST (which is what it sounds like you are proposing), you can store each dict as a Redis HASH. This should work well if your dicts are relatively simple key/value pairs. After creating each HASH you could add the key for the HASH to a LIST, which would provide you with an index of keys for the hashes. The benefits of this approach could be avoiding or lessening the amount of serialization needed, and may make it easier to use the data set in other applications and from other languages.
There are of course many other approaches you can take and that will depend on lots of factors related to what kind of data you are dealing with and how you plan to use it.
If you do go with serialization you might want to at least consider a more language agnostic serialization format, like JSON, BSON, YAML, or one of the many others.
I'm evaluating using redis to store some session values. When constructing the redis client (we will be using this python one) I get to pass in the db to use. Is it appropriate to use the DB as a sort of prefix for my keys? E.g. store all session keys in db 0 and some messages in db 1 and so on? Or should I keep all my applications keys in the same db?
Quoting my answer from this question:
It depends on your use case, but my rule of thumb is: If you have a
very large quantity of related data keys that are unrelated to all the
rest of your data in Redis, put them in a new database. Reasons being:
You may need to (non-ideally) use the keys command to get all of that
data at some point, and having the data segregated makes that much
cheaper.
You may want to switch to a second redis server later, and having
related data pre-segregated makes this much easier.
You can keep your databases named somewhere, so it's easier for you,
or a new employee to figure out where to look for particular data.
Conversely, if your data is related to other data, they should always
live in the same database, so you can easily write pipelines and lua
scripts that can access both.