I am considering to serialize a big set of database records for cache in Redis, using python and Cassandra. I have either to serialize each record and persist a string in redis or to create a dictionary for each record and persist in redis as a list of dictionaries.
Which way is faster? pickle each record? or create a dictionary for each record?
And second : Is there any method to fetch from database as list of dic's? (instead of a list of model obj's)
Instead of serializing your dictionaries into strings and storing them in a Redis LIST (which is what it sounds like you are proposing), you can store each dict as a Redis HASH. This should work well if your dicts are relatively simple key/value pairs. After creating each HASH you could add the key for the HASH to a LIST, which would provide you with an index of keys for the hashes. The benefits of this approach could be avoiding or lessening the amount of serialization needed, and may make it easier to use the data set in other applications and from other languages.
There are of course many other approaches you can take and that will depend on lots of factors related to what kind of data you are dealing with and how you plan to use it.
If you do go with serialization you might want to at least consider a more language agnostic serialization format, like JSON, BSON, YAML, or one of the many others.
Related
I have a requirement where I would like to store two or more independent sets of key value pairs in Redis. they are required to be independent because they represent completely different datasets.
For example, consider the following independent kv sets:
Mapping between user_id and corresponding session_id
Mapping between game lobby id and corresponding host's user_id
Mapping between user_id and usernames
Now, if I had to store this key values "in memory" using some data structure in Python, I would user 3 separate dicts (or HashMaps in Java). But how can I achieve this in Redis? The separation is required because for almost all user_ids there will be session_id as well as username.
I'm referring this Redis doc for available datatypes: https://redis.io/topics/data-types and the closest match with my requirements is Hash. But as mentioned, they are objects and only space efficient if there are a few fields. So is this okay to store huge amounts of kv pairs? Will that impact search time going forward?
Are there any alternatives to achieve this using Redis? I'm completely new to Redis.
You can do that by creating separate databases (sometimes Redis calls them 'key-spaces') for each of your data sets. Then you create your connection/pool objects with the db parameter:
>>> pool = redis.ConnectionPool(host='redis_host', port=6379, db=0)
>>> r = redis.Redis(connection_pool=pool)
The only drawback of this is that you need to then manage different pools for each data set. Another (perhaps easier) option is to just add a tag to the keys that indicate which data set they belong to.
Also worth noting that the key-spaces are organized numerically (so you can't name them like in other database systems), and it only supports a small number (I think 15).
As for the number of K-V pairs, it's completely fine to store huge numbers of them, and lookups should still be pretty quick. O(1) lookup is the main advantage of key-based data structures, with the tradeoff being that you are limited (to some extent anyway) to using the key to access the data (as opposed to other databases where you can query on any field in the table).
I'm trying to maintain a very volatile database in memory (disk would be too slow), always updating as I'm listening to hundreds of JSON websocket streams.
Currently I'm using redis, but it doesn't have dictionary support, just arrays. You can dump a JSON string to a redis db, but redis won't know it's JSON. So you can't access or change a certain key's value; you have to load the whole JSON string, edit it, then dump it back into redis. This of course is very slow, and I'd like to stop doing this.
There's a library called reJSON which allows redis to recognize JSON and edit JSON dictionaries, but I'd have to rewrite a lot of my code to use it. However, if I used reJSON, I could directly access and edit particular keys' values instead of loading and dumping the whole JSON string.
Currently what I'm doing is concatenating key names since they contain dictionaries. The problem here is when there's a dictionary within a dictionary; I'd have to concatenate a 2nd time, and this would produce a -ton- of keys. I don't think this is the optimal approach.
I was also recommended to use Redis commands and hashes instead of just storing JSON strings, but hashes don't seem to support dictionaries, just arrays.
As for the data itself, I'm listening to websocket streams where each update gives me data like this:
https://api.binance.com/api/v1/klines?symbol=XRPBTC&interval=5m&limit=1
This is a "candle" for the trading market "XRPBTC", and the candle completes every "5 minutes". I want to keep create a new key every time the first element of that API data changes (meaning 5 minutes have passed and there's now a new candle. This value is a millisecond epoch). If it didn't change, the current candle isn't new, but other elements of the array changed, and these changes need to be made in the redis DB.
Let's say I got a websocket update where the 4th element of the array of the current candle changed. This is most likely to happen, as it is the "close", or current price of the market.
What I have right now is a key called candles_XRPBTC5MINUTE. This key's value is a dictionary of dictionaries.
https://pastebin.com/rAYs0TaN
The "close" value in redis is here:
candles_XRPBTC5MINUTE["1554792300000"]["close"]
I want to edit the value in redis to be the one I got in the new websocket update. candles_XRPBTC5MINUTE contains nested dictionaries, and is 0.1 megabytes. Currently I load candles_XRPBTC5MINUTE from redis as JSON, update candles_XRPBTC5MINUTE["1554792300000"]["close"] to whatever is in the websocket update, then dump it back to redis as JSON. As you can tell, this is a lot of handling of old, unnecessary data when I'm focusing on the newest key 1554792300000.
My options seem to be:
A. Use reJSON
B. Keep using vanilla Redis, but concatenate key names again, creating tens of thousands of keys (candles_XRPBTC5MINUTE_1554792300000, candles_XRPBTC5MINUTE_1554792600000, candles_XRPBTC5MINUTE_1554792900000, etc for 1MINUTE, 3MINUTE, 15MINUTE, for hundreds of other markets)
C. Try to store the data I retrieve from the websockets as Redis hashes instead of JSON strings
What is the best option here, and why? Are there any other options?
I'm evaluating using redis to store some session values. When constructing the redis client (we will be using this python one) I get to pass in the db to use. Is it appropriate to use the DB as a sort of prefix for my keys? E.g. store all session keys in db 0 and some messages in db 1 and so on? Or should I keep all my applications keys in the same db?
Quoting my answer from this question:
It depends on your use case, but my rule of thumb is: If you have a
very large quantity of related data keys that are unrelated to all the
rest of your data in Redis, put them in a new database. Reasons being:
You may need to (non-ideally) use the keys command to get all of that
data at some point, and having the data segregated makes that much
cheaper.
You may want to switch to a second redis server later, and having
related data pre-segregated makes this much easier.
You can keep your databases named somewhere, so it's easier for you,
or a new employee to figure out where to look for particular data.
Conversely, if your data is related to other data, they should always
live in the same database, so you can easily write pipelines and lua
scripts that can access both.
Is there an existing (Python) implementation of a hash-like data structure that exists partially on disk? Or, can persist specific keys to some secondary storage, based on some criteria (like last-accessed-time)?
ex: "data at key K has not been accessed in M milliseconds; serialize it to persistent storage (disk?), and delete it".
I was referred to this, but I'm not sure I can digest it.
EDIT:
I've recd two excellent answers (sqlite; gdb)m; in order to determine a winner, I'll have to wait until I've tested both. Thank you!!
Go for SQLITE. A big problem that you will face down the road is concurrency/file corruption, etc and SQLite makes these very easy to avoid as it provides transactional integrity. Just define a single table with schema (primary key string key, string value). SQLite is insanely fast, especially if you wrap bunches of writes into a transaction.
GDBM IMHO also has licence problems depending on what you want to do, whereas SQLite is public domain.
Sounds like you are looking for gdbm:
The gdbm module provides an interface to the GNU DBM library. gdbm objects behave like mappings (dictionaries), except that keys and values are always strings.
That's basically a dictionary on disk. You might have to do a bit of serialization depending on your use.
I have a need to store a python set in a database for accessing later. What's the best way to go about doing this? My initial plan was to use a textfield on my model and just store the set as a comma or pipe delimited string, then when I need to pull it back out for use in my app I could initialize a set by calling split on the string. Obviously if there is a simple way to serialize the set to store it in the db so I can pull it back out as a set when I need to use it later that would be best.
If your database is better at storing blobs of binary data, you can pickle your set. Actually, pickle stores data as text by default, so it might be better than the delimited string approach anyway. Just pickle.dumps(your_set) and unpickled = pickle.loads(database_string) later.
There are a number of options here, depending on what kind of data you wish to store in the set.
If it's regular integers, CommaSeparatedIntegerField might work fine, although it often feels like a clumsy storage method to me.
If it's other kinds of Python objects, you can try pickling it before saving it to the database, and unpickling it when you load it again. That seems like a good approach.
If you want something human-readable in your database though, you could even JSON-encode it into a TextField, as long as the data you're storing doesn't include Python objects.
Redis natively stores sets (as well as other data structures (lists, dicts, queue)) and provides set operations - and its rocket fast too. I find it's the swiss army knife for python development.
I know its not a relational database per se, but it does solve this problem very concisely.
What about CommaSeparatedIntegerField?
If you need other type (string for example) you can create your own field which would work like CommaSeparatedIntegerField but will use strings (without commas).
Or, if you need other type, probably a better way of doing it: have a dictionary which maps integers to your values.