I am trying to encode some data ( a very big string actually ) in a very memory efficient way on the Redis side. According to the Redis docs, it is claimed that "use hashes when possible", and it declares two configuration parameters:
The "hash-max-zipmap-entries", which if I understood well it denotes how many keys at most every hash key must have ( am I right?).
The "hash-max-zipmap-value", which denotes the maximum length for the value. Does it refer to the field or to the value, actually? And the length is in bytes, characters, or what?
My thought is to split the string ( which somehow has fixed length) in such quantities that will play well with the above parameters, and store them as values. The fields should be just sequence numbers, to ensure a consistent decoding..
EDIT: I have benchmarked extensively and it seems that encoding the string in a hash yields a ~50% better memory consumption.
Here is my benchmarking script:
import redis, random, sys
def new_db():
db = redis.Redis(host='localhost', port=6666, db=0)
db.flushall()
return db
def db_info(db):
return " used memory %s " % db.info()["used_memory_human"]
def random_string(_len):
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
return "".join([letters[random.randint(0,len(letters)-1)] for i in range(_len) ])
def chunk(astr, size):
while len(astr) > size:
yield astr[:size]
astr = astr[size:]
if len(astr):
yield astr
def encode_as_dict(astr, size):
dod={}
cnt = 0
for i in chunk(astr,size):
dod[cnt] = i
cnt+=1
return dod
db=new_db()
r = random_string(1000000)
print "size of string in bytes ", sys.getsizeof(r)
print "default Redis memory consumption", db_info(db)
dict_chunk = 10000
print "*"*100
print "BENCHMARKING \n"
db=new_db()
db.set("akey", r)
print "as string " , db_info(db)
print "*"*100
db=new_db()
db.hmset("akey", encode_as_dict(r,dict_chunk))
print "as dict and stored at value" , db_info(db)
print "*"*100
and the results on my machine (32bit Redis instance):
size of string in bytes 1000024
default Redis memory consumption used memory 534.52K
******************************************************************************************
BENCHMARKING
as string used memory 2.98M
******************************************************************************************
as dict and stored at value used memory 1.49M
I am asking if there is a more efficient way to store the string as a hash, by playing with the parameters I mentioned. So firstly, I must be aware of what they mean.. Then I'll benchmark again and see if there is more gain..
EDIT2: Am I an idiot? The benchmarking is correct, but it's confirmed for one big string. If I repeat for many big strings, storing them as big strings is the definite winner.. I think that the reason why I got those results for one string lies in the Redis internals..
Actually, the most efficient way to store a large string is as a large string - anything else adds overhead. The optimizations you mention are for dealing with lots of short strings, where empty space between the strings can become an issue.
Performance on storing a large string may not be as good as for small strings due to the need to find more contiguous blocks to store it, but that is unlikely to actually affect anything.
I've tried reading the Redis docs about the settings you mention, and it isn't easy. But it doesn't sound to me like your plan is a good idea. The hashing they describe is designed to save memory for small values. The values are still stored completely in memory. It sounds to me like they are reducing the overhead when they appear many times, for example, when a string is added to many sets. Your string doesn't meet these criteria. I strongly doubt you will save memory using your scheme.
You can of course benchmark it to see.
Try to look at Redis Memory Usage article where you can find a good comparison of various data types and their memory consumption.
When you storing data in a hash you will just skip ~100 bytes per value overhead!
So when your string length is comparable, 100-200 bytes for instance, than you may see 30-50% memory savings, for integers it's a 10 times less memory though!
Here is couple of links:
About 100 bytes overhead
Different memory optimization comparision gist
Related
I'm looking for a set-like data structure in Python that allows a fast lookup (O(1) for sets), for 100 millions of short strings (or bytes-strings) of length ~ 10.
With 10M strings, this already takes 750 MB RAM on Python 3.7 or 3.10.2 on (or 900 MB if we replace the b-strings by strings):
S = set(b"a%09i" % i for i in range(10_000_000)) # { b"a000000000", b"a000000001", ... }
whereas the "real data" here is 10 bytes * 10M ~ 100 MB. So there is a 7.5x memory consumption factor because of the set structure, pointers, buckets... (for a study about this in the case of a list, see the answer of Memory usage of a list of millions of strings in Python).
When working with "short" strings, having pointers to the strings (probably taking 64 bit = 8 bytes) in the internal structure is probably already responsible for a 2x factor, and also the buckets structure of the hash-table, etc.
Are there some "short string optimizations" techniques allowing to have a memory-efficient set of short bytes-strings in Python? (or any other structure allowing fast lookup/membership test)
Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.
Or would using a bisect or a sorted list help (lookup in O(log n) might be ok), while keeping memory usage small? (smaller than a 7.5x factor as with a set)
Up to now here are the methods that I tested thanks to comments, and that seem working.
Sorted list + bisection search (+ bloom filter)
Insert everything in a standard list L, in sorted order. This takes a lot less memory than a set.
(optional) Create a Bloom filter, here is a very small code to do it.
(optional) First test membership with Bloom filter (fast).
Check if it really is a match (and not a false positive) with the fast in_sorted_list() from this answer using bisect, much faster than a standard lookup b"hello" in L.
If the bisection search is fast enough, we can even bypass the bloom filter (steps 2 and 3). It will be O(log n).
In my test with 100M strings, even without bloom filter, the lookup took 2 µs on average.
Sqlite3
As suggested by #tomalak's comment, inserting all the data in a Sqlite3 database works very well.
Querying if a string exists in the database was done in 50 µs on average on my 8 GB database, even without any index.
Adding an index made the DB grow to 11 GB, but then the queries were still done in ~50 µs on average, so no gain here.
Edit: as mentioned in a comment, using CREATE TABLE t(s TEXT PRIMARY KEY) WITHOUT ROWID; even made the DB smaller: 3.3 GB, and the queries are still done in ~50 µs on average. Sqlite3 is (as always) really amazing.
In this case, it's even possible to load it totally in RAM with the method from How to load existing db file to memory in Python sqlite3?, and then it's ~9 µs per query!
Bisection in file with sorted lines
Working, and with very fast queries (~ 35 µs per query), without loading the file in memory! See
Bisection search in the sorted lines of an opened file (not loaded in memory)
Dict with prefixes as keys and concatenation of suffixes as values
This is the solution described here: Set of 10-char strings in Python is 10 times bigger in RAM as expected.
The idea is: we have a dict D and, for a given word,
prefix, suffix = word[:4], word[4:]
D[prefix] += suffix + b' '
With this method, the RAM space used is even smaller than the actual data (I tested with 30M of strings of average length 14, and it used 349 MB), the queries seem very fast (2 µs), but the initial creation time of the dict is a bit high.
I also tried with dict values = list of suffixes, but it's much more RAM-consuming.
Maybe without pointers to strings, but rather storing the strings directly in the data structure if string length <= 16 characters, etc.
While not being a set data-structure but rather a list, I think pyarrow has a quite optimized way of storing a large number of small strings. There is a pandas integration as well which should make it easy to try it out:
https://pythonspeed.com/articles/pandas-string-dtype-memory/
This should be easy. After looking at what's going on, I'm not so sure. Is the writing/reading of a single binary integer atomic? That's how the underlying hardware does reads and writes of 32-bit integers. After some research, I realized Python does not store integers as a collection of bytes. It doesn't even store bytes as a collection of bytes. There's overhead involved. Does this overhead break the atomic nature of binary integers?
Here is some code I used trying to figure this out:
import time
import sys
tm=time.time()
int_tm = int(tm * 1000000)
bin_tm = bin(int_tm)
int_bin_tm = int(bin_tm, 2)
print('tm:', tm, ", Size:", sys.getsizeof(tm))
print('int_tm:', int_tm, ", Size:", sys.getsizeof(int_tm))
print('bin_tm:', bin_tm, ", Size:", sys.getsizeof(bin_tm))
print('int_bin_tm:', int_bin_tm, ", Size:", sys.getsizeof(int_bin_tm))
Output:
tm: 1581435513.076924 , Size: 24
int_tm: 1581435513076924 , Size: 32
bin_tm: 0b101100111100100111010100101111111011111110010111100 , Size: 102
int_bin_tm: 1581435513076924 , Size: 32
For a couple side questions, does Python's binary representation of integers really consume so much more memory? Am I using the wrong type for converting decimal integers to bytes?
Python doesn't guarantee any atomic operations other than specific mutex constructs like locks and semaphores. Some operations will seem to be atomic because the GIL will prevent bytecode from being run on multiple python threads at once "This lock is necessary mainly because CPython's memory management is not thread-safe".
Basically what this means is that python will ensure an entire bytecode instruction is completely evaluated before allowing another thread to continue. This does not mean however an entire line of code is guaranteed to complete without interruption. This is especially true with function calls. For a deeper look at this, take a look at the dis module.
I will also point out that this talk about atomicity means nothing at a hardware level, the whole idea of an interpreted language is to abstract out the hardware. If you want to consider "actual" hardware atomicity, it will generally be a function provided by the operating system (which is how python likely implements things like threading.Lock).
Note on data sizes (this is just a quickie, because this is a whole other question):
sizeof(tm): 8 bytes for the 64 bit float, 8 bytes for a pointer to the data type, and 8 bytes for a pointer to the reference count
sizeof(int_tm): ints are a bit more complicated as some small values are "cached" using a smaller format for efficiency, then large values use a more flexible type where the number of bytes used to store the int can expand to however big it needs to be.
sizeof(bin_tm): This is actually a string, which is why it takes so much more memory than just a number, there is a fairly significant overhead, plus at least one byte per character.
"Am I using the wrong type for converting..?" We need to know what you're trying to do with the result to answer this.
I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences.
The naive way is something like this:
ht = defaultdict(int)
for s in sentences:
ht[s]+=1
I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?
If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.
Can someone approve the former paragraph?
One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert:
ht[hash(s)]+=1
This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.
Will that work? Should I expect collisions? any other Pythonic solutions?
Thanks!
Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.
Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:
for x in xrange(5000000): # it's 5 millions
d[x] = random.getrandbits(BITS)
For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:
for x in xrange(5000000): # it's 5 millions
d[x] = (random.getrandbits(64), random.getrandbits(64))
It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:
for x in xrange(5000000): # it's still 5 millions
d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)
It'll reduce memory usage by two.
It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:
hashes = {}
for s in sentence:
ptr_value = pointer(s) # make it integer
hash_value = hash(s) # make it integer
if hash_value in hashes:
collisions.setdefault(hashes[hash_value], []).append(ptr_value)
else:
hashes[hash_value] = ptr_value
So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).
perhaps passing keys to md5 http://docs.python.org/library/md5.html
Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.
I know there have been some questions regarding file reading, binary data handling and integer conversion using struct before, so I come here to ask about a piece of code I have that I think is taking too much time to run. The file being read is a multichannel datasample recording (short integers), with intercalated intervals of data (hence the nested for statements). The code is as follows:
# channel_content is a dictionary, channel_content[channel]['nsamples'] is a string
for rec in xrange(number_of_intervals)):
for channel in channel_names:
channel_content[channel]['recording'].extend(
[struct.unpack( "h", f.read(2))[0]
for iteration in xrange(int(channel_content[channel]['nsamples']))])
With this code, I get 2.2 seconds per megabyte read with a dual-core with 2Mb RAM, and my files typically have 20+ Mb, which gives some very annoying delay (specially considering another benchmark shareware program I am trying to mirror loads the file WAY faster).
What I would like to know:
If there is some violation of "good practice": bad-arranged loops, repetitive operations that take longer than necessary, use of inefficient container types (dictionaries?), etc.
If this reading speed is normal, or normal to Python, and if reading speed
If creating a C++ compiled extension would be likely to improve performance, and if it would be a recommended approach.
(of course) If anyone suggests some modification to this code, preferrably based on previous experience with similar operations.
Thanks for reading
(I have already posted a few questions about this job of mine, I hope they are all conceptually unrelated, and I also hope not being too repetitive.)
Edit: channel_names is a list, so I made the correction suggested by #eumiro (remove typoed brackets)
Edit: I am currently going with Sebastian's suggestion of using array with fromfile() method, and will soon put the final code here. Besides, every contibution has been very useful to me, and I very gladly thank everyone who kindly answered.
Final Form after going with array.fromfile() once, and then alternately extending one array for each channel via slicing the big array:
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(f.filename)/fullsamples.itemsize - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
for channel in self.channel_labels:
samples = int(self.channel_content[channel]['nsamples'])
self.channel_content[channel]['recording'].extend(
fullsamples[position:position+samples])
position += samples
The speed improvement was very impressive over reading the file a bit at a time, or using struct in any form.
You could use array to read your data:
import array
import os
fn = 'data.bin'
a = array.array('h')
a.fromfile(open(fn, 'rb'), os.path.getsize(fn) // a.itemsize)
It is 40x times faster than struct.unpack from #samplebias's answer.
If the files are only 20-30M, why not read the entire file, decode the nums in a single call to unpack and then distribute them among your channels by iterating over the array:
data = open('data.bin', 'rb').read()
values = struct.unpack('%dh' % len(data)/2, data)
del data
# iterate over channels, and assign from values using indices/slices
A quick test showed this resulted in a 10x speedup over struct.unpack('h', f.read(2)) on a 20M file.
A single array fromfile call is definitively fastest, but wont work if the dataseries is interleaved with other value types.
In such cases, another big speedincrease that can be combined with the previous struct answers, is that instead of calling the unpack function multiple times, precompile a struct.Struct object with the format for each chunk. From the docs:
Creating a Struct object once and calling its methods is more
efficient than calling the struct functions with the same format since
the format string only needs to be compiled once.
So for instance, if you wanted to unpack 1000 interleaved shorts and floats at a time, you could write:
chunksize = 1000
structobj = struct.Struct("hf" * chunksize)
while True:
chunkdata = structobj.unpack(fileobj.read(structobj.size))
(Note that the example is only partial and needs to account for changing the chunksize at the end of the file and breaking the while loop.)
extend() acepts iterables, that is to say instead of .extend([...]) , you can write .extend(...) . It is likely to speed up the program because extend() will process on a generator , no more on a built list
There is an incoherence in your code: you define first channel_content = {} , and after that you perform channel_content[channel]['recording'].extend(...) that needs the preliminary existence of a key channel and a subkey 'recording' with a list as a value to be able to extend to something
What is the nature of self.channel_content[channel]['nsamples'] so that it can be submitted to int() function ?
Where do number_of_intervals come from ? What is the nature of the intervals ?
In the rec in xrange(number_of_intervals)): loop , I don't see anymore rec . So it seems to me that you are repeating the same loop process for channel in channel_names: as many times as the number expressed by number_of_intervals . Are there number_of_intervals * int(self.channel_content[channel]['nsamples']) * 2 values to read in f ?
I read in the doc:
class struct.Struct(format)
Return a
new Struct object which writes and
reads binary data according to the
format string format. Creating a
Struct object once and calling its
methods is more efficient than calling
the struct functions with the same
format since the format string only
needs to be compiled once.
This expresses the same idea as samplebias.
If your aim is to create a dictionary, there is also the possibility to use dict() with a generator as argument
.
EDIT
I propose
channel_content = {}
for rec in xrange(number_of_intervals)):
for channel in channel_names:
N = int(self.channel_content[channel]['nsamples'])
upk = str(N)+"h", f.read(2*N)
channel_content[channel]['recording'].extend(struct.unpack(x) for i,x in enumerate(upk) if not i%2)
I don't know how to take account of the J.F. Sebastian's suggestion to use array
Not sure if it would be faster, but I would try to decode chunks of words instead of one word a time. For example, you could read 100 bytes of data a time like:
s = f.read(100)
struct.unpack(str(len(s)/2)+"h", s)
I have a set of lots of big long strings that I want to do existence lookups for. I don't need the whole string ever to be saved. As far as I can tell, the set() actually stored the string which is eating up a lot of my memory.
Does such a data structure exist?
done = hash_only_set()
while len(queue) > 0 :
item = queue.pop()
if item not in done :
process(item)
done.add(item)
(My queue is constantly being filled by other threads so I have no way of dedupping it at the start).
It's certainly possible to keep a set of only hashes:
done = set()
while len(queue) > 0 :
item = queue.pop()
h = hash(item)
if h not in done :
process(item)
done.add(h)
Notice that because of hash collisions, there is a chance that you consider an item done even though it isn't.
If you cannot accept this risk, you really need to save the full strings to be able to tell whether you have seen it before. Alternatively: perhaps the processing itself would be able to tell?
Yet alternatively: if you cannot accept to keep the strings in memory, keep them in a database, or create files in a directory with the same name as the string.
You can use a data structure called Bloom Filter specifically for this purpose. A Python implementation can be found here.
EDIT: Important notes:
False positives are possible in this data structure, i.e. a check for the existence of a string could return a positive result even though it was not stored.
False negatives (getting a negative result for a string that was stored) are not possible.
That said, the chances of this happening can be brought to a minimum if used properly and so I consider this data structure to be very useful.
If you use a secure (like SHA-256, found in the hashlib module) hash function to hash the strings, it's very unlikely that you would found duplicate (and if you find some you can probably win a prize as with most cryptographic hash functions).
The builtin __hash__() method does not guarantee you won't have duplicates (and since it only uses 32 bits, it's very likely you'll find some).
You need to know the whole string to have 100% certainty. If you have lots of strings with similar prefixes you could save space by using a trie to store the strings. If your strings are long you could also save space by using a large hash function like SHA-1 to make the possibility of hash collisions so remote as to be irrelevant.
If you can make the process() function idempotent - i.e. having it called twice on an item is only a performance issue, then the problem becomes a lot simpler and you can use lossy datastructures, such as bloom filters.
You would have to think about how to do the lookup, since there are two methods that the set needs, __hash__ and __eq__.
The hash is a "loose part" that you can take away, but the __eq__ is not a loose part that you can save; you have to have two strings for the comparison.
If you only need negative confirmation (this item is not part of the set), you could fill a Set collection you implemented yourself with your strings, then you "finalize" the set by removing all strings, except those with collisions (those are kept around for eq tests), and you promise not to add more objects to your Set. Now you have an exclusive test available.. you can tell if an object is not in your Set. You can't be certain if "obj in Set == True" is a false positive or not.
Edit: This is basically a bloom filter that was cleverly linked, but a bloom filter might use more than one hash per element which is really clever.
Edit2: This is my 3-minute bloom filter:
class BloomFilter (object):
"""
Let's make a bloom filter
http://en.wikipedia.org/wiki/Bloom_filter
__contains__ has false positives, but never false negatives
"""
def __init__(self, hashes=(hash, )):
self.hashes = hashes
self.data = set()
def __contains__(self, obj):
return all((h(obj) in self.data) for h in self.hashes)
def add(self, obj):
self.data.update(h(obj) for h in self.hashes)
As has been hinted already, if the answers offered here (most of which break down in the face of hash collisions) are not acceptable you would need to use a lossless representation of the strings.
Python's zlib module provides built-in string compression capabilities and could be used to pre-process the strings before you put them in your set. Note however that the strings would need to be quite long (which you hint that they are) and have minimal entropy in order to save much memory space. Other compression options might provide better space savings and some Python based implementations can be found here