I have this property in GAE:
memberNames = db.StringListProperty(indexed = False)
But for unindexed properties, they usually don't cost me any writes (just the basic write to put the file), but with this property, I'm getting writes for every string in the list. Am I not allowed to have non-indexed ListProperties? Is my only other choice to use a JSON array string or is there a way around this?
StringListProperty creates several index rows (one row per value) and is stored only in the index, so you are correct that you need to implement your own serialization (e.g. JSON) of multi-value-properties rather than using StringListProperty to eliminate the index writes.
Related
I currently have a for loop which is finding and storing combinations in a list. The possible combinations are very large and I need to be able to access the combos.
can I use an empty relational db like SQLite to store my list on a disk instead of using list = []?
Essentially what I am asking is whether there is a db equivalent to list = [] that I can use to store the combinations generated via my script?
Edit:
SQLlite is not a must. Any will work if it can accomplish my task.
Here is the exact function that is causing me so much trouble. Maybe there is a better solution in general.
Idea - Could I insert the list into the database on each loop and then empty the list? Basically, create a list on each loop, send that list to PostgreSQL and then empty the list in the python to keep the RAM usage down?
def permute(set1, set2):
set1_combos = list(combinations(set1, 2))
set2_combos = list(combinations(set2, 8))
full_sets = []
for i in set1_combos:
for j in set2_combos:
full_sets.append(i + j)
return full_sets
Ok, a few ideas
My first thought was, why do you explode the combinations objects in lists? But of course, since we have two nested for loops, the iterator in the inner loop is consumed at the first iteration of the outer loop if it is not converted to a list.
However, you don't need to explode both objects: you can explode just the smaller one. For instance, if both our sets are made of 50 elements, the combinations of 2 elements are 1225 with a memsize (if the items are integers) of about 120 bytes each, i.e. 147KB, while the combinations of 8 elements are 5.36e+08 with a memsize of about 336 bytes, i.e. 180GB. So the first thing is, keep the larger combo set as a combinations object and iterate over it in the outer loop. By the way, this will also be really faster.
Now the database part. I assume a relational DBMS, be it SQLite or anything.
You want to create a table with a single column defined. Each row of your table will contain one final combination. Instead of appending each combination to a list, you will insert it in the table.
Now the question is, how do you need to access the data you created? Do you just need to iterate over the final combos sequentially, or do you need to query them, for instance finding all the combos which contain one specific value?
In the latter case, you'll want to define your column as the Primay Key, so your queries will be efficient; otherwise, you will save space on disk using an auto incrementing integer as the PK (SQLite will create it for you if you don't explicitly define a PK, and so will do a few other DMBS as well).
One final note: the insert phase may be painfully slow if you don't take some specific measures: check this very interesting SO post for details. In short, with a few optimizations they were able to pass from 85 to over 96K insert per second.
EDIT: iterating over the saved data
Once we have the data in the DB, iterating over them could be as simple as:
mycursor.execute('SELECT * FROM <table> WHERE <conditions>')
for combo in mycursor.fetchall():
print(combo) #or do what you need
But if your conditions don't filter away most of the rows you will meet the same memory issue we started with. A first step could be using fetchmany() or even fetchone() instead of fetchall() but still you may have a problem with the size of the query result set.
So you will probably need to read from the DB a chunk of data at a time, exploiting the LIMIT and OFFSET parameters in your SELECT. The final result may be something like:
chunck_size = 1000 #or whatever number fits your case
chunk_count = 0
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> LIMIT {chunk_size} ORDER BY <primarykey>'}
while chunk:
for combo in mycursor.fetchall():
print(combo) #or do what you need
chunk_count += 1
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> ORDER BY <primarykey>' OFFSET {chunk_size * chunk_count} LIMIT {chunk_size}}
Note that you will usually need the ORDER BY clause to ensure rows are returned as you expect them, and not in a random manner.
I don't believe SQLite has a built in array data type. Other DBMSs, such as PostgreSQL, do.
For SQLite, a good recommendation by another user on this site to obtain an array in SQLite can be found here: How to store array in one column in Sqlite3?
Another solution can be found: https://sqlite.org/forum/info/99a33767e8a07e59
In either case, yes it is possible to have a DBMS like SQLite store an array (list) type. However, it may require a little setup depending on the DBMS.
Edit: If you're having memory issues, have you thought about storing your data as a string and accessing the portions of the string you need when you need it?
How I could append an element to an array like that:
Using this code I'm overriding the old data:
let toUpdate = [book.id]
self.refUsers.child(localUser.key!).child("booksPurchased").setValue(toUpdate, withCompletionBlock: { (error, _) in
You could use this method: firebase.firestore.FieldValue.arrayUnion()
Example with angularfire2:
this.afs.collection('collection').doc(id).update( {
array: firebase.firestore.FieldValue.arrayUnion( 'newItem' )
});
For more information: https://firebase.google.com/docs/reference/js/firebase.firestore.FieldValue#arrayunion
In this case, you will have to read the existing data, then write it back with the new value added. Arrays like this are not always the best way to store lists of data if you want to perform a lot of append operations. For that, you're better off pushing data into a location using childByAutoId.
You could set the values of the keys in the array to true, and then set the value directly in an update.
So if 'newId' is the new item to add, maybe something like:
const update = {
[`/users/${localUser.key}/booksPurchased/${newId}`]: true]
}
firebase.db.ref().udpate(update);
Firebase docs example of an update:
https://firebase.google.com/docs/database/web/read-and-write
Reading and writing lists
Append to a list of data
Use the childByAutoId method to append data to a list in multiuser applications. The childByAutoId method generates a unique key every time a new child is added to the specified Firebase reference. By using these auto-generated keys for each new element in the list, several clients can add children to the same location at the same time without write conflicts. The unique key generated by childByAutoId is based on a timestamp, so list items are automatically ordered chronologically.
You can use the reference to the new data returned by the childByAutoId method to get the value of the child's auto-generated key or set data for the child. Calling getKey on a childByAutoId reference returns the auto-generated key.
You can use these auto-generated keys to simplify flattening your data structure. For more information, see the data fan-out example.
-https://firebase.google.com/docs/database/ios/lists-of-data
As a research project I'm currently writing a document-oriented database from scratch in Python. Like MongoDB, the database supports the creation of indexes on arbitrary document keys. These indexes are currently implemented using two simple dictionaries: The first contains as key the (possibly hashed) value of the indexed field and as value the store keys of all documents associated with that field value, which allows the DB to locate the document on disk. The second dictionary contains the inverse of that, i.e. as a key the store key of a given document and as value the (hashed) value of the indexed field (which makes removing document from the index more efficient). An example:
doc1 = {'foo' : 'bar'} # store-key : doc1
doc2 = {'foo' : 'baz'} # store-key : doc2
doc3 = {'foo' : 'bar'} # store-key : doc3
For the foo field, the index dictionaries for these documents would look like this:
foo_index = {'bar' : ['doc1','doc3'],'baz' : ['doc2']}
foo_reverse_index = {'doc1' : ['bar'],'doc2' : ['baz'], 'doc3' : ['bar']}
(please not that the reverse index also consists of lists of values [and not single values] to accommodate indexing of list fields, in which case each element of the list field would be contained in the index separately)
During normal operation, the index resides in memory and is updated in real time after each insert/update/delete operation. To persist it, it gets serialized (e.g. as JSON object) and stored to disk, which works reasonably well for index sizes up to a few 100k entries. However, as the database size grows the index loading times at program startup become problematic, and committing changes in realtime to disk becomes nearly impossible since writing of the index incurs a large overhead.
Hence I'm looking for an implementation of a persistent index which allows for efficient incremental updates, or, expressed differently, does not require rewriting the whole index when persisting it to disk. What would be a suitable strategy for approaching this problem? I thought about using a linked-list to implement an addressable storage space to which objects could be written but I'm not sure if this is the right approach.
My suggestion is limited to the update of the index for persistence; the extra time at program startup is not a major one and can not really be avoided.
One approach is to use a preallocation of disk space for the index (possibly for other collections too). In the preallocation, you define an empirical size associated with each entry of the index as well as the total size of the index on the disk. For example 1024 bytes for each entry of the index and a total of 1000 entries.
The strategy allows for direct access to each entry of the index on disk. You just have to store the position on the disk along with the index in memory. Any time you update an entry of the index in memory, you point directly to its exact location on the disk and rewrite only a single entry.
If it happens that the first index file is full, just create a second file; always preallocate the space for your file on disk (1024*1000 bytes). You should also preallocate the space for your other data, and choose to use multiple fixed-size files instead of a single large file
If it happens that some entries of the index require more than 1024 bytes, simply create an extra index file for larger entries; for example 2048 bytes per entry and a total of 100 entries.
The most important is to used fixed size index entries for direct access.
I hope it helps
I have a dictionary that I would like to write in whole to an NDB on App Engine. The problem is that only the last item in the dictionary is being written. I thought perhaps the writes were too fast so I put a sleep timer in with a very long wait of 20 seconds to see what would happen. I continually refreshed the Datastore Viewer and saw the transaction write, and then later get overwritten by the next transaction, etc. The table started out empty and the dictionary keys are unique. A simple example:
class Stats(ndb.Model):
desc= ndb.StringProperty(required = True)
count= ndb.IntegerProperty(required = True)
update = ndb.DateTimeProperty(auto_now_add = True)
class refresh(webapp2.RequestHandler):
def get(self):
statsStore = Stats()
dict = {"test1":0,"test2":1,"test3":2}
for key in dict:
statsStore.desc = key
statsStore.count = dict.get(key)
statsStore.put()
What will happen above is that only the final dictionary item will remain in the datastore. Again with a sleep timer I can see each being written but then overwritten. I am using this on my local machine with the local development GAE environment.
Appreciate the help.
The problem with your original code is that you're reusing the same entity (model instance).
During the first put(), a datastore key is generated and assigned to that entity. Then, all the following put() calls are using the same key.
Changing it to create a new model instance on each iteration (the solution you mention in your comment) will ensure a new datastore key is generated each time.
Another option would be to clear the key with "statsStore.key = None" before calling put(). But what you did is probably better.
Not sure what you are trying to do, but here are some hopefully helpful pointers. If you want to save the dict and then re-use it later by reading from the database, then change your string to a text property, import json, and save the dict as a json string value using json.dumps(). If you want to write an entity for every element in your dict, then you will want to move your statsStore class creation line inside the for loop, and finish the loop process by adding each Stats() classes to an array. Once the loop is done, you can batch put all the entities in the array. This batch approach is much faster than including a put() inside your loop which is most often a very non-performant design choice. If you just want to record all the values in the dict for later reference, and you have a value that you can safely use as a delimiter, then I would create two empty arrays prior to your loop, and append each desc and count inside the respective array. Once outside the array, you can save these values to two text properties in your entity by joining the arrays using the delimiter string. If you do this, then strongly suggest using urllib.quote() to escape your desc text value when appending it so at to avoid conflicts with your delimiter value.
Some final notese: You should be careful using this type of process with a StringProperty. You might easily exceed the string limit size depending on the number of items, and/or the length of your desc values. Also remember your items in the dict may not come out in the order you intend. Consider something like: "for k, v in sorted(mydict.items()):" HTH, stevep
I have a large object I'd like to serialize to disk. I'm finding marshal works quite well and is nice and fast.
Right now I'm creating my large object then calling marshal.dump . I'd like to avoid holding the large object in memory if possible - I'd like to dump it incrementally as I build it. Is that possible?
The object is fairly simple, a dictionary of arrays.
The bsddb module's 'hashopen' and 'btopen' functions provide a persistent dictionary-like interface. Perhaps you could use one of these, instead of a regular dictionary, to incrementally serialize the arrays to disk?
import bsddb
import marshal
db = bsddb.hashopen('file.db')
db['array1'] = marshal.dumps(array1)
db['array2'] = marshal.dumps(array2)
...
db.close()
To retrieve the arrays:
db = bsddb.hashopen('file.db')
array1 = marshal.loads(db['array1'])
...
It all your object has to do is be a dictionary of lists, then you may be able to use the shelve module. It presents a dictionary-like interface where the keys and values are stored in a database file instead of in memory. One limitation which may or may not affect you is that keys in Shelf objects must be strings. Value storage will be more efficient if you specify protocol=-1 when creating the Shelf object to have it use a more efficient binary representation.
This very much depends on how you are building the object. Is it an array of sub objects? You could marshal/pickle each array element as you build it. Is it a dictionary? Same idea applies (marshal/pickle keys)
If it is just a big complex harry object, you might want to marshal dump each piece of the object, and then the apply what ever your 'building' process is when you read it back in.
You should be able to dump the item piece by piece to the file. The two design questions that need settling are:
How are you building the object when you're putting it in memory?
How do you need you're data when it comes out of memory?
If your build process populates the entire array associated with a given key at a time, you might just dump the key:array pair in a file as a separate dictionary:
big_hairy_dictionary['sample_key'] = pre_existing_array
marshal.dump({'sample_key':big_hairy_dictionary['sample_key']},'central_file')
Then on update, each call to marshal.load('central_file') will return a dictionary that you can use to update a central dictionary. But this is really only going to be helpful if, when you need the data back, you want to handle reading 'central_file' once per key.
Alternately, if you are populating arrays element by element in no particular order, maybe try:
big_hairy_dictionary['sample_key'].append(single_element)
marshal.dump(single_element,'marshaled_files/'+'sample_key')
Then, when you load it back, you don't necessarily need to build the entire dictionary to get back what you need; you just call marshal.load('marshaled_files/sample_key') until it returns None, and you have everything associated with the key.