NDB Put Transactions are being Overwritten - python

I have a dictionary that I would like to write in whole to an NDB on App Engine. The problem is that only the last item in the dictionary is being written. I thought perhaps the writes were too fast so I put a sleep timer in with a very long wait of 20 seconds to see what would happen. I continually refreshed the Datastore Viewer and saw the transaction write, and then later get overwritten by the next transaction, etc. The table started out empty and the dictionary keys are unique. A simple example:
class Stats(ndb.Model):
desc= ndb.StringProperty(required = True)
count= ndb.IntegerProperty(required = True)
update = ndb.DateTimeProperty(auto_now_add = True)
class refresh(webapp2.RequestHandler):
def get(self):
statsStore = Stats()
dict = {"test1":0,"test2":1,"test3":2}
for key in dict:
statsStore.desc = key
statsStore.count = dict.get(key)
statsStore.put()
What will happen above is that only the final dictionary item will remain in the datastore. Again with a sleep timer I can see each being written but then overwritten. I am using this on my local machine with the local development GAE environment.
Appreciate the help.

The problem with your original code is that you're reusing the same entity (model instance).
During the first put(), a datastore key is generated and assigned to that entity. Then, all the following put() calls are using the same key.
Changing it to create a new model instance on each iteration (the solution you mention in your comment) will ensure a new datastore key is generated each time.
Another option would be to clear the key with "statsStore.key = None" before calling put(). But what you did is probably better.

Not sure what you are trying to do, but here are some hopefully helpful pointers. If you want to save the dict and then re-use it later by reading from the database, then change your string to a text property, import json, and save the dict as a json string value using json.dumps(). If you want to write an entity for every element in your dict, then you will want to move your statsStore class creation line inside the for loop, and finish the loop process by adding each Stats() classes to an array. Once the loop is done, you can batch put all the entities in the array. This batch approach is much faster than including a put() inside your loop which is most often a very non-performant design choice. If you just want to record all the values in the dict for later reference, and you have a value that you can safely use as a delimiter, then I would create two empty arrays prior to your loop, and append each desc and count inside the respective array. Once outside the array, you can save these values to two text properties in your entity by joining the arrays using the delimiter string. If you do this, then strongly suggest using urllib.quote() to escape your desc text value when appending it so at to avoid conflicts with your delimiter value.
Some final notese: You should be careful using this type of process with a StringProperty. You might easily exceed the string limit size depending on the number of items, and/or the length of your desc values. Also remember your items in the dict may not come out in the order you intend. Consider something like: "for k, v in sorted(mydict.items()):" HTH, stevep

Related

Speed up lookup item in list (via Python)

I have a very large list, and I have to run a lot of lookups for this list.
To be more specific I work on a large (> 11 Gb) textfile for processing, but there are items which are appear more than once, and I have only process them first when they are appearing.
If the pattern shows up, I process it, and put it to a list. If the item appears again, I check for it in the list, and if it is, then I just pass to process, like this:
[...]
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.append(threadid)
elif threadid in closedthreads:
pass
else:
[...]
the code itself is far from optimal. My main problem is that the 'closedthreads' list contains a few million items, and the whole operation just start to be slower and slower.
I think it could be help to sort the list (or use a 'sorted list' object) after every append() but I am not sure about this.
What is the most elegant sollution?
You can simply use a set or a hash table which marks if given id already appeared. It should solve your problem with O(1) time complexity for adding and finding an item.
Using a set instead of a list will give you O(1) lookup time, although there may be other ways to optimize this that will work better for your particular data.
closedthreads = set()
# ...
if boundary.match(line):
if closedreg.match(logentry):
closedthreads.add(threadid)
elif threadid in closedthreads:
pass
else:
Do you need to preserve ordering?
If not - use a set.
If you do - use an OrderedDict. OrderedDict lets you store values associated with it as well (example, process results)
But... do you need to preserve the original values at all? You might look at the 'dbm' module if you absolutely do (or buy a lot of memory!) or, instead of storing the actual text, store SHA-1 digests, or something like that. If all you want to do is make sure you don't run the same element twice, that might work.

Way to save a dictionary as a separate reference in python

When building a set of statistics from a dictionary, I process the various entries (such as by user). THus, I can build the various statistics for each user. While doing this, I also build the statistics for a dummy user that I can call "total". After the dictionary is completely built, I create a .csv file and output the statistics using the writerow method.
Since python iterates of the dictionary keys in no particular order, I want to cause the total user to print last. If I attempt to save the generated statistics into a save variable and then output it at the proper time, the save variable gets reset because python variables work by reference rather than value. That is the code
mystats = {}
totalstats = {}
for user in mydict
#perform calculations to generate mystats dictionary entries
if user == 'Total':
totalstats = mystats
else:
outfile.writerow(mystats)
outfile.writerow(totalstats)
However, the actual output of totalstats is whatever set of values had been put into mystats last.
Is there a decent way to show that totalstats is to keep the explicit values within mystats that I had at the time of the assignment or do I need to calculate all the statistics at the end or do
for stattype in mystats:
totalstats[stattype] = mystats[stattype]
While this works, I would rather have something of the type "totalstats = mystats' rather than do a large loop over the complete set of statistics or calculate the entire set of statistics for Total at the end of processing.
You can use copy.deepcopy:
from copy import deepcopy
totalstats = deepcopy(mystats)
If the dict doesn't contain mutable values then you can simply use dict.copy().

Google App Engine non-indexed list property?

I have this property in GAE:
memberNames = db.StringListProperty(indexed = False)
But for unindexed properties, they usually don't cost me any writes (just the basic write to put the file), but with this property, I'm getting writes for every string in the list. Am I not allowed to have non-indexed ListProperties? Is my only other choice to use a JSON array string or is there a way around this?
StringListProperty creates several index rows (one row per value) and is stored only in the index, so you are correct that you need to implement your own serialization (e.g. JSON) of multi-value-properties rather than using StringListProperty to eliminate the index writes.

can this python be shorter

I tend to obsess about expressing code the most compactly and succinctly possible without sacrificing runtime efficiency.
Here's my code:
p_audio = plate.parts.filter(content__iendswith=".mp3")
p_video = not p_audio and plate.parts.filter(content__iendswith=".flv")
p_swf = not p_audio and not p_video and plate.parts.filter(content__iendswith=".swf")
extra_context.update({
'p_audio': p_audio and p_audio[0],
'p_video': p_video and p_video[0],
'p_swf': p_swf and p_swf[0]
})
Are there any python/django gurus that can drastically shorten this code?
Actually, in your pursuit of compactness and efficiency, you have managed to come up with code that is terribly inefficient. This is because when you refer to p_audio or not p_audio, that causes that queryset to be evaluated - and because you haven't sliced it before then, that means that the entire filter is brought from the database - eg all the plate objects that end with mp3, and so on.
You should ensure you do the slice for each query first, before you refer to the value of that query. Since you're concerned with code compactness, you probably want to slice with [:1] first, to get a queryset of a single object:
p_audio = plate.parts.filter(content__iendswith=".mp3")[:1]
p_video = not p_audio and plate.parts.filter(content__iendswith=".flv") [:1]
p_swf = not p_audio and not p_video and plate.parts.filter(content__iendswith=".swf")[:1]
and the rest can stay the same.
Edit to add Because you're only interested in the first element of each list, as evidenced by the fact that you only pass [0] from each element into the context. But in your code, not p_audio refers to the original, unsliced queryset: and to determine the true/false value of the qs, Django has to evaluate it, which gets all matching elements from the database and converts them into Python objects. Since you don't actually want those objects, you're doing a lot more work than you need.
Note though that it's not re-running it every time: just the first time, since after the first evaluation the queryset is cached internally. But as I say, that's already more work than you want.
Besides featuring less redundancy, this is also way easier to extend with new content types.
kinds = (("p_audio", ".mp3"), ("p_video", ".flv"), ("p_swf", ".swf"))
extra_context.update((key, False) for key, _ in kinds)
for key, ext in kinds:
entries = plate.parts.filter(content__iendswith=ext)
if entries:
extra_context[key] = entries[0]
break
Just adding this as another answer inspired by Pyroscope's above (as my edit there has to be peer reviewed)
The latest incarnation is exploiting that the Django template system just disregards nonexistant context items when they are referenced, so mp3, etc below do not need to be initialized to False (or 0). So, the following meets all the functionality of the code from the OP. The other optimization is that mp3, etc are used as key names (instead of "p_audio" etc.)
for key in ['mp3','flv','swf'] :
entries = plate.parts.filter(content__iendswith=key)[:1]
extra_context[key] = entries and entries[0]
if extra_context[key] :
break

How to rewrite this Dictionary For Loop in Python?

I have a Dictionary of Classes where the classes hold attributes that are lists of strings.
I made this function to find out the max number of items are in one of those lists for a particular person.
def find_max_var_amt(some_person) #pass in a patient id number, get back their max number of variables for a type of variable
max_vars=0
for key, value in patients[some_person].__dict__.items():
challenger=len(value)
if max_vars < challenger:
max_vars= challenger
return max_vars
What I want to do is rewrite it so that I do not have to use the .iteritems() function. This find_max_var_amt function works fine as is, but I am converting my code from using a dictionary to be a database using the dbm module, so typical dictionary functions will no longer work for me even though the syntax for assigning and accessing the key:value pairs will be the same. Thanks for your help!
Since dbm doesn't let you iterate over the values directly, you can iterate over the keys. To do so, you could modify your for loop to look like
for key in patients[some_person].__dict__:
value = patients[some_person].__dict__[key]
# then continue as before
I think a bigger issue, though, will be the fact that dbm only stores strings. So you won't be able to store the list directly in the database; you'll have to store a string representation of it. And that means that when you try to compute the length of the list, it won't be as simple as len(value); you'll have to develop some code to figure out the length of the list based on whatever string representation you use. It could just be as simple as len(the_string.split(',')), just be aware that you have to do it.
By the way, your existing function could be rewritten using a generator, like so:
def find_max_var_amt(some_person):
return max(len(value) for value in patients[some_person].__dict__.itervalues())
and if you did it that way, the change to iterating over keys would look like
def find_max_var_amt(some_person):
dct = patients[some_person].__dict__
return max(len(dct[key]) for key in dct)

Categories

Resources