here's my problem. I want to make a collection in mongodb where I have a word and the number of the times it occurs. I'm doing it in python and it's extrememly slow. It's most probably because for every word that i have, I check if it is already in the database (using
*find_one*)and if yes, get its frequency, increment it and store it back (using update) Of course, when the word is not there, I append it to a list and do bulk insert periodically.
Is there a better way of doing this? The number of words is huge (different languages possible). Is mongoDB the right thing to use in the first place? I chose mongoDB because it was very easy to install to and I picked up the tutorial in 10 min...
edit - added the code as well. When I say large, I mean a file that is some 4 GB large with words in them...
insertlist = []
def copy_to_db(word):
global insertlist
wordCollection = db['words']
occurrence = wordCollection.find_one({'word' : word})
if occurrence:
n = occurrence['number']
n = n + 1
wordCollection.update({'word' : word}, {'$set' : {'number' : n}})
else:
insertlist.append({'word' : word, 'number' : 1})
#wordCollection.insert({'word' : word, 'number' : 1})
if len(insertlist) >= 5000:
print("insert triggered ... ")
wordCollection.insert(insertlist)
insertlist = []
i call this func. for every word.
Sounds like you could use upserts. If you use upserts you don't need to do that fetch/save cycle.
I'm not sure how this is done in python driver, but in JavaScript it would look something like:
db.words.update({"_id": "the_word" }, {"$inc": {"frequency": 1}}, true)
MongoDB creates an index for the _id field automatically. If you are not using the _id field for your word, then creating an index for your key would most probably help a lot.
Edit: some more ideas for you
As there is a lot of data, you could use the _id field for your word. This way you wouldn't need to create another index and the updates would be slightly faster as only one index needs to be updated while inserting new documents. This is in case inserting speed is the bottleneck.
While taking advantage of batch inserts is generally a good idea when inserting a lot of data, I'm not sure if it helps too much on this case. This depends on your data. If the ratio of unique words is high, then batch inserts might be handy. But if the same words are used over and over again (which I guess is the case with most languages), then batch inserts might not be of too much help.
Also, it looks like you have a problem in your batch insert. Think about if you encounter a words for the first time. It is inserted to your insertlist. Now, if this same word is encountered again while the previous batch is not inserted, the number attribute for this word would be 1 which would be incorrect.
Are you sure the db is the bottleneck? Have you already made sure that the is no other poorly performing code? But anyway I guess, inserting 4GB of data will take a while in any case.
Related
This is my first post, please be gentle. I'm attempting to sort some
files into ascending and descending order. Once I have sorted a file, I am storing it in a list which is assigned to a variable. The user is then to choose a file and search for an item. I get an error message....
TypeError: unorderable types; int() < list()
.....when ever I try to search for an item using the variable of my sorted list, the error occurs on line 27 of my code. From research, I know that an int and list cannot be compared, but I cant for the life of me think how else to search a large (600) list for an item.
At the moment I'm just playing around with binary search to get used to it.
Any suggestions would be appreciated.
year = []
with open("Year_1.txt") as file:
for line in file:
line = line.strip()
year.append(line)
def selectionSort(alist):
for fillslot in range(len(alist)-1,0,-1):
positionOfMax=0
for location in range(1,fillslot+1):
if alist[location]>alist[positionOfMax]:
positionOfMax = location
temp = alist[fillslot]
alist[fillslot] = alist[positionOfMax]
alist[positionOfMax] = temp
def binarySearch(alist, item):
first = 0
last = len(alist)-1
found = False
while first<=last and not found:
midpoint = (first + last)//2
if alist[midpoint] == item:
found = True
else:
if item < alist[midpoint]:
last = midpoint-1
else:
first = midpoint+1
return found
selectionSort(year)
testlist = []
testlist.append(year)
print(binarySearch(testlist, 2014))
Year_1.txt file consists of 600 items, all years in the format of 2016.
They are listed in descending order and start at 2017, down to 2013. Hope that makes sense.
Is there some reason you're not using the Python: bisect module?
Something like:
import bisect
sorted_year = list()
for each in year:
bisect.insort(sorted_year, each)
... is sufficient to create the sorted list. Then you can search it using functions such as those in the documentation.
(Actually you could just use year.sort() to sort the list in-place ... bisect.insort() might be marginally more efficient for building the list from the input stream in lieu of your call to year.append() ... but my point about using the `bisect module remains).
Also note that 600 items is trivial for modern computing platforms. Even 6,000 won't take but a few milliseconds. On my laptop sorting 600,000 random integers takes about 180ms and similar sized strings still takes under 200ms.
So you're probably not gaining anything by sorting this list in this application at that data scale.
On the other hand Python also includes a number of modules in its standard libraries for managing structured data and data files. For example you could use Python: SQLite3.
Using this you'd use standard SQL DDL (data definition language) to describe your data structure and schema, SQL DML (data manipulation language: INSERT, UPDATE, and DELETE statements) to manage the contents of the data and SQL queries to fetch data from it. Your data can be returned sorted on any column and any mixture of ascending and descending on any number of columns with the standard SQL ORDER BY clauses and you can add indexes to your schema to ensure that the data is stored in a manner to enable efficient querying and traversal (table scans) in any order on any key(s) you choose.
Because Python includes SQLite in its standard libraries, and because SQLite provides SQL client/server semantics over simple local files ... there's almost no downside to using it for structured data. It's not like you have to install and maintain additional software, servers, handle network connections to a remote database server nor any of that.
I'm going to walk through some steps before getting to the answer.
You need to post a [mcve]. Instead of telling us to read from "Year1.txt", which we don't have, you need to put the list itself in the code. Do you NEED 600 entries to get the error in your code? No. This is sufficient:
year = ["2001", "2002", "2003"]
If you really need 600 entries, then provide them. Either post the actual data, or
year = [str(x) for x in range(2017-600, 2017)]
The code you post needs to be Cut, Paste, Boom - reproduces the error on my computer just like that.
selectionSort is completely irrelevant to the question, so delete it from the question entirely. In fact, since you say the input was already sorted, I'm not sure what selectionSort is actually supposed to do in your code, either. :)
Next you say testlist = [].append(year). USE YOUR DEBUGGER before you ask here. Simply looking at the value in your variable would have made a problem obvious.
How to append list to second list (concatenate lists)
Fixing that means you now have a list of things to search. Before you were searching a list to see if 2014 matched the one thing in there, which was a complete list of all the years.
Now we get into binarySearch. If you look at the variables, you see you are comparing the integer 2014 with some string, maybe "1716", and the answer to that is useless, if it even lets you do that (I have python 2.7 so I am not sure exactly what you get there). But the point is you can't find the integer 2014 in a list of strings, so it will always return False.
If you don't have a debugger, then you can place strategic print statements like
print ("debug info: binarySearch comparing ", item, alist[midpoint])
Now here, what VBB said in comments worked for me, after I fixed the other problems. If you are searching for something that isn't even in the list, and expecting True, that's wrong. Searching for "2014" returns True, if you provide the correct list to search. Alternatively, you could force it to string and then search for it. You could force all the years to int during the input phase. But the int 2014 is not the same as the string "2014".
I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?
If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.
If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.
The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed:
seq.active = False
consumed = avail
seq.current = start + consumed
seq.put()
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)
If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
Edit
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.
Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.
Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.
Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.
I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.
I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).
This question already has answers here:
How can I get a random record from MongoDB?
(30 answers)
Closed 6 years ago.
I want single random document from mongoDB collection. Now my mongoDB collection contains more then 1 billion collections. How to get single random document from that collection ?
I never worked with MongoDB from Python, but there is a general solution for your problem. Here is a MongoDB shell script for obtaining single random document:
N = db.collection.count(condition)
db.collection.find(condition).limit(1).skip(Math.floor(Math.random()*N))
condition here is a MongoDB query. If you want to query an entire collection, use query = null.
It's a general solution, so it works with any MongoDB driver.
Update
I ran a benchmark to test several implementations. First, I created test collection with 5567249 documents with indexed random field rnd.
I chose three methods to compare with each other:
First method:
db.collection.find().limit(1).skip(Math.floor(Math.random()*N))
Second method:
db.collection.find({rnd: {$gte: Math.random()}}).sort({rnd:1}).limit(1)
Third method:
db.collection.findOne({rnd: {$gte: Math.random()}})
I ran each method 10 times and got its average computing time:
method 1: 882.1 msec
method 2: 1.2 msec
method 3: 0.6 msec
This benchmark shows that my solution not the fastest one.
But the third solution is not a good one either, because it finds the first element in database (sorted in natural order) with rnd > random(). So, its output not truly random.
I think that second method is the best one for frequent usage. But it has one defect: it requires altering the whole database and ensuring additional index.
Add an additional column named random to your collection and make that the value in it is between 0 to 1. You can assign random floating points between 0 to 1 into this column for each record via [random.random() for _ in range(0, 10)].
Then:-
import random
collection = mongodb["collection_name"]
rand = random.random() # rand will be a floating point between 0 to 1.
random_record = collection.find_one({ 'random' => { '$gte' => rand } })
MongoDB will have its native implementation in due course. Filed feature here - https://jira.mongodb.org/browse/SERVER-533
Not yet implemented at time of writing.
Since MongoDB 3.2, it can be done using aggregate function with $sample operator, as described in docs. It's super fast. Following code will randomly select 20 documents from collection.
db.collection.aggregate( [ { $sample: {size: 20} } ] )
if you need to select random documents with specific criteria, you can use it with $match opperator
db.collection.aggregate([
{ $sample: {size: 20} },
{ $match:{"yourField": value} }
])
beware of the order! If I search in my small database around 100k documents, this command above takes 15ms, while when you switch the order, it's 1750ms (more then 100x times slower). The reason is obvious of course. Additionally, with this order you get subset of those random 20 documents...
In a performant manner? It is hard, to say the least, without changing your data.
Imagine you try and get a rand() of 1,000,000 from 1b documents. That will be slow, very slow. This is because MongoDB does not make effective use of indexes when skipping.
As #Calvin said, MongoDB has a feature request to get random documents however it is not yet implemented.
The most performant way of doing this, atm if you were to do this regularly, is to add a auto incrementing id to your records: http://www.mongodb.org/display/DOCS/How+to+Make+an+Auto+Incrementing+Field and use that to rand() on.
Edit
To clarify; when using the auto incrementing id you will need to do one query initially (unless you keep track of it another way) to get the highest value of the field. You can either query the counter collection or the collection itself and sort in reverse (sort({field:-1})) and limit(1) to get the highest value for rand().
You also need to take into account changes in data which means you actually want the $gte of that random position.
My idea can be explained more here: php mongodb find nth entry in collection
If your objects have int id's on them you could do something like
findOne({id: {$gte: rand()}})
I'm writing a program in python to do a unigram (and eventually bigram etc) analysis of movie reviews. The goal is to create feature vectors to feed into libsvm. I have 50,000 odd unique words in my feature vector (which seems rather large to me, but I ham relatively sure I'm right about that).
I'm using the python dictionary implementation as a hashtable to keep track of new words as I meet them, but I'm noticing an enormous slowdown after the first 1000 odd documents are processed. Would I have better efficiency (given the distribution of natural language) if I used several smaller hashtable/dictionaries or would it be the same/worse?
More info:
The data is split into 1500 or so documents, 500-ish words each. There are between 100 and 300 unique words (with respect to all previous documents) in each document.
My current code:
#processes each individual file, tok == filename, v == predefined class
def processtok(tok, v):
#n is the number of unique words so far,
#reference is the mapping reference in case I want to add new data later
#hash is the hashtable
#statlist is the massive feature vector I'm trying to build
global n
global reference
global hash
global statlist
cin=open(tok, 'r')
statlist=[0]*43990
statlist[0] = v
lines = cin.readlines()
for l in lines:
line = l.split(" ")
for word in line:
if word in hash.keys():
if statlist[hash[word]] == 0:
statlist[hash[word]] = 1
else:
hash[word]=n
n+=1
ref.write('['+str(word)+','+str(n)+']'+'\n')
statlist[hash[word]] = 1
cin.close()
return statlist
Also keep in mind that my input data is about 6mb and my output data is about 300mb. I'm simply startled at how long this takes, and I feel that it shouldn't be slowing down so dramatically as it's running.
Slowing down: the first 50 documents take about 5 seconds, the last 50 take about 5 minutes.
#ThatGuy has made the fix, but hasn't actually told you this:
The major cause of your slowdown is the line
if word in hash.keys():
which laboriously makes a list of all the keys so far, then laboriously searches that list for `word'. The time taken is proportional to the number of keys i.e. the number of unique words found so far. That's why it starts fast and becomes slower and slower.
All you need is if word in hash: which in 99.9999999% of cases takes time independent of the number of keys -- one of the major reasons for having a dict.
The faffing about with statlist[hash[word]] doesn't help, either. By the way, the fixed size in statlist=[0]*43990 needs explanation.
More problems
Problem A: Either (1) your code suffered from indentation distortion when you published it, or (2) hash will never be updated by that function. Quite simply, if word is not in hash i.e it's the first time you've seen it, absolutely nothing happens. The hash[word] = n statement (the ONLY code that updates hash) is NOT executed. So no word will ever be in hash.
It looks like this block of code needs to be shifted left 4 columns, so that it's aligned with the outer if:
else:
hash[word]=n
ref.write('['+str(word)+','+str(n)+']'+'\n')
statlist[hash[word]] = 1
Problem B: There is no code at all to update n (allegedly the number of unique words so far).
I strongly suggest that you take as many of the suggestions that #ThatGuy and I have made as you care to, rip out all the global stuff, fix up your code, chuck in a few print statements at salient points, and run it over say 2 documents each of 3 lines with about 4 words in each. Ensure that it is working properly. THEN run it on your big data set (with the prints suppressed). In any case you may want to put out stats (like number of documents, lines, words, unique words, elapsed time, etc) at regular intervals.
Another problem
Problem C: I mentioned this in a comment on #ThatGuy's answer, and he agreed with me, but you haven't mentioned taking it up:
>>> line = "foo bar foo\n"
>>> line.split(" ")
['foo', 'bar', 'foo\n']
>>> line.split()
['foo', 'bar', 'foo']
>>>
Your use of .split(" ") will lead to spurious "words" and distort your statistics, including the number of unique words that you have. You may well find the need to change that hard-coded magic number.
I say again: There is no code that updates n in the function . Doing hash[word] = n seems very strange, even if n is updated for each document.
I don't think Python's Dictionary has anything to do with your slowdown here. Especially when you are saying that the entries are around 100. I am hoping that you are referring to Insertion and Retrival, which are both O(1) in a dictionary. The problem could be that you are not using iterators (or loading key,value pairs one at a time) when creating a dictionary and you are loading the entire words in-memory. In that case, the slowdown is due to memory consumption.
I think you've got a few problems going on here. Mostly, I am unsure of what you are tying to accomplish with statlist. It seems to me like it is serving as a poor duplicate of your dictionary. Create it after you have found all of your words.
Here is my guess as to what you want:
def processtok(tok, v):
global n
global reference
global hash
cin=open(tok, 'rb')
for l in cin:
line = l.split(" ")
for word in line:
if word in hash:
hash[word] += 1
else:
hash[word] = 1
n += 1
ref.write('['+str(word)+','+str(n)+']'+'\n')
cin.close()
return hash
Note, that this means you no longer need an "n" as you can discover this by doing len(n).
I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?
If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.
If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.
The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed:
seq.active = False
consumed = avail
seq.current = start + consumed
seq.put()
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)
If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
Edit
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.
Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.
Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.
Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.
I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.
I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).