I must / have to create unique ID for invoices. I have a table id and another column for this unique number. I use serialization isolation level. Using
var seq = #"SELECT invoice_serial + 1 FROM invoice WHERE ""type""=#type ORDER BY invoice_serial DESC LIMIT 1";
Doesn't help because even using FOR UPDATE it wont read correct value as in serialization level.
Only solution seems to put some retry code.
Sequences do not generate gap-free sets of numbers, and there's really no way of making them do that because a rollback or error will "use" the sequence number.
I wrote up an article on this a while ago. It's directed at Oracle but is really about the fundamental principles of gap-free numbers, and I think the same applies here.
Well, it’s happened again. Someone has asked how to implement a requirement to generate a gap-free series of numbers and a swarm of nay-sayers have descended on them to say (and here I paraphrase slightly) that this will kill system performance, that’s it’s rarely a valid requirement, that whoever wrote the requirement is an idiot blah blah blah.
As I point out on the thread, it is sometimes a genuine legal requirement to generate gap-free series of numbers. Invoice numbers for the 2,000,000+ organisations in the UK that are VAT (sales tax) registered have such a requirement, and the reason for this is rather obvious: that it makes it more difficult to hide the generation of revenue from tax authorities. I’ve seen comments that it is a requirement in Spain and Portugal, and I’d not be surprised if it was not a requirement in many other countries.
So, if we accept that it is a valid requirement, under what circumstances are gap-free series* of numbers a problem? Group-think would often have you believe that it always is, but in fact it is only a potential problem under very particular circumstances.
The series of numbers must have no gaps.
Multiple processes create the entities to which the number is associated (eg. invoices).
The numbers must be generated at the time that the entity is created.
If all of these requirements must be met then you have a point of serialisation in your application, and we’ll discuss that in a moment.
First let’s talk about methods of implementing a series-of-numbers requirement if you can drop any one of those requirements.
If your series of numbers can have gaps (and you have multiple processes requiring instant generation of the number) then use an Oracle Sequence object. They are very high performance and the situations in which gaps can be expected have been very well discussed. It is not too challenging to minimise the amount of numbers skipped by making design efforts to minimise the chance of a process failure between generation of the number and commiting the transaction, if that is important.
If you do not have multiple processes creating the entities (and you need a gap-free series of numbers that must be instantly generated), as might be the case with the batch generation of invoices, then you already have a point of serialisation. That in itself may not be a problem, and may be an efficient way of performing the required operation. Generating the gap-free numbers is rather trivial in this case. You can read the current maximum value and apply an incrementing value to every entity with a number of techniques. For example if you are inserting a new batch of invoices into your invoice table from a temporary working table you might:
insert into
invoices
(
invoice#,
...)
with curr as (
select Coalesce(Max(invoice#)) max_invoice#
from invoices)
select
curr.max_invoice#+rownum,
...
from
tmp_invoice
...
Of course you would protect your process so that only one instance can run at a time (probably with DBMS_Lock if you're using Oracle), and protect the invoice# with a unique key contrainst, and probably check for missing values with separate code if you really, really care.
If you do not need instant generation of the numbers (but you need them gap-free and multiple processes generate the entities) then you can allow the entities to be generated and the transaction commited, and then leave generation of the number to a single batch job. An update on the entity table, or an insert into a separate table.
So if we need the trifecta of instant generation of a gap-free series of numbers by multiple processes? All we can do is to try to minimise the period of serialisation in the process, and I offer the following advice, and welcome any additional advice (or counter-advice of course).
Store your current values in a dedicated table. DO NOT use a sequence.
Ensure that all processes use the same code to generate new numbers by encapsulating it in a function or procedure.
Serialise access to the number generator with DBMS_Lock, making sure that each series has it’s own dedicated lock.
Hold the lock in the series generator until your entity creation transaction is complete by releasing the lock on commit
Delay the generation of the number until the last possible moment.
Consider the impact of an unexpected error after generating the number and before the commit is completed — will the application rollback gracefully and release the lock, or will it hold the lock on the series generator until the session disconnects later? Whatever method is used, if the transaction fails then the series number(s) must be “returned to the pool”.
Can you encapsulate the whole thing in a trigger on the entity’s table? Can you encapsulate it in a table or other API call that inserts the row and commits the insert automatically?
Original article
You could create a sequence with no cache , then get the next value from the sequence and use that as your counter.
CREATE SEQUENCE invoice_serial_seq START 101 CACHE 1;
SELECT nextval('invoice_serial_seq');
More info here
You either lock the table to inserts, and/or need to have retry code. There's no other option available. If you stop to think about what can happen with:
parallel processes rolling back
locks timing out
you'll see why.
In 2006, someone posted a gapless-sequence solution to the PostgreSQL mailing list: http://www.postgresql.org/message-id/44E376F6.7010802#seaworthysys.com
Related
Hi I've a Lambda function treated as a webhook. The webhook may be called multiple time simultaneously with same data. In the lambda function, I check if the transaction record is present in the DynamoDB or not. If it's present in db the Lambda simply returns otherwise it execute further. The problem arises here that when checking if a record in db the Lambda get called again and that check fails because the previous transaction still not inserted in db. and transaction can get executed multiple times.
My question is how to handle this situation. will SQS be helpful in this situation?
You can use optimistic locking for this. I've written a more detailed blog about implementing it, but here are the core ideas.
For each item you track a version number that always gets incremented. Each update to the item will increment the version number by one.
When you want to perform an update, you first read the old item and store its version number locally. Then you change the item locally and increment its version number. When you write it back to the table in your transaction, you add a conditional write. The condition being that the current version number of the item is still the same it was when you read it.
This means the transaction will fail if the item has been update in the mean time. Optimistic Locking helps you with collision detection and is a good solution under the assumption that such collisions are relatively rare. You'd be better served with different locking strategies if they're more frequent.
Optimistic Locking will help you identify the cases you're worried about. It doesn't resolve them, you'll have to implement that yourself. A common conflict resolution approach would be to read the item again and check if your changes have already been applied.
" if it's present in db the lambda simply return otherwise it execute further" given that, is it possible to use FIFO queue and use some "key" from the data as deduplication id (fifo) and that would mean all duplicate messages would never make it to your logic and then you would also need
dynamodb's "strongly consistent" option.
I'm building a simple calendar application with recurring events. I imagine creating two tables, one for all the events and one for the recurrences. If I make an individual event, I'll add it to the events table. If I make a recurring event, I'll add the meta-information to the recurrences table and then all the events in the event table (a monthly recurrence would add 12 events to the events table).
Should I implement the logic about calculating the dates of all the recurring events in Postgresql (https://justatheory.com/2008/01/postgres-recurring-events/) or with Python (our backend API)? Are there performances differences?
Theoretically, in Postgres you only calculate recurrences once (on create/update), but in Python you would recaclucate them on every fetch, so Postgres should waste less cycles. However, I doubt that it's a lot of calculations. It shouldn't make a big difference for performance if it's not some sort of high-load project. I would judge it from a design perspective.
If you pre-calculate and save recurrences in Postgres, you need to make sure that:
They are automatically deleted when the corresponding meta-record is deleted. This is easy, store a foreign key and cascade.
They are automatically updated as time passes (in your example, you need to generate more recurrences after a year passes). This one is more tricky, I'm not sure what is the "correct" way to do scheduled updates in Postgres.
If you calculate in Python, you get rid of this problem with scheduled updates, you just create recurrences on every fetch. But you lose the beautiful ability to just fetch the data as is, from the single source of truth, and just present it to the user. The logic layer becomes more thick.
Overall, the Postgres solution seems nicer to me. I haven't looked much into the article you've linked, but it mentiones views. This probably means that you also have an option to calculate recurrences on each select, but in Postgres instead of Python. I'd just pick whatever is easier to implement and maintain.
If you're really only concerned about performance, I'm sorry, I'm not competent enough to analyze it.
I have a next entities:
class Player(ndb.Model):
player_id = ndb.IntegerProperty()
and
class TimeRecord(ndb.Model):
time = ndb.StringProperty()
So TimeRecord's instance is a child of certain instance of Player.
If I need to put a instance of TimeRecord with certain Player I am doing like this:
tr = TimeRecord(parent = ndb.Key("Player", Player.query(Player.player_id == int(certain_id)).get().key.integer_id()), time = value)
This query is expensive and sophisticated. Accordingly to doc
qry = Account.query(Account.userid == 42)
If you are sure that there was just one Account with that userid, you might prefer to use userid as a key. Account.get_by_id(...) is faster than Account.query(...).get().
As I understand I need to change structure of my datastore:
Use player_id as a key of Player and move TimeRecord (time) to property of Players. player_id is unique value.
class Player(ndb.Model):
time = ndb.StringProperty()
Q: Is that a right approach?
This is similar to mixing different levels of entities inheritance due to as I see every data should be a different entity. And inheritance implemented by ancestor keys.
Upd:
But in this case I can store just one TimeRecord value for a certain Player.
And I need a set of TimeRecords for a Player. Is a repeated property solution of this problem?
The redesign you're proposing is essentially, from the POV of a relational database user, a "de-normalization" -- which is almost a bad word in the relational field, but absolutely "normal" (ha ha) once you move into NoSQL.
If you know how things will be queried and updated, de-normalization improves performance (usually) and/or storage (sometimes) at the expense of some flexibility.
Do be aware of the trade-offs, though. Often, de-normalizing improves performance in querying/reading at the expense of extra burdens in updating -- that can be fine since typically reading is much more frequent than writing, but you need to know whether this is the case for your application.
Examining your specific use case, I see definite savings in storage (esp. if you can use a more specialized type for your time property, see https://cloud.google.com/appengine/docs/python/ndb/properties#Date_and_Time) and fewer interactions with the backend (thus better performance) on retrievals. It also simplifies your code (simplicity is good: fewer risks of bugs, easier to unit-test).
However, if saving new "time records" is a very frequent need for a player, the repeated property grows larger and larger (at some point this slows things down despite it still being a single interaction; at worst it would "bump its head" against a single entity's maximum size, which is one megabyte -- sure, that would take many tens of thousands of "time records" per player, but, not knowing your app at all, I can't tell whether that's a risk... only you can!-).
Queries can also be a problem, again entirely depending on what your app needs. I'm specifically thinking of inequality queries. Suppose you need all players with time records greater than, say, '20141215-10:00:00', and smaller than, say, '20141215-18:00:00'. Alas, an inequality query on a repeated property won't do that for you! That is, if you query for
ndb.AND(Player.time > '20141215-10:00:00',
Player.time < '20141215-18:00:00')
you'll get players with a time greater than the first constant and a time less than the second one -- not necessarily the same time! This means the query may return many more players than you wish it would, and you'll need to "filter" the resulting bunch of players in your app's code.
If you had an entity where time is not a repeated property (such as your original TimeRecord entity), then the query analogous to this one would return exactly the bunch of entities of interest (though if you then needed to fetch the players sporting those times, you'd then need another interaction with the storage back-end, typically an ndb.get_multi, so it's hard to predict performance effects without knowing much more about your app's operational parameters!).
That's what de-normalization usually boils down to: trade-offs between different aspect of "desirability" (simplicity, storage saving, fewer backend interactions, smaller amounts of data going to/from the backend -- and we're not even getting into atomic transactions and applicability of async techniques!-) -- trade-offs that can be made only with some deep understanding of an app's operational parameters.
Indeed it may be worth deploying two or more prototypes, each to a small set of users, to get actual data about how they perform (the new Cloud Monitoring offer can help with the "get actual data" part), before choosing a "definitive" (ha!) architecture -- despite the fact that migrating the data from the prototypes to the "definitive" schema will incur overhead-effort needs.
And if the app is an overnight success and suddenly you get tens of thousands of queries per second, rather than the orders-of-magnitude fewer you had planned for, the performance characteristics may just as suddenly change to the point the pain of re-architecting and migrating again may be warranted (a good problem to have, for sure, but still...).
I'm trying to insert a row if the same primary key does not exist yet (ignore in that case). Doing this from Python, using psycopg2 and Postgres version 9.3.
There are several options how to do this: 1) use subselect, 2) use transaction, 3) let it fail.
It seems easiest to do something like this:
try:
cursor.execute('INSERT...')
except psycopg2.IntegrityError:
pass
Are there any drawbacks to this approach? Is there any performance penalty with the failure?
The foolproof way to do it at the moment is try the insert and let it fail. You can do that at the app level or at the Postgres level; assuming it's not part of a procedure being executed on the server, it doesn't materially matter if it's one or the other when it comes to performance, since either way you're sending a request to the server and retrieving the result. (Where it may matter is in your need to define a save point if you're trying it from within a transaction, for the same reason. Or, as highlighted in Craig's answer, if you've many failed statements.)
In future releases, a proper merge and upsert are on the radar, but as the near-decade long discussion will suggest implementing them properly is rather thorny:
https://wiki.postgresql.org/wiki/SQL_MERGE
https://wiki.postgresql.org/wiki/UPSERT
With respect to the other options you mentioned, the above wiki pages and the links within them should highlight the difficulties. Basically though, using a subselect is cheap, as noted by Erwin, but isn't concurrency-proof (unless you lock properly); using locks basically amounts to locking the entire table (trivial but not great) or reinventing the wheel that's being forged in core (trivial for existing rows, less so for potentially new ones which are inserted concurrently if seek to use predicates instead of a table-level lock); and using a transaction and catching the exception is what you'll end up doing anyway.
Work is ongoing to add a native upsert to PostgreSQL 9.5, which will probably take the form of an INSERT ... ON CONFLICT UPDATE ... statement.
In the mean time, you must attempt the update and if it fails, retry. There's no safe alternative, though you can loop within a PL/PgSQL function to hide this from the application.
Re trying and letting it fail:
Are there any drawbacks to this approach?
It creates a large volume of annoying noise in the log files. It also burns through transaction IDs very rapidly if the conflict rate is high, potentially requiring more frequent VACUUM FREEZE to be run by autovacuum, which can be an issue on large databases.
Is there any performance penalty with the failure?
If the conflict rate is high, you'll be doing a bunch of extra round trips to the database. Otherwise not much really.
I am in the process of migrating an application from Master/Slave to HRD. I would like to hear some comments from who already went through the migration.
I tried a simple example to just post a new entity without ancestor and redirecting to a page to list all entities from that model. I tried it several times and it was always consistent. Them I put 500 indexed properties and again, always consistent...
I was also worried about some claims of a limit of one 1 put() per entity group per second. I put() 30 entities with same ancestor (same HTTP request but put() one by one) and it was basically no difference from puting 30 entities without ancestor. (I am using NDB, could it be doing some kind of optimization?)
I tested this with an empty app without any traffic and I am wondering how much a real traffic would affect the "eventual consistency".
I am aware I can test "eventual consistency" on local development. My question is:
Do I really need to restructure my app to handle eventual consistency?
Or it would be acceptable to leave it the way it is because the eventual consistency is actually consistent in practice for 99%?
If you have a small app then your data probably live on the same part of the same disk and you have one instance. You probably won't notice eventual consistency. As your app grows, you notice it more. Usually it takes milliseconds to reach consistency, but I've seen cases where it takes an hour or more.
Generally, queries is where you notice it most. One way to reduce the impact is to query by keys only and then use ndb.get_multi() to load the entities. Fetching entities by keys ensures that you get the latest version of that entity. It doesn't guarantee that the keys list is strongly consistent, though. So you might get entities that don't match the query conditions, so loop through the entities and skip the ones that don't match.
From what I've noticed, the pain of eventual consistency grows gradually as your app grows. At some point you do need to take it seriously and update the critical areas of your code to handle it.
What's the worst case if you get inconsistent results?
Does a user see some unimportant info that's out of date? That's probably ok.
Will you miscalculate something important, like the price of something? Or the number of items in stock in a store? In that case, you would want to avoid that chance occurence.
From observation only, it seems like eventually consistent results show up more as your dataset gets larger, I suspect as your data is split across more tablets.
Also, if you're reading your entities back with get() requests by key/id, it'll always be consistent. Make sure you're doing a query to get eventually consistent results.
The replication speed is going to be primarily server-workload-dependent. Typically on an unloaded system the replication delay is going to be milliseconds.
But the idea of "eventually consistent" is that you need to write your app so that you don't rely on that; any replication delay needs to be allowable within the constraints of your application.