Together with my team, I am working on a functionality to generate invoice numbers. The requirements says that:
there should be no gaps between invoice numbers
the numbers should start from 0 every year (the together with the year we will have a unique key)
the invoice numbers should grow accordinlgy to the time of the creation of the invoices
We are using php and postgres. We tought to implement this in the following way:
each time a new invoice is persisted on the database we use a BEFORE INSERT trigger
the trigger executes a function that retrieves a new value from a postgres sequence and writes it on the invoice as its number
Considering that multiple invoices could be created during the same transaction, my question is: is this a sufficiently safe approach? What are its flaws? How would you suggest to improve it?
Introduction
I believe the most crucial point here is:
there should be no gaps between invoice numbers
In this case you cannot use a squence and an auto-increment field (as others propose in the comments). Auto-increment field use sequence under the hood and nextval(regclass) function increments sequence's counter no matter if transaction succeeded or failed (you point that out by yourself).
Update:
What I mean is you shouldn't use sequences at all, especially solution proposed by you doesn't eliminates gap possibility. Your trigger gets new sequence value but INSERT could still failed.
Sequences works this way because they mainly meant to be used for PRIMARY KEYs and OIDs values generation where uniqueness and non-blocking mechanism is ultimate goal and gaps between values are really no big deal.
In your case however the priorities may be different, but there are couple things to consider.
Simple solution
First possible solution to your problem could be returning new number as maximum value of currently existing ones. It can be done in your trigger:
NEW.invoice_number =
(SELECT foo.invoice_number
FROM invoices foo
WHERE foo._year = NEW._year
ORDER BY foo.invoice_number DESC NULLS LAST LIMIT 1
); /*query 1*/
This query could use your composite UNIQUE INDEX if it was created with "proper" syntax and columns order which would be the "year" column in the first place ex.:
CREATE UNIQUE INDEX invoice_number_unique
ON invoices (_year, invoice_number DESC NULLS LAST);
In PostgreSQL UNIQUE CONSTRAINTs are implemented simply as UNIQUE INDEXes so most of the times there no difference which command you will use. However using that particular syntax presented above, makes possible to define order on that index. It's really nice trick which makes /*query 1*/ quicker than simple SELECT max(invoice_number) FROM invoices WHERE _year = NEW.year if the invoice table gets bigger.
This is simple solution but has one big drawback. There is possibility of race condition when two transactions try to insert invoice at the same time. Both could acquire the same max value and the UNIQUE CONSTRAINT will prevent the second one from committing. Despite that it could be sufficient in some small system with special insert policy.
Better solution
You may create table
CREATE TABLE invoice_numbers(
_year INTEGER NOT NULL PRIMARY KEY,
next_number_within_year INTEGER
);
to store next possible number for certain year. Then, in AFTER INSERT trigger you could:
Lock invoice_numbers that no other transaction could even read the number LOCK TABLE invoice_numbers IN ACCESS EXCLUSIVE;
Get new invoice number new_invoice_number = (SELECT foo.next_number_within_year FROM invoice_numbers foo where foo._year = NEW.year);
Update number value of new added invoice row
Increment UPDATE invoice_numbers SET next_number_within_year = next_number_within_year + 1 WHERE _year = NEW._year;
Because table lock is hold by the transaction to its commit, this probably should be the last trigger fired (read more about trigger execution order here)
Update:
Instead of locking whole table with LOCK command check link provided by Craig Ringer
The drawback in this case is INSERT operation performance drop down --- only one transaction at the time can perform insert.
Related
I have got a table which has an id (primary key with auto increment), uid (key refering to users id for example) and something else which for my question won’t matter.
I want to make, lets call it, different auto-increment keys on id for each uid entry.
So, I will add an entry with uid 10, and the id field for this entry will have a 1 because there were no previous entries with a value of 10 in uid. I will add a new one with uid 4 and its id will be 3 because I there were already two entried with uid 4.
...Very obvious explanation, but I am trying to be as explainative an clear as I can to demonstrate the idea... clearly.
What SQL engine can provide such a functionality natively? (non Microsoft/Oracle based)
If there is none, how could I best replicate it? Triggers perhaps?
Does this functionality have a more suitable name?
In case you know about a non SQL database engine providing such a functioality, name it anyway, I am curious.
Thanks.
MySQL's MyISAM engine can do this. See their manual, in section Using AUTO_INCREMENT:
For MyISAM tables you can specify AUTO_INCREMENT on a secondary column in a multiple-column index. In this case, the generated value for the AUTO_INCREMENT column is calculated as MAX(auto_increment_column) + 1 WHERE prefix=given-prefix. This is useful when you want to put data into ordered groups.
The docs go on after that paragraph, showing an example.
The InnoDB engine in MySQL does not support this feature, which is unfortunate because it's better to use InnoDB in almost all cases.
You can't emulate this behavior using triggers (or any SQL statements limited to transaction scope) without locking tables on INSERT. Consider this sequence of actions:
Mario starts transaction and inserts a new row for user 4.
Bill starts transaction and inserts a new row for user 4.
Mario's session fires a trigger to computes MAX(id)+1 for user 4. You get 3.
Bill's session fires a trigger to compute MAX(id). I get 3.
Bill's session finishes his INSERT and commits.
Mario's session tries to finish his INSERT, but the row with (userid=4, id=3) now exists, so Mario gets a primary key conflict.
In general, you can't control the order of execution of these steps without some kind of synchronization.
The solutions to this are either:
Get an exclusive table lock. Before trying an INSERT, lock the table. This is necessary to prevent concurrent INSERTs from creating a race condition like in the example above. It's necessary to lock the whole table, since you're trying to restrict INSERT there's no specific row to lock (if you were trying to govern access to a given row with UPDATE, you could lock just the specific row). But locking the table causes access to the table to become serial, which limits your throughput.
Do it outside transaction scope. Generate the id number in a way that won't be hidden from two concurrent transactions. By the way, this is what AUTO_INCREMENT does. Two concurrent sessions will each get a unique id value, regardless of their order of execution or order of commit. But tracking the last generated id per userid requires access to the database, or a duplicate data store. For example, a memcached key per userid, which can be incremented atomically.
It's relatively easy to ensure that inserts get unique values. But it's hard to ensure they will get consecutive ordinal values. Also consider:
What happens if you INSERT in a transaction but then roll back? You've allocated id value 3 in that transaction, and then I allocated value 4, so if you roll back and I commit, now there's a gap.
What happens if an INSERT fails because of other constraints on the table (e.g. another column is NOT NULL)? You could get gaps this way too.
If you ever DELETE a row, do you need to renumber all the following rows for the same userid? What does that do to your memcached entries if you use that solution?
SQL Server should allow you to do this. If you can't implement this using a computed column (probably not - there are some restrictions), surely you can implement it in a trigger.
MySQL also would allow you to implement this via triggers.
In a comment you ask the question about efficiency. Unless you are dealing with extreme volumes, storing an 8 byte DATETIME isn't much of an overhead compared to using, for example, a 4 byte INT.
It also massively simplifies your data inserts, as well as being able to cope with records being deleted without creating 'holes' in your sequence.
If you DO need this, be careful with the field names. If you have uid and id in a table, I'd expect id to be unique in that table, and uid to refer to something else. Perhaps, instead, use the field names property_id and amendment_id.
In terms of implementation, there are generally two options.
1). A trigger
Implementations vary, but the logic remains the same. As you don't specify an RDBMS (other than NOT MS/Oracle) the general logic is simple...
Start a transaction (often this is Implicitly already started inside triggers)
Find the MAX(amendment_id) for the property_id being inserted
Update the newly inserted value with MAX(amendment_id) + 1
Commit the transaction
Things to be aware of are...
- multiple records being inserted at the same time
- records being inserted with amendment_id being already populated
- updates altering existing records
2). A Stored Procedure
If you use a stored procedure to control writes to the table, you gain a lot more control.
Implicitly, you know you're only dealing with one record.
You simply don't provide a parameter for DEFAULT fields.
You know what updates / deletes can and can't happen.
You can implement all the business logic you like without hidden triggers
I personally recommend the Stored Procedure route, but triggers do work.
It is important to get your data types right.
What you are describing is a multi-part key. So use a multi-part key. Don't try to encode everything into a magic integer, you will poison the rest of your code.
If a record is identified by (entity_id,version_number) then embrace that description and use it directly instead of mangling the meaning of your keys. You will have to write queries which constrain the version number but that's OK. Databases are good at this sort of thing.
version_number could be a timestamp, as a_horse_with_no_name suggests. This is quite a good idea. There is no meaningful performance disadvantage to using timestamps instead of plain integers. What you gain is meaning, which is more important.
You could maintain a "latest version" table which contains, for each entity_id, only the record with the most-recent version_number. This will be more work for you, so only do it if you really need the performance.
I would like to find out what the consquence is if you want to create a sequence after a table has been created and quite a bit of data already been inserted.
( this is because PEAR's DataObject's insert() method sometimes skips incremental IDs )
So here is an example to achieve this, but is this the correct way to do if after the amount of time has passed?
Table definition:
CREATE TABLE departments (
ID NUMBER(10) NOT NULL,
DESCRIPTION VARCHAR2(50) NOT NULL);
ALTER TABLE departments ADD (
CONSTRAINT dept_pk PRIMARY KEY (ID));
CREATE SEQUENCE dept_seq;
Trigger definition:
CREATE OR REPLACE TRIGGER dept_bir
BEFORE INSERT ON departments
FOR EACH ROW
BEGIN
SELECT dept_seq.NEXTVAL
INTO :new.id
FROM dual;
END;
If you mean that you already have datas with ID field inserted without using the trigger, the only thing you'll have to check is that the "start" of your sequence = at least the max existing ID + 1
CREATE SEQUENCE dept_seq
START WITH 2503
INCREMENT BY 1
Then it should be perfectly fine.
this is because PEAR's DataObject's insert() method sometimes skips incremental IDs
As a complement to Raphaël Althaus's answer, using a sequence will not guarantee anyhow that you don't have "holes" in the IDs. Think about concurrent access, or rollbacks.
To quote the documentation:
When a sequence number is generated, the sequence is incremented, independent of the transaction committing or rolling back. If two users concurrently increment the same sequence, then the sequence numbers each user acquires may have gaps, because sequence numbers are being generated by the other user.
There was a interesting answer to the same question on Asktom:
Sequences will never generate a gap free sequence of numbers.
[...]
You should never count on a sequence generating anything even close to a gap free
sequence of numbers. They are a high speed, extremely scalable multi-user way to
generate surrogate keys for a table.
[...] contigous sequences of numbers are pretty much impossible
with sequences (only takes but one rollback -- and those will happen).
In my database (MySQL) I have a table (MyISAM) containing a field called number. Each value of this field is either 0 or a positive number. The non zero values must be unique. And the last thing is that the value of the field is being generated in my php code according to value of another field (called isNew) in this table. The code folows.
$maxNumber = $db->selectField('select max(number)+1 m from confirmed where isNew = ?', array($isNew), 'm');
$db->query('update confirmed set number = ? where dataid = ?', array($maxNumber, $id));
The first line of code select the maximum value of the number field and increments it. The second line updates the record by setting it freshly generated number.
This code is being used concurrently by hundreds of clients so I noticed that sometimes duplicates of the number field occur. As I understand this is happening when two clients read value of the number field almost simultaneously and this fact leads to the duplicate.
I have read about the SELECT ... FOR UPDATE statement but I'm not quite sure it is applicable in my case.
So the question is should I just append FOR UPDATE to my SELECT statement? Or create a stored procedure to do the job? Or maybe completely change the way the numbers are being generated?
This is definitely possible to do. MyISAM doesn't offer transaction locking so forget about stuff like FOR UPDATE. There's definitely room for a race condition between the two statements in your example code. The way you've implemented it, this one is like the talking burro. It's amazing it works at all, not that it works badly! :-)
I don't understand what you're doing with this SQL:
select max(number)+1 m from confirmed where isNew = ?
Are the values of number unique throughout the table, or only within sets where isNew has a certain value? Would it work if the values of number were unique throughout the table? That would be easier to create, debug, and maintain.
You need a multi-connection-safe way of getting a number.
You could try this SQL. It will do the setting of the max number in one statement.
UPDATE confirmed
SET number = (SELECT 1+ MAX(number) FROM confirmed WHERE isNew = ?)
WHERE dataid = ?
This will perform badly. Without a compound index on (isNew, number), and without both those columns declared NOT NULL it will perform very very badly.
If you can use numbers that are unique throughout the table I suggest you create for yourself a sequence setup, which will return a unique number each time you use it. You need to use a series of consecutive SQL statements to do that. Here's how it goes.
First, when you create your tables create yourself a table to use called sequence (or whatever name you like). This is a one-column table.
CREATE TABLE sequence (
sequence_id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`sequence_id`)
) AUTO_INCREMENT = 990000
This will make the sequence table start issuing numbers at 990,000.
Second, when you need a unique number in your application, do the following things.
INSERT INTO sequence () VALUES ();
DELETE FROM sequence WHERE sequence_id < LAST_INSERT_ID();
UPDATE confirmed
SET number = LAST_INSERT_ID()
WHERE dataid = ?
What's going on here? The MySQL function LAST_INSERT_ID() returns the value of the most recent autoincrement-generated ID number. Because you inserted a row into that sequence table, it gives you back that generated ID number. The DELETE FROM command keeps that table from snarfing up disk space; we don't care about old ID numbers.
LAST_INSERT_ID() is connection-safe. If software on different connections to your database uses it, they all get their own values.
If you need to know the last inserted ID number, you can issue this SQL:
SELECT LAST_INSERT_ID() AS sequence_id
and you'll get it returned.
If you were using Oracle or PostgreSQL, instead of MySQL, you'd find they provide SEQUENCE objects that basically do this.
Here's the answer to another similar question.
Fastest way to generate 11,000,000 unique ids
There is a large table that holds millions of records. phpMyAdmin reports 1.2G size for the table.
There is a calculation that needs to be done for every row. The calculation is not simple (cannot be put in set col= calc format), it uses a stored function to get the values, so currently we have for each row a single update.
This is extremely slow and we want to optimize it.
Stored function:
https://gist.github.com/a9c2f9275644409dd19d
And this is called by this method for every row:
https://gist.github.com/82adfd97b9e5797feea6
This is performed on a off live server, and usually it is updated once per week.
What options we have here.
Why not setup a separate table to hold the computed values to take the load off your current table. It can have two columns: primary key for each row in your main table and a column for the computed value.
Then your process can be:
a) Truncate computedValues table - This is faster than trying to identify new rows
b) Compute the values and insert into the computed values table
c) So when ever you need your computed values you join to the computedValues table using a primary key join which is fast, and in case you need more computations well you just add new columns.
d) You can also update the main table using the computed values if you have to
Well, the problem doesn't seem to be the UPDATE query because no calculations are performed in the query itself. As it seems the calculations are performed first and then the UPDATE query is run. So the UPDATE should be quick enough.
When you say "this is extremely slow", I assume you are not referring to the UPDATE query but the complete process. Here are some quick thoughts:
As you said there are millions of records, updating those many entries is always time consuming. And if there are many columns and indexes defined on the table, it will add to the overhead.
I see that there are many REPLACE INTO queries in the function getNumberOfPeople(). These might as well be a reason for the slow process. Have you checked how efficient are these REPLACE INTO queries? Can you try removing them and then see if it has any impact on the UPDATE process.
There are a couple of SELECT queries too in getNumberOfPeople(). Check if they might be impacting the process and if so, try optimizing them.
In procedure updateGPCD(), you may try replacing SELECT COUNT(*) INTO _has_breakdown with SELECT COUNT(1) INTO _has_breakdown. In the same query, the WHERE condition is reading _ACCOUNT but this will fail when _ACCOUNT = 0, no?
On another suggestion, if it is the UPDATE that you think is slow because of reason 1, it might make sense to move the updating column gpcd outside usage_bill to another table. The only other column in the table should be the unique ID from usage_bill.
Hope the above make sense.
Okay, so let's say I have a mysql database table with two columns, one is for id and the other is for password. If I have three rows of data and the id values go from 1 to 3 and I delete row 3 and then create another row of data, I will see id=4 instead of id=3 on the newly created row. I know this has to do with the auto increment value but I was wondering if I can add some code in a php file that will automatically reset all the id numbers such that you start at id=1 and go up to the last id number in increments of 1 after a row has been deleted?
My goal is to create a form where the user enters a password and the system will match the password with a password value in the database. If there is a match, the row with the matched password will be deleted and the column of id numbers will be reordered such that no id numbers are skipped.
Update: I'm making a rotating banner ad system by setting a random number from 1 to 4 to a variable so that the php file will retrieve a random ad from id=1 to id=4 by using the random number variable. If the random number happens to be 3 and id=3 does not exist, there will be a gap in the row of banner ads. If there is a way to work around big gaps in this situation, please tell me. thanks in advance
Just execute the following SQL query:
ALTER TABLE `tbl_name` AUTO_INCREMENT = 1;
…but it sounds like a terrible idea, so don't do it. Why is the value of your primary key so important? Uniqueness is far more important, and reseting it undermines that.
You can only use
ALTER TABLE 'tbl' AUTO_INCREMENT=#
to reset to a number above the highest value number. If you have 1, 2, 3, and you delete 2, you cannot use this to fill 2. If you delete 3, you could use this to re-use 3 (assuming you haven't put anything higher). That is the best you can do.
ALTER TABLE 'table' AUTO_INCREMENT = 1;
However running this code is not the best idea. There is something wrong with your application if you depend on the column having no gaps. Are you trying to count the number of users? if so use COUNT(id)? Are you trying to deal with other tables? If so use a foreign key.
If you are dead set on doing this the Wrong Way you could try to look for the lowest free number and do the incrementing on your own. Keep in mind the race conditions involves however.
Also, keep in mind that if you change the actual numbers in the database you will need to change all references to it in other tables and in your code.
Well, you can actually just specify the id number you'd like a record to have as part of your insert statement, for example:
INSERT INTO person VALUES(1,'John','Smith','jsmith#devnull.fake','+19995559999');
And if there's not a primary key collision (no record in the database with id=1), then MySQL will happily execute it.
The ALTER TABLE 'tbl' AUTO_INCREMENT=# thing also works, and means you don't have to keep track of the counter.
While you're thinking about this, though, you might want to read some of the discussion on natural vs surrogate keys. The idea of having your id # be specifically important is a bit unusual and might be a sign of a troubled design.
You could do that by:
Inventing a mechanism that provides the next available id when you want to insert (e.g. a transaction involving reading and incrementing an integer column somewhere -- pay special attention to the transaction isolation level!)
Using UPDATE to decrement all ids greater than the one you just deleted (again, with a transaction -- don't forget that foreign keys must be ON UPDATE CASCADE!)
But it begs the question: why do you want this? is it going to be worth the trouble?
It's almost certain that you can achieve whatever your goal is without such witchery.
Update (to address comment):
To select a random number of rows, you can do e.g. in MySQL
SELECT id FROM banners ORDER BY RAND() LIMIT 5
to select 5 random, guaranteed existing banner ids.
A word of caution: there are quite a few people who view ORDER BY RAND() as a bad performance hog. However, it is IMHO not quite right to put every case in the same basket. If the number of rows in the table is manageable (I would consider anything below 10K to be not that many) then ORDER BY RAND() provides a very nice and succint solution. Also, the documentation itself suggests this approach:
However, you can retrieve rows in
random order like this:
mysql> SELECT * FROM tbl_name ORDER BY RAND();
ORDER BY RAND() combined with
LIMIT is useful for selecting a random
sample from a set of rows:
mysql> SELECT * FROM table1, table2 WHERE a=b AND c ORDER BY RAND() LIMIT 1000;
RAND() is not meant to be
a perfect random generator. It is a
fast way to generate random numbers on
demand that is portable between
platforms for the same MySQL version.