Assume that there are around 200,000 records in a database.
Here do-while may execute up to 200,000 times. It means it will call the DB 200,000 times.
My Question: Is this the correct way to do so? Is there a better way?
function get_new_key() {
do {
$new_key = 'some_xyz';
} while ( function_call_to_check_newKey_exists_in_db );
return $new_key;
}
Option 1) Use surrogate key. Natural keys has its uses but surrogate keys have many benefits, flexibiliy for one. Som sql-servers may generate non-sequential auto-generated surrogate keys.
Option 2) Use http://php.net/manual/en/function.uniqid.php and trust a try- and catch scenario.
Overall, since you're pumping 200.000 records(?), you may be better of with placing them in a text-file and use an import function at the db.
You might do well to determine how the new_key is generated. Without this information there's little helping you.
If it is, for example, an incremental number, then it would be better to choose MAX(columnkey)+1 (with the table locked to prevent conflicts), or better still, declare it PRIMARY KEY AUTO_INCREMENT and insert the NULL value for the key. This will set the key automatically. Recover its value with LAST_INSERT_ID():
INSERT INTO ... (all columns except key) VALUES (all values)
SELECT LAST_INSERT_ID() AS newkey;
Or you could use a timestamp of sufficient granularity.
If you want to avoid "holes" in the numbering (not necessarily a bad practice, but please have a better reason to do so than "it looks nicer"!), you can LEFT JOIN the table to itself on condition that the right-hand table ID is one greater than the left-hand side. Those rows where the right side is NULL indicate that their counterpart ID is available (again, lock the tables - in this case, with MySQL, twice):
-- You need an index on mykey
SELECT a.mykey+1 AS available
FROM mytable AS a
LEFT JOIN mytable AS b ON (a.mykey+1 = b.mykey)
WHERE b.mykey IS NULL
ORDER BY a.mykey LIMIT 20
Related
I have found two different ways to, first, get the next invoice number and, then, save the invoice in a multi-tenant database where, of course, each tenant will have his own invoices with different incremental numbers.
My first (and actual) approach is this (works fine):
Add a new record to the invoices tables. No matter the invoice number yet (for example, 0, or empty)
I get the unique ID of THAT created record after insert
Now I do a "SELECT table where ID = $lastcreatedID **FOR UPDATE**"
Here I get the latest saved invoice number with "SELECT #A:=MAX(NUMBER)+1 FROM TABLE WHERE......"
Finally I update the previously saved record with that invoice number with an "UPDATE table SET NUMBER = $mynumber WHERE ID = $lastcreatedID"
This works fine, but I don't know if the "for update" is really needed or if this is the correct way to do this in a multi-tenant DB, due to performance, etc.
The second (and simpler) approach is this (and works too, but I don't know if it is a secure approach):
INSERT INTO table (NUMBER,TENANT) SELECT COALESCE(MAX(NUMBER),0)+1,$tenant FROM table WHERE....
That's it
Both methods are working, but I would like to know the differences between them regarding speed, performance, if it may create duplicates, etc.
Or... is there any better way to do this?
I'm using MySQL and PHP. The application is an invoice/sales cloud software that will be used by a lot of customers (tenants).
Thanks
Regardless of if you're using these values as database IDs or not, re-using IDs is virtually guaranteed to cause problems at some point. Even if you're not re-using IDs you're going to run into the case where two invoice creation requests run at the same time and get the same MAX()+1 result.
To get around all this you need to reimplement a simple sequence generator that locks its storage while a value is being issued. Eg:
CREATE TABLE client_invoice_serial (
-- note: also FK this back to the client record
client_id INTEGER UNSIGNED NOT NULL PRIMARY KEY,
serial INTEGER UNSIGNED NOT NULL DEFAULT 0
);
$dbh = new PDO('mysql:...');
/* this defaults to 'on', making every query an implicit transaction. it needs to
be off for this. you may or may not want to set this globally, or just turn it off
before this, and back on at the end. */
$dbh->setAttribute(PDO::ATTR_AUTOCOMMIT,0);
// simple best practice, ensures that SQL errors MUST be dealt with. is assumed to be enabled for the below try/catch.
$dbh->setAttribute(PDO::ATTR_ERRMODE_EXCEPTION,1);
$dbh->beginTransaction();
try {
// the below will lock the selected row
$select = $dbh->prepare("SELECT * FROM client_invoice_serial WHERE client_id = ? FOR UPDATE;");
$select->execute([$client_id]);
if( $select->rowCount() === 0 ) {
$insert = $dbh->prepare("INSERT INTO client_invoice_serial (client_id, serial) VALUES (?, 1);");
$insert->execute([$client_id]);
$invoice_id = 1;
} else {
$invoice_id = $select->fetch(PDO::FETCH_ASSOC)['serial'] + 1;
$update = $dbh->prepare("UPDATE client_invoice_serial SET serial = serial + 1 WHERE client_id = ?");
$update->execute([$client_id])
}
$dbh->commit();
} catch(\PDOException $e) {
// make sure that the transaction is cleaned up ASAP, then let the exception bubble up into your general error handling.
$dbh->rollback();
throw $e; // or throw a more pertinent error/exception of your choosing.
}
// both committing and rolling back will release the lock
At a very basic level this is what MySQL is doing in the background for AUTOINCREMENT columns.
Do not use MAX(id)+1. It will, someday, bite you. There will be two invoices with the same number, and it will take us a few paragraphs to explain why it happened.
Instead, use AUTO_INCREMENT the way it is intended.
INSERT INTO Invoices (id, ...) VALUES (NULL, ...);
SELECT LAST_INSERT_ID(); -- specific to the conne ction
That is safe even when multiple connections are doing the same thing. No FOR UPDATE, no BEGIN, etc is necessary. (You may want such for other purposes.)
And, never delete rows. Instead, use the standard business practice of invalidating bad invoices. Imagine being audited.
All that said, there is still a potential problem. After a ROLLBACK or system crash, an id may be "burned". Also things like INSERT IGNORE allocate the id before checking to see whether it will be needed.
If you can live with the caveats, use AUTO_INCREMENT.
If not, then create a 1-row, 2-column table to simulate a sequence number generator: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#sequence
Or use MariaDB's SEQUENCE
Both the approaches do work, but each with its own demerits in high traffic situations.
The first approach runs 3 queries for every invoice you create, putting extra load on your server.
The second approach can lead to duplicates in events where two invoices are generated with very little time difference (such that the SELECT query return same max number for both invoices).
Both the approaches may lead to problems in high traffic conditions.
Two solutions to the problems are listed below:
Use generated columns: Mysql supports generated columns, which are basically derived using other column values for each row. Refer this
Calculate invoice number on the fly: Since you're using the primary key as part of the invoice, let the DB handle generating unique primary keys, and then generate invoice numbers on the fly in your business logic using the id for each invoice.
I am using the commonly known reddit 'hot' algorithm on my table 'posts'. Now this hot column is a decimal number like this: 'XXXXX,XXXXXXXX'
I want this column to be an index, because when I order by 'hot', I want the query to be as fast as possible. However, I am kind of new to indexes. Does an index need to be unique?
If it has to be unique, would this work and be efficient?
$table->unique('id', 'hot');
If it does not have to be unique, would this be the right approach?
$table->index('hot');
Last question: would the following query be taking advantage of the index?
Post::orderBy('hot', 'desc')->get()
If not, how should I modify it?
Thank you very much!
Do not make it UNIQUE unless you need the constraint that you cannot insert duplicates.
Phrased differently, a UNIQUE key is two things: an INDEX (for speedy searching) and a "constraint" (to give an error when trying to insert a dup).
ORDER BY hot DESC can use INDEX(hot) (or UNIQUE(hot)). I say "can", not "will", because there are other issues where the Optimizer may decide not to use the index. (Without seeing the actual query and knowing more about the the dataset, I can't be more specific.)
If id is the PRIMARY KEY, then neither of these is of any use: INDEX(id, hot); UNIQUE(id, hot). Swapping the order of the columns makes sense. Or simply INDEX(hot).
A caveat: EXPLAIN does not say whether the index is used for ORDER BY, only for WHERE. On the other hand, EXPLAIN FORMAT=JSON does give more details. Try that.
(Yes, DECIMAL columns can be indexed.)
I realized, that the response to a MySQL query becomes much faster, when creating an index for the column you use for "ORDER BY", e.g.
SELECT username FROM table ORDER BY registration_date DESC
Now I'm wondering which indices I should create to optimize the request time.
For example I frequently use the following queries:
SELECT username FROM table WHERE
registration_date > ".(time() - 10000)."
SELECT username FROM table WHERE
registration_date > ".(time() - 10000)."
&& status='active'
SELECT username FROM table WHERE
status='active'
SELECT username FROM table ORDER BY registration_date DESC
SELECT username FROM table WHERE
registration_date > ".(time() - 10000)."
&& status='active'
ORDER BY birth_date DESC
Question 1:
Should I set up separate indices for the first three request types? (i.e. one index for the column "registration_date", one index for the column "status", and another column for the combination of both?)
Question 2:
Are different indices independently used for "WHERE" and for "ORDER BY"? Say, I have a combined index for the columns "status" and "registration_date", and another index only for the column "birth_date". Should I setup another combined index for the three columns ("status", "registration_date" and "birth_date")?
There are no hard-and-fast rules for indices or query optimization. Each case needs to be considered and examined.
Generally speaking, however, you can and should add indices to columns that you frequently sort by or use in WHERE statements. (Answer to Question 2 -- No, the same indices are potentially used for ORDER BY and WHERE) Whether to do a multi-column index or a single-column one depends on the frequency of queries. Also, you should note that single-column indices may be combined by mySQL using the Index Merge Optimization:
The Index Merge method is used to retrieve rows with several range
scans and to merge their results into one. The merge can produce
unions, intersections, or unions-of-intersections of its underlying
scans. This access method merges index scans from a single table; it
does not merge scans across multiple tables.
(more reading: http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html)
Multi-column indices also require that you take care to structure your queries in such a way that your use of indexed columns matches the column order in the index:
MySQL cannot use an index if the columns do not form a leftmost
prefix of the index. Suppose that you have the SELECT statements shown
here:
SELECT * FROM tbl_name WHERE col1=val1; SELECT * FROM tbl_name WHERE
col1=val1 AND col2=val2;
SELECT * FROM tbl_name WHERE col2=val2; SELECT * FROM tbl_name WHERE
col2=val2 AND col3=val3;
If an index exists on (col1, col2, col3), only the first two queries
use the index. The third and fourth queries do involve indexed
columns, but (col2) and (col2, col3) are not leftmost prefixes of
(col1, col2, col3).
Bear in mind that indices DO have a performance consideration of their own -- it is possible to "over-index" a table. Each time a record is inserted or an indexed column is modified, the index/indices will have to be rebuilt. This does demand resources, and depending on the size and structure of your table, it may cause a decrease in responsiveness while the index building operations are active.
Use EXPLAIN to find out exactly what is happening in your queries. Analyze, experiment, and don't over-do it. The shotgun approach is not appropriate for database optimization.
Documentation
MySQL EXPLAIN - http://dev.mysql.com/doc/refman/5.0/en/explain.html
How MySQL uses indices - http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
Index Merge Optimization - http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html
To quote this page:
[Indices] will slow down your updates and inserts.
That's the tradeoff you have to calculate. To optimize your table, you should put indices only in the columns you are most likely to apply conditions to - the more indices you have, the slower your data-changing operations become. In that sense, I personally don't see much merit in creating combined indices - if you create all 7 possible permutations of indices for 3 columns, you are most definitely putting more drag on your updates and inserts than just using 3 indices for 3 columns (and even that can be debatable). On the other hand, if the data is being edited much, much less than it is being SELECTed, then indices can really help you speed things up.
Something else to take into consideration (again quoting the above page):
If your table is very small [...] it's worse to use an index than to leave it out and just let it do a table scan. Indexes really only come in handy with tables that have a lot of rows.
Yes, it is a good idea to have indexes on your column you often use, both for order by and in your where clauses.
But be aware: UPDATES, INSERTS and DELETE slow down if you have indexes.
That is because after such an operation, the index must be updated too.
So, as a rule-of-thumb: If your application is read-intensive, use the indexes where you think they help.
If your application is often updating the data, be careful, because that may get slow because of the indexes.
When in doubt, you must simply get dirty hands, and study the results of EXPLAIN.
http://dev.mysql.com/doc/refman/5.6/en/explain.html
As for the first two examples, you can satisfy them with one index: {registration_date, status}. Such an index can support filters on the first item (registration_date) or on both.
It does not work for status alone, however. The question on status is how selective is the status. That is, what proportion of records have status = "active". If this is a high proportion (so, on average, every database page would have an active record), then an index may not help very much.
The order by's are trickier. I don't know if mysql uses indexes for this purpose. Often, using an index for sorting entire records is less efficient than just sorting the records. Using the index causes a random access pattern to the records in the pages, which can cause major performance problems for tables larger than the page cache.
Use the explain function on your select statements to determine where your joins are slowing down (the more rows that are referenced, the slower it will be). Then apply your indices to those columns.
EXPLAIN SELECT * FROM table JOIN table 2 ON a = b WHERE conditions;
Say I have a table with 1000s of records and I only want to update one record. Does it make the query faster if I specify more 'WHERE' clauses to narrow down the search to fewer possible matching records, or is it faster to just state one WHERE clause (eg. recordID)?
EXAMPLE:
UPDATE table
SET record_name = 'new name'
WHERE record_ID = 'x'
LIMIT 1
or
UPDATE table
SET record_name = 'new name'
WHERE record_ID = 'foo'
AND record_city = 'bah'
LIMIT 1
Assuming record_ID is a primary key, this is definitely the fastest way to update a single record.
if record_ID is primary key, the first solution is the fastest. I think that it is not necesary to put "LIMIT 1" because it is implicit in the primary key
It's worth noting that the two commands actually are not functionally equivalent as well.
You should always run the query, which does what you want to do.
The optimizer in the database will figure it out.
In this case, even if you specify both, it will have to access the location of the row to delete it, reading an extra WHERE against a column is a very minor penalty compared to navigating to the row in the first place.
Note also that using LIMIT without an ORDER BY will break serialization for replication and restoration of binary logs for point in time recovery in STATEMENT mode.
It's in this case better to specify in the WHERE all that is needed and not need the LIMIT.
I've got two MySQL queries that both insert data into a table. Both have the following format:
CREATE TABLE IF NOT EXISTS `data` (
`id` BIGINT NOT NULL AUTO_INCREMENT UNIQUE,
PRIMARY KEY (`id`)
)
SELECT `field1`, `field2`
WHERE `active` = 1
The only differences between the two queries are how field1 and field2 are determined, and some minor differences in the conditions clause. Both run up to 12K and more records.
Now, what will be more efficient:
A. Run both queries separately:
if (mysql_query($query1)) {
return mysql_query($query2);
}
return false;
B. OR combine the two queries with a UNION, and run once:
$query = 'SELECT `field1`, `field2` WHERE `active` = 1
UNION
SELECT DO_ONE(`field1`), DO_TWO(`field2`) WHERE `active` = 1
ORDER BY `field1`';
return mysql_query('CREATE TABLE IF NOT EXISTS `data` (
`id` BIGINT NOT NULL AUTO_INCREMENT UNIQUE,
PRIMARY KEY (`id`)
) ' . $query)
The data from the one query is useless without the data from the other, so both need to succeed. DO_ONE and DO_TWO are user defined MySQL functions that change the field data according to some specs.
Aaronmccall's answer is probably the best in general -- the UNION approach does it all in one SQL call. In general that will be the most "efficient", but there could be side issues that could come into play and affect the measure of "efficient" for your particular application.
Specifically, if the UNION requires a temporary table to gather the intermediate results and you are working with very large sets of data, then doing two separate straight SELECTs into the new table might turn out being more efficient in your particular case. This would depend on the internal workings, optimizations done, etc within the database engine (which could change depending on the version of the database engine you are using).
Ultimately, the only way to answer your question on such a specific question like this might be to do timings for your particular application and environment.
You also might want to consider that the difference between the time required for two separate queries vs an "all in one" query might be insignificant in the grand scheme of things... you are probably talking about a difference of a few milliseconds (or even microseconds?) unless your mysql database is on a separate server with huge latency issues. If you are doing thousands of these calls in one shot, then the difference might be significant, but if you are only doing one or two of these calls and your application is spending 99.99% of its time executing other things, then the difference between the two probably won't even be noticed.
---Lawrence
The UNION approach should definitely be faster due to the expense of making two mysql api calls from php vs. one.
Your options do different things. First one returns the results from the second query if the first query executes correctly (which is BTW independent of the results that it returns, it can be returning an empty rowset). Second one returns the results from the first query and the second query together. First option seems to me pretty useless, probably what you want to achieve is what you did with the UNION (unless I missunderstood you).
EDIT: After reading your comment, I think you are after something like this:
SELECT true where (EXISTS(SELECT field1, field2 ...) AND EXISTS (SELECT Field1, field2 ...)).
That way you will have only one query to the DB, which scales better, takes less resources from the connection pool and doesn't double the impact of latency if you have your DB engine in a different server, but you will still interrupt the query if the first condition fails, which is the performance improvement that you where looking for with the nested separated queries.
As an optimization, try to have first the condition that will execute faster, in case they are not the same. I assume that if one of them requires those field calculations would be slower.