I have a table that contains invoices for several companies, each company needs to have their own incrementing invoice number system.
id | invoiceId | companyId
--------------------------
1 | 1 | 1
2 | 2 | 1
3 | 1 | 2
4 | 1 | 3
I was hoping to achieve this with a unique compound key similar to this approach for MyISAM outlined here, but it seems it is not possible with InnoDB.
I need to return the new ID immediately after insertion and have concerns about creating a race condition if I try and achieve this with PHP.
Is my best option to create a trigger and if yes what would that look like? I have no experience with triggers and my research into using an after insert trigger has me worried with this quote from the MariaDB documentation:
RESTRICTIONS
You can not create an AFTER trigger on a view. You can not update the
NEW values. You can not update the OLD values.
Thanks for any advice
You need to add a unique index besides getting your next value. The next value is best gotten by querying the table with a trigger or by some procedure within a transaction. The remark of trigger on a view is not relevant in that case.
The unique index is on companyId,invoiceId is required to prevent two insert processes running on the same company adding an invoice, which then can end up both with the same invoiceId. Even better is when you switch to InnoDB so you can use transactions: Then 2 processes started at virtually the same time can benefit from transaction isolation with as result that they will be serialized and you get 2 unique incrementing invoice ids returned without having to handle the unique index exception in your code.
As far as I know, mysql's last_id is connection based, and not global and shared between processes.
using this simple script I've validated my concerns
(note this is codeigniter syntax, but it's relatively easy to understand)
I accessed the following function twice, within 10 seconds of each other(in different browser windows)
function test(){
$arr['var'] = rand(0,100000000);
$this->db->insert('test_table',$arr);
sleep(30);
$id = $this->db->insert_id();
var_dump($id);
}
Interesting to note, instead of getting "2" as a response in both of them, I've gotten 1 and two respectfully. This makes even more sense when you look at the underlying function
function insert_id()
{
return #mysqli_insert_id($this->conn_id);
}
This solves the returned ID, Your race condition is the product of the underlying query, which is basically "Select MAX(invoiceId ) WHERE companyID = X" and add +1 to that, and insert it.
This should be possible with a table lock before insert, however this depends on how many times per second you expect this table to get updated.
note, on persistent connection the last_insert_id might work differently, I haven't tested it.
Related
For an MySQL table I am using the InnoDB engine and the structure of my tables looks like this:
Table user
id | username | etc...
----|------------|--------
1 | bruce | ...
2 | clark | ...
3 | tony | ...
Table user-emails
id | person_id | email
----|-------------|---------
1 | 1 | bruce#wayne-ent.com
2 | 1 | ceo#wayne-ent.com
3 | 2 | clark.k#daily-planet.com
To fetch data from the database I've written a tiny framework. E.g. on __construct($id) it checks if there is a person with the given id, if yes it creates the corresponding model and saves only the field id to an array. During runtime, if I need another field from the model it fetches only the value from the database, saves it to the array and returns it. E.g. same with the field emails for that my code accesses the table user-emails and get all the emails for the corresponding user.
For small models this works alright, but now I am working on another project where I have to fetch a lot of data at once for a list and that takes some time. Also I know that many connections to MySQL and many queries are quite stressful for the server, so..
My question now is: Should I fetch all data at once (with left joins etc.) while constructing the model and save the fields as an array or should I use some other method?
Why do people insist on referring to the entities and domain objects as "models".
Unless your entities are extremely large, I would populate the entire entity, when you need it. And, if "email list" is part of that entity, I would populate that too.
As I see it, the question is more related to "what to do with tables, that are related by foreign keys".
Lets say you have Users and Articles tables, where each article has a specific owner associate by user_id foreign key. In this case, when populating the Article entity, I would only retrieve the user_id value instead of pulling in all the information about the user.
But in your example with Users and UserEmails, the emails seem to be a part of the User entity, and something that you would often call via $user->getEmailList().
TL;DR
I would do this in two queries, when populating User entity:
select all you need from Users table and apply to User entity
select all user's emails from the UserEmails table and apply it to User entity.
P.S
You might want to look at data mapper pattern for "how" part.
In my opinion you should fetch all your fields at once, and divide queries in a way that makes your code easier to read/manage.
When we're talking about one query or two, the difference is usually negligible unless the combined query (with JOINs or whatever) is overly complex. Usually an index or two is the solution to a very slow query.
If we're talking about one vs hundreds or thousands of queries, that's when the connection/transmission overhead becomes more significant, and reducing the number of queries can make an impact.
It seems that your framework suffers from premature optimization. You are hyper-concerned about fetching too many fields from a row, but why? Do you have thousands of columns or something?
The time consuming part of your query is almost always the lookup, not the transmission of data. You are causing the database to do the "hard" part over and over again as you pull one field at a time.
I have a PHP script that retrieves rows from a database and then performs work based on the contents. The work can be time consuming (but not necessarily computationally expensive) and so I need to allow multiple scripts to run in parallel.
The rows in the database looks something like this:
+---------------------+---------------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+---------------+------+-----+---------------------+----------------+
| id | bigint(11) | NO | PRI | NULL | auto_increment |
.....
| date_update_started | datetime | NO | | 0000-00-00 00:00:00 | |
| date_last_updated | datetime | NO | | 0000-00-00 00:00:00 | |
+---------------------+---------------+------+-----+---------------------+----------------+
My script currently selects rows with the oldest dates in date_last_updated (which is updated once the work is done) and does not make use of date_update_started.
If I were to run multiple instances of the script in parallel right now, they would select the same rows (at least some of the time) and duplicate work would be done.
What I'm thinking of doing is using a transaction to select the rows, update the date_update_started column, and then add a WHERE condition to the SQL statement selecting the rows to only select rows with date_update_started greater than some value (to ensure another script isn't working on it). E.g.
$sth = $dbh->prepare('
START TRANSACTION;
SELECT * FROM table WHERE date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000;
UPDATE table DAY SET date_update_started = UTC_TIMESTAMP() WHERE id IN (SELECT id FROM table WHERE date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000;);
COMMIT;
');
$sth->execute(); // in real code some values will be bound
$rows = $sth->fetchAll(PDO::FETCH_ASSOC);
From what I've read, this is essentially a queue implementation and seems to be frowned upon in MySQL. All the same, I need to find a way to allow multiple scripts to run in parallel, and after the research I've done this is what I've come up with.
Will this type of approach work? Is there a better way?
I think your approach could work, as long as you also add some kind of identifier to the rows you selected that they are currently been worked on, it could be as #JuniusRendel suggested and i would even think about using another string key (random or instance id) for cases where the script resulted in errors and did not complete gracefully, as you will have to clean these fields once you updated the rows back after your work.
The problem with this approach as i see it is the option that there will be 2 scripts that run at the same point and will select the same rows before they were signed as locked. here as i can see it, it really depends on what kind of work you do on the rows, if the end result in these both scripts will be the same, i think the only problem you have is for wasted time and server memory (which are not small issues but i will put them aside for now...). if your work will result in different updates on both scripts your problem will be that you could have the wrong update at the end in the TB.
#Jean has mentioned the second approach you can take that involves using the MySql locks. i am not an expert of the subject but it seems like a good approach and using the 'Select .... FOR UPDATE' statement could give you what you are looking for as you could do on the same call the select & the update - which will be faster than 2 separate queries and could reduce the risk for other instances to select these rows as they will be locked.
The 'SELECT .... FOR UPDATE' allows you to run a select statement and lock those specific rows for updating them, so your statement could look like:
START TRANSACTION;
SELECT * FROM tb where field='value' LIMIT 1000 FOR UPDATE;
UPDATE tb SET lock_field='1' WHERE field='value' LIMIT 1000;
COMMIT;
Locks are powerful but be careful that it wont affect your application in different sections. Check if those selected rows that are currently locked for the update, are they requested somewhere else in your application (maybe for the end user) and what will happen in that case.
Also, Tables must be InnoDB and it is recommended that the fields you are checking the where clause with have a Mysql index as if not you may lock the whole table or encounter the 'Gap Lock'.
There is also a possibility that the locking process and especially when running parallel scripts will be heavy on your CPU & memory.
here is another read on the subject: http://www.percona.com/blog/2006/08/06/select-lock-in-share-mode-and-for-update/
Hope this helps, and would like to hear how you progressed.
We have something like this implemented in production.
To avoid duplicates, we do a MySQL UPDATE like this (I modified the query to resemble your table):
UPDATE queue SET id = LAST_INSERT_ID(id), date_update_started = ...
WHERE date_update_started IS NULL AND ...
LIMIT 1;
We do this UPDATE in a single transaction, and we leverage the LAST_INSERT_ID function. When used like that, with a parameter, it writes in the transaction session the parameter that, in this case, it's the ID of the single (LIMIT 1) queue that has been updated (if there is one).
Just after that, we do:
SELECT LAST_INSERT_ID();
When used without parameter, it retrieves the previously stored value, obtaining the queue item's ID that has to be performed.
Edit: Sorry, I totally misunderstood your question
You should just put a "locked" column on your table put the value to true on the entries your script is working with, and when it's done put it to false.
In my case i have put 3 other timestamp (integer) columns: target_ts , start_ts , done_ts.
You
UPDATE table SET locked = TRUE WHERE target_ts<=UNIX_TIMESTAMP() AND ISNULL(done_ts) AND ISNULL(start_ts);
and then
SELECT * FROM table WHERE target_ts<=UNIX_TIMESTAMP() AND ISNULL(start_ts) AND locked=TRUE;
Do your jobs and update each entry one by one (to avoid data inconcistencies) setting the done_ts property to current timestamp (you can also unlock them now). You can update target_ts to the next update you wish or you can ignore this column and just use done_ts for your select
Each time the script runs I would have the script generate a uniqid.
$sctiptInstance = uniqid();
I would add a script instance column to hold this value as a varchar and put an index on it. When the script runs I would use select for update inside of a transaction to select your rows based on whatever logic, excluding rows with a script instance, and then update those rows with the script instance. Something like:
START TRANSACTION;
SELECT * FROM table WHERE script_instance = '' AND date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000 FOR UPDATE;
UPDATE table SET date_update_started = UTC_TIMESTAMP(), script_instance = '{$scriptInstance}' WHERE script_instance = '' AND date_update_started > 1 DAY ORDER BY date_last_updated LIMIT 1000;
COMMIT;
Now those rows will be excluded from other instances of the script. Do you work, and then update the rows to set the script instance back to null or blank, and also update your date last updated column.
You could also use the script instance to write to another table called "current instances" or something like that, and have the script check that table to get a count of running scripts to control the number of concurrent scripts. I would add the PID of the script to the table as well. You could then use that information to create a housekeeping script to run from cron periodically to check for long running or rogue processes and kill them, etc.
I have a system working exactly like this in production. We run a script every minute to do some processing, and sometimes that run can take more than a minute.
We have a table column for status, which is 0 for NOT RUN YET, 1 for FINISHED, and other value for under way.
The first thing the script does is to update the table, setting a line or multiple lines with a value meaning that we are working on that line. We use getmypid() to update the lines that we want to work on, and that are still unprocessed.
When we finish the processing, the script updates the lines that have the same process ID, marking them as finished (status 1).
This way we avoid each of the scripts to try and process a line that is already under processing, and it works like a charm. This doesn't mean that there isn't a better way, but this does get the work done.
I have used a stored procedure for very similar reasons in the past. We used the FOR UPDATE read lock to lock the table while a selected flag was updated to remove that entry from any future selects. It looked something like this:
CREATE PROCEDURE `select_and_lock`()
BEGIN
START TRANSACTION;
SELECT your_fields FROM a_table WHERE some_stuff=something
AND selected = 0 FOR UPDATE;
UPDATE a_table SET selected = 1;
COMMIT;
END$$
No reason it has to be done in a stored procedure though now I think about it.
Consider the following table row:
ID | First Name | Last Name | Email | Age
____________________________________________________
1 | John | Smith | john#smith.com | 23
2 | Mohammad | Naji | me#naji.com | 26
When an update occurs, eg. an email of an account is changed, how should I detect the change was that?
I should bold the changes for website admins.
Current database schema doesn't support it because I don't store previous revisions of the row.
Please advise me with the least cost solution for me now.
You can create a function in php and use it to update the data:
function update_row($new_row, $id)
Parameters:
$new_row is an associative array: array("column name" => new column value)
$id - id of the row to update
Function works like this:
Select current row with id = $id into $old_row
Compare old and new rows and get the columns updated:
$columns_updated = array();
foreach($new_row as $key => $value){
if($new_row[$key] != $value)
{
array_push($key);
}
}
update row where id=$id to $new_row
return $columns_updated
You'll be unable to track changes unless you make some sort of change to the schema. At the very least you'll need a table (or tables) that do that for you.
You can either
a) explicitly track changes as updates are made by modifying the code that makes them. These could be in many places, so this is likely to be time consuming.
b) Track the changes by implementing a mySQL trigger on the database that automatically copies the old version to your new tables each time a row is updated.
In either case, you'll need to query both the current table and the changes table to check for changes you need to highlight.
You'll also need to determine at what point a change no longer needs to be highlighted. Simply deleting the old row from your changes table will remove the change, but you'll need to decide when that should be done. You could use a MySQL event to cull the changes on a regular basis, or you could tie this maintenance to some other trigger or action in your system.
Implementation details will need to be decided based on your system and expectations.
Using Triggers and Events has the advantage that changes can be confined to the database, except where the changes need to be highlighted.
I have a table that stores specific updates for all customers.
Some sample table:
record_id | customer_id | unit_id | time_stamp | data1 | data2 | data3 | data4 | more
When I created the application, I did not realize how much this table would grow -- currently I have over 10mil records within 1 month. I am facing issues, when php stops executing due to amount of time it takes. Some queries produce top-1 results, based on the time_stamp + customer_id + unit_id
How would you suggest handling this type of issues? For example, I can create new table for each customer, although I think it does not a good solution.
I am stuck with no good solution in mind.
If you're on the cloud (where you're charged for moving data between server and db), ignore.
Move all logic to the server
The fastest query is a SELECT WHEREing the PRIMARY. It won't matter how large your database is, it will come back just as fast with a table of 1 row (as long as your hardware isn't unbalanced).
I can't tell exactly what you're doing with your query, but first download all of the sorting and limiting data into PHP. Once you've got what you need, SELECT the data directly WHEREing on record_id (I assume that's your PRIMARY).
It looks like your on demand data is pretty computationally intensive and huge, so I recommend using a faster language. http://blog.famzah.net/2010/07/01/cpp-vs-python-vs-perl-vs-php-performance-benchmark/
Also, when you start sorting and limiting on the server rather than the db, you can start identifying shortcuts to speed it up even further.
This is what the server's for.
I suggest you use partitioning of your data following some criteria.
You can make horizontal or vertical partition of your data.
For example group your customer_id in 10 partitions, using his id module 10.
So, customer_id terminated in 0 goes to partition 0, with ended in 1 goes to partition 1
MySQL can make this for you easily.
What is the count of records within the tables? Often, with relational databases, it's not how much data you have (millions are nothing to relational databases), it's how you're retrieving it.
From the look of your select, in fact, you probably just need to optimize the statement itself and avoid the multiple subselects, which is probably the main cause of the slowdown. Try running an explain on that statement, or just get the ids and run the interior select individually on the ids of the records that you've actually found & retrieved in the first run.
Just the fact that you have those subselects within your overall statement means that you haven't optimized that far into the process anyway. For example, you could be running a nightly or hourly cron job that aggregates into a new table the sets like the one created by SELECT gps_unit.idgps_unit, and then you can run your selects against a previously generated table instead of creating blocks of data that are equivalent of a table on the fly.
If you find yourself unable to effectively optimize that select statement, you have "final" options like:
Categorize via some criteria and split into different tables.
Keep a deep archive, such that anything past the first year or so is migrated to a less used table and requires special retrieval.
Finally, if you have so much small data, you may be able to completely archive certain tables and keep them around in file form only and then truncate past a certain date. Often with web tracking data that isn't that important and is kinda spammy, I end up doing this after a few years, when the data is really not going to do anyone any good any more.
I'm having some trouble coming up with an efficient solution to this problem. Maybe I am making it more complicated than needs to be. I have a table like this:
thing_id | user_id | order
1 1 0
2 1 1
3 1 2
The user may mess around with their things and it may happen that they change thing 1 to thing 3, and thing 3 to thing 1. In my case, it is not that the user is explicitly changing the order. Rather, they are modifying their bank of things, and they may change the thing in slot 1 to be the thing in slot 3, and vice versa. So if the user performs this operation, the table should look like this:
thing_id | user_id | order
3 1 0
2 1 1
1 1 2
What complicates this is that (thing_id, user_id) has a unique constraint, so doing sequential updates does not quite work. If I try to UPDATE tbl SET thing_id=3 WHERE thing_id=1, the unique constraint is broken.
The order column is purely for show, in order to make an alphabetized list. So I suppose I could use PHP to check the order and figure things out like that, but this introduces code that really has nothing to do with the important stuff. I'd like to find a solution that is purely/mostly SQL.
Also, along the same lines, if I were to insert a new row into the table, I would want the order value to be 3. Is there an efficient way to do this in SQL, without first having to SELECT MAX(order) WHERE user_id=1?
My comment seems to have gotten some traction, so I'm posting it as an answer... To avoid your problem, add a new column, without constraints, and just use that for user desired updates.
Why aren't you updating the order instead of the thingid?
UPDATE tbl
SET order = 2
WHERE thing_id=1;
Each row represents a "thing-user" pair. The data is the ordering that you want to use. You don't want to change the entity ("thing-user"). You want to change the data.
By the way, you'll then have to do some additional work to keep unique values in orders.
If you switched this around and put the unique constraint on user_id, order, then it would make sense to update the thing_id.